Peterson’s Solution on Modern Multiprocessors
I wrote about the Peterson's Solution in one of my previous posts. I gave an example of how the Peterson’s n-Process Protocol can be implemented in C. It is a classical software based solution to the critical section problem. It isn't desirable to implement on modern systems since it wastes CPU time (busy waiting).
In addition, there is one -very important- reason why it isn't implemented today; there is no guarantee that Peterson's Solution will work on modern multiprocessor. When I had first heard it, I couldn't understand the exact reason why it will not work. Yesterday afternoon, I had time to search. I was thinking I can understand the whole picture, every details in a short time (how naive).
I found the first well described information in Solaris' Multithreaded Programming Guide. Briefly, it says, Peterson's Algorithm works when the multiprocessor has strongly ordered memory. However today's multiprocessors (not all) have store buffers. When a process/thread executes a store instruction, the data is put into the store buffer. The buffered data is sent to the cache sooner or later, but not necessarily right away.

Consequently, two threads running on separate processors both could enter the critical section. Each thread sets its own interested array slot (interested[process] = TRUE) and then loads from the others slot (interested[other] == TRUE). Both threads could read the old values (FALSE) and then assume that the other party is not interested. It means both enter the critical section.
#define TRUE 1
int turn;
int interested[2];
void enterRegion(int process)
{
int other = 1 - process;
interested[process] = TRUE;
turn = process;
while (interested[other] == TRUE && turn == process)
;
}
void leaveRegion(int process)
{
interested[process] = FALSE;
}
The first explanation is pretty clear. However when you dive into the details, it becomes a complex subject (memory models, instruction reordering etc.). In addition it is hardware depended, you should read different documents for Intel, SPARC or other architectures. Anyway, I don't like to write about advanced topics which I don't have accurate knowledge yet. I found and downloaded lots of documents. I think it will take at least few months to read and to understand them. I had estimated few hours at the beginning.
After explained the problem, you might think of the question, then how can we implement multithreading? It is really easy as an application developer, just use the synchronization objects provided by operating systems (mutex locks, condition variables, read-write locks, semaphores etc.). Kernel developers consider all hardware depended issues and implement the synchronization objects in proper way.
I had been wondering if I can reproduce the buggy case on modern processors. Therefore I have implemented a test code with Peterson's Algorithm and performed some tests on three different hardware. The test code is very simple, one thread adds 1 to a variable and the other subtracts 1. Results are interesting;
- First machine has got 2 x SPARC64-VII CPU as following
bash-3.00$ psrinfo -vp The physical processor has 8 virtual processors (0-7) SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz) The physical processor has 8 virtual processors (8-15) SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz)
Threads run about a minute and after 32.577.239 calculations (add/sub), I have seen 17 cases where both entered critical section.
- Second machine has got 2 x UltraSPARC-T2+ CPU as following
bash-3.00$ psrinfo -vp The physical processor has 48 virtual processors (0-23 32-55) UltraSPARC-T2+ (cpuid 0 clock 1162 MHz) The physical processor has 48 virtual processors (64-111) UltraSPARC-T2+ (cpuid 64 clock 1162 MHz)
Threads run about 5 hours (please consider that first machine performed 32.577.239 calculations in 1 minute), I haven't seen any case where both enter critical section (I am not sure if it means it never happens on UltraSPARC-T2+, I am going to read the CPU specs to understand its memory model).
- Last machine has got 1 x Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz CPU as following
vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz stepping : 10 cpu MHz : 1998.000 cache size : 2048 KB
Threads run about 5 minutes, I have seen 3 cases where both entered critical section.
April 11th, 2010 - 18:12
Interesting topic. It reminded me of the “Double-Checked Locking” problem in Java related to Out-of-order writes in pre Java 5 memory model. http://www.ibm.com/developerworks/library/j-dcl.html
August 26th, 2010 - 16:38
Very interesting test and results. Just for my curiosity: have you ever tried to use the volatile attribute for the turn and interested variables?
Regards
Juergen
August 30th, 2010 - 11:42
I tested with the volatile, the test results are similar. I disassembler the code before and after the volatile, the produced codes are identical. Actually, GCC doesn’t perform any optimization like keeping the values in registers. The store instruction is executed. However, two threads enter the critical section at the same time because of the store buffer and/or instruction reordering.
Regards,
Alper