http://www.alper.net/ Eyup Alper Yoney

17Mar/103

Peterson’s Solution on Modern Multiprocessors

I wrote about the Peterson's Solution in one of my previous posts. I gave an example of how the Peterson’s n-Process Protocol can be implemented in C. It is a classical software based solution to the critical section problem. It isn't desirable to implement on modern systems since it wastes CPU time (busy waiting).

In addition, there is one -very important- reason why it isn't implemented today; there is no guarantee that Peterson's Solution will work on modern multiprocessor. When I had first heard it, I couldn't understand the exact reason why it will not work.  Yesterday afternoon, I had time to search. I was thinking I can understand the whole picture, every details in a short time (how naive).

I found the first well described information in Solaris' Multithreaded Programming Guide. Briefly, it says, Peterson's Algorithm works when the multiprocessor has strongly ordered memory. However today's multiprocessors (not all) have store buffers. When a process/thread executes a store instruction, the data is put into the store buffer. The buffered data is  sent to the cache sooner or later, but not necessarily right away.

Store Buffer

Consequently, two threads running on separate processors both could enter the critical section.  Each thread sets its own interested array slot (interested[process] = TRUE) and then loads from the others slot (interested[other] == TRUE). Both threads could read the old values (FALSE) and then assume that the other party is not interested. It means both enter the critical section.

#define FALSE  0
#define TRUE   1

int turn;
int interested[2];

void enterRegion(int process)
{
  int other = 1 - process;
  interested[process] = TRUE;
  turn = process;

  while (interested[other] == TRUE && turn == process)
    ;
}

void leaveRegion(int process)
{
  interested[process] = FALSE;
}
 

The first explanation is pretty clear. However when you dive into the details, it becomes a complex subject (memory models,  instruction reordering etc.).  In addition it is hardware depended, you should read different documents for Intel, SPARC or other architectures.  Anyway, I don't like to write about advanced topics which I don't have accurate knowledge yet. I found and downloaded lots of documents. I think it will take at least few months to read and to understand them. I had estimated few hours at the beginning. :)

After explained the problem, you might think of the question, then how can we implement multithreading? It is really easy as an application developer, just use the synchronization objects provided by operating systems (mutex locks, condition variables, read-write locks, semaphores etc.). Kernel developers consider all hardware depended issues and implement the synchronization objects in proper way.

I had been wondering if I can reproduce the buggy case on modern processors. Therefore I have implemented a test code with Peterson's Algorithm and performed some tests on three different hardware. The test code is very simple, one thread adds 1 to a variable and the other subtracts 1. Results are interesting;

- First machine has got 2 x SPARC64-VII CPU as following

bash-3.00$ psrinfo -vp
The physical processor has 8 virtual processors (0-7)
SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz)
The physical processor has 8 virtual processors (8-15)
SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz)

Threads run about a minute and after 32.577.239 calculations (add/sub), I have seen 17 cases where both entered critical section.

- Second machine has got 2 x UltraSPARC-T2+ CPU as following

bash-3.00$ psrinfo -vp
The physical processor has 48 virtual processors (0-23 32-55)
  UltraSPARC-T2+ (cpuid 0 clock 1162 MHz)
The physical processor has 48 virtual processors (64-111)
  UltraSPARC-T2+ (cpuid 64 clock 1162 MHz)

Threads run about 5 hours (please consider that first machine performed 32.577.239 calculations in 1 minute), I haven't seen any case where both enter critical section (I am not sure if it means it never happens on UltraSPARC-T2+, I am going to read the CPU specs to understand its memory model).

- Last machine has got 1 x Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz CPU as following

vendor_id         : GenuineIntel
cpu family        : 6
model              : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q8400  @ 2.66GHz
stepping          : 10
cpu MHz          : 1998.000
cache size       : 2048 KB

Threads run about 5 minutes, I have seen 3 cases where both entered critical section.

19Dec/094

Peterson’s n-Process Protocol

Peterson's Solution is a classic software based solution to the critical-section problem. Peterson's original formulations works for 2 processes. However, it is possible to implement it for more than 2 processes.

The following code works for 2 processes and it can be found easily on the Internet.

#define FALSE  0
#define TRUE   1

int turn;
int interested[2];

void enterRegion(int process)
{
  int other = 1 - process;
  interested[process] = TRUE;
  turn = process;

  while (interested[other] == TRUE && turn == process)
    ;
}

void leaveRegion(int process)
{
  interested[process] = FALSE;
}

Unfortunately, there aren't many resources about n-Process Peterson solution. I think one of the good explanation can be found in "Operating Systems Review, January 1990 ('Proof of a Mutual Exclusion Algorithm', M Hofri)".

I just like to show the following implementation of Peterson's n-Process Protocol.

#define FALSE  0
#define N     10     /* First process is indicated with 1, not 0 */

int turn[N];
int stage[N + 1];

void enterRegion(int process)
{
  int i, j;

  for (i = 1; i <= N - 1; i++) {
    stage[process] = i;
    turn[i] = process;
    for (j = 1; j <= N; j++) {
      if (j == process)
        continue;
      while (stage[j] >= i  && turn[i] == process)
        ;
    }
  }
}

void leaveRegion(int process)
{
  stage[process] = FALSE;
}

Note: There are N processes from 1 to N. Therefore we need a turn array with size N - 1 and we need a stage array with size N. However, I declared the turn and the stage arrays with sizes N and N + 1 respectively. It is just because I like to keep the code clean. I don't want to subtract 1 like turn[i - 1]. I ignore the first elements -turn[0] and stage[0]- and never access them.