http://www.alper.net/ Eyup Alper Yoney

17Mar/103

Peterson’s Solution on Modern Multiprocessors

I wrote about the Peterson's Solution in one of my previous posts. I gave an example of how the Peterson’s n-Process Protocol can be implemented in C. It is a classical software based solution to the critical section problem. It isn't desirable to implement on modern systems since it wastes CPU time (busy waiting).

In addition, there is one -very important- reason why it isn't implemented today; there is no guarantee that Peterson's Solution will work on modern multiprocessor. When I had first heard it, I couldn't understand the exact reason why it will not work.  Yesterday afternoon, I had time to search. I was thinking I can understand the whole picture, every details in a short time (how naive).

I found the first well described information in Solaris' Multithreaded Programming Guide. Briefly, it says, Peterson's Algorithm works when the multiprocessor has strongly ordered memory. However today's multiprocessors (not all) have store buffers. When a process/thread executes a store instruction, the data is put into the store buffer. The buffered data is  sent to the cache sooner or later, but not necessarily right away.

Store Buffer

Consequently, two threads running on separate processors both could enter the critical section.  Each thread sets its own interested array slot (interested[process] = TRUE) and then loads from the others slot (interested[other] == TRUE). Both threads could read the old values (FALSE) and then assume that the other party is not interested. It means both enter the critical section.

#define FALSE  0
#define TRUE   1

int turn;
int interested[2];

void enterRegion(int process)
{
  int other = 1 - process;
  interested[process] = TRUE;
  turn = process;

  while (interested[other] == TRUE && turn == process)
    ;
}

void leaveRegion(int process)
{
  interested[process] = FALSE;
}
 

The first explanation is pretty clear. However when you dive into the details, it becomes a complex subject (memory models,  instruction reordering etc.).  In addition it is hardware depended, you should read different documents for Intel, SPARC or other architectures.  Anyway, I don't like to write about advanced topics which I don't have accurate knowledge yet. I found and downloaded lots of documents. I think it will take at least few months to read and to understand them. I had estimated few hours at the beginning. :)

After explained the problem, you might think of the question, then how can we implement multithreading? It is really easy as an application developer, just use the synchronization objects provided by operating systems (mutex locks, condition variables, read-write locks, semaphores etc.). Kernel developers consider all hardware depended issues and implement the synchronization objects in proper way.

I had been wondering if I can reproduce the buggy case on modern processors. Therefore I have implemented a test code with Peterson's Algorithm and performed some tests on three different hardware. The test code is very simple, one thread adds 1 to a variable and the other subtracts 1. Results are interesting;

- First machine has got 2 x SPARC64-VII CPU as following

bash-3.00$ psrinfo -vp
The physical processor has 8 virtual processors (0-7)
SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz)
The physical processor has 8 virtual processors (8-15)
SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz)

Threads run about a minute and after 32.577.239 calculations (add/sub), I have seen 17 cases where both entered critical section.

- Second machine has got 2 x UltraSPARC-T2+ CPU as following

bash-3.00$ psrinfo -vp
The physical processor has 48 virtual processors (0-23 32-55)
  UltraSPARC-T2+ (cpuid 0 clock 1162 MHz)
The physical processor has 48 virtual processors (64-111)
  UltraSPARC-T2+ (cpuid 64 clock 1162 MHz)

Threads run about 5 hours (please consider that first machine performed 32.577.239 calculations in 1 minute), I haven't seen any case where both enter critical section (I am not sure if it means it never happens on UltraSPARC-T2+, I am going to read the CPU specs to understand its memory model).

- Last machine has got 1 x Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz CPU as following

vendor_id         : GenuineIntel
cpu family        : 6
model              : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q8400  @ 2.66GHz
stepping          : 10
cpu MHz          : 1998.000
cache size       : 2048 KB

Threads run about 5 minutes, I have seen 3 cases where both entered critical section.

5Mar/100

Linux RAM-Based Filesystem

I have been searching the Internet for the RAM-based filesystem in Linux. There are many posts and each one describes its solution and configuration well. However, after reading few of them, it may seem a little confusing because there are 3 different mechanisms -ramdisk, ramfs and tmpfs- to create a RAM-based filesystem in Linux and current posts generally describes and compares one or two. In addition, there is a name confusion, some posts used 'ramdisk' as a general name. RAM-based filesystem is more commonly known as ramdisk. However, one of the Linux mechanisms is also called as ramdisk. Therefore, I always use the word 'ramdisk' to indicate the Linux mechanism rather than general name throughout the post.

I like to describe all mechanisms and compare them as well. However there are some details you may not want to know if you like to  just use it rather than to care how the Linux kernel handles it inside. That's why I don't mention all details here and keep it as simple as possible. If you like to know all details, the Linux kernel documentation files filesystems/ramfs-rootfs-initramfs.txt and filesystems/tmpfs.txt are the right places to start and then of course the Linux kernel source code itself for the definite guide.

I assume you already know what the RAM-based filesystem and for what purpose it has been used. I skip the introductory information. In addition, I assume you have got some Linux background (grub, mount, shell etc.).

I have been using Fedora 12 for the configuration examples, I expect, the examples work with other GNU/Linux distributions running 2.6.x kernels (compiled with ramdisk support).

1 - ramdisk

The ramdisk is the oldest mechanism to create RAM-based filesystem in Linux. Basically, it is a way to use the system memory as a block device. This block device is fixed size and it needs a filesystem driver such as ext2 (It is possible to create ext3 filesystem on ramdisk. However, it doesn't make sense since we don't need journal on RAM-based filesystem). The ramdisk run similar to other block devices, it means ramdisk uses  Linux's file caching mechanism similar to other block devices. Consequently, the ramdisk wastes memory and CPU time compare to ramfs and tmpfs because it caches the data that already in memory.

Fedora 12 has got 16 ramdisks as default (at least mine does), the default size of the ramdisks is 16MB. If you like to increase the size of the ramdisks, you must pass a parameter -'ramdisk_size=X'- to kernel during boot. I increased to 64MB in my environment, the related grub.conf file line is below

kernel /boot/vmlinuz-2.6.31.12-174.2.22.fc12.x86_64 ro root=UUID=a2142f4e-7880-4419-9051-0433bdc36277 ramdisk_size=65536 rhgb quiet

and here is the list of default ramdisks. These ramdisks don't use any memory area until you format them and mount.

[root@localhost /]# ls -l /dev/ram*
brw-rw---- 1 root disk 1,  0 2010-03-04 03:44 /dev/ram0
brw-rw---- 1 root disk 1,  1 2010-03-04 03:44 /dev/ram1
brw-rw---- 1 root disk 1, 10 2010-03-04 03:44 /dev/ram10
brw-rw---- 1 root disk 1, 11 2010-03-04 03:44 /dev/ram11
brw-rw---- 1 root disk 1, 12 2010-03-04 03:44 /dev/ram12
brw-rw---- 1 root disk 1, 13 2010-03-04 03:44 /dev/ram13
brw-rw---- 1 root disk 1, 14 2010-03-04 03:44 /dev/ram14
brw-rw---- 1 root disk 1, 15 2010-03-04 03:44 /dev/ram15
brw-rw---- 1 root disk 1,  2 2010-03-04 03:44 /dev/ram2
brw-rw---- 1 root disk 1,  3 2010-03-04 03:44 /dev/ram3
brw-rw---- 1 root disk 1,  4 2010-03-04 03:44 /dev/ram4
brw-rw---- 1 root disk 1,  5 2010-03-04 03:44 /dev/ram5
brw-rw---- 1 root disk 1,  6 2010-03-04 03:44 /dev/ram6
brw-rw---- 1 root disk 1,  7 2010-03-04 03:44 /dev/ram7
brw-rw---- 1 root disk 1,  8 2010-03-04 03:44 /dev/ram8
brw-rw---- 1 root disk 1,  9 2010-03-04 03:44 /dev/ram9

Now, we can create the file system; in other words, format the ramdisk.

[root@localhost /]# mke2fs -t ext2 -m 0 /dev/ram0
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
16384 inodes, 65536 blocks
0 blocks (0.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67108864
8 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 21 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

Then, create the mount point, mount and verify.

[root@localhost /]# mkdir /tmp/ramdisk
[root@localhost /]# mount /dev/ram0 /tmp/ramdisk
[root@localhost /]# df -h | grep ramdisk
/dev/ram0              62M  1.3M   61M   3% /tmp/ramdisk
[root@localhost /]# mount | grep ramdisk
/dev/ram0 on /tmp/ramdisk type ext2 (rw)

If you need to ramdisk on every boot up, you should do some scripting.

2 - ramfs

The ramfs is a very simple filesystem. It turns Linux's disk caching mechanisms into dynamically resizable RAM-based filesystem. Linux caches all files. Files read from device are kept in memory since they're likely to be needed again and they're set as clean to free in case virtual memory system needs the memory. Similarly, data written to files are kept in memory and set as clean as soon as they've been written to device. However, ramfs hasn't got any device to write back.  Files written to ramfs allocate cache structures as usual but there is nowhere to write them back. It means the cache never set as clean.

One downside of ramfs is you can not set size limit. Therefore you can keep writing to ramfs until you fill up all memory. The virtual memory  system can't free it (can't move to swap space). That's why you should be very careful if you decide to chose ramfs. In addition, you should carefully set the permissions to identify which users are allowed to write the ramfs. I think no one likes to freeze the system. :)

Configuring ramfs is very easy compare to ramdisk, we only need to create the mount point and mount it.

[root@localhost /]# mkdir /tmp/ramfs
[root@localhost /]# mount -t ramfs ramfs /tmp/ramfs
[root@localhost /]# mount | grep ramfs
ramfs on /tmp/ramfs type ramfs (rw)
[root@localhost /]# df -h | grep ramfs
[root@localhost /]#

It is ready to use (you can not see it in df output, it is fine). You can add an entry to /etc/fstab file to mount the ramfs during the boot.

3 - tmpfs

The tmpfs is derivative of ramfs. It is easy and short to describe tmpfs after ramfs. The tmpfs is created to add size limit and ability to use the swap in case of virtual memory system needs memory.

Configuring tmpfs is similar to ramfs, we only need to create the mount point and mount it.

[root@localhost /]# mkdir /tmp/tmpfs
[root@localhost /]# mount -t tmpfs -o size=64m tmpfs /tmp/tmpfs
[root@localhost /]# mount | grep tmpfs
tmpfs on /tmp/tmpfs type tmpfs (rw,size=64m)
[root@localhost /]# df -h | grep tmpfs
tmpfs                  64M     0   64M   0% /tmp/tmpfs

Similar to ramfs, you can add an entry to /etc/fstab file to mount the tmpfs during the boot. In addition, tmpfs' size limit can be adjusted on run time (mount -o remount ...).

Summary

ramdisk ramfs tmpfs
Size Limit Yes No Yes
Resizable Yes (on boot) Yes (no limit!) Yes (on runtime)
Use swap No No Yes
3Mar/100

The Mythical Man-Month

I am currently reading The Mythical Man-Month from Fred Brooks. This book consists of essays on software engineer. These essays draw from his experience as project manager for the IBM System/360 computer family and then OS/360, its massive software system.

I suggest read this book if you take part in any software project. I believe you will find parts similar to your own projects and learn from Brooks' experience.

For example, when a task cannot be partitioned because of sequential constraints, the application of more effort has no effect on the schedule. However, project managers sometimes try to shorten the schedule by assigning men. The following phase from the book perfectly explains the situation. :)



The Mythical Man-Month


The bearing of a child takes nine months, no matter how many women are assigned.

12Jan/100

Endianness

I found a paper about endianness while I was looking on Intel's developer documentations. I have been wondering what kind of advantages can be gained to store the data in little endian or big endian format. There are some explained advantages and disadvantages in that document. However, I haven't been enough convinced with these explanations. I think there are many others related to CPU and hardware design.

Anyway, that document well describes the endianness and the portability issues. It should be read if you have got some questions in mind.

I implemented a function which checks the endiannes of the CPU on run time.  It is possible to decide the architecture by comparing one byte. However, I like to perform strict check on multi-byte data.

Here is the link of the Intel's document.

#include <stdio.h>

union {
  char array[sizeof(int)];
  int intVal;
} testUnion;

int main()
{
  int i, bigEndVal = 0, littleEndVal = 0;
  char c = 'a';

  for(i = 0; i < sizeof(int); i++) {
    testUnion.array[i] = c;
    bigEndVal = (bigEndVal << 8) | c;
    littleEndVal = littleEndVal | (c << 8 * i);
    c++;
  }

  if (testUnion.intVal == bigEndVal)
    printf("Big Endiann");
  else if (testUnion.intVal == littleEndVal)
    printf("Little Endiann");
  else
    printf("Unknownn");

  return 0;
}
 


2Jan/100

C++, Functions That Return Two Values

I am going to explain the class pair that is provided to treat two values as one. The class pair is used several times in STL. The container classes map and multimap use the class pair to handle the key/value pairs. The other usage of the pair class is functions that return two values. I choose the title as 'Functions That Return Two Value' instead of 'The class pair' because it sounds much more attractive since developers used to hear functions can return one value.

The pair structure is defined in <utility> as follows;

namespace std {
  template<class T1, class T2>
  struct pair {
    typedef T1 first_type;
    typedef T2 second_type;

    T1 fist;
    T2 second;

    pair() : first(T1()), second(T2()) {}
    pair(const T1& val1, const T2& val2) : first(val1), second(val2) {}
    template<class Other1, class Other2>
    pair(const pair<Other1, Other2>& p) : first(p.first), second(p.second) {}
  };
}
 

The pair is defined as struct. Therefore, accessing the members (first and second) is possible. The definition is short and straightforward. However, the template version of the copy constructor may seen confusing. The template copy constructor is called whenever implicit conversion is needed. There are many good books about C++, STL and Templates, two of them are as follows;

1. C++ Primer, by Stanley B. Lippman, Josée LaJoie, Barbara E. Moo
2. The C++ Standard Library: A Tutorial and Reference, by Nicolai M. Josuttis

The make_pair() template function makes the life easier by creating the pair without explicit types.

namespace std {
  template<class T1, class T2>
  make_pair(const T1& val1, const T2& val2)
  {
    return pair<T1, T2>(val1, val2);
  }
}
 

In conclusion, here is an example of function that returns two values, including use of the pair class and make_pair function.

#include <cstdlib>
#include <iostream>
#include <utility>

using namespace std;

const double PI = 3.14;

pair<double, double> areaAndCircumferenceOfCircle(double r)
{
  return make_pair((PI * r * r), (2 * PI * r));
}

int main(int argc, char *argv[])
{
  pair<double, double> circle;
 
  circle = areaAndCircumferenceOfCircle(5);
  cout << "Area: " << circle.first << endl;
  cout << "Circumference: " << circle.second << endl;
 
  return 0;
}