CS 885 : Future Multicore Architectures and their Software


Assignment 1: pthreads

Due Date : May 30th

The purpose of this assignment is to give you some general experience in writing shared memory programs using the standard pthreads library package. For more information on pthreads, you may want to consult the pthreads man page and the tutorial from Lawrence Livermore National Lab.

Part I

Create a program that launches t threads and have them share a counter. Have each thread increment the counter i times

  • with no synchronization
  • with pthread locks
  • with TAS locks
  • with TATAS locks
  • with ticket locks
  • with MCS locks

Try each option with varying numbers of threads, both greater and fewer than the number of processors in the machine. Report final counter values and execution times. Try any other tests that occur to you. Explain your results (in writing).

To simplify testing of your code, please write your program to take the number of threads t and the number of iterations i as command-line arguments, specified with "-t t" and "-i i" (in either order). If the arguments are not specified, use t = 4 and i = 10,000.

Be sure to include a README.pdf file that explains what you did and what you learned. It should include your timing results and analysis of the various locks. We will be grading the assignment on a roughly equal mixture of completeness and correctness and quality of write-up.

Part II

Use the infrastructure you built for Part II to test concurrent queue algorithms.  Specifically, have each thread run a loop in which it repeatedly enqueues data to, or dequeues from (with 50-50 probability), a shared queue. 

  • Implement the queue using a TAS lock with exponential backoff.  For backoff, simply have your thread spin some number of times around a loop that reads a volatile variable (without the volatile read, the compiler may "optimize" the loop away: Wiki about volatile variables).  Experiment with different values for the backoff base (initial delay) and cap (maximum delay).  Pick (and report in your README file) values that seem to maximize throughput. 

  • Implement the queue using a two lock solution from M&S two-lock queue algorithm. [ Pseudo code ]

  • Implement the queue using an Michael and Scott non-blocking algorithm [ Paper and pseudo code ]

As described in the linked sources above, be sure to use "counted pointers" to avoid the ABA problem.  Also, to avoid "polluting" your results with the overhead of memory management (the standard malloc and free don't scale well), you should have each thread pre-allocate its own supply of nodes, which it can keep on a private list when they are not in the queue. 

Finally, to avoid pollution from calls to the pseudo-random (the standard rand isn't even thread-safe), you should pre-generate enough random bits to drive the choice between enqueueing and dequeueing for the entire test run.  Be sure to use a barrier (described in the notes below) to "synch up" your threads after this setup is complete, and before starting the timing run. 

As in Part II, your README.pdf file should contain a detailed description of the experiments you ran, graphs of the results, and discussion that explains those results—why they look the way they do.

Assignment Notes

  • Pseudocode for locking algorithms can be found here

  • The Gnu C compiler provides a very flexible mechanism to insert assembly language instructions (e.g. the various atomic primitives) into your code. Unfortunately it’s a rather confusing mechanism, so we’ve written the magic incantations for you: atomics_ops.h

  • Pthreads are not part of the C standard library. To use them you must link in a separate library explicity. Add -lpthread to the end of your gcc command line. We strongly recommend that you create a Makefile for your assignment, even if you have only a few files. Something like the following is a good start:

    CC = gcc
    # CFLAGS = -m32 -Wa -g


    test: test.c
    $(CC) -o test $(CFLAGS) test.c -lpthread

    test.s: test.c
    $(CC) -S -fverbose-asm $(CFLAGS) test.c

    clean:
    rm test
    This assumes that your test program is named test.  The test.s rule allows you to see the x86 assembly code you’re running, if you’re curious, or if you need to debug any in-line assembler.

  • 32 bit compilation. When encoding our atomic ops we assumed 32 bit mode and hence used instructions such as cmpxchg8b for double-word swaps. Hence, the whole program needs to be compiled in 32 bit mode. This is achieved by passing -m32 to the CFLAGs. We also need to compile against the 32 bit libstdc++ library. We have installed these libraries on our linux machine. Use the following LDFLAGS = -L/usr/lib32 .
  • To time programs, use the gethrtime() .  Run your tests multiple times.  See which results seem repeatable, and which vary greatly from one run to the next.  (And try to explain why.)  To gain the absolute maximum performance you may have to resort to techniques such as binding a thread to processor.  Take a look at techniques such as processor bind.

  • Submission Instruction

    • Create separate folders for Part I and Part II
    • We will be using portal.cs.sfu.ca for submission.
    • Submit tar.gz of assignment. Name the gz. [Groupid].tar.gz
    • Each group submits one copy of the assignment.