SFU Parallel Systems Class Review Unix Programming Tools and Basics of the C Language

Grace Hopper developed the first compiler for a computer programming language.

Github clone link

ERRATA 1 (Part 3 Mandelbrot)

Observation: for mandelbrot.ispc.bin, travis build failing, but code works on cs-cloud-02 and cs-cloud-04)

Instructions below assume you have cloned your repo. The instructions below are compatible with cs-cloud-04.

cd $REPO_DIR
wget "https://www2.cs.sfu.ca/~ashriram/Courses/CS431/assets/travis/Ass1/CMakeLists.txt" -O prob3_mandelbrot_ispc/CMakeLists.txt
git add prob3_mandelbrot_ispc/CMakeLists.txt
git commit -am "Updated CMakeLists Part 3"
git push

First-steps

Here is a list of things you should already have completed before embarking on this assignment

git tutorial and Assignment 0
Read through cmake tutorial

We will be using cmake as our build system in 431. If you have not worked with cmake here you see it in action. CMake is a cross-platform Makefile generator! Simply put, CMake automatically generates the Makefiles for your project.

Assignment 1 cmake video

If on sfu-cloud run this dependency loader before following steps below

module load ispc

cd [Cloned directory]
export ASSIGNMENT=`pwd`
mkdir build
cd build
cmake ../; make help
mandelbrot.serial.bin # Default serial version of mandelbrot
mandelbrot.bin      # Part 1
array_vec.bin       # Part 2
mandelbrot_ispc.bin # Part 3
sqrt_ispc.bin       # Part 4

You should not have to add any new source files to the build system.

Logistics

This is an individual project. All handins are electronic.

commit your git repo - remember to fill in your SFUID file
report.pdf (each part has to be subsection) submit to coursys
supported machines: sfu-cloud-04.cs.surrey.sfu.ca and sfu-cloud-02.cs.surrey.sfu.ca
Dependencies (cmake, g++-7, ispc)

This is an individual project. All handins are electronic. Your submission will consist of the code files you have modified. You may use any document preparation system you choose, but the final result must be stored as a single file in PDF format, named report.pdf. Make sure the report includes your name and sfu id and number. More details on how to submit this information is provided at the end of this document

Helper docs

Overview

In this assignment you will modify and experiment with code designed to exploit the two main forms of parallelism available on modern processors: the multiple cores that can execute programs independently, and the SIMD vector units that allow each processor to perform some of its arithmetic and memory operations on vectors of data.

You will gain experience measuring and reasoning about the performance of parallel programs, a challenging, but important, skill you will use throughout this class. This assignment involves only a small amount of programming, but a lot of analysis!

Problem 1 - Parallel Fractal Generation Using Pthreads - 15 points

First Steps

Build and run the code in the prog1_mandelbrot_threads directory of the Assignment 0 code base. This program produces the image file mandelbrot-serial.ppm, which is a visualization of a famous set of complex numbers called the Mandelbrot set. As you can see in the images below, the result is a familiar and beautiful fractal. Each pixel in the image corresponds to a value in the complex plane, and the brightness of each pixel is proportional to the computational cost of determining whether the value is contained in the Mandelbrot set—white pixels required the maximum (256) number of it-erations, black ones only a few iterations, and gray pixels were somewhere in between. See function mandel defined in mandelbrot.cpp mandel You can learn more about the definition of the Mandelbrot set here.

Use the command option “–view V “ for V between 1 and 6 to get the different images. You can click the links below to see the different images on a browser. These were generated using the problem0 serial version. Load environment (works only on sfu-cloud machines. If you are running on own machine. Check dependencies file. WE WILL NOT SUPPORT YOUR OWN MACHINES.)

# Loading ISPC on cs-cloud-02 and cs-cloud-04. Please read slurm page before running this assignment.
module load ispc

cd [Cloned directory]
export ASSIGNMENT=`pwd`
mkdir build
cd build
cmake ../; make

Visualization

_ Visualization (for your convenience, it is not required) _

Requires X11 on the machine and a windowing environment running on the client you are sshing from. If you are sshing from a personal machine to a SFU machine (e.g., linux.cs.sfu.ca and then logging into cs-cloud-02,03 then you need to forward X through linux.cs.sfu.ca as well).

Install X11 on OS X Install X11 on Windows

ssh -Y linux.cs.sfu.ca # on x-term on linux or OS X
ssh -Y cs-cloud-04.cs.surrey.sfu.ca -p 24 # from linux-cs

_ Script below will open a window on client and show the screen _

python3 ./mviz.py ./build/prob0_mandelbrot_serial/mandelbrot.serial.bin -v 1

Video demonstrating how to ssh and run the visualization

Param	Description
-t	–threads Use N threads
-v	–view Use specified view settings (0-6)
-f	–field x0:y0:x1:y1 Specify set boundaries
-o	outfile Specify output file
-?	–help This message

V1	V2	V3
V4	V5	V6

Your job is to parallelize the computation of the images using Pthreads. The assignment folder is : prob1_mandelbrot_threads

cd $ASSIGNMENT/prob1_mandelbrot_threads

The commandline option “–threads T “ specifies that the computation is to be partitioned over T threads. In function mandelbrotThread located in mandelbrot.cpp, the main application thread creates T additional threads using pthread_create(). It waits for these threads to complete using pthread_join(). Currently, neither the launched threads nor the main thread do any computation, and so the program generates an error message. You should add code to the workerThreadStart() function to accomplish this task. You will not need to use of any other Pthread API calls in this assignment.

What you need to do

Modify the code in mandelbrot.cpp to parallelize the Mandelbrot generation using two cores. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors.
Extend your code to utilize T threads for 2 — 16, partitioning the image generation work into the appropriate number of horizontal blocks. You will need to modify the code in function workerThreadStart, to partition the work over the threads. Note that the processor only has N threads. Also, the images have 750 rows, and so you must handle the case where the number of rows is not evenly divisible by the number of threads.
Measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). How do your measurements explain the speedup graph you previously created?
_ Modify the mapping of work to threads to improve speedup _ to almost 8 on the first two views of the Mandelbrot set (if you’re close to 8 that’s fine, don’t sweat it). You may not use any synchronization between threads. We expect you to come up with a single work decomposition policy that will work well for all thread counts; hard coding a solution specific to each configuration is not allowed! (Hint: There is a very simple static assignment that will achieve this goal, and no communication/synchronization among threads is necessary.)

In your writeup, describe your approach and report the final 16-thread speedup obtained. Also comment on the difference in scaling behavior from 4 to 8 threads versus 8 to 16 threads.

What you need to turn in

Your report should contain the graphs, analyses, and answers specified above.
Your report should describe the decomposition strategy you used to maximize speedup.
In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of cores used for view 1. Is speedup linear in the number of cores used? In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for some of the other views to help you come up with an answer.)

Problem 2 - Vectorizing Code Using SIMD Intrinsics - 20 points

cd $ASSIGNMENT/prob2_vecintrin
# Build and run
./prob2_vecintrin/array_vec.bin --help
Usage: ./prob2_vecintrin/array_vec.bin [options]
Program Options:
  -s  --size <N>     Use workload size N (Default = 16)
  -l  --log          Print vector unit execution log
  -?  --help         This message

Relevant lectures

(https://www.cs.sfu.ca/~ashriram/Courses/CS431/assets/lecturesPart2/04_431_ParallelProg.pdf)

Take a look at the function clampedExpSerial in prog2_vecintrin/functions.cpp of the Assignment 1 code base. The function raises values[i] to the integer power given by exponents[i] for all elements of the input array and clamps the resulting values at 4.18. The function computes x^p based on the technique known as exponentiation by squaring. Whereas the usual technique of multiplying together p copies of x requires p-1 multiplications, iterative squaring requires at most 2log₂p multiplications. For p = 1000, exponentiation by squaring requires less than 20 multiplications rather than 999. Your job is to vectorize this piece of code so that it can be run on a machine with SIMD vector instructions.

Vector intrinics library

We won’t ask you to craft an implementation using the SSE or AVX vector instrinsics. We’re asking you to implement your version using 431’s ``fake vector instrinics’’ defined in intrin.h. (These functions don’t translate to real vector instructions; instead we simulate these operations for you in our library, and provide feedback that makes for easier debugging.)

As an example of using the intrinsics, a vectorized version of the abs() function is given in functions.cpp.

This example contains some basic vector loads and stores and manipulates mask registers. Note that the abs() example is only a simple example (does not handle all corner cases, we will let you figure out why). You may wish to read through the comments and function definitions in intrin.h to know what operations are available to you.

Background hints

Every vector instruction is subject to an optional mask parameter. The mask parameter defines which lanes have their output ``masked’’ for this operation. A 0 in the mask indicates a lane is masked, and so its value will not be overwritten by the results of the vector operation. If no mask is specified in the operation, no lanes are masked. (This is equivalent to providing a mask of all ones.) Your solution will need to use multiple mask registers and various mask operations provided in the library.
You will find the __sfu431_cntbits() function to be helpful in this problem.
You must handle the case where the total number of loop iterations is not a multiple of SIMD vector width. We suggest you test your code with

cd $ASSIGNMENT/build
./prob2_vecintrin/array_vec.bin -s 3

You might find _sfu431_init_nes() helpful.

Use ./array_vec.bin -l to print a log of executed vector instrucion at the end.Use function addUserLog to add customized debug information in log. Feel free to add additional Logger.printog() to help you debug.

cd $ASSIGNMENT/build
./prob2_vecintrin/array_vec.bin -l

The output of the program will tell you if your implementation generates correct output. If there are incorrect results, the program will print the first one it finds and print out a table of function inputs and outputs. Your function’s output is after “output = “, which should match with the results after “gold = “. The program also prints out a list of statistics describing utilization of the intrin’s fake vector units. You should consider the performance of your implementation to be the value “Total Vector Instructions”. “Vector Utilization” shows the percentage of vector lanes that are enabled.

What you need to do

Implement a vectorized version of clampedExpSerial() as the function clampedExpVector() in file functions.cpp. Your implementation should work with any combination of input array size N and vector width W .
Run

./array_vec.bin -s 10000

Sweep the vector width over the values 2; 4; 8; 16; 32. Record the resulting vector utilization. You can do this by changing the defined value of VECTOR_WIDTH in intrin.h and recompiling the code each time. How much does the vector utilization change as W changes? Explain the reason for these changes and the degree of sensitivity the utilization has on the vector width. Explain how the total number of vector instructions varies with W .

Extra credit: (1 point)

Implement a vectorized version of arraySumSerial() as the function arraySumVector() in file functions.cpp. Your implementation may assume that W is a factor of the input array size N. Whereas the serial implementation has O(N) span, your implementation should have at most O(N=W + log₂W ) span. You may find the hadd and interleave operations useful.

What you need to turn in

Your report should contains tables giving the vector utilizations and total vector instructions for the different values of W .
Your report should contain the analyses and answers to the questions listed above.
If you did the extra credit problem, state in the report whether or not your code passed the correctness test.

Problem 3 Mandelbrot ISPC - 15 points

./prob3_mandelbrot_ispc/mandelbrot_ispc.bin --help
Usage: ./prob3_mandelbrot_ispc/mandelbrot_ispc.bin [options]
Program Options:
  -t  --tasks             Run ISPC code implementation with tasks
  -v  --view <INT>        Use specified view settings (0-6)
  -f  --field x0:y0:x1:y1 Specify set boundaries
  -o  outfile             Specify output file
  -?  --help              This message

Part 1 - A Few ISPC Basics - 7 of 15 points

When reading ISPC code, you must keep in mind that, although the code appears much like C/C++ code, the ISPC execution model differs from that of standard C/C++. In contrast to C, multiple program instances of an ISPC program are always executed in parallel on the CPU’s SIMD execution units.

cs-cloud-02 SIMD - sse2-i32x4. cs-cloud-04 SIMD - avx1-i32x8

The number of program instances executed simultaneously is determined by the compiler (and chosen specifically for the underlying machine). This number of concurrent instances is available to the ISPC programmer via the built-in variable programCount. ISPC code can reference its own program instance identifier via the built-in programIndex. Thus, a call from C code to an ISPC function can be thought of as spawning a group of concurrent ISPC program instances (referred to in the ISPC documentation as a gang). The gang of instances runs to completion, then control returns back to the calling C code.

As an example, the following program uses a combination of regular C code and ISPC code to add two 1024-element vectors. As discussed in class, since each instance in a gang is independent and performs the exact same program logic, execution can be accelerated via SIMD instructions.

main() invokes ispc kernel A simple ISPC program is given below. First, the C program, which calls the ISPC-generated code:

const int TOTAL_VALUES = 1024;

float a[TOTAL_VALUES];

float b[TOTAL_VALUES];

float c[TOTAL_VALUES]

// Initialize arrays a and b here.

. . .

sum(TOTAL_VALUES, a, b, c);
// Upon return from sumArrays, result of a + b is stored in c.

The function sum() called by the C code is generated by compiling the following ISPC code:

ISPC Kernel

export sum(uniform int N, uniform float* a, uniform float* b, uniform float* c) {
//	Assumption programCount divides N evenly.
for (int i=0; i<N; i+=programCount) {
c[programIndex + i] = a[programIndex + i] + b[programIndex + i];
}
}

The ISPC program code above interleaves the processing of array elements among program instances. Note the similarity to Problem 1, where you statically assigned parts of the image to threads.

However, rather than thinking about how to divide work among program instances (that is, how work is mapped to execution units), it is often more convenient, and more powerful, to instead focus only on the partitioning of a problem into independent parts. ISPCs foreach construct provides a mechanism to express problem decomposition. Below, the foreach loop in the ISPC function sum2() defines an iteration space where all iterations are independent and therefore can be carried out in any order. ISPC handles the assignment of loop iterations to concurrent program instances. The difference between sum() and sum2() below is subtle, but very important. sum() is imperative: it describes how to map work to concurrent instances. The sum2() function below is declarative: it specifies only the set of work to be performed.

export sum2(uniform int N, uniform float* a, uniform float* b, uniform float\* c) {
foreach (i = 0 ... N) {
c[i] = a[i] + b[i];
}
}

Before proceeding, you are encouraged to familiarize yourself with ISPC language constructs by reading through the ISPC walkthrough available at http://ispc.github.com/example.html. The example program in the walkthrough is almost exactly the same as Problem 3’s implementation of mandelbrot_ispc() in mandelbrot.ispc. In the assignment code, we have changed the bounds of the foreach loop to yield a more straightforward implementation.

What you need to do

Compile and run the program mandelbrot.ispc. The ISPC compiler is configured to emit 4-wide SSE2 vector instructions (on cs-sfu-cloud-02) and 8-wide AVX1 (on cs-sfu-cloud-04). What is the maximum speedup you expect given what you know about these CPUs?
Why might the number you observe be less than this ideal? Hint: Consider the characteristics of the computation you are performing.
What parts of the image present challenges for SIMD execution? Comparing the performance of rendering the different views of the Mandelbrot set may help confirm your hypothesis.
We remind you that for the code described in this subsection, the ISPC compiler maps gangs of program instances to SIMD instructions executed on a single core. This parallelization scheme differs from that of Problem 1, where speedup was achieved by running threads on multiple cores.

Part 2 - ISPC Tasks - 8 of 15 points

ISPC’s SPMD execution model and the foreach mechanism facilitate the creation of programs that utilize SIMD processing. The language also provide the launch mechanism to utilize multiple cores in an ISPC computation, via a lightweight form of threading known as tasks.

See the launch command in the function mandelbrot_ispc_withtasks() in the file mandelbrot.ispc. This command launches multiple tasks (2 in the starter code). Each task defines a computation that will be executed by a gang of ISPC program instances. As given by the function mandelbrot_ispc_task(), each task computes a region of the final image. Similar to how the foreach construct defines loop itera-tions that can be carried out in any order (and in parallel by ISPC program instances), the tasks created by this launch operation can be processed in any order (and in parallel on different CPU cores).

What you need to do

Run mandelbrot_ispc with the commandline option “–tasks.” What speedup do you observe on view 1? What is the speedup over the version of mandelbrot_ispc that does not partition that computation into tasks?
There is a simple way to improve the performance of mandelbrot_ispc –tasks by changing the number of tasks the code creates. By only changing code in the function mandelbrot_ispc_withtasks(),

You should be able to achieve performance that exceeds the sequential version of the code by about 20–22 times! How did you determine how many tasks to create? Why does the number you chose work best?

Note: Your code must correctly handle the case where the number of rows in the image is not divisible by the number of tasks.

Extra Credit: (1 point) What are differences between the Pthread abstraction (used in Problem 1) and the ISPC task abstraction? There are some obvious differences in semantics between the (create/join and (launch/sync) mechanisms, but the implications of these differences are more subtle. Here’s a thought experiment to guide your answer: what happens when you launch 10,000 ISPC tasks? What happens when you launch 10,000 pthreads?

The smart-thinking student’s question: Hey wait! Why are there two different mechanisms (foreach and launch) for expressing independent, parallelizable work to the ISPC system?

Couldn’t the system just partition the many iterations of foreach across all cores and also emit the appropriate SIMD code for the cores? Answer: Great question! And there are a lot of possible answers. We’ll talk more in lecture.

What you need to turn in

Your report must contain answers to the questions listed above.
Your report should describe performance gains you get from both SIMD and threaded parallelism.

Problem 4 Iterative Square root - 10 points

# Source
cd $ASSIGNMENT/prog4_sqrt
# Build and run
./prob4_sqrt/sqrt_ispc.bin --help
Usage: ./prob4_sqrt/sqrt_ispc.bin [options]
Program Options:
  -d  --data r|g|b   Initialize with random|good|bad data
  -?  --help         This message

Problem 4 concerns the program sqrt, generated from an ISPC program that computes the square root of 20 million random numbers between 0 and 3. For value s, it uses the iterative Newton’s method (named after Isaac Newton) to solve the equation 1/x² = s. This gives a value x \sqrt{1/s}. Multiplying x by s gives an approximation to s. The value 1.0 is used as the initial guess in this implementation. The graph below shows the number of iterations required for the program to converge to an accurate solution for values in the open interval (0; 3). (The implementation does not converge for inputs outside this range). Notice how the rate of convergence depends on how close the solution is to the initial guess.

description here

What you need to do

Build and run sqrt. Report the ISPC implementation speedup for a single CPU core (no tasks) and when using all cores (with tasks). What is the speedup due to SIMD parallelization? What is the speedup due to multi-core parallelization?
Modify the function initGood() in the file data.cpp to generate data that will yield a very high relative speedup of the ISPC implementations. Describe why these input data will maximize speedup over the sequential version and report the resulting speedup achieved (for both the with-and without-tasks ISPC implementations). You can test this version with the commandline argument “–data g.” Does your modification improve SIMD speedup? Does it improve multi-core speedup? Please explain why.
Modify the function initBad() in the file data.cpp to generate data that will yield a very low (less than 1.0) relative speedup of the ISPC implementations. Describe why these input data will cause the SIMD code to have very poor speedup over the sequential version and report the resulting speedup achieved (for both the with- and without-tasks ISPC implementations). You can test this version with the commandline argument –data b. Does your modification improve multi-core speedup? Please explain why.

Notes and comments: When running your ``very-good-case input’’, take a look at what the benefit of multi-core execution.

What you need to handin

Provide explanations and answers in your report.

Test Machines

Your programs have to pass validation on travis, cs-cloud-02, and cs-cloud-04. Those are the only machines we will be using for tests.

Github clone link

TOC

ERRATA 1 (Part 3 Mandelbrot)

Observation: for mandelbrot.ispc.bin, travis build failing, but code works on cs-cloud-02 and cs-cloud-04)

First-steps

Logistics

Helper docs

Overview

Problem 1 - Parallel Fractal Generation Using Pthreads - 15 points

First Steps

Visualization

What you need to do

What you need to turn in

Problem 2 - Vectorizing Code Using SIMD Intrinsics - 20 points

Relevant lectures

Vector intrinics library

Background hints

What you need to do

Extra credit: (1 point)

What you need to turn in

Problem 3 Mandelbrot ISPC - 15 points

Part 1 - A Few ISPC Basics - 7 of 15 points

What you need to do

Part 2 - ISPC Tasks - 8 of 15 points

What you need to do

What you need to turn in

Problem 4 Iterative Square root - 10 points

What you need to do

What you need to handin

Test Machines