Grace Hopper developed the first compiler for a computer programming language.

Errata

# Change line 379: $REPO/gauss-parallel/gauss_thread.cpp from
## CORRECT
error = 0.0
## INCORRECT 
error = 0xdeadbeef

Reference file

Github clone link

Click on on link to clone

The main goal of this assignment is to gain the experience of developing a parallel program in pthreads and understand the synchronization and work balance issues with irregular programs. Additionally, you will study how the parallel performance is affected by your parallelization strategy (including task decomposition, assignment, and synchronization management) and the multiprocessor platform.

_ Running and Collecting results may take upto a couple of hrs so get your measurements early; do not wait till the last minute. _

_ Make sure ncpus set in the slurm job matches the -p parameter in the applications _

Part 1 - SOR Successive over relaxation - 20 points

As a preparation, your first task is to run a sample program (Successive Over Relaxation) that we provide to you. The sequential version of the program is available at sor/sor_seq.bin and the pthreads parallel version is available at sor/sor_thread.bin. Please choose a multiprocessor machine and generate performance results for the sequential version of the program and the pthreads version at a range of processor numbers (at least to include 1, 2, 4, 8, 16). Besides running the program, you second task is to understand the assignment of tasks to the processors in the parallel program.

mkdir -p build
cd build
cmake ../; make
./sor/sor_seq.bin -m32768 -n2048 -i5 -v 0 
./sor/sor_seq.bin -m32768 -n2048 -i5 -v 1 # Verification. Dumps 100/200MB files. seq_res. Clean out after running.
./sor/sor_thread.bin -p 4 -m4096 -n2048 -i400 -v 1 

Param	Description
-m,-n	Creates grid of size M+2 x 2N+2
-p	the number of threads.
-v	0 (no check) 1 (check). verify dumps huge files. clean it out after running

Solves a M+2 by 2N+2 array the overall matrix is separated to one M+2 by N+1 red matrix,and one M+2 by N+1 black marix. And the two matrices form the overall matrix in a checkerboard pattern fashion as illustrated:

 *
 * 	-------------------------
 *	|R(1,0)	|B(1,0)	|R(1,1)	|
 *	-------------------------
 *	|B(2,0)	|R(2,0)	|B(2,1)	|
 *	-------------------------
 *	|R(3,0)	|B(3,0)	|R(3,1)	|

Part 2 Gauss Parallel - 80 points (40 per style of parallelism)

You are asked to develop a parallel Gaussian Elimination (with partial pivoting) program using pthreads (see notes/ repo). Updated class notes can be found here. In the context of solving a system of linear equations, Gaussian Elimination is a classical method for reducing an equation matrix into an equivalent upper-diagonal matrix. The full solution requires an additional back-solver, which has a lower computation complexity and thus typically requires much less time than the Gaussian Elimination step. In this assignment, you only need to parallelize the Gaussian Elimination step.

A sequential version of Gaussian Elimination is provided to you. It is available at gauss-seq The program takes one parameter as the input matrix. Currently six matrices (five real-world sparse matrices and one artificially generated dense matrix) are available for your testing.

After developing and testing your program, you should measure the performance/speedup of your program with the input matrices on multipple threading levels (2—16).

You should analyze the speedup results. In addition, I expect that you have tried multiple ways of realizing parallel Gaussian Elimination (in terms of task decomposition, assignment, or synchronization management). You need to provide a comparison on at least two alternative approaches (pick a comparison that you have learned the most from) and analyze their performance. We specifically recommend two approaches

A cyclic assignment of rows to threads (e.g., Thread 0 works on Row 0, Thread 1 works on row 1…)
A block assignment of rows to threads (e.g., Thread 0 works on Row 0…3, Thread 1 works on row 4…)

We only care about the timing of the Gaussian Elimination step of your program. In your parallel program, please use barrier (see the sor code) to properly synchronize all processors at the beginning and end of Gaussian Elimination for timing. Code for using the high resolution timer on Intel processors is available at common/ (this is the same timer interface as assignment 2). To acquire accurate timing, you’ll need to make sure that the machine has enough number of processors for your experiments.

Parameters for the gauss binary

Param	Description
-m	location of your matrix file
-p	the number of threads.
-s	to use Stripe method or not. Default is Cyclical.

For example: This command run the Gaussian elimination on sherman3 matrix with 4 threads and Stripe method

mkdir build
cmake ../
make
./gauss-parallel/gauss_thread.bin -m ../input_matrics/sherman3.dat -p 4 -s
./gauss-seq/gauss_seq.bin /scratch/assignment-barriers/input

Input data set

The program takes one parameter as the input matrix.

Six matrices (extracted from real-world problems) are available for your testing:

input/jpwh_991.dat (size: 94KB, 991 rows)
input/orsreg_1.dat (size: 247KB,2205 rows)
input/sherman5.dat (size: 401KB,3312 rows)
input/saylr4.dat (size: 392KB,3564 rows)
input/sherman3.dat (size: 355KB,5005 rows)
input/memplus.dat (size: 2702KB,17758 rows)
matrix_2000.dat (size:80MB)

PS: memplus.dat has a lot of rows (~17000). i.e., takes a long time to run and is only available on the cs-cloud machines. You are not required to report results from matrix_2000 or memplus.

Submission

Report.

Successive Over Relaxation

Speedup plot (vary threads. Try different thread configurations e.g., more threads than even CPUs. Set number of CPUs in slurm job, less that -p parameter for your program). (5 points)
Problem size – Vary the problem size and plot speedup (5 points)
Plot what is happening on each thread at fixed intervals i.e., track what function a thread is executing over time.
Try smaller problem sizes and plot speedup curves (5 points)

Gauss Elimination

Plot the time breakdown for the sequential code and analyze it. (5)
Describe striped vs block implementation, the challenges and benefits of each approach. (5)
Describe the synchronization patterns (5).
Problem size – Vary the problem size and plot speedup with varying threads (10 points). Speedup when matrix size is small (matrix 991)
Plot how long each thread waits to start working for both striped vs block (5) sherman3
Plot cs-cloud-04 vs cs-cloud-02. Speedup plots for (matrix 991 and memplus).