# Change line 379: $REPO/gauss-parallel/gauss_thread.cpp from
## CORRECT
error = 0.0
## INCORRECT 
error = 0xdeadbeef
The main goal of this assignment is to gain the experience of developing a parallel program in pthreads and understand the synchronization and work balance issues with irregular programs. Additionally, you will study how the parallel performance is affected by your parallelization strategy (including task decomposition, assignment, and synchronization management) and the multiprocessor platform.
_ Running and Collecting results may take upto a couple of hrs so get your measurements early; do not wait till the last minute. _
_ Make sure ncpus set in the slurm job matches the -p parameter in the applications _
As a preparation, your first task is to run a sample program (Successive Over Relaxation) that we provide to you. The sequential version of the program is available at sor/sor_seq.bin and the pthreads parallel version is available at sor/sor_thread.bin. Please choose a multiprocessor machine and generate performance results for the sequential version of the program and the pthreads version at a range of processor numbers (at least to include 1, 2, 4, 8, 16). Besides running the program, you second task is to understand the assignment of tasks to the processors in the parallel program.
mkdir -p build
cd build
cmake ../; make
./sor/sor_seq.bin -m32768 -n2048 -i5 -v 0 
./sor/sor_seq.bin -m32768 -n2048 -i5 -v 1 # Verification. Dumps 100/200MB files. seq_res. Clean out after running.
./sor/sor_thread.bin -p 4 -m4096 -n2048 -i400 -v 1 
| Param | Description | 
|---|---|
| -m,-n | Creates grid of size M+2 x 2N+2 | 
| -p | the number of threads. | 
| -v | 0 (no check) 1 (check). verify dumps huge files. clean it out after running | 
Solves a M+2 by 2N+2 array the overall matrix is separated to one M+2 by N+1 red matrix,and one M+2 by N+1 black marix. And the two matrices form the overall matrix in a checkerboard pattern fashion as illustrated:
 *
 * 	-------------------------
 *	|R(1,0)	|B(1,0)	|R(1,1)	|
 *	-------------------------
 *	|B(2,0)	|R(2,0)	|B(2,1)	|
 *	-------------------------
 *	|R(3,0)	|B(3,0)	|R(3,1)	|
You are asked to develop a parallel Gaussian Elimination (with partial pivoting) program using pthreads (see notes/ repo). Updated class notes can be found here. In the context of solving a system of linear equations, Gaussian Elimination is a classical method for reducing an equation matrix into an equivalent upper-diagonal matrix. The full solution requires an additional back-solver, which has a lower computation complexity and thus typically requires much less time than the Gaussian Elimination step. In this assignment, you only need to parallelize the Gaussian Elimination step.
A sequential version of Gaussian Elimination is provided to you. It is available at gauss-seq The program takes one parameter as the input matrix. Currently six matrices (five real-world sparse matrices and one artificially generated dense matrix) are available for your testing.
After developing and testing your program, you should measure the performance/speedup of your program with the input matrices on multipple threading levels (2—16).
You should analyze the speedup results. In addition, I expect that you have tried multiple ways of realizing parallel Gaussian Elimination (in terms of task decomposition, assignment, or synchronization management). You need to provide a comparison on at least two alternative approaches (pick a comparison that you have learned the most from) and analyze their performance. We specifically recommend two approaches
We only care about the timing of the Gaussian Elimination step of your program. In your parallel program, please use barrier (see the sor code) to properly synchronize all processors at the beginning and end of Gaussian Elimination for timing. Code for using the high resolution timer on Intel processors is available at common/ (this is the same timer interface as assignment 2). To acquire accurate timing, you’ll need to make sure that the machine has enough number of processors for your experiments.
| Param | Description | 
|---|---|
| -m | location of your matrix file | 
| -p | the number of threads. | 
| -s | to use Stripe method or not. Default is Cyclical. | 
For example: This command run the Gaussian elimination on sherman3 matrix with 4 threads and Stripe method
mkdir build
cmake ../
make
./gauss-parallel/gauss_thread.bin -m ../input_matrics/sherman3.dat -p 4 -s
./gauss-seq/gauss_seq.bin /scratch/assignment-barriers/input
The program takes one parameter as the input matrix.
Six matrices (extracted from real-world problems) are available for your testing:
| input/jpwh_991.dat (size: 94KB, 991 rows) | |
| input/orsreg_1.dat (size: 247KB,2205 rows) | |
| input/sherman5.dat (size: 401KB,3312 rows) | |
| input/saylr4.dat (size: 392KB,3564 rows) | |
| input/sherman3.dat (size: 355KB,5005 rows) | |
| input/memplus.dat (size: 2702KB,17758 rows) | |
| matrix_2000.dat (size:80MB) | |
PS: memplus.dat has a lot of rows (~17000). i.e., takes a long time to run and is only available on the cs-cloud machines. You are not required to report results from matrix_2000 or memplus.
Report.