SFU Parallel Systems Class Review Unix Programming Tools and Basics of the C Language

Grace Hopper developed the first compiler for a computer programming language.

Github clone link

In this assignment, we will learn how to parallelize simple programs using C++11 threads. There are three problems in this assignment, and for each problem you are provided the serial C++ implementation, the expected parallelization strategy, and the expected output to be generated by your parallel solution.

Refs

Revise (https://github.com/CMPT-431-SFU/Cpp11-Demos)
C++ Concurrency Book
Before starting this assignment, you should have completed Slurm Tutorial which walks you through how to use our servers for your code development.
Ass 2 Slides

General Instructions

You are provided with the serial version of all the programs. To build..

mkdir build
cd build; cmake ../; make
# This will build the serial versions; the parallel versions will throw an error until you complete them

We have provided a template for parallel programs in all *_parallel.cpp. All parallel programs have the command-line argument --nWorkers to specify the number of threads for the program. Example: --nWorkers 4. While testing your solutions, make sure that cpus-per-task is correctly specified in your slurm config file based on your requirement.
The templated code has a number of TODOs: You can use simply use the provided template and fill the TODOs to complete the assignment OR you can modify the template however you see fit. Output has to match the test script’s expectations, see below
Sample outputs for all the programs can be found in sample_outputs directory. Programs will be evaluated and graded automatically. Please make sure that your program output strictly follows the sample output format. . We have already included the tester compatible print statements in the *_parallel.cpp.
You have also been asked to print the time spent by different threads on specific code regions. The time spent by any code region can be computed as follows:

        timer t1;
        t1.start();
        /* ---- Code region whose time is to be measured --- */
        double time_taken = t1.stop();

Certain programs operate on graph datasets. Sample input graphs are available at /scratch/assignment1/input_graphs/ (only availabe on cs-cloud-02 and cs-cloud-04).
Test Graphs If you’d like to test your solution with more graph datasets, you can create your own simple graphs as follows:
- We have checked in a sample graph roadNet-CA (under pagerank/maps/)
- Create a file called testGraph.txt with the list of edges (one edge on each line in “<source> <destination>” form) in the graph. For example,
```
1 2
2 3
```
- Run /scratch/assignment1/input_graphs/SNAPtoBinary testGraph.txt testGraphConverted. This will create testGraphConverted.csr and testGraphConverted.csc files which are CSR and CSC representations of the graph.
- To use the graphs in your solutions, use the command line argument --inputFile "testGraphConverted".
Graph Inputs Maps tend to be big so we have checked in only a single map roadNet-CA on the repository. There are three other maps available on the cs-cloud-02 and 03 (lj,test_25M_50M)
Testing We have provided test scripts for you to quickly test your solutions during your development process You can test your code using the test script available at pi/scripts/, pagerank/scripts/, and triangle/scripts/. Note that these test scripts only validate the output formats, and a different evaluation script will be used for grading the assignments. Important: You should use slurm when performing these and other tests. The test scripts are set up for 4 threads; make sure --cpus-per-task=4 is set in your slurm job. _The reported timings in the test script are also measured on cs-cloud-02 (your report should include measurements from 02 and 04).

Problem 1. Monte Carlo Pi Estimation

[10 Points]

The value of Pi (3.14159) can be estimated using Monte Carlo method as described below:

Consider a circle of radius 0.5 units that is inscribed in a unit square (each side is 1 unit).
The ratio of their areas is: pi * r * r / 2r * 2r = pi / 4.
We randomly generate n points inside the square. Let c out of the n points fall inside the circle.
Pi is then approximated as: pi / 4 = c / n ==> pi = 4 * c / n.

The program below implements the above algorithm.

        uint circle_count = 0;
        double x_coord, y_coord;
        for (uint i = 0; i < n; i++) {
            x_coord = (2.0 * get_random_coordinate(&random_seed)) - 1.0;
            y_coord = (2.0 * get_random_coordinate(&random_seed)) - 1.0;
            if ((sqr(x_coord) + sqr(y_coord)) <= 1.0)
                circle_count++;
        }
        double pi_value = 4.0 * (double)circle_points / (double)n;

Our goal is to parallelize the above algorithm. Specifically, we are interested in parallelizing the for loop such that each thread generates (approximately) n/T points, where T is the number of threads. Below is the pseudo-code showing the logic of our parallel solution:

        Create T threads
        for each thread in parallel {
            Get the circle_count for (approximately) n/T points
        }
        total_circle_points = Accumulate the circle_counts from all threads
        pi_value = 4.0 * total_circle_points / n;

The serial implementation is available in pi_serial.cpp. You have to parallelize the given serial implementation using C++11 threads.

Your parallel solution must satisfy the following:

The file should be named pi_parallel.cpp.
Your parallel solution must output the following information:
- Total number of threads used.
- For each thread: the number of random points generated, the number of points within the circle, and the time taken to generate and process these points (your threads should be numbered between [0, T)).
- The total number of points generated.
- The total number of points within the circle.
- The total time taken for the entire execution (the code region to be timed is highlighted using comments in the serial code).
The sample output can be found in sample_outputs/pi_calculation.output.

Please note that the output format should strictly match the expected format (including “spaces” and “commas”). You can test your code using the test script as follows:

Testing

We have provided test scripts for you to quickly test your solutions during your development process You can test your code using the test script available at pi/scripts/. Note that these test scripts only validate the output formats, and a different evaluation script will be used for grading the assignments. Important: You should use slurm when performing these and other tests. The test scripts test for up to 4 threads; make sure --cpus-per-task=4 is set in your slurm job.

cd build
./build/pi/pi_parallel.bin --nWorkers 4 --nPoints 500000007
TESTER=$PWD/../pi; python2.7 $TESTER/scripts/pi_calculation_tester.pyc --execPath=$PWD/pi/pi_parallel.bin --scriptPath=$TESTER/scripts/pi_calculation_evaluator.pyc
# execPath is the absolute path of your pi parallel binary
# scriptPath is the absolute path of pi/scripts/pi_calculation_evaluator.pyc

Problem 2. Triangle Counting

[10 Points]

The number of triangles in a graph can be computed by counting the number of triangles formed by each edge in the graph. For an edge (u, v), the number of triangles it forms is given by: |A ∩ B| where A is the set of inNeighbors of u excluding v, and B is the set of outNeighbors of v excluding u.

The program below counts the number of triangles in a given graph (count_triangles_between(u, v) computes the above formula).

        triangle_count = 0
        for u in Graph.all_vertices {
            for v in u.outNeighbors() {
                triangle_count += count_triangles_between(u, v)
            }
        }
        triangle_count = triangle_count / 3 // divide by 3 to get the number of unique triangles

Our goal is to parallelize the above algorithm such that each thread works on a sub-graph. Below is the pseudo-code showing the logic of our parallel solution:

        Create T threads
        for each thread in parallel {
            Compute the number of triangles created by the vertices allocated to the thread
        }
        triangle_count = Accumulate the triangle counts from all the threads
        triangle_count = triangle_count / 3

The serial implementation is available in triangle_counting.cpp. You have to parallelize the given serial implementation using C++11 threads.

The file should be named triangle_counting_parallel.cpp. The input graph file should be specified using the command-line parameter --inputFile (similar to the serial code).
Your parallel solution must output the following information:
- Total number of threads used.
- For each thread: the number of triangles counted and the time taken to count the triangles (your threads should be numbered between [0, T)).
- The total number of triangles in the graph.
- The total number of unique triangles in the graph.
- The total time taken for the entire execution (the code region to be timed is highlighted using comments in the serial code).
The sample console output can be found in sample_outputs/pi_calculation.output.

Please note that the output format should strictly match the expected format (including “spaces” and “commas”). You can test your code using the test script as follows:

Testing

cd build
./build/triangle/triangle_parallel.bin --nWorkers 4 --inputFile /scratch/assignment1/input_graphs/roadNet-CA
python2.7 ../triangle/scripts/triangle_tester.pyc --execPath=$PWD/triangle/triangle_parallel.bin --scriptPath=$PWD/../triangle/scripts/triangle_evaluator.pyc --mapPath=/scratch/assignment1/input_graphs/ --map=roadNet-CA
# execPath is the absolute path of your pi parallel binary
# scriptPath is the absolute path of pi/scripts/pi_calculation_evaluator.pyc
# mapPath is the absolute path with maps.

_ The python autograding WILL NOT WORK IF YOU DECIDE TO GENERATE YOUR OWN GRAPH _

_ map options roadNet-CA, lj, or test_25M_50M _

Problem 3. PageRank

[20 Points]

Given a graph, the PageRank of a vertex v is computed as:

pagerank[v] = (1 - DAMPING) + (DAMPING * Σ(pagerank[u] / out_degree[u]))

where DAMPING is a constant set to 0.85, and Σ represents summation over all incoming edges (u, v) of v.

Note that v might be an incoming neighbor for some other vertex w, and hence, we cannot simply overwrite pagerank[v] before pagerank[w] gets computed. To do so, we separate the pagerank values on the left of the equation with the pagerank values on the right of the equation, as shown here:

    pagerank_next[v] = (1 - DAMPING) + (DAMPING * Σ (pagerank_curr[u] / out_degree[u]))

where pagerank_next[v] is the new pagerank of v and pagerank_curr[u] is the old pagerank of u.

The pagerank for all the vertices in the graph is computed multiple times for certain number of iterations. In this problem, we will look at an implementation of pagerank where each vertex u pushes its pagerank value to its outgoing neighbors, after which, the cumulative value received by each vertex gets used to compute pagerank of the vertex. The serial code shown below repeats this process for max_iters iterations.

       // Initialization
       for (uintV i = 0; i < n; i++) {
           pr_curr[i] = 1.0;
           pr_next[i] = 0.0;
       }

       // ----------------------------------------------------------------
       for (int iter = 0; iter < max_iters; iter++) {
           for (uintV u = 0; u < n; u++) {
               uintE out_degree = g.vertices_[u].getOutDegree();
               for (uintE i = 0; i < out_degree; i++) {
                   uintV v = g.vertices_[u].getOutNeighbor(i);
                   pr_next[v] += (pr_curr[u] / out_degree);
               }
           }
           for (uintV u = 0; u < n; u++) {
               // pr_next[u] contains the cumulative value received by u
               pr_next[u] = (1 - DAMPING) + (DAMPING * pr_next[u]);
               // Update pr_curr and reset pr_next for the next iteration
               pr_curr[u] = pr_next[u];
               pr_next[u] = 0.0;
           }
       }
       // ----------------------------------------------------------------

Our goal is to parallelize the above algorithm. Specifically, we are interested in parallelizing the for loop demarcated by comments (// ---) such that each thread works on a sub-graph. Below is the pseudo-code showing the logic of our parallel solution:

        Create T threads
        for(i=0; i<max_iterations; i++) {
            for each thread in parallel {
                for each vertex 'u' allocated to the thread {
                    for vertex 'v' in outNeighbor(u)
                        next_page_rank[v] += (current_page_rank[u]/outdegree[u])
                }
            }

            for each thread in parallel {
                for each vertex 'v' allocated to the thread {
                    compute the new_pagerank using the accumulated values in next_page_rank[v].
                    current_page_rank[v] = new_pagerank
                    Reset next_page_rank[v] to 0
                }
            }
        }

Key things to note:

Observe the variables shared by multiple threads. Ensure that access to shared resources is properly synchronized using locks. Use std::mutex as needed.
Observe the separation between the two for loops. Only after all the threads have processed the first for loop, the second for loop should be processed by each thread. You will need a barrier to achieve this. We have provided CustomBarrier in the utils.h which you can use as shown below:

        CustomBarrier my_barrier(4);   //Create a barrier object. 4 --> number of workers/threads
        // Share this my_barrier object with all the threads
        // ---
            // Inside the thread function, wait on the barrier as follows:
            my_barrier.wait();
        // ---

The serial implementation is available in page_rank.cpp. You have to parallelize the given serial implementation using C++11 threads and locks using std::mutex.

Your parallel solution must satisfy the following:

The file should be named pagerank_parallel.cpp. The input graph file should be specified using the command-line parameter --inputFile (similar to the serial code).
Your parallel solution must output the following information:
- Total number of threads used.
- For each thread: the time taken to compute pageranks (your threads should be numbered between [0, T)).
- The sum of pageranks of all vertices.
- The total time taken for the entire execution (the code region to be timed is highlighted using comments in the serial code).
The sample console output can be found at sample_outputs/page_rank.output.

Please note that the output format should strictly match the expected format (including “spaces” and “commas”). You can test your code using the test script as follows:

Testing

cd build
./build/pagerank/pagerank_parallel.bin --nWorkers 4 --nIterations 10 --inputFile /scratch/assignment1/input_graphs/lj
python2.7 ../pagerank/scripts/pagerank_tester.pyc --execPath=$PWD/pagerank/pagerank_parallel.bin --scriptPath=$PWD/../pagerank/scripts/pagerank_evaluator.pyc --mapPath=/scratch/assignment1/input_graphs/ --map=roadNet-CA
python2.7 ../pagerank/scripts/pagerank_tester.pyc --execPath=$PWD/pagerank/pagerank_parallel.bin --scriptPath=$PWD/../pagerank/scripts/pagerank_evaluator.pyc --mapPath=/scratch/assignment1/input_graphs/ --map=lj
## Note that tester script is different, if you use a map name other than roadNet

NOT REQUIRED (FOR YOUR OWN VERIFICATION)

Note that the floating point based implementation is the default implementation (i.e., doesn’t require any flags). The output of your program can be verified by comparing the sum of all the pagerank values with that generated by the serial implementation. It is important to note that floating point multiplication is not associative. So, there may be minor variation in the pagerank values compared to serial implementation. For quick verification, we have also provided an integer version of the program. To run the integer version of PageRank, use the flag USE_INT=1 during make as follows:

cd build; rm -rf *;
cmake -DUSE_INT=1 ../
make
# Sample parallel invocation
# ./pagerank/pagerank_parallel.bin --nWorkers 4 --nIterations 10 --inputFile /scratch/assignment1/input_graphs/roadNet-CA

Submission Guidelines

Assignment commit
Report with speedup curves for each part (coursys). Speedup curve (Fix all commandline parameters other than nWorkers, vary nWorkers from 1—16 and plot against execution time).
For triangle and pagerank, try different graphs.

Github clone link

TOC

Refs

General Instructions

Problem 1. Monte Carlo Pi Estimation

Testing

Problem 2. Triangle Counting

Testing

Problem 3. PageRank

Testing

NOT REQUIRED (FOR YOUR OWN VERIFICATION)

Submission Guidelines