Table of contents generated with markdown-toc
Assignment on graphs and parallelization.
In this assignment, we will learn how to parallelize simple programs using C++11 threads. There are three problems in this assignment, and for each problem you are provided the serial C++ implementation, the expected parallelization strategy, and the expected output to be generated by your parallel solution.
mkdir build
cd build; cmake ../; make
# This will build the serial versions; the parallel versions will throw an error until you complete them
*_parallel.cpp
. All parallel programs have the command-line argument --nWorkers
to specify the number of threads for the program. Example: --nWorkers 4
.
While testing your solutions, make sure that cpus-per-task
is correctly specified in your slurm config file based on your requirement.sample_outputs
directory. Programs will be evaluated and graded automatically. Please make sure that your program output strictly follows the sample output format. . We have already included the tester compatible print statements in the *_parallel.cpp
. timer t1;
t1.start();
/* ---- Code region whose time is to be measured --- */
double time_taken = t1.stop();
Certain programs operate on graph datasets. Sample input graphs are available at /scratch/assignment1/input_graphs/
(only availabe on cs-cloud-02 and cs-cloud-04).
Test Graphs If you’d like to test your solution with more graph datasets, you can create your own simple graphs as follows:
pagerank/maps/
)Create a file called testGraph.txt
with the list of edges (one edge on each line in “<source> <destination>
” form) in the graph. For example,
1 2
2 3
/scratch/assignment1/input_graphs/SNAPtoBinary testGraph.txt testGraphConverted
. This will create testGraphConverted.csr
and testGraphConverted.csc
files which are CSR and CSC representations of the graph.--inputFile "testGraphConverted"
.Graph Inputs Maps tend to be big so we have checked in only a single map roadNet-CA on the repository. There are three other maps available on the cs-cloud-02 and 03 (lj
,test_25M_50M
)
Testing We have provided test scripts for you to quickly test your solutions during your development process You can test your code using the test script available at pi/scripts/
, pagerank/scripts/
, and triangle/scripts/
. Note that these test scripts only validate the output formats, and a different evaluation script will be used for grading the assignments. Important: You should use slurm when performing these and other tests. The test scripts are set up for 4 threads; make sure --cpus-per-task=4
is set in your slurm job. _The reported timings in the test script are also measured on cs-cloud-02 (your report should include measurements from 02 and 04).
[10 Points]
The value of Pi (3.14159
) can be estimated using Monte Carlo method as described below:
0.5
units that is inscribed in a unit square (each side is 1
unit).pi * r * r / 2r * 2r = pi / 4
.n
points inside the square. Let c
out of the n
points fall inside the circle.pi / 4 = c / n
==>
pi = 4 * c / n
.The program below implements the above algorithm.
uint circle_count = 0;
double x_coord, y_coord;
for (uint i = 0; i < n; i++) {
x_coord = (2.0 * get_random_coordinate(&random_seed)) - 1.0;
y_coord = (2.0 * get_random_coordinate(&random_seed)) - 1.0;
if ((sqr(x_coord) + sqr(y_coord)) <= 1.0)
circle_count++;
}
double pi_value = 4.0 * (double)circle_points / (double)n;
Our goal is to parallelize the above algorithm. Specifically, we are interested in parallelizing the for loop such that each thread generates (approximately) n/T
points, where T
is the number of threads. Below is the pseudo-code showing the logic of our parallel solution:
Create T threads
for each thread in parallel {
Get the circle_count for (approximately) n/T points
}
total_circle_points = Accumulate the circle_counts from all threads
pi_value = 4.0 * total_circle_points / n;
The serial implementation is available in pi_serial.cpp
. You have to parallelize the given serial implementation using C++11 threads.
Your parallel solution must satisfy the following:
pi_parallel.cpp
.[0, T)
).sample_outputs/pi_calculation.output
.Please note that the output format should strictly match the expected format (including “spaces” and “commas”). You can test your code using the test script as follows:
We have provided test scripts for you to quickly test your solutions during your development process
You can test your code using the test script available at pi/scripts/
. Note that these test scripts only validate the output formats, and a different evaluation script will be used for grading the assignments. Important: You should use slurm when performing these and other tests. The test scripts test for up to 4 threads; make sure --cpus-per-task=4
is set in your slurm job.
cd build
./build/pi/pi_parallel.bin --nWorkers 4 --nPoints 500000007
TESTER=$PWD/../pi; python2.7 $TESTER/scripts/pi_calculation_tester.pyc --execPath=$PWD/pi/pi_parallel.bin --scriptPath=$TESTER/scripts/pi_calculation_evaluator.pyc
# execPath is the absolute path of your pi parallel binary
# scriptPath is the absolute path of pi/scripts/pi_calculation_evaluator.pyc
[10 Points]
The number of triangles in a graph can be computed by counting the number of triangles formed by each edge in the graph. For an edge (u, v)
, the number of triangles it forms is given by: |A
∩
B|
where A
is the set of inNeighbors of u
excluding v
, and B
is the set of outNeighbors of v
excluding u
.
The program below counts the number of triangles in a given graph (count_triangles_between(u, v)
computes the above formula).
triangle_count = 0
for u in Graph.all_vertices {
for v in u.outNeighbors() {
triangle_count += count_triangles_between(u, v)
}
}
triangle_count = triangle_count / 3 // divide by 3 to get the number of unique triangles
Our goal is to parallelize the above algorithm such that each thread works on a sub-graph. Below is the pseudo-code showing the logic of our parallel solution:
Create T threads
for each thread in parallel {
Compute the number of triangles created by the vertices allocated to the thread
}
triangle_count = Accumulate the triangle counts from all the threads
triangle_count = triangle_count / 3
The serial implementation is available in triangle_counting.cpp
. You have to parallelize the given serial implementation using C++11 threads.
triangle_counting_parallel.cpp
. The input graph file should be specified using the command-line parameter --inputFile
(similar to the serial code).[0, T)
).sample_outputs/pi_calculation.output
.Please note that the output format should strictly match the expected format (including “spaces” and “commas”). You can test your code using the test script as follows:
cd build
./build/triangle/triangle_parallel.bin --nWorkers 4 --inputFile /scratch/assignment1/input_graphs/roadNet-CA
python2.7 ../triangle/scripts/triangle_tester.pyc --execPath=$PWD/triangle/triangle_parallel.bin --scriptPath=$PWD/../triangle/scripts/triangle_evaluator.pyc --mapPath=/scratch/assignment1/input_graphs/ --map=roadNet-CA
# execPath is the absolute path of your pi parallel binary
# scriptPath is the absolute path of pi/scripts/pi_calculation_evaluator.pyc
# mapPath is the absolute path with maps.
_ The python autograding WILL NOT WORK IF YOU DECIDE TO GENERATE YOUR OWN GRAPH _
_ map options roadNet-CA, lj, or test_25M_50M _
[20 Points]
Given a graph, the PageRank of a vertex v
is computed as:
pagerank[v] = (1 - DAMPING) + (DAMPING * Σ(pagerank[u] / out_degree[u]))
where DAMPING
is a constant set to 0.85
, and Σ
represents summation over all incoming edges (u, v)
of v
.
Note that v
might be an incoming neighbor for some other vertex w
, and hence, we cannot simply overwrite pagerank[v]
before pagerank[w]
gets computed. To do so, we separate the pagerank values on the left of the equation with the pagerank values on the right of the equation, as shown here:
pagerank_next[v] = (1 - DAMPING) + (DAMPING * Σ (pagerank_curr[u] / out_degree[u]))
where pagerank_next[v]
is the new pagerank of v
and pagerank_curr[u]
is the old pagerank of u
.
The pagerank for all the vertices in the graph is computed multiple times for certain number of iterations. In this problem, we will look at an implementation of pagerank where each vertex u
pushes its pagerank value to its outgoing neighbors, after which, the cumulative value received by each vertex gets used to compute pagerank of the vertex. The serial code shown below repeats this process for max_iters
iterations.
// Initialization
for (uintV i = 0; i < n; i++) {
pr_curr[i] = 1.0;
pr_next[i] = 0.0;
}
// ----------------------------------------------------------------
for (int iter = 0; iter < max_iters; iter++) {
for (uintV u = 0; u < n; u++) {
uintE out_degree = g.vertices_[u].getOutDegree();
for (uintE i = 0; i < out_degree; i++) {
uintV v = g.vertices_[u].getOutNeighbor(i);
pr_next[v] += (pr_curr[u] / out_degree);
}
}
for (uintV u = 0; u < n; u++) {
// pr_next[u] contains the cumulative value received by u
pr_next[u] = (1 - DAMPING) + (DAMPING * pr_next[u]);
// Update pr_curr and reset pr_next for the next iteration
pr_curr[u] = pr_next[u];
pr_next[u] = 0.0;
}
}
// ----------------------------------------------------------------
Our goal is to parallelize the above algorithm. Specifically, we are interested in parallelizing the for loop demarcated by comments (// ---
) such that each thread works on a sub-graph. Below is the pseudo-code showing the logic of our parallel solution:
Create T threads
for(i=0; i<max_iterations; i++) {
for each thread in parallel {
for each vertex 'u' allocated to the thread {
for vertex 'v' in outNeighbor(u)
next_page_rank[v] += (current_page_rank[u]/outdegree[u])
}
}
for each thread in parallel {
for each vertex 'v' allocated to the thread {
compute the new_pagerank using the accumulated values in next_page_rank[v].
current_page_rank[v] = new_pagerank
Reset next_page_rank[v] to 0
}
}
}
Key things to note:
Observe the variables shared by multiple threads. Ensure that access to shared resources is properly synchronized using locks. Use std::mutex as needed.
Observe the separation between the two for loops. Only after all the threads have processed the first for loop, the second for loop should be processed by each thread. You will need a barrier to achieve this. We have provided CustomBarrier
in the utils.h
which you can use as shown below:
CustomBarrier my_barrier(4); //Create a barrier object. 4 --> number of workers/threads
// Share this my_barrier object with all the threads
// ---
// Inside the thread function, wait on the barrier as follows:
my_barrier.wait();
// ---
The serial implementation is available in page_rank.cpp
. You have to parallelize the given serial implementation using C++11 threads and locks using std::mutex.
Your parallel solution must satisfy the following:
pagerank_parallel.cpp
. The input graph file should be specified using the command-line parameter --inputFile
(similar to the serial code).[0, T)
).sample_outputs/page_rank.output
.Please note that the output format should strictly match the expected format (including “spaces” and “commas”). You can test your code using the test script as follows:
cd build
./build/pagerank/pagerank_parallel.bin --nWorkers 4 --nIterations 10 --inputFile /scratch/assignment1/input_graphs/lj
python2.7 ../pagerank/scripts/pagerank_tester.pyc --execPath=$PWD/pagerank/pagerank_parallel.bin --scriptPath=$PWD/../pagerank/scripts/pagerank_evaluator.pyc --mapPath=/scratch/assignment1/input_graphs/ --map=roadNet-CA
python2.7 ../pagerank/scripts/pagerank_tester.pyc --execPath=$PWD/pagerank/pagerank_parallel.bin --scriptPath=$PWD/../pagerank/scripts/pagerank_evaluator.pyc --mapPath=/scratch/assignment1/input_graphs/ --map=lj
## Note that tester script is different, if you use a map name other than roadNet
Note that the floating point based implementation is the default implementation (i.e., doesn’t require any flags). The output of your program can be verified by comparing the sum of all the pagerank values with that generated by the serial implementation. It is important to note that floating point multiplication is not associative. So, there may be minor variation in the pagerank values compared to serial implementation. For quick verification, we have also provided an integer version of the program. To run the integer version of PageRank, use the flag USE_INT=1
during make
as follows:
cd build; rm -rf *;
cmake -DUSE_INT=1 ../
make
# Sample parallel invocation
# ./pagerank/pagerank_parallel.bin --nWorkers 4 --nIterations 10 --inputFile /scratch/assignment1/input_graphs/roadNet-CA
@copyright Arrvindh Shriraman and Keval Vora