Project 2: Dynamic Dependence Profiling

This project is intended to help you get acquainted with dynamic program analysis and tools that can assist in implementing dynamic program analyses.

For this project, you'll be instrumenting a program to track dependences data dependences within a program at runtime. Similar profiling techniques have proven useful in helping developers to identify code that may be refactored to exploit parallelism or even to automatically refactor code to optimistically exploit parallelism.

This project only considers a narrow subset of the potential ways that dynamic dependences may assist parallelizing a program. In particular, you will identify potential parallelism in loops when each iteration of a loop may be run in parallel simultaneously instead of sequentially. You only need to consider accesses to memory via loads and stores, as dependences produced via registers could normally be determined through static analysis. While this may make your results a bit less sound in practice, you should feel free to handle dependences from registers as well, if you so desire.

The particular dependences that you will track identify different ways that one iteration of a loop may affect another iteration of a loop. Such iteration crossing dependences are collectively called loop carried dependences.

Flow dependences occur when a memory location a is written in one iteration and read in a later iteration. Thus the later iteration uses results from an earlier iteration and prevents parallelization.
Anti-dependences occur when a memory location is read in one iteration and written in a later iteration. Parallelizing the two iterations might allow the later write to move before the earlier read, so these also prevent parallelization.
Output dependences occur when a memory location is written in one iteration and written again by a later iteration. Parallelizing the iterations may allow the write of the earlier iteration to occur after the write of the later iteration, changing the values along any of its prior flow dependences. Thus, these also prevent parallelization.

Your program will track all of these dependences and the conflicts that they represent between loop iterations. Any loop without such conflicts is a candidate for parallelization. You shall compute the potential benefits from parallelizing these loops and present it as a result of your analysis.

While relatively new techniques exist to improve the performance, precision and generality of such profiling techniques [1][2][3], we shall use and older and simpler approach within this project [4]. The technique is nicely described in section 4, pages 3-5 of the original paper. To implement the technique, you will need to keep track of different types of events that may happen at runtime: loads & stores of memory, loop entry & exit & backedge (also known as latch) traversal, function entry & exit.

You will need to keep track of the stack of currently active loops as well as some additional information about each loop, such as the current iteration, the longest amount of time taken for any iteration to execute (clock_gettime suffices for time measurement), and a dynamic loop ID. As each dynamic loop instance is encountered during the execution of the program, it will be given a loop ID, starting with the ID 1. By tracking whether or not a loop has conflicts from loop carried dependences as well as the longest running iteration and total time taken by a loop, you can calculate the possible speedup from parallelizing a loop.

You will also need to to keep track of the call stack. When a function returns, you should forget the memory accesses of local variables within that function, as they may create false dependences that prevent parallelizing the iterations of a loop.

The speedup for a loop is the ratio of the original runtime to an optimistic, fully parallelized version. This can be broken into two case:

If the loop is conflict free, this is simply the ratio of the original loop time to the time taken by the longest running iteration. Note, the longest running iteration may have a deeper recursive speedup from nested dynamic loops.
If the loop has conflicts, then you should subtract from the original duration of the loop the improvements made by speeding up any directly nested loops.

The speed up of the overall program is simply the ratio of the original runtime versus the runtime with the time taken by top level loops replaced by their sped-up versions.

The final output of your analysis should list (1) the speed up ratio of the overall program and (2) the speed of ratio of each conflict-free loop within the analyzed execution, e.g.

Program speed up: 3.5
Loop 10: 15.4
Loop 12: 10.8

The precise rules for tracking dependences across iterations and recognizing conflicts are presented at the bottom of page 4 of the paper.

For sanity checking your implementation, you can consider the simple examples in test1.c and test2.c. These certainly don't comprise the full behavior that you may expect in real software, but they capture core problems in the underlying implementation. Note, these tests will behave differently when compiled with different optimization settings. That is, -O2 and -O3 will exhibit substantially different behavior than -O0.

Once again, there is no explicitly correct solution, as the solution that you create will balance design decisions in ways that affect not only which loops have conflicts, but the perceived speed up that may be gained regardless of conflicts.

Tools

The project may be completed using either LLVM and static instrumentation or Pin and dynamic binary instrumentation. For your convenience, I have alsoprovided a tool that can instrument the loop entry, exit, and backedges of programs with stub functions using LLVM. You should compile programs using thistool to aid the process of developing a solution using Pin.

Note: Unlike the previous project, the task of manipulating the underlying IR of the provided tools has been done for you. Function calls have been inserted at key points inside the execution of analyzed programs that provide you with the information you need to complete the task. Thus, it is your job to fill in the runtime behavior necessary to complete the dynamic analysis itself.

LLVM

A template for the project based on LLVM is available here. The compilation process for the project should be the same as in project 1. To complete the project, you must implement the correct runtime behavior for the dynamic analysis in lib/prof-rt/*. You may add or change any other files as you see fit.

To use the provided tool, run depprof <inputbitcode>.bc -o <output binary>.

To simplify your introduction to Pin, I decided to provide you with some tools that will ease the instrumentation process. In particular, a program that inserts stub function calls at loop entry, exit, and iteration points is available here. To use the provided tool, run loopstubber <inputbitcode>.bc -o <output binary>.

A template for the project based on Pin is available here. To complete the project, you must implement the correct runtime behavior for the dynamic analysis in the handle* functions in DepProf.cpp. You may add or change any other functionality as you see fit. It is easiest to build the project by copying it into the source/tools directory of your pin distribution.

To use the provided tool, run pin -injection child -t <path to project>/obj-intel64/DepProf.so -- <path to program>.

What you turn in

You must turn in two things. Make sure that your name is clearly specified in both.

The source files for your project. Compress your project directory into a .tar.gz file in order to send it to me. I will build and test all projects on a 64-bit machine. Make sure that I can test your project by simply running the tool that gets built by the provided template.
A brief (1-2 page) write up of your project that explains your basic design as well as what results you may have observed when running your tool on tests and a real world program. Did the analysis meet your expectations? When might it be useful? When would it not be useful? How might you change things to make it more useful (feel free to brainstorm)?

Both of these should be emailed to me by the deadline.

Deadline

Thursday, February 27 at 11:59pm

Project 2: Dynamic Dependence Profiling

Tools

LLVM

Pin

What you turn in

Deadline

References