Project 3: Efficient Path Profiling

This project is intended to help you get acquainted with creating dynamic analyses using LLVM.

Identifying which paths in a program execute most frequently can be an effective tool for identifying regions of a program to optimize and for double checking that a program is behaving as expected. Beyond these more expected uses, it can benefit and drive additional analyses ranging from dataflow analysis to statistical debugging For this project, you will implement a limited form of a technique called "Efficient Path Profiling". You have previously seen this technique presented in class, so the basics should already be familiar to you.

Recall that Efficient Path Profiling (EPP) compactly numbers all paths through a directed acyclic graph (DAG) and then annotates DAGs within a control flow graph (CFG) so that executing a path of a DAG within a CFG will increase a counter for that path. When this newly instrumented program is then compiled and executed, the counters for all paths are logged when the program terminates and can be decoded to identify the frequency of all paths.

Note that there are three key components to this system:

  1. The encoding functionality that computes the compact numbering for a CFG.
  2. The instrumented runtime logging that actually computes the identifiers for paths as they execute and increments the counters for those identifiers.
  3. The decoding functionality that decodes a given identifier.

You will be implementing all three of these components, but you will only have to do so for the bodies of innermost loops within a function. These are loops that do not contain any loops within themselves. Thus, in the following function, you will increment a counter for the path taken by each iteration of loops A and C.

def foo():
    while someCondition: # Loop A
        # Identify the paths here and increment their counters

    while anotherCondition: # Loop B
        # Some code that is ignored
        while yetAThirdCondition # Loop C
            Identify the paths here and increment their counters

A function containing no loops is uninstrumented. The body of a loop with multiple exits should be modeled as though all actual exits are immediately followed by a unique pseudo-exit for the loop body. Similarly, the latch, or last basic block before the backedge of the loop that returns to the loop header, represents the last node of the DAG before exiting the loop body. Information about loops inside LLVM is available via the LoopInfo class, which is enabled in the template provided to you. You will be particularly interested in the begin(), end(), and empty() methods method of LoopInfo as well as the methods of the same name and contains, getLoopLatch, and getExitingBlocks or isExitingBlock for Loop. When instrumenting code, you may also find the SplitEdge() function in BasicBlockUtils.h to be useful.

The tool that you create, called pathprofiler shall instrument and compile a given program so that it identifies the acyclic paths within the innermost loops of every function and counts how many times they execute. When the instrumented program runs and terminates, it should write out a file called path-profile-results containing these path identifiers as well as how many times each of the paths was executed in that one run. If your tool is instead given the original bitcode module and the path-profile-results then it shall print out the top five most frequent paths and how many times they executed, e.g.

Top 3 Paths
=========================================
Path, occurrences: 1
File c/02-simpleloop.c line 7

Path, occurrences: 3
File c/02-simpleloop.c line 7
File c/02-simpleloop.c line 8
File c/02-simpleloop.c line 11
File c/02-simpleloop.c line 13
File c/02-simpleloop.c line 7

Path, occurrences: 7
File c/02-simpleloop.c line 7
File c/02-simpleloop.c line 8
File c/02-simpleloop.c line 9
File c/02-simpleloop.c line 10
File c/02-simpleloop.c line 13
File c/02-simpleloop.c line 7

The exact nature of the input is discussed in the LLVM section. The exact algorithm for computing edge increments can be found in Figure 5 on page six of the Efficient Path Profiling paper. Do not forget that we had a lecture on how to write an analysis using LLVM. In particular remember that you must initialize your data when the program starts, update it as the program runs using your runtime code, and write out your data when the program ends.

You may assume:

LLVM

A template for the project is available here. The compilation process for the project should be the same as before. To complete the project, you must implement the analysis in lib/pathprofiler-inst/PathProfilingPass.cpp. You may add or change any other files as you see fit.

To use the tool, run

./pathprofiler <input bitcode or llvm assembly> -o <compiled path>

to instrument and compile the module into a new and executable binary. Running the compiled program then produces a file named path-profile-results in the current directory. Running

./pathprofiler <input bitcode or llvm assembly> -p <path-profile-results>

shall print out the appropriate list of most executed paths.

By default, the project template assumes that your static instrumentation code can be found in lib/pathprofiler-inst/ and your runtime library that will be linked with the instrumented program can be found in lib/pathprofiler-rt/.

What you turn in

You must turn in one thing via CourSys. Make sure that your name is clearly specified.

This should be submitted to CourSys by the deadline.

Deadline

Tuesday, March 10 at 11:59pm

References

  1. Efficient Path Profiling
  2. Imrpoving Dataflow Analysis with Path Profiles
  3. Holmes: Effective Statistical Debugging via Efficient Path Profiling
  4. So much more related work…. So much.