Project: Efficient Path Profiling

This project is intended to help you get further acquainted with creating dynamic analyses using LLVM.

Identifying which paths in a program execute most frequently can be an effective tool for identifying regions of a program to optimize and for double checking that a program is behaving as expected. Beyond these more expected uses, it can benefit and drive additional analyses ranging from dataflow analysis to statistical debugging to designing custom hardware. For this project, you will implement a limited form of a technique called "Efficient Path Profiling". You have previously seen this technique presented in class, so the basics should already be familiar to you.

Recall that Efficient Path Profiling (EPP) compactly numbers all paths through a directed acyclic graph (DAG) and then annotates DAGs within a control flow graph (CFG) so that executing a path of a DAG within a CFG will increase a counter for that path. When this newly instrumented program is then compiled and executed, the counters for all paths are logged when the program terminates and can be decoded to identify the frequency of all paths.

Note that there are three key components to this system:

  1. The encoding functionality that computes the compact numbering for a CFG.
  2. The instrumented runtime logging that actually computes the identifiers for paths as they execute and increments the counters for those identifiers.
  3. The decoding functionality that decodes a given identifier.

You will be implementing all three of these components, but you will only have to do so for the bodies of functions that do not contain loops (or other back edges in the control flow graph). A function containing any back edges in the control flow graph is left uninstrumented. A function with multiple returns should be modeled as though all actual returns are immediately followed by a single unique exit point for the function body, as in the paper.

Information about loops inside LLVM is available via the LoopInfo class, which is enabled in the template provided to you. When instrumenting code, you may also find the SplitEdge() function in BasicBlockUtils.h to be useful.

The tool that you create, called pathprofiler shall instrument and compile a given program so that it identifies the paths within every acyclic function and counts how many times they execute. When the instrumented program runs and terminates, it should write out a file called path-profile-results containing the path identifiers for executed paths as well as how many times each of the paths was executed in that one run. The exact format of the file is up to you. I will not examine it. If your tool is instead given the original bitcode module and the path-profile-results file, then it shall create a file called top-five-paths.csv in the current directory containing information for the five most frequently executed paths with each line in the following format:

<# Occurrences>, <Function Name> (, <File Name>, <Line #>)+

for example, given the following program:

example.c
#include <stdio.h>
#include <stdlib.h>


void
profiled(int argc, int i) {
  if ((i + argc) % 3) {
    printf("Truey\n");
  } else {
    printf("Falsey\n");
  }
  if ((i + argc) % 5) {
    printf("Finn\n");
  } else {
    printf("Jake\n");
  }
}


int
main(int argc, char **argv) {
  for (int j = 0, e = 100; j != e; ++j) {
    for (int i = 0, e = atoi(argv[1]); i < e; ++i) {
      profiled(argc, i);
    }
  }
  return 0;
}

The final contents of top-five-paths.csv should look like:

top-five-paths.csv
500, profiled, c/05-multibranch.c, 7, c/05-multibranch.c, 8, c/05-multibranch.c, 9, c/05-multibranch.c, 10, c/05-multibranch.c, 13, c/05-multibranch.c, 14, c/05-multibranch.c, 15, c/05-multibranch.c, 18
300, profiled, c/05-multibranch.c, 7, c/05-multibranch.c, 8, c/05-multibranch.c, 11, c/05-multibranch.c, 13, c/05-multibranch.c, 14, c/05-multibranch.c, 15, c/05-multibranch.c, 18
200, profiled, c/05-multibranch.c, 7, c/05-multibranch.c, 8, c/05-multibranch.c, 9, c/05-multibranch.c, 10, c/05-multibranch.c, 13, c/05-multibranch.c, 16, c/05-multibranch.c, 18

Note that only executed paths are written to the CSV. You may break ties arbitrarily.

The exact nature of the input is discussed in the LLVM section. The exact algorithm for computing edge increments can be found in Figure 5 on page six of the Efficient Path Profiling paper. Do not forget that we had a lecture on how to write an analysis using LLVM. In particular remember that you must initialize your data when the program starts, update it as the program runs using your runtime code, and write out your data when the program ends.

Note, in general, adding instrumentation may require splitting critical edges in the graph. LLVM has a SplitEdge() function that serves this purpose. It may not do exactly what you expect, so you may wish to create a wrapper for it.

LLVM

A template for the project is available here. The compilation process for the project should be the same as before. To complete the project, you must implement the analyses in

You may add or change any other files as you see fit.

To use the tool, run

./pathprofiler <input bitcode or llvm assembly> -o <compiled path>

to instrument and compile the module into a new and executable binary. Running the compiled program then produces a file named path-profile-results in the current directory. Running

./pathprofiler <input bitcode or llvm assembly> -p <path-profile-results>

shall create the text file top-five-paths.csv containing the appropriate list of the top five most executed paths.

By default, the project template assumes that your static instrumentation code can be found in lib/pathprofiler-inst/ and your runtime library that will be linked with the instrumented program can be found in lib/pathprofiler-rt/.

What you turn in

You must turn in one thing via CourSys. Make sure that your name is clearly specified.

This should be submitted to CourSys by the deadline.

Deadline

Monday, October 2 at 11:59pm

References

  1. Efficient Path Profiling
  2. Improving Dataflow Analysis with Path Profiles
  3. Holmes: Effective Statistical Debugging via Efficient Path Profiling
  4. NEEDLE: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs
  5. So much more related work…. So much.