This project is intended to help you get further acquainted with creating dynamic analyses using LLVM.
Identifying which paths in a program execute most frequently can be an effective tool for identifying regions of a program to optimize and for double checking that a program is behaving as expected. Beyond these more expected uses, it can benefit and drive additional analyses ranging from dataflow analysis to statistical debugging to designing custom hardware. For this project, you will implement a limited form of a technique called "Efficient Path Profiling". You have previously seen this technique presented in class, so the basics should already be familiar to you.
Recall that Efficient Path Profiling (EPP) compactly numbers all paths through a directed acyclic graph (DAG) and then annotates DAGs within a control flow graph (CFG) so that executing a path of a DAG within a CFG will increase a counter for that path. When this newly instrumented program is then compiled and executed, the counters for all paths are logged when the program terminates and can be decoded to identify the frequency of all paths.
Note that there are three key components to this system:
You will be implementing all three of these components, but you will only have to do so for the bodies of functions that do not contain loops (or other back edges in the control flow graph). A function containing any back edges in the control flow graph is left uninstrumented. A function with multiple returns should be modeled as though all actual returns are immediately followed by a single unique exit point for the function body, as in the paper.
Information about loops inside LLVM is available via the
LoopInfo
class, which is enabled in the template provided to you.
When instrumenting code, you may also find the SplitEdge()
function in
BasicBlockUtils.h
to be useful.
The tool that you create, called pathprofiler
shall instrument and compile
a given program so that it identifies the paths within every acyclic function
and counts how many times they execute.
When the instrumented program runs and terminates,
it should write out a file called path-profile-results
containing the path
identifiers for executed paths as well as how many times each of the paths was
executed in that one run.
The exact format of the file is up to you. I will not examine it.
If your tool is instead given the original bitcode module and the
path-profile-results
file, then it shall create a file called
top-five-paths.csv
in the current directory containing information for the
five most frequently executed paths with each line in the following format:
<# Occurrences>, <Function Name> (, <File Name>, <Line #>)+
for example, given the following program:
#include <stdio.h>
#include <stdlib.h>
void
profiled(int argc, int i) {
if ((i + argc) % 3) {
printf("Truey\n");
} else {
printf("Falsey\n");
}
if ((i + argc) % 5) {
printf("Finn\n");
} else {
printf("Jake\n");
}
}
int
main(int argc, char **argv) {
for (int j = 0, e = 100; j != e; ++j) {
for (int i = 0, e = atoi(argv[1]); i < e; ++i) {
profiled(argc, i);
}
}
return 0;
}
The final contents of top-five-paths.csv
should look like:
500, profiled, c/05-multibranch.c, 7, c/05-multibranch.c, 8, c/05-multibranch.c, 9, c/05-multibranch.c, 10, c/05-multibranch.c, 13, c/05-multibranch.c, 14, c/05-multibranch.c, 15, c/05-multibranch.c, 18
300, profiled, c/05-multibranch.c, 7, c/05-multibranch.c, 8, c/05-multibranch.c, 11, c/05-multibranch.c, 13, c/05-multibranch.c, 14, c/05-multibranch.c, 15, c/05-multibranch.c, 18
200, profiled, c/05-multibranch.c, 7, c/05-multibranch.c, 8, c/05-multibranch.c, 9, c/05-multibranch.c, 10, c/05-multibranch.c, 13, c/05-multibranch.c, 16, c/05-multibranch.c, 18
Note that only executed paths are written to the CSV. You may break ties arbitrarily.
The exact nature of the input is discussed in the LLVM section. The exact algorithm for computing edge increments can be found in Figure 5 on page six of the Efficient Path Profiling paper. Do not forget that we had a lecture on how to write an analysis using LLVM. In particular remember that you must initialize your data when the program starts, update it as the program runs using your runtime code, and write out your data when the program ends.
Note, in general, adding instrumentation may require splitting
critical edges
in the graph. LLVM has a
SplitEdge()
function that serves this purpose. It may not do exactly what you expect,
so you may wish to create a wrapper for it.
A template for the project is available here. The compilation process for the project should be the same as before. To complete the project, you must implement the analyses in
lib/pathprofiler-inst/PathProfilingPass.cpp
lib/pathprofiler-inst/PathEncodingPass.cpp
lib/pathprofiler-inst/PathDecodingPass.cpp
.You may add or change any other files as you see fit.
To use the tool, run
./pathprofiler <input bitcode or llvm assembly> -o <compiled path>
to instrument and compile the module into a new and executable binary. Running
the compiled program then produces a file named path-profile-results
in
the current directory. Running
./pathprofiler <input bitcode or llvm assembly> -p <path-profile-results>
shall create the text file top-five-paths.csv
containing the appropriate list
of the top five most executed paths.
By default, the project template assumes that your static instrumentation
code can be found in lib/pathprofiler-inst/
and your runtime library that
will be linked with the instrumented program can be found in
lib/pathprofiler-rt/
.
You must turn in one thing via CourSys. Make sure that your name is clearly specified.
This should be submitted to CourSys by the deadline.
Monday, October 2 at 11:59pm