This project will help you get acquainted with using infrastructures like LLVM to gather basic information about computer programs. You will also gain some experience recognizing limitations and trade-offs made when designing and constructing a dynamic analysis tool.
For this project, you will construct an LLVM tool that can compute and output a weighted dynamic call graph for an execution of an input program.
As a reminder, a call graph is a directed graph where the nodes represent the
functions within a program.
An edge exists from foo()
to bar()
when foo()
may call bar()
.
Such call graphs can be helpful for examining the structure of a program,
and they are also a crucial first step in many other analyses.
They are especially useful for understanding programs with indirect calls or
function pointers.
For such programs, a single call site in a program may call many different
functions across different program executions or even within a single execution
of a program.
These graphs can also be made more informative by noting the precise call sites
or lines in foo()
that make calls to bar()
.
The dynamic weighted call graphs that you construct for this project shall show the observed call targets for each call site within an execution of the program. In addition, they shall show the weight of an edge in the call graph, the number of times that particular function was the target of that call site. For example, consider the simple program:
#include <stdio.h>
void
foo(int i) {
printf("Blimey!\n");
}
void
bar(int i) {
if (i > 0) {
bar(i - 1);
}
}
int
main(int argc, char** argv) {
for (unsigned i = 0; i < 10; ++i) {
void (*fptr)(int) = (i % argc) ? foo : bar;
fptr(i);
}
return 0;
}
Then running the commands:
clang -S -emit-llvm -g example.c
callgraph-profiler example.ll -o example
./example 2 3
shall produce a CSV (Comma Separated Value) file called profile-results.csv
in the current directory that looks like:
foo, example.c, 6, printf, 6
bar, example.c, 13, bar, 18
main, example.c, 22, bar, 4
main, example.c, 22, foo, 6
This CSV format encodes each edge in the dynamic call graph on a separate line. The format of a single line is:
<caller function name>, <call site file name>, <call site line #>, <callee function name>, <(call site,callee) frequency>
The last column on the line is the number of times that the callee (the call target) was called from that particular call site.
The CSV format is easy for you to produce and easier for me to grade, but it is perhaps challenging to understand. You are also provided with a script that can automatically convert the CSV format into a Graphviz formatted output like:
digraph {
node [shape=record];
"main"[label="{main|<l0>example.c:22}"];
"main":l0 -> "bar" [label="4",penwidth="1.11",labelfontcolor=black,color="#380000"];
"main":l0 -> "foo" [label="6",penwidth="1.67",labelfontcolor=black,color="#550000"];
"foo"[label="{foo|<l0>example.c:6}"];
"foo":l0 -> "printf" [label="6",penwidth="1.67",labelfontcolor=black,color="#550000"];
"bar"[label="{bar|<l0>example.c:13}"];
"bar":l0 -> "bar" [label="18",penwidth="5.0",labelfontcolor=black,color="#ff0000"];
"printf"[label="{printf}"];
}
which can in turn be passed to the dot
command to automatically produce
pictures of the dynamic call graph like the following:
Examples of how you can automatically produce these pictures to better understand your results are provided to you.
IR Intrinsics – LLVM inserts calls to some functions in order to represent
information within the IR.
In particular, some debugging information is anchored by calls to functions
that have names starting with llvm.dbg
.
You should ignore these functions in your callgraph.
Recursion – Both direct and mutual recursion must work correctly. For this project, recursion shouldn't pose any special problems, but it is always a useful corner case to bear in mind.
Pointers – Function pointers allow one call site to have many different potential targets/callees. They also complicate the identification of the call target at the call site.
External Functions – Some functions, like printf()
, are implemented elsewhere
in a library. Thus, you have access to the call site, but you cannot modify the
call target.
I have created a
basic template
for the project.
This template can be used to create an LLVM project that compiles using
cmake
similar to the LLVM demo we saw in class.
The tools
directory contains source for a simple program called
callgraph-profiler
that takes in a single bitcode file, runs the
instrumentation analysis that you will write, and produces a compiled program
that includes any instrumentation you have added.
The libs/callgraph-profiler-inst/
directory contains a template in
ProfilingInstrumentationPass.cpp
for the analysis that you will write in
order to consruct a call graph.
The header for the analysis is in include/ProfilingInstrumentationPass.h
.
The runtime support library for any functions to which you wish to insert calls
can be found in libs/callgraph-profiler-rt/runtime.cpp
.
Feel free to modify these sources as much as you wish.
Some example tests are provided in the tests/
directory to help you
refine your implementation.
I encourage you to discuss and share additional test cases in the class
discussion forum.
You can find step-by-step instructions for building the project in the README.md document in the project template.
To build this project, you should use LLVM 6.0, available on the
LLVM download page.
Downloading the pre-built binaries should provide you everything you need.
This should already be available in the CSIL lab at
/usr/shared/CMPT/faculty/wsumner/llvm/
. The bin/
subdirectory also
contains appropriate versions of clang
and other LLVM related tools that you
may use for the project.
cmake
First create a build directory separate from your source directory in order to perform an out-of-source build. From within the build directory for the project, run:
cmake -DLLVM_DIR=<installation, build, or extraction path of LLVM>/lib/cmake/llvm <path to project source>
Then you can simply run make
to build the project within the build directory.
The callgraph-profiler program will reside at bin/callgraph-profiler
in the
build directory.
On CSIL in particular, you should use a recent version of clang when compiling. You can do this by configuring the build with:
CXX=/usr/shared/CMPT/faculty/wsumner/llvm/bin/clang++ \
CC=/usr/shared/CMPT/faculty/wsumner/llvm/bin/clang \
cmake -DLLVM_DIR=/usr/shared/CMPT/faculty/wsumner/llvm/lib/cmake/llvm <path to project source>
You must turn in two things. Make sure that your name is clearly specified in both.
The source files for your project.
Compress your project directory into a .tar.gz file in order to
submit it via CourSys
All projects will be built and tested on a 64-bit machine.
Make sure that we can test your project by simply running the
callgraph-profiler
program to instrument a module and running the resulting
program.
Grading of the projects will be automatic, so make sure that your output
format is correct!
A brief (up to 1 page) write up of your project that explains your basic design as well as the limitations and advantages of your approach. In particular, explain how you dealt with function pointers and external function calls. Are there ways that you did or could have made the analysis more efficient for common cases? Feel free to include any other points of interest, such as trade offs between design complexity and notions of efficiency.
Both of these should be submitted via CourSys by the deadline.
Monday, September 17 at 11:59pm