This project will help you get acquainted with using infrastructures like LLVM to gather basic information about computer programs. You will also gain experience recognizing limitations and trade-offs made when designing and constructing a static analysis tool.
For this project, you will construct an LLVM tool that can compute and output a static call graph for an input program. You may notice that LLVM already has some functionality for computing and printing call graphs; however, the graphs that LLVM computes do not necessarily provide the same information that you will be required to present in this project.
As a reminder, a call graph is a directed graph where the nodes represent the
functions within a program.
An edge exists from foo()
to bar()
when foo()
may call bar()
.
Such call graphs can be helpful for examining the structure of a program,
and they are also a crucial first step in many other analyses.
They are especially useful for understanding programs with indirect calls or
function pointers.
For such programs, a single call site in a program may call many different
functions across different program executions or even within a single execution
of a program.
These graphs can also be made more informative by noting the precise call sites
or lines in foo()
that make calls to bar()
.
The call graphs that you construct for this project shall show the potentially called functions for each call site within a program. In addition, they shall show the weight of a function in the call graph, the number of incoming edges to that particular function. For example, for the simple program:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25</td>void foo();
void bar() {
foo();
bar();
}
void baz() {
foo();
bar();
}
int main(int argc, char **argv) {
foo();
bar();
baz();
void (*bam)() = 0;
switch (argc%3) {
case 0: bam = foo; break;
case 1: bam = bar; break;
case 2: bam = baz; break;
}
bam();
return 0;
}
</pre></div>
</tr></table></code>
</figure>
Your program shall produce a CSV (Comma Separated Value) formatted output like:
bar,4,0,example1.c,4,1,example1.c,5
foo,4
baz,2,0,example1.c,9,1,example1.c,10
main,0,0,example1.c,14,1,example1.c,15,2,example1.c,16,3,example1.c,23
bar,0,foo
bar,1,bar
baz,0,foo
baz,1,bar
main,0,foo
main,1,bar
main,2,baz
main,3,foo
main,3,baz
main,3,bar
This CSV format is divided into two sections, separated by a blank line.
The first section encodes each function in the call graphs on a separate line.
The format of a single line is:
, (, , , )*
The call site ID is a unique identifier for each individual call site in a
function. It should consist of only numbers and/or letters, but apart from that
it may be anything you like. The second section encodes each edge of the call
graph on a separate line. The format of a single line is:
, ,
The CSV format is easy for you to produce and easier for me to grade, but it is
perhaps challenging to understand. You are also provided with a script that
can automatically convert the CSV format into a
[Graphviz](http://www.graphviz.org) formatted output like:
which can in turn be passed to the `dot` command to automatically produce
pictures of the call graph like the following:
Examples of how you can automatically produce these pictures to better
understand your results are provided to you.
## Issues to keep in mind
IR Intrinsics -- LLVM inserts calls to some functions in order to represent
information within the IR.
In particular, some debugging information is anchored by calls to functions
that have names starting with `llvm.dbg`.
You should ignore these functions in your callgraph.
Recursion -- Both direct and mutual recursion must work correctly.
For this project, recursion shouldn't pose any special problems, but it is
always a useful corner case to bear in mind.
Disconnected graphs -- Not every function may be reachable from the main
function.
As a result, the call graph may form several disconnected components.
Your implementation must be able to handle this.
Pointers -- Handling indirect function calls inherently leads to imprecision.
You must select and implement one approach for constructing a call graph even
in the presence of function pointers.
On top of learning the basics of LLVM, this issue poses the greatest challenge
for the project.
There are several different approaches that you may take for trying to compute
a more precise call graph even in the presence of function pointers.
In (I think) increasing order of difficulty, some possible approaches are:
* Use argument and return type information to disambiguate possible targets of
function calls.
Don't forget that only a function that has its address taken may be the
target of a function call (unless its address escapes somehow...).
Some functions can take a flexible number of arguments.
* Use an existing alias analysis component to determine the possible points-to
sets of function pointers.
There are some resources on such components for LLVM online.
* Use class hierarchy analysis to determine possible targets for C++ programs.
That is, if you know that a method is called upon an object of a certain
class, the possible targets of the call are the specific functions implemented
by that class or its subclasses.
Be careful; C++ programs use both *call* and *invoke* to call functions
depending on whether or not the call may throw an exception.
The `CallSite` helper class can conveniently identify both.
The only approach you *may not* use is the naïve method of assuming that any
indirect call may point to any function that has its address taken.
You should make sure that you understand the trade offs of the particular
approach that you use.
You will be expected to discuss them in a *brief* document when you turn your
project in.
## What I provide
I have created a
[basic template](https://github.com/nsumner/callgrapher-template.git) for the
project that includes the Graphviz formatted output.
This template can be used to create an LLVM project that compiles using
`cmake` similar to the LLVM demo we saw in class.
The `tools` directory contains source for a simple program called `callgrapher`
that takes in a single bitcode file, runs the analysis that you will write, and
prints the resulting callgraph.
The `libs` directory contains a template in `CallGraph.cpp` for the analysis
that you will write in order to consruct a call graph.
The header for the analysis is in `include/CallGraph.h`.
Feel free to modify these sources as much as you wish.
Some example tests are provided in the tests directory to help you
refine your implementation. I encourage you to discuss and share additional
test cases.
There are a few different ways that you may build the project.
I recommend using whichever approach is most familiar to you.
You can also find step-by-step instructions in the docs/BUILDING document in
the project template.
To build this project, you should use LLVM 3.7, available on the
[LLVM download page](http://llvm.org/releases/download.html#3.7.0).
Downloading the pre-built binaries should provide you everything you need.
Due to recent ABI incompatibility issues, if you are using Ubuntu 15.10 or
later, you will need to build LLVM from source yourself.
### Building using `cmake`
First create a build directory separate from your source directory in order
to perform an out-of-source build.
From within the build directory for the project, run:
Then you can simply run `make` to build the project within the build directory.
The callgrapher program will reside at `bin/callgrapher` in the
build directory.
## What you turn in
You must turn in two things.
Make sure that your name is clearly specified in both.
1. The *source* files for your project.
Compress your project directory into a .tar.gz file in order to
submit it via [CourSys](http://courses.cs.sfu.ca/)
I will build and test all projects on a 64-bit machine.
Make sure that I can test your project by simply running the `callgrapher`
program that gets built by the provided template.
If you have used an external alias analysis component, include that as well.
Grading of the projects will be automatic, so make sure that your output
format is correct!
2. A brief (up to 1 page) write up of your project that explains your basic
design as well as the limitations and advantages of your approach.
In particular, explain how you dealt with function pointers and how this
relates to false positives and false negatives with respect to the 'may call'
relationship that the call graph captures.
Feel free to include any other points of interest, such as trade offs
between design complexity and precision.
Both of these should be submitted via [CourSys](http://courses.cs.sfu.ca) by the deadline.
## Deadline
Tuesday, January 19 at 11:59pm
## Useful links
* [LLVM Doxygen](http://llvm.org/docs/doxygen/html/)
* [Debugging Information in LLVM](http://llvm.org/docs/SourceLevelDebugging.html).
Pay extra attention to the section on C and C++ specific debugging information.
* [Function::hasAddressTaken](http://llvm.org/docs/doxygen/html/classllvm_1_1Function.html#aafc2232f97cd2d2fae9b4f5bda77a363)
* [CallSite](http://llvm.org/docs/doxygen/html/CallSite_8h_source.html)
* [DerivedTypes](http://llvm.org/docs/doxygen/html/DerivedTypes_8h_source.html)
## Additional Resources on Call Graphs
* [Optimization of Object-Oriented Programs using Static Class Hierarchy Analysis](ftp://ftp.cs.washington.edu/pub/chambers/hierarchy.ps.Z)
* [Fast Static Analysis of C++ Virtual Function Calls](http://www.cs.cornell.edu/courses/cs711/2005fa/papers/bs-oopsla96.pdf)
* [Scalable Propagation-Based Call Graph Construction Algorithms](http://web.cs.ucla.edu/~palsberg/paper/oopsla00.pdf)
* [Precise Call Graphs for C Programs with Function Pointers](http://www.cs.rpi.edu/~milanova/docs/paper_kluw.pdf)
* [A Framework for Call Graph Construction Algorithms](http://www.cs.washington.edu/research/projects/cecil/pubs/cgc-toplas.pdf)
* [Comparing Call Graphs](https://plg.uwaterloo.ca/~olhotak/pubs/paste07.pdf)
* [Efficient Construction of Approximate Call Graphs for JavaScript IDE Services](http://www.franktip.org/pubs/icse2013approximate.pdf)
* [Function pointers in C](http://blog.frama-c.com/index.php?post/2013/08/24/Function-pointers-in-C)