Project 1: Weighted Call Graphs

This project will help you get acquainted with using infrastructures like LLVM to gather basic information about computer programs. You will also gain experience recognizing limitations and trade-offs made when designing and constructing a static analysis tool.

For this project, you will construct an LLVM tool that can compute and output a static call graph for an input program. You may notice that LLVM already has some functionality for computing and printing call graphs; however, the graphs that LLVM computes do not necessarily provide the same information that you will be required to present in this project.

As a reminder, a call graph is a directed graph where the nodes represent the functions within a program. An edge exists from foo() to bar() when foo() may call bar(). Such call graphs can be helpful for examining the structure of a program, and they are also a crucial first step in many other analyses. They are especially useful for understanding programs with indirect calls or function pointers. For such programs, a single call site in a program may call many different functions across different program executions or even within a single execution of a program. These graphs can also be made more informative by noting the precise call sites or lines in foo() that make calls to bar().

The call graphs that you construct for this project shall show the possibly called functions for each call site within a program. In addition, they shall show the weight of a function in the call graph, the number of incoming edges to that particular function. For example, for the simple program:

 1 void foo();
 2 
 3 void bar() {
 4   foo();
 5   bar();
 6 }
 7 
 8 void baz() {
 9   foo();
10   bar();
11 }
12 
13 int main(int argc, char **argv) {
14   foo();
15   bar();
16   baz();
17   void (*bam)() = 0;
18   switch (argc%3) {
19    case 0: bam = foo; break;
20    case 1: bam = bar; break;
21    case 2: bam = baz; break;
22   }
23   bam();
24   return 0;
25 }

Your program shall produce Graphviz formatted output like:

digraph {
  node [shape=record];
  bar[label="{bar|Weight: 4|<l0>example.c:4|<l1>example.c:5}"];
  foo[label="{foo|Weight: 4}"];
  baz[label="{baz|Weight: 2|<l0>example.c:9|<l1>example.c:10}"];
  main[label="{main|Weight: 0|<l0>example.c:14|<l1>example.c:15|<l2>example.c:16|<l3>example.c:23}"];
  bar:l0 -> foo;
  bar:l1 -> bar;
  baz:l0 -> foo;
  baz:l1 -> bar;
  main:l0 -> foo;
  main:l1 -> bar;
  main:l2 -> baz;
  main:l3 -> foo;
  main:l3 -> baz;
  main:l3 -> bar;
}

Instructions for producing this output are provided in the template. a:b -> c means that the call site b in function a may call function c. a[...] specifies the name, weight, and call sites of function a. This produces the following call graph:

Note that each node in the graph contains the name of the function that it represents along with weight of the function and a list of the line numbers of call sites in the function. Edges connect the call sites to their potential call targets.

Issues to keep in mind

IR Intrinsics – LLVM inserts calls to some functions in order to represent information within the IR. In particular, some debugging information is anchored by calls to functions that have names starting with llvm.dbg. You should ignore these functions in your callgraph.

Recursion – Both direct and mutual recursion must work correctly. For this project, recursion shouldn't pose any special problems, but it is always a useful corner case to bear in mind.

Disconnected graphs – Not every function may be reachable from the main function. As a result, the call graph may form several disconnected components. Your implementation must be able to handle this.

Pointers – Handling indirect function calls inherently leads to imprecision. You must select and implement one approach for constructing a call graph even in the presence of function pointers. On top of learning the basics of LLVM, this issue poses the greatest challenge for the project. There are several different approaches that you may take for trying to compute a more precise call graph even in the presence of function pointers. In (I think) increasing order of difficulty, some possible approaches are:

Use argument and return type information to disambiguate possible targets of function calls. Don't forget that only a function that has its address taken may be the target of a function call. Some functions can take a flexible number of arguments.
Use an existing alias analysis component to determine the possible points-to sets of function pointers. There are some resources on such components for LLVM online.
Use class hierarchy analysis to determine possible targets for C++ programs. That is, if you know that a method is called upon an object of a certain class, the possible targets of the call are the specific functions implemented by that class or its subclasses. Be careful; C++ programs use both call and invoke to call functions depending on whether or not the call may throw an exception. The CallSite helper class can conveniently identify both.

The only approach you may not use is the naïve method of assuming that any indirect call may point to any function that has its address taken. You should make sure that you understand the trade offs of the particular approach that you use. You will be expected to discuss them in a brief document when you turn your project in.

What I provide

I have created a basic template for the project that includes the Graphviz formatted output. This template can be used to create an LLVM project that compiles using either configure or cmake. The tools directory contains source for a simple program called callgrapher that takes in a single bitcode file, runs the analysis that you will write, and prints the resulting callgraph. The libs directory contains a template in CallGraph.cpp for the analysis that you will write in order to consruct a call graph. The header for the analysis is in include/CallGraph.h. Feel free to modify these sources as much as you wish. Some example tests are available to help you refine your implementation.

There are a few different ways that you may build the project. I recommend using whichever approach is most familiar to you. You can also find step-by-step instructions in the docs/BUILDING document in the project template.

Either way you want to use llvm 3.5, available on the LLVM download page.

Option 1) Building using `autoconf`

First, you must build LLVM as described here and skipping steps 4, 5, 6, and 7. Make sure to run configure with the flags --disable-optimized --enable-debug-runtime --enable-assertions. Next, follow the instructions for building LLVM projects:
cd autoconf
./AutoRegen.sh
Enter the paths to the LLVM source and build directories.
From a separate build directory, run:
<path to project source>/configure --with-llvmsrc=<path to llvm src dir> --with-llvmobj=<path to llvm build dir> --disable-optimized --enable-debug-runtime --enable-assertions
Then you can simple run make to build the project within the build directory. The callgrapher program will reside at Debug+Asserts/bin/callgrapher in the build directory.

Option 2) Building using `cmake`

First, you must build LLVM as described here. Make sure to run cmake with -DCMAKE_BUILD_TYPE=Debug. You can use -DCMAKE_INSTALL_PREFIX=<path to local install directory> to fully install llvm to a local path. From a separate build directory for the project, run:
CC=clang CXX=clang++ cmake -DLLVM_DIR=<installation or build path of LLVM>/share/llvm/cmake <path to project source>
Then you can simple run make to build the project within the build directory. The callgrapher program will reside at tools/callgrapher/callgrapher in the build directory.

What you turn in

You must turn in two things. Make sure that your name is clearly specified in both.

The source files for your project. Compress your project directory into a .tar.gz file in order to send it to me. I will build and test all projects on a 64-bit machine. Make sure that I can test your project by simply running the cggen program that gets built by the provided template. If you have used an external alias analysis component, include that as well.
A brief (1 page) write up of your project that explains your basic design as well as the limitations and advantages of your approach. In particular, explain how you dealt with function pointers and how this relates to false positives and false negatives with respect to the 'may call' relationship (->) that the call graph captures. Feel free to include any other points of interest, such as trade offs between design complexity and precision.

Both of these should be submitted via CourSys by the deadline.

Deadline

Tuesday, January 22 at 11:59pm

Useful links

LLVM Doxygen
Debugging Information in LLVM. Pay extra attention to the section on C and C++ specific debugging information.
Function::hasAddressTaken
CallSite
DerivedTypes