Exercise 4

Using Random Testing Tools

As we discussed in class, random testing provides a way to get more power than traditional unit tests provide. This exercise will help you gain experience with two forms of random testing: (1) fuzz testing and (2) property based testing.

Fuzz testing provides a way to perform ongoing, security oriented testing. It tries to explore behaviors that may not have been otherwise considered during the testing process. From a security perspective, fuzz testing is sometimes even called automated penetration testing (also now used to refer to breach and attack simulation). It provides a critical part of developing proactively with an eye toward security.

Property based testing, on the other hand, attempts to get more power out of each unit test. Rather than arranging, acting, and asserting on a single scenario, property based testing attempts to sample and test many similar scenarios. In addition, it enables testing of otherwise hard to test code, such as code with nondeterminism or multiple possible valid outputs.

Note, while the programming skills involved in this exercise may be simple, fuzz testing again takes time. You will need to allocate 8 hours of total running time for running the fuzz testing experiments. This can be done overnight. For fuzz testing, it is recommended that you try running each technique for a few minutes first in order to look for progress and sanity check that the way you are using the tool is working.

You will work in groups of two or three students for the exercise.

Selecting Projects to Analyze

For the two fuzzing tasks, you will need to select at least one project to analyze.

The software that you analyze should be an open source project of some sort. Any analyzed project should contain at least 4000 lines of code and must include an established test suite and automated test process. You are also free to analyze two different projects, one for each fuzzing task. Once again, you are free to consider different projects listed on www.github.com, www.sourceforge.net, www.openhub.net, www.gnu.org, or other major collections of open source software.

Once again, you should identify and consider:

Identification of the open-source project.
Size of the code base.
Build time to compile and link an executable from source code.
Execution time for the test suite.

Again, include this information in your report.

Task 1) Libfuzzer

Libfuzzer is a fuzzing engine produced at Google within the LLVM project and constantly running within their OSS-Fuzz initiative. As per that site, OSS-Fuzz has found more than 30,000 bugs by 2021. You can read more about fuzzing at Google here. Or watch some of their videos [1] [2].

Of the tools you will use, Libfuzzer also requires more work to set up, but it isn't too bad. The burden comes because Libfuzzer does not fuzz an entire application. Instead, it only fuzzes a single function (and others transitively called) at a time. This gives developers more precise control over where time is spent when fuzzing, but it also requires a developer to specify how a random string of bits is translated into input for that function, e.g. by constructing the arguments out of a sequence of bits. This is particularly useful because many bugs and security vulnerabilities are found in file parsing functions, and libFuzzer tends to be straightforward to use in these contexts. It also enables libFuzzer to perform in process fuzzing, where the tested function is called repeatedly in a loop instead of in a fresh process. This tends to be an order of magnitude faster than out of process fuzzing, thus finding more bugs with fewer resources.

You will use libFuzzer to test a single parsing function for some open source software. You will need to write a small test driver that calls a single function from the software and passes in a sequence of bytes. You can find a complete tutorial for libFuzzer here. Note, writing the test harness may require you to link against some of the libraries or object files of the software project in question. It may also require you to modify the build process of the project in question to include the -fsanitize=address,fuzzer command line options. You should document the process that you used to get it working in your project write up. You should run the fuzzer for at least 4 hours or until the first crash. Note this limitation of in process fuzzing: after finding the first crash, it stops. Consider: why does this limitation exist?

Ideally, you would run it for at least 24 hours to get a better picture of the behavior and find more interesting things. Interesting behaviors with fuzzing frequently arise after a week or more. Feel free to do so if you are able to. You can use the screen or tmux command to log out of a machine while your tool continues to run and then log back in later to see the results. Google now also provides support for evaluating novel fuzz testing methodologies. Consider what their report presentations indicate about the behavior of the fuzzer and how well it performed.

libFuzzer & Clang in CSIL (or at home)

libFuzzer is distributed as a part of the LLVM project and Clang. In CSIL, I have made a development ready install of Clang 15.0 along with libFuzzer available via a shared directory. You can take advantage of these by modifying your path. Specifically, at the end of your ~/.bash_profile to include, you can add:

export PATH=/usr/shared/CMPT/faculty/wsumner/base/bin/:/usr/shared/CMPT/faculty/wsumner/llvm/bin:$PATH

The next time you log in, these will be available to you. You can try the provided toy example to double check that it is working.

Alternatively, you should also get access to that version of clang and LLVM through the virtual environment set up for the class in CSIL:

source /usr/shared/CMPT/faculty/wsumner/base/env473/bin/activate

Task 2) AFLplusplus

AFLplusplus is a community maintained fork of american fuzzy lop. It includes many additional patches that can affect or be configured to control the fuzzing process.

AFL (and its forks) is a coverage guided fuzzer. It tracks coverage by changing the behavior of the compiler to add extra instructions into the program to measure coverage as it executes (it is a dynamic analysis). Specifically, classical AFL style fuzzers approximate unique paths through the program by computing a hash code that identifies the path based on the basic blocks traversed. It then keeps inputs that traverse new paths around for further exploration. In practice, AFLplusplus employs many additional approaches and heuristics on top of this.

You will use AFLplusplus to test the behavior of a program or library when reading in some sort of input from the user (or a file). You can examine the results inside the output directory in order to gain an understanding of the what the tools has done.

You should again run the fuzzer for at least 4 hours. For fuzzers based on american fuzzy lop, curating the initial set of tests can be useful. Try to use the test inputs provided with the software you are testing. If the inputs are too varied (e.g. source code from many different programming languages), then perhaps limit the test suite to focus the process more.

AFLplusplus is also available within the virtual environment for the class in CSIL, but I recommend running it outside of CSIL if you are able to. The current configuration within CSIL interferes with some of the standard crash recognition and reporting mechanisms, making it a bit harder to find interesting behaviors.

Task 3) Property based testing

As discussed in class, Hypothesis is a tool for performing property based testing (primarily in Python). In order to use it effectively you must: (1) define a generator that can create inputs in your domain, and (2) define validity properties that apply to inputs in that domain. We considered a handful of ways to define such properties in class. This task will simply require you to define validity properties for a few functions. You will not need to submit a write-up for this task. Rather, you will submit the properties that you write as a part of the testing process.

A template for this task is available here. The code is written in Python and depends on networkx and hypothesis. It should run without extra effort on your part in CSIL, but if you run it on your own machine, you will need to install those packages. Inside the template, you can find 3 files: invariants.py, relational.py, and test_relational.py. You will only need to modify and submit invariants.py, but the other files can help you to understand the task and run the Hypothesis.

There are two functions defined inside relational.py: sortish(sequence,cmp) and toposortish(nodes,edges). These define two sorting functions that you will test by writing invariant properties.

sortish: The first function is a nonstable sorting algorithm for lists. It takes in as arguments a list of elements sequence and a comparison function cmp. cmp takes in two elements from the list and returns -1 if the first is smaller, 0 if they are equal, and 1 if the second is smaller. It mutates the passed in list and puts the elements into ascending sorted order.
toposortish: The second function takes in a list of nodes (labeled by integers) in a directed acyclic graph (DAG) and a list of directed edges (as tuples of the form (source, target)). It returns a new list containing the nodes of the graph in a topological order. In any DAG, a topological ordering gives an order in which nodes can be visited that guarantees a node will be visited before any of its successors. You can find more details in the wiki link above.

Your task is to write the properties / invariants to check that these functions are implemented correctly. You should write the properties entirely within invariants.py. You should not change the imports of invariants.py. During grading the implementations inside relational.py could change, and your property functions may even be called directly with custom inputs from a test suite. You can complete the task by modifying the behavior of is_valid_sort() and is_valid_toposort() to contain appropriate property checks.

This testing is driven by Hypothesis, which will automatically generate random inputs for you to check your work. The basic tests are actually already provided for you inside test_relational.py. You can see how the Hypothesis framework is invoked, and even how I set up the automated generation of random structured inputs like DAGs for the testing process. As a result, by just writing your invariants inside invariants.py, you are able to achieve significant confidence in the implementations.

The provided implementations of sortish and toposortish are correct. After all, they rely on common Python libraries! This means that you should initially work to get your tests to pass. After that, you might consider what incorrect implementations look like in order to make sure that your tests can fail when appropriate, too.

When you are done, you will submit only your invariants.py to CourSys. Changes to other files will be disregarded.

Deliverables

All deliverables should be submitted via CourSys.

For the fuzzing based tasks, you will need to write up a brief report / summary along with artifacts from the fuzzing process.

For the property based testing task, you should submit your invariants.py.

Fuzzing Report + Artifacts

As a group, you should reflect on the challenges faced, effort required, and either potential or received benefits of the tools you used for the projects you examined. What are the strengths and weaknesses of the different fuzz testing tools you used? Are these reflected in your results? Why or why not? How?

Your write-up for the exercise should again include the challenges you faced during this process, as well as your approaches for overcoming them. You should also report any errors indicated by the analyses. For groups of k members, the groups should also explain whether k of the crashes/hangs found were real bugs or not. If fewer than k bugs were found, then all discovered errors should be explained. You should also explain why you believe fewer than k bugs were found if possible. For instance, if the fuzz tester got stuck performing the same tests on irrelevant code. This may depend on both the tool and the observed results. As a group, you should contrast your experiences with both AFLplusplus and libFuzzer and discuss their strengths and weaknesses with evidence based on your experiences.

In addition to this write-up, you should also submit any test harnesses, invocations, and the full set of constructed outputs that cause problems as well as the overall statistics. For AFLplusplus, this means that you should submit the crashes/ and hangs/ subdirectories of the output directory as well as the fuzzer_stats file for each experiment. For libFuzzer, you should include any test files corresponding to failures along with the output of the overall testing process.

Be careful. AFLplusplus may warn you that the program you are analyzing is not instrumented or that you are only identifying one path after many executions. These can be signs that you are not correctly running the fuzzer to fuzz the program of interest. Again, I expect you to not make these mistakes. We have discussed some of the potential causes in class.

Other directions you may be interested in

We are only scratching the surface of fuzzing in this exercise. Other fuzzing tools can do things like combine coverage guided fuzzing with language based models to make sure that generated inputs are syntactically valid [3]. They can help to automatically identify side-channel vulnerabilities [4]. Google has even worked to integrate libFuzzer with protobufs to generate valid inputs by piggybacking on common infrastructure in many projects [5]. With all of these advances also come challenges. Fuzzing can be nondeterministic, so knowing when a fuzzer is better than other fuzzers requires careful methodology [6].