As we discussed in class, random testing provides a way to get more power than traditional unit tests provide. This exercise will help you gain experience with two forms of random testing: (1) fuzz testing and (2) property based testing.
Fuzz testing provides a way to perform ongoing, security oriented testing. It tries to explore behaviors that may not have been otherwise considered during the testing process. From a security perspective, fuzz testing is sometimes even called automated penetration testing (also now used to refer to breach and attack simulation). It provides a critical part of developing proactively with an eye toward security.
Property based testing, on the other hand, attempts to get more power out of each unit test. Rather than arranging, acting, and asserting on a single scenario, property based testing attempts to sample and test many similar scenarios. In addition, it enables testing of otherwise hard to test code, such as code with nondeterminism or multiple possible valid outputs.
Note, while the programming skills involved in this exercise may be simple, fuzz testing again takes time. You will need to allocate 8 hours of total running time for running the fuzz testing experiments. This can be done overnight. For fuzz testing, it is recommended that you try running each technique for a few minutes first in order to look for progress and sanity check that the way you are using the tool is working.
You will work in groups of two or three students for the exercise.
For the two fuzzing tasks, you will need to select at least one project to analyze.
The software that you analyze should be an open source project of some sort. Any analyzed project should contain at least 4000 lines of code and must include an established test suite and automated test process. You are also free to analyze two different projects, one for each fuzzing task. Once again, you are free to consider different projects listed on www.github.com, www.sourceforge.net, www.openhub.net, www.gnu.org, or other major collections of open source software.
Once again, you should identify and consider:
Again, include this information in your report.
Libfuzzer is a fuzzing engine produced at Google within the LLVM project and constantly running within their OSS-Fuzz initiative. As per that site, OSS-Fuzz has found more than 30,000 bugs by 2021. You can read more about fuzzing at Google here. Or watch some of their videos [1] [2].
Of the tools you will use, Libfuzzer also requires more work to set up, but it isn't too bad. The burden comes because Libfuzzer does not fuzz an entire application. Instead, it only fuzzes a single function (and others transitively called) at a time. This gives developers more precise control over where time is spent when fuzzing, but it also requires a developer to specify how a random string of bits is translated into input for that function, e.g. by constructing the arguments out of a sequence of bits. This is particularly useful because many bugs and security vulnerabilities are found in file parsing functions, and libFuzzer tends to be straightforward to use in these contexts. It also enables libFuzzer to perform in process fuzzing, where the tested function is called repeatedly in a loop instead of in a fresh process. This tends to be an order of magnitude faster than out of process fuzzing, thus finding more bugs with fewer resources.
You will use libFuzzer to test a single parsing function for some open source
software. You will need to write a small test driver that calls a single
function from the software and passes in a sequence of bytes. You can find a
complete tutorial for libFuzzer here.
Note, writing the test harness may require you to link against some of the
libraries or object files of the software project in question. It may also
require you to modify the build process of the project in question to include
the -fsanitize=address,fuzzer
command line options. You should document
the process that you used to get it working in your project write up.
You should run the fuzzer for
at least 4 hours
or until the first crash.
Note this limitation of in process fuzzing: after finding the first crash, it
stops. Consider: why does this limitation exist?
Ideally, you would run it for at least 24 hours to get a better picture of the
behavior and find more interesting things.
Interesting behaviors with fuzzing frequently arise after a week or more.
Feel free to do so if you are able to.
You can use the screen
or tmux
command to log out of a machine while your
tool continues to run and then log back in later to see the results.
Google now also provides
support
for evaluating novel fuzz testing methodologies.
Consider what their report presentations indicate about the behavior of the
fuzzer and how well it performed.
libFuzzer is distributed as a part of the LLVM project and Clang. In CSIL, I
have made a development ready install of Clang 15.0 along with libFuzzer
available via a shared directory.
You can take advantage of these by modifying your path. Specifically, at the end
of your ~/.bash_profile to include
, you can add:
1
export PATH=/usr/shared/CMPT/faculty/wsumner/base/bin/:/usr/shared/CMPT/faculty/wsumner/llvm/bin:$PATH
The next time you log in, these will be available to you. You can try the provided toy example to double check that it is working.
Alternatively, you should also get access to that version of clang
and LLVM
through the virtual environment set up for the class in CSIL:
1
source /usr/shared/CMPT/faculty/wsumner/base/env473/bin/activate
AFLplusplus is a community maintained fork of american fuzzy lop. It includes many additional patches that can affect or be configured to control the fuzzing process.
AFL (and its forks) is a coverage guided fuzzer. It tracks coverage by changing the behavior of the compiler to add extra instructions into the program to measure coverage as it executes (it is a dynamic analysis). Specifically, classical AFL style fuzzers approximate unique paths through the program by computing a hash code that identifies the path based on the basic blocks traversed. It then keeps inputs that traverse new paths around for further exploration. In practice, AFLplusplus employs many additional approaches and heuristics on top of this.
You will use AFLplusplus to test the behavior of a program or library when reading in some sort of input from the user (or a file). You can examine the results inside the output directory in order to gain an understanding of the what the tools has done.
You should again run the fuzzer for at least 4 hours. For fuzzers based on american fuzzy lop, curating the initial set of tests can be useful. Try to use the test inputs provided with the software you are testing. If the inputs are too varied (e.g. source code from many different programming languages), then perhaps limit the test suite to focus the process more.
AFLplusplus is also available within the virtual environment for the class in CSIL, but I recommend running it outside of CSIL if you are able to. The current configuration within CSIL interferes with some of the standard crash recognition and reporting mechanisms, making it a bit harder to find interesting behaviors.
As discussed in class, Hypothesis is a tool for performing property based testing (primarily in Python). In order to use it effectively you must: (1) define a generator that can create inputs in your domain, and (2) define validity properties that apply to inputs in that domain. We considered a handful of ways to define such properties in class. This task will simply require you to define validity properties for a few functions. You will not need to submit a write-up for this task. Rather, you will submit the properties that you write as a part of the testing process.
A template for this task is available here.
The code is written in Python and depends on networkx
and hypothesis
.
It should run without extra effort on your part in CSIL, but if you run
it on your own machine, you will need to install those packages.
Inside the template, you can find 3 files:
invariants.py
, relational.py
, and test_relational.py
.
You will only need to modify and submit invariants.py
, but the
other files can help you to understand the task and run the Hypothesis.
There are two functions defined inside relational.py
:
sortish(sequence,cmp)
and toposortish(nodes,edges)
.
These define two sorting functions that you will test by writing
invariant properties.
sortish
:
The first function is a nonstable sorting algorithm for lists.
It takes in as arguments a list of elements sequence
and
a comparison function cmp
. cmp
takes in two elements from
the list and returns -1 if the first is smaller,
if they are equal, and 1 if the second is smaller.
It mutates the passed in list and puts the elements into
ascending sorted order.toposortish
:
The second function takes in a list of nodes (labeled by integers)
in a directed acyclic graph (DAG)
and a list of directed edges
(as tuples of the form (source, target)
). It returns
a new list containing the nodes of the graph in a
topological order.
In any DAG, a topological ordering gives an order in which nodes
can be visited that guarantees a node will be visited before any
of its successors.
You can find more details in the wiki link above.Your task is to write the properties / invariants to check that these
functions are implemented correctly.
You should write the properties entirely within invariants.py
.
You should not change the imports of invariants.py
.
During grading the implementations inside relational.py
could change,
and your property functions may even be called directly with custom
inputs from a test suite.
You can complete the task by modifying the behavior of
is_valid_sort()
and is_valid_toposort()
to contain appropriate
property checks.
This testing is driven by Hypothesis, which will automatically generate
random inputs for you to check your work.
The basic tests are actually already provided for you inside
test_relational.py
. You can see how the Hypothesis framework is
invoked, and even how I set up the automated generation of random
structured inputs like DAGs for the testing process.
As a result, by just writing your invariants inside invariants.py
,
you are able to achieve significant confidence in the implementations.
The provided implementations of sortish
and toposortish
are correct.
After all, they rely on common Python libraries!
This means that you should initially work to get your tests to pass.
After that, you might consider what incorrect implementations look like
in order to make sure that your tests can fail when appropriate, too.
When you are done, you will submit only your invariants.py
to
CourSys. Changes to other files will be disregarded.
All deliverables should be submitted via CourSys.
For the fuzzing based tasks, you will need to write up a brief report / summary along with artifacts from the fuzzing process.
For the property based testing task, you should submit your invariants.py
.
As a group, you should reflect on the challenges faced, effort required, and either potential or received benefits of the tools you used for the projects you examined. What are the strengths and weaknesses of the different fuzz testing tools you used? Are these reflected in your results? Why or why not? How?
Your write-up for the exercise should again include the challenges you faced during this process, as well as your approaches for overcoming them. You should also report any errors indicated by the analyses. For groups of k members, the groups should also explain whether k of the crashes/hangs found were real bugs or not. If fewer than k bugs were found, then all discovered errors should be explained. You should also explain why you believe fewer than k bugs were found if possible. For instance, if the fuzz tester got stuck performing the same tests on irrelevant code. This may depend on both the tool and the observed results. As a group, you should contrast your experiences with both AFLplusplus and libFuzzer and discuss their strengths and weaknesses with evidence based on your experiences.
In addition to this write-up, you should also submit any test harnesses,
invocations, and the full set of constructed outputs that cause problems as
well as the overall statistics. For AFLplusplus, this means that you should submit the
crashes/
and hangs/
subdirectories of the output directory as well as the
fuzzer_stats
file for each experiment.
For libFuzzer, you should include any test files
corresponding to failures along with the output of the overall testing process.
Be careful. AFLplusplus
may warn you that the program you are analyzing is not
instrumented or that you are only identifying one path after many executions.
These can be signs that you are not correctly running the fuzzer to fuzz the
program of interest. Again, I expect you to not make these mistakes. We have
discussed some of the potential causes in class.
We are only scratching the surface of fuzzing in this exercise. Other fuzzing tools can do things like combine coverage guided fuzzing with language based models to make sure that generated inputs are syntactically valid [3]. They can help to automatically identify side-channel vulnerabilities [4]. Google has even worked to integrate libFuzzer with protobufs to generate valid inputs by piggybacking on common infrastructure in many projects [5]. With all of these advances also come challenges. Fuzzing can be nondeterministic, so knowing when a fuzzer is better than other fuzzers requires careful methodology [6].