Unit Testing¶
Introduction¶
testing is an important topic!
- imagine an engineer made a new bridge, but had never actually put anything on it, i.e. they never tested it: would you be willing to walk across that bridge?
how do you know your program works correctly if you don’t test it?
- some computer scientists argue that it might be possible to write a mathematical proof that a program is correct
- but so far that idea is not practical and it is still a research topics used only in special cases
in these notes we’ll look just at unit testing, which means testing individual functions
unit testing is in contrast to system testing, where the entire system is tested
- unit testing tests just part of the entire system
- a problem with system testing is that you can’t do it until the system is implemented
- when developing a program, it’s a good idea to testing as you go, i.e. after you write an important function you should immediately test it to make sure it’s correct
- unit testing lets you “test as you go”, while system testing waits to do all the testing at the very end
Introduction¶
testing is sometime see as tedious, unglamorous work
- it doesn’t seem to be as creative, or fun, as writing code
- however, in reality testing can be a creative an interesting challenge that is important to do well
some programmers don’t like “breaking” things they’ve made
- some testers say they try to get into a “mean and nasty” state of mind to do good testing
- good testers try hard to break the software they’re testing
yet, testing is an important part of all good software, and cannot be skipped
What is a test?¶
the basic structure of a test is this (f
is the function being tested):
if (f(test_input) == expected_result) {
// test passed!
} else {
// test failed
}
you run f
on a test input, and check that the value it returns matches the
expected result
where does test_input
come from?
- it is often hand-picked, i.e. the programmer chooses important or tricky inputs
- it might be randomly generated by some other function
where does expected_result
come from?
- for hand-generated test inputs, the expected result is also typically hand generated
- if you happen to have a function that returns the same value, you could
compare it to that
- this is reasonably common, e.g.
f
could be a new and improved version of some older function
- this is reasonably common, e.g.
Testing Should be Done Automatically¶
all our testing will be done automatically, i.e. all tests will be run and checked automatically
sometimes programmers just print out the results of their tests and manually check to see if that’s correct
the problem with manual testing is that it is tedious and error-prone, and almost all programmers stop doing it (or don’t do it carefully)
Testing Workflow 1: Code First¶
suppose you are writing the function f
- write
f
fully, as well as you can - then write some tests for
f
- run the tests on
f
- if any tests fail, then fix
f
(or fix the test, if it’s wrong) and go to the previous step - if no tests fail, then maybe
f
is correct
Testing Workflow 2: Test First¶
called test-driven development (TDD)
suppose you are writing the function f
- then write some tests for
f
- write
f
fully, as well as you can - run the tests on
f
- if any tests fail, then fix
f
(or fix the test, if it’s wrong) and go to the previous step - if no tests fail, then maybe
f
is correct
many programmers like TDD because writing the test cases first helps them
understand and implement f
Testing Workflow 3: Mixed¶
suppose you are writing the function f
- write a little bit of
f
- then write some tests for this partial version
f
- run the tests on
f
- if any tests fail, then fix
f
(or fix the test, if it’s wrong) and go to the previous step - if
f
is not fully implemented, go to the first step - if no tests fail, then maybe
f
is correct
this mixed approach mixes testing and implementation, and in general this is the recommended approach
Success and Failure¶
what should you do when a test succeeds?
- in simple test situations, you could just print messages to the screen saying whether a test succeeded
- or, more commonly in practice, the success messages are written to a log file so that you can look at the results later
- if all your tests pass, then maybe your code is correct
what should you do when a test fails?
- in simple test situations, you could just print a message saying the test failed, then keep going and do more tests
- or you could immediately stop the program with a message indicating which
test failed
- that’s what assert-style testing below does: a failed test immediately stops the program
- crashing after a failed test is fine for us since we are dealing with relatively small programs
- but for larger programs that run a of tests, one failed test should not
stop the entire system
- instead, the test that caused the failure should be recorded (e.g. written to a file), and the following tests should be run
- after you (hopefully) failed tests should be run again to ensure that changes really do fix them: this is known as regression testing
- you can also record (e.g. in a file) which tests succeeded/failed
- tests that cause failure are valuable: a test that catches a bug is useful!
Example: trim Function¶
the trim
function removes leading and trailing spaces from a string
for example, trim(" cat ")
returns "cat"
// returns a copy of s with all leading and trailing
// spaces removed
string trim(const string& s)
we’ll use it as an example in the following discussion
Blackbox Testing¶
black box testing is when you create tests without knowing (or looking at) a function’s implementation
black box test cases can be created before (!) you implement a function
Some Blackbox Test Cases for trim¶
trim(" cat ")
should return "cat"
trim(" cat")
should return "cat"
trim("cat ")
should return "cat"
trim("cat")
should return "cat"
Some Blackbox Test Cases¶
spaces in the middle of the string should be untouched
trim(" cat dog ")
should return "cat dog"
trim(" cat dog")
should return "cat dog"
trim("cat dog ")
should return "cat dog"
trim("cat dog")
should return "cat dog"
Extreme Values¶
a useful rule of thumb is to test extreme values
e.g. empty strings, strings with 1 character, empty files, min/max possible numbers, etc.
trim("")
should return ""
trim(" ")
should return ""
trim(" ")
should return ""
trim(" a ")
should return "a"
trim("a ")
should return "a"
trim(" a")
should return "a"
trim("a")
should return "a"
as another example of extreme values, suppose think of what extreme values you
would test f(int n)
with
- some extreme values of
int
: 0, -1, 1, max int, min int - if you happen to know more about what
f
is supposed to do, then you might be able to come with other extreme values - note that if you have a function that takes two
int
values, sayg(int m, int n)
, then bothint
inputs could take on any one of the 5 extreme values forint
, so that means there are 5*5=25 extreme value test cases forg
Whitebox Testing¶
whitebox tests are tests that are created by looking that the implementation of a function
here is one possible implementation of trim
string trim(const string& s) {
int begin = 0;
while (begin < s.size() && s[begin] == ' ') ++begin;
int end = s.size() - 1;
while (end >= 0 && s[end] == ' ') --end;
return s.substr(begin, end - begin + 1);
}
looking at the function, you can write tests to “cover” (i.e. execute) each line of the function
there are usually multiple ways to implement a function, and different implementations may suggest different whitebox tests
thus the usefulness of some whitebox tests can change when you change a function’s implementation
for example, this implementation might lead you to create different whitebox tests:
string trim(const string& s) {
int first = s.find_first_not_of(' ');
if (first == string::npos) return "";
int last = s.find_last_not_of(' ');
return s.substr(first, last - first + 1);
}
How Many Test Cases is Enough?¶
good question!
generally, you want to do the least amount of testing necessary to make you confident that your function works correctly in all important cases
for some functions, it might be possible to test all possible inputs and outputs, e.g.:
// if a and b are different, then true is returned;
// otherwise, false is returned
bool exclusive_or(bool a, bool b);
exclusive_or
only has 4 possible different inputs, so we can easily test
them all
but more typically, functions have far more input values than could ever be practically tested; for example:
// if a and b are different, then true is returned;
// otherwise, false is returned
int sum(const vector<int>& v);
assuming an int
is 32 bits, and that, say, v
could have at most 2
billion elements, then there are 32 * 2 billion bits of input to sum
,
which means there are \(2^{64\, billion}\) cases, which is an astronomical
amount — there’s no way to test that many different cases!
sometimes extra test cases add little value
e.g. trim("cat")
, trim("dog")
, trim("mouse")
, trim("hat")
one or two of these would be enough
try to categorize inputs into different groups, and have one test case for each group
also, have a reason for each test case: what is it testing for?
test cases that find a lot of bugs are more valuable than those that find only a few bugs
Coding Test Cases¶
easy but tedious!
#include "cmpt_error.h"
void trim_test() {
if (trim(" cat ") != "cat") {
cmpt::error("test case failed");
}
if (trim(" cat") != "cat") {
cmpt::error("test case failed");
}
if (trim("cat ") != "cat") {
cmpt::error("test case failed");
}
// ...
cout << "All trim tests passed\n";
}
cmpt::error(msg)
intentionally crashes the program and prints msg
in
the error to help with debugging
Coding Test Cases¶
a more convenient way to write many tests is to use the standard assert
macro, e.g.
#include <cassert>
void trim_test() {
assert(trim(" cat ") == "cat");
assert(trim(" cat") == "cat");
assert(trim("cat ") == "cat");
// ...
cout << "All trim tests passed\n";
}
assert(bool_expr)
does nothing if bool_expr
is true
but if bool_expr
is false
, it (intentionally) crashes your program and
includes the line number of the assertion that failed
Coding Test Cases¶
another nice way to organize test cases is use what is sometimes called table-based testing:
struct Test_case {
string input;
string expected_output;
};
vector<Test_case> all_tests = {
Test_case{"", ""},
Test_case{" ", ""},
Test_case{" ", ""},
Test_case{"a", "a"},
Test_case{" a", "a"},
Test_case{"a ", "a"},
Test_case{" a ", "a"},
Test_case{"ab", "ab"},
Test_case{" ab", "ab"},
Test_case{" ab ", "ab"},
Test_case{"a b", "a b"},
Test_case{" a b", "a b"},
Test_case{" a b ", "a b"},
};
void do_tests() {
for(Test_case tc : all_tests) {
cout << "trim(\"" << tc.input << "\") ... ";
string actual = trim(tc.input);
if (actual == tc.expected_output) {
cout << "passed\n";
} else {
cout << "failed:\n"
<< " expected \"" << tc.expected_output << "\"\n"
<< " returned \"" << actual << "\"\n";
}
}
}
now it is easy run, add, and remove tests
of course, this is a lot of work to test just one function!
but for important functions that must be checked carefully, it may well be worth the effort
if you search online, you can find many C++ testing frameworks that can help organize your test cases and make testing easier and more productive
Testing Doesn’t Catch All Bugs!¶
just because your code passes all its tests doesn’t mean it is bug free!
for example, consider this intentionally wrong version of trim
:
// bug: begin <= s.size() in first while loop
// should be begin < s.size()
string trim_bug(const string& s) {
int begin = 0;
while (begin <= s.size() && s[begin] == ' ') ++begin; // bug: begin < s.size()
int end = s.size() - 1;
while (end >= 0 && s[end] == ' ') --end;
return s.substr(begin, end - begin + 1);
}
the bug is easy to miss: the expression begin <= s.size()
is incorrect,
and should be begin < s.size()
all the tests in all_tests
above pass! they don’t catch this bug!
the problem is that C++ strings don’t do bounds checking, i.e. if s
is a
string then s[s.size()]
is an out-of-bounds access, and so is an error
the first while-loop sometimes checks s[s.size()]
, but as long as that
location is not a space, then it doesn’t cause a problem
but if s[s.size()]
is a space, then there will probably be a noticeable
bug
the tricky part is that s[s.size()]
is invalid, and it evaluates to some
unknown value
so, in addition to the test cases, you also need to check for valid array accesses to catch this error
Testing Doesn’t Catch All Bugs!¶
one way to deal with indexing errors like this is to try to write your functions in a way that use less indexing
for example, this version of trim
has no explicit string accesses:
string trim(const string& s) {
int first = s.find_first_not_of(' ');
if (first == string::npos) return "";
int last = s.find_last_not_of(' ');
return s.substr(first, last - first + 1);
}
this code is easier to test because it is less complex, and relies on standard C++ functions that are almost certainly correct (they have been used by 1000s of programmers for decades)
in general, the simpler the implementation of a function the better … simpler code is easier to understand and to test
Bugs that Testing has a Hard Time Catching¶
here’s an interesting function:
// meant to always return 3, but has a bug
int three() {
if (rand() == 0) {
return 2;
} else {
return 3;
}
}
three()
almost always returns 3, as expected
but every once in a (usually) long while, it returns 2 — which is incorrect
- for 32 bit ints, you’d expect about 1 out of every \(2^{31}\) calls to
three()
to return the wrong result
there is no particular test case that would cause 2 to be returned — it is totally random
there are really only two ways to notice this error:
- read the implementation and notice that it has this crazy error
- do a lot of testing, e.g. billions and billions of calls
this might seem like a contrived example, but
- low-level errors like this can occur in hardware
- such bugs are usually disguised in ordinary code, e.g. the
begin <= s.size()
bug in thetrim
function from above is also a time bomb bug that no particular test case can catch - in concurrent programs, i.e. programs where more than one thread of
control may be active at the same time (e.g. on computers with 2 or more
CPUs), extremely subtle and rare bugs caused by interactions of the two
threads of control
- the exact conditions under which a bug in a concurrent system occurs can be sometimes be extremely difficult to re-create
- such bugs are sometimes called race conditions
- concurrent programs are extremely common in practice
Practice Questions¶
- Explain the difference between unit testing and system testing. What’s one of the problems with system testing?
- Describe the general form of a test for a function
f
. - Why is manually checking the results of tests usually a bad idea?
- Describe how test-driven development (TDD) works.
- Describe two different things testing code might do when a test fails.
- Explain blackbox and whitebox testing. How do they differ?
- Explain extreme value testing. What are some extreme test values for a
function that takes one
int
as an input. - Suppose you want to test the function
g(int a, int b)
with extreme values only. What extreme values would you testg
with? - What is the general rule you should follow when deciding how much testing to do for a function?
- Explain why a function might not be correct even through it passes all its tests.
- Given an example of a function with a bug that testing would have an extremely hard time catching. Instead of testing, how might the bug be caught?