Unit Testing

Introduction

testing is an important topic!

  • imagine an engineer made a new bridge, but had never actually put anything on it, i.e. they never tested it: would you be willing to walk across that bridge?

how do you know your program works correctly if you don’t test it?

  • some computer scientists argue that it might be possible to write a mathematical proof that a program is correct
  • but so far that idea is not practical and it is still a research topics used only in special cases

in these notes we’ll look just at unit testing, which means testing individual functions

unit testing is in contrast to system testing, where the entire system is tested

  • unit testing tests just part of the entire system
  • a problem with system testing is that you can’t do it until the system is implemented
  • when developing a program, it’s a good idea to testing as you go, i.e. after you write an important function you should immediately test it to make sure it’s correct
  • unit testing lets you “test as you go”, while system testing waits to do all the testing at the very end

Introduction

testing is sometime see as tedious, unglamorous work

  • it doesn’t seem to be as creative, or fun, as writing code
  • however, in reality testing can be a creative an interesting challenge that is important to do well

some programmers don’t like “breaking” things they’ve made

  • some testers say they try to get into a “mean and nasty” state of mind to do good testing
  • good testers try hard to break the software they’re testing

yet, testing is an important part of all good software, and cannot be skipped

What is a test?

the basic structure of a test is this (f is the function being tested):

if (f(test_input) == expected_result) {
    // test passed!
} else {
    // test failed
}

you run f on a test input, and check that the value it returns matches the expected result

where does test_input come from?

  • it is often hand-picked, i.e. the programmer chooses important or tricky inputs
  • it might be randomly generated by some other function

where does expected_result come from?

  • for hand-generated test inputs, the expected result is also typically hand generated
  • if you happen to have a function that returns the same value, you could compare it to that
    • this is reasonably common, e.g. f could be a new and improved version of some older function

Testing Should be Done Automatically

all our testing will be done automatically, i.e. all tests will be run and checked automatically

sometimes programmers just print out the results of their tests and manually check to see if that’s correct

the problem with manual testing is that it is tedious and error-prone, and almost all programmers stop doing it (or don’t do it carefully)

Testing Workflow 1: Code First

suppose you are writing the function f

  • write f fully, as well as you can
  • then write some tests for f
  • run the tests on f
  • if any tests fail, then fix f (or fix the test, if it’s wrong) and go to the previous step
  • if no tests fail, then maybe f is correct

Testing Workflow 2: Test First

called test-driven development (TDD)

suppose you are writing the function f

  • then write some tests for f
  • write f fully, as well as you can
  • run the tests on f
  • if any tests fail, then fix f (or fix the test, if it’s wrong) and go to the previous step
  • if no tests fail, then maybe f is correct

many programmers like TDD because writing the test cases first helps them understand and implement f

Testing Workflow 3: Mixed

suppose you are writing the function f

  • write a little bit of f
  • then write some tests for this partial version f
  • run the tests on f
  • if any tests fail, then fix f (or fix the test, if it’s wrong) and go to the previous step
  • if f is not fully implemented, go to the first step
  • if no tests fail, then maybe f is correct

this mixed approach mixes testing and implementation, and in general this is the recommended approach

Success and Failure

what should you do when a test succeeds?

  • in simple test situations, you could just print messages to the screen saying whether a test succeeded
  • or, more commonly in practice, the success messages are written to a log file so that you can look at the results later
  • if all your tests pass, then maybe your code is correct

what should you do when a test fails?

  • in simple test situations, you could just print a message saying the test failed, then keep going and do more tests
  • or you could immediately stop the program with a message indicating which test failed
    • that’s what assert-style testing below does: a failed test immediately stops the program
    • crashing after a failed test is fine for us since we are dealing with relatively small programs
    • but for larger programs that run a of tests, one failed test should not stop the entire system
      • instead, the test that caused the failure should be recorded (e.g. written to a file), and the following tests should be run
  • after you (hopefully) failed tests should be run again to ensure that changes really do fix them: this is known as regression testing
  • you can also record (e.g. in a file) which tests succeeded/failed
    • tests that cause failure are valuable: a test that catches a bug is useful!

Example: trim Function

the trim function removes leading and trailing spaces from a string

for example, trim("  cat ") returns "cat"

// returns a copy of s with all leading and trailing
// spaces removed
string trim(const string& s)

we’ll use it as an example in the following discussion

Blackbox Testing

black box testing is when you create tests without knowing (or looking at) a function’s implementation

black box test cases can be created before (!) you implement a function

Some Blackbox Test Cases for trim

trim(" cat  ") should return "cat"

trim(" cat") should return "cat"

trim("cat  ") should return "cat"

trim("cat") should return "cat"

Some Blackbox Test Cases

spaces in the middle of the string should be untouched

trim(" cat  dog ") should return "cat  dog"

trim(" cat  dog") should return "cat  dog"

trim("cat  dog ") should return "cat  dog"

trim("cat  dog") should return "cat  dog"

Extreme Values

a useful rule of thumb is to test extreme values

e.g. empty strings, strings with 1 character, empty files, min/max possible numbers, etc.

trim("") should return ""

trim(" ") should return ""

trim("  ") should return ""

trim(" a ") should return "a"

trim("a ") should return "a"

trim(" a") should return "a"

trim("a") should return "a"

as another example of extreme values, suppose think of what extreme values you would test f(int n) with

  • some extreme values of int: 0, -1, 1, max int, min int
  • if you happen to know more about what f is supposed to do, then you might be able to come with other extreme values
  • note that if you have a function that takes two int values, say g(int m, int n), then both int inputs could take on any one of the 5 extreme values for int, so that means there are 5*5=25 extreme value test cases for g

Whitebox Testing

whitebox tests are tests that are created by looking that the implementation of a function

here is one possible implementation of trim

string trim(const string& s) {
    int begin = 0;
    while (begin < s.size() && s[begin] == ' ') ++begin;
    int end = s.size() - 1;
    while (end >= 0 && s[end] == ' ') --end;
    return s.substr(begin, end - begin + 1);
}

looking at the function, you can write tests to “cover” (i.e. execute) each line of the function

there are usually multiple ways to implement a function, and different implementations may suggest different whitebox tests

thus the usefulness of some whitebox tests can change when you change a function’s implementation

for example, this implementation might lead you to create different whitebox tests:

string trim(const string& s) {
    int first = s.find_first_not_of(' ');
    if (first == string::npos) return "";
    int last = s.find_last_not_of(' ');
    return s.substr(first, last - first + 1);
}

How Many Test Cases is Enough?

good question!

generally, you want to do the least amount of testing necessary to make you confident that your function works correctly in all important cases

for some functions, it might be possible to test all possible inputs and outputs, e.g.:

// if a and b are different, then true is returned;
// otherwise, false is returned
bool exclusive_or(bool a, bool b);

exclusive_or only has 4 possible different inputs, so we can easily test them all

but more typically, functions have far more input values than could ever be practically tested; for example:

// if a and b are different, then true is returned;
// otherwise, false is returned
int sum(const vector<int>& v);

assuming an int is 32 bits, and that, say, v could have at most 2 billion elements, then there are 32 * 2 billion bits of input to sum, which means there are \(2^{64\, billion}\) cases, which is an astronomical amount — there’s no way to test that many different cases!

sometimes extra test cases add little value

e.g. trim("cat"), trim("dog"), trim("mouse"), trim("hat")

one or two of these would be enough

try to categorize inputs into different groups, and have one test case for each group

also, have a reason for each test case: what is it testing for?

test cases that find a lot of bugs are more valuable than those that find only a few bugs

Coding Test Cases

easy but tedious!

#include "cmpt_error.h"

void trim_test() {
    if (trim(" cat  ") != "cat") {
        cmpt::error("test case failed");
    }

    if (trim(" cat") != "cat") {
        cmpt::error("test case failed");
    }

    if (trim("cat ") != "cat") {
        cmpt::error("test case failed");
    }

    // ...

    cout << "All trim tests passed\n";
}

cmpt::error(msg) intentionally crashes the program and prints msg in the error to help with debugging

Coding Test Cases

a more convenient way to write many tests is to use the standard assert macro, e.g.

#include <cassert>

void trim_test() {
    assert(trim(" cat  ") == "cat");
    assert(trim(" cat") == "cat");
    assert(trim("cat ") == "cat");

    // ...

    cout << "All trim tests passed\n";
}

assert(bool_expr) does nothing if bool_expr is true

but if bool_expr is false, it (intentionally) crashes your program and includes the line number of the assertion that failed

Coding Test Cases

another nice way to organize test cases is use what is sometimes called table-based testing:

struct Test_case {
    string input;
    string expected_output;
};

vector<Test_case> all_tests = {
    Test_case{"", ""},
    Test_case{" ", ""},
    Test_case{"  ", ""},
    Test_case{"a", "a"},
    Test_case{" a", "a"},
    Test_case{"a ", "a"},
    Test_case{" a ", "a"},
    Test_case{"ab", "ab"},
    Test_case{" ab", "ab"},
    Test_case{" ab ", "ab"},
    Test_case{"a b", "a b"},
    Test_case{" a b", "a b"},
    Test_case{" a b ", "a b"},
};

void do_tests() {
    for(Test_case tc : all_tests) {
        cout << "trim(\"" << tc.input << "\") ... ";
        string actual = trim(tc.input);
        if (actual == tc.expected_output) {
            cout << "passed\n";
        } else {
            cout << "failed:\n"
                 << "     expected \"" << tc.expected_output << "\"\n"
                 << "     returned \"" << actual << "\"\n";
        }
    }
}

now it is easy run, add, and remove tests

of course, this is a lot of work to test just one function!

but for important functions that must be checked carefully, it may well be worth the effort

if you search online, you can find many C++ testing frameworks that can help organize your test cases and make testing easier and more productive

Testing Doesn’t Catch All Bugs!

just because your code passes all its tests doesn’t mean it is bug free!

for example, consider this intentionally wrong version of trim:

// bug: begin <= s.size() in first while loop
//      should be begin < s.size()
string trim_bug(const string& s) {
    int begin = 0;
    while (begin <= s.size() && s[begin] == ' ') ++begin; // bug: begin < s.size()
    int end = s.size() - 1;
    while (end >= 0 && s[end] == ' ') --end;
    return s.substr(begin, end - begin + 1);
}

the bug is easy to miss: the expression begin <= s.size() is incorrect, and should be begin < s.size()

all the tests in all_tests above pass! they don’t catch this bug!

the problem is that C++ strings don’t do bounds checking, i.e. if s is a string then s[s.size()] is an out-of-bounds access, and so is an error

the first while-loop sometimes checks s[s.size()], but as long as that location is not a space, then it doesn’t cause a problem

but if s[s.size()] is a space, then there will probably be a noticeable bug

the tricky part is that s[s.size()] is invalid, and it evaluates to some unknown value

so, in addition to the test cases, you also need to check for valid array accesses to catch this error

Testing Doesn’t Catch All Bugs!

one way to deal with indexing errors like this is to try to write your functions in a way that use less indexing

for example, this version of trim has no explicit string accesses:

string trim(const string& s) {
    int first = s.find_first_not_of(' ');
    if (first == string::npos) return "";
    int last = s.find_last_not_of(' ');
    return s.substr(first, last - first + 1);
}

this code is easier to test because it is less complex, and relies on standard C++ functions that are almost certainly correct (they have been used by 1000s of programmers for decades)

in general, the simpler the implementation of a function the better … simpler code is easier to understand and to test

Bugs that Testing has a Hard Time Catching

here’s an interesting function:

// meant to always return 3, but has a bug
int three() {
   if (rand() == 0) {
     return 2;
   } else {
     return 3;
   }
}

three() almost always returns 3, as expected

but every once in a (usually) long while, it returns 2 — which is incorrect

  • for 32 bit ints, you’d expect about 1 out of every \(2^{31}\) calls to three() to return the wrong result

there is no particular test case that would cause 2 to be returned — it is totally random

there are really only two ways to notice this error:

  1. read the implementation and notice that it has this crazy error
  2. do a lot of testing, e.g. billions and billions of calls

this might seem like a contrived example, but

  • low-level errors like this can occur in hardware
  • such bugs are usually disguised in ordinary code, e.g. the begin <= s.size() bug in the trim function from above is also a time bomb bug that no particular test case can catch
  • in concurrent programs, i.e. programs where more than one thread of control may be active at the same time (e.g. on computers with 2 or more CPUs), extremely subtle and rare bugs caused by interactions of the two threads of control
    • the exact conditions under which a bug in a concurrent system occurs can be sometimes be extremely difficult to re-create
    • such bugs are sometimes called race conditions
    • concurrent programs are extremely common in practice

Practice Questions

  1. Explain the difference between unit testing and system testing. What’s one of the problems with system testing?
  2. Describe the general form of a test for a function f.
  3. Why is manually checking the results of tests usually a bad idea?
  4. Describe how test-driven development (TDD) works.
  5. Describe two different things testing code might do when a test fails.
  6. Explain blackbox and whitebox testing. How do they differ?
  7. Explain extreme value testing. What are some extreme test values for a function that takes one int as an input.
  8. Suppose you want to test the function g(int a, int b) with extreme values only. What extreme values would you test g with?
  9. What is the general rule you should follow when deciding how much testing to do for a function?
  10. Explain why a function might not be correct even through it passes all its tests.
  11. Given an example of a function with a bug that testing would have an extremely hard time catching. Instead of testing, how might the bug be caught?