IMPORTANT NOTE: This has been moved to `here <https://csil-git1.cs.surrey.sfu.ca/cmpt-125-summer-2020/cmpt-125-summer-2020/-/tree/master/C_testing >`_. The notes below are not up to date, and will soon be deleted.

Testing

Imagine an engineer made a new bridge, but had never actually put anything on it, i.e. they never tested it: would you be willing to walk across that bridge? Or a biologist who created a new medicine, but never tried it out to see if it was lethal. Would you take it?

Just as in cases like this, testing is an essential part of programming. If you don’t test your program, how do you know it is working correctly?

Note

Some computer scientists believe it might be possible to prove a program is correct using mathematics, but that idea has yet to catch on in practice. Plus, how do you know the proof is correct? Would you trust your life to a program that had been proven correct, but had never actually been run?

Testing is a large topic, and so to make things manageable we will look just at testing functions, which is a kind of unit testing.

What is a Test?

If f is the function being tested, then this is test:

if (f(test_input) == expected_result) {
    // test passed!
} else {
    // test failed
}

We give f a test input, and check that the value it returns matches the expected result.

We often write tests using assert like this:

assert(f(test_input) == expected_result);

This assertion will do nothing if the test passes, and crash the program if it fails. Generally, intentionally crashing a program is bad, but in this case we allow it because it is an easy way to write tests.

Where does test_input come from? Typically one of two places:

  • it might be hand-picked, i.e. the programmer chose because it is an important or tricky input
  • it could be randomly generated by some other function

Where does expected_result come from? Here are two possibilities:

  • It could be generated by hand.
  • If you happen to have a function that returns the same value as f, you could compare it to that. This is reasonably common in practice, e.g. f could be a new and improved version of an older function.

Testing Should be Done Automatically

All our testing will be done automatically, i.e. all tests will be run and checked automatically by the computer. Sometimes programmers just print out the results of their tests and manually check to see if they’re correct. The problem with manual testing is that it is tedious and error-prone, and most programmers give up on it sooner or later.

The test-as-you-go Workflow

Here is the basic approach to testing that we recommend, which we call test-as-you-go.

Suppose you are writing a function named f. Then:

  1. write a little bit of f
  2. then write some tests for this partial version of f
  3. run the tests on f
  4. if any tests fail, then fix f (or fix the test, if it’s wrong) and go to step 3
  5. if f is not fully implemented, go to step 1
  6. if no tests fail, then maybe f is correct

Some programmers like to change the order of step 1 and step 2, i.e. write tests before writing any code. Writing tests before code is called test-driven development (TDD). It can help you understand the function, and consider tricky cases. It also ensures you always have tests to check your function with.

Example: Testing the right_dot_at Function

The right_dot_at finds the index of the right-most '.' in a string. For example, right_dot_at("a.b.c") returns 3.

Here’s the function header and specification:

// Pre-condition:
//    none
// Post-condition:
//    Returns the index location of the right-most '.' in s.
//    If s has no '.', -1 is returned
// Example:
//    right_dot_at("readme.txt") returns 6
//    right_dot_at("config.sys.old") returns 10
//    right_dot_at("makefile") returns -1
int right_dot_at(char* s)

We have not provided the implementation yet because this is all we need to create some good test cases.

Blackbox Testing

Blackbox testing is when you create tests by looking just at the function header and its specification (i.e. pre and post conditions), and without looking at it’s implementation.

Thus, blackbox test cases can be created before you implement a function, which makes it useful in test-driven development.

Here are some blackbox test tases for right_dot_at:

right_dot_at("") should return -1

right_dot_at(".") should return 0

right_dot_at("m") should return -1

right_dot_at("..") should return 1

right_dot_at("readme.txt") should return 6

right_dot_at("config.sys.old") should return 10

right_dot_at("makefile") should return -1

These test cases were created by thinking about the kinds of strings right_dot_at will probably be used with. Also, the '.' character is important, so many tests cases have a '.'

Coding Test Cases

A convenient way to organize tests is in a function and using the assert macro, e.g.:

void right_dot_at_test() {
    assert(right_dot_at("") == -1);
    assert(right_dot_at(".") == 0);
    assert(right_dot_at("m") == -1);
    assert(right_dot_at("..") == 1);
    assert(right_dot_at("readme.txt") == 6);
    assert(right_dot_at("config.sys.old") == 10);
    assert(right_dot_at("makefile") == -1);

    printf("all right_dot_at tests passed\n");
}

assert(bool_expr) does nothing if bool_expr is true, but crashes the program if bool_expr is false. A nice feature of assert is that, when it causes a crash, its error message will tell you the line number of the assert in the source code.

Running Test Cases

Here is an implementation of right_dot_at:

int right_dot_at(char* s) {
    int result = -1;
    for(int i = 0; s[i] != '\0'; i++) {
        if (s[i] == '.') {
            result = i;
        }
    }
    return result;
}

From looking at the source code, it might not be obvious if it is correct. To test it, we call right_dot_at_test(). If any of the assertions fail, then we fix the error. Once right_dot_at_test() prints “all right_dot_at tests passed”, then we can say our tests have passed.

Testing Tricks: Extreme Values and Code Coverage

A useful rule of thumb is to test extreme values. What counts as an extreme value depends upon the function, but in general extreme values are often very large, or very small, or near some important value.

For example, for strings, extreme values might be:

  • the empty string
  • strings with 1 character
  • very large strings, strings with all the same character
  • strings with only non-printable characters

For ints, extreme values might be: 0, 1, -1, max_int, max_int - 1, max_int + 1, etc. What counts as extreme depends in part on the function you are testing.

Another useful idea is to check for code coverage, i.e. choose tests cases that cause every line of code in the function to be executed. Of course, you can only do this if you can see the implementation of the function, and so code coverage testing is not blackbox testing. Instead, it’s an example of whitebox testing, i.e. creating test cases by looking at a function’s implementation.

How Many Test Cases is Enough?

Good question! You might think that more testing is always better, but that’s not necessarily true. In practice, testing time is limited, and so it is better to test a variety of carefully chosen cases that test how you think your function will be used. Test cases that find lots of bugs are more valuable than test cases that have found few, or none.

The Dutch computer scientist Edsger Dijkstra famously said this about testing:

Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence. – Edsger Dijkstra

If your goal is to be certain that a function is 100% correct in call cases, then testing is not enough.

Practically speaking, this a good rule to follow when testing:

do the least amount of testing necessary to make you confident that your function works correctly in all important cases

For some functions, it might be possible to test all possible inputs and outputs. This is sometimes called brute force testing. However, most functions usually have so many possible inputs than that brute force testing is impractical. For example, consider this function:

int f(int a, int b)

If an int is 64 bits, then f has \(64 + 64 = 128\) bits of input, which means the total number of tests cases is \(2^{64}\), which is 18 446 744 073 709 551 616, i.e. over 18 quintillion tests cases. If you could run 1 trillion tests a second, it would take over 35 years to check all these!

Testing Doesn’t Catch All Bugs!

Just because your code passes all its tests doesn’t mean it works correctly! All you can be certain of is that it works correctly for those particular test cases run at that particular time.

As an example of function that can be hard to test, consider this:

// meant to always return 3, but has a bug
int three() {
   if (rand() == 0) {
     return 2;
   } else {
     return 3;
   }
}

three() almost always returns 3. But every once in a (usually) long while, it returns 2 — which is incorrect. For 32 bit ints, you’d expect about 1 out of every \(2^{31}\) calls to three() to return the wrong result.

There is no particular test case that would cause 2 to be returned — three() doesn’t even take in any input! There seem to be only two ways to notice this error:

  1. Read the implementation and notice that it has this crazy error. While the problem is easy to spot here, it might not be so easy to find in a program with hundreds of functions and thousands of lines of code.
  2. Do a lot of testing, e.g. re-run your tests billions and billions of times. Not many programmers have time to do so much testing.

While this is a contrived example, this sort of error does show up in practice. For example, in concurrent programs, i.e. programs where more than one thread of control is active at the same time (e.g. on computers with 2 or more CPUs), interactions of the two threads can cause extremely subtle and rare bugs. The exact conditions under which a bug in a concurrent system occurs can sometimes be extremely difficult to re-create, and may effectively occur at random times. Such bugs are sometimes called race conditions. Concurrent programs are very common in practice, and so this is a major source of difficult bugs.

Finally, here is one example that shows that testing alone is not enough to determine if a function is correct. Consider this:

// adds 2 to n
int h(int n) {
  return n + 3;
}

What is h supposed to do? Add 2, or add 3? The error is either in the comment, or in the return statement, but no amount of testing can tell us which.

Questions

  1. Show the general form of a test for a function f in two different ways: one way using if-statements, another way using assert.
  2. Why is manually checking the results of tests usually a bad idea?
  3. What is the main idea of test-driven development (TDD)?
  4. What is brute force testing?
  5. Compare and contrast blackbox and whitebox testing.
  6. In the context of testing, what is code coverage?
  7. Explain extreme value testing. What are some extreme test values for a function that takes one int as an input? What might be extreme values for char?
  8. What is the general rule you should follow when deciding how much testing to do for a function?
  9. Explain why a function might not be correct even through it passes all its tests.
  10. Given an example of a function with a bug that testing would have an extremely hard time catching. Instead of testing, how might the bug be caught?