Testing

Based on Chp 8 of Sommerville “Software Engineering” (9th edition)

Verification and Validation

testing is one aspect of a larger process known as verification and validation (V & V)

Validation: “Are we building the right product?”

  • i.e. does the software meet the customers expectations?

Verification: “Are we building the product right?”

  • i.e. does the software meet its requirements?

the ultimate goal of V&V is to determine if the software is fit for purpose, i.e. if it is good enough for its intended use

Inspections

inspections are when a group of people carefully discuss some aspect of the system under development

almost any aspect of system could be inspected, e.g.

  • requirements
  • design documents
  • source code

in some software groups, no source code can be checked in until the original programmer gets at least one other programmer to read through their code

code inspections can be an extremely effective way of discovering all kinds of issues in source code

since humans are doing the inspection, they might notice things like:

  • missing features
  • unneeded complexity
  • poor or missing documentation
  • poor organization
  • inconsistent coding style
  • etc.

Testing

experience and reports suggest that manual code inspections are very effective at finding many kinds of code defects, and they seem to be well worth the time

however, inspections can’t completely replace testing because

  • testing can catch cases that can be hard for humans to notice
  • testing can be done automatically and more quickly than inspections

the goal of testing is to demonstrate that a program meets its requirements, and is free from defects/bugs

validation testing is when you test a system to show that it meets its requirements

defect testing is when you test a system to show that it has no defects/bugs

Unit Testing

during development, unit testing is testing of a single unit of code in isolation from the rest of the program

a “unit” could be a function, class, module, etc.

generally, artificial data is somehow created

  • usually hand-crafted, or randomly generated

then the test data is run through the unit, and the results are checked to see if they are correct

  • correctness could be checked by manually including correct results
  • or sometimes by comparing to another unit that does that same thing, e.g. you could compare a new sorting function to an older, slower one

Choosing Tests Cases

generally, you want test cases that tend to find defects

  • test cases that never find any defects are candidates for removal!

blackbox testing is testing that is done by looking just at the specification of the unit, and without looking at the implementation

  • thus if the implementation changes, the blackbox test cases can be re-used

whitebox testing is testing that is done looking at the implementation of the unit

  • whitebox testing often aims for code coverage, i.e. enough test cases to ensure that every line of code (or even path through the code) is executed at least once
  • when the implementation changes, whitebox test cases may need to be changed as well

testing principle: use extreme values, e.g.

  • extreme values for a number: 0, 1, -1, max num, min num, epsilon (smallest possible positive value), NaN (not a number), Inf (infinity)
  • extreme values for a string: “”, single-digit string, a very large random string, a string with all-same values, etc.
  • extreme values for an array: empty array, single-entry array, a very large random array, and array with all-same values, etc.

testing principle: partition testing

  • partition the input space for the function into categories relevant to the unit being testing
  • choose one candidate from each partition

for example, you might could have an input space partitioned like this, where a * represents an input chosen for testing:

+-----------------------------------+----------------------------------+
|                                   |                                  |
|                                   |          *                       |
|                                   |                                  |
|                                   |                                  |
|                                   |                                  |
|             *                     |                                  |
|                                   +----------------------------------+
|                                   |                                  |
|                                   |                                  |
|                                   |                                  |
|                                   |                                  |
+--------------+-----+--------------+                   *              |
|              |     |              |                                  |
|              |     |              |                                  |
|  *           |     |              |                                  |
|              |   * |              |                                  |
|              |     |              |                                  |
|              |     |              +----------------------------------+
|              |     |                                                 |
|              |     |                                                 |
|              |     |          *                                      |
+--------------+-----+-------------------------------------------------+

testing principle: choose test cases is to pick inputs near boundaries, e.g. the +s in the diagram would likely be good test cases

  • this is a generalization of extreme value testing

testing principle: test important requirements of the system

  • in most system, some uses case are more important than others, so it makes sense to explicitly test the most important ones

testing principle: test error messages and failure cases, not just successes

  • how a system handles failure is often very important, and so should be tested

testing principle: stress testing

  • stress testing is when you test a unit by running lots of test on it
  • e.g. if you might stress test a server to see how it responds to a lot of simultaneous users

Automated Testing

a lot of software unit testing is done using hand-crafted test-cases created by developers

this can be tedious, and even the best testers can miss defects hiding in unusual situations

also, the process of running and checking test cases is tedious and exacting work

humans can do all of this manually, but probably not for long!

  • people get bored and tired
  • management might but time constraints on development that make it impossible to spend and time on testing

so in practice it is important to automate test cases whenever possible

  • some developers would go so far to say that if you don’t have automated testing, you don’t really have any testing at all!

this makes easy to run (and re-run) test cases with very little effort

plus it becomes possible to keep statistics, e.g. what test cases are good at finding defects

there are various unit testing frameworks for most programming languages that automate at least some testing — use them!

Property Testing

one interesting and effective way of automating testing is to test properties of a unit

for example, suppose you are writing a fancy new sorting algorithm

one of the properties of any correct sorting algorithm is that sort(sort(v)) == sort(v)

  • there are other properties that you could test, e.g. sort(v) == v if v is already in sorted order, or sort(v).size() == v.size()
  • choosing the best properties to test is a bit of an art

property testing is where you test that this property holds for sort

typically, all you need to do is state that sort(sort(v)) == sort(v) holds, and the test cases are automatically generated and run

random inputs to sort are created

  • since the testing is random, there are no human-like biases, and so this can sometimes catch unusual errors that humans never even think to look for

these are then tested to ensure that sort(sort(v)) == sort(v) holds for them

if the test case fails, then the input is shrunk to a smaller and simpler input that is easy for a human to use for tracing through the unit

  • random data is often long and messy, and hard for humans to work with
  • the shrinking is typically done by using various simplification heuristics
  • can be quite helpful (and surprising!) when you find an extremely small and simple test case that fails

the idea of property testing was popularized in the QuickCheck package for Haskell

  • many other languages have borrowed the idea (or parts of it)

  • for example, the Python Hypothesis package is a very easy-to-use property testing framework, e.g.:

    @given(st.lists(st.integers()))
    def test_reversing_twice_gives_same_list(xs):
        ys = list(xs)
        ys.reverse()
        ys.reverse()
        assert xs == ys
    
    • the @given annotation at the top is used to generate the right kind of test data
    • then the assert is checked for 100 random lists
    • the developer does not need to create the test cases or even run them — it is all automatic after this testing function is created

System Testing

system testing is about testing the entire (nearly) complete system, not just isolated units (as done in unit testing)

it’s often the case that systems can pass all unit tests, but then fail in unexpected ways when the units are connected

systems testing need not be done by the developers

  • in practice, you may ask a subset of real users to be “alpha” testers or “beta” testers
  • they essentially become early users of a partially complete system, and help find defects, evaluate features, etc.

systems test may ask testers to

  • work through a set of use cases
  • use it an ordinary way that reflects their regular expected usage
  • purposefully try crazy things in order to stress the system
    • e.g. video game testers might be asked to play a racing game going reverse the entire time

another kind of user-oriented system testing is acceptance testing

  • the idea of acceptance testing is to ask users to try an essentially complete and finishes system with the goal of discovering if they “accept” the system, i.e. if they like using it
  • acceptance testing is not about finding defects, or checking if the requirements are met
    • it’s possible that requirements are met but users don’t accept it!
  • leaving acceptance testing to the very end of software development could be a disaster
    • if users don’t like the system, you want to try to figure that out as soon as possible

When Should You Test?

the traditional waterfall model puts testing at the very end of development

but that’s often too late

  • testing at the end of a project is sometimes called big bang testing
  • big bang testing is usually bad because if it fails there isn’t really time to go back and fix things

in practice, it is usually better to interleave testing and development

the sooner you get the feedback that testing gives you, the better

  • you will have more time to fix things that aren’t working

Question: Is Testing Enough?

suppose you have a function f(n) that takes a 32-bit integer n as input

suppose you test it by calling f on all \(2^{32}\) inputs, and you verify that f returns the correct answer for all those inputs

can you conclude that f is correct?

it would seem so — all possible inputs and outputs have been checked!

but even such exhaustive testing can miss bugs, e.g. what if f were this function:

// Pre-condition:
//    none
// Post-condition
//    returns 5
int f(n) {
  if (rand() == 0) {
    return 6;
  }
  return 5;
}

suppose the rand() function returns a non-negative integer less than \(2^32\)

that means that once in every 4.3 billion calls to f, it will return the wrong value

it’s possible that stress testing could catch this — but that would be a huge amount of testing for one very simple function

  • more practically, a code review would be the best way to find this sort of error
  • if you saw this function in a code review, you would hopefully at least ask a question about why it is the way it is

of course, this particular function f is not realistic

  • but it is representative of a nasty kind of error that can occur in real life
  • for example, race conditions can occur in concurrent systems where hard-to-reproduce bugs show up at seemingly random times
  • or some unusual set of circumstances could cause a strange bug, e.g. maybe date/time code goes wrong during daylight savings time in locations with non-standard time zones?