We examine the problem of testing a UTF-8 to UTF-16 transcoder to illustrate the following concepts in testing.
This case study is illustrated by the test case generator and execution scripts in the http://u8u16.costar.sfu.ca/browser/QA/ directory.
| Unicode | UTF-8 | UTF-16 | ||||
|---|---|---|---|---|---|---|
| Codepoint | Byte Sequence | Doublebyte Sequence | ||||
| 00-0x7F | 0x00-0x7F | 0x0000-0x007F | ||||
| 0x80-0x7FF | 0xC2-0xDF | 0x80-0xBF | 0x0080-0x07FF | |||
| 0x800-0xFFFF | 0xE0-0xEF | 0x80-0xBF | 0x80-0xBF | 0x07FF-0xFFFF | ||
| 0x10000-0x10FFFF | 0xF0-0xF4 | 0x80-0xBF | 0x80-0xBF | 0x80-0xBF | 0xD800-0xDBFF | 0xDC00-0xDFFF |
Some byte sequences do not represent valid UTF-8.
0xC0 and 0xC1 are illegal.0xF5 through 0xFF are illegal.0xC2 through 0xDF must be followed by
exactly one suffix byte in the range 0x80through 0xBF.0xE0 through 0xEF must be followed by
exactly two suffix bytes in the range 0x80through 0xBF.0xF0 through 0xF4 must be followed by
exactly three suffix bytes in the range 0x80through 0xBF.0xE0 requires that the range of the immediately next suffix byte be limited to
the range 0xA0through 0xBF.0xED requires that the range of the immediately next suffix byte be limited to
the range 0x80through 0x9F.0xF0 requires that the range of the immediately next suffix byte be limited to
the range 0x90through 0xBF.0xF4 requires that the range of the immediately next suffix byte be limited to
the range 0x80through 0x8F.Based on the requirements of UTF-8 to UTF-16 conversion, we take a functional testing approach, also known as black-box testing. This is in contrast to a structural testing approach in which we can see inside the box, often called white-box (or, sometimes, glass-box) testing.
Equivalence partititioning is the technique of dividing up the domain of possible inputs into equivalence classes. Testing any member of the equivalence class is assumed to be a representative
We divide the UTF-8 inputs up into the equivalence classes based on the following ideas.
0xC0 or 0xC1 together with
a legal suffix byte form an equivalence class., as do
four-byte sequences with an illegal prefix byte 0xF5 through 0xFF (2 classes)
0xE0, 0xED, 0xF0, and 0xF4,
an otherwise correct three or four byte sequence that violates the constraint for the first suffix
byte forms a class. (4 classes).
Given the equivalence classes above, we can automate the
generation of test sequences. A Python implementation of generating
a random element from each erroneous and each incomplete UTF-8 sequence
is given by the functions gen_UTF8_error_sequences and
gen_UTF8_incomplete_sequences, respectively
in the u8u16_testgen.py
script.
Boundary-Value Analysis often looks at boundary cases associated with equivalence classes as ones that may need particular testing. Sometimes the boundary-value analysis is based on boundaries involving in the actual code being tested.
Because the UTF-8 to UTF-16 implementations being tested were based on parallel techniques with potential block and boundary value problems, it was desired to insert instances of each equivalence class at randomly chosen positions reflecting boundary problems.
These categories were implemented as the prefix_groups list in the script.
Similarly, for each type of error at each boundary class, it was chosen to generate different length sequences of correct input past the error position in the following 4 classes.
These categories were implemented as the suffix_groups list in the script.
Using the equivalence classes defined above, a test data generator was then implemented to produce test files with examples of erroneous input and incomplete input. In each case, a sample input file was constructed, together with an expected output file and an expected message file according to the conventions of a driver program written for the purpose.
With the expected output and error message files available,
a simple shell script was written to perform the tests and produce
test output and test message directory in parallel to the expected
output and expected messages directory. A simple shell script
was used and the standard diff invocation implemented the
test oracle.
run_all script