We examine the problem of testing a UTF-8 to UTF-16 transcoder to illustrate the following concepts in testing.
This case study is illustrated by the test case generator and execution scripts in the http://u8u16.costar.sfu.ca/browser/QA/ directory.
Unicode | UTF-8 | UTF-16 | ||||
---|---|---|---|---|---|---|
Codepoint | Byte Sequence | Doublebyte Sequence | ||||
00-0x7F | 0x00-0x7F | 0x0000-0x007F | ||||
0x80-0x7FF | 0xC2-0xDF | 0x80-0xBF | 0x0080-0x07FF | |||
0x800-0xFFFF | 0xE0-0xEF | 0x80-0xBF | 0x80-0xBF | 0x07FF-0xFFFF | ||
0x10000-0x10FFFF | 0xF0-0xF4 | 0x80-0xBF | 0x80-0xBF | 0x80-0xBF | 0xD800-0xDBFF | 0xDC00-0xDFFF |
Some byte sequences do not represent valid UTF-8.
0xC0
and 0xC1
are illegal.0xF5
through 0xFF
are illegal.0xC2
through 0xDF
must be followed by
exactly one suffix byte in the range 0x80
through 0xBF
.0xE0
through 0xEF
must be followed by
exactly two suffix bytes in the range 0x80
through 0xBF
.0xF0
through 0xF4
must be followed by
exactly three suffix bytes in the range 0x80
through 0xBF
.0xE0
requires that the range of the immediately next suffix byte be limited to
the range 0xA0
through 0xBF
.0xED
requires that the range of the immediately next suffix byte be limited to
the range 0x80
through 0x9F
.0xF0
requires that the range of the immediately next suffix byte be limited to
the range 0x90
through 0xBF
.0xF4
requires that the range of the immediately next suffix byte be limited to
the range 0x80
through 0x8F
.Based on the requirements of UTF-8 to UTF-16 conversion, we take a functional testing approach, also known as black-box testing. This is in contrast to a structural testing approach in which we can see inside the box, often called white-box (or, sometimes, glass-box) testing.
Equivalence partititioning is the technique of dividing up the domain of possible inputs into equivalence classes. Testing any member of the equivalence class is assumed to be a representative
We divide the UTF-8 inputs up into the equivalence classes based on the following ideas.
0xC0
or 0xC1
together with
a legal suffix byte form an equivalence class., as do
four-byte sequences with an illegal prefix byte 0xF5
through 0xFF
(2 classes)
0xE0
, 0xED
, 0xF0
, and 0xF4
,
an otherwise correct three or four byte sequence that violates the constraint for the first suffix
byte forms a class. (4 classes).
Given the equivalence classes above, we can automate the
generation of test sequences. A Python implementation of generating
a random element from each erroneous and each incomplete UTF-8 sequence
is given by the functions gen_UTF8_error_sequences
and
gen_UTF8_incomplete_sequences
, respectively
in the u8u16_testgen.py
script.
Boundary-Value Analysis often looks at boundary cases associated with equivalence classes as ones that may need particular testing. Sometimes the boundary-value analysis is based on boundaries involving in the actual code being tested.
Because the UTF-8 to UTF-16 implementations being tested were based on parallel techniques with potential block and boundary value problems, it was desired to insert instances of each equivalence class at randomly chosen positions reflecting boundary problems.
These categories were implemented as the prefix_groups list in the script.
Similarly, for each type of error at each boundary class, it was chosen to generate different length sequences of correct input past the error position in the following 4 classes.
These categories were implemented as the suffix_groups list in the script.
Using the equivalence classes defined above, a test data generator was then implemented to produce test files with examples of erroneous input and incomplete input. In each case, a sample input file was constructed, together with an expected output file and an expected message file according to the conventions of a driver program written for the purpose.
With the expected output and error message files available,
a simple shell script was written to perform the tests and produce
test output and test message directory in parallel to the expected
output and expected messages directory. A simple shell script
was used and the standard diff
invocation implemented the
test oracle.
run_all
script