Case Study: UTF-8 to UTF-16 Transcoder Testing

We examine the problem of testing a UTF-8 to UTF-16 transcoder to illustrate the following concepts in testing.

This case study is illustrated by the test case generator and execution scripts in the http://u8u16.costar.sfu.ca/browser/QA/ directory.

The Problem Statement: UTF-8 to UTF-16 Conversion

The problem is to convert valid, complete sequences of UTF-8 code units into their corresponding representation. UTF-8 is a variable length representation of Unicode of 1 to 4 bytes. UTF-16 is variable-length representation of 1 or 2 doublebytes. The following table shows conversion ranges of individual character code points.
Unicode UTF-8 UTF-16
Codepoint Byte Sequence Doublebyte Sequence
00-0x7F0x00-0x7F 0x0000-0x007F 
0x80-0x7FF0xC2-0xDF 0x80-0xBF  0x0080-0x07FF 
0x800-0xFFFF0xE0-0xEF 0x80-0xBF 0x80-0xBF  0x07FF-0xFFFF 
0x10000-0x10FFFF0xF0-0xF4 0x80-0xBF 0x80-0xBF 0x80-0xBF0xD800-0xDBFF0xDC00-0xDFFF

Valid and Invalid UTF-8 Input Sequences

Some byte sequences do not represent valid UTF-8.

  1. The prefix bytes 0xC0 and 0xC1 are illegal.
  2. The prefix bytes 0xF5 through 0xFF are illegal.
  3. Any prefix byte in the range 0xC2 through 0xDF must be followed by exactly one suffix byte in the range 0x80through 0xBF.
  4. Any prefix byte in the range 0xE0 through 0xEF must be followed by exactly two suffix bytes in the range 0x80through 0xBF.
  5. Any prefix byte in the range 0xF0 through 0xF4 must be followed by exactly three suffix bytes in the range 0x80through 0xBF.
  6. A prefix byte 0xE0 requires that the range of the immediately next suffix byte be limited to the range 0xA0through 0xBF.
  7. A prefix byte 0xED requires that the range of the immediately next suffix byte be limited to the range 0x80through 0x9F.
  8. A prefix byte 0xF0 requires that the range of the immediately next suffix byte be limited to the range 0x90through 0xBF.
  9. A prefix byte 0xF4 requires that the range of the immediately next suffix byte be limited to the range 0x80through 0x8F.
  10. A suffix byte standing alone or beyond the specified number of suffix bytes for a given prefix byte is illegal.

Based on the requirements of UTF-8 to UTF-16 conversion, we take a functional testing approach, also known as black-box testing. This is in contrast to a structural testing approach in which we can see inside the box, often called white-box (or, sometimes, glass-box) testing.

Equivalence Partitioning

Equivalence partititioning is the technique of dividing up the domain of possible inputs into equivalence classes. Testing any member of the equivalence class is assumed to be a representative

We divide the UTF-8 inputs up into the equivalence classes based on the following ideas.

Test Value Automation: Random Value Generation

Given the equivalence classes above, we can automate the generation of test sequences. A Python implementation of generating a random element from each erroneous and each incomplete UTF-8 sequence is given by the functions gen_UTF8_error_sequences and gen_UTF8_incomplete_sequences, respectively in the u8u16_testgen.py script.

White-Box Boundary Value Analysis

Boundary-Value Analysis often looks at boundary cases associated with equivalence classes as ones that may need particular testing. Sometimes the boundary-value analysis is based on boundaries involving in the actual code being tested.

Because the UTF-8 to UTF-16 implementations being tested were based on parallel techniques with potential block and boundary value problems, it was desired to insert instances of each equivalence class at randomly chosen positions reflecting boundary problems.

These categories were implemented as the prefix_groups list in the script.

Similarly, for each type of error at each boundary class, it was chosen to generate different length sequences of correct input past the error position in the following 4 classes.

These categories were implemented as the suffix_groups list in the script.

Test Data Generation

Using the equivalence classes defined above, a test data generator was then implemented to produce test files with examples of erroneous input and incomplete input. In each case, a sample input file was constructed, together with an expected output file and an expected message file according to the conventions of a driver program written for the purpose.

Test Execution Harness and Oracle

With the expected output and error message files available, a simple shell script was written to perform the tests and produce test output and test message directory in parallel to the expected output and expected messages directory. A simple shell script was used and the standard diff invocation implemented the test oracle.