The Performance Dimension of Software Quality

An important aspect of many software systems is how well they perform. Performance may refer to speed of execution, responsiveness to user input, memory footprint, energy efficiency and/or other resource utilization characteristics.

But:

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Donald E. Knuth, Structured Programming with go to Statements, Computing Surveys 6(4), December 1974.

Structured Performance Improvement Engineering

Performance considerations in software system development should first be addressed using high-level approaches.

Design system and software architecture for good performance.
Choose good data structures and algorithms in critical areas.
Use compiler settings and/or optimizing compilers for initial performance improvement.
When necessary, profile then optimize source code.

Computer Architecture and Software Performance

Performance improvement engineering requires a strong knowledge of computer and system architecture. This is particularly true for software deployed on modern multicore processors.

Branch Mispredictions

Control structures such as if-statements and while-loops are implemented using conditional branch instructions.
Depending on some condition (e.g., if-test or while-test), conditional branches may transfer control from one point in an instruction stream to a different instruction address.
If a conditional branch is not taken, control flows to the next instruction.
Modern processors typically have a pipeline of 10-25 instructions in different stages of execution.
When a conditional branch instruction is encountered, a branch prediction is used to identify the next instruction to be loaded into the pipeline.
Processors use branch target buffers and branch prediction algorithms to improve prediction rates.
If the predicted branch is not taken, this is a branch misprediction.
Branch mispredictions cause a pipeline flush to clear out the incorrectly loaded instructions.
The pipeline flush in turn causes a processor stall, 10-20 CPU cycles of no useful work while the pipeline reloads.

Consider the following code that attempts to speed things up by avoiding a divison in many cases.

// Optimize z = x/y, given that 50% of the time y = 2.
// In this case,  we can just shift in this case.
if (y == 2) {
  z = x>>1;
}
else {
  z = x/y;
}

With a random pattern of divisor values being equal to 2 about half the time, a 50% branch prediction rate is likely. At 10 cycles per misprediction, this optimization costs 5 cycles to save half of the divisions.

Although division may take several cycles, a design goal of pipelined processors is to achieve an effective throughput of a single cycle per operation. Thus the "optimization" actually slows down performance.

An detailed example shows the effects of 50% branch mispredictions.

Cache Misses

Most modern processors use at least 2 levels of cache memory to avoid the cost of always retrieving data from off-chip memory.
The L1 cache is tightly integrated with the processor and may cost only 2-3 cycles to access.
L2 cache is generally much larger than L1 but may cost 10-20 cycles to access in the event of a cache miss.
A miss in the last-level cache may result in a penalty of 100-200 cycles to access main memory

Algorithms which attempt to use large tables of data in memory to save computation costs may be much slower than ones that perform computation (even recomputation) on demand.

Cache memory limitations affect instruction streams as well as data streams. Programs with large memory footprints can have very poor performance due to instruction cache misses.

Cache limitations become more severe in multicore systems, when processors typically share at least the last-level cache.

Interesting processor cache effects can show some surprisingly nonintuitive behaviour.

Short Vector SIMD Parallelism

Modern commodity processors generally include a set of SIMD instructions that operate simultaneously on short vectors of data held in wide registers. SSE on AMD/Intel, Neon on ARM and Altivec on Power PC all use 128-bit wide registers, which can allow four 32-bit operations at a time, eight simultaneous 16-bit operations or sixteen simultaneous single-byte operations.

Intel/AMD AVX and AVX2 instructions now support SIMD parallelism with 256-bit registers.

The Parabix project takes advantage of SIMD registers for text processing using parallel bit vectors, each having 1-bit per character. Thus SSE registers can be used to process 128 characters at a time.

Multicore Processors

All major processor families now feature multicore processors, i.e., processor chips with multiple processors on the chip.

Multicore systems can potentially increase throughput by increasing the number of processors being used to solve a particular problem. However, synchronization overheads may limit the effectiveness of multithreaded solutions. Furthermore the speedup that may be achieved is limited to that portion of a program that may be parallelized, in accord with Amdahl's Law.

Program Profiling and Performance Measurement

When software system performance is an important issue to overall software quality, systematic profiling and measurement of performance is important to identify system bottlenecks and opportunities for performance improvement.

Program profiling determines a profile of execution time broken down to the level of individual routines, using varous approaches.

Sampling statistically determines a program profile by periodically sampling program counter locations and reporting the relative frequencies that samples area taken in each of the systems individual routines.
Event-based approaches, record start and stop timing data triggered by specific events, such as the call to and return from individual functions.
Instrumentation approaches add specific monitoring and/or measurement statements to the software being evaluated. This may be done through source code addition or by compiler-aided instrumentation.

Processor Performance Counters

Modern processors have built-in performance counters for directly accessing various aspects of performances. The following perf list data shows some measurements available on an Intel SandyBridge processor.

  cpu-cycles OR cycles                               [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]

Here is an example of performance data extracted for three versions of an XML parser used with a GML-to-SVG conversion application. These versions include the original Xerces C++ parser, plus two versions of that parser systematically accelerated using Parabix technology.

Using the linux perf stat tool, measurements can be collected which report performance only while executing user (:u), ignoring the costs of system code.

GML to SVG using the original Xerces C++ parser.

perf stat -e instructions:u,cycles:u,cache-misses:u,branches:u,branch-misses:u\
,L1-dcache-load-misses:u,L1-icache-load-misses:u,LLC-load-misses:u,LLC-store-misses:u\
./gml2svg_xerces ../../data/layer/gml-10 out_4

 Performance counter stats for './gml2svg_3_1_1 ../../data/layer/gml-10 out_4':

    24,295,196,260 instructions:u            #    1.82  insns per cycle         [62.50%]
    13,366,940,989 cycles:u                  #    0.000 GHz                     [62.49%]
        92,645,936 L1-dcache-load-misses:u                                      [62.50%]
       152,881,279 L1-icache-load-misses:u                                      [62.47%]
        42,917,671 branch-misses:u           #    0.71% of all branches         [62.42%]
     6,067,392,465 branches:u                                                   [50.02%]
                 0 LLC-load-misses:u                                            [50.10%]
                 0 LLC-store-misses:u                                           [50.09%]

       3.781265157 seconds time elapsed

GML to SVG using the icXML accelerated version of Xerces C++.

perf stat -e instructions:u,cycles:u,cache-misses:u,branches:u,branch-misses:u\
,L1-dcache-load-misses:u,L1-icache-load-misses:u,LLC-load-misses:u,LLC-store-misses:u\
./gml2svg_icx ../../data/layer/gml-10 out_4

 Performance counter stats for './gml2svg_icx ../../data/layer/gml-10 out_4':

    16,400,836,143 instructions:u            #    1.90  insns per cycle         [62.56%]
     8,631,121,947 cycles:u                  #    0.000 GHz                     [62.56%]
        68,404,206 L1-dcache-load-misses:u                                      [62.36%]
        34,659,162 L1-icache-load-misses:u                                      [62.37%]
        15,480,827 branch-misses:u           #    0.48% of all branches         [62.37%]
     3,223,965,560 branches:u                                                   [50.12%]
                 0 LLC-load-misses:u                                            [50.21%]
                 0 LLC-store-misses:u                                           [50.12%]

       2.592457480 seconds time elapsed

GML to SVG using multithreaded icXML with 2-stage pipeline parallelism.

perf stat -e instructions:u,cycles:u,cache-misses:u,branches:u,branch-misses:u\
,L1-dcache-load-misses:u,L1-icache-load-misses:u,LLC-load-misses:u,LLC-store-misses:u\
./gml2svg_icx_pipeline ../../data/layer/gml-10 out_4

 Performance counter stats for './gml2svg_icx_pipeline ../../data/layer/gml-10 out_4':

    16,497,759,243 instructions:u            #    1.35  insns per cycle         [62.15%]
    12,187,579,701 cycles:u                  #    0.000 GHz                     [61.85%]
        71,401,265 L1-dcache-load-misses:u                                      [62.73%]
        24,600,865 L1-icache-load-misses:u                                      [63.61%]
        12,606,003 branch-misses:u           #    0.38% of all branches         [64.70%]
     3,285,914,450 branches:u                                                   [52.55%]
                 0 LLC-load-misses:u                                            [52.10%]
                12 LLC-store-misses:u                                           [50.33%]

       1.996408290 seconds time elapsed

An interesting aspect of these performance measurements shows that dramatic performance changes that can be achieved with parallel methods based on SIMD and multicore parallelism.

Dramatic L1 icache miss improvement.
Dramatic branch-miss improvement.