18. Sorting and Searching a vector

Now we come to sorting, which is an important and well-studied topic in computer science. Sorting a vector means to arrange its elements in order from smallest to largest.

Hundreds of different sorting algorithms have been created. Here we will only look at one of the most efficient simple sorting methods known as insertion sort.

We will restrict our attention to sorting vector<int> objects for simplicity, and also assume there are duplicate numbers among the values being sorted.

18.1. Linear Insertion Sort

Linear insertion sort is a simple (but slow!) sorting algorithm that you are probably already familiar with: many people use it to sort their cards when playing card games. For brevity, we usually refer to it as just insertion sort.

Here’s the idea. Suppose we want to arrange these values into ascending order:

5  6  1  2  4

Insertion sort conceptually divides this lists into a sorted part and an unsorted part:

sorted    unsorted
     5 |  6  1  2  4

The | is an imaginary line that shows where the sorted/unsorted portions begin/end. Initially, only the first element is on the sorted part, and the rest are on the unsorted part.

Insertion sort now repeats the following until the unsorted part is empty:

  • Take the first element of unsorted and insert it into its correct position in sorted (so that sorted is always in ascending sorted order).

Lets trace this on our sample data:

5  |  6  1  2  4

5  6  |  1  2  4

1  5  6  |  2  4

1  2  5  6  |  4

1  2  4  5  6  |

Note that finding the insertion point requires some searching. Typically, we use a variation of linear search to do this, hence the name linear insertion sort. It’s also possible to do the insertion using binary search (discussed below). In that case, the algorithm is called binary insertion sort.

How fast is insertion sort? Not very, it turns out. To estimate it’s performance, note that n - 1 insertions are needed to sort an n-element vector. Each insertion requires doing a linear search followed by an insertion. We can’t get an exact count of how many comparisons (the standard measure of sort efficiency) insert sort does without knowing the details of the data being sorted, but we can make a useful worst-case estimate.

So for the sake of simplifying the analysis, suppose that the linear search part of insertion sort always does the worst-case number of comparisons, i.e. it always searches to the end of the sorted part of the list. Then suppose we are sorting 5 numbers:

a | b c d e     0 comparisons to determine location of a
a b | c d e     1 comparison to determine location of b
a b c | d e     2 comparisons to determine location of c
a b c d | e     3 comparisons to determine location of d
a b c d e |     4 comparisons to determine location of e

This shows that, in the worst case, 0 + 1 + 2 + 3 + 4 = 10 comparisons are needed to sort 5 objects. Generalizing this is not too hard: to sort n objects, then, in the worst case, 0 + 1 + 2 + ... + n - 1 =
\frac{n(n-1)}{2} comparisons will be done. The expression \frac{n(n-1)}{2} is quadratic, because the power of its highest term is 2. This tells us that as n gets bigger, the number of comparisons done by insertion sort grows like n^2.

This analysis only considers comparisons: it ignores the work done when inserting new elements. So lets consider how many numbers are moved by insertion sort. In the worst case, insertion sort does the most work when we insert a number at the beginning of the vector (because it has to move all the elements after it up 1 position). In a vector with n elements, inserting an item at the beginning requires moving n - 1 elements. Since, in the worst case, we will be inserting n - 1 elements at the start (not n — the first element is always automatically considered to be sorted), (n-1)(n-1) = n^2 - 2n + 1 \approx n^2 items must be moved.

So whether you count comparisons or data moves, the result is the same. To sort n items in a vector, linear insertion sort takes time proportional to n^2. In practice, this turns out to be quite slow, and insertion sort is really only useful for sorting a small number of items (maybe a few thousand, depending upon the speed of your computer).

Here’s an implementation of insertion sort:

void insertion_sort(vector<int> v) {
  for(int i = 1; i < v.size(); ++i) {

    int key = v[i];             // key is the value we are going to insert
                                // into the sorted part of v

    int j = i - 1;              // j points to one position before the
                                // (possible) insertion point of the key;
                                // thus, key will eventually be inserted at
                                // v[j + 1]

    //
    // This loop determines where to insert the key into the sorted part
    // of v. It does this by searching backwards through the sorted part of v
    // for the first value that is less than, or equal to, key.
    //
    // This loop combines searching with element moving. Every time an element
    // bigger than the key is found, it is moved up one position.
    //
    // It's possible that there is no value in the sorted part that is
    // smaller than key, in which case key gets inserted at the very
    // start of v. This is a special case that is handled by the j >= 0 term
    // in the loop condition.
    //
    while (j >= 0 && v[j] > key) {
      v[j + 1] = v[j];
      --j;
    }
    // True after loop ends: (j < 0 || v[j] <= key)

    v[j + 1] = key;             // j points to the location *before*
                                // the one where key will be inserted
  }
}

While insertion sort is conceptually simple, you need to be quite careful to get the details exactly right when you implement it. This can be tricky, and so you should test it carefully on a variety of different arrays.

Note

What sorts of arrays should you test insertion sort on? Testing with small arrays is often a good way to find bugs, so we suggest at least the following:

  • The empty array, {}.
  • An array with one element, e.g. {5}.
  • An array with two elements already in sorted order, e.g. {5, 8}.
  • An array with two elements not in sorted order, e.g. {8, 5}.
  • An array with three elements in sorted order, e.g. {5, 8, 9}.
  • An array with three elements in reverse sorted order, e.g. {9, 8, 5}.
  • An array with three elements in no particular order, e.g. {5, 9, 8}.
  • A larger array, e.g. {7, 5, 3, 9, 0, 8, 1, 5}.

18.2. Faster Sorting

Linear insertion sort is far from being the most efficient sorting algorithm in practice. For large amounts of data, it runs extremely slowly. More efficient sorting algorithms exist that perform a number of comparisons proportional to n \log n. For example, mergesort, quicksort, and heapsort are three well-know sorting algorithms that all do about n
\log n on average. They are a little more complex than simple sorts like insertion sort, but they sort large amounts of data much more quickly.

Real-world sorting algorithms can be quite complex. For example, timsort is a popular sorting algorithm that can often do less than n \log n comparisons when sorting many real- world data sets. It does this by identifying parts of the data that are already partially sorted. Timsort is a very complex algorithm (see the source code) consisting of hundreds of lines of carefully crafted C code.

Table Of Contents

Previous topic

17. Some Basic Algorithms

Next topic

19. Calculating Powers

This Page