Sorting and Searching

Sorting is an important and well-studied topic in computer science. A sorting algorithm re-arranges the elements of a vector (or an array, or string, or similar linear data structure) from smallest to biggest. For simplicity, in these notes we’ll stick to sorting a vector<int>.

For example, these numbers are in sorted order:

2, 3, 8, 8, 10, 101   // sorted

These are not in sorted order:

2, 4, 1, 6, 2, 7, 10   // unsorted

Notice that finding that min and max value of a sorted list is easy: the first value is the min, and the last value is the max.

Hundreds of different sorting algorithms have been created. Most of them can be described by the following specification:

// Pre-condition:
//     none
// Post-condition:
//    The elements of v are re-arranged into ascending
//    order, i.e. v[0] <= v[1] <= ... <= v[n-1].
void sort(vector<int>& v)

In what follows, we will look at two popular sorting algorithms: insertion sort and mergesort.

Linear Insertion Sort

Linear insertion sort, or just insertion sort, is a simple (but slow!) sorting algorithm that you are probably already familiar with: many people use it to sort their cards when playing card games.

Here’s the idea. Suppose we want to arrange these values into ascending order:

5  6  1  2  4

Insertion sort divides the list into a sorted part and an unsorted part:

sorted    unsorted
     5 |  6  1  2  4

The | is an imaginary line that shows where the sorted/unsorted parts begin and end. Initially, only the first element starts in the sorted part, and the rest are in the unsorted part.

Insertion sort then repeats the following action until the unsorted part is empty:

  • Take the first element of unsorted and insert it into its correct position in sorted (so that sorted is always in ascending sorted order).

Lets trace this on our sample data:

5  |  6  1  2  4

5  6  |  1  2  4

1  5  6  |  2  4

1  2  5  6  |  4

1  2  4  5  6  |

Note that finding the correct insertion point requires searching through the sorted part. Typically, we use a variation of linear search to do this, hence the name linear insertion sort. It’s also possible to do the insertion using binary search (discussed below), in which case the algorithm is called binary insertion sort.

Implementation of Linear Insertion Sort

Here’s an implementation of insertion sort:

void insertion_sort(vector<int>& v) {
  for(int i = 1; i < v.size(); ++i) {

    int key = v[i];             // key is the value we are going to insert
                                // into the sorted part of v

    int j = i - 1;              // j points to one position before the
                                // (possible) insertion point of the key;
                                // thus, key will eventually be inserted at
                                // v[j + 1]

    //
    // This loop determines where to insert the key into the sorted part
    // of v. It does this by searching backwards through the sorted part
    // for the first value that is less than, or equal to, key.
    //
    // This loop combines searching with element moving. Every time an element
    // bigger than the key is found, it is moved up one position.
    //
    // It's possible that there is no value in the sorted part that is
    // smaller than key, in which case key gets inserted at the very
    // start of v. This is a special case that is handled by j >= 0
    // in the loop condition.
    //
    while (j >= 0 && v[j] > key) {
       v[j + 1] = v[j];
       --j;
    }
    // After loop ends this is true: (j < 0 || v[j] <= key)

    v[j + 1] = key;             // j points to the location *before*
                                // the one where key will be inserted
  }
}

While insertion sort is conceptually simple, the details can be tricky to get right and so you should test it carefully on a variety of different inputs.

The following function is helpful for testing:

// returns true if v is in ascending sorted order, and false otherwise
bool is_sorted(const vector<int>& v) {
  for(int i = 1; i < v.size(); i++) {  // note i starts at 1
    if (!(v[i-1] <= v[i])) {
      return false;
    }
  }
  return true;
}

Testing with small arrays is often a good way to find bugs, and so we can test insertion_sort like this:

bool sort_ok(vector<int> v) {
  insertion_sort(v);
  return is_sorted(v);
}

void insertion_sort_test() {
  vector<int> empty;
  vector<int> one = {5};
  vector<int> two_a = {2, 7};
  vector<int> two_b = {4, 1};
  vector<int> two_c = {3, 3};
  vector<int> same = {4,4,4,4,4,4};
  vector<int> ordered = {-1, 0, 5, 9, 10};
  vector<int> rev = {8, 7, 3, 1, 0, -5};

  assert(sort_ok(empty));
  assert(sort_ok(one));
  assert(sort_ok(two_a));
  assert(sort_ok(two_b));
  assert(sort_ok(two_c));
  assert(sort_ok(same));
  assert(sort_ok(ordered));
  assert(sort_ok(rev));
  cout << "all insertion_sort tests passed\n";
}

The Performance of Linear Insertion Sort

How fast is insertion sort? Not very, it turns out. To estimate it’s performance, note that n-1 insertions are needed to sort an n-element vector. Each insertion requires doing a linear search followed by an insertion. Getting an exact count of how many comparisons (the standard measure of sort efficiency) are done is rather tricky, and so we will answer a simpler question: in the worst-case, how many comparisons does insertion sort do?

So, suppose that the linear search part of insertion sort always does the worst-case number of comparisons, i.e. it always searches to the end of the sorted part of the list. If we are sorting 5 numbers, it would go like this:

a | b c d e     the 1st element, a, is already in correct position
a b | c d e     1 comparison  to determine where to insert b
a b c | d e     2 comparisons to determine where to insert c
a b c d | e     3 comparisons to determine where to insert d
a b c d e |     4 comparisons to determine where to insert e

In total, \(0 + 1 + 2 + 3 + 4 = 10\) comparisons are needed in the worst case (in some cases fewer comparisons might be necessary). Generalizing this is not too hard: to sort n objects in the worst case, insertion sort does \(0 + 1 + 2 + ... + n - 1 = \frac{n(n-1)}{2}\) comparisons. The expression \(\frac{n(n-1)}{2} = \frac{1}{2}n^2 - \frac{1}{2}n\) is quadratic, because the power of its highest term is 2. When \(n\) is big, the low-order term \(\frac{1}{2}n\) doesn’t make much difference and can be ignored. So, when \(n\) is big, insertion sort does about \(n^2\) comparisons.

This analysis only considers comparisons: it ignores the work done when inserting new elements. So lets consider how many numbers are moved by insertion sort. Again, to make the analysis easier, we will only consider the worst case when all of the sorted part of the vector up one position. When there are \(k\) numbers in the sorted part, the next insertion must move all \(k\) of those numbers. As shown above, the sorted part of the vector increases by 1 after every insertion, so the total number of data moves is \(1+2+3+\ldots +(n-1) = \frac{n(n-1)}{2}\), which is approximately \(n^2\). So, when n is big, insertion sort does about \(n^2\) data moves in the worst case.

Whether you count comparisons or data moves (or both), the result is the same: to sort n items in a vector, linear insertion sort does about \(n^2\) comparisons. In practice, this turns out to be quite slow, and so insertion sort should only be used for sorting a small number of items (maybe a few thousand, depending upon the speed of your computer).

Faster Sorting: Mergesort

There are sorting algorithms that are much more efficient than insertion sort. Here’s a high-level description of one of them, a divide-and-conquer sorting algorithm called mergesort.

Suppose you want to sort these 8 elements:

4 8 2 1 3 7 6 5

Using mergesort, you first divide the numbers into a left part and a right part:

        |
  left  |  right
4 8 2 1 | 3 7 6 5
        |

Second, recursively sort the left part and then recursively sort the right part to get this:

        |
  left  |  right
1 2 4 8 | 3 5 6 7
        |

The two parts are sorted, but the entire vector is not sorted. So the next step is to merge the parts into a single sorted vector.

Merging is the process of combining two sorted vectors into one sorted vector, and it can be done very efficiently. The basic technique is as follows. Create two pointers, a and b, initialized to point to the beginning of the two halves:

        |
  left  |  right
1 2 4 8 | 3 5 6 7
^       | ^
|         |
a         b

Also, create a result vector big enough to hold both halves:

        |
  left  |  right
1 2 4 8 | 3 5 6 7
^       | ^
|         |
a         b

result: { }

To merge left and right, compare the elements that a and b point to, and append a copy of the smaller to the end of result. Then, increment the appropriate pointer to point to the next element. For example, in the above diagram 1 is less than 3, so 1 is appended to result and a is incremented:

  left  |  right
1 2 4 8 | 3 5 6 7
  ^     | ^
  |       |
  a       b

result: { 1 }

Next, 2 is smaller than 3, so 2 is appended to result and a is incremented:

  left  |  right
1 2 4 8 | 3 5 6 7
    ^   | ^
    |     |
    a     b

result: { 1, 2 }

Next, 3 is less than 4, so 3 is appended to result and b is incremented:

  left  |  right
1 2 4 8 | 3 5 6 7
    ^   |   ^
    |       |
    a       b

result: { 1, 2, 3 }

The merge continues in the same way until all the elements in left and right have been appended to result:

  left  |  right
1 2 4 8 | 3 5 6 7
      ^ |   ^
      |     |
      a     b

result: { 1, 2, 3, 4 }


  left  |  right
1 2 4 8 | 3 5 6 7
      ^ |     ^
      |       |
      a       b

result: { 1, 2, 3, 4, 5 }


  left  |  right
1 2 4 8 | 3 5 6 7
      ^ |       ^
      |         |
      a         b

result: { 1, 2, 3, 4, 5, 6 }


  left  |  right
1 2 4 8 | 3 5 6 7
      ^ |        ^
      |          |
      a          b

result: { 1, 2, 3, 4, 5, 6, 7 }


  left  |  right
1 2 4 8 | 3 5 6 7
       ^|        ^
       |         |
       a         b

result: { 1, 2, 3, 4, 5, 6, 7, 9 }

Now the entire vector is sorted.

Here’s pseudocode for mergesort:

mergesort(v)    // v has n elements
  if v is size 0 or 1, then return  // base case

  half = n / 2
  left = {v[0], v[1], ..., v[half-1]}       // split v into 2 equal parts
  right = {v[half], v[half+1], ..., v[n-1]}

  left = mergesort(left)                    // recursively sort the 2 parts
  right = mergesort(right)

  v = merge(left, right)                    // combine the 2 parts

The high-level structure is quite straightforward, and also naturally recursive.

Here is a C++ implementation of mergesort:

// Pre-condition:
//    is_sorted(v, begin, mid) and
//    is_sorted(v, mid, end)
// Post-condition:
//    is_sorted(begin, end)
void merge(vector<int>& v, int begin, int mid, int end) {
  int a = begin;
  int b = mid;
  vector<int> result(v);
  for(int i = begin; i < end; i++) {
    if (a >= mid) {
      result[i] = v[b];
      b++;
    } else if (b >= end) {
      result[i] = v[a];
      a++;
    } else if (v[a] < v[b]) {
      result[i] = v[a];
      a++;
    } else {
      result[i] = v[b];
      b++;
    }
  } // for
  v = result;
} // merge

void mergesort(vector<int>& v, int begin, int end) {
  const int n = end - begin;
  if (n <= 1) return;  // base case: empty, or single-element, sub-vector
  int mid = (begin + end) / 2;
  mergesort(v, begin, mid);
  mergesort(v, mid, end);
  merge(v, begin, mid, end);
}

void mergesort(vector<int>& v) {
  mergesort(v, 0, v.size());
}

In practice, mergesort is very efficient, and in the worst case does about \(n \log n\) comparisons. This table shows the expected number of comparisons done by mergesort and insertion sort:

n \(n\log n\) (mergesort) \(n^2\) (insertion sort)
10 33 100
100 664 10000
1000 9966 1000000
10000 132877 100000000
100000 1660964 10000000000

For example, to sort one hundred thousand items, mergesort does about 1.6 million comparisons, while insertion sort does about 10 billion comparisons. In other words, insertion sort does 6000 times more work than mergesort to sort one hundred thousand items.

Mergesort can also been used as the starting point for many more sophisticated sorts. For example, timsort is a popular sorting algorithm that works well on many real-world data sets. It is essentially a variation of mergesort that is able to notice partially sorted data. Timsort is remarkably intricate (see the source code) and consists of hundreds of lines of carefully crafted code.

Optional Extra: Quicksort

Another very popular divide-and-conquer sorting algorithm is quicksort. In practice, it can be more efficient and use less memory that mergesort, although occasionally its performance can degrade if you’re unlucky.

Here is a C++ implementation of quicksort:

void swap(int& x, int& y) {
    int temp = x;
    x = y;
    y = temp;
}

// returns index of the final location of pivot value
int partition(vector<int>& v, int begin, int end) {
    int p = begin;
    for (int i = begin + 1; i < end; ++i) {
        if (v[i] <= v[p]) {
            swap(v[i], v[p + 1]);
            swap(v[p], v[p + 1]);
            ++p;
        }
    }
    return p;
}

// re-arranges the elements from v[begin] to v[end - 1] such that
// v[begin] <= v[begin + 1] <= ... <= v[end - 1]
void quicksort(vector<int>& v, int begin, int end) {
    const int n = end - begin;
    if (n <= 1) return;
    int p = partition(v, begin, end);
    quicksort(v, begin, p);
    quicksort(v, p + 1, end);
}

void quicksort(vector<int>& v) {
    quicksort(v, 0, v.size());
}

Like mergesort, quicksort splits the vector it’s sorting into two parts, and then recursively sorts them. But, instead of merging the results, quicksort splits the vector using partitioning, which moves all the “small” elements to the left, and all the “big” elements to the right. So once the two halves of the vector are sorted, the entire vector is sorted! If the results of the partitioning are two nearly-equal halves, then quicksort runs extremely quickly — typically faster than mergesort. However, in rare cases when multiple bad partitions (where one side is much bigger than the other) occur, it can slow to a crawl, running at about the same speed as insertion sort.

Extra: Why is Binary Search so Fast?

Every time binary search iterates through its main loop one time, it discards half of all the elements of v. So if v initially has n elements, the number of possible elements that could be equal to x decreases in size like this:

n

n/2

n/4

n/8

n/16

...

n/2^i

Eventually, \(2^i\) is bigger than n, meaning there are no more elements to consider. We want to know the first value of \(i\) that makes \(\frac{n}{2^i} < 1\). That is, we want to solve for \(i\) in this inequality:

\[\frac{n}{2^i} < 1 \leq \frac{n}{2^{i-1}}\]

Multiplying everything by \(2^i\) gives this:

\[n < 2^i \leq 2n\]

And taking the base-2 logarithm:

\[\log_2 n < i \leq \log_2 2n\]

Since \(\log_2 2n = \log_2 2 + \log_2 n = 1 + \log_2 n\), it further simplifies to:

\[\log_2 n < i \leq 1 + \log_2 n\]

Thus, i is approximately equal to \(\log_2 n\).

This result shows that binary search does about \(\log_2 n\) comparisons in the worst case. Sometimes it might do fewer, of course, but \(\log_2 n\) is small for even big values of n. For instance, \(\log_2 10^9 \approx 30\), meaning that binary search does at most 30 comparisons to find a given number in a vector with a billion elements.

Note

One way to think about the base-2 logarithm of a number is this: \(log_2 n\) is the number of bits used in the binary representation of n.