Sorting and Searching (in C)¶
Sorting is an important and well-studied topic in computer science. A sorting algorithm re-arranges the elements of an array (or string, or any similar linear data structure) from smallest to biggest. For simplicity, in these notes we’ll stick to sorting arrays of ints.
For example, these numbers are in sorted order:
2, 3, 8, 8, 10, 101 // sorted
These are not in sorted order:
2, 4, 1, 6, 2, 7, 10 // unsorted
Notice that finding that min and max value of a sorted list is easy: the first value is the min, and the last value is the max.
Hundreds of different sorting algorithms have been created. Most of them can be described by the following specification:
// Pre-condition:
// none
// Post-condition:
// The elements of v are re-arranged into ascending
// order, i.e. v[0] <= v[1] <= ... <= v[n-1].
void sort(int* v, int n)
In what follows, we will look at two popular sorting algorithms: insertion sort and mergesort.
Linear Insertion Sort¶
Linear insertion sort, or just insertion sort, is a simple (but slow!) sorting algorithm that you are probably already familiar with: many people use it to sort their cards when playing card games.
Here’s the idea. Suppose we want to arrange these values into ascending order:
5 6 1 2 4
Insertion sort divides the list into a sorted part and an unsorted part:
sorted unsorted
5 | 6 1 2 4
The |
is an imaginary line that shows where the sorted/unsorted parts
begin and end. Initially, only the first element starts in the sorted part,
and the rest are in the unsorted part.
Insertion sort then repeats the following action until the unsorted part is empty:
- Take the first element of unsorted and insert it into its correct position in sorted (so that sorted is always in ascending sorted order).
Lets trace this on our sample data:
5 | 6 1 2 4
5 6 | 1 2 4
1 5 6 | 2 4
1 2 5 6 | 4
1 2 4 5 6 |
Note that finding the correct insertion point requires searching through the sorted part. Typically, we use a variation of linear search to do this, hence the name linear insertion sort. It’s also possible to do the insertion using binary search (discussed below), in which case the algorithm is called binary insertion sort.
Implementation of Linear Insertion Sort¶
Here’s an implementation of insertion sort:
void insertion_sort(int* v, int n) {
for(int i = 1; i < n; ++i) {
int key = v[i]; // key is the value we are going to insert
// into the sorted part of v
int j = i - 1; // j points to one position before the
// (possible) insertion point of the key;
// thus, key will eventually be inserted at
// v[j + 1]
//
// This loop determines where to insert the key into the sorted part
// of v. It does this by searching backwards through the sorted part
// for the first value that is less than, or equal to, key.
//
// This loop combines searching with element moving. Every time an element
// bigger than the key is found, it is moved up one position.
//
// It's possible that there is no value in the sorted part that is
// smaller than key, in which case key gets inserted at the very
// start of v. This is a special case that is handled by j >= 0
// in the loop condition.
//
while (j >= 0 && v[j] > key) {
v[j + 1] = v[j];
--j;
}
// After loop ends this is true: (j < 0 || v[j] <= key)
v[j + 1] = key; // j points to the location *before*
// the one where key will be inserted
}
}
While insertion sort is conceptually simple, the details can be tricky to get right and so you should test it carefully on a variety of different inputs.
The following function is helpful for testing:
// returns true if v is in ascending sorted order, and false otherwise
int is_sorted(int* v, int n) {
for(int i = 1; i < n; i++) { // note i starts at 1
if (!(v[i-1] <= v[i])) {
return 0;
}
}
return 1;
}
The Performance of Linear Insertion Sort¶
How fast is insertion sort? Not very, it turns out. To estimate it’s performance, note that n-1 insertions are needed to sort an n-element array. Each insertion requires doing a linear search followed by an insertion. Getting an exact count of how many comparisons (the standard measure of sort efficiency) are done is rather tricky, and so we will answer a simpler question: in the worst-case, how many comparisons does insertion sort do?
So, suppose that the linear search part of insertion sort always does the worst-case number of comparisons, i.e. it always searches to the end of the sorted part of the list. If we are sorting 5 numbers, it would go like this:
a | b c d e the 1st element, a, is already in correct position
a b | c d e 1 comparison to determine where to insert b
a b c | d e 2 comparisons to determine where to insert c
a b c d | e 3 comparisons to determine where to insert d
a b c d e | 4 comparisons to determine where to insert e
In total, \(0 + 1 + 2 + 3 + 4 = 10\) comparisons are needed in the worst case (in some cases fewer comparisons might be necessary). Generalizing this is not too hard: to sort n objects in the worst case, insertion sort does \(0 + 1 + 2 + ... + n - 1 = \frac{n(n-1)}{2}\) comparisons. The expression \(\frac{n(n-1)}{2} = \frac{1}{2}n^2 - \frac{1}{2}n\) is quadratic, because the power of its highest term is 2. When \(n\) is big, the low-order term \(\frac{1}{2}n\) doesn’t make much difference and can be ignored. So, when \(n\) is big, insertion sort does about \(n^2\) comparisons.
This analysis only considers comparisons: it ignores the work done when inserting new elements. So lets consider how many numbers are moved by insertion sort. Again, to make the analysis easier, we will only consider the worst case when all of the sorted part of the array up one position. When there are \(k\) numbers in the sorted part, the next insertion must move all \(k\) of those numbers. As shown above, the sorted part of the array increases by 1 after every insertion, so the total number of data moves is \(1+2+3+\ldots +(n-1) = \frac{n(n-1)}{2}\), which is approximately \(n^2\). So, when n is big, insertion sort does about \(n^2\) data moves in the worst case.
Whether you count comparisons or data moves (or both), the result is the same: to sort n items in an array, linear insertion sort does about \(n^2\) comparisons. In practice, this turns out to be quite slow, and so insertion sort should only be used for sorting a small number of items (maybe a few thousand, depending upon the speed of your computer).
Faster Sorting: Mergesort¶
There are sorting algorithms that are much more efficient than insertion sort. Here’s a high-level description of one of them, a divide-and-conquer sorting algorithm called mergesort.
Suppose you want to sort these 8 elements:
4 8 2 1 3 7 6 5
Using mergesort, you first divide the numbers into a left part and a right part:
|
left | right
4 8 2 1 | 3 7 6 5
|
Second, recursively sort the left part and then recursively sort the right part to get this:
|
left | right
1 2 4 8 | 3 5 6 7
|
The two parts are sorted, but the entire array is not sorted. So the next step is to merge the parts into a single sorted array.
Merging is the process of combining two sorted arrays into one sorted
array, and it can be done very efficiently. The basic technique is as follows.
Create two pointers, a
and b
, initialized to point to the beginning of
the two halves:
|
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
Also, create a result array big enough to hold both halves:
|
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { }
To merge left and right, compare the elements that a
and b
point to,
and append a copy of the smaller to the end of result
. Then, increment
the appropriate pointer to point to the next element. For example, in the
above diagram 1 is less than 3, so 1 is appended to result and a
is
incremented:
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { 1 }
Next, 2 is smaller than 3, so 2 is appended to result
and a
is
incremented:
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { 1, 2 }
Next, 3 is less than 4, so 3 is appended to result
and b
is
incremented:
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { 1, 2, 3 }
The merge continues in the same way until all the elements in left and right
have been appended to result
:
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { 1, 2, 3, 4 }
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { 1, 2, 3, 4, 5 }
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { 1, 2, 3, 4, 5, 6 }
left | right
1 2 4 8 | 3 5 6 7
^ | ^
| |
a b
result: { 1, 2, 3, 4, 5, 6, 7 }
left | right
1 2 4 8 | 3 5 6 7
^| ^
| |
a b
result: { 1, 2, 3, 4, 5, 6, 7, 9 }
Now the entire array is sorted.
Here’s pseudocode for mergesort:
mergesort(v) // v has n elements
if v is size 0 or 1, then return // base case
half = n / 2
left = {v[0], v[1], ..., v[half-1]} // split v into 2 equal parts
right = {v[half], v[half+1], ..., v[n-1]}
left = mergesort(left) // recursively sort the 2 parts
right = mergesort(right)
v = merge(left, right) // combine the 2 parts
The high-level structure is quite straightforward, and also naturally recursive.
Here is a C implementation of mergesort:
// Pre-condition:
// is_sorted(v, begin, mid) and
// is_sorted(v, mid, end) and
// merge_arr is an array the same length as v
// Post-condition:
// is_sorted(begin, end)
void merge(int* v, int* merge_arr, int begin, int mid, int end) {
int a = begin;
int b = mid;
int n = end - begin;
// copy all elements into merge_arr
for(int i = 0; i < n; i++) {
if (a >= mid) {
merge_arr[i] = v[b];
b++;
} else if (b >= end) {
merge_arr[i] = v[a];
a++;
} else if (v[a] < v[b]) {
merge_arr[i] = v[a];
a++;
} else {
merge_arr[i] = v[b];
b++;
}
} // for
// copy elements from merge_arr back into correct location in v
for(int i = 0; i < n; i++) {
v[begin+i] = merge_arr[i];
}
} // merge
void mergesort_help(int* v, int* merge_arr, int begin, int end) {
const int n = end - begin;
if (n <= 1) return; // base case: empty, or single-element, sub-array
int mid = (begin + end) / 2;
mergesort_help(v, merge_arr, begin, mid);
mergesort_help(v, merge_arr, mid, end);
merge(v, merge_arr, begin, mid, end);
}
void mergesort(int* v, int n) {
// mergesort needs an extra array of length n to do the merging
int merge_arr[n];
mergesort_help(v, merge_arr, 0, n);
}
In practice, mergesort is very efficient, and in the worst case does about \(n \log n\) comparisons. This table shows the expected number of comparisons done by mergesort and insertion sort:
n \(n\log n\) (mergesort) \(n^2\) (insertion sort) 10 33 100 100 664 10000 1000 9966 1000000 10000 132877 100000000 100000 1660964 10000000000
For example, to sort one hundred thousand items, mergesort does about 1.6 million comparisons, while insertion sort does about 10 billion comparisons. In other words, insertion sort does 6000 times more work than mergesort to sort one hundred thousand items.
Mergesort can also been used as the starting point for many more sophisticated sorts. For example, timsort is a popular sorting algorithm that works well on many real-world data sets. It is essentially a variation of mergesort that is able to notice partially sorted data. Timsort is remarkably intricate (see the source code) and consists of hundreds of lines of carefully crafted code.
Binary Search¶
Sorted data is extremely valuable. One of the best demonstrations of this is the amazing performance of binary search, an algorithm that finds the location of a given element in a sorted array:
// Pre-condition:
// v is in ascending sorted order
// Post-condition:
// returns an index i such that v[i] == x
int binary_search(int* v, int x)
The pre-condition is essential here: binary search only works on data that is in (ascending) sorted order.
Binary search works by first looking for x
in the middle of the array.
There are three possible cases:
- The middle element is equal to
x
. In this casex
has been found and so binary search stops. x
is smaller than the middle element. In this case, ifx
is in the array it’s among the values to the left of the middle element, and so binary search is run on this left half.x
is bigger than the middle element. In this case, ifx
is in the array it’s among the values to the right of the middle element, and so binary search is run on this right half.
What makes this so efficient is that each comparison cause half of the remaining elements to be discarded.
For example, suppose our array looks like this:
4, 5, 8, 11, 15, 20, 21
Does it contain 6? Binary search would do the following:
4, 5, 8, 11, 15, 20, 21
^
|
check the middle-most element
Since 11 is not 6, we know that if 6 is in the array, it must be to the left of 11. Thus we can ignore all the elements from 11 to 21:
4, 5, 8
^
|
check the middle-most element
Since 5 is not 11, we know that if 6 is among these numbers it must be to the right of 5. Thus we can ignore 4 and 5:
8
^
|
check the middle-most element
After checking that 8 is not equal to 6, we are done: we’ve proven that 6 is not in the list of numbers. And it only tool 3 comparisons to do this. In contrast, linear search would have had to have checked all the numbers to make none are 3, and so would have done 7 comparisons.
Here’s an implementation:
// Pre-condition:
// v[begin] to v[end - 1] is in ascending sorted order
// Post-condition:
// returns an index i such that v[i] == x and begin <= i < end;
// if x is not found, -1 is returned
int binary_search_help(int* v, int x, int begin, int end) {
while (begin < end) {
int mid = (begin + end) / 2;
if (v[mid] == x) { // found x!
return mid;
} else if (x < v[mid]) {
end = mid;
} else if (x > v[mid]) {
begin = mid + 1;
}
}
return -1; // x not found
}
// Pre-condition:
// v is in ascending sorted order
// Post-condition:
// returns an index i such that v[i] == x; if x is not in
// v, -1 is returned
int binary_search(int* v, int n, int x) {
return binary_search_help(v, x, 0, n);
}
The algorithm is rather subtle at points, and it is easy to make mistakes. For example, this line is easy to get wrong:
int mid = (begin + end) / 2;
This calculates the mid-point of a range. You can derive this as follows. The length of the range is \(end - begin\), and half of that is \(\frac{end - begin}{2}\). Since the range starts at \(begin\), the mid-point is \(begin + \frac{end - begin}{2}\), which simplifies to \(\frac{end + begin}{2}\).
If \(end - begin\) happens to be odd, then \(\frac{end + begin}{2}\) ends with .5, and we simply chop that off (e.g. 34.5 becomes 34). This is not obviously the right thing to do, but if you test it with a few actual values by hand, and write some good tests, you can be pretty confident that it’s correct.
Here’s a recursive version of binary search:
int binary_search_rec_help(int* v, int x, int begin, int end) {
const int n = end - begin;
// if the sub-array being searched is empty, then x is not in it
if (n <= 0) return -1; // x not found
int mid = (begin + end) / 2;
if (x == v[mid]) {
return mid;
} else if (x < v[mid]) {
return binary_search_rec_help(v, x, begin, mid);
} else {
return binary_search_rec_help(v, x, mid + 1, end);
}
}
// Pre-condition:
// v is in ascending sorted order
// Post-condition:
// returns an index i such that v[i] == x; if x is not in
// v, -1 is returned
int binary_search_rec(int* v, int n, int x) {
return binary_search_rec_help(v, x, 0, n);
}
The code is similar in length and complexity to the non-recursive version, and so in practice the non-recursive version is generally preferred. But this recursive version is still extremely efficient.
How fast is binary search? Blazingly fast, as it turns out: for an n-element array, it needs to check, at most, \(log_2 n\) different values. So, for instance, on a million-element array binary search will never do at most 20 comparisons (because \(log_2 1000000 \approx 20\)).
While binary search is extremely efficient for searching large amounts of data, it comes with a price: the array must be in sorted order. The problem with sorting is that even the fastest general-purpose sorting algorithms are slower than linear search. Thus sorting an array and then calling binary search is usually slower than doing linear search. However, sorting the data once followed by multiple binary searches may be faster than lots of linear searches.
Note
Did you know that you can drive your car a mile without using a drop of gas? Just start at the top of a mile-long hill and roll down. Of course, the hard part is getting your car up there in the first place!
Binary search suffers from a similar problem.
Extra: Why is Binary Search so Fast?¶
Every time binary search iterates through its main loop one time, it discards
half of all the elements of v
. So if v
initially has n elements, the
number of possible elements that could be equal to x
decreases in size
like this:
n
n/2
n/4
n/8
n/16
...
n/2^i
Eventually, \(2^i\) is bigger than n, meaning there are no more elements to consider. We want to know the first value of \(i\) that makes \(\frac{n}{2^i} < 1\). That is, we want to solve for \(i\) in this inequality:
Multiplying everything by \(2^i\) gives this:
And taking the base-2 logarithm:
Since \(\log_2 2n = \log_2 2 + \log_2 n = 1 + \log_2 n\), it further simplifies to:
Thus, i
is approximately equal to \(\log_2 n\).
This result shows that binary search does about \(\log_2 n\) comparisons in the worst case. Sometimes it might do fewer, of course, but \(\log_2 n\) is small for even big values of n. For instance, \(\log_2 10^9 \approx 30\), meaning that binary search does at most 30 comparisons to find a given number in an array with a billion elements.
Note
One way to think about the base-2 logarithm of a number is this: \(log_2 n\) is the number of bits used in the binary representation of n.
Source Code¶
// sorting.c
#include <stdio.h>
#include <assert.h>
// Prints the given array in C-style, e.g. { 3, 1, -4 }.
void print_array(int* v, int n) {
assert(n >= 0);
if (n == 0) {
printf("{}");
} else if (n == 1) {
printf("{ %d }", v[0]);
} else { // n >= 2
printf("{ %d", v[0]);
for(int i = 1; i < n; i++) {
printf(", %d", v[i]);
}
printf(" }");
}
}
// Same as print_array, but prints a newline at the end.
void println_array(int* arr, int n) {
print_array(arr, n);
printf("\n");
}
// returns true if v is in ascending sorted order, and false otherwise
int is_sorted(int* v, int n) {
for(int i = 1; i < n; i++) { // note i starts at 1
if (!(v[i-1] <= v[i])) {
return 0;
}
}
return 1;
}
void insertion_sort(int* v, int n) {
for(int i = 1; i < n; ++i) {
int key = v[i]; // key is the value we are going to insert
// into the sorted part of v
int j = i - 1; // j points to one position before the
// (possible) insertion point of the key;
// thus, key will eventually be inserted at
// v[j + 1]
//
// This loop determines where to insert the key into the sorted part
// of v. It does this by searching backwards through the sorted part
// for the first value that is less than, or equal to, key.
//
// This loop combines searching with element moving. Every time an element
// bigger than the key is found, it is moved up one position.
//
// It's possible that there is no value in the sorted part that is
// smaller than key, in which case key gets inserted at the very
// start of v. This is a special case that is handled by j >= 0
// in the loop condition.
//
while (j >= 0 && v[j] > key) {
v[j + 1] = v[j];
--j;
}
// After loop ends this is true: (j < 0 || v[j] <= key)
v[j + 1] = key; // j points to the location *before*
// the one where key will be inserted
}
}
// Pre-condition:
// is_sorted(v, begin, mid) and
// is_sorted(v, mid, end) and
// merge_arr is an array the same length as v
// Post-condition:
// is_sorted(begin, end)
void merge(int* v, int* merge_arr, int begin, int mid, int end) {
int a = begin;
int b = mid;
int n = end - begin;
// copy all elements into merge_arr
for(int i = 0; i < n; i++) {
if (a >= mid) {
merge_arr[i] = v[b];
b++;
} else if (b >= end) {
merge_arr[i] = v[a];
a++;
} else if (v[a] < v[b]) {
merge_arr[i] = v[a];
a++;
} else {
merge_arr[i] = v[b];
b++;
}
} // for
// copy elements from merge_arr back into correct location in v
for(int i = 0; i < n; i++) {
v[begin+i] = merge_arr[i];
}
} // merge
void mergesort_help(int* v, int* merge_arr, int begin, int end) {
const int n = end - begin;
if (n <= 1) return; // base case: empty, or single-element, sub-array
int mid = (begin + end) / 2;
mergesort_help(v, merge_arr, begin, mid);
mergesort_help(v, merge_arr, mid, end);
merge(v, merge_arr, begin, mid, end);
}
void mergesort(int* v, int n) {
// mergesort needs an extra array of length n to do the merging
int merge_arr[n];
mergesort_help(v, merge_arr, 0, n);
}
// Pre-condition:
// v[begin] to v[end - 1] is in ascending sorted order
// Post-condition:
// returns an index i such that v[i] == x and begin <= i < end;
// if x is not found, -1 is returned
int binary_search_help(int* v, int x, int begin, int end) {
while (begin < end) {
int mid = (begin + end) / 2;
if (v[mid] == x) { // found x!
return mid;
} else if (x < v[mid]) {
end = mid;
} else if (x > v[mid]) {
begin = mid + 1;
}
}
return -1; // x not found
}
// Pre-condition:
// v is in ascending sorted order
// Post-condition:
// returns an index i such that v[i] == x; if x is not in
// v, -1 is returned
int binary_search(int* v, int n, int x) {
return binary_search_help(v, x, 0, n);
}
int binary_search_rec_help(int* v, int x, int begin, int end) {
const int n = end - begin;
// if the sub-array being searched is empty, then x is not in it
if (n <= 0) return -1; // x not found
int mid = (begin + end) / 2;
if (x == v[mid]) {
return mid;
} else if (x < v[mid]) {
return binary_search_rec_help(v, x, begin, mid);
} else {
return binary_search_rec_help(v, x, mid + 1, end);
}
}
// Pre-condition:
// v is in ascending sorted order
// Post-condition:
// returns an index i such that v[i] == x; if x is not in
// v, -1 is returned
int binary_search_rec(int* v, int n, int x) {
return binary_search_rec_help(v, x, 0, n);
}
void test1() {
int n = 4;
int a[] = {4, 1, 9, 5};
println_array(a, n);
insertion_sort(a, n);
println_array(a, n);
assert(is_sorted(a, n));
n = 1;
int b[] = {5};
println_array(b, n);
insertion_sort(b, n);
println_array(b, n);
assert(is_sorted(b, n));
n = 2;
int c[] = {3, 5};
println_array(c, n);
insertion_sort(c, n);
println_array(c, n);
assert(is_sorted(c, n));
n = 2;
int d[] = {5, 3};
println_array(d, n);
insertion_sort(d, n);
println_array(d, n);
assert(is_sorted(d, n));
printf("all insertion_sort tests passed\n");
}
void test2() {
int n = 4;
int a[] = {4, 1, 9, 5};
println_array(a, n);
mergesort(a, n);
println_array(a, n);
assert(is_sorted(a, n));
n = 1;
int b[] = {5};
println_array(b, n);
mergesort(b, n);
println_array(b, n);
assert(is_sorted(b, n));
n = 2;
int c[] = {3, 5};
println_array(c, n);
mergesort(c, n);
println_array(c, n);
assert(is_sorted(c, n));
n = 2;
int d[] = {5, 3};
println_array(d, n);
mergesort(d, n);
println_array(d, n);
assert(is_sorted(d, n));
printf("all mergesort tests passed\n");
}
void test3() {
int n = 4;
int a[] = {1, 4, 5, 9};
println_array(a, n);
assert(binary_search(a, n, 4) == 1);
assert(binary_search(a, n, 1) == 0);
assert(binary_search(a, n, 9) == 3);
assert(binary_search(a, n, 5) == 2);
assert(binary_search(a, n, 0) == -1);
assert(binary_search(a, n, 11) == -1);
assert(binary_search(a, n, 3) == -1);
assert(binary_search(a, n, 8) == -1);
printf("all binary_search tests passed\n");
}
void test4() {
int n = 4;
int a[] = {1, 4, 5, 9};
println_array(a, n);
assert(binary_search_rec(a, n, 4) == 1);
assert(binary_search_rec(a, n, 1) == 0);
assert(binary_search_rec(a, n, 9) == 3);
assert(binary_search_rec(a, n, 5) == 2);
assert(binary_search_rec(a, n, 0) == -1);
assert(binary_search_rec(a, n, 11) == -1);
assert(binary_search_rec(a, n, 3) == -1);
assert(binary_search_rec(a, n, 8) == -1);
printf("all binary_search_rec tests passed\n");
}
int main() {
test1();
test2();
test3();
test4();
}