Base Conversion

There are several algorithms that are needed to work with integers in a computer.
- Most obvious problem: computers store binary (0s and 1s). That's all we have to work with.
- All other data types must be build from bits.
Theorem: For an integer \(b>1\) (the base), a positive integer \(n\) can be represented uniquely as \[n=a_kb^k + a_{k-1}n^{k-1} +\cdots + a_1b^1 + a_0b^0\,,\] where \(k>0\) is an integer and each \(a_i\) is an intger \(0\le a_i\lt b\).
For \(b=10\), you already knew this. The number 1253 is \[1253 = 1\cdot10^{3} + 2\cdot10^{2} + 5\cdot 10^1+3\cdot 10^0\,.\]
- That is, each \(a_i\) is the decimal digit in each position.
- Decimal (the way you're used to counting) is also called “base 10”.
The theorem implies that we can find an representation of any integer is any base.
- e.g. with base 7, we can represent 1257 as \[1257 = 3\cdot 7^3 + 4\cdot 7^2 + 4\cdot 7^1 + 0\cdot 7^0\,.\]
- We'll write this as a number with a subscript to indicate the base: \[1253_{10} = 3440_7\,.\]
If we represent a number in base 2, we will get a binary representation: only 0s and 1s.
- With our earlier example, we get \(1253_{10} = 10011100101_2\)
- But how did we get it?

This algorithm will convert a value (\(n\)) to any base (\(b\)):

procedure to_base(n, b)
    k = 0
    while n ≠ 0
        a[k] = n mod b
        n = ⌊n/b⌋
        k = k+1
    return a[k-1], a[k-2], …, a[1], a[0]

If we try it on the base 7 example above, the variable change as follows for each iteration: (all values base 10)

\(n\) \(k\) \(n\mathop{\mathbf{mod}}7\) \(\lfloor n/7\rfloor\)

1253 0 0 179

179 1 4 25

25 2 4 3

3 3 3 0
- We can read the base \(b\) expansion up the \(n\mathop{\mathbf{mod}}b\) column.
If we do this with base 16, we get:

\(n\) \(k\) \(n\mathop{\mathbf{mod}}16\) \(\lfloor n/16\rfloor\)

1253 0 5 78

78 1 14 4

4 2 4 0
- We don't have enough digits to represent the 14. We'll resort to letters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F.
- Then, \(1253_{10}=4\mathrm{E}5_{16}\)
- Base 16 is often called hexadecimal.
Other base occasionally used in computing: base 8 or octal.

\(n\)	\(k\)	\(n\mathop{\mathbf{mod}}7\)	\(\lfloor n/7\rfloor\)
1253	0	0	179
179	1	4	25
25	2	4	3
3	3	3	0

\(n\)	\(k\)	\(n\mathop{\mathbf{mod}}16\)	\(\lfloor n/16\rfloor\)
1253	0	5	78
78	1	14	4
4	2	4	0

Application: Base64 Encoding

Many older email servers could only deal with ASCII characters: no non-English, no binary data (attachments, etc).
- When it became necessary to do those things in email, that data needed to be encoded somehow so that it wouldn't be mangled by older servers.
Characters that are definitely safe in any system: letters A–Z and a–z, digits 0–9.
- That's 62. Add in “+” and “/” and we have 64 values.
- 64 unique values means we have a way to encode base 64 integers.
- Take the message you're trying to send, convert to base 64, and represent that using those characters.
- Since 64 is a power of 2, the algorithm can be simplified a little.
Encoding in base64:
- Take the bits you have to encode. Break into 6-bit chunks. Each 6-bit string can be encoded as one base64 character.
- Convert the 6 bits (e.g. \(100110_2\)) to a number (\(2^5+2^2+2^1=38_{10}\)).
- Take the character that maps to that value (38 is “m”).
- Do that for each 6-bit chunk.
- Fix a little if you don't have a multiple of 6 bits in the original message.

For example, the ASCII string “Hello world!” gets encoded as “SGVsbG8gd29ybGQh”.

Longer messages are typically split across lines (long lines were another problem in old mail software):

RGlkIHlvdSByZWFsbHkgZGVjb2RlIHRoaXM/IEdvb2QgZm9yIHlvdS4gSXQncyBuaWNlIHRvIHNl
ZSBzdHVkZW50cyB0YWtpbmcgaW5pdGlhdGl2ZS4KCk1heWJlIHlvdSBleHBlY3RlZCBzb21lIG1p
ZHRlcm0gaGludHMgaGVyZT8gU29ycnkgYWJvdXQgdGhhdC4=

Base64 is also used in data URLs in HTML: [from Wikipedia “Data URI scheme”]

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />

Result:

Base-n Arithmetic

It turns out, you already know how to do basic arithmetic on base-\(n\) values.
- You learned it years ago: the teacher just always used base 10.
- We just have to adapt to other bases.
Adding, the elementary school way: \(849+423\)

1 0 1

8 4 9

+ 4 2 3

1 2 7 2
What we did:
- Add up each column, starting from the right.
- If the total is 10 or more, subtract 10 and carry one.
- Include the carry when adding the next column.
Adding any other base: replace “10” with “n” in the above.
e.g. add \(1011_2+0010_2\): (remember \(2_{10}=10_2\) and \(3_{10}=11_2\))

0 0 1 0

1 0 1 1

+ 0 0 1 0

0 1 1 0 1
- Did that work?
- We started with \(11_{10}+2_{10}\) and got \(01101_2=13_{10}\). So, yes.
e.g. add \(\mathrm{F}227_{16}+3\mathrm{A}65_{16}\):

1 0 0 0

F 2 2 7

+ 3 A 6 5

1 2 C 8 C
- Again, we can check by converting to decimal: \(\mathrm{F}227_{16}+3\mathrm{A}65_{16}=61991_{10}+14949_{10}\) and got \(12\mathrm{C}8\mathrm{C}_{16}=76940_{10}\).
Multiplication can be adapted too: \(849\times 23\) in base 10 looks like this

8 4 9

× 2 3

2 5 4 7

1 6 9 8

1 9 5 2 7
Again, we do the same thing, using the new base instead of 10: multiply each digit, shifting once each time; add up the results.
e.g. multiply \(1011_2+0101_2\):

1 0 1 1

× 0 1 0 1

1 0 1 1

0 0 0 0

1 0 1 1

0 0 0 0

0 1 1 0 1 1 1
- That was \(1011_2=11_{10}\) times \(101_2=5_{10}\) with result \(110111_2=55_{10}\). Again correct.

The Euclidean Algorithm

The Euclidean algorithm is an algorithm to find the greatest common divisor of two integers.
- i.e. given integers \(a\) and \(b\), find the largest integer that divides both.
- It relies on this theorem…
Theorem: For integers \(a\), \(b\), \(q\), and \(r\) with \(a=bq+r\), \(\gcd(a,b)=\gcd(b,r)\).

Proof: Suppose \(d\) is a common divisor of both \(a\) and \(b\). Then from earlier theorems about divisibility, \(d\) also divides \(a-bq=r\). Thus any common divisor of \(a\) and \(b\) is also a divisor of \(r\).

Similarly, consider a factor \(e\) that divides both \(b\) and \(r\). Then \(e\) also divides \(bq+r=a\). Thus the set of common divisors of \(a\) and \(b\) and of \(b\) and \(r\) are the same. Then the largest element of those sets must also be identical.∎
- Note that \(r\) here could be \(a\mathop{\mathbf{mod}}b\), but doesn't have to be.
This is the algorithm:
```
procedure gcd(a, b)
    while b ≠ 0
        a, b = b, a mod b
    return a
```
- From the above theorem, the GCD of \(a\) and \(b\) doesn't change as the loop runs.
- We exit after \(a\mathop{\mathbf{mod}}b = 0\). Then \(b\mid a\), so the GCD of the two is \(b\) (which has shifted into \(a\) by the time we exit the loop).
An example to find \(\gcd(123,345)\):

\(a\) \(b\) \(a\mathop{\mathbf{mod}}b\)

123 345 123

345 123 99

123 99 24

99 24 3

24 3 0

3 0
- So, \(\gcd(123,345)=3\).
What is the running time of Euclid's algorithm?
- \(O(\log( \min(a,b)))\), but not for any obvious reason. (The text does prove it in section 4.3.)

\(a\)	\(b\)	\(a\mathop{\mathbf{mod}}b\)
123	345	123
345	123	99
123	99	24
99	24	3
24	3	0
3	0