Base Conversion
- There are several algorithms that are needed to work with integers in a computer.
- Most obvious problem: computers store binary (0s and 1s). That's all we have to work with.
- All other data types must be build from bits.
Theorem: For an integer \(b>1\) (the base), a positive integer \(n\) can be represented uniquely as \[n=a_kb^k + a_{k-1}n^{k-1} +\cdots + a_1b^1 + a_0b^0\,,\] where \(k>0\) is an integer and each \(a_i\) is an intger \(0\le a_i\lt b\).
- For \(b=10\), you already knew this. The number 1253 is
\[1253 = 1\cdot10^{3} + 2\cdot10^{2} + 5\cdot 10^1+3\cdot 10^0\,.\]
- That is, each \(a_i\) is the decimal digit in each position.
- Decimal (the way you're used to counting) is also called “base 10”.
- The theorem implies that we can find an representation of any integer is any base.
- e.g. with base 7, we can represent 1257 as \[1257 = 3\cdot 7^3 + 4\cdot 7^2 + 4\cdot 7^1 + 0\cdot 7^0\,.\]
- We'll write this as a number with a subscript to indicate the base: \[1253_{10} = 3440_7\,.\]
- If we represent a number in base 2, we will get a binary representation: only 0s and 1s.
- With our earlier example, we get \(1253_{10} = 10011100101_2\)
- But how did we get it?
- This algorithm will convert a value (\(n\)) to any base (\(b\)):
procedure to_base(n, b) k = 0 while n ≠ 0 a[k] = n mod b n = ⌊n/b⌋ k = k+1 return a[k-1], a[k-2], …, a[1], a[0]
- If we try it on the base 7 example above, the variable change as follows for each iteration: (all values base 10)
\(n\) \(k\) \(n\mathop{\mathbf{mod}}7\) \(\lfloor n/7\rfloor\) 1253 0 0 179 179 1 4 25 25 2 4 3 3 3 3 0 - We can read the base \(b\) expansion up the \(n\mathop{\mathbf{mod}}b\) column.
- If we do this with base 16, we get:
\(n\) \(k\) \(n\mathop{\mathbf{mod}}16\) \(\lfloor n/16\rfloor\) 1253 0 5 78 78 1 14 4 4 2 4 0 - We don't have enough digits to represent the 14. We'll resort to letters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F.
- Then, \(1253_{10}=4\mathrm{E}5_{16}\)
- Base 16 is often called hexadecimal.
- Other base occasionally used in computing: base 8 or octal.
Application: Base64 Encoding
- Many older email servers could only deal with ASCII characters: no non-English, no binary data (attachments, etc).
- When it became necessary to do those things in email, that data needed to be encoded somehow so that it wouldn't be mangled by older servers.
- Characters that are definitely safe in any system: letters A–Z and a–z, digits 0–9.
- That's 62. Add in “+” and “/” and we have 64 values.
- 64 unique values means we have a way to encode base 64 integers.
- Take the message you're trying to send, convert to base 64, and represent that using those characters.
- Since 64 is a power of 2, the algorithm can be simplified a little.
- Encoding in base64:
- Take the bits you have to encode. Break into 6-bit chunks. Each 6-bit string can be encoded as one base64 character.
- Convert the 6 bits (e.g. \(100110_2\)) to a number (\(2^5+2^2+2^1=38_{10}\)).
- Take the character that maps to that value (38 is “m”).
- Do that for each 6-bit chunk.
- Fix a little if you don't have a multiple of 6 bits in the original message.
- For example, the ASCII string “Hello world!” gets encoded as “SGVsbG8gd29ybGQh”.
- Longer messages are typically split across lines (long lines were another problem in old mail software):
RGlkIHlvdSByZWFsbHkgZGVjb2RlIHRoaXM/IEdvb2QgZm9yIHlvdS4gSXQncyBuaWNlIHRvIHNl ZSBzdHVkZW50cyB0YWtpbmcgaW5pdGlhdGl2ZS4KCk1heWJlIHlvdSBleHBlY3RlZCBzb21lIG1p ZHRlcm0gaGludHMgaGVyZT8gU29ycnkgYWJvdXQgdGhhdC4=
- Longer messages are typically split across lines (long lines were another problem in old mail software):
- Base64 is also used in data URLs in HTML: [from Wikipedia “Data URI scheme”]
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO 9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
- Result:
- Result:
Base-n Arithmetic
- It turns out, you already know how to do basic arithmetic on base-\(n\) values.
- You learned it years ago: the teacher just always used base 10.
- We just have to adapt to other bases.
- Adding, the elementary school way: \(849+423\)
1 0 1 8 4 9 + 4 2 3 1 2 7 2 - What we did:
- Add up each column, starting from the right.
- If the total is 10 or more, subtract 10 and carry one.
- Include the carry when adding the next column.
- Adding any other base: replace “10” with “n” in the above.
- e.g. add \(1011_2+0010_2\): (remember \(2_{10}=10_2\) and \(3_{10}=11_2\))
0 0 1 0 1 0 1 1 + 0 0 1 0 0 1 1 0 1 - Did that work?
- We started with \(11_{10}+2_{10}\) and got \(01101_2=13_{10}\). So, yes.
- e.g. add \(\mathrm{F}227_{16}+3\mathrm{A}65_{16}\):
1 0 0 0 F 2 2 7 + 3 A 6 5 1 2 C 8 C - Again, we can check by converting to decimal: \(\mathrm{F}227_{16}+3\mathrm{A}65_{16}=61991_{10}+14949_{10}\) and got \(12\mathrm{C}8\mathrm{C}_{16}=76940_{10}\).
- Multiplication can be adapted too: \(849\times 23\) in base 10 looks like this
8 4 9 × 2 3 2 5 4 7 1 6 9 8 1 9 5 2 7 - Again, we do the same thing, using the new base instead of 10: multiply each digit, shifting once each time; add up the results.
- e.g. multiply \(1011_2+0101_2\):
1 0 1 1 × 0 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 1 1 - That was \(1011_2=11_{10}\) times \(101_2=5_{10}\) with result \(110111_2=55_{10}\). Again correct.
The Euclidean Algorithm
- The Euclidean algorithm is an algorithm to find the greatest common divisor of two integers.
- i.e. given integers \(a\) and \(b\), find the largest integer that divides both.
- It relies on this theorem…
Theorem: For integers \(a\), \(b\), \(q\), and \(r\) with \(a=bq+r\), \(\gcd(a,b)=\gcd(b,r)\).
Proof: Suppose \(d\) is a common divisor of both \(a\) and \(b\). Then from earlier theorems about divisibility, \(d\) also divides \(a-bq=r\). Thus any common divisor of \(a\) and \(b\) is also a divisor of \(r\).
Similarly, consider a factor \(e\) that divides both \(b\) and \(r\). Then \(e\) also divides \(bq+r=a\). Thus the set of common divisors of \(a\) and \(b\) and of \(b\) and \(r\) are the same. Then the largest element of those sets must also be identical.∎
- Note that \(r\) here could be \(a\mathop{\mathbf{mod}}b\), but doesn't have to be.
- This is the algorithm:
procedure gcd(a, b) while b ≠ 0 a, b = b, a mod b return a
- From the above theorem, the GCD of \(a\) and \(b\) doesn't change as the loop runs.
- We exit after \(a\mathop{\mathbf{mod}}b = 0\). Then \(b\mid a\), so the GCD of the two is \(b\) (which has shifted into \(a\) by the time we exit the loop).
- An example to find \(\gcd(123,345)\):
\(a\) \(b\) \(a\mathop{\mathbf{mod}}b\) 123 345 123 345 123 99 123 99 24 99 24 3 24 3 0 3 0 - So, \(\gcd(123,345)=3\).
- What is the running time of Euclid's algorithm?
- \(O(\log( \min(a,b)))\), but not for any obvious reason. (The text does prove it in section 4.3.)