We have talked a lot about transmitting information over the Internet. In order for any information to be transmitted over the network (or stored in your computer's memory, or stored on its disks), it must be encoded as a sequence of bits.
Each bit can have one of two values, usually written as 0 and 1. How these are represented on the network or disk doesn't matter to us, but it will be something like “an electrical pulse for 1 but none for 0” or “a magnetic field oriented this way for 0 but that way for 1”.
All of the information we can been working with has been turned into bits by our software. The details of encoding files like images are beyond the scope of this course: we can just be assured that the PNG and JPEG formats specify (very different) ways to turn image data into a sequence of bits (and back again). Text files, on the other hand, deserve a little more attention.
Bits and Bytes
You have likely see the words “bits” and “bytes” used to describe amounts of information. As stated above, a bit is a single 0 or 1.
A byte is a sequence of eight bits: it is often a more convenient unit to start with when talking about storing or transmitting information. There are 28 = 256 possible bytes, usually taken to represent numbers 0–255:
00000000 = 0
00000001 = 1
00000010 = 2
00000011 = 3
00000100 = 4
11111111 = 255
A kilobit (kb) is 210 = 1024 bits and a kilobyte (kB) is 1024 bytes. A megabit and megabyte (Mb and MB) are 220 = 1048576 bits and bytes, respectively. A gigabit and gigabyte represent multiples of 230 = 1073741824.
The powers of two here (210 = 1024 instead of 103 = 1000) are used because powers of two occur frequently in computing and are often more convenient: 2n is the number of unique values that can be represented by n bits. The kilo- and mega- prefixes are usually used this way to represent powers of 1024 in computing, but are also sometimes used to indicate multiples of 1000 like they are for other metric units. It won't always be clear from the context, but usually the difference is not big enough to worry too much about.
We touched on some of these topics when discussing image file sizes in Using Images on Web Pages and when discussing image formats in Image Formats and Bitmap Image Formats. In order to make image files smaller, we were forced to decrease the amount of information we were storing, by decreasing the number of pixels, number of possible colours (which determines bits per pixel), and choice of compression algorithm (which does a better job of packing information into a smaller number of bits).
In order for this to happen, characters must be encoded as bits. As long as the client and server agree on this encoding, everything will work. If they don't, the characters won't be received correctly.
For example, a tweet that was received by Obama during a campaign event contained the work “that’s”. The fifth character there isn't the apostrophe character that is on your keyboard; it is a right single quote character.
That was encoded (properly) by the sender and Twitter, but decoded incorrectly by the display software where it became “thatâ€™s”. This is a very easy mistake to make as a programmer or author, but very annoying for users when it happens (although most of us won't be on TV when it happens).
In order to have everything work, first we need to decide on a character set: a list of all possible characters, and they need to be numbered. The Unicode character set defines characters for all written languages, as we saw in Character References.
For example, in Unicode “A” is character number 65, and “€” is 8364 (so you can produce € in HTML with
€). Unicode lists about 120,000 characters.
Then we need to decide on a character encoding: a way to encode the character numbers into bits. There are several character encodings for Unicode. The choice for web pages is easy: use UTF-8.
<meta charset="UTF-8" /> element we have been including on every page indicates that the document uses the UTF-8 encoding of Unicode, and specifying it explicitly ensure that the web browser will decode everything properly. This is the character encoding that should be used for all documents on the Web.
The way UTF-8 encodes characters is very similar to some other character sets (which is why most of the characters in the Obama tweet were decoded correctly). If you have any characters in a document that aren't on a standard English keyboard (€, á, ⽑, Я, …), make sure to test that they are being displayed correctly. (And programmers should always try some non-English characters to make sure their software works with them.)