HTTP

The protocol used to transfer web resources.
- See also example HTTP conversation.
- Why HTTP?
- Web Basics: HTTP
- HTTP [zhwiki]
- HyperText Transfer Protocol
- a text-based conversation, quite simple.
There are more features in HTTP than most people realize.
An example request/response for: http://cmpt470.csil.sfu.ca/~ggbaker/test.html
The user agent (client/browser) initiates the conversation by making a TCP connection to cmpt470.csil.sfu.ca on port 80 (the default) like this:
```
GET /~ggbaker/test.html HTTP/1.1
Host: cmpt470.csil.sfu.ca
Connection: close
(blank line)
```
The request line (GET …)
- GET is the method; (more later)
- then the path part of the URL;
  - includes the query string if present
- and the HTTP version: 1.1 is current.
Followed by one or more HTTP (request) headers.
- … in the format “Field-name: value”
- The Host header is required in HTTP 1.1.
  - allows for name-based hosting: multiple sites on same IP address
- Other headers can give more info about the browser's capabilities, etc.
Blank line indicates end of the request (or start of request body).
Request method:
- identifies the overall “action”.
- GET is the most common: all link clicks and typed URLs indicate a GET request.
  - Form data submitted by GET are URI-encoded in the query string:
    http://…?field1=value1&field2=value
- POST can also be used for form submission.
  - Form data is encoded in the request body.
- The difference: GET requests should be safe.
  - i.e. have no significant side-effects.
  - GET can be safely reloaded, bookmarked, etc.
  - The Post/Redirect/Get pattern is used to avoid reloading POST pages.
- There are other methods, which are mostly used for REST. (later?)

Response from the server will look like:

HTTP/1.1 200 Okay
Server: Apache/2.2.0
Content-language: en
Content-type: text/html; charset=utf-8
(blank line)
<html><head>…

Status line (HTTP/…)
- HTTP/1.1: HTTP version of the response.
- 200: status code—success/failure/whatever of the request (more later).
- Okay: the reason phrase—human-readable version of the status code.
The response headers indicate various info about the response.
- e.g. Content-type indicates the Internet media type (MIME type) of the content.
The blank line separates the headers from the message body.
The message body contains the contents of the response.
- Interpreted according to the header info.
- Empty for some responses.

Status Codes

The status code is used to indicate what kind of response is being sent.
The first digit indicates the overall type.
1xx: informational
2xx: success
- 200 Okay: everything is fine; contents of resource follow in message body.
- 206 Partial Content: client requested only part of the resource (with Range header); that part is being sent.
3xx: fixable without the user doing anything
- 301 Moved Permanently
- 302 Found Elsewhere
- 303 See Other
  - All indicate that the resource has moved.
  - Web browser automatically takes user to the address in the Location header:
    HTTP/1.1 301 Moved Forever Location: http://www.cs.sfu.ca/~ggbaker/foo.html
  - Should always be used instead of a “this page has moved” message.
  - Difference is in interpretation. Should bookmarks be changed? How should search engine update its database?
- 304 Not Modified
  - Can be returned if the browser's cached copy is current.
  - … if the request was sent with an If-modified-since or If-none-match header. These indicate the age and specific content cached.
  - No body is returned.
4xx: client error
- 403 Forbidden: you don't have permission.
- 404 Not Found: the server can't find the requested resource.
  - Should only be seen after a typo: old URLs should return one of 301, 303, 410 forever.
- 410 Gone: the resource isn't available anymore.
  - The informative version of 404.
5xx: server error
- 500 Internal Server Error: something bad happened and it wasn't the client's fault.

Web Servers

The basic job of a web server is like any other server: answer requests.
- … which they do using HTTP.
- Requests come in; responses go out.
But the job of a web server can be much more complicated than others (FTP, SSH, etc)
- … because so much web content must be created when it is requested.
- In other words, it is dynamic content (动态内容).
- Content that doesn't need to be created (and can just be read off disk) is static content (静态内容).
There are many servers to choose from.
- Apache is the most common: more than half of all sites.
- MS IIS: under 20%.
- Rising fast: Nginx.
For dynamic content, the server has to be configured to run the code to generate the content.
- Different for each server and server-side language/framework.
- Don't ignore difficulties in deployment (部署) when it comes to scheduling: it can be a big problem.
- The result is the same in all cases: user gets a custom-built page when they send a request.

Server-Side Programming

There are many tools for server-side programming. Too many to talk about now.
The old fashioned way to generate dynamic content: CGI scripts.
- Any executable program that prints the contents of the resource to stdout.
- with print, printf, cout, System.out.print, …
- Web server runs the program and sends its output to the client.
- You'll probably never use CGI, but it's a good place to see the basics.
- We will discuss why CGI is bad (slow) later.

Minimal CGI script in Python:

print "Content-type: text/html"
print
print "<title>Page</title>A web page."

Program must output part of the HTTP response: last header(s), blank line, message body.
CGI scripts (and other dynamic pages) can also read input.
- … from form submission, URL contents, etc.
- “CGI” actually describes how the program can find this info.
Except for the input/output details, it's just programming.
- You can do whatever you can do in any programming language: access a database, read files, do calculations, ….
There are many frameworks (Rails, Django, Cake, ASP.NET MVC, Spring, …) that make the common jobs in web programming easier.
- They will be the subject of the technology evaluation in this course.

Character Sets and Encodings

What is sent server to browser is really a sequence of bits, not characters.
- Both must agree on how to convert bits to characters.
- i.e. characters must be encoded.
- If they don't agree, some characters will be displayed wrong.
Character set (字集?): a list of characters and character ↔ number mappings.
- ASCII is the most universal. Defines 95 characters: English-only but each char fits in 7 bits.
- Many extensions use 8 bits and define ~220 characters, e.g. ISO-8859-1 for western European languages.
- GB2312 or GBK or GB18030 extend ASCII to Chinese characters.
- But these are mutually-incompatible and only one can be used per document.
Unicode
- Character set designed to represent all written languages.
- Defines ~100k characters.
Character encoding (字符编码): how to convert character numbers ↔ bits.
- Before Unicode/GB2312/etc, this was easy: one byte per character.
- With Unicode, there are more choices.
But your choice is easy: always use Unicode encoded with UTF-8.
- This encodes Unicode efficiently: ASCII characters are one byte each.
… even for Chinese pages (I suggest).
- Most Chinese pages use GB2312 (but should use GB18030 which is more modern).
- Experimentally: saves <10% over UTF-8 encoding the same page with lots of Chinese text.
- Remember that a lot of your text (HTML code, URLs, etc) is ASCII anyway.
GBK (and predecessors) make it impossible to encode many languages.
- What if your commenters want to write about something in their Thai/Hindi/Arabic course? (สวัสดี)
- … or their phonetics (发音学) course? (/fəˈnɛtɪks/)
- … or you want to use a funny bullet (✏) for a list?
- I tried to save a Wikipedia page in GB2312 and it failed (probably because of the “in other languages” section).
- Data point: cc98.org uses UTF-8.
- GB18030 is actually Unicode encoded in a complicated way to maintain legacy compatibility with GBK. If you use it (correctly), you've got Unicode.
Always explicitly declare character encoding on documents!
- HTTP header: Content-type: text/plain; charset=utf-8 (probably best)
- in HTML/XHTML: <meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
- In HTML5: <meta charset="utf-8">
- Declaring character encodings in HTML
- Language Code List
While you're at it, declare your language with the lang attribute: <html lang="zh">, <p lang="en">
As a programmer, never guess character sets, or assume that byte==character.

Redirects

We have seen the HTTP status codes 301, 302, 303, 410 for moved resources.
Can be produced in several ways.
- By the web server: Apache's mod_alias, etc. (Definitely best for whole sites/directories.)
- In your code: you can return a 303 status from your app. (Good for specific cases that need logic to detect.)
General principle: URLs should never become invalid.
- Cool URIs don't change
- When planning a site/app, take some time to come up with a good URL schema and plan for the future.
- Redirect if it's really necessary to move stuff.
This seems to be something Chinese sites have a real problem with.

HTTP Caching

Both browsers and proxy servers can cache content to save network traffic and time later (高速缓存).
- Loading from disk or sending from nearby proxy server (代理服务器) is faster than fetching from remote server.
- A caching proxy is usually shared by all users of an ISP or network.
When a request is made through a cache, there are three possibilities: [decreasing goodness]
1. The cached copy is “fresh”.
  - The original was sent with an Expires header and the expiry time hasn't passed yet.
  - No need to check with the server: just use cached copy.
2. Have a cached copy, but might not be up to date.
  - Request to server with an If-modified-since and/or If-none-match header.
  - Unchanged: server responds with 304 Not Modified (and Etag header indicating file to use).
  - Changed: server sends 200 OK and new version (which gets cached).
3. No cached version: request and cache.
For static content, the server will have configuration for expiry times.
- e.g. Apache mod_expires, nginx expires directive
The Etag header identifies the specific content.
- Should change if and only if the contents change.
- Generated by the server for static content.
- Can be used to cache multiple copies (e.g. for caching + content negotiation).
Dynamic content is not cached by default: no way for the server to guess how long it will be the same.
- The programmer can set expires headers where possible.
- … or generate 304 Not Modified if possible.
- Frameworks often provide support for caching.
- Can be combined with a reverse proxy for huge savings. (more later)
General tips:
- For dynamic sites, remember your CSS, JS, images. They are usually static and can have long expiry times: can save a lot of network access.
- Use URLs consistently: don't use both /dir/index.html and /dir/
- Minimize dynamic content, POST. They aren't generally cached.
- Set reasonable expires headers, especially on shared CSS, JS, images.
- Consider JS libraries: if you have a directory jquery-1.2.3/, you know that will never change.
  - It may be replaced by jquery-1.2.4/, but that's a new URL.
  - Will be loaded with every page view, and could be a large download.
  - Cache for a long time, maybe forever.
  - - Google Hosted Libraries
    Or use hosted version of the libraries that can be shared by many sites.
  - Google Hosted Libraries, or many libraries have their own hosted versions.

Content Negotiation

Different HTTP clients have different capabilities. So do their users.
- File types handled, natural languages read, etc.
HTTP provides a way to automatically and transparently deal with some of these: content negotiation.
- With every request, browsers send HTTP headers indicating their capabilities.
Media type:
Accept: application/xhtml+xml, text/html;q=0.9, */*;q=0.1
- Indicates media types browser can handle.
- … or at least types it prefers over others.
- “q=” indicates the “quality” of that type.
- “*/*” means “anything else”.
Language(s) the user can read:
Accept-language: en, fr;q=0.5
- What natural language(s) can the user read?
- Set by user configuration, so not always reliable: often just the OS install language.
- If you're going to use this, definitely allow the user to set a preference to override.
Character encodings:
Accept-charset: iso-8859-1, utf-8;q=0.9, *;q=0.7
- What character encodings can the browser handle?
- Probably not very useful: everybody can do UTF-8 and that's what you should be using anyway.
File encodings:
Accept-encoding: gzip, deflate
- How can the data be encoded for transport?
- Typically used for transparent compression: most browsers can.
- The server can compress the content (or find a compressed version on disk); the client knows how to uncompress it at their end.
For static content in Apache, you can configure the server.
- In Apache: either Multiviews or type maps to indicate the variants of the content you have.
- Most servers at least do gzip/deflate compression of static content, if not the other content negotiation features.
For dynamic content, it's up to the programmer.
- Most frameworks have some support to make negotiation easier.
- … especially language negotiation.