HTTP
- The protocol used to transfer web resources.
- HyperText Transfer Protocol
- a text-based conversation, quite simple.
- There are more features in HTTP than most people realize.
- An example request/response for:
http://cmpt470.csil.sfu.ca/~ggbaker/test.html
- The user agent (client/browser) initiates the conversation by making a TCP connection to
cmpt470.csil.sfu.ca
on port 80 (the default) like this:GET /~ggbaker/test.html HTTP/1.1 Host: cmpt470.csil.sfu.ca Connection: close
(blank line) - The request line (
GET …
)GET
is the method; (more later)- then the path part of the URL;
- includes the query string if present
- and the HTTP version: 1.1 is current.
- Followed by one or more HTTP (request) headers.
- … in the format “Field-name: value”
- The
Host
header is required in HTTP 1.1.- allows for name-based hosting: multiple sites on same IP address
- Other headers can give more info about the browser's capabilities, etc.
- Blank line indicates end of the request (or start of request body).
- Request method:
- identifies the overall “action”.
- GET is the most common: all link clicks and typed URLs indicate a GET request.
- Form data submitted by GET are URI-encoded in the query string:
http://…?field1=value1&field2=value
- Form data submitted by GET are URI-encoded in the query string:
- POST can also be used for form submission.
- Form data is encoded in the request body.
- The difference: GET requests should be safe.
- i.e. have no significant side-effects.
- GET can be safely reloaded, bookmarked, etc.
- The Post/Redirect/Get pattern is used to avoid reloading POST pages.
- There are other methods, which are mostly used for REST. (later?)
- Response from the server will look like:
HTTP/1.1 200 Okay Server: Apache/2.2.0 Content-language: en Content-type: text/html; charset=utf-8 (blank line) <html><head>…
- Status line (
HTTP/…
)HTTP/1.1
: HTTP version of the response.200
: status code—success/failure/whatever of the request (more later).Okay
: the reason phrase—human-readable version of the status code.
- The response headers indicate various info about the response.
- e.g.
Content-type
indicates the Internet media type (MIME type) of the content.
- e.g.
- The blank line separates the headers from the message body.
- The message body contains the contents of the response.
- Interpreted according to the header info.
- Empty for some responses.
Status Codes
- The status code is used to indicate what kind of response is being sent.
- The first digit indicates the overall type.
- 1xx: informational
- 2xx: success
- 200 Okay: everything is fine; contents of resource follow in message body.
- 206 Partial Content: client requested only part of the resource (with
Range
header); that part is being sent.
- 3xx: fixable without the user doing anything
- 301 Moved Permanently
- 302 Found Elsewhere
- 303 See Other
- All indicate that the resource has moved.
- Web browser automatically takes user to the address in the
Location
header:HTTP/1.1 301 Moved Forever
Location: http://www.cs.sfu.ca/~ggbaker/foo.html - Should always be used instead of a “this page has moved” message.
- Difference is in interpretation. Should bookmarks be changed? How should search engine update its database?
- 304 Not Modified
- Can be returned if the browser's cached copy is current.
- … if the request was sent with an
If-modified-since
orIf-none-match
header. These indicate the age and specific content cached. - No body is returned.
- 4xx: client error
- 403 Forbidden: you don't have permission.
- 404 Not Found: the server can't find the requested resource.
- Should only be seen after a typo: old URLs should return one of 301, 303, 410 forever.
- 410 Gone: the resource isn't available anymore.
- The informative version of 404.
- 5xx: server error
- 500 Internal Server Error: something bad happened and it wasn't the client's fault.
Web Servers
- The basic job of a web server is like any other server: answer requests.
- … which they do using HTTP.
- Requests come in; responses go out.
- But the job of a web server can be much more complicated than others (FTP, SSH, etc)
- … because so much web content must be created when it is requested.
- In other words, it is dynamic content (动态内容).
- Content that doesn't need to be created (and can just be read off disk) is static content (静态内容).
- There are many servers to choose from.
- Apache is the most common: more than half of all sites.
- MS IIS: under 20%.
- Rising fast: Nginx.
- For dynamic content, the server has to be configured to run the code to generate the content.
- Different for each server and server-side language/framework.
- Don't ignore difficulties in deployment (部署) when it comes to scheduling: it can be a big problem.
- The result is the same in all cases: user gets a custom-built page when they send a request.
Server-Side Programming
- There are many tools for server-side programming. Too many to talk about now.
- The old fashioned way to generate dynamic content: CGI scripts.
- Any executable program that prints the contents of the resource to stdout.
- with
print
,printf
,cout
,System.out.print
, … - Web server runs the program and sends its output to the client.
- You'll probably never use CGI, but it's a good place to see the basics.
- We will discuss why CGI is bad (slow) later.
- Minimal CGI script in Python:
print "Content-type: text/html" print print "<title>Page</title>A web page."
- Program must output part of the HTTP response: last header(s), blank line, message body.
- CGI scripts (and other dynamic pages) can also read input.
- … from form submission, URL contents, etc.
- “CGI” actually describes how the program can find this info.
- Except for the input/output details, it's just programming.
- You can do whatever you can do in any programming language: access a database, read files, do calculations, ….
- There are many frameworks (Rails, Django, Cake, ASP.NET MVC, Spring, …) that make the common jobs in web programming easier.
- They will be the subject of the technology evaluation in this course.
Character Sets and Encodings
- What is sent server to browser is really a sequence of bits, not characters.
- Both must agree on how to convert bits to characters.
- i.e. characters must be encoded.
- If they don't agree, some characters will be displayed wrong.
- Character set (字集?): a list of characters and character ↔ number mappings.
- ASCII is the most universal. Defines 95 characters: English-only but each char fits in 7 bits.
- Many extensions use 8 bits and define ~220 characters, e.g. ISO-8859-1 for western European languages.
- GB2312 or GBK or GB18030 extend ASCII to Chinese characters.
- But these are mutually-incompatible and only one can be used per document.
- Unicode
- Character set designed to represent all written languages.
- Defines ~100k characters.
- Character encoding (字符编码): how to convert character numbers ↔ bits.
- Before Unicode/GB2312/etc, this was easy: one byte per character.
- With Unicode, there are more choices.
- But your choice is easy: always use Unicode encoded with UTF-8.
- This encodes Unicode efficiently: ASCII characters are one byte each.
- … even for Chinese pages (I suggest).
- Most Chinese pages use GB2312 (but should use GB18030 which is more modern).
- Experimentally: saves <10% over UTF-8 encoding the same page with lots of Chinese text.
- Remember that a lot of your text (HTML code, URLs, etc) is ASCII anyway.
- GBK (and predecessors) make it impossible to encode many languages.
- What if your commenters want to write about something in their Thai/Hindi/Arabic course? (สวัสดี)
- … or their phonetics (发音学) course? (/fəˈnɛtɪks/)
- … or you want to use a funny bullet (✏) for a list?
- I tried to save a Wikipedia page in GB2312 and it failed (probably because of the “in other languages” section).
- Data point: cc98.org uses UTF-8.
- GB18030 is actually Unicode encoded in a complicated way to maintain legacy compatibility with GBK. If you use it (correctly), you've got Unicode.
- Always explicitly declare character encoding on documents!
- HTTP header:
Content-type: text/plain; charset=utf-8
(probably best) - in HTML/XHTML:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
- In HTML5:
<meta charset="utf-8">
- HTTP header:
- While you're at it, declare your language with the
lang
attribute:<html lang="zh">
,<p lang="en">
- As a programmer, never guess character sets, or assume that byte==character.
Redirects
- We have seen the HTTP status codes 301, 302, 303, 410 for moved resources.
- Can be produced in several ways.
- By the web server: Apache's mod_alias, etc. (Definitely best for whole sites/directories.)
- In your code: you can return a 303 status from your app. (Good for specific cases that need logic to detect.)
- General principle: URLs should never become invalid.
- When planning a site/app, take some time to come up with a good URL schema and plan for the future.
- Redirect if it's really necessary to move stuff.
- This seems to be something Chinese sites have a real problem with.
HTTP Caching
- Both browsers and proxy servers can cache content to save network traffic and time later (高速缓存).
- Loading from disk or sending from nearby proxy server (代理服务器) is faster than fetching from remote server.
- A caching proxy is usually shared by all users of an ISP or network.
- When a request is made through a cache, there are three possibilities: [decreasing goodness]
- The cached copy is “fresh”.
- The original was sent with an
Expires
header and the expiry time hasn't passed yet. - No need to check with the server: just use cached copy.
- The original was sent with an
- Have a cached copy, but might not be up to date.
- Request to server with an
If-modified-since
and/orIf-none-match
header. - Unchanged: server responds with 304 Not Modified (and
Etag
header indicating file to use). - Changed: server sends 200 OK and new version (which gets cached).
- Request to server with an
- No cached version: request and cache.
- The cached copy is “fresh”.
- For static content, the server will have configuration for expiry times.
- e.g. Apache mod_expires, nginx
expires
directive
- e.g. Apache mod_expires, nginx
- The
Etag
header identifies the specific content.- Should change if and only if the contents change.
- Generated by the server for static content.
- Can be used to cache multiple copies (e.g. for caching + content negotiation).
- Dynamic content is not cached by default: no way for the server to guess how long it will be the same.
- The programmer can set expires headers where possible.
- … or generate 304 Not Modified if possible.
- Frameworks often provide support for caching.
- Can be combined with a reverse proxy for huge savings. (more later)
- General tips:
- For dynamic sites, remember your CSS, JS, images. They are usually static and can have long expiry times: can save a lot of network access.
- Use URLs consistently: don't use both
/dir/index.html
and/dir/
- Minimize dynamic content, POST. They aren't generally cached.
- Set reasonable expires headers, especially on shared CSS, JS, images.
- Consider JS libraries: if you have a directory
jquery-1.2.3/
, you know that will never change.- It may be replaced by
jquery-1.2.4/
, but that's a new URL. - Will be loaded with every page view, and could be a large download.
- Cache for a long time, maybe forever.
- Or use hosted version of the libraries that can be shared by many sites.
- Google Hosted Libraries, or many libraries have their own hosted versions.
- It may be replaced by
Content Negotiation
- Different HTTP clients have different capabilities. So do their users.
- File types handled, natural languages read, etc.
- HTTP provides a way to automatically and transparently deal with some of these: content negotiation.
- With every request, browsers send HTTP headers indicating their capabilities.
- Media type:
Accept: application/xhtml+xml, text/html;q=0.9, */*;q=0.1
- Indicates media types browser can handle.
- … or at least types it prefers over others.
- “
q=
” indicates the “quality” of that type. - “
*/*
” means “anything else”.
- Language(s) the user can read:
Accept-language: en, fr;q=0.5
- What natural language(s) can the user read?
- Set by user configuration, so not always reliable: often just the OS install language.
- If you're going to use this, definitely allow the user to set a preference to override.
- Character encodings:
Accept-charset: iso-8859-1, utf-8;q=0.9, *;q=0.7
- What character encodings can the browser handle?
- Probably not very useful: everybody can do UTF-8 and that's what you should be using anyway.
- File encodings:
Accept-encoding: gzip, deflate
- How can the data be encoded for transport?
- Typically used for transparent compression: most browsers can.
- The server can compress the content (or find a compressed version on disk); the client knows how to uncompress it at their end.
- For static content in Apache, you can configure the server.
- In Apache: either Multiviews or type maps to indicate the variants of the content you have.
- Most servers at least do gzip/deflate compression of static content, if not the other content negotiation features.
- For dynamic content, it's up to the programmer.
- Most frameworks have some support to make negotiation easier.
- … especially language negotiation.