Library/Framework Support

The big lesson of the technology evaluation: you don't have to code everything from scratch.
There are many common tasks in web devel: user authentication and sessions, HTML generation, processing form submission, etc.
Modern frameworks take care of many of these for you.
- … and there are 3rd party libraries that do a lot more.
- Before you tackle any big problem, first look for others that have solved it.
Frameworks also promote MVC separation.
- … a good habit to get in to.
- Don't fight it: keep your logic separated and maintainable.
But the frameworks are big and take time to learn.
- Many new functions/modules and different workflows.
- But startup time is gained back quickly: development should get faster soon.

Architecture and Speed

The baseline for our comparison: CGI.
- To generate a response, the web server sets some environment variables and runs an external program.
For every request, the server must:
1. Load the interpreter and start the process (行程).
2. Load the code and parse/compile it.
3. Do any startup work: load config files, open DB connection, etc.
4. Produce the response.
If we can avoid some of these steps, the server can handle requests faster.
- Of course, the goal is to do as little work as possible for each request.
Step 1: start the interpreter. Solution: keep it running.
- If we incorporate the interpreter into the server, it will already be there.
  - PHP was designed as an Apache module.
  - mod_perl, mod_python, mod_wsgi, integrate their interpreters as Apache modules.
  - JSP, Servlets are incorporated into Tomcat
  - ASP and CLR are integrated into IIS.
- Or could start a separate process but keep it running.
  - … server communicates with that process through a socket (or similar).
  - Better for shared hosting: processes can have have different owners, can support many languages.
  - e.g. FastCGI, SCGI, mod_passenger, mod_wsgi.
  - Could think of reverse-proxying solutions like this too. (later)
Step 2: load and compile the code.
- Unavoidable, but can cache compiled version of recently-run code.
- e.g. PHP accelerators
- Or keep the code running.
- e.g. FastCGI, Servlets
- It can be hard to tell which is happening: mod_wsgi, mod_passenger, etc. all do one of these, but it's not clear (to me) which.
Step 3: startup work.
- Opening DB connections and reading config files can be a lot of overhead for a single connection.
- Initializing a framework can be a huge amount of work.
- Usual solution: structure your code so that startup and page generation work are in separate functions/methods.
- First request: do both, but keep environment set up (i.e. keep the process running).
- Later requests: run page generation code only.
- e.g. Java Servlets are very explicit:
```
public class FooServlet extends HttpServlet {
  public void init(…) {…}
  public void doGet(…) {…}
  public void destroy() {…}
}
```
- mod_wsgi, mod_passenger, FastCGI, etc. allow similar separation.
Step 4: produce the response.
- Also unavoidable in general.
- This is usually much less than 1–3 for CGI.
- Can cache responses and avoid this where possible.

Reverse Proxying/Accelerating

Remember that we already talked about HTTP caching.
- That can be used to cache at the client/proxy level.
- … but it's often not very effective at reducing server load.
- There are too many clients: still have to generate many times.
Even with a lot of optimization work, apps can be slow.
- Dynamic languages (Python, Ruby, PHP, etc) tend to be slower.
- Lots of work to render templates, do ORM, etc.
- Can involve big DB queries, lengthy calculations.
Solution: don't run your code for every request.
- HTTP caching can help, but still must serve to each client/proxy once.
Can do the caching within your own network
- … with an HTTP accelerator (or reverse proxy, 反向代理).
- Setup:
- The reverse proxy looks like a web server to the outside world.
- … but acts like a caching proxy: keeps an internal cache of content, and only asks the server for missing/stale pages.
- So, frequently-accessed pages (with an Expires header) can be served quickly without regenerating.
e.g. Wikipedia: if not logged in, everybody gets the same content, but converting wikitext to HTML is expensive.
Ways to do this:
- Varnish: designed for exactly this use-case; very fast.
- Squid: originally designed as a caching proxy, but can also be used as a reverse proxy.
- Memcached + Nginx: fast cache + fast server. Can link the two so cache is checked first for each request.
But: be careful of all the things that can make a cache less useful.
- especially vary by cookie for logged-in users.
- inconsistent URLs
- etc.
For truly dynamic content (i.e. every person sees a different page), this won't help.
- Best solution is probably memcached for page-pieces and function memoization.
Remember to think about cache invalidation (or cache coherence, 快取一致性).
- If something changes in your DB but you have cached for a long time, that's bad.
- Solution 1: don't cache for a long time.
- Solution 2: when the change occurs, tell your cache to forget the out-of-date stuff.
- #2 is better but trickier: have to trigger the invalidation (probably in the model) and have knowledge of all code caching data that relies on it.

Multitier Architectures

As sites grow in traffic, one server can't keep up with the load.
- First things: reduce server overhead; cache templates/calculations internally, and pages in an accelerator.
- Then: faster server
- But eventually, that reaches a limit.
Can separate roles onto multiple machines:
- Multitier architecture [enwiki]
- Wikipedia server layout diagrams
- A multitier architecture (多层体系结构).
- Specifically, a three tier architecture.
If load increases more, can start replicating roles onto more machines: (复制? There's probaby a better translation.)
- Load balancing (负载均衡) can be done by a reverse proxy/accelerator, or dedicated hardware.
- DB replication or sharding can be expensive/slow/unreliable/etc. Choose your DB technology carefully.
Multi-tier architectures provide redundancy as well.
- e.g. in above, if one web server goes down, things keep running (slower).
- Important for a large-scale busy site.

Distributed Web Architecture

Managing traditional multi-tier architectures can be difficult.
- Can be hard to get an extra server in to meet more demand.
- Might involve significant re-configuration.
- Requires actually buying a server.
Many companies offer cloud services (雲端運算). The idea:
- Cloud computing [enwiki]
- 雲端運算 [zhwiki]
- OpenStack, an open source cloud infrastructure
- SFU's CIO on cloud privacy
- They run your app on their server(s).
- They have a huge collection of servers: you can use more when you need them.
- You pay for only the bandwidth/processor/storage you use.
- e.g. Google App Engine, Amazon EC2/S3, Windows Azure
Great for a small site that you think might grow.
- Costs little/nothing while developing.
- Can handle growth.
- No up-front server costs.
Software as a Service (SaaS, 软件即服务)
- Cloud offerings at the application level.
- e.g. Gmail handling a company's email; Github for source management.
Platform as a Service (PaaS, 平台即服务)
- Google App Engine docs
- Company provides a development environment/framework you can use.
- Code written in that framework can run on their servers.
- Functionality is usually limited. e.g. no files can be written but file-like data can be stored elsewhere; must use their NoSQL database.
- So, you must code to their framework: not all code will run there; that code probably won't run elsewhere.
- e.g. Google App Engine, Windows Azure
Infrastructure as a Service (IaaS, 基础设施即服务)
- Amazon EC2 pricing
- You get virtual machines (with certain CPU/memory limits).
- Can be configured as you like: no restrictions on programs you run.
- Typically the VM itself is considered temporary: can only write temporary files. Permanent storage happens on a fileserver.
- New VMs can be provisioned as you need them.
- e.g. Amazon EC2, Rackspace Cloud
But…
- Consider privacy laws: who is storing the data? Who can see it?
- What if the company goes out of business? (especially for SaaS, PaaS)
- There are some “private cloud” solutions to allow you to provide cloud-like architecture within a company.

Content Delivery Networks

Many sites have a lot of static media to serve.
- e.g. Flickr has small HTML files to generate dynamically, but lots of static images.
- The large static files are often the slow part.
- Can be too much data to serve from one place.
Since they are static data, they are easy to distribute across multiple locations.
- e.g. keep “official” copy of the files in the main data centre in California, but copy to servers in Germany on demand.
This can also be done with dynamic content, but it will be trickier.
- Will require the app/code to handle distributed data storage.
- … and data coherence problems that come with it.
- But if you want low ping times to everybody, it's the only option.
Once this is set up, you can serve content from the closest site.
- e.g. a Spanish visitor might get the files from the German servers.
- … but visitors shouldn't have to figure this out manually.
In either case, you have a content delivery network (內容傳遞網路).
The trick is done in DNS.
- When a DNS request comes in, server gives back IP address of server “near” the client.
- Then the client will contact that server for the content.
- Minimizes response time.
- Compare:
```
nslookup google.com
nslookup google.com 8.8.8.8 # a nameserver in California
```
But requires a distributed collection of coordinated servers.
- Several companies would be happy to sell you space on theirs.
- … or just get big enough to buy your own.

MapReduce

If your site (or other computational problems) get really big, you're going to need a cluster of servers.
- i.e. multiple computer connected by a fast network, all coordinated to work on a task.
- “big” could be computation, storage, or both.
For web stuff, the map-reduce paradigm seems to dominate.
- Term originated by Google, but basically just “divide and conquer” on a massive scale.
- The idea: each node in the cluster has some of the data. Each node processes its portion of the data, and returns the results to be combined.
- e.g. Google has its search index split across hundreds of servers. To handle a search, each checks its part of the index and those are combined/filtered to get the top results.
Computation is split into parts. (e.g. search data for occurrences of some value.)
- Split: chop input into appropriately-sized chunks. (e.g. give each of the n nodes 1/n of the data.)
- Map: do the work. (e.g. search your part of the set and collect locations that match.)
- Reduce: Combine the results from the chunks. (e.g. join the sets of matches.)
Hadoop is an open source map-reduce architecture.