Library/Framework Support
- The big lesson of the technology evaluation: you don't have to code everything from scratch.
- There are many common tasks in web devel: user authentication and sessions, HTML generation, processing form submission, etc.
- Modern frameworks take care of many of these for you.
- … and there are 3rd party libraries that do a lot more.
- Before you tackle any big problem, first look for others that have solved it.
- Frameworks also promote MVC separation.
- … a good habit to get in to.
- Don't fight it: keep your logic separated and maintainable.
- But the frameworks are big and take time to learn.
- Many new functions/modules and different workflows.
- But startup time is gained back quickly: development should get faster soon.
Architecture and Speed
- The baseline for our comparison: CGI.
- To generate a response, the web server sets some environment variables and runs an external program.
- For every request, the server must:
- Load the interpreter and start the process (行程).
- Load the code and parse/compile it.
- Do any startup work: load config files, open DB connection, etc.
- Produce the response.
- If we can avoid some of these steps, the server can handle requests faster.
- Of course, the goal is to do as little work as possible for each request.
- Step 1: start the interpreter. Solution: keep it running.
- If we incorporate the interpreter into the server, it will already be there.
- PHP was designed as an Apache module.
- mod_perl, mod_python, mod_wsgi, integrate their interpreters as Apache modules.
- JSP, Servlets are incorporated into Tomcat
- ASP and CLR are integrated into IIS.
- Or could start a separate process but keep it running.
- … server communicates with that process through a socket (or similar).
- Better for shared hosting: processes can have have different owners, can support many languages.
- e.g. FastCGI, SCGI, mod_passenger, mod_wsgi.
- Could think of reverse-proxying solutions like this too. (later)
- If we incorporate the interpreter into the server, it will already be there.
- Step 2: load and compile the code.
- Unavoidable, but can cache compiled version of recently-run code.
- e.g. PHP accelerators
- Or keep the code running.
- e.g. FastCGI, Servlets
- It can be hard to tell which is happening: mod_wsgi, mod_passenger, etc. all do one of these, but it's not clear (to me) which.
- Step 3: startup work.
- Opening DB connections and reading config files can be a lot of overhead for a single connection.
- Initializing a framework can be a huge amount of work.
- Usual solution: structure your code so that startup and page generation work are in separate functions/methods.
- First request: do both, but keep environment set up (i.e. keep the process running).
- Later requests: run page generation code only.
- e.g. Java Servlets are very explicit:
public class FooServlet extends HttpServlet { public void init(…) {…} public void doGet(…) {…} public void destroy() {…} }
- mod_wsgi, mod_passenger, FastCGI, etc. allow similar separation.
- Step 4: produce the response.
- Also unavoidable in general.
- This is usually much less than 1–3 for CGI.
- Can cache responses and avoid this where possible.
Reverse Proxying/Accelerating
- Remember that we already talked about HTTP caching.
- That can be used to cache at the client/proxy level.
- … but it's often not very effective at reducing server load.
- There are too many clients: still have to generate many times.
- Even with a lot of optimization work, apps can be slow.
- Dynamic languages (Python, Ruby, PHP, etc) tend to be slower.
- Lots of work to render templates, do ORM, etc.
- Can involve big DB queries, lengthy calculations.
- Solution: don't run your code for every request.
- HTTP caching can help, but still must serve to each client/proxy once.
- Can do the caching within your own network
- … with an HTTP accelerator (or reverse proxy, 反向代理).
- Setup:
- The reverse proxy looks like a web server to the outside world.
- … but acts like a caching proxy: keeps an internal cache of content, and only asks the server for missing/stale pages.
- So, frequently-accessed pages (with an
Expires
header) can be served quickly without regenerating.
- e.g. Wikipedia: if not logged in, everybody gets the same content, but converting wikitext to HTML is expensive.
- Ways to do this:
- Varnish: designed for exactly this use-case; very fast.
- Squid: originally designed as a caching proxy, but can also be used as a reverse proxy.
- Memcached + Nginx: fast cache + fast server. Can link the two so cache is checked first for each request.
- But: be careful of all the things that can make a cache less useful.
- especially vary by cookie for logged-in users.
- inconsistent URLs
- etc.
- For truly dynamic content (i.e. every person sees a different page), this won't help.
- Best solution is probably memcached for page-pieces and function memoization.
- Remember to think about cache invalidation (or cache coherence, 快取一致性).
- If something changes in your DB but you have cached for a long time, that's bad.
- Solution 1: don't cache for a long time.
- Solution 2: when the change occurs, tell your cache to forget the out-of-date stuff.
- #2 is better but trickier: have to trigger the invalidation (probably in the model) and have knowledge of all code caching data that relies on it.
Multitier Architectures
- As sites grow in traffic, one server can't keep up with the load.
- First things: reduce server overhead; cache templates/calculations internally, and pages in an accelerator.
- Then: faster server
- But eventually, that reaches a limit.
- Can separate roles onto multiple machines:
- A multitier architecture (多层体系结构).
- Specifically, a three tier architecture.
- If load increases more, can start replicating roles onto more machines: (复制? There's probaby a better translation.)
- Load balancing (负载均衡) can be done by a reverse proxy/accelerator, or dedicated hardware.
- DB replication or sharding can be expensive/slow/unreliable/etc. Choose your DB technology carefully.
- Multi-tier architectures provide redundancy as well.
- e.g. in above, if one web server goes down, things keep running (slower).
- Important for a large-scale busy site.
Distributed Web Architecture
- Managing traditional multi-tier architectures can be difficult.
- Can be hard to get an extra server in to meet more demand.
- Might involve significant re-configuration.
- Requires actually buying a server.
- Many companies offer cloud services (雲端運算). The idea:
- They run your app on their server(s).
- They have a huge collection of servers: you can use more when you need them.
- You pay for only the bandwidth/processor/storage you use.
- e.g. Google App Engine, Amazon EC2/S3, Windows Azure
- Great for a small site that you think might grow.
- Costs little/nothing while developing.
- Can handle growth.
- No up-front server costs.
- Software as a Service (SaaS, 软件即服务)
- Cloud offerings at the application level.
- e.g. Gmail handling a company's email; Github for source management.
- Platform as a Service (PaaS, 平台即服务)
- Company provides a development environment/framework you can use.
- Code written in that framework can run on their servers.
- Functionality is usually limited. e.g. no files can be written but file-like data can be stored elsewhere; must use their NoSQL database.
- So, you must code to their framework: not all code will run there; that code probably won't run elsewhere.
- e.g. Google App Engine, Windows Azure
- Infrastructure as a Service (IaaS, 基础设施即服务)
- You get virtual machines (with certain CPU/memory limits).
- Can be configured as you like: no restrictions on programs you run.
- Typically the VM itself is considered temporary: can only write temporary files. Permanent storage happens on a fileserver.
- New VMs can be provisioned as you need them.
- e.g. Amazon EC2, Rackspace Cloud
- But…
- Consider privacy laws: who is storing the data? Who can see it?
- What if the company goes out of business? (especially for SaaS, PaaS)
- There are some “private cloud” solutions to allow you to provide cloud-like architecture within a company.
Content Delivery Networks
- Many sites have a lot of static media to serve.
- e.g. Flickr has small HTML files to generate dynamically, but lots of static images.
- The large static files are often the slow part.
- Can be too much data to serve from one place.
- Since they are static data, they are easy to distribute across multiple locations.
- e.g. keep “official” copy of the files in the main data centre in California, but copy to servers in Germany on demand.
- This can also be done with dynamic content, but it will be trickier.
- Will require the app/code to handle distributed data storage.
- … and data coherence problems that come with it.
- But if you want low ping times to everybody, it's the only option.
- Once this is set up, you can serve content from the closest site.
- e.g. a Spanish visitor might get the files from the German servers.
- … but visitors shouldn't have to figure this out manually.
- In either case, you have a content delivery network (內容傳遞網路).
- The trick is done in DNS.
- When a DNS request comes in, server gives back IP address of server “near” the client.
- Then the client will contact that server for the content.
- Minimizes response time.
- Compare:
nslookup google.com nslookup google.com 8.8.8.8 # a nameserver in California
- But requires a distributed collection of coordinated servers.
- Several companies would be happy to sell you space on theirs.
- … or just get big enough to buy your own.
MapReduce
- If your site (or other computational problems) get really big, you're going to need a cluster of servers.
- i.e. multiple computer connected by a fast network, all coordinated to work on a task.
- “big” could be computation, storage, or both.
- For web stuff, the map-reduce paradigm seems to dominate.
- Term originated by Google, but basically just “divide and conquer” on a massive scale.
- The idea: each node in the cluster has some of the data. Each node processes its portion of the data, and returns the results to be combined.
- e.g. Google has its search index split across hundreds of servers. To handle a search, each checks its part of the index and those are combined/filtered to get the top results.
- Computation is split into parts. (e.g. search data for occurrences of some value.)
- Split: chop input into appropriately-sized chunks. (e.g. give each of the n nodes 1/n of the data.)
- Map: do the work. (e.g. search your part of the set and collect locations that match.)
- Reduce: Combine the results from the chunks. (e.g. join the sets of matches.)
- Hadoop is an open source map-reduce architecture.