First: let's go back to HTTP and talk about Content Negotiation.
The lang attribute
- The first thing you should do on every page, whether you're translating or not: say what language you're using.
- The
lang
attribute can be put on any tag in HTML (orxml:lang
in XML/XHTML):<html lang="en"> ⋮ <p>Confucius (<span lang="zh">孔子</span>) once said</p> <blockquote lang="zh">學而時習之,不亦悅乎? <cite lang="en">Analects</cite></blockquote> ⋮
- The
- Why bother? It helps various automated tools.
- Google categorization by language (to get your page to the right readers).
- Chrome's “would you like to translate” offer.
- Correct spelling/grammar checkers for input boxes. (Browsers probably don't do that now, but could.)
- Better typography (hyphenation, ligatures, spacing).
- Speech synthesizers. (
<span lang="en">porter</span>
is pronounced differently than<span lang="fr">porter</span>
. So are<span lang="en">he</span>
and<span lang="zh" class="pinyin">he</span>
)
- If you need to (or can) be more specific, there are subcodes for dialect or national variations.
- For example:
<span lang="zh">我吃饭</span> <span lang="yue">我食饭</span> <!-- probably best --> or <span lang="zh-yue">我食饭</span> or <span lang="zh-HK">我食饭</span> <span lang="zh-Hans">电脑</span> <span lang="zh-Hant">電腦</span> <span lang="en">How are you?</span> <span lang="en-AU">How ya goin', mate?</span> <span lang="en-CA">How ya doin', eh?</span> <span lang="fr">Pourquoi?</span> <span lang="fr-CA">À cause?</span>
- I would probably only bother with these if it was really important.
- For example, on a page about regional language differences.
- For example:
- Every page should at least have
lang
on the<html>
element.- Include that as part of your base template.
- And of course fill in the right value as necessary.
Translating Web Apps
- Internationalization and localization or i18n and L10n refers to the process of making software adaptable to other languages and cultures.
- It's not exactly difficult, but does take some discipline when developing.
- It's probably easier to do this from the start, rather than trying to refactor all of the changes out.
- Let's start with language, and get to other changes later.
- The hard part is getting all of the language in your app.
- You suddenly can't just show the user a string (
"hello"
), you have to show them the appropriately-translated version of the string. - Most of your language will come from templates.
- But also might be generated in code, be pulled from the database, etc.
- You suddenly can't just show the user a string (
- You also have to know what language a user wants to read.
- The HTTP
Accept-language
header is certainly the best guess: it comes from the user's language preferences (which default to the OS install language). - Please don't use country to infer the language users want: the header is much better.
- You should always have an easily-settable preference so the user can override it.
- e.g. I'm in China but want English. Google makes it really hard to get it sometimes, and the preference doesn't stick, even though I'm logged in. Don't do that.
- Frameworks will provide some support for this.
- The HTTP
- Frameworks provide support for translations.
- There are basically two places to worry about: your code itself and your templates.
- In your code, you can wrap each user-visible string in a function call that gets the approprate translation.
- In Django: (from the Django translation docs)
from django.utils.translation import ugettext as _ def my_view(request): output = _("Welcome to MySite") return HttpResponse(output)
- In Rails: (from the Rails i18n docs)
class HomeController < ApplicationController def index flash[:notice] = t(:welcome_to_mysite) end end
- In Django: (from the Django translation docs)
- In templates, you wrap in a template widget to do the same thing.
- In Django:
{% load i18n %} ⋮ <title>{% trans "Welcome to MySite" %}</title>
- In Rails:
{% load i18n %} ⋮ <title><%=t :welcome_to_mysite %></title>
- In Django:
- Translations probably provide even more of a reason to make the model/view/controller separation.
- It's very easy to be lazy and let some user-facing content leak into your model code.
- e.g.
class SomeModel(models.Model): ⋮ def property_description(self): if self.property > 10: return "best" elif self.property > 5: return "okay" else: return "bad"
- Eventually, somebody's going to use that in a template:
Current Status: {{ m.property_description }}
- Now you have three strings in the model that have to be translated.
- Will anybody ever find them? Shouldn't the strings come from a template anyway?
- … or at least somewhere other than the model?
- The translations themselves are usually contained in separate files that are basically dictionaries: identifier → translated string.
- The format depends on the tools, and there might be a compilation step to build something for faster lookups.
- The English translation file might contain something like this:
msgid "welcome_to_mysite" msgstr "Welcome to MySite"
- And the Spanish file:
msgid "welcome_to_mysite" msgstr "Bienvenido a MySite"
- The tools probably also have something to help you find missing translation strings.
- There are lots of ways this becomes difficult.
- In English, nouns need to be pluralized: “1 apple” “2 apples”
- In Chinese the don't, but you do need a measure word: “一个苹果” “两个苹果”
- Other languages have more categories, like “numbers ending with 1” being a separate category, or two being another special case (like one in English).
- Often you can work around the whole problem: “Number of Apples: 4”.
- Changing all of the strings in your app might throw off your layout.
- … especially if your sizing is very rigid.
- The button that you size for
<button lang="en">Stop</button>
is going to be too big for<button lang="zh">停止</button>
and too small for<button lang="de">Abbrechen</button>
. - Be as flexible as possible in your CSS.
- [Automated translation by Google often breaks very rigid layout too.]
- This is another reason to use Unicode and UTF-8 everywhere: every language will work if you start translating.
- Demo: translation with Django. (notes)
Localization
- In addition to languages, there are some other differences that you need to worry about.
- Date formats: “March 1, 2002”, “03/01/02”, “01/02/2002”, “2002-03-01”, …
- You could try localizing.
- Or just always use ISO 8601 format: 2002-03-01.
- Number formats: 1234567.89, 1,234,567.89, 1.234.567,89.
- Currency symbol and placement: $100, USD100, CAD100, ¥100, CNY100, 100₲.
- Using an ambiguous currency symbol is annoying: is “$100” in American or Canadian or Hong Kong dollars? Is “¥100” in Yen or Renminbi?
- Suggestion: always include the ISO 4217 currency code (CAD, USD, HKD, JPY, RMB, …) somewhere.
- String comparison, sorting, collation.
- Should “apple” sort before or after “Orange”? In any programming language
"apple" > "Orange"
because lowercase letters are after uppercase. - How do “苹果” and “橘子” sort?
- What about “苹果” and “apple”?
- In Java, a
java.text.Collator
object does locale-aware comparisons on strings. - The Unicode strings
"\u00e9"
and"\u0065\u0301"
should be equal: they are both “é”. - Unicode normal forms will help with that.
- Should “apple” sort before or after “Orange”? In any programming language
- Names.
- In the west: given name before family name. (e.g. “John Smith” has relatives with the name “Smith”)
- Most of Asia: family name before given name. (e.g. “王伟” has relatives with the name “王”)
- Others: single-word names (e.g. in Indonesia “Gema” is a perfectly complete legal name), “family name” isn't inherited the same way, …
- If you ask for “first name” and “last name”, it's probably a useless question.
- Best advice: just ask for “full name” and “what should we call you?” and remember you're getting Unicode, not ASCII.
- There are a lot of small things you can easily get wrong.
- There's a character called the “en dash” that is supposed to be used for number ranges:
1–10
becomes “1–10”. That probably isn't right in Chinese: “一–十” looks wrong and is more commonly “一~十” or maybe “1~10”. - Quotes vary by language: “hello” «bonjour» „hallo”.
- Using red text to display someone's name: bad for Chinese users.
- Probably a thousand other things I don't know about.
- There's a character called the “en dash” that is supposed to be used for number ranges:
- Realistically, you probably aren't going to get all of this perfect: there is too much localized detail, and too many locales.
- Doing your best is better than doing nothing.
- Most of your users are probably going to be understanding.
- Again, frameworks can help with some of this (like number and date formatting).
Timezones
- Remember when talking about Unicode and encodings, I (and Joel) said “there's no such thing as plain text”?
- There's no such thing as “time”.
- Knowing that an event will happen “2014-03-01 12:34:56” is useless.
- … unless you know that time refers to Vancouver (pacific) time.
- You need to keep track of time zones.
- Time zones change.
- Vancouver isn't “UTC minus 8 hours”. Sometimes it's minus 7 hours, with complicated rules about when the change occurs.
- January 1 1980 at 00:00 was a different time in Shanghai and Harbin. It was the same time in 1981.
- Storing “this user is UTC+800” doesn't make sense.
- Storing “this user is in the America/Toronto time zone” does.
- Suggestion #1: treat time zones as a formatting issue.
- Store all times in UTC.
- Display them in the user's preferred time zone.
- Good for time stamps on comments, etc.
- Suggestion #2: store time and timezone.
- For every date+time you have, also store a time zone reference.
- e.g. “2014-03-01 12:34:56” and “America/Vancouver”.
- Good for future meeting times, etc.
- In either case, let a time zone library take care of the details.
- You don't want to update your code every time some government changes their daylight savings time rules, or some city decides to switch to the neighbouring state's timezone.
- Guessing the time zone for a user is another tricky thing.
- With JavaScript you can use
DategetTimezoneOffset
to get the user's current offset from UTC. - But remember that can change (for daylight savings time, or time zone changes).
- You could do a GeoIP lookup on their IP address, guess where they are on the planet, and guess their time zone from that.
- You're going to be wrong sometimes. Always let them override your guess by setting a preference.
- With JavaScript you can use
International Hosting
- If you're targeting users across the world, you have a problem: it's a big planet.
- It takes a noticeable amount of time to get packets around the world. (Diameter of earth) ÷ (speed of light) = 43 ms.
- Typical ping time is >100 ms and can be >200 ms.
- If some of your users are on the wrong side of the firewall, you have to worry about outages, DNS failure.
- Some of the network links across the globe are heavily used. Nearby users will get faster responses.
- The solution is obvious: serve your site from multiple locations and make sure users visit the closest one.
- The first part is easy: set up servers in multiple locations.
- Getting your users to the best server is a little harder. [We'll talk more about Content Distribution Networks later.]
- If you're going to host from multiple locations, you have to store your data in multiple locations.
- That means keeping somewhat-consistent data on multiple continents.
- That's never going to be easy or perfect.
- But some of the NoSQL databases can help.
- Two of the features many NoSQL databases support are very helpful here: (1) redundant data replicated on multiple servers, (2) imperfect consistency between nodes.
- Those together mean we can replicate between continents but still have fast read/writes at all servers.
- The cost: we have to remember that our database might always have slightly old data.
- That might not be acceptable for some applications: auction site, anything with money.
- Several technologies that will do this nicely.
- Cassandra: has built-in logic to replicate appropriately between computers in the same data centre, and between data centres.
- CouchDB: designed for replication between servers, and even offline (very high latency) replication.
- Riak: does multi-data-centre replication in the for-pay enterprise version.
- MySQL: does have a fairly new “geographic redundancy” feature, but it requires a low latency network.