Validating HTML

When writing HTML, you might make a few mistakes as you're learning, and that's completely normal. But you may have noticed that if you try to load some pages with small mistakes into a web browser, they actually look okay.

For example, we know that every (non-empty) tag must be closed. What if one isn't closed? How will the browser (or a mobile browser, or Google, or other tools that read HTML) interpret that markup?

Let's look at an invalid HTML page to see what happens. That page contains this incorrect markup (hopefully you can spot the mistake):

<p>… the <a href="http://www.w3.org/"><abbr title="World Wide Web
Consortium">W3C</a>, the group that …</p>

The <abbr> element is not closed. The browser will do its best to display the markup anyway, but we're not exactly sure how: is the text “the group that…” part of the abbreviation or not? We could end up with either interpretation:

… the W3C, the group that …

… the W3C, the group that …

My web browser chooses the first of these: the <abbr> element contains only three characters, but how can I be sure about other browsers, or Google, or other tools? I probably can't: this error could be interpreted different ways without be knowing. That could lead to errors on my page for some users, or bad parsing by search engines leading to bad rankings.

There is another error on that invalid HTML page: the quote ending an attribute value is missing:

<p>An <a href="http://validator.w3.org/>HTML Validator</a>
is a tool …</p>

In my browser, this makes the rest of the paragraph not display (because it's interpreting the rest of the text as part of the attribute value), so it's pretty likely I would have noticed. If I had been using a different browser when writing that page, would it have done the same?

The solution to this problem is easy: don't make mistakes like that. Make sure your markup follows the rules and it should be interpreted the same way by every tool. (Well, it's easy to say at least.)

HTML Validators

As we have seen, web browsers won't alert us about errors like the one above. So it's easy to say “don't make mistakes” but hard to notice if we do. Another tool is needed that will actually tell us about problems in our markup.

An HTML validator (or just validator) is a tool that does exactly that: it compares our document against the rules of HTML markup and tells us if there are any problems.

The best validator to use is probably The W3C Markup Validation Service, which you can access online. You can give it the URL of your page once it is uploaded, upload the file form your computer, or just copy-and-paste your markup.

Once the validator has your markup, it will either report that it is “valid” (yay!) or give a list of errors it finds (boo!).

If there are errors, please remember that you're working with an automated tool: it won't magically know exactly how to fix the problem. The validator may point to the problem, or to something just before or after the thing you actually want to fix. It will probably also give multiple “errors” for one actual problem that you need to fix, just because the mistake left it confused about the structure of the page.

A suggestion: if the validator gives many errors, fix the first one and then try validating again. It's likely that some of the remaining errors are the result of it being confused, not many more problems to fix.

Try the validator now. Give it the invalid example above, or other examples from this course (which should all be valid), or some of the pages your have created for this course.

The Robustness Principle

It might help to understand why it's important for authors to make sure their HTML is as “correct” as possible, as well as why browsers don't break pages when small errors are present.

Web browsers (and hopefully you as an author) are following the robustness principle of software design:

Be conservative in what you do, be liberal in what you accept from others.

Originally from Jon Postell, in the Transmission Control Protocol (TCP) specification

The idea here is tha when two pieces of software are communicating, the one sending out information should be very careful what it sends, to make sure that the receiver understands it. On the other hand, the receiver should do its best to gracefully handle small errors in what it receives.

If both sides uphold their end of this bargain, everything should be transmitted correctly and there should be no errors or miscommunication.

If we try to do the analogous thing in human language, I claim these statements all carry the same information:

“You are smart.”
“Smart is a thing that you is.”
“LOL ur smrt!!!”
“It is smart is a kind of thing you are.”

If I want to express this idea, will they understand 2, 3, or 4? Would they all be understood to carry the same information? Maybe, but how can I be sure? The first is the safe, robust way to express this idea: I'm fairly sure any reader can understand it. Someone reading could probably understand at least the first three, and would do their best to figure them all out, but it doesn't mean they are all equally good ways to get your point across.

Back on the web, web browsers are doing their part: when there are errors in HTML, they handle them as best they can, to make sure as many pages as possible display, even if not exactly as the author intended.

As authors, we need to do our part as well. Don't rely on the browser handling errors in the “right” way: do your best to not produce errors. That way, you make sure your page works in as many situations as possible.