XML
-
XML stands for eXtensible Markup Language.
- A text-based markup langauge for describing data.
- “text”: UTF-8 by default (and always unless you want to ask for problems).
- Like HTML and XHTML, XML uses tags, attributes, entities to describe information.
- … but there are no fixed tags/attributes/entities predefined.
- e.g HTML has
<p>
defined to be a paragraph. In XML in general it has no special meaning.
- XML can be used to describe any kind of structured data.
- Specific tags/attributes/entities can be created to describe particular types of data.
- A document that follows the XML syntax rules is called well formed.
- i.e. all tags closed, attribute values quoted, etc.
- In XML, all tags must be closed, even empty tags (unlike HTML):
<sometag>…</sometag> <emptytag /> or <emptytag></emptytag>
- Being well-formed is half of being valid.
- An XML parser will fail on any non-well-formed document.
- i.e. if you forget to close a tag, and XML parser will raise an exception, not guess how to fix your markup.
- A specific set of tags/attributes can be defined with a DTD or Schema.
- e.g. rules like “
<a>
can go in a<p>
and must have anhref
attribute”. - With a DTD or Schema, an XML document can be validated.
- e.g. rules like “
- Some XML schema:
- XHTML
- SVG: Scalable Vector Graphics. Tags like
<circle>
and<line>
. - GPX: GPS data (tracks, waypoints, etc.)
- MathML: formulas and calculations.
- OpenDocument files are XML wrapped in a ZIP (with media files).
- thousands of others.
- Easy to work with.
- Most languages have (several) XML parsing libraries.
- Text-based, so easy to create by hand or check when debugging.
- Can validate output XML as part of a test suite.
- Easy to share between systems, as long as you know the schema.
- But…
- If you don't know the schema, working with the file will be hard.
- Relatively expensive to process.
- Verbose: hard to type; file size usually larger than necessary (but easily compressible).
- My opinion:
- Good for storing data and exchanging between systems.
- For well-defined messages within a system (server↔browser), there's too much overhead.
DOM
- It's often necessary to operate on particular parts of an HTML/XML document.
- e.g. grab every
<li>
from within a particular<ul>
and collect the contents - e.g. find the element with
id="foo"
and change its contents - e.g. the April Fools thing
- e.g. grab every
- There are several contexts where this would be useful.
- Client-side scripting in HTML, to do dynamic effects.
- Parsing and manipulating XML data files (with many possible programming languages).
- The DOM (Document Object Model, 文档对象模型) is a standard API to do this.
- Gives a standard object-oriented way to access and manipulate an XML document.
- i.e. a standard tree-like data structure to represent XML data.
- Independent of programming language.
- Some DOM classes: element, text, attribute, document.
- Each element object contains:
- its contents: an ordered collection (list, array) of element and text objects.
- its attributes: an associative array (hash table, dictionary) of attribute/value pairs.
- Each class has several useful methods:
elt.getAttribute(class) document.getElementsByTagName("p") elt.setAttribute("class", "important")
- Any OO language you use will have a DOM library somewhere.
- But in particular: it's built into JavaScript.
- DOM is useful and universal.
- … but very verbose. Not fun to use for frequent XML manipulations.
- In JavaScript, many libraries exist to make it easier to work with the page. e.g. jQuery.
JSON
- For many applications, XML is more complex than necessary.
- Must design a tag/attribute collection for your problem—often many choices to be made there.
- Difficult to parse, use DOM, etc.
- JSON is a simpler alternative.
- JavaScript Object Notation
- Based on JS syntax for some basic data structures, but can be used with any language.
- JSON libraries exist for any common language.
- Basic types: number, string, boolean, array/list (in
[…]
), object/dictionary/map (in{…}
), null. - e.g. an object:
{ "number": "300012345", "fname": "John", "age": 22, "courses": ["CMPT 470", "CMPT 383"] }
- Simpler to work with than XML, but somewhat less flexible.
- Usually very easy to decide how to represent your data: dump your internal data structure.
- … especially in a dynamically typed language like Python, Ruby, JavaScript.
- Often easier for data exchange within a system.
- e.g. web system requests client ↔ server.