Copyright 1998, Robert D. Cameron. Permission to copy for individual use is permitted. Multiple copies may be made for use in classrooms, discussion groups, or committee meetings, provided that notice of the intent and extent of the copying is sent to the author (e-mail is satisfactory). Reproduction or republication in any general distribution medium is not permitted without explicit consent of the author. All copying requires that the integrity of the paper be preserved and that this copyright notice be reproduced in full.
The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.
The syntax of Extensible Markup Language (XML) is simple enough that it is possible to contemplate lightweight XML parsers based entirely on regular expression technology. This is no accident; the ease of writing programs to process XML documents was an important design goal for XML [XML 1.0]. Indeed, early working drafts of the XML specification considered a "trivial text grammar" that could be used as the basis for the "simplest form of XML processor" [WD-961114, WD-970807]; a version thereof also appears in the SGML FAQ Book [DeRose]. Unfortunately, these grammars never seem to have been debugged and developed to the point of usability for parsing; the trivial text grammar was dropped from the final version of the XML 1.0 specification.
Perhaps the most intriguing application of regular expressions
to XML parsing is that a single regular expression can serve
as a basis for completely parsing a document into a list
of its individual markup and text components.
Such an operation is called a shallow parse
of the document, because the internal structure of the markup terms
is not parsed in the initial instance.
For example, if $XML_SPE
is a Perl variable
denoting a shallow parsing regular expression, then
the following Perl subroutine computes and returns
the shallow parse of an XML document as an array of
strings.
sub ShallowParse { my($XML_document) = @_; return $XML_document =~ /$XML_SPE/g; }
Shallow parsing can be implemented with similar ease using other languages or tools that provide regular expression support. Full implementations of shallow parsing in Perl, JavaScript and Lex/Flex are provided as Appendices to this paper.
Once shallow parsing has returned the list of strings representation of an XML document, further parsing operations can be applied to extract the deep structure of markup items. Regular expressions can also be applied to a number of these subsequent processing steps. In general, the regular expressions for subsequent processing can be relatively simple adaptations of corresponding structures from the overall shallow parsing expression. In some cases, completion of the parsing task also requires some recursive parsing procedures to deal with, for example, the recursive structure of element content models.
The combination of the initial shallow parser of XML documents together with the secondary parsing operations suggests a third form of XML processor API (application programmer interface) to supplement the event-based APIs such as those based on SAX and the tree-based APIs such as those based on expat. The shallow parser approach has some interesting advantages. The list of strings representation is very simple and may often be directly coded using a native data structure in the host language. In comparison with the tree-based approach this simplifies the interface by eliminating the need for a programmed data structure and corresponding access routines. The representation of markup terms by their string representation may also be considerably more compact than corresponding parsed representations. Another advantage is that it is easy to regenerate the original XML document or components thereof using simple string concatenation operations. Finally, the shallow parsing approach is fault-tolerant: it provides a correct parse for correct XML documents and a useful parse for attempted but incorrect XML documents. In fact, shallow parsing need never fail; given any document whatsoever, an appropriate list of strings representation of that document can be produced. For the implementation of lightweight XML-to-XML filters and fault-tolerant XML editing tools, these advantages may strongly favour the shallow parser approach. A full comparative analysis of the three approaches is beyond the scope of this paper, however.
Implementation of shallow parsing using regular expressions has the additional benefit that it is a relatively language-independent technique. Modern scripting languages now generally provide built-in regular expression support, while other languages can make use of widely available regular expression libraries or lexical analyzer generators. Thus, regular-expression based designs for XML processing tools can often be contemplated without a premature commitment to implementation language. Furthermore, once an implementation has been developed, rehosting of that implementation in a different language may be relatively straightforward, at least insofar as the regular-expression based processing is concerned. However, language independence is not automatic; a number of nonstandard extensions or special processing modes exist in the various regular expression support packages. To maximize language independence, the regular expressions developed in this paper generally avoid nonstandard features.
The goal of this paper, then, is to present a language-independent toolkit of XML regular expressions that can be used as the basis of simple, correct, and efficient XML processing tools. Section II of this paper briefly discusses important preliminaries of regular expression notation, character set issues and the efficiency of regular-expression based processing. Section III describes the most important regular expression developed in this paper: a single regular expression that can be used to produce a shallow parse of an entire XML document. Subsequent to shallow parsing there are a variety of other tasks that can be aided by carefully constructed regular-expressions; these are discussed in section IV. Section V concludes the main body of the paper; the Appendix gives complete code for sample regular-expression based XML tools implemented in JavaScript.
Regular expressions are presented in this paper using a grammatical notation adapted from the XML specification. This notation allows a free-format presentation (spacing is irrelevant) and avoids the readability problems of excessive backslash escapes for special characters. Compared to the notation of the XML specification, restrictions are imposed in order to ensure that the grammar is indeed regular and can be easily translated into the syntax used by typical regular expression support facilities.
The following conventions are used for regular expression construction.
"^"
)
immediately following the opening bracket.*
(zero or more),
R+
(one or more) and R?
(zero or one) are repetition regular expressions.
|
R2 is a
regular expression
matching any string matched by either R1 or R2.
Overall, the translation of regular expressions given using these conventions to the equivalent notation of a regular expression support package is relatively straightforward. The appendices illustrates the translation in several instances.
An apparent problem with the use of regular expressions as a basis for XML processing is that many regular expression packages are based on single-octet (8-bit) extended ASCII (e.g., ISO Latin 1) character sets, while XML requires support for the full ISO 10646/Unicode character set using various alternative multi-octet encodings. Over time, this problem may be solved by the general upgrading of regular expression support packages to provide explicit support for ISO 10646 character sets. However, for some applications, including XML parsing, it is possible to fairly convenient use of 8-bit regular expression packages on the UTF-8 encoding of ISO 10646, provided that some care is taken in the treatment of non-ASCII characters.
UTF-8 is a particular encoding of ISO 10646 designed for interoperability with some forms of ASCII-based single-octet processing. UTF-8 is a variable length (1 to 6 octet) encoding with two important properties with respect to the ASCII representation. First, characters in the 7-bit ASCII character subset (high-order bit 0) have the same bit representation as single octets in UTF-8. Second, octets 2 through 6 in UTF-8 always have a high-order bit 1 to avoid confusion with 7-bit ASCII characters. Thus, ASCII-based applications that make processing decisions only on the occurrence of 7-bit characters and otherwise pass through 8-bit characters (non-ASCII octets) unaltered can often be used with equivalent effect on the UTF-8 representation.
There are two important techniques that allow UTF-8 processing
with 8-bit extended ASCII regular expression packages.
First, a simple restricted form of regular
expression construction can ensure that they are suitable
for processing UTF-8 documents directly. In short, the restriction
is that non-ASCII characters may only be accepted by negative character
set constructors in combinations equivalent to
"[^A] [^B]*
", where
A and B represent possibly
different sets of 7-bit ASCII characters.
This restriction is explained in subsection II.2.1 below.
The second technique is to directly write regular expressions
over UTF-8 octets using hexadecimal character code notation.
For XML processing, this technique is only relevant for
processing of the non-ASCII name characters that may appear
in names and name tokens. Processing of XML name characters is
discussed in subsection II.2.2.
Encodings other than UTF-8 are also permitted for XML documents.
However, as UTF-8 is a universal encoding format for ISO 10646,
it is always possible to convert the alternatively encoded
documents to the UTF-8 representation. If output documents
are then to be generated based on the UTF-8 representation,
an important technicality
is to ensure that any encoding declaration in the XML document
is rewritten to use the string "UTF-8"
.
ASCII-based processing for UTF-8 applications can be formalized based on the following notion of ASCII-based automata without counting. An ASCII-based automaton without counting is a deterministic finite automaton over a character set that includes the 7-bit ASCII set as a subset and meets the constraints that (a) all super-ASCII transitions are total, and (b) all super-ASCII acceptance states include a super-ASCII self transition. A super-ASCII transition is one which accepts any non-ASCII character as well as possibly some ASCII characters; a super-ASCII transition is total if it accepts all non-ASCII characters. A super-ASCII acceptance state is one reached by a super-ASCII transition. In essence, constraint (a) means that the automaton is ASCII-based, that is, transition decisions are based on the presence or absence of specific 7-bit ASCII characters. If the test is based on the absence of specific ASCII characters, then all non-ASCII characters are acceptable and the super-ASCII transition is total. Constraint (b) means that non-ASCII characters are accepted without counting. Once a non-ASCII character is accepted, the automaton accepts arbitrarily more non-ASCII characters in the same state. An exit transition for the state may only be defined based on the recognition of specific ASCII characters.
The essential result that enables ASCII-based processing for UTF-8 applications is that an ASCII-based automaton without counting defined over an 8-bit extended ASCII character set implements an equivalent automaton over UTF-8. In the 8-bit case, note that the super-ASCII transitions are the ones that accept octets with the high-order bit set. Consider then what happens if a UTF-8 document is supplied to an 8-bit ASCII-based automaton without counting. In essence, the automaton is being applied to the octet stream of the UTF-8 document which is not necessarily the same as the character stream. However, octets with zero in the high-order bit will be accepted and processed correctly as the ASCII characters they represent. When an octet with high-order bit set to one is encountered, it is the start of a multi-octet sequence coding for a non-ASCII character of ISO 10646. The automaton will initially treat this as a single-octet non-ASCII character. However, the character will only be accepted if a super-ASCII transition exists in the current state. If so, a super-ASCII acceptance state is entered. This state will have a super-ASCII self transition that will accept additional octets with the high-order bit set indefinitely. Thus the remaining octets of the UTF-8 sequence coding for the non-ASCII character (which octets must all have their high-order bits set) will be accepted without leaving this state. In essence, the state has been entered with the acceptance of the full multi-octet coding of the non-ASCII character. As a result, the 8-bit ASCII-based automaton accepts UTF-8 octet sequences in precisely the fashion required for implementation of the corresponding automaton over the UTF-8 character set.
The constraints for UTF-8 processing using 8-bit extended ASCII automata
can be expressed as corresponding constraints on the use of 8-bit ASCII
regular expression packages.
Again, the requirement that all super-ASCII transitions are total
is equivalent to the statement that transitions decisions are
based on the presence or absence of specific ASCII characters.
In regular expression terms, this corresponds to a requirement
that only the 7-bit ASCII characters be used in literal strings
or character set constructors. A super-ASCII transition then
arises whenever a character set is constructed in
negative form, for example, "[^>;.]
". Such
an expression accepts any single character except those
listed. Because only 7-bit ASCII characters are listed, every
non-ASCII character is acceptable and the resulting super-ASCII
transition is total.
The second constraint for UTF-8 processing using 8-bit ASCII
regular expression packages arises from the requirement that
super-ASCII acceptance states have super-ASCII self-transitions.
In regular expression terms, a super-ASCII self-transition
corresponds to a zero-or-more repetition of a negated character
set expression. The prototypical regular expression for
super-ASCII transitions in a super-ASCII acceptance state is
thus "[^A][^B]*
", where
A and B represent possibly
different sets of 7-bit ASCII characters.
In this case, the initial super-ASCII transition based on
"[^A]
" enters
a state with a super-ASCII self-transition based on "[^B]
".
However, other regular expression constructions can have
a similar effect. Note, for example that
that the form "[^A]+
"
is acceptable because it is equivalent to
"[^A][^A]*
".
A more complex example that may also be implemented
by an ASCII-based DFA without counting is
"([^A]|R)([^B]*|R2)
",
provided that neither subexpression R nor R2
initially matches a non-ASCII character.
In general, except for the processing of names and name tokens as discussed in the next subsection, XML parsing actions may be implemented ASCII-based automata without counting. Every regular expression defined in sections III and IV of this paper meet the restrictions outlined here.
XML documents can largely be processed using ASCII-based
techniques compatible with the constraints described above.
The only difficulty is that names and name tokens are permitted
to include some, but not all, non-ASCII characters from the
ISO 10646 as name characters (NameChar
in the
XML specification). The super-ASCII transitions
of the Unicode subautomata for name recognition are thus nontotal
and cannot be directly implemented using the techniques
of the previous section.
However, there are two relatively straightforward approaches to dealing with the issue of processing non-ASCII name characters using extended-ASCII regular expression packages. The first is to use direct hexadecimal coding of UTF-8 octets to define regular expressions for all and only the legal name structures of XML. Construction of these expressions is tedious, but not conceptually complex. The second approach is to relax the restrictions on non-ASCII name characters. That is, because non-ASCII characters play no role in delimiting names within correct XML markup, every non-ASCII character may safely be accepted as a name character during XML shallow parsing. Using this approach, correct XML documents will be correctly parsed, and documents that are erroneous by virtue of an illegal non-ASCII name character will nevertheless be usefully parsed in a way that supports the implementation of robust XML filters.
For simplicity and its support of robustness, then, REX adopts the approach of relaxing the restrictions on non-ASCII name characters. The following definitions result.
NameStrt = [A-Za-z_:] | [^\x00-\x7F]
NameChar = [A-Za-z0-9_:.-] | [^\x00-\x7F]
Name = NameStrt NameChar*
These definitions meet the constraints for implementation via ASCII-based DFAs without counting, allowing correct regular-expression based processing of UTF-8 documents to be implemented using 8-bit extended ASCII regular expression packages.
Although higher-level declarative programming techniques are often associated with inefficient implementations, regular-expression based processing can be made quite efficient. Indeed, there are good arguments why regular-expression based implementations may be more efficient than carefully hand-crafted implementations under certain circumstances.
In general, regular-expression based processing is implemented by compilation of the regular expressions into finite automata. The automata may either be nondeterministic (NFAs) or deterministic (DFAs) depending on the implementation strategy. NFAs may be more compact than DFAs, but next state transitions may be ambiguous. NFA interpretation may hence require backtracking strategies to explore alternative state transition possibilities.
When the number of states is not excessive, DFAs can be a very efficient approach to regular-expression processing. In essence, each DFA state has a lookup table giving the next state transition for each possible input character. For 8-bit extended ASCII automata, complete lookup tables of 256 entries are tolerably small, provided that the total number of states is not excessive. Input characters can thus be examined and next states computed with a single memory reference. Long strings can be scanned very efficiently, with each input character being touched only once, and each touch being very lightweight.
DFA-based implementation may indeed outperform carefully coded manual implementations. A manual implementation may also be able to achieve a touch-once property, but usually will involve heavier touches. For example, character and line counting is often an integral part of such implementations so that error messages can be helpfully correlated with the input document when necessary. However, this sacrifices the speed of processing correct documents for the convenience of error analysis of incorrect documents. This paper develops regular-expression based approaches that reverse the trade-off; correct documents can be parsed very efficiently, while error analysis and reporting may be somewhat slower.
Furthermore, when regular expression support is built-in to a scripting language such as Perl or JavaScript, it may not be possible to manually implement scanners with equivalent speed. In essence, interpretation and/or compilation of manually written source code statements may involve considerably more overhead than the built-in string processing code associated with regular-expression support.
For XML parsing, it is generally possible to develop regular expressions for which reasonably compact DFAs may be generated. Furthermore, it is also possible to write these expressions in a deterministic form that allows NFA-based implementation without excessive backtracking. In essence, the regular expressions can be developed so that, at each decision point, there is an unambiguous choice dependent on the next input character. Sections III and IV follow this approach in developing the regular expressions for XML parsing.
Literate programming is a term coined by Knuth for a particular
style of program development in which high quality typesettable
program documentation is integrated with program source code
[Literate Programming].
Literate programming is supported by a representation that combines
both a particular programming language and a particular typesetting language,
for example, Pascal and TeX in the original web system (unrelated
to the World-Wide Web).
Two programs, generally referred to as tangle
and
weave
operate on
this representation to produce the required language processor
input, or typesetter input, respectively.
By construction, literate programming leads to program documentation
that is consistent with the program it documents. This is a
particularly valuable property both for program validation and subsequent
program maintenance.
This paper employs a form of literate programming adapted for use with
regular expressions. In essence, each named regular expression in this
paper is defined in terms of a structured representation intended
to serve as input to programs retangle
and reweave
.
The reweave
program weaves the regular expression
definitions into the tapestry of this paper in the notation of
Section II.1 above. The retangle
program produces
programming language implementations of these expressions tangled
together with all the necessary quotes, backslash escapes,
string operators and so on.
Furthermore, the
retangle
program is parametric in the target programming
language of the generated output. As implemented for this paper,
retangle generates the Perl, JavaScript and Lex/Flex implementations
presented in the Appendices.
Perhaps the most intriguing aspect of literate regular-expression
programming as used in this paper is that it is based
on an XML representation. This provides an interesting
opportunity to test the XML shallow
parsing expressions in the implementation of the
reweave
and retangle
tools.
Indeed, after a bootstrapping process using hand-written shallow parsers,
the reweave
and retangle
have in
fact been rewritten to use the very regular expressions documented in
this paper.
Overall, the use of these literate regular-expression programming
techniques have a considerable benefit in ensuring that
the regular expressions developed herein are consistent,
reliable and well-documented. By being based on a common XML
representation, the Perl, JavaScript and Lex/Flex implementations
of the regular expressions are consistent with each other and with
the forms actually appearing in this paper. Errors identified in
any one setting are fed back to the original XML encoding and
result in regenerated versions for each implementation and the
paper. Furthermore, regenerated versions are then tested
with the retangle
and reweave
programs.
If those tests produce erroneous results, the errors must be
corrected once again until stable, consistent and correct processing
is once again achieved.
In summary form, shallow parsing may be described as the parsing of an XML document into a list of its individual markup and character data strings. However, this statement can be interpreted in a variety of ways with respect to the actual markup and text strings that are returned and the behaviour of shallow parsing in the case of an XML document containing errors. To clarify these issues, this section analyzes the general properties desirable for XML shallow parsing, together with some of the alternatives.
The first desirable property of shallow parsing is that the output list of strings is an ordered partition of the input document. That is, every character from the input document is represented exactly once in an output string, and the order of the character occurrences is preserved. In other words, the concatenation of the output list of strings precisely reproduces the input document.
The input partitioning property for shallow parsing is particularly useful for the construction of lightweight XML-to-XML filters. These filters generally take an XML document and make some small changes at one or more locations. The partition property allows the implementation of such filters without the imposition of additional changes by the parsing process. This will often be considered highly desirable by the users of such filters, particularly when the filter is used as part of a document editing or transformation task.
It is possible to contemplate a shallow parsing system that does not preserve all input characters, for example, deleting some white space characters within markup items and/or expanding some references. However, these effects may be considered undesirable for some types of XML filters.
Each of strings in the list returned by an XML shallow parser may be classified as of one three types: markup strings, error strings, or parsed character data strings. The following breakdown is particularly simple and useful.
Character, general and parameter-entity references are not extracted in this approach. A second process of reference extraction can be used for this purpose. An alternative design would be to separate out character and general entity references from text items in the initial shallow parse. However, reference extraction would still be needed as a separate process for application to markup items, for example, extracting character and general references from the literal attribute values of element start tags. Furthermore, early reference extraction from text strings complicates the specification and implementation of the shallow parser and leads to a somewhat heavier weight representation. More importantly, reference extraction from text may be irrelevant to a large class of lightweight XML filters, that is, those filters that can preserve references because the input and output document type definitions are the same. For such filters, early reference extraction from text is simply an unnecessary task that needs to be reversed. For these reasons, the approach taken in this paper is to avoid premature reference extraction; it is also more consistent with the overall philosophy of shallow parsing.
Another implication of this approach is that XML document type declarations are returned as single strings, even though document type declarations may contain nested markup items such as processing instructions, comments or markup declarations. When necessary, these nested items must be extracted and processed in a secondary parsing process.
Shallow parsing is intended to support processing of partial, invalid or otherwise ill-formed XML documents. In fact, it is not difficult to arrange that, given any document whatsoever, the shallow parser returns a viable list of strings representing its interpretation as an XML document. For a fully robust shallow parser, there are no fatal errors. Errors are identified during shallow processing whenever the opening left angle bracket of a markup item has been found, but there is either an internal syntactic error in the item or the corresponding closing bracket cannot be found. The simplest approach to error identification in these cases is to return the character string consisting of a single left angle bracket as a universal error token. In the shallow parse representation, this would generally be followed a text string object containing both the remainder of the attempted markup item as well as the following actual text up to the next occurrence of an angle bracket.
However, it is possible to design regular expressions for shallow
parsing that improve upon this.
In most cases involving erroneous markup items,
it is possible to return at least the
full opening delimiter of a markup item, possibly followed by its "name"
(where relevant). Thus, strings such as "<!--
",
"<![CDATA[
", "<?xml
", "<!DOCTYPE
",
"<H3
" and "</H3
"
can serve as tokens denoting errors in particular types of
structure. In some cases, longer error tokens can also be
usefully returned to more precisely locate an error.
For example, given the erroneous comment
<!-- Embedded double hyphen (--) not allowed -->
the error token "<!-- Embedded double hyphen (--
"
usefully locates the precise point of error.
Nevertheless, a single character "<
" error
token can still occur; it usually denotes an erroneous
occurrence of an opening angle bracket in text data, such
as "x < 0
".
The final general issue with respect to shallow parsing specification is what properties can be assumed when a fully delimited markup string is returned. In general, regular expressions cannot enforce all of the well-formedness constraints for correct XML markup, let alone the context-sensitive validity constraints. Furthermore, in support of fault-tolerant XML tools it may be desirable to relax certain constraints that are technically possible to enforce. For example, various different contexts within the XML grammar impose different constraints on the characters that may be contained within quoted strings. However, most of the constraints do not affect the construction of XML-to-XML filters and so their enforcement within the XML shallow parser may be unnecessarily limiting.
The specific properties that can be assumed for each type of markup item are documented in the relevant subsection of section IV.
Parsed character data strings returned by the shallow parser consist of maximal length arbitrary text and whitespace strings found between markup items (correct or erroneous) in the document source. The maximal length property means that the strings always extend from one markup or error item (or beginning of file) to another markup or error item (or end of file). Thus, no two character data strings will occur consecutively in the shallow parse. Alternatively, a shallow parser could be designed that further divides parsed character data in some fashion, for example, at line boundaries. However, this leads to a heavier weight representation. In the case of line boundary divisions, the strategy is also incomplete, in that such divisions occuring within markup or error items would not be identified.
Three additional observations about parsed character data
may be of interest. First, character data strings may
consist entirely of whitespace. This is necessary to
preserve the input partitioning property when consecutive
markup items are found, say, with an intervening blank line.
Second, as noted previously, parsed character data may also
include character or general entity references (which in turn
may be either correctly or incorrectly formed).
Third, parsed character data strings returned in shallow parsing
may also include "]]>
" sequences even though
those sequences are not permitted in XML.
The principal implication of this is that
a check for this sequence must be made whenever
a text object is enclosed in a CDATA section.
However, a general purpose CDATA construction operation
must always make this test anyway, because the
text argument for such an operation might come
from a source other than a text string of an unexpanded input document.
Based on the general analysis of the previous section,
this section develops the overall XML shallow parsing expression
XML_SPE
.
This expression is intended to be repeatedly matched against an XML document
to return the list of individual text, error and markup items
that comprise the document shallow parse.
The overall structure of XML_SPE
is given by the following
definitions.
XML_SPE = TextSE | MarkupSPE
TextSE = [^<]+
MarkupSPE = '<' ('!' DeclCE? | '?' PI_CE? | '/' EndTagCE? | ElemTagCE?)
DeclCE = '--' CommentCE? | '[CDATA[' CDATA_CE? | 'DOCTYPE' DocTypeCE?
If the first character of the input to be matched against XML_SPE
is any character but a left angle bracket, then the text scanning
expression TextSE
is used to extract a text item.
Otherwise, a markup item (or related error item) should be parsed,
the type of which is determined by one or more characters following the
opening angle bracket. Once the correct markup type is determined,
further parsing of the item is handled by a "completion expression" (CE
)
for the type.
In the overall structure of XML_SPE
, note that the
completion expressions are each optional. If no string matching the
relevant completion expression is found, then the opening delimiter
itself may be returned as an error token. Thus any of the tokens
"<!
",
"<!--
",
"<![CDATA[
",
"<!DOCTYPE
",
"<?
" or
"</
" may be returned to indicate malformed markup of a
particular type. Furthermore, a solitary "<
"
may be returned as a general error token in the event that it is
followed neither by one of the special characters
("!
", "?
" or "/
") nor by a string
matching the completion expression for element tags.
Note also that XML_SPE
always matches at least one character for any
input. In this way, global matching of XML_SPE
against an input
document is guaranteed to produce a shallow parse satisfying the input
partitioning property.
The remainder of this section develops the regular expressions for
extraction of each type of markup item. Each expression is
developed in standalone form as well as in the completion
expression form for use within XML_SPE
as defined here.
An XML comment opens with a "<!--
" delimiter
and generally closes with the first subsequent occurrence of the
closing "-->
" delimiter. An explicitly stated
exception is that a double hyphen is not permitted within
the body of a comment. This rule ensures that unterminated comments
are detected if a new comment opening delimiter is encountered.
There is an additional restriction
that comments cannot be terminated with the "--->
"
sequence, that is, that the body of the comment cannot terminate
with a hyphen. This rule may be inferred from the
grammar rule for comments given in the
specification [XML 1.0].
Comment |
::= | '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->' |
Here Char
stands for the set of all characters and
the unquoted "-
" denotes substraction from this set.
This rule does implement the stated restriction on double hyphens,
but also has the additional effect of disallowing a hyphen
at the end. This is compatible with the SGML treatment of comments.
In fact, this grammar rule can be directly converted
to a regular expression by substitution of "[^-]
"
for each occurrence of "(Char - '-')
".
However, the result is nondeterministic with respect to
hyphens found after the opening delimiter.
In essence, in a direct NFA-based implementation it forces
a choice of matching the hyphen within the comment body
or as the first character of the closing delimiter.
Backtracking to reverse this decision may consequently be required.
An equivalent deterministic regular expression may be constructed as follows.
CommentRE = '<!--' Until2Hyphens '>'
Until2Hyphens = UntilHyphen ([^-] UntilHyphen)* '-'
UntilHyphen = [^-]* '-'
The steps used in matching this expression to an XML comment may be summarized as follows.
<!--
").Until1Hyphen
).
The comment matching process scans until the first occurrence of a double hyphen. Because a double hyphen is not legal within the body of a comment, it must signal the position of the closing delimiter. If the closing angle bracket does not follow immediately, the comment scan fails.
Applied to shallow parsing, CommentRE
will either match complete and correct comments or fail to match.
For robustness, the shallow parsing expression
returns two types of comment error token.
CommentSPE = '<!--' CommentCE?
CommentCE = Until2Hyphens '>'?
As mentioned previously, the comment completion expression is optional,
allowing the opening delimiter to be
identified as an error string even if the rest of the structure
is not matched (unterminated comment).
Within the completion expression, the optional closing angle bracket
means that a comment
erroneously terminated with a double hyphen may nevertheless
be identified.
Under the longest substring match rule, then, CommentSPE
matches the
full structure of a legal XML comment when it exists,
an erroneously terminated comment ending in a double hyphen
as a second priority, or the opening comment delimiter, otherwise.
The properties of comment shallow parsing may be summarized as follows.
<!--
" indicates an unterminated
comment. There can be no subsequent double hyphen (and hence
no subsequent comment) in the input file.
XML processing instructions (including the XML declaration as a special case) have the following syntax.
PI ::= '<?' Name (S (Char* - (Char* '?>' Char*)))? '?>'
Here "Name
" and "S
" are nonterminals in the
XML grammar for names and whitespace.
After the opening "<?
" delimiter,
a processing instruction terminates with the
first occurrence of the corresponding "?>
" sequence
(this sequence cannot occur in a name or whitespace).
A deterministic regular expression that precisely captures this syntax may be constructed as follows.
PI_RE = '<?' Name PI_Tail
PI_Tail = '?>' | S1 UntilQMs ([^>?] UntilQMs)* '>'
S1 = [\n\r\t ]
UntilQMs = [^?]* '?'+
After the opening delimiter and the processing instruction target (name),
the tail of the processing instruction may immediately terminate with
the closing delimiter or may first have arbitrary content following whitespace.
The key to the scanning process for this arbitrary content is to
use the UntilQMs
subexpression to
scan for question mark sequences (one or more
question marks) until a closing angle bracket is found
immediately after such a sequence. This may be contrasted
to the scan for a single hyphen in comment matching.
Technically, a processing
instruction may terminate with "?>
", or
"??>
" or "???>
" or so on.
Failure to properly scan for question mark sequences is
an easy mistake to make. For example, the
following erroneous scanning expression was
found in early working drafts of the XML specification
[WD-970807, WD-961114]
as well as the SGML FAQ book [DeRose].
Erroneous_PI_SE = '<?' [^?]* ('?' [^>]+)* '?>'
This scanning expression was designed as a minimal form to find the closing delimiter of a processing instruction, but fails to accept any processing instruction that terminates with exactly two question marks before the right angle bracket.
A further subtle point in the PI_RE
expression is the use of
S1
expression to match a
single whitespace character rather than matching
one or more whitespace characters with S
.
The reason for this is to avoid nondeterminism;
the extra whitespace characters are also matched by
the [^?]
subexpression that immediately follows.
In turn, this means that NFA-based implementations
of this expression can avoid backtracking.
PI_RE
suffices to identify all and only
syntactically correct processing instructions.
In the event of an incorrect or unterminated processing
expression, the shallow parser should return
the opening "<?
" delimiter followed
by any legal portion of the target name that exists.
The following shallow parsing expression is suitable
for the task.
The properties of shallow parsing of processing instructions may be summarized as follows.
<?
"
and ">
" is returned, it represents
either a processing instruction meeting all the
required constraints of the XML specification or
an XML declaration. In the latter case, the internal structure
of the XML declaration has not been checked.
<?
" is
returned, it indicates a processing instruction opening
delimiter that is not immediately followed by the name
of the processing instruction target.
<?
"
delimiter and a name is returned, it indicates either that
an end delimiter for the processing instruction is missing, or that
the required whitespace following the processing instruction target
was not found.
CDATA sections open with the delimiter "<![CDATA[
"
and have arbitrary content terminated
with the first occurrence of the "]]>
"
delimiter. The key to the scanning process in this
case is to scan for sequences of two or more
right square brackets followed immediately by a right
angle bracket. The precise regular expression for CDATA
sections can thus be developed as follows.
CDATA_RE = '<![CDATA[' CDATA_CE
CDATA_CE = UntilRSBs ([^]>] UntilRSBs)* '>'
UntilRSBs = [^]]* ']' ([^]]+ ']')* ']'+
Although the UntilRSBs
expression is more
complex, it plays a similar role here
to that of UntilHyphen
in comment scanning
and UntilQMs
in processing-instruction scanning.
Following the previous examples, adapting this expression for error signalling can be accomodated by making the tail part of the CDATA section optional.
This gives a shallow parsing expression which may match and return two types of string only.
<![CDATA[
" may be returned
to indicate an unterminated CDATA section. In this case, there will be
no subsequent occurrence of the "]]>
" sequence in the
input document.
The element tagging structures consist of element start tags, empty element
tags and element end tags.
Element start tags and empty element tags have a common syntactic
structure distinguished only by the occurrence of "/
"
before the closing ">
" delimiter.
AttValue ::= '"' ([^<&"] | Reference)* '"'
| "'" ([^<&'] | Reference)* "'"
Attribute ::= Name S? '=' S? AttValue
STag ::= '<' Name (S Attribute)* S? '>'
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
There are several design alternatives for element tag regular expressions ranging from minimal to maximal enforcement of syntactic constraints. On the minimal side, after matching the opening delimiter and a name start character, a scanning expression could match arbitrary text up to the first unquoted right angle bracket. On the maximal side, a regular expression can be designed to enforce every constraint given in the grammar above. Each approach gives regular expressions that can be used to correctly extract element tagging from well-formed XML documents. The approaches differ in terms of fault-tolerance and support for subsequent processing. The scanning expression approach allows markup items to be identified even in the presence of internal syntactic errors, while the maximal enforcement approach ensures that markup items are in a form enabling extraction of internal components.
The pure scanning expression approach suffers from the problem that,
if a closing quote mark or a closing angle bracket is missing,
scanning can continue into subsequent text and markup.
This can often generate misleading results.
For example, the erroneous construction
<E1 Att1='error><e2 Att2='>'>
would be parsed as the element tag
<E1 Att1='error><e2 Att2='>
followed by "'>
" as text.
The element tag might then be
erroneously interpreted as a correct
E1
element start tag with
"error><e2 Att2=
" as the
the value of attribute Att1
.
However, it is possible to avoid these problems by designing a scanning expression that terminates upon occurrence of an opening angle bracket of a new markup item. Note that such brackets are legal neither within quoted attribute value strings nor unquoted elsewhere in the body of the element tag.
This scanning expression is maximally tolerant of internal syntactic faults without overscanning into subsequent markup items. Such fault tolerance may be useful to applications intended to work with erroneous XML documents.
A further problem with the scanning approach is that one cannot count on returned markup items to be sufficiently well structured to support extraction of attribute value lists. This represents an impediment to element tag processing in the "normal case," that is, that element tags are correctly formed. In order to remove this impediment, the following expression may be used as the basis for shallow parsing.
This expression ensures that identified markup items are well structured for attribute value extraction. However, a scanning philosophy is still maintained for attribute value strings. This allows applications to robustly handle documents in which such strings contain ill-formed references.
Finally, to adapt ElemTagRE
for error handling,
it is useful to make the closing delimiter optional.
This allows the return of an error string consisting of the opening delimiter and element name together with as many attribute value pairs that exist in the correct form.
A minor point about this expression is that there is a slight nondeterminism
in whitespace processing. In a nonoptimizing NFA-based implementation
this may require some backtracking if whitespace exists immediately before the
final "/>
" or ">
" delimiter.
However, a DFA-based implementation can be achieved without state
expansion.
The regular expression for element end tags follows directly from the grammar in the XML specification.
The shallow parsing expression modifies this structure to match appropriate error strings.
Either the opening end tag delimiter followed by the element name and possible white space, or the opening delimiter alone may be identified as error strings.
Document type declarations are complex syntactic objects
for which regular-expression based parsing would not normally
be attempted. However, from the point of view of scanning
and locating the appropriate closing ">
" delimiter, they
do in fact have a regular structure. By generally ignoring the internal
structure of document type declarations except as specifically
relevant to delimiter determination, a reasonably straightforward
scanning expression can be developed. This in turn permits
implementation by a DFA of a reasonably small number of states
and is also tolerant of minor internal syntactic errors.
As discussed in the example of element tagging, however, it is desirable to modify the pure scanning approach in three ways. First, it is desirable to avoid overscanning through a faulty construct into subsequent markup items. Thus, although other kinds of syntactic error might be tolerated so long as the delimiters can be correctly identified, any inappropriate occurrence of an opening markup delimiter (angle bracket) should immediately terminate the scan. Second, in order to facilitate subsequent processing, it is desirable to ensure that a string returned as a fully-delimited document type declaration indeed has sufficient internal structure to enable correct extraction of the individual components. Third, when internal syntactic errors do exist, it is desirable to return a partial document type declaration consisting of a correct prefix prior to the point of error.
Document type declarations have two basic parts: an identificaton part specifying the document type name and possible external identifiers, and an optional body consisting of a sequence of declarations and other items enclosed in square brackets. The top-level structure of a shallow parsing expression for document type declarations enforces this syntax.
DocTypeSPE = '<!DOCTYPE' DocTypeCE?
DocTypeCE = DT_IdentSE S? ('[' DT_ItemSE* ']' S?)? '>'?
Optional whitespace is included in places specified by the XML grammar. The closing angle bracket is optional for shallow parsing; if it is indeed missing, the almost complete document type declaration is returned as an error string.
The identification part of a document type declaration consists of a
document type name, possibly followed by an external identifier
consisting of a keyword
SYSTEM
or PUBLIC
and one or two quoted string
literals. Whitespace is required to separate these elements
from each other and the opening "<!DOCTYPE
" delimiter.
From the perspective of scanning for the opening "[
"
of the document type body or the closing ">
" of the overall
document type declaration, occurrences of these symbols within the
quoted strings must be ignored.
Beyond this, it is reasonable to enforce the existence of a document type
name since this is required of all correct document type declarations.
The existence and structure of the external identifier part can be
modelled as a list of zero or more whitespace separated names (keywords)
or quoted string literals.
This ensures that external identifier part can be extracted as well-formed list of symbols while tolerating possible errors in the actual symbols used.
Within the square brackets of a document type declaration, four types of item may be found, with or without whitespace to separate.
DocTypeItem ::= MarkupDeclaration | PI | Comment | PEReference | S
MarkupDeclaration ::= '<!' DeclType DeclBody '>'
DeclType ::= 'ELEMENT' | 'ATTLIST' | 'ENTITY' | 'NOTATION'
PEReference ::= '%' Name ';'
Here DeclBody
denotes a common lexical structure
for element, attlist, entity and notation declarations, although
their detailed syntax varies considerably. For shallow parsing purposes,
the bodies of these declarations have the common property that they may contain
an internal occurrence of a closing ">
"
delimiter only within a quoted string.
This syntax is captured by the following
declaration completion expression.
MarkupDeclCE = ([^]"'><]+ | QuoteSE)* '>'
This expression scans through arbitrary content
searching for the closing ">
" delimiter.
Quoted strings are skipped and the scan may terminate
with failure if an erroneous "]
" or "<
"
delimiter is encountered.
Combining MarkupDeclCE
with the previously developed
expressions for comments and processing instructions
and parameter-entity reference expression gives an overall
expression for scanning items in the body of document type declarations.
DT_ItemSE = '<' ('!' ('--' Until2Hyphens '>' | [^-] MarkupDeclCE) | '?' Name PI_Tail) | '%' Name ';' | S
This completes the definition of the scanning expression for document type declarations.
Shallow parsing performs the primary task of parsing an XML document into its invididual markup and character data components. Subsequent to this, other processing tasks are generally required, depending on the specific XML application. Some of these secondary tasks can also use regular-expression based techniques for parsing of the individual markup and text items. In most cases, the regular expressions for these secondary parsing tasks can be developed by adaptation of those documented in the previous section. However, there is one additional task unaddressed by the previously developed expressions, namely that of reference extraction.
Reference extraction is the process of extracting character, entity and parameter-entity references from quoted strings or parsed character data. The quoted string literals may be either literal entity values from entity declarations, or literal attribute values from attribute list declarations, element start tags or empty element tags. Character and entity references may be found in any of these contexts, while parameter-entity references may occur in literal entity values only. Parsed character data is the text data found between markup terms during shallow parsing; only character and entity references occur in parsed character data.
The XML grammatical productions for each of these reference types are in fact in regular form as given in the XML specification.
CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
EntityRef ::= '&' Name ';'
PEReference ::= '%' Name ';'
Direct use of regular expressions based on these forms can allow syntactically correct references to be extracted.
From the standpoint of fault tolerance, however, the reference
extraction expressions can be improved to account
for syntactically incorrect references as well.
A useful technique is to extract maximal length
reference prefixes. For example, in the
text item "A carriage return (
:)
", the
character reference is incorrect because it is terminated
with a colon rather than a semicolon. In this case,
the maximal length reference prefix "
"
could be returned by the extraction process to indicate
the error. Of course, whenever a reference is syntactically
correct, the maximal length prefix is the full reference itself.
Correct references can always be identified by the presence
of the terminating semicolon in the maximal length prefix.
Using the longest match rule, maximal length prefixes
can be determined using
regular expressions that match all possible prefixes.
The simplest case is the all-prefix expression for parameter-entity
references.
The shortest prefix is just "%
", so everything
after this is optional.
Longer prefixes must have one or more characters of a name.
However, the Name
construct may be used directly,
since every nonempty prefix of a name is also a valid name.
After the name characters, the prefix may be extended
with a semicolon to give a full parameter-entity reference.
An all-prefix expression for character and general entity references may be built using similar techniques.
HexPart = 'x' ([0-9a-fA-F]+ ';'?)?
NumPart = '#' ([0-9]+ ';'? | HexPart)?
CGRef_APE = '&' (Name ';'? | NumPart)?
After the opening "&
", there may or may not
be either a name part or a numeric part. A name part, if it
exists, may have a following semicolon to complete the reference.
A numeric part starts with the "x
" character
followed by a possible decimal sequence or a hexadecimal part.
From an efficiency perspective, the all-prefix expressions have
the attractive property that their DFA implementations require
no more states than do the corresponding implementations
of the simpler expressions for full references.
In essence, the DFAs for the full reference expressions are easily
converted to DFAs for the all-prefix form by making every
state except for the start state (before the "&
"
or "%
") into acceptance states.
Finally, the extraction expressions can be easily converted into expressions suitable for parsing and determination of replacement texts.
Text_PE = CGRef_APE | [^&]+
EntityValue_PE = CGRef_APE | PERef_APE | [^%&]+
Using these expressions input strings can be parsed
into lists of alternating pure text and reference prefix strings.
In JavaScript, for example, if textString
is a string object, and Text_PE_g
is a global
match version of the Text_PE
expression,
then textString.match(Text_PE_g)
parses textString
into such a list of
strings.
The replacement text for the entire string can then be
determined by replacing the references with their appropriate
values and reconcatenating the string list.
Regular-expression based processing allows for very simple implementations of a variety of lightweight XML processing tools. These include lightweight XML-to-XML filters that may generate output texts only slightly modified from their inputs (with limited entity replacement, for example), tools for special purpose XML processing (constrained to a fixed document type or set of document types, for example), and fault-tolerant editing or construction tools for working with partial, ill-formed or invalid XML documents before they meet the requirements of conforming XML processors. Implementation of these kinds of tools is not well supported, and may even be hindered, by XML toolkits based on either validating or nonvalidating processors conforming to the XML specification.
The simplicity of the shallow parsing model based on regular expressions suggest suggests some interesting possible directions for development of XML. First of all, a shallow parsing representation such as that produced by REX could be a useful reference representation for a revised XML specification. Such a reference representation would have the advantage of providing a language-independent approach to shallow parsing encoded in the standard, with a language-independent implementation framework based on regular expressions. Furthermore, it may be possible to relax certain XML restrictions that can be easily accommodated by regular-expression processing, such as that restriction that attributed values must always be quoted. However, possibilities such as these must be carefully weighed by the overall XML development community.
# REX/Perl 1.0 # Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions", # Technical Report TR 1998-17, School of Computing Science, Simon Fraser # University, November, 1998. # Copyright (c) 1998, Robert D. Cameron. # The following code may be freely used and distributed provided that # this copyright and citation notice remains intact and that modifications # or additions are clearly identified. $TextSE = "[^<]+"; $UntilHyphen = "[^-]*-"; $Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-"; $CommentCE = "$Until2Hyphens>?"; $UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+"; $CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>"; $S = "[ \\n\\t\\r]+"; $NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]"; $NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]"; $Name = "(?:$NameStrt)(?:$NameChar)*"; $QuoteSE = "\"[^\"]*\"|'[^']*'"; $DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*"; $MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>"; $S1 = "[\\n\\r\\t ]"; $UntilQMs = "[^?]*\\?+"; $PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>"; $DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?:$PI_Tail))|%$Name;|$S"; $DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?"; $DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocTypeCE)?"; $PI_CE = "$Name(?:$PI_Tail)?"; $EndTagCE = "$Name(?:$S)?>?"; $AttValSE = "\"[^<\"]*\"|'[^<']*'"; $ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>?"; $MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$ElemTagCE)?)"; $XML_SPE = "$TextSE|$MarkupSPE"; sub ShallowParse { my($XML_document) = @_; return $XML_document =~ /$XML_SPE/g; }
// REX/Javascript 1.0 // Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions", // Technical Report TR 1998-17, School of Computing Science, Simon Fraser // University, November, 1998. // Copyright (c) 1998, Robert D. Cameron. // The following code may be freely used and distributed provided that // this copyright and citation notice remains intact and that modifications // or additions are clearly identified. TextSE = "[^<]+"; UntilHyphen = "[^-]*-"; Until2Hyphens = UntilHyphen + "([^-]" + UntilHyphen + ")*-"; CommentCE = Until2Hyphens + ">?"; UntilRSBs = "[^]]*]([^]]+])*]+"; CDATA_CE = UntilRSBs + "([^]>]" + UntilRSBs + ")*>"; S = "[ \\n\\t\\r]+"; NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]"; NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]"; Name = "(" + NameStrt + ")(" + NameChar + ")*"; QuoteSE = '"[^"]' + "*" + '"' + "|'[^']*'"; DT_IdentSE = S + Name + "(" + S + "(" + Name + "|" + QuoteSE + "))*"; MarkupDeclCE = "([^]\"'><]+|" + QuoteSE + ")*>"; S1 = "[\\n\\r\\t ]"; UntilQMs = "[^?]*\\?+"; PI_Tail = "\\?>|" + S1 + UntilQMs + "([^>?]" + UntilQMs + ")*>"; DT_ItemSE = "<(!(--" + Until2Hyphens + ">|[^-]" + MarkupDeclCE + ")|\\?" + Name + "(" + PI_Tail + "))|%" + Name + ";|" + S; DocTypeCE = DT_IdentSE + "(" + S + ")?(\\[(" + DT_ItemSE + ")*](" + S + ")?)?>?"; DeclCE = "--(" + CommentCE + ")?|\\[CDATA\\[(" + CDATA_CE + ")?|DOCTYPE(" + DocTypeCE + ")?"; PI_CE = Name + "(" + PI_Tail + ")?"; EndTagCE = Name + "(" + S + ")?>?"; AttValSE = '"[^<"]' + "*" + '"' + "|'[^<']*'"; ElemTagCE = Name + "(" + S + Name + "(" + S + ")?=(" + S + ")?(" + AttValSE + "))*(" + S + ")?/?>?"; MarkupSPE = "<(!(" + DeclCE + ")?|\\?(" + PI_CE + ")?|/(" + EndTagCE + ")?|(" + ElemTagCE + ")?)"; XML_SPE = TextSE + "|" + MarkupSPE; function ShallowParse(XMLdoc) { return XMLdoc.match(new RegExp(XML_SPE, "g")); }
/* REX/Lex 1.0 Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions", Technical Report TR 1998-17, School of Computing Science, Simon Fraser University, November, 1998. Copyright (c) 1998, Robert D. Cameron. The following code may be freely used and distributed provided that this copyright and citation notice remains intact and that modifications or additions are clearly identified. */ TextSE [^<]+ UntilHyphen [^-]*- Until2Hyphens {UntilHyphen}([^-]{UntilHyphen})*- CommentCE {Until2Hyphens}>? UntilRSBs [^\]]*]([^\]]+])*]+ CDATA_CE {UntilRSBs}([^\]>]{UntilRSBs})*> S [ \n\t\r]+ NameStrt [A-Za-z_:]|[^\x00-\x7F] NameChar [A-Za-z0-9_:.-]|[^\x00-\x7F] Name ({NameStrt})({NameChar})* QuoteSE \"[^"]*\"|'[^']*' DT_IdentSE {S}{Name}({S}({Name}|{QuoteSE}))* MarkupDeclCE ([^\]"'><]+|{QuoteSE})*> S1 [\n\r\t ] UntilQMs [^?]*\?+ PI_Tail \?>|{S1}{UntilQMs}([^>?]{UntilQMs})*> DT_ItemSE \<(!(--{Until2Hyphens}>|[^-]{MarkupDeclCE})|\?{Name}({PI_Tail}))|%{Name};|{S} DocTypeCE {DT_IdentSE}({S})?(\[({DT_ItemSE})*]({S})?)?>? DeclCE --({CommentCE})?|\[CDATA\[({CDATA_CE})?|DOCTYPE({DocTypeCE})? PI_CE {Name}({PI_Tail})? EndTagCE {Name}({S})?>? AttValSE \"[^<"]*\"|'[^<']*' ElemTagCE {Name}({S}{Name}({S})?=({S})?({AttValSE}))*({S})?\/?>? MarkupSPE \<(!({DeclCE})?|\?({PI_CE})?|\/({EndTagCE})?|({ElemTagCE})?) XML_SPE {TextSE}|{MarkupSPE}
This appendix contains an interactive regular expression demo using the regular expression support facilities of Javascript 1.2.