Regular Expressions
CMPT 479/880 Notes
Robert D. Cameron
January, 2001
Regular Expressions
Regular expressions are a grammatical notation often used for
describing the structure of programming language tokens or symbols.
Other practical applications:
- Powerful find-and-replace operations in editors such
emacs
and ex
.
-
Powerful string processing features of scripting languages
like PHP, JavaScript, Perl and Python.
- URL Rewriting Rules in Apache web server.
-
Automatic generation of lexical analyzers using tools
such as Flex.
Classical Regular Expressions
A simplified version of regular expressions is the following
definition of regular expressions over an alphabet S.
- Base Rule 1: Epsilon is a regular expression that stands for the
the empty string.
- Base Rule 2: Any character of S is a regular expression that stands
itself.
- Inductive Rule 1: If A and B are regular expressions,
then A B is a regular expression that
stands for the strings which are formed by concatenating
a string represented by A and a string represented by B.
- Inductive Rule 2: If A and B are regular expressions,
then A|B is a regular expression that
stands for all the strings which are
represented by either A or B.
- Inductive Rule 3: If A is a regular expression,
then A* is a regular expression that
stands for all the strings which consist of the concatentation
of zero or more strings represented by A.
- Inductive Rule 4: If A is a regular expression,
then (A) is an equivalent regular expression.
Examples:
- integer-numeral = digit digit*
- identifier = letter (letter | digit)*
- real-numeral = digit digit* . digit digit* E (epsilon | + | -) digit digit*
A common extension is:
- Inductive Rule 5: If A is a regular expression,
then A+ is a regular expression that
stands for all the strings which consist of the concatentation
of one or more strings represented by A.
This shortens things:
- integer-numeral = digit+
- identifier = letter (letter | digit)*
- real-numeral = digit+ . digit+ E (epsilon | + | -) digit+
Regular Expressions Define Sets Of Strings
Formally, regular expressions define sets of strings.
- Base Rule 1: Epsilon stands for the set whose only member
is the empty string.
- Base Rule 2: Any character of S is a regular expression that stands for
the set whose only member is the string consisting of that
single character.
- Inductive Rule 1: A B
stands for the set of all possible concatenations
of one string from the set A and one string from the
set B.
- Inductive Rule 2: A|B
is the union of sets A and B.
- Inductive Rule 3: A* is equivalent to
the "infinite" regular expression (epsilon | A | A A | A A A
| A A A A | ...).
That is, it is the set of all possible concatenations of
zero or more strings from set A.
- Inductive Rule 4: (A) denotes the same set as A.
Matching and Backtracking
Matching is the process of determining whether a string is in
the set defined by a regular expression.
Matching may require backtracking. Consider the regular
expression (0|1)*. This stands for the set of all strings
of binary digits. If we match this pattern
against 0101B, there are 4 possible substrings it can match,
namely 0, 01, 010, and 0101. Often we try to match the
longest possible substring, in this case 0101. But
now consider the regular expression (0|1)*1B. This stands
for the set of all binary numerals ending in 1 followed by a
capital B. Matching this pattern against the string 0101B
might first result in the match of (0|1)* with 0101. But now
the remainder of the pattern 1B will not match the remainder
of the string which is simply B. To remedy this, we have
to backtrack, undo the decision to match (0|1)* against
0101 and try another alternative. If we try the next longest
match, i.e., 010, then the remainder of the string is 1B and
this matches the remainder of the pattern precisely.
Longest Match Rule for Language Tokens
In programming languages, regular expressions are often
used to describe how a stream of characters can be
grouped as a stream of significant language tokens.
As noted in Chapter 4 of Louden, the longest match rule
(match the longest possible substring) is
almost always used in this case. For example, consider
an expression language whose tokens are described
by the following regular expressions.
- integer-numeral = digit+
- identifier = letter (letter | digit)*
- operator = + | - | * | / | **
Using the longest match rule, a string input such as
a4**73 is broken up into three tokens:
the identifier "a4", the operator "**" and the integer-numeral
"73". If we did not use the longest match rule,
then there are many other ways of dividing up this
input expression into a series of tokens; for example,
we could use a "shortest match" interpretation giving
six tokens: the identifier "a", the numeral "4",
the operators "*", another operator "*", and numerals "7"
and "3".
Unix/Perl Regular Expressions
Regular expressions turn out to be quite useful
for programming various string processing operations.
Unix editors like "vi" and "emacs" and many other
Unix tools use regular expressions. Regular
expressions are also a programming feature of the
many scripting languages.
Here we summarize the main features of Unix/Perl
regular expressions; Perl regular expression syntax
has been adopted by many other languages including Python and PHP.
- Base Rule 1: Any printable ASCII character
that is not a metacharacter (".", "^", "$", "/", etc.) is
regular expression which matches itself.
- Base Rule 2: "." is a regular expression which
matches any single character. (So "..." matches a field
of length 3).
- Base Rule 3: "^" is a regular expression which
matches the beginning of a line. (So "^From:" only
matches the string "From:" appearing at the beginning of a line.
- Base Rule 4: "$" is a regular expression which
matches the end of a line.
- Base Rule 5: The character "\" followed by
any metacharacter is a regular expression that stands
for that metacharacter. (Thus "\." matches ".",
"\^" matches "^ and "\\" matches "\".)
- Base Rule 6: [] are set bracket metacharacters.
[123] matches any one of the characters 1, 2 or 3.
It stands for the set of the three single character strings
"1", "2" and "3".
Inside set brackets, "-" between two characters is used to
specify a contiguous range of ASCII characters. Thus
[A-Z] matches any upper case letter, and
[A-Za-z0-9] matches any alphanumeric character.
If the first character in the set brackets is "^",
then the set is interpreted as all the characters not
listed. Thus [^0-9] matches any character that is not a digit.
- Inductive Rule 1: If A and B are
regular expressions, then AB matches
concatenations of A and B.
- Inductive Rule 2: If A and B are regular expressions,
then A|B is a regular expression that
matches stands matched by either A or B.
- Inductive Rule 3: If A is a regular expression,
then A* is a regular expression that
matches zero or more occurrences of A.
- Inductive Rule 4: If A is a regular expression,
then A+ is a regular expression that
matches one or more occurrences of A.
- Inductive Rule 5: If A is a regular expression,
then (A) is an equivalent regular expression (Perl).
The function of the parentheses is for grouping and
substring selection. For example, the Perl statement
($hrs, $min, $sec) =
$timeStr =~ m/([0-9]+):([0-5][0-9]):([0-5][0-9])/;
could select three subfields from $timeStr and assign them to
appropriate variables. Here the Perl "m//" syntax is used to specify
the regular expression for matching. Substring selection may
also be used for matching and replacement with the "s///" syntax.
For example:
$timeStr =~
s/([0-9]+):([0-5][0-9]):([0-5][0-9])/\2 min., \3 sec. after \1/;
This is an operation that might change $timeStr from "4:25:18" to
"25 min., 18 sec. after 4".
In this example, the notation \1 stands for the first matched
substring, \2 the second matched substring and so on.
In Perl, the matched substrings are also assigned to variables
$1, $2 and so on. These variables can be used after matching
or even in the replacement string. Thus an equivalent statement
to that above is:
$timeStr =~
s/([0-9]+):([0-5][0-9]):([0-5][0-9])/$2 min., $3 sec. after $1/;
In other Unix tools, grouping uses "\(" and "\)" instead of
"(" and ")". In those tools "(" and ")" simply stand for themselves.
For example, in the "ex" or "vi" editors, a match and replace
command might be:
s/\([0-9]+\):\([0-5][0-9]\):\([0-5][0-9]\)/\2 min., \3 sec. after \1/;
- Precedence rules:
The repetition operators ("*", "+", and "{m,n}") have highest
precedence.
If A and B are regular expressions, then
AB*
is interpreted A(B*) not
(AB)*.
The concatenation operation takes precedence
over alternation.
If A, B and C
are regular expressions, then A|BC
is interpreted as A|(BC) not
(A|B)C.
These descriptions of Unix/Perl regular expression
features are all written in terms of the
"pattern matching" interpretation of regular
expressions. However, it is still the case that
Unix/Perl regular expressions can be understood as
specifying sets of strings and the matching operation
is just checking to see whether a particular string is
in the set specified by the regular expression.
Regular Expressions Versus BNF
Regular expressions and BNF are two grammatical formalisms
for describing sets of strings. Regular expressions are
a concise and convenient notation for describing
syntax when "nesting" is not an issue. BNF is a more
powerful notation that allows for the description of
nested language constructs using nonterminal symbols
in arbitrary recursive combinations.
Thus regular expressions are appropriate for token-level
syntax of programming languages, while BNF is required
for the higher-level recursive syntax of expressions,
statements and so on.
In formal language theory, the class of languages
definable by regular expressions is called the class of
regular languages. The languages
definable by BNF grammars are called context-free languages.
The class of regular languages is a subset of the class of
context-free languages.