Regular Expressions

CMPT 479/880 Notes
Robert D. Cameron
January, 2001

Regular Expressions

Regular expressions are a grammatical notation often used for describing the structure of programming language tokens or symbols. Other practical applications:

Powerful find-and-replace operations in editors such emacs and ex.
Powerful string processing features of scripting languages like PHP, JavaScript, Perl and Python.
URL Rewriting Rules in Apache web server.
Automatic generation of lexical analyzers using tools such as Flex.

Classical Regular Expressions

A simplified version of regular expressions is the following definition of regular expressions over an alphabet S.

Base Rule 1: Epsilon is a regular expression that stands for the the empty string.
Base Rule 2: Any character of S is a regular expression that stands itself.
Inductive Rule 1: If A and B are regular expressions, then A B is a regular expression that stands for the strings which are formed by concatenating a string represented by A and a string represented by B.
Inductive Rule 2: If A and B are regular expressions, then A|B is a regular expression that stands for all the strings which are represented by either A or B.
Inductive Rule 3: If A is a regular expression, then A* is a regular expression that stands for all the strings which consist of the concatentation of zero or more strings represented by A.
Inductive Rule 4: If A is a regular expression, then (A) is an equivalent regular expression.

Examples:

integer-numeral = digit digit*
identifier = letter (letter | digit)*
real-numeral = digit digit* . digit digit* E (epsilon | + | -) digit digit*

A common extension is:

Inductive Rule 5: If A is a regular expression, then A+ is a regular expression that stands for all the strings which consist of the concatentation of one or more strings represented by A.

This shortens things:

integer-numeral = digit+
identifier = letter (letter | digit)*
real-numeral = digit+ . digit+ E (epsilon | + | -) digit+

Regular Expressions Define Sets Of Strings

Formally, regular expressions define sets of strings.

Base Rule 1: Epsilon stands for the set whose only member is the empty string.
Base Rule 2: Any character of S is a regular expression that stands for the set whose only member is the string consisting of that single character.
Inductive Rule 1: A B stands for the set of all possible concatenations of one string from the set A and one string from the set B.
Inductive Rule 2: A|B is the union of sets A and B.
Inductive Rule 3: A* is equivalent to the "infinite" regular expression (epsilon | A | A A | A A A | A A A A | ...). That is, it is the set of all possible concatenations of zero or more strings from set A.
Inductive Rule 4: (A) denotes the same set as A.

Matching and Backtracking

Matching is the process of determining whether a string is in the set defined by a regular expression. Matching may require backtracking. Consider the regular expression (0|1)*. This stands for the set of all strings of binary digits. If we match this pattern against 0101B, there are 4 possible substrings it can match, namely 0, 01, 010, and 0101. Often we try to match the longest possible substring, in this case 0101. But now consider the regular expression (0|1)*1B. This stands for the set of all binary numerals ending in 1 followed by a capital B. Matching this pattern against the string 0101B might first result in the match of (0|1)* with 0101. But now the remainder of the pattern 1B will not match the remainder of the string which is simply B. To remedy this, we have to backtrack, undo the decision to match (0|1)* against 0101 and try another alternative. If we try the next longest match, i.e., 010, then the remainder of the string is 1B and this matches the remainder of the pattern precisely.

Longest Match Rule for Language Tokens

In programming languages, regular expressions are often used to describe how a stream of characters can be grouped as a stream of significant language tokens. As noted in Chapter 4 of Louden, the longest match rule (match the longest possible substring) is almost always used in this case. For example, consider an expression language whose tokens are described by the following regular expressions.

integer-numeral = digit+
identifier = letter (letter | digit)*
operator = + | - | * | / | **

Using the longest match rule, a string input such as a4**73 is broken up into three tokens: the identifier "a4", the operator "**" and the integer-numeral "73". If we did not use the longest match rule, then there are many other ways of dividing up this input expression into a series of tokens; for example, we could use a "shortest match" interpretation giving six tokens: the identifier "a", the numeral "4", the operators "*", another operator "*", and numerals "7" and "3".

Unix/Perl Regular Expressions

Regular expressions turn out to be quite useful for programming various string processing operations. Unix editors like "vi" and "emacs" and many other Unix tools use regular expressions. Regular expressions are also a programming feature of the many scripting languages.

Here we summarize the main features of Unix/Perl regular expressions; Perl regular expression syntax has been adopted by many other languages including Python and PHP.

Base Rule 1: Any printable ASCII character that is not a metacharacter (".", "^", "$", "/", etc.) is regular expression which matches itself.
Base Rule 2: "." is a regular expression which matches any single character. (So "..." matches a field of length 3).
Base Rule 3: "^" is a regular expression which matches the beginning of a line. (So "^From:" only matches the string "From:" appearing at the beginning of a line.
Base Rule 4: "$" is a regular expression which matches the end of a line.
Base Rule 5: The character "\" followed by any metacharacter is a regular expression that stands for that metacharacter. (Thus "\." matches ".", "\^" matches "^ and "\\" matches "\".)
Base Rule 6: [] are set bracket metacharacters. [123] matches any one of the characters 1, 2 or 3. It stands for the set of the three single character strings "1", "2" and "3". Inside set brackets, "-" between two characters is used to specify a contiguous range of ASCII characters. Thus [A-Z] matches any upper case letter, and [A-Za-z0-9] matches any alphanumeric character. If the first character in the set brackets is "^", then the set is interpreted as all the characters not listed. Thus [^0-9] matches any character that is not a digit.
Inductive Rule 1: If A and B are regular expressions, then AB matches concatenations of A and B.
Inductive Rule 2: If A and B are regular expressions, then A|B is a regular expression that matches stands matched by either A or B.
Inductive Rule 3: If A is a regular expression, then A* is a regular expression that matches zero or more occurrences of A.
Inductive Rule 4: If A is a regular expression, then A+ is a regular expression that matches one or more occurrences of A.
Inductive Rule 5: If A is a regular expression, then (A) is an equivalent regular expression (Perl). The function of the parentheses is for grouping and substring selection. For example, the Perl statement
```
  ($hrs, $min, $sec) = 
      $timeStr =~ m/([0-9]+):([0-5][0-9]):([0-5][0-9])/;
```
could select three subfields from $timeStr and assign them to appropriate variables. Here the Perl "m//" syntax is used to specify the regular expression for matching. Substring selection may also be used for matching and replacement with the "s///" syntax. For example:
```
  $timeStr =~ 
     s/([0-9]+):([0-5][0-9]):([0-5][0-9])/\2 min., \3 sec. after \1/;
```
This is an operation that might change $timeStr from "4:25:18" to "25 min., 18 sec. after 4". In this example, the notation \1 stands for the first matched substring, \2 the second matched substring and so on. In Perl, the matched substrings are also assigned to variables $1, $2 and so on. These variables can be used after matching or even in the replacement string. Thus an equivalent statement to that above is:
```
  $timeStr =~ 
     s/([0-9]+):([0-5][0-9]):([0-5][0-9])/$2 min., $3 sec. after $1/;
```
In other Unix tools, grouping uses "$" and "$" instead of "(" and ")". In those tools "(" and ")" simply stand for themselves. For example, in the "ex" or "vi" editors, a match and replace command might be:
```
s/$[0-9]+$:$[0-5][0-9]$:$[0-5][0-9]$/\2 min., \3 sec. after \1/;
```
Precedence rules: The repetition operators ("*", "+", and "{m,n}") have highest precedence. If A and B are regular expressions, then AB* is interpreted A(B*) not (AB)*. The concatenation operation takes precedence over alternation. If A, B and C are regular expressions, then A|BC is interpreted as A|(BC) not (A|B)C.

These descriptions of Unix/Perl regular expression features are all written in terms of the "pattern matching" interpretation of regular expressions. However, it is still the case that Unix/Perl regular expressions can be understood as specifying sets of strings and the matching operation is just checking to see whether a particular string is in the set specified by the regular expression.

Regular Expressions Versus BNF

Regular expressions and BNF are two grammatical formalisms for describing sets of strings. Regular expressions are a concise and convenient notation for describing syntax when "nesting" is not an issue. BNF is a more powerful notation that allows for the description of nested language constructs using nonterminal symbols in arbitrary recursive combinations. Thus regular expressions are appropriate for token-level syntax of programming languages, while BNF is required for the higher-level recursive syntax of expressions, statements and so on.

In formal language theory, the class of languages definable by regular expressions is called the class of regular languages. The languages definable by BNF grammars are called context-free languages. The class of regular languages is a subset of the class of context-free languages.

Regular Expressions

CMPT 479/880 Notes Robert D. Cameron January, 2001

Regular Expressions

Classical Regular Expressions

Regular Expressions Define Sets Of Strings

Matching and Backtracking

Longest Match Rule for Language Tokens

Unix/Perl Regular Expressions

Regular Expressions Versus BNF

CMPT 479/880 Notes
Robert D. Cameron
January, 2001