Regular Expressions

CMPT 479/880 Notes
Robert D. Cameron
January, 2001

Regular Expressions

Regular expressions are a grammatical notation often used for describing the structure of programming language tokens or symbols. Other practical applications:

Classical Regular Expressions

A simplified version of regular expressions is the following definition of regular expressions over an alphabet S.

Examples:

A common extension is:

This shortens things:

Regular Expressions Define Sets Of Strings

Formally, regular expressions define sets of strings.

Matching and Backtracking

Matching is the process of determining whether a string is in the set defined by a regular expression. Matching may require backtracking. Consider the regular expression (0|1)*. This stands for the set of all strings of binary digits. If we match this pattern against 0101B, there are 4 possible substrings it can match, namely 0, 01, 010, and 0101. Often we try to match the longest possible substring, in this case 0101. But now consider the regular expression (0|1)*1B. This stands for the set of all binary numerals ending in 1 followed by a capital B. Matching this pattern against the string 0101B might first result in the match of (0|1)* with 0101. But now the remainder of the pattern 1B will not match the remainder of the string which is simply B. To remedy this, we have to backtrack, undo the decision to match (0|1)* against 0101 and try another alternative. If we try the next longest match, i.e., 010, then the remainder of the string is 1B and this matches the remainder of the pattern precisely.

Longest Match Rule for Language Tokens

In programming languages, regular expressions are often used to describe how a stream of characters can be grouped as a stream of significant language tokens. As noted in Chapter 4 of Louden, the longest match rule (match the longest possible substring) is almost always used in this case. For example, consider an expression language whose tokens are described by the following regular expressions.

Using the longest match rule, a string input such as a4**73 is broken up into three tokens: the identifier "a4", the operator "**" and the integer-numeral "73". If we did not use the longest match rule, then there are many other ways of dividing up this input expression into a series of tokens; for example, we could use a "shortest match" interpretation giving six tokens: the identifier "a", the numeral "4", the operators "*", another operator "*", and numerals "7" and "3".

Unix/Perl Regular Expressions

Regular expressions turn out to be quite useful for programming various string processing operations. Unix editors like "vi" and "emacs" and many other Unix tools use regular expressions. Regular expressions are also a programming feature of the many scripting languages.

Here we summarize the main features of Unix/Perl regular expressions; Perl regular expression syntax has been adopted by many other languages including Python and PHP.

These descriptions of Unix/Perl regular expression features are all written in terms of the "pattern matching" interpretation of regular expressions. However, it is still the case that Unix/Perl regular expressions can be understood as specifying sets of strings and the matching operation is just checking to see whether a particular string is in the set specified by the regular expression.

Regular Expressions Versus BNF

Regular expressions and BNF are two grammatical formalisms for describing sets of strings. Regular expressions are a concise and convenient notation for describing syntax when "nesting" is not an issue. BNF is a more powerful notation that allows for the description of nested language constructs using nonterminal symbols in arbitrary recursive combinations. Thus regular expressions are appropriate for token-level syntax of programming languages, while BNF is required for the higher-level recursive syntax of expressions, statements and so on.

In formal language theory, the class of languages definable by regular expressions is called the class of regular languages. The languages definable by BNF grammars are called context-free languages. The class of regular languages is a subset of the class of context-free languages.