Derivations, Ambiguity and Semantics

Robert D. Cameron
January 7, 2002

Derivations and Derivation Trees

BNF grammars are finite descriptions of infinite languages, i.e., languages with infinitely many possible strings. These grammars are often said to be generative grammars because they generate all strings of the language. The way in which a string of the language is generated by the grammar is called its derivation and can be illustrated by its derivation tree (also called parse tree or syntax tree).

A derivation tree is formed in an upside-down fashion, starting with a single nonterminal symbol at the top. One of the alternatives in the production rule for this nonterminal is chosen and placed below the initial nonterminal symbol; branches are then drawn connecting the initial nonterminal symbol to each terminal and nonterminal symbol in the alternative chosen. Each nonterminal symbol in the alternative is then expanded in a similar fashion. The process completes when only terminal symbols remain at the tips of all the branches.

Consider, for example, the statement list S1; S2; S3. Using the right-recursive production for statement-lists, the derivation tree is as follows.

<statement-list> ::= <statement> | <statement> ; <statement-list>
 <statement-list> 
/|\ 
<statement>; <statement-list> 
|/ |\
S1<statement>; <statement-list>
 | |
 S2 <statement>
 |
 S3

Here S1, S2 and S3 stand for statements which would normally be further expanded according to the productions describing their structure.

Alternatively, if the left-recursive rule for statement-lists were used, the derivation tree would be as follows.

<statement-list> ::= <statement> | <statement-list> ; <statement>
 <statement-list> 
 /|\
 <statement-list>; <statement>
/ |\|
<statement-list>; <statement>S3
| | 
<statement> S2 
| 
S1 

Grammatical and Semantic Ambiguity

A grammar is said to be ambiguous if there is more than one possible derivation for a string of the language. In itself, grammatical ambiguity may be harmless as the grammar can still be perfectly accurate in describing the language syntax. Because semantics is assigned to language constructs based on their derivation from the grammar, however, grammatical ambiguity may lead to an ambiguity in semantics, i.e., language constructs for which two conflicting meanings are defined. Ambiguity is a serious problem in programming language definition.

Unfortunately, there is no way of detecting ambiguity in general. There are, however, certain well-known types of ambiguity which can be watched for. An important case that often arises in the syntax of expressions is that of productions which are both left- and right-recursive; such productions are always ambiguous. Consider, for example, the following grammar.

<expression> ::= <variable> | ( <expression> ) | <expression> + <expression> |
  <expression> * <expression>
<variable> ::= x | y | z
This grammar is ambiguous with respect to expressions such as x * y + z, for which the following two derivation trees are possible.
 <expression> 
/|\ 
<expression>* <expression> 
|/ |\
<variable><expression>+ <expression>
|| |
x<variable> <variable>
 | |
 y z
 <expression> 
 /|\
 <expression>+ <expression>
/ |\|
<expression>* <expression><variable>
| ||
<variable> <variable>z
| | 
x y 

In the first of these derivations, the variables y and z are grouped together in a subtree with the operator +; the implied semantics is equivalent to x * (y + z). The second of these derivations has a different interpretation, however, with the subtree for x * y implying the interpretation (x * y) + z.

In general, ambiguities can be eliminated by adding new productions and re-arranging the grammar. The ambiguity, above, for example can be eliminated using the following equivalent grammar.

<expression> ::= <element> | <expression> + <element> |
  <expression> * <element>
<element> ::= <variable> | ( <expression> )
<variable> ::= x | y | z
Using this grammar, the expression x * y + z has the following unique derivation.
 <expression> 
 /|\
 <expression>+ <element>
/ |\|
<expression>* <element><variable>
| ||
<element> <variable>z
| | 
<variable> y 
| 
x 

Even when ambiguity is removed from the grammar, it may not provide an appropriate basis for language semantics. Consider, for example, the expression x + y * z as described according to our revised grammar. The derivation for this expression groups x and y together for addition before z is used for multiplication, as follows.

 <expression> 
 /|\
 <expression>* <element>
/ |\|
<expression>+ <element><variable>
| ||
<element> <variable>z
| | 
<variable> y 
| 
x 

The implied semantics, then, is equivalent to (x + y) * z when the standard interpretation desired might be x + (y * z).

Re-arrangement of the grammar is again the basis for solving such problems. In this case, the grammar must be arranged to indicated the higher precedence of multiplication over addition. This may be achieved as follows.

<expression> ::= <term> | <expression> + <term>
<term> ::= <element> | <term> * <element>
<element> ::= <variable> | ( <expression> )
<variable> ::= x | y | z

Using this grammar, the derivation tree for x + y * z implies the desired result.

 <expression> 
/|\ 
<expression>+ <term> 
|/ |\
<term><term>* <element>
|| |
<element><element> <variable>
|| |
<variable><variable> z
||  
xy 

In general, then, language grammars must be designed carefully not only to describe the syntax of the language, but also to ensure that the structure of derivation trees allows proper semantic interpretations to be defined. Reading a grammar also requires care in order to understand its semantic implications.