Given a context-free grammar, it is possible to define a a set of predicates that encode the logic of the grammar.
Consider the following simple grammar of constant expressions.
<expr> | ::= | <num> | <num> + <expr> | <num> - <expr> |
As a first step in considering how the logic of such a grammar may be encoded in Prolog, let us assume that input to be parsed will be represented as lists of numerals and symbols. For example:
[11, +, 2, -, 7]
is a valid <expr>, and
[11, +, 2, -]
is not a valid <expr>.
Parsing may be implemented in a straightforward manner by creating one predicate for each non-terminal in the grammar. These predicates will take as an argument a list of items representing a possible instance of the non-terminal, having the value true if a given phrase is such an instance, false otherwise. Each predicates is programmed using one clause for each alternative form of the corresponding non-terminal.
expr(L) :- num(L). expr(L) :- append(L1, [+|L2], L), num(L1), expr(L2). expr(L) :- append(L1, [-|L2], L), num(L1), expr(L2). num([D]) :- number(D).
The append
predicate can generate all possible
partitions of the input, parsing succeeds if a particular
partition corresponds to the grammatical structure of a phrase.
Given these definitions, we can easily check to see if particular lists parse as expressions.
| ?- expr([11, +, 2, -, 7]). yes | ?- expr([11, +, 2, -]). no | ?- expr([11, +, 2]). yes
Unfortunately, the general strategy of parsing by partitioning is
quite inefficient. In the given examples, the partitions
are easily checked at the top level by virtue of the infix operator
+
or -
. In other cases, there
is no easy way to check if a top-level partition is the right one to try.
This may mean that a lot of work is done before a particular partitioning
is found to be incorrect; much of this work must be repeated for
the next attempt at partitioning (and the next, and so on, if they also
are incorrect).
A much more efficient approach to parsing is to have predicates that consume as many elements of the input list as they need to match a particular nonterminal, and then return the rest of the list for subsequent processing. In this case, parsing predicates have two arguments, the first being the list representation of the input stream, and the second being instantiated to the list of input elements that remain after a complete syntactic structure has been found.
For example, let expr(L, Remain)
be true if
a prefix of L
matches the production <expr> and
Remain
is the list of input remaining after this
initial prefix. Using this convention, the parser for our grammar
may be recast as follows.
expr(L, Remain) :- num(L, Remain). expr(L, Remain) :- num(L, L1), 'C'(L1, +, L2), expr(L2, Remain). expr(L, Remain) :- num(L, L1), 'C'(L1, -, L2), expr(L2, Remain). num([D|Remain], Remain) :- number(D). 'C'([X|L], X, L).Here, the Prolog's DCG (definite clause grammar) predicate
'C'(L1, X, L2)
is introduced
to say that L1
begins with terminal symbol
X after which L2 remains to be parsed.
The following examples illustrate that parsing an expression as the initial prefix of an input list may indeed have several correct answers.
| ?- expr([1, +, 2, -, 3], A). A = [+,2,-,3] ? ; A = [-,3] ? ; A = [] ? ; no
To ensure that all of an input list is parsed, we can define a top-level predicate that requires the list of remaining elements after parsing to be empty.
expr-complete(L) :- expr(L, []). | ?- expr-complete([11, +, 2, -, 7]). yes | ?- expr-complete([11, +, 2, -]). no
Given a context-free grammar, it is a relatively mechanical job
to construct the parsing predicates shown above.
Because this is such a frequent task in Prolog-based symbolic
computing applications, a special syntax exists to let Prolog
take care of the mechanical work!
In short, the -->
predicate
introduces a definite clause that allows BNF-like
specification of syntax.
expr --> num. expr --> num, [+], expr. expr --> num, [-], expr. num --> [D], {number(D)}. expr-complete(L) :- expr(L, []).On the right hand side of any definite clause, non-terminals may be simply given as symbolic atoms, while terminal symbols must be embedded in lists. In addition, arbitrary prolog predicates may be enclosed in braces.
Whenever Prolog reads in a definite clause grammar, it in fact performs the transformation to generate predicates equivalent to those shown previously!
To go beyond mere yes or no answers to parsing operations, the DCG syntax allows non-terminals to be parameterized. For example, to parse a numeric expression and compute its value, we can add a parameter to represent the computed value to each non-terminal.
expr(Z) --> num(Z). expr(Z) --> num(X), [+], expr(Y), {Z is X+Y}. expr(Z) --> num(X), [-], expr(Y), {Z is X-Y}. num(D) --> [D], {number(D)}. expr_value(L, V) :- expr(V, L, []).
In transforming these DCG rules to the corresponding parsing
predicates, the predicates expr
and num
each get an extra parameter.
This implements the semantics of numeric expression
evaluation as illustrated below.
| ?- expr_value([11, +, 2, -, 7], V). V = 6 ? | ?- expr_value([8, -, 6, -, 2], V). V = 4 ?
The astute reader will note that the final example shows an unusual feature of the defined expression grammar: right associativity. Left associativity requires left-recursive grammar rules, which in turn require slightly more complicated handling with DCGs.