Introduction to EBNF for SyntaxΒΆ
How can we describe the syntax and semantics of a programming language?
Syntax: the form of the statements and expressions in a language, e.g.:
x = 5; // okay in C++ x = 5 // syntax error in C++ x := 5; // syntax error in C++ = x 5; // syntax error in C++
The syntax of a language is important because it tells us which expressions are valid, and which are invalid. The syntax rules for a language need to be precise so that compilers and other tools can do their job.
Semantics: the meaning of the statement and expressions in a language, e.g.:
x = 5; // in C++, this means put a copy of the value 5 into the memory location // x refers to assign(x, 5); // theoretically, this statement could do exactly the same thing // (it doesn't; I just made this up)
Consider the follow two statements:
var x int // a local Go variable int x; // a local C++ variable
These both declare a variable
x
. But the syntax is obviously different. Also, the semantics are different: the Gox
is guaranteed to be initialized to 0, but the C++x
could be any value.The problem of formally describing syntax has is well-studied in CS and math, and some more or less standard techniques have arisen.
For the purposes of defining syntax, we think of a language as being a set of strings, or sentences.
Consider this statement
var x = 5
It consists of 4 tokens, each lexeme being a kind of token (a category of lexemes):
var
is a VAR tokenx
is an IDENTIFIER token=
is an EQUALS token5
is an INTEGER token
The syntax rules of a language typically define the allowable sentences using rules about the order of tokens. The set of such rules is called a grammar.
One popular grammar formalism for specifying language syntax is BNF, aka Backus-Naur Form. Go, for example, use an extended for of BNF (EBNF) to specify its syntax.
Another visually interesting way to expression grammars is to use syntax diagrams.
Here are the basic definitions the EBNF the Go specification uses:
Anything in double-quotes, like “-48” or “{”, is a token in the language.
Non-quoted words refer to grammar rule names that can be expanded.
{}-brackets mean “repeat 0 or more times”, e.g. { ”!” } matches the strings “”, ”!”, ”!!”, ”!!!”, ...
[]-brackets mean “0 or 1 times”, e.g. [“-“] “56” matches the strings “-56” and “56”
()-brackets are used for grouping, e.g. “-” “0” | int could mean either (“-” “0”) | int or, more likely, “-” (“0” | int)
- means “alternative, e.g. “5” | “6” matches the strings “5” and “6”
... indicates a sequence of elements, e.g. “0” ... “3” means “0” “1” “2” “3”
Here’s how Go defines letters and digits:
unicode_letter = /* a Unicode code point classified as "Letter" */ letter = unicode_letter | "_" decimal_digit = "0" ... "9" octal_digit = "0" ... "7" hex_digit = "0" ... "9" | "A" ... "F" | "a" ... "f"
- The bar
|
is alternation, i.e. indicating alternatives. It is usually read as “or”.
- The bar
Note that these definitions are precise enough to let us write recognizer functions, e.g.:
// Go func is_octal_digit(s string) bool { return s == "0" || s == "1" || s == "2" || s == "3" || s == "4" || s == "5" || s == "6" || s == "7" }
Indeed, it is not hard to image writing a program whose input would be one of these simple sorts of grammar rules, and whose output would be a Go recognizer function for it.
Here is the Go definition of an identifier (i.e. the name of a program entity, such as a variable, function, or type):
unicode_digit = /* a Unicode code point classified as "Decimal Digit" */ . identifier = letter { letter | unicode_digit }
In EBNF, the curly braces indicate that something is repeated 0 or more times.
Here is the definition of a decimal literal in Go:
decimal_lit = ( "1" | ... | "9" ) { decimal_digit }
In EBNF, the round brackets indicate a group of items to make the scope of the
|
operator clear. So( "1" ... "9" )
means one of the digits"1"
to"9"
.Here is the EBNF definition for a Go expression:
Expression = UnaryExpr | Expression binary_op UnaryExpr UnaryExpr = PrimaryExpr | unary_op UnaryExpr binary_op = "||" | "&&" | rel_op | add_op | mul_op rel_op = "==" | "!=" | "<" | "<=" | ">" | ">=" add_op = "+" | "-" | "|" | "^" mul_op = "*" | "/" | "%" | "<<" | ">>" | "&" | "&^" unary_op = "+" | "-" | "!" | "^" | "*" | "&" | "<-"
Notice that the right-hand side of a grammar can contain a reference to itself. EBNF rules are thus recursive, and this is what allows for expressions of arbitrary size.
PrimaryExpr is not listed here since it is quite long. It’s worth a look.
The following two rules show how to define lists of other items:
IdentifierList = identifier { "," identifier } ExpressionList = Expression { "," Expression }
Curly braces mean “0 or more”.
For example, the first rule,
IdentifierList
, describes strings like this:a a, b a, a, a, cat3, Up, Down
We usually add spaces between tokens for readability. Like many languages, Go ignores whitespace between tokens.
Here is the EBNF definition of the Go if-statement:
IfStmt = "if" [ SimpleStmt ";" ] Expression Block [ "else" ( IfStmt | Block ) ]
Square brackets,
[
and]
, indicate that something is optional, i.e. can occur 0 or 1 times. The terms in double-quotes indicate strings that should appear as-is, while the other terms refer to other grammar rules.You can see that Go allows for an optional
SimpleStmt
at the beginning of an if-statement. This is often used for declaring variables that are used in the if-statement condition and body.It is important to understand that these EBNF rules are precise enough to write programs with. This is useful if you to write a compiler, or some other sort of tool that needs to understand Go syntax, like an automatic formatter or code-colorer in an editor.