Introduction to EBNF for SyntaxΒΆ

  • How can we describe the syntax and semantics of a programming language?

  • Syntax: the form of the statements and expressions in a language, e.g.:

    x = 5;   // okay in C++
    
    x = 5    // syntax error in C++
    
    x := 5;  // syntax error in C++
    
    = x 5;   // syntax error in C++
    

    The syntax of a language is important because it tells us which expressions are valid, and which are invalid. The syntax rules for a language need to be precise so that compilers and other tools can do their job.

  • Semantics: the meaning of the statement and expressions in a language, e.g.:

    x = 5; // in C++, this means put a copy of the value 5 into the memory location
           // x refers to
    
    assign(x, 5);  // theoretically, this statement could do exactly the same thing
                   // (it doesn't; I just made this up)
    
  • Consider the follow two statements:

    var x int // a local Go variable
    
    int x;    // a local C++ variable
    

    These both declare a variable x. But the syntax is obviously different. Also, the semantics are different: the Go x is guaranteed to be initialized to 0, but the C++ x could be any value.

  • The problem of formally describing syntax has is well-studied in CS and math, and some more or less standard techniques have arisen.

  • For the purposes of defining syntax, we think of a language as being a set of strings, or sentences.

  • Consider this statement

    var x = 5

    It consists of 4 tokens, each lexeme being a kind of token (a category of lexemes):

    • var is a VAR token
    • x is an IDENTIFIER token
    • = is an EQUALS token
    • 5 is an INTEGER token
  • The syntax rules of a language typically define the allowable sentences using rules about the order of tokens. The set of such rules is called a grammar.

  • One popular grammar formalism for specifying language syntax is BNF, aka Backus-Naur Form. Go, for example, use an extended for of BNF (EBNF) to specify its syntax.

  • Another visually interesting way to expression grammars is to use syntax diagrams.

  • Here are the basic definitions the EBNF the Go specification uses:

    • Anything in double-quotes, like “-48” or “{”, is a token in the language.

    • Non-quoted words refer to grammar rule names that can be expanded.

    • {}-brackets mean “repeat 0 or more times”, e.g. { ”!” } matches the strings “”, ”!”, ”!!”, ”!!!”, ...

    • []-brackets mean “0 or 1 times”, e.g. [“-“] “56” matches the strings “-56” and “56”

    • ()-brackets are used for grouping, e.g. “-” “0” | int could mean either (“-” “0”) | int or, more likely, “-” (“0” | int)

    • means “alternative, e.g. “5” | “6” matches the strings “5” and “6”
    • ... indicates a sequence of elements, e.g. “0” ... “3” means “0” “1” “2” “3”

  • Here’s how Go defines letters and digits:

    unicode_letter = /* a Unicode code point classified as "Letter" */
    
    letter        = unicode_letter | "_"
    decimal_digit = "0" ... "9"
    octal_digit   = "0" ... "7"
    hex_digit     = "0" ... "9" | "A" ... "F" | "a" ... "f"
    
    The bar | is alternation, i.e. indicating alternatives. It is usually read as

    “or”.

  • Note that these definitions are precise enough to let us write recognizer functions, e.g.:

    // Go
    func is_octal_digit(s string) bool {
      return s == "0" || s == "1" || s == "2" ||
             s == "3" || s == "4" || s == "5" ||
             s == "6" || s == "7"
    }
    

    Indeed, it is not hard to image writing a program whose input would be one of these simple sorts of grammar rules, and whose output would be a Go recognizer function for it.

  • Here is the Go definition of an identifier (i.e. the name of a program entity, such as a variable, function, or type):

    unicode_digit  = /* a Unicode code point classified as "Decimal Digit" */ .
    
    identifier = letter { letter | unicode_digit }
    

    In EBNF, the curly braces indicate that something is repeated 0 or more times.

  • Here is the definition of a decimal literal in Go:

    decimal_lit = ( "1" | ... | "9" ) { decimal_digit }
    

    In EBNF, the round brackets indicate a group of items to make the scope of the | operator clear. So ( "1" ... "9" ) means one of the digits "1" to "9".

  • Here is the EBNF definition for a Go expression:

    Expression = UnaryExpr | Expression binary_op UnaryExpr
    UnaryExpr  = PrimaryExpr | unary_op UnaryExpr
    
    binary_op  = "||" | "&&" | rel_op | add_op | mul_op
    rel_op     = "==" | "!=" | "<" | "<=" | ">" | ">="
    add_op     = "+" | "-" | "|" | "^"
    mul_op     = "*" | "/" | "%" | "<<" | ">>" | "&" | "&^"
    
    unary_op   = "+" | "-" | "!" | "^" | "*" | "&" | "<-"
    

    Notice that the right-hand side of a grammar can contain a reference to itself. EBNF rules are thus recursive, and this is what allows for expressions of arbitrary size.

    PrimaryExpr is not listed here since it is quite long. It’s worth a look.

  • The following two rules show how to define lists of other items:

    IdentifierList = identifier { "," identifier }
    ExpressionList = Expression { "," Expression }
    

    Curly braces mean “0 or more”.

    For example, the first rule, IdentifierList, describes strings like this:

    a
    a, b
    a, a, a, cat3, Up, Down
    

    We usually add spaces between tokens for readability. Like many languages, Go ignores whitespace between tokens.

  • Here is the EBNF definition of the Go if-statement:

    IfStmt = "if" [ SimpleStmt ";" ] Expression Block [ "else" ( IfStmt | Block ) ]
    

    Square brackets, [ and ], indicate that something is optional, i.e. can occur 0 or 1 times. The terms in double-quotes indicate strings that should appear as-is, while the other terms refer to other grammar rules.

    You can see that Go allows for an optional SimpleStmt at the beginning of an if-statement. This is often used for declaring variables that are used in the if-statement condition and body.

  • It is important to understand that these EBNF rules are precise enough to write programs with. This is useful if you to write a compiler, or some other sort of tool that needs to understand Go syntax, like an automatic formatter or code-colorer in an editor.