Introduction to EBNF for Syntax =============================== - How can we describe the syntax and semantics of a programming language? - **Syntax**: the *form* of the statements and expressions in a language, e.g.:: x = 5; // okay in C++ x = 5 // syntax error in C++ x := 5; // syntax error in C++ = x 5; // syntax error in C++ The syntax of a language is important because it tells us which expressions are valid, and which are invalid. The syntax rules for a language need to be precise so that compilers and other tools can do their job. - **Semantics**: the meaning of the statement and expressions in a language, e.g.:: x = 5; // in C++, this means put a copy of the value 5 into the memory location // x refers to assign(x, 5); // theoretically, this statement could do exactly the same thing // (it doesn't; I just made this up) - Consider the follow two statements:: var x int // a local Go variable int x; // a local C++ variable These both declare a variable ``x``. But the syntax is obviously different. Also, the semantics are different: the Go_ ``x`` is guaranteed to be initialized to 0, but the C++ ``x`` could be any value. - The problem of formally describing syntax has is well-studied in CS and math, and some more or less standard techniques have arisen. - For the purposes of defining syntax, we think of a language as being a set of strings, or **sentences**. - Consider this statement var x = 5 It consists of 4 **tokens**, each lexeme being a kind of token (a category of lexemes): - ``var`` is a VAR token - ``x`` is an IDENTIFIER token - ``=`` is an EQUALS token - ``5`` is an INTEGER token - The syntax rules of a language typically define the allowable sentences using rules about the order of tokens. The set of such rules is called a **grammar**. - One popular grammar formalism for specifying language syntax is BNF, aka Backus-Naur Form. Go_, for example, use an extended for of BNF (EBNF) to specify its syntax. - Another visually interesting way to expression grammars is to use `syntax diagrams `_. - Here are the basic definitions the `EBNF the Go specification `_ uses: - Anything in double-quotes, like "-48" or "{", is a token in the language. - Non-quoted words refer to grammar rule names that can be expanded. - {}-brackets mean "repeat 0 or more times", e.g. { "!" } matches the strings "", "!", "!!", "!!!", ... - []-brackets mean "0 or 1 times", e.g. ["-"] "56" matches the strings "-56" and "56" - ()-brackets are used for grouping, e.g. "-" "0" | int could mean either ("-" "0") | int or, more likely, "-" ("0" | int) - | means "alternative, e.g. "5" | "6" matches the strings "5" and "6" - ... indicates a sequence of elements, e.g. "0" ... "3" means "0" "1" "2" "3" - Here's how Go_ defines letters and digits:: unicode_letter = /* a Unicode code point classified as "Letter" */ letter = unicode_letter | "_" decimal_digit = "0" ... "9" octal_digit = "0" ... "7" hex_digit = "0" ... "9" | "A" ... "F" | "a" ... "f" The bar ``|`` is alternation, i.e. indicating alternatives. It is usually read as "or". - Note that these definitions are precise enough to let us write recognizer functions, e.g.:: // Go func is_octal_digit(s string) bool { return s == "0" || s == "1" || s == "2" || s == "3" || s == "4" || s == "5" || s == "6" || s == "7" } Indeed, it is not hard to image writing a program whose input would be one of these simple sorts of grammar rules, and whose output would be a Go_ recognizer function for it. - Here is the Go_ definition of an *identifier* (i.e. the name of a program entity, such as a variable, function, or type):: unicode_digit = /* a Unicode code point classified as "Decimal Digit" */ . identifier = letter { letter | unicode_digit } In EBNF, the curly braces indicate that something is repeated **0 or more** times. - Here is the definition of a decimal literal in Go_:: decimal_lit = ( "1" | ... | "9" ) { decimal_digit } In EBNF, the round brackets indicate a group of items to make the scope of the ``|`` operator clear. So ``( "1" ... "9" )`` means one of the digits ``"1"`` to ``"9"``. - Here is the EBNF definition for a Go_ expression:: Expression = UnaryExpr | Expression binary_op UnaryExpr UnaryExpr = PrimaryExpr | unary_op UnaryExpr binary_op = "||" | "&&" | rel_op | add_op | mul_op rel_op = "==" | "!=" | "<" | "<=" | ">" | ">=" add_op = "+" | "-" | "|" | "^" mul_op = "*" | "/" | "%" | "<<" | ">>" | "&" | "&^" unary_op = "+" | "-" | "!" | "^" | "*" | "&" | "<-" Notice that the right-hand side of a grammar can contain a reference to itself. EBNF rules are thus recursive, and this is what allows for expressions of arbitrary size. `PrimaryExpr `_ is not listed here since it is quite long. It's worth a look. - The following two rules show how to define lists of other items:: IdentifierList = identifier { "," identifier } ExpressionList = Expression { "," Expression } Curly braces mean "0 or more". For example, the first rule, ``IdentifierList``, describes strings like this:: a a, b a, a, a, cat3, Up, Down We usually add spaces between tokens for readability. Like many languages, Go_ ignores whitespace between tokens. - Here is the EBNF definition of the Go_ if-statement:: IfStmt = "if" [ SimpleStmt ";" ] Expression Block [ "else" ( IfStmt | Block ) ] Square brackets, ``[`` and ``]``, indicate that something is *optional*, i.e. can occur 0 or 1 times. The terms in double-quotes indicate strings that should appear as-is, while the other terms refer to other grammar rules. You can see that Go_ allows for an optional ``SimpleStmt`` at the beginning of an if-statement. This is often used for declaring variables that are used in the if-statement condition and body. - It is important to understand that these EBNF rules are precise enough to write programs with. This is useful if you to write a compiler, or some other sort of tool that needs to understand Go_ syntax, like an automatic formatter or code-colorer in an editor.