More About EBNF
===============

- Lets use Go-style EBNF to formally specify a few small examples. For
  instance, here is a definition of a bit string::

    bit        = "0" | "1"      // a bit is a 0 or a 1
    bit-string = bit { bit }    // a bit string is a bit followed by 0 or more bits

- SFU uses a `code system for semesters
  <http://www.cs.sfu.ca/Forms/Grads/TAapp/sem_explain.html>`_ that, while
  clever, is not is not very user friendly. For example, the code 1047 means
  the Fall 2004 semester.
  
  Here is an EBNF definition of the SFU semester codes::

    digit         = "0" | ... | "9"
    century       = digit               // 0=1900, 1=2000, 2=2100, ...
    decade        = digit               // 0=00, 1=10, 2=20, ...
    year          = digit      
    semester      = "1" | "4" | "7"
    semester-code = century decade year semester

  From this it is not hard to write corresponding recognizer functions::

      package main

      import "fmt"

      func is_digit(s string) bool {
        return s == "0" || s == "1" || s == "2" ||
        s == "3" || s == "4" || s == "5" ||
        s == "6" || s == "7" || s == "8" ||
        s == "9"
      }

      func is_century(s string) bool { return is_digit(s) }
      func is_decade(s string) bool { return is_digit(s) }
      func is_year(s string) bool { return is_digit(s) }
      func is_semester(s string) bool { return s == "1"  || s == "4" || s == "7" }

      func is_semester_code(s string) bool {
        return len(s) == 4 &&
        is_century(s[0:1]) &&    // s[a:a+1] is slice notation that
        is_decade(s[1:2]) &&     // returns character s[a] as a string
        is_year(s[2:3]) &&
        is_semester(s[3:4])
      }

      func main() {
        fmt.Println(is_semester_code("1047"))
        fmt.Println(is_semester_code("1141"))
        fmt.Println(is_semester_code("1142"))
      }

  In practice, string matching of this sort is often done using **regular
  expressions**. For example, the (Python) regular expression
  ``r"\d\d\d[147]"`` matches just valid semester codes.

- Here is the simple grammar of example 3.2 (p. 121) written in our EBNF 
  notation::

    assign = id "=" expr
    id     = "A" | "B" | "C"
    expr   =  id "+" expr
            | id "*" expr
            | "(" expr ")"
            | id

  For instance, consider this expression::

    A = B * (A + C)

  We can show how it is derived from the grammar step by step::

    assign = id "=" expr
           = "A" "=" expr
           = "A" "=" id "*" expr
           = "A" "=" "B" "*" expr
           = "A" "=" "B" "*" "(" expr ")"
           = "A" "=" "B" "*" "(" id "+" expr ")"
           = "A" "=" "B" "*" "(" "A" "+" expr ")"
           = "A" "=" "B" "*" "(" "A" "+" id ")"
           = "A" "=" "B" "*" "(" "A" "+" "C" ")"

- Grammars naturally describe parse trees. The rules of a grammar form a
  hierarchy that we can view as a tree. Parse trees are typically generated by
  compilers when the evaluate a program, and they contain a lot of useful
  information.

  See figure 3.1 (p. 122) of Sebesta for an example parse tree of ``A = B * (A
  + C)``.

- Some grammars are *ambiguous*, i.e. they generate more than one parse tree
  for a given expression. Here's the ambiguous grammar of example 3.3 (p. 122)
  in EBNF::

    assign = id "=" expr
    id     = "A" | "B" | "C"
    expr   =  expr "+" expr    // the above grammar rule was id "+" expr
            | expr "*" expr    // the above grammar rule was id "*" expr
            | "(" expr ")"
            | id

  Figure 3.2 (p. 123) shows two different parse trees for the expression  ``A
  = B + C * A`` corresponding to ``A = B + (C * A)`` and ``A = (B + C) * A``.
  Clearly, this bad: the first parse tree is the one we want  because the
  rules of arithmetic say that multiplication is done before addition.

- Unambiguous grammars can be written for arithmetic expressions always apply
  operators in the correct order. It requires a bit more work. Here is example
  3.4 (p. 125) in our EBNF notation::

    assign = id "=" expr
    id = "A" | "B" | "C"
    expr =  expr "+" term
          | term
    term =  term "*" factor
          | factor
    factor =  "(" expr ")"
            | id

  This is clearly not as simple as the previous grammars, but it is
  unambiguous and evaluates operators in the correct order.

- Another problem that grammars can suffer from is *left recursion*. This
  occurs in a grammar rule when the first thing on the right side of the rule
  is the same as the name of the rule, e.g.::

    expr = expr "+" term   // left recursive

  Some grammar processing algorithms don't work with left-recursive rules, and
  so they would need to be re-written in a way the avoids left recursion.

- Our examples of parse trees use arithmetic expressions for simplicity. But
  keep in mind that an entire program can be thought of as one large parse
  tree.