Lecture 5

Chapter 4: Lexical and Syntax Analysis

  • Chapter 4 of the text discusses the problem of tokenization (i.e. lexical scanning) and parsing.

  • It shows an example of a lexical scanner written in C for a simple expression language. Project 1 asks you to writer a lexical scanner for JSON using Go, so this might be a useful example to refer to.

    However, keep in mind the text’s example is written in C, and so uses C-style coding idioms that often don’t make sense in Go. But the basic idea is the same: process a character at a time, and use if-statements or switch to decide what to do.

  • We won’t cover parsing algorithm in this course; look at them if you are interested.

Chapter 5: Names, Bindings, and Scopes

Identifiers (names)

Names and identifiers are essentially the same thing. Names name variables, functions, modules, etc.

Issues: What characters are allowed in a name? Does case matter?

e.g. in Go, identifiers are defined as follows:

identifier = letter { letter | unicode_digit }

Case matters, and so x and X are different. Identifiers can’t start with a number, so x2 is an identifier but 2x is not.

Underscores are used to help make compound names, like file_name or current_temp. Camel case is also useful, e.g. fileName or currentTemp.

e.g. In languages such as PHP, Perl, and Ruby, special symbols (sometimes called sigils) are used with variables. For example, in Ruby:

  • x refers to a local variable
  • @x refers to a variable in an object (member of an object)
  • @@x refers to a static class variable
  • x? is a convention for functions that return a boolean (i.e. true or false)
  • x! is a convention for functions that modify the object they’re called on

e.g. Go identifiers that start with an uppercase letter, such as Trim or Open are called exported identifiers, meaning that they can be accessed from other packages.

e.g. Most languages reserve certain identifiers for special use. For example, the following identifiers are reserved in Go as keywords:

break        default      func         interface    select
case         defer        go           map          struct
chan         else         goto         package      switch
const        fallthrough  if           range        type
continue     for          import       return       var

You cannot create your own identifiers with the same name.

Go also has a number of pre-declared identifiers that you can’t use:

Types:
        bool byte complex64 complex128 error float32 float64
        int int8 int16 int32 int64 rune string
        uint uint8 uint16 uint32 uint64 uintptr

Constants:
        true false iota

Zero value:
        nil

Functions:
        append cap close complex copy delete imag len
        make new panic print println real recover

Constants, such as true and false, are constant so as to prevent re- defining them like this:

fmt.Println(true)  // okay: prints "true"
true = false       // compiler error: cannot assign to true
fmt.Println(true)

However, you can do this:

fmt.Println(true)  // prints "true"
true := false
fmt.Println(true)  // prints "false"

e.g. Python is another language whose predeclared identifiers can be changed, e.g.:

>>> print True
True
>>> 2 == 2
True
>>> True = "haha"
>>> print True
haha
>>> True = False
>>> True == False
True

Of course, there rarely, if ever, is a reason to do something like this, but the fact that Python allows this sort of change opens up a whole class of errors that don’t exist in other languages.

Too many reserved words can frustrate a programmer by taking away names they might like to use in their programs.

Variables

Variables are a surprisingly intricate topic with many details. They are how most high-level languages let us access memory.

Variables typically have the following associated information:

  • name
  • address in memory
  • value
  • type
  • lifetime
  • scope

The address of a variable is where it is located in memory. In Go and C/C++, you can use the & operator to access the address of a variable, e.g.:

s := "apple"     // Go
fmt.Println(&s)  // prints 0x10500168 (for me)

There’s usually no guarantee about the exact address of a variable in memory, which can be good and bad. It’s good because it allows the compiler and operating system to choose where in memory the program runs. It can be bad if for some reason you need to access a specific memory location for, say, hardware reasons.

The address of a variable is sometimes called its l-value, because it is what is on the left-hand side of an assignment statement, e.g.:

x = 5   // put value 5 into the address referred to by x

Values 5 or "apple" don’t have l-values, and so can’t be used on the left side of an assignment.

Constant variables also lack l-values in Go:

const pie string = "apple"    // Go

func main() {
        fmt.Println(pie)   // okay, prints: "apple"
        fmt.Println(&pie)  // compiler error: can't use & with constants
}

It may be that pie is, in fact, stored at some particular memory location. But it might also be that the Go compiler treated it in some special way, e.g. replaced instances of pie in the program with the string "apple". If so, then there is no sensible address for it. For example:

const one int = 1    // Go
const two int = 2

func main() {
        three := one + two   // think about this line
        fmt.Println(three)
}

A smart compiler might be able to realize that the expression one + two can be calculated at compile-time instead of run-time, and so effectively change that line to:

three := 3    // Go

Plus, since neither one nor two are used elsewhere in the program, perhaps the compiler doesn’t even bother storing them anywhere in the running program.

Aliasing is when the same address is referred to by more than one name. For example, consider this C++ code:

void incr(int& a) {    // C++
        ++a;
}

void main() {
        int x = 5;
        incr(x);
}

When incr is called in main, a and x are aliases for the same memory address. You can be even more direct like this:

int x = 5;     // C++
int& a = x;
++x;
++a;
cout << x << endl;  // prints 7

Unrestricted aliasing like in this code seems to be a generally bad idea since in more complicated examples it might be hard to know that two different names refer to the same address.

Values are the contents of memory cells, and they are also sometimes known as r-values because they can appear on the right-hand side of an assignment statement. Many variables have both an l-value (an address) and an r-value (the value it stores), and before you access its r-value you must determine its l-value (which might not always be simple).

Binding

An important idea in programming is the notion of binding. In general, binding means associating two things, such as a variable with a value or a function name with a function.

Binding can occur at different times in the programming process:

  • Go values like true, false, and nil were bound at design-time by the language designers.
  • A value like int in C can be bound at implementation time, i.e. the exact number of bits in a C int can be set through the compiler.
  • A constant value could be bound to a value at compile-time.
  • Variables are often bound at run-time, e.g. when you read a string from a file and store that string in a variable, the variable is being bound to the string at run-time.
  • A function in a library, such as printf in C, is bound at link time. A linker is, essentially, a program that “links” function calls to their code. Typically, the code is in a library.

A static binding is a binding that occurs before run-time and is unchanged during run-time. If a binding can change during run-time, it is called a dynamic binding.

Variables can be bound to types either explicitly or implicitly. For example:

double x = 3.14;  // C: explicit

auto x = 3.14;    // C++: implicit (type inference)

x := 3.14         // Go: implicit (type inference)

In all three of these examples, x is bound to a floating-point type at compile-time. That is, the compiler determines the type of x.

Another example of implicit type binding is FORTRAN, which, by default, implicitly declares variables beginning with I, J, K, L, M, or N to be Integer types, and all other variables to be Real types. This kind of implicit typing has not been very popular. One of the problems is that if you accidentally forget to declare a variable’s type, FORTRAN may automatically assign it the wrong type. Modern FORTRAN lets you use an Implicit none statement to turn off this behaviour.

Dynamic Type Binding

Some languages, such as Python, Ruby, JavaScript, and PHP, the type of a variable is not specified by a type declaration (or some other naming convention). So in Python you can write code like this:

x = 5
x = "house"

The variable x is not restricted to contain only variables of a particular type. This makes Python (and languages like it) very flexible.

This examples shows another use of dynamic type binding:

>>> def twice(x): return x + x
>>> twice(5)
10
>>> twice("house")
'househouse'

The twice function works with any value that is defined for +. There’s no need to specify the type of x ahead of time.

Ruby is an interesting language for many reasons. All Ruby values are objects, and so Ruby variables don’t have different types: Ruby variables are all of the same object type. Thus any value can be assigned to any Ruby variable.

There are at least two problems with dynamically typed languages. One problem is that they can suffer from the errors that compilers might easily catch in statically typed languages. In a dynamically typed language, such errors might crash the program at run-time, which hurts their robustness. The second major problem is performance: dynamic type binding requires type-checking to be done at run-time (instead of compile-time), and this extra checking can slow down programs significantly.

Variable Lifetime

The lifetime of a variable is the time the variable is bound to a memory cell to the time it is unbound. We distinguish three lifetime categories:

  • Static. Static variables are bound to memory cells before program execution. This can make them quite efficient, but also less flexible since you can’t typically add new static variables at run-time (and so, for instance, recursion is impossible in a language with only static variables).

  • Stack-dynamic. Stack-dynamic variables are variables instantiated at run-time, usually on the program’s stack. For example, in languages like C++, Java, and C#, local variables in a function or method are, by default, stack-dynamic. They are instantiated when their declaration statement is reached, and de-allocated when the function ends.

    For example:

    int abs(int a, int b) {   // Java
          int diff = a - b;
          if (diff < 0)
                  return -diff;
          else
                  return diff;
    }
    

    a, b, and diff are pushed onto the program stack each time abs is called. They are popped when it ends.

  • Heap-dynamic. Heap-dynamic variables are variables whose memory is explicitly allocated and de-allocated at run-time. For example, in C++:

        int* ip = new int;    // new int allocates a new int memory cell on the heap
                              // ip holds the address of this cell
    
    *ip = 4;
    cout << *ip << endl;  // prints 4
    
    delete ip;            // de-allocate the memory cell ip points to
    

    In many languages, the deallocation of heap values is done automatically by a garbage collector. The reason for this is that it turns out to be surprisingly challenging to call delete at just the right time in complex programs. Call delete too soon and you’ve destroyed a value your program needs; call delete too late, and you’ve wasted memory (e.g. caused a memory leak).

Scope

A variable’s scope is the range of statements over which it is visible. A local variable is a variable whose scope is restricted to a block of code. A nonlocal variable is visible outside of the block in which it was declared. Global variables are a kind of nonlocal variable.

Many modern languages are statically (or, lexically) scoped. This means that the scope of the variable can be determined before the program runs just by looking at the source code. This helps programmers to read source code and figure out a things type without having to run it.

Here’s an example of static scoping from JavaScript:

function big() {
        function sub1() {
                var x = 7;   // hides the x defined in big
                sub2();
        }

        function sub2() {
                var y = x;      // which x does this refer to?
        }
        var x = 3;
        sub1();
}

Under static scoping rules, when x is seen in sub2, you first check to see if it is declared in sub2. It’s not, so you then check to see if x is declared anywhere in it parent function (big). Indeed it is, so the x in sub2 is the one declared in big.

C-based languages let you create blocks that introduce new scopes. For example, in C++ you can write code like this:

void f() {
        int x = 5;
        cout << x << endl;     // prints 5
        {
                int x = 7;
                cout << x << endl; // prints 7
        }
        cout << x << endl;     // prints 5
}

This example has two blocks, one nested inside the other. Blocks are not often used on their own like this, but instead tend to be used as the body for functions, if-statements, loops, etc.

Another interesting example of scoping arises in functional languages, such as LISP, Haskell, and ML. For example, in Clojure (a LISP-based language), you can write this code:

(let [x 4
      y x]
  (+ x y))

This is a let-expression, and it starts by binding 4 to x and y. You can then use x and y in the body of let, but nowhere outside of it. The entire expression evaluates to 8. Also, x and y cannot be changed once they’re assigned: they are read-only constants.

Dynamic Scoping

Dynamic scoping is an alternative to static scoping that has largely fallen out of favor. Most examples of dynamic scoping occur in older languages, such as APL, SNOBOL, and early versions of LISP. Perl lets you optionally declare variables that follow dynamic scoping rules.

The idea of dynamic scoping is that the meaning of a variable depends upon the value of the most recent variable with the same name in the current function call stack (as opposed to the enclosing block of source code).

Here’s some sample code (from http://msujaws.wordpress.com/2011/05/03/static- vs-dynamic-scoping/) that gives different results if you use dynamic or static scoping:

const int b = 5;

int foo()
{
   int a = b + 5;  // What is b?
   return a;
}

int bar()
{
   int b = 2;
   return foo();
}

int main()
{
   foo(); // returns 10 for static scoping; 10 for dynamic scoping
   bar(); // returns 10 for static scoping; 7 for dynamic scoping
}

The idea of dynamic scoping is that when the value of a variable is needed, the program searches for variables with that name among the names of variables in functions that are currently active. So in this code, when foo() is called inside bar(), the it’s the value of the local b in bar that gets used in b + 5.

In general, dynamic scoping is unpopular because it makes it harder to reason about the meaning of programs from their source code alone. Under dynamic scoping, you can’t always tell for sure what a variable refers to until the code runs, because the order in which functions are called matters.

Another problem with dynamic scoping is that it exposes the local variables of a function to other functions, thus allow the possibility that they could be modified.

Also, dynamic scoping has performance and type-checking difficulties. Probably the major good thing about dynamic scoping is that it is easier to implement than static scoping.