Lecture 5
=========

Chapter 4: Lexical and Syntax Analysis
--------------------------------------

- Chapter 4 of the text discusses the problem of tokenization (i.e. lexical 
  scanning) and parsing. 

- It shows an example of a lexical scanner written in C for a simple expression 
  language. Project 1 asks you to writer a lexical scanner for JSON using Go_, so
  this might be a useful example to refer to.

  However, keep in mind the text's example is written in C, and so uses
  C-style coding idioms that often don't make sense in Go_. But the basic idea
  is the same: process a character at a time, and use if-statements or switch
  to decide what to do.

- We won't cover parsing algorithm in this course; look at them if you are
  interested.


Chapter 5: Names, Bindings, and Scopes
--------------------------------------

Identifiers (names)
~~~~~~~~~~~~~~~~~~~

Names and identifiers are essentially the same thing. Names name variables,
functions, modules, etc.

Issues: What characters are allowed in a name? Does case matter?

e.g. in Go_, identifiers are defined as follows::

	identifier = letter { letter | unicode_digit }

Case matters, and so ``x`` and ``X`` are different. Identifiers can't start
with a number, so ``x2`` is an identifier but ``2x`` is not.

Underscores are used to help make compound names, like ``file_name`` or
``current_temp``. Camel case is also useful, e.g. ``fileName`` or
``currentTemp``.

e.g. In languages such as PHP, Perl, and Ruby, special symbols (sometimes
called **sigils**) are used with variables. For example, in Ruby:

- ``x`` refers to a local variable

- ``@x`` refers to a variable in an object (member of an object)

- ``@@x`` refers to a static class variable

- ``x?`` is a convention for functions that return a boolean (i.e. true or
  false)

- ``x!`` is a convention for functions that modify the object they're called
  on

e.g. Go_ identifiers that start with an uppercase letter, such as ``Trim`` or
``Open`` are called exported identifiers, meaning that they can be accessed
from other packages.

e.g. Most languages reserve certain identifiers for special use. For example,
the following identifiers are reserved in Go_ as keywords::

	break        default      func         interface    select
	case         defer        go           map          struct
	chan         else         goto         package      switch
	const        fallthrough  if           range        type
	continue     for          import       return       var

You cannot create your own identifiers with the same name.

Go_ also has a number of pre-declared identifiers that you can't use::

	Types:
		bool byte complex64 complex128 error float32 float64
		int int8 int16 int32 int64 rune string
		uint uint8 uint16 uint32 uint64 uintptr

	Constants:
		true false iota

	Zero value:
		nil

	Functions:
		append cap close complex copy delete imag len
		make new panic print println real recover

Constants, such as ``true`` and ``false``, are constant so as to prevent re-
defining them like this::

	fmt.Println(true)  // okay: prints "true"
	true = false       // compiler error: cannot assign to true
	fmt.Println(true)

However, you can do this::

	fmt.Println(true)  // prints "true"
	true := false
	fmt.Println(true)  // prints "false"

e.g. Python is another language whose predeclared identifiers can be changed,
e.g.::

	>>> print True
	True
	>>> 2 == 2
	True
	>>> True = "haha"
	>>> print True
	haha
	>>> True = False
	>>> True == False
	True

Of course, there rarely, if ever, is a reason to do something like this, but
the fact that Python allows this sort of change opens up a whole class of
errors that don't exist in other languages.

Too many reserved words can frustrate a programmer by taking away names
they might like to use in their programs.


Variables
~~~~~~~~~

Variables are a surprisingly intricate topic with many details. They are how
most high-level languages let us access memory.

Variables typically have the following associated information:

- name

- address in memory

- value

- type

- lifetime

- scope

The address of a variable is where it is located in memory. In Go_ and C/C++,
you can use the ``&`` operator to access the address of a variable, e.g.::

	s := "apple"     // Go
	fmt.Println(&s)  // prints 0x10500168 (for me)

There's usually no guarantee about the exact address of a variable in memory,
which can be good and bad. It's good because it allows the compiler and
operating system to choose where in memory the program runs. It can be bad if
for some reason you need to access a specific memory location for, say,
hardware reasons.

The address of a variable is sometimes called its **l-value**, because it is
what is on the left-hand side of an assignment statement, e.g.::

	x = 5   // put value 5 into the address referred to by x

Values ``5`` or ``"apple"`` don't have l-values, and so can't be used
on the left side of an assignment.

Constant variables also lack l-values in Go_::
 
	const pie string = "apple"    // Go

	func main() {
		fmt.Println(pie)   // okay, prints: "apple"
		fmt.Println(&pie)  // compiler error: can't use & with constants
	}

It may be that ``pie`` is, in fact, stored at some particular memory location.
But it might also be that the Go_ compiler treated it in some special way,
e.g. replaced instances of ``pie`` in the program with the string ``"apple"``.
If so, then there is no sensible address for it. For example::

	const one int = 1    // Go
	const two int = 2

	func main() {
		three := one + two   // think about this line
		fmt.Println(three)
	}

A smart compiler might be able to realize that the expression ``one + two``
can be calculated at compile-time instead of run-time, and so effectively
change that line to::

	three := 3    // Go

Plus, since neither ``one`` nor ``two`` are used elsewhere in the program,
perhaps the compiler doesn't even bother storing them anywhere in the running
program.

**Aliasing** is when the same address is referred to by more than one name. For
example, consider this C++ code::

	void incr(int& a) {    // C++
		++a;
	}

	void main() {
		int x = 5;
		incr(x);
	}

When ``incr`` is called in ``main``, ``a`` and ``x`` are aliases for the same
memory address. You can be even more direct like this::

	int x = 5;     // C++
	int& a = x;
	++x;
	++a;
	cout << x << endl;  // prints 7

Unrestricted aliasing like in this code seems to be a generally bad idea since
in more complicated examples it might be hard to know that two different names
refer to the same address.

Values are the contents of memory cells, and they are also sometimes known as
**r-values** because they can appear on the right-hand side of an assignment
statement. Many variables have both an l-value (an address) and an r-value
(the value it stores), and before you access its r-value you must determine
its l-value (which might not always be simple).


Binding
-------

An important idea in programming is the notion of **binding**. In general,
binding means associating two things, such as a variable with a value or a
function name with a function.

Binding can occur at different times in the programming process:

- Go_ values like ``true``, ``false``, and ``nil`` were bound at design-time 
  by the language designers.

- A value like ``int`` in C can be bound at implementation time, i.e. the
  exact number of bits in a C ``int`` can be set through the compiler.

- A constant value could be bound to a value at compile-time.

- Variables are often bound at run-time, e.g. when you read a string from a
  file and store that string in a variable, the variable is being bound to the
  string at run-time.

- A function in a library, such as ``printf`` in C, is bound at link time. A 
  linker is, essentially, a program that "links" function calls to their 
  code. Typically, the code is in a library.

A **static binding** is a binding that occurs *before* run-time and is
unchanged during run-time. If a binding can change during run-time, it is
called a **dynamic binding**.

Variables can be bound to types either explicitly or implicitly. For example::

	double x = 3.14;  // C: explicit

	auto x = 3.14;    // C++: implicit (type inference)

	x := 3.14         // Go: implicit (type inference)

In all three of these examples, ``x`` is bound to a floating-point type at
compile-time. That is, the compiler determines the type of ``x``.

Another example of implicit type binding is FORTRAN, which, by default,
implicitly declares variables beginning with ``I``, ``J``, ``K``, ``L``,
``M``, or ``N`` to be ``Integer`` types, and all other variables to be
``Real`` types. This kind of implicit typing has not been very popular. One of
the problems is that if you accidentally forget to declare a variable's type,
FORTRAN may automatically assign it the wrong type. Modern FORTRAN lets you
use an ``Implicit none`` statement to turn off this behaviour.


Dynamic Type Binding
~~~~~~~~~~~~~~~~~~~~

Some languages, such as Python, Ruby, JavaScript, and PHP, the type of a
variable is *not* specified by a type declaration (or some other naming
convention). So in Python you can write code like this::

	x = 5
	x = "house"

The variable ``x`` is not restricted to contain only variables of a particular
type. This makes Python (and languages like it) very flexible.

This examples shows another use of dynamic type binding::

	>>> def twice(x): return x + x
	>>> twice(5)
	10
	>>> twice("house")
	'househouse'

The ``twice`` function works with any value that is defined for ``+``. There's
no need to specify the type of ``x`` ahead of time.

Ruby is an interesting language for many reasons. All Ruby values are objects,
and so Ruby variables don't have different types: Ruby variables are all of the
same object type. Thus any value can be assigned to any Ruby variable.

There are at least two problems with dynamically typed languages. One problem
is that they can suffer from the errors that compilers might easily catch in
statically typed languages. In a dynamically typed language, such errors might
crash the program at run-time, which hurts their robustness. The second major
problem is performance: dynamic type binding requires type-checking to be done
at run-time (instead of compile-time), and this extra checking can slow down
programs significantly.


Variable Lifetime
~~~~~~~~~~~~~~~~~

The **lifetime** of a variable is the time the variable is bound to a memory
cell to the time it is unbound. We distinguish three lifetime categories:

- **Static.** Static variables are bound to memory cells *before* program
  execution. This can make them quite efficient, but also less flexible since
  you can't typically add new static variables at run-time (and so, for
  instance, recursion is impossible in a language with only static variables).

- **Stack-dynamic.** Stack-dynamic variables are variables instantiated at
  run-time, usually on the program's stack. For example, in languages like
  C++, Java, and C#, local variables in a function or method are, by default,
  stack-dynamic. They are instantiated when their declaration statement is
  reached, and de-allocated when the function ends.

  For example::

	  int abs(int a, int b) {   // Java
	  	int diff = a - b;
	  	if (diff < 0) 
	  		return -diff;
	  	else
	  		return diff;
	  } 

  ``a``, ``b``, and ``diff`` are pushed onto the program stack each time
  ``abs`` is called. They are popped when it ends.

- **Heap-dynamic.** Heap-dynamic variables are variables whose memory is
  explicitly allocated and de-allocated at run-time. For example, in C++::

  	int* ip = new int;    // new int allocates a new int memory cell on the heap
  	                      // ip holds the address of this cell

    *ip = 4;
    cout << *ip << endl;  // prints 4

    delete ip;            // de-allocate the memory cell ip points to

  In many languages, the deallocation of heap values is done automatically by
  a garbage collector. The reason for this is that it turns out to be
  surprisingly challenging to call ``delete`` at just the right time in
  complex programs. Call ``delete`` too soon and you've destroyed a value
  your program needs; call ``delete`` too late, and you've wasted memory (e.g.
  caused a memory leak).


Scope
~~~~~

A variable's **scope** is the range of statements over which it is visible. A
local variable is a variable whose scope is restricted to a block of code. A
nonlocal variable is visible outside of the block in which it was declared.
Global variables are a kind of nonlocal variable.

Many modern languages are **statically** (or, **lexically**) scoped. This
means that the scope of the variable can be determined *before* the program
runs just by looking at the source code. This helps programmers to read source
code and figure out a things type without having to run it.

Here's an example of static scoping from JavaScript::

	function big() {
		function sub1() {
			var x = 7;   // hides the x defined in big
			sub2();
		}

		function sub2() {
			var y = x;      // which x does this refer to?
		}
		var x = 3;
		sub1();
	}

Under static scoping rules, when ``x`` is seen in ``sub2``, you first check to
see if it is declared in ``sub2``. It's not, so you then check to see if ``x``
is declared anywhere in it parent function (``big``). Indeed it is, so the
``x`` in ``sub2`` is the one declared in ``big``.

C-based languages let you create blocks that introduce new scopes. For
example, in C++ you can write code like this::

	void f() {
		int x = 5;
		cout << x << endl;     // prints 5
		{
			int x = 7;
			cout << x << endl; // prints 7
		}
		cout << x << endl;     // prints 5
	}

This example has two blocks, one nested inside the other. Blocks are not often
used on their own like this, but instead tend to be used as the body for
functions, if-statements, loops, etc.

Another interesting example of scoping arises in functional languages, such as
LISP, Haskell, and ML. For example, in Clojure_ (a LISP-based language), you
can write this code::

	(let [x 4
	      y x]
	  (+ x y))

This is a let-expression, and it starts by binding 4 to ``x`` and ``y``. You
can then use ``x`` and ``y`` in the body of let, but nowhere outside of it.
The entire expression evaluates to 8. Also, ``x`` and ``y`` cannot be changed
once they're assigned: they are read-only constants.


Dynamic Scoping
~~~~~~~~~~~~~~~

Dynamic scoping is an alternative to static scoping that has largely fallen
out of favor. Most examples of dynamic scoping occur in older languages, such
as APL, SNOBOL, and early versions of LISP. Perl lets you optionally declare
variables that follow dynamic scoping rules.

The idea of dynamic scoping is that the meaning of a variable depends upon the
value of the most recent variable with the same name in the current function
call stack (as opposed to the enclosing block of source code).

Here's some sample code (from http://msujaws.wordpress.com/2011/05/03/static-
vs-dynamic-scoping/) that gives different results if you use dynamic or static
scoping::

	const int b = 5;    

	int foo()
	{
	   int a = b + 5;  // What is b?
	   return a;
	}
	 
	int bar()
	{
	   int b = 2;
	   return foo();
	}
	 
	int main()
	{
	   foo(); // returns 10 for static scoping; 10 for dynamic scoping
	   bar(); // returns 10 for static scoping; 7 for dynamic scoping
	}

The idea of dynamic scoping is that when the value of a variable is needed,
the program searches for variables with that name among the names of variables
in functions that are currently active. So in this code, when ``foo()`` is
called inside ``bar()``, the it's the value of the local ``b`` in ``bar`` that
gets used in ``b + 5``.

In general, dynamic scoping is unpopular because it makes it harder to reason
about the meaning of programs from their source code alone. Under dynamic
scoping, you can't always tell for sure what a variable refers to until the
code runs, because the order in which functions are called matters.

Another problem with dynamic scoping is that it exposes the local variables of
a function to other functions, thus allow the possibility that they could be
modified.

Also, dynamic scoping has performance and type-checking difficulties. Probably
the major good thing about dynamic scoping is that it is easier to implement
than static scoping.