Lecture 5¶
Chapter 4: Lexical and Syntax Analysis¶
Chapter 4 of the text discusses the problem of tokenization (i.e. lexical scanning) and parsing.
It shows an example of a lexical scanner written in C for a simple expression language. Project 1 asks you to writer a lexical scanner for JSON using Go, so this might be a useful example to refer to.
However, keep in mind the text’s example is written in C, and so uses C-style coding idioms that often don’t make sense in Go. But the basic idea is the same: process a character at a time, and use if-statements or switch to decide what to do.
We won’t cover parsing algorithm in this course; look at them if you are interested.
Chapter 5: Names, Bindings, and Scopes¶
Identifiers (names)¶
Names and identifiers are essentially the same thing. Names name variables, functions, modules, etc.
Issues: What characters are allowed in a name? Does case matter?
e.g. in Go, identifiers are defined as follows:
identifier = letter { letter | unicode_digit }
Case matters, and so x
and X
are different. Identifiers can’t start
with a number, so x2
is an identifier but 2x
is not.
Underscores are used to help make compound names, like file_name
or
current_temp
. Camel case is also useful, e.g. fileName
or
currentTemp
.
e.g. In languages such as PHP, Perl, and Ruby, special symbols (sometimes called sigils) are used with variables. For example, in Ruby:
x
refers to a local variable@x
refers to a variable in an object (member of an object)@@x
refers to a static class variablex?
is a convention for functions that return a boolean (i.e. true or false)x!
is a convention for functions that modify the object they’re called on
e.g. Go identifiers that start with an uppercase letter, such as Trim
or
Open
are called exported identifiers, meaning that they can be accessed
from other packages.
e.g. Most languages reserve certain identifiers for special use. For example, the following identifiers are reserved in Go as keywords:
break default func interface select
case defer go map struct
chan else goto package switch
const fallthrough if range type
continue for import return var
You cannot create your own identifiers with the same name.
Go also has a number of pre-declared identifiers that you can’t use:
Types:
bool byte complex64 complex128 error float32 float64
int int8 int16 int32 int64 rune string
uint uint8 uint16 uint32 uint64 uintptr
Constants:
true false iota
Zero value:
nil
Functions:
append cap close complex copy delete imag len
make new panic print println real recover
Constants, such as true
and false
, are constant so as to prevent re-
defining them like this:
fmt.Println(true) // okay: prints "true"
true = false // compiler error: cannot assign to true
fmt.Println(true)
However, you can do this:
fmt.Println(true) // prints "true"
true := false
fmt.Println(true) // prints "false"
e.g. Python is another language whose predeclared identifiers can be changed, e.g.:
>>> print True
True
>>> 2 == 2
True
>>> True = "haha"
>>> print True
haha
>>> True = False
>>> True == False
True
Of course, there rarely, if ever, is a reason to do something like this, but the fact that Python allows this sort of change opens up a whole class of errors that don’t exist in other languages.
Too many reserved words can frustrate a programmer by taking away names they might like to use in their programs.
Variables¶
Variables are a surprisingly intricate topic with many details. They are how most high-level languages let us access memory.
Variables typically have the following associated information:
- name
- address in memory
- value
- type
- lifetime
- scope
The address of a variable is where it is located in memory. In Go and C/C++,
you can use the &
operator to access the address of a variable, e.g.:
s := "apple" // Go
fmt.Println(&s) // prints 0x10500168 (for me)
There’s usually no guarantee about the exact address of a variable in memory, which can be good and bad. It’s good because it allows the compiler and operating system to choose where in memory the program runs. It can be bad if for some reason you need to access a specific memory location for, say, hardware reasons.
The address of a variable is sometimes called its l-value, because it is what is on the left-hand side of an assignment statement, e.g.:
x = 5 // put value 5 into the address referred to by x
Values 5
or "apple"
don’t have l-values, and so can’t be used
on the left side of an assignment.
Constant variables also lack l-values in Go:
const pie string = "apple" // Go
func main() {
fmt.Println(pie) // okay, prints: "apple"
fmt.Println(&pie) // compiler error: can't use & with constants
}
It may be that pie
is, in fact, stored at some particular memory location.
But it might also be that the Go compiler treated it in some special way,
e.g. replaced instances of pie
in the program with the string "apple"
.
If so, then there is no sensible address for it. For example:
const one int = 1 // Go
const two int = 2
func main() {
three := one + two // think about this line
fmt.Println(three)
}
A smart compiler might be able to realize that the expression one + two
can be calculated at compile-time instead of run-time, and so effectively
change that line to:
three := 3 // Go
Plus, since neither one
nor two
are used elsewhere in the program,
perhaps the compiler doesn’t even bother storing them anywhere in the running
program.
Aliasing is when the same address is referred to by more than one name. For example, consider this C++ code:
void incr(int& a) { // C++
++a;
}
void main() {
int x = 5;
incr(x);
}
When incr
is called in main
, a
and x
are aliases for the same
memory address. You can be even more direct like this:
int x = 5; // C++
int& a = x;
++x;
++a;
cout << x << endl; // prints 7
Unrestricted aliasing like in this code seems to be a generally bad idea since in more complicated examples it might be hard to know that two different names refer to the same address.
Values are the contents of memory cells, and they are also sometimes known as r-values because they can appear on the right-hand side of an assignment statement. Many variables have both an l-value (an address) and an r-value (the value it stores), and before you access its r-value you must determine its l-value (which might not always be simple).
Binding¶
An important idea in programming is the notion of binding. In general, binding means associating two things, such as a variable with a value or a function name with a function.
Binding can occur at different times in the programming process:
- Go values like
true
,false
, andnil
were bound at design-time by the language designers. - A value like
int
in C can be bound at implementation time, i.e. the exact number of bits in a Cint
can be set through the compiler. - A constant value could be bound to a value at compile-time.
- Variables are often bound at run-time, e.g. when you read a string from a file and store that string in a variable, the variable is being bound to the string at run-time.
- A function in a library, such as
printf
in C, is bound at link time. A linker is, essentially, a program that “links” function calls to their code. Typically, the code is in a library.
A static binding is a binding that occurs before run-time and is unchanged during run-time. If a binding can change during run-time, it is called a dynamic binding.
Variables can be bound to types either explicitly or implicitly. For example:
double x = 3.14; // C: explicit
auto x = 3.14; // C++: implicit (type inference)
x := 3.14 // Go: implicit (type inference)
In all three of these examples, x
is bound to a floating-point type at
compile-time. That is, the compiler determines the type of x
.
Another example of implicit type binding is FORTRAN, which, by default,
implicitly declares variables beginning with I
, J
, K
, L
,
M
, or N
to be Integer
types, and all other variables to be
Real
types. This kind of implicit typing has not been very popular. One of
the problems is that if you accidentally forget to declare a variable’s type,
FORTRAN may automatically assign it the wrong type. Modern FORTRAN lets you
use an Implicit none
statement to turn off this behaviour.
Dynamic Type Binding¶
Some languages, such as Python, Ruby, JavaScript, and PHP, the type of a variable is not specified by a type declaration (or some other naming convention). So in Python you can write code like this:
x = 5
x = "house"
The variable x
is not restricted to contain only variables of a particular
type. This makes Python (and languages like it) very flexible.
This examples shows another use of dynamic type binding:
>>> def twice(x): return x + x
>>> twice(5)
10
>>> twice("house")
'househouse'
The twice
function works with any value that is defined for +
. There’s
no need to specify the type of x
ahead of time.
Ruby is an interesting language for many reasons. All Ruby values are objects, and so Ruby variables don’t have different types: Ruby variables are all of the same object type. Thus any value can be assigned to any Ruby variable.
There are at least two problems with dynamically typed languages. One problem is that they can suffer from the errors that compilers might easily catch in statically typed languages. In a dynamically typed language, such errors might crash the program at run-time, which hurts their robustness. The second major problem is performance: dynamic type binding requires type-checking to be done at run-time (instead of compile-time), and this extra checking can slow down programs significantly.
Variable Lifetime¶
The lifetime of a variable is the time the variable is bound to a memory cell to the time it is unbound. We distinguish three lifetime categories:
Static. Static variables are bound to memory cells before program execution. This can make them quite efficient, but also less flexible since you can’t typically add new static variables at run-time (and so, for instance, recursion is impossible in a language with only static variables).
Stack-dynamic. Stack-dynamic variables are variables instantiated at run-time, usually on the program’s stack. For example, in languages like C++, Java, and C#, local variables in a function or method are, by default, stack-dynamic. They are instantiated when their declaration statement is reached, and de-allocated when the function ends.
For example:
int abs(int a, int b) { // Java int diff = a - b; if (diff < 0) return -diff; else return diff; }
a
,b
, anddiff
are pushed onto the program stack each timeabs
is called. They are popped when it ends.Heap-dynamic. Heap-dynamic variables are variables whose memory is explicitly allocated and de-allocated at run-time. For example, in C++:
int* ip = new int; // new int allocates a new int memory cell on the heap // ip holds the address of this cell *ip = 4; cout << *ip << endl; // prints 4 delete ip; // de-allocate the memory cell ip points to
In many languages, the deallocation of heap values is done automatically by a garbage collector. The reason for this is that it turns out to be surprisingly challenging to call
delete
at just the right time in complex programs. Calldelete
too soon and you’ve destroyed a value your program needs; calldelete
too late, and you’ve wasted memory (e.g. caused a memory leak).
Scope¶
A variable’s scope is the range of statements over which it is visible. A local variable is a variable whose scope is restricted to a block of code. A nonlocal variable is visible outside of the block in which it was declared. Global variables are a kind of nonlocal variable.
Many modern languages are statically (or, lexically) scoped. This means that the scope of the variable can be determined before the program runs just by looking at the source code. This helps programmers to read source code and figure out a things type without having to run it.
Here’s an example of static scoping from JavaScript:
function big() {
function sub1() {
var x = 7; // hides the x defined in big
sub2();
}
function sub2() {
var y = x; // which x does this refer to?
}
var x = 3;
sub1();
}
Under static scoping rules, when x
is seen in sub2
, you first check to
see if it is declared in sub2
. It’s not, so you then check to see if x
is declared anywhere in it parent function (big
). Indeed it is, so the
x
in sub2
is the one declared in big
.
C-based languages let you create blocks that introduce new scopes. For example, in C++ you can write code like this:
void f() {
int x = 5;
cout << x << endl; // prints 5
{
int x = 7;
cout << x << endl; // prints 7
}
cout << x << endl; // prints 5
}
This example has two blocks, one nested inside the other. Blocks are not often used on their own like this, but instead tend to be used as the body for functions, if-statements, loops, etc.
Another interesting example of scoping arises in functional languages, such as LISP, Haskell, and ML. For example, in Clojure (a LISP-based language), you can write this code:
(let [x 4
y x]
(+ x y))
This is a let-expression, and it starts by binding 4 to x
and y
. You
can then use x
and y
in the body of let, but nowhere outside of it.
The entire expression evaluates to 8. Also, x
and y
cannot be changed
once they’re assigned: they are read-only constants.
Dynamic Scoping¶
Dynamic scoping is an alternative to static scoping that has largely fallen out of favor. Most examples of dynamic scoping occur in older languages, such as APL, SNOBOL, and early versions of LISP. Perl lets you optionally declare variables that follow dynamic scoping rules.
The idea of dynamic scoping is that the meaning of a variable depends upon the value of the most recent variable with the same name in the current function call stack (as opposed to the enclosing block of source code).
Here’s some sample code (from http://msujaws.wordpress.com/2011/05/03/static- vs-dynamic-scoping/) that gives different results if you use dynamic or static scoping:
const int b = 5;
int foo()
{
int a = b + 5; // What is b?
return a;
}
int bar()
{
int b = 2;
return foo();
}
int main()
{
foo(); // returns 10 for static scoping; 10 for dynamic scoping
bar(); // returns 10 for static scoping; 7 for dynamic scoping
}
The idea of dynamic scoping is that when the value of a variable is needed,
the program searches for variables with that name among the names of variables
in functions that are currently active. So in this code, when foo()
is
called inside bar()
, the it’s the value of the local b
in bar
that
gets used in b + 5
.
In general, dynamic scoping is unpopular because it makes it harder to reason about the meaning of programs from their source code alone. Under dynamic scoping, you can’t always tell for sure what a variable refers to until the code runs, because the order in which functions are called matters.
Another problem with dynamic scoping is that it exposes the local variables of a function to other functions, thus allow the possibility that they could be modified.
Also, dynamic scoping has performance and type-checking difficulties. Probably the major good thing about dynamic scoping is that it is easier to implement than static scoping.