Lecture 5 ========= Chapter 4: Lexical and Syntax Analysis -------------------------------------- - Chapter 4 of the text discusses the problem of tokenization (i.e. lexical scanning) and parsing. - It shows an example of a lexical scanner written in C for a simple expression language. Project 1 asks you to writer a lexical scanner for JSON using Go_, so this might be a useful example to refer to. However, keep in mind the text's example is written in C, and so uses C-style coding idioms that often don't make sense in Go_. But the basic idea is the same: process a character at a time, and use if-statements or switch to decide what to do. - We won't cover parsing algorithm in this course; look at them if you are interested. Chapter 5: Names, Bindings, and Scopes -------------------------------------- Identifiers (names) ~~~~~~~~~~~~~~~~~~~ Names and identifiers are essentially the same thing. Names name variables, functions, modules, etc. Issues: What characters are allowed in a name? Does case matter? e.g. in Go_, identifiers are defined as follows:: identifier = letter { letter | unicode_digit } Case matters, and so ``x`` and ``X`` are different. Identifiers can't start with a number, so ``x2`` is an identifier but ``2x`` is not. Underscores are used to help make compound names, like ``file_name`` or ``current_temp``. Camel case is also useful, e.g. ``fileName`` or ``currentTemp``. e.g. In languages such as PHP, Perl, and Ruby, special symbols (sometimes called **sigils**) are used with variables. For example, in Ruby: - ``x`` refers to a local variable - ``@x`` refers to a variable in an object (member of an object) - ``@@x`` refers to a static class variable - ``x?`` is a convention for functions that return a boolean (i.e. true or false) - ``x!`` is a convention for functions that modify the object they're called on e.g. Go_ identifiers that start with an uppercase letter, such as ``Trim`` or ``Open`` are called exported identifiers, meaning that they can be accessed from other packages. e.g. Most languages reserve certain identifiers for special use. For example, the following identifiers are reserved in Go_ as keywords:: break default func interface select case defer go map struct chan else goto package switch const fallthrough if range type continue for import return var You cannot create your own identifiers with the same name. Go_ also has a number of pre-declared identifiers that you can't use:: Types: bool byte complex64 complex128 error float32 float64 int int8 int16 int32 int64 rune string uint uint8 uint16 uint32 uint64 uintptr Constants: true false iota Zero value: nil Functions: append cap close complex copy delete imag len make new panic print println real recover Constants, such as ``true`` and ``false``, are constant so as to prevent re- defining them like this:: fmt.Println(true) // okay: prints "true" true = false // compiler error: cannot assign to true fmt.Println(true) However, you can do this:: fmt.Println(true) // prints "true" true := false fmt.Println(true) // prints "false" e.g. Python is another language whose predeclared identifiers can be changed, e.g.:: >>> print True True >>> 2 == 2 True >>> True = "haha" >>> print True haha >>> True = False >>> True == False True Of course, there rarely, if ever, is a reason to do something like this, but the fact that Python allows this sort of change opens up a whole class of errors that don't exist in other languages. Too many reserved words can frustrate a programmer by taking away names they might like to use in their programs. Variables ~~~~~~~~~ Variables are a surprisingly intricate topic with many details. They are how most high-level languages let us access memory. Variables typically have the following associated information: - name - address in memory - value - type - lifetime - scope The address of a variable is where it is located in memory. In Go_ and C/C++, you can use the ``&`` operator to access the address of a variable, e.g.:: s := "apple" // Go fmt.Println(&s) // prints 0x10500168 (for me) There's usually no guarantee about the exact address of a variable in memory, which can be good and bad. It's good because it allows the compiler and operating system to choose where in memory the program runs. It can be bad if for some reason you need to access a specific memory location for, say, hardware reasons. The address of a variable is sometimes called its **l-value**, because it is what is on the left-hand side of an assignment statement, e.g.:: x = 5 // put value 5 into the address referred to by x Values ``5`` or ``"apple"`` don't have l-values, and so can't be used on the left side of an assignment. Constant variables also lack l-values in Go_:: const pie string = "apple" // Go func main() { fmt.Println(pie) // okay, prints: "apple" fmt.Println(&pie) // compiler error: can't use & with constants } It may be that ``pie`` is, in fact, stored at some particular memory location. But it might also be that the Go_ compiler treated it in some special way, e.g. replaced instances of ``pie`` in the program with the string ``"apple"``. If so, then there is no sensible address for it. For example:: const one int = 1 // Go const two int = 2 func main() { three := one + two // think about this line fmt.Println(three) } A smart compiler might be able to realize that the expression ``one + two`` can be calculated at compile-time instead of run-time, and so effectively change that line to:: three := 3 // Go Plus, since neither ``one`` nor ``two`` are used elsewhere in the program, perhaps the compiler doesn't even bother storing them anywhere in the running program. **Aliasing** is when the same address is referred to by more than one name. For example, consider this C++ code:: void incr(int& a) { // C++ ++a; } void main() { int x = 5; incr(x); } When ``incr`` is called in ``main``, ``a`` and ``x`` are aliases for the same memory address. You can be even more direct like this:: int x = 5; // C++ int& a = x; ++x; ++a; cout << x << endl; // prints 7 Unrestricted aliasing like in this code seems to be a generally bad idea since in more complicated examples it might be hard to know that two different names refer to the same address. Values are the contents of memory cells, and they are also sometimes known as **r-values** because they can appear on the right-hand side of an assignment statement. Many variables have both an l-value (an address) and an r-value (the value it stores), and before you access its r-value you must determine its l-value (which might not always be simple). Binding ------- An important idea in programming is the notion of **binding**. In general, binding means associating two things, such as a variable with a value or a function name with a function. Binding can occur at different times in the programming process: - Go_ values like ``true``, ``false``, and ``nil`` were bound at design-time by the language designers. - A value like ``int`` in C can be bound at implementation time, i.e. the exact number of bits in a C ``int`` can be set through the compiler. - A constant value could be bound to a value at compile-time. - Variables are often bound at run-time, e.g. when you read a string from a file and store that string in a variable, the variable is being bound to the string at run-time. - A function in a library, such as ``printf`` in C, is bound at link time. A linker is, essentially, a program that "links" function calls to their code. Typically, the code is in a library. A **static binding** is a binding that occurs *before* run-time and is unchanged during run-time. If a binding can change during run-time, it is called a **dynamic binding**. Variables can be bound to types either explicitly or implicitly. For example:: double x = 3.14; // C: explicit auto x = 3.14; // C++: implicit (type inference) x := 3.14 // Go: implicit (type inference) In all three of these examples, ``x`` is bound to a floating-point type at compile-time. That is, the compiler determines the type of ``x``. Another example of implicit type binding is FORTRAN, which, by default, implicitly declares variables beginning with ``I``, ``J``, ``K``, ``L``, ``M``, or ``N`` to be ``Integer`` types, and all other variables to be ``Real`` types. This kind of implicit typing has not been very popular. One of the problems is that if you accidentally forget to declare a variable's type, FORTRAN may automatically assign it the wrong type. Modern FORTRAN lets you use an ``Implicit none`` statement to turn off this behaviour. Dynamic Type Binding ~~~~~~~~~~~~~~~~~~~~ Some languages, such as Python, Ruby, JavaScript, and PHP, the type of a variable is *not* specified by a type declaration (or some other naming convention). So in Python you can write code like this:: x = 5 x = "house" The variable ``x`` is not restricted to contain only variables of a particular type. This makes Python (and languages like it) very flexible. This examples shows another use of dynamic type binding:: >>> def twice(x): return x + x >>> twice(5) 10 >>> twice("house") 'househouse' The ``twice`` function works with any value that is defined for ``+``. There's no need to specify the type of ``x`` ahead of time. Ruby is an interesting language for many reasons. All Ruby values are objects, and so Ruby variables don't have different types: Ruby variables are all of the same object type. Thus any value can be assigned to any Ruby variable. There are at least two problems with dynamically typed languages. One problem is that they can suffer from the errors that compilers might easily catch in statically typed languages. In a dynamically typed language, such errors might crash the program at run-time, which hurts their robustness. The second major problem is performance: dynamic type binding requires type-checking to be done at run-time (instead of compile-time), and this extra checking can slow down programs significantly. Variable Lifetime ~~~~~~~~~~~~~~~~~ The **lifetime** of a variable is the time the variable is bound to a memory cell to the time it is unbound. We distinguish three lifetime categories: - **Static.** Static variables are bound to memory cells *before* program execution. This can make them quite efficient, but also less flexible since you can't typically add new static variables at run-time (and so, for instance, recursion is impossible in a language with only static variables). - **Stack-dynamic.** Stack-dynamic variables are variables instantiated at run-time, usually on the program's stack. For example, in languages like C++, Java, and C#, local variables in a function or method are, by default, stack-dynamic. They are instantiated when their declaration statement is reached, and de-allocated when the function ends. For example:: int abs(int a, int b) { // Java int diff = a - b; if (diff < 0) return -diff; else return diff; } ``a``, ``b``, and ``diff`` are pushed onto the program stack each time ``abs`` is called. They are popped when it ends. - **Heap-dynamic.** Heap-dynamic variables are variables whose memory is explicitly allocated and de-allocated at run-time. For example, in C++:: int* ip = new int; // new int allocates a new int memory cell on the heap // ip holds the address of this cell *ip = 4; cout << *ip << endl; // prints 4 delete ip; // de-allocate the memory cell ip points to In many languages, the deallocation of heap values is done automatically by a garbage collector. The reason for this is that it turns out to be surprisingly challenging to call ``delete`` at just the right time in complex programs. Call ``delete`` too soon and you've destroyed a value your program needs; call ``delete`` too late, and you've wasted memory (e.g. caused a memory leak). Scope ~~~~~ A variable's **scope** is the range of statements over which it is visible. A local variable is a variable whose scope is restricted to a block of code. A nonlocal variable is visible outside of the block in which it was declared. Global variables are a kind of nonlocal variable. Many modern languages are **statically** (or, **lexically**) scoped. This means that the scope of the variable can be determined *before* the program runs just by looking at the source code. This helps programmers to read source code and figure out a things type without having to run it. Here's an example of static scoping from JavaScript:: function big() { function sub1() { var x = 7; // hides the x defined in big sub2(); } function sub2() { var y = x; // which x does this refer to? } var x = 3; sub1(); } Under static scoping rules, when ``x`` is seen in ``sub2``, you first check to see if it is declared in ``sub2``. It's not, so you then check to see if ``x`` is declared anywhere in it parent function (``big``). Indeed it is, so the ``x`` in ``sub2`` is the one declared in ``big``. C-based languages let you create blocks that introduce new scopes. For example, in C++ you can write code like this:: void f() { int x = 5; cout << x << endl; // prints 5 { int x = 7; cout << x << endl; // prints 7 } cout << x << endl; // prints 5 } This example has two blocks, one nested inside the other. Blocks are not often used on their own like this, but instead tend to be used as the body for functions, if-statements, loops, etc. Another interesting example of scoping arises in functional languages, such as LISP, Haskell, and ML. For example, in Clojure_ (a LISP-based language), you can write this code:: (let [x 4 y x] (+ x y)) This is a let-expression, and it starts by binding 4 to ``x`` and ``y``. You can then use ``x`` and ``y`` in the body of let, but nowhere outside of it. The entire expression evaluates to 8. Also, ``x`` and ``y`` cannot be changed once they're assigned: they are read-only constants. Dynamic Scoping ~~~~~~~~~~~~~~~ Dynamic scoping is an alternative to static scoping that has largely fallen out of favor. Most examples of dynamic scoping occur in older languages, such as APL, SNOBOL, and early versions of LISP. Perl lets you optionally declare variables that follow dynamic scoping rules. The idea of dynamic scoping is that the meaning of a variable depends upon the value of the most recent variable with the same name in the current function call stack (as opposed to the enclosing block of source code). Here's some sample code (from http://msujaws.wordpress.com/2011/05/03/static- vs-dynamic-scoping/) that gives different results if you use dynamic or static scoping:: const int b = 5; int foo() { int a = b + 5; // What is b? return a; } int bar() { int b = 2; return foo(); } int main() { foo(); // returns 10 for static scoping; 10 for dynamic scoping bar(); // returns 10 for static scoping; 7 for dynamic scoping } The idea of dynamic scoping is that when the value of a variable is needed, the program searches for variables with that name among the names of variables in functions that are currently active. So in this code, when ``foo()`` is called inside ``bar()``, the it's the value of the local ``b`` in ``bar`` that gets used in ``b + 5``. In general, dynamic scoping is unpopular because it makes it harder to reason about the meaning of programs from their source code alone. Under dynamic scoping, you can't always tell for sure what a variable refers to until the code runs, because the order in which functions are called matters. Another problem with dynamic scoping is that it exposes the local variables of a function to other functions, thus allow the possibility that they could be modified. Also, dynamic scoping has performance and type-checking difficulties. Probably the major good thing about dynamic scoping is that it is easier to implement than static scoping.