Aggregate Data Types and Pointers

Initializing Arrays

Initializing arrays is an interesting issue because often arrays contain a lot of data, and so a lot of code may be required to initialize them. Thus, many languages provide convenient way to initialize arrays. For example, in C:

int arr [] = { 4, 5, 7, 83 };  // C array

In Java, you do essentially the same thing:

int[] arr = { 4, 5, 7, 83 };  // Java array

You can create an array with all its locations initialized to 0 like this in Java:

int[] arr = new int[4];  // Java array of all 0s

Python is similarly easy:

arr = [4, 5, 7, 83]  # Python

It’s worth noting that in all these examples the size of the array is calculated automatically. While this saves some typing, it can also fail to catch errors where you forget a value in the array, or an add an extra one.

Ada uses a slightly different approach:

List : array (1..5) of Integer := (1, 3, 5, 7, 9);     // Ada

Bunch : array (1..5) of Integer := (1 => 17, 3 => 34,
                                    others => 0);

In the Bunch array, only the first and third elements of the array are given explicit initial values. The other elements are set to 0. Also note that the length of the array is given.

In C++, the situation for vector initialization has changed for the better in C++11. Before C++11, you had to initialize a vector like this:

vector<int> arr;  // C++ before C++11
arr.push_back(4);
arr.push_back(5);
arr.push_back(7);
arr.push_back(83);

Or like this:

vector<int> arr(4);  // C++ before C++11
arr[0] = 4;
arr[1] = 5;
arr[2] = 7;
arr[3] = 83;

Or more cleverly:

static const int arr[] = { 4, 5, 6, 83};    // C++ before C++11
vector<int> vec (arr, arr + sizeof(arr) / sizeof(arr[0]) );

In contrast, initializing a C++ array is as simple as in C (since it is the same!), and so this is sometimes a real annoyance when using vectors.

But in C++11, there is a new initializer notation for C++ that simplifies things greatly:

vector<int> arr = { 4, 5, 6, 83 }; // C++11

Comprehensions

A feature that is worth mentioning here is the idea of a comprehensions. Basically, comprehensions are a shorthand for making sequences. For instance, here are some list comprehensions in Python:

>>> [x * x for x in range(5)]
[0, 1, 4, 9, 16]

>>> ['Hello %s!' % n for n in ['Will', 'May', 'Bonnie', 'Chuck']]
['Hello Will!' 'Hello May!' 'Hello Bonnie!' 'Hello Chuck!']

>>> [n for n in ['Will', 'May', 'Bonnie', 'Chuck'] if len(n) % 2 == 0]
['Will', 'Bonnie']

>>> [(x, y) for x in range(3) for y in range(3)]
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

>>> [(x, y) for x in range(3) for y in range(3) if x != y]
[(0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]

Comprehensions don’t provide any features we can’t already do with loops or recursion, but they are a clear and compact notation that many programmers enjoy using. They are also close to the mathematical notation often used for sets.

Array Processing in J

Most languages provide functions are processing arrays, either in a library or, sometimes, directly as part of the language. For example, Python has some useful built-in functions:

>>> [3, 1, 2] + [6, 7]                 # Python
[3, 1, 2, 6, 7]

>>> [i * i for i in range(5)]
[0, 1, 4, 9, 16]

Some languages, notably APL _, provide much more support for arrays, to the point where they are sometimes called array- based programming languages. APL lets you write very compact and efficient programs for processing arrays, although it comes at the cost of requiring esoteric symbols not available as keys on standard keyboards.

J is similar in spirit to APL, but can be typed with a standard keyboard.

Here’s an example of J function that calculates the average of a sequence of numbers:

avg=: +/ % #

You call it like this:

=> avg 1 2 3 4
2.5

# counts the number of numbers, and +/ sums them. Then % divides those two results. This is both compact and cryptic — experienced programmers unfamiliar with J find it is almost impossible to understand. J has an extensive vocabulary of briefly-defined functions that programmers need at hand.

The cryptic nature of J is likely the reason it has never become a mainstream language. It can be very difficult to read (and thus debug or modify) even simple J programs.

Slices

Slices refer to subsequences within an existing an array. For example, consider this Python code:

>>> arr = [5, 8, 2, 1, 4]     # Python
>>> arr[2:4]
[2, 1]

arr[2:4] is a copy of the subsequence within arr that starts at index location 2 and ends at index location 3. Ruby has similar notations(!) for slices.

In Go, slices are a fundamental data type that are different than arrays. In practice, slices are used far more frequently than arrays in Go.

List Types

List types are similar to array types, although in practice list types are often (but not always) implemented as singly-linked lists.

Lisp is the canonical example of a language with good support for lists (and comparatively poor support for arrays).

One thing we will mention here is that the values stored in a linked-list are not necessarily stored contiguously in memory the way the values of an array are. Accessing the ith element of a linked list takes O(n) time, compared to O(1) time for an array.

On the plus side, adding or deleting a node (that you have a reference to) is an O(1) operation for a linked list, but the same sort of operation in an array is typically O(n).

Thus, for some operations arrays are more efficient, while for others linked lists are more efficient. Good programmers need to be aware of both so they can choose the right data structure for the job.

Associative Arrays

An associative array, also know as a dictionary, map, or hash (or ash table), is, essentially, and unordered collection of (key, value) pairs that is efficiently indexed by the keys. Often, but not always, they are implemented using hash tables, and so there may not be a natural ordering to the pairs.

Here’s a Python example of how you can use an associative array to count the number of occurrences of characters in a string:

line = "once upon a time a frog bounded into the kingdom"
char_count = {}
for c in line:
    if c in char_count:
        char_count[c] += 1
    else:
        char_count[c] = 1

>>> char_count
{'a': 2, ' ': 9, 'c': 1, 'b': 1, 'e': 4, 'd': 3, 'g': 2, 'f': 1,
 'i': 3, 'h': 1, 'k': 1, 'm': 2, 'o': 6, 'n': 5, 'p': 1, 'r': 1,
 'u': 2, 't': 3}

>>> char_count['e']
4

Record Types

A record is a collection of data values, possibly of different types, accessed by name. For example, here is a Go record that contains two values:

type Rectangle struct {
    length, width int
}

You can use it like this:

r := Rectangle{}  // makes a new Rectangle
fmt.Println(r)    // "{0 0}"

r.width = 4
r.height = 2

fmt.Println(r)    // "{2 4}"

In practice, the basic difference between a record and array is that array elements are accessed by an index value, and record values are accessed by a field name.

See p. 277 and 278 of the textbook for examples of record definitions in COBOL and Ada.

Tuple Types

Tuple types are similar to records except that the fields of a tuple are unnamed. In Python, for example, tuples are denoted with round brackets, e.g.:

>>> (2, 'a', [8, 2])
(2, 'a', [8, 2])
>>> (4,)
(4,)
>>> (4)
4

Notice in the second example a comma comes after the 4. This is required by Python to indicate that this is a tuple and not just the number 4 in brackets.

In Python, tuples are fixed in size. Unlike a Python list, you can’t make a tuple bigger or smaller. This makes tuples less flexible then lists, but possibly more efficient. Tuples may also be less error-prone since there are fewer functions that operate on them.

A very useful application of tuples is for functions that return multiple values. For example, Go functions often return tuples in support of error handling:

resp, err := http.Get("http://example.com/")

Here the http.Get function returns two values, a response and an error flag. This is essentially a tuple of 2 elements. Interestingly, Go only supports tuples in function calls and assignments, and doesn’t let you create a stand-alone tuple (as in Python).

Pointers and References

Pointer types are variables that can store the addresses of memory cells in the computer. They also have a nil value indicating that they are not referring to a memory address.

In most languages, pointers are used to manage objects instantiated on the heap. In C, pointers are perhaps the defining feature of the language. C pointers are notoriously powerful, with a single pointer allowing practically unlimited access to the memory of a running programming.

For example, suppose an int is 32 bits:

int* ip = new int;  // C++

ip points to the address of a newly allocated int somewhere on the heap. You can access the value as *ip.

You can also write the expression ip + 1. which points to the memory address 32 bits after ip. There’s no guarantee what is contained at address ip + 1, so this is a source of confusion, errors, and even security breaches. But for many low-level applications, such quick and easy access to specific memory locations is one of the strong points of C.

Most other languages that supports pointers don’t allow the above kind of pointer arithmetic. For instance, Go has pointers, but if ip is an int pointer, the expression ip + 1 is an error because you are not allowed to do arithmetic with pointer types.

An important detail of C-style pointers is that they have a type. It is perhaps not immediately obvious why this is the case since all address in a computer are the same type. The type of the pointer helps in pointer arithmetic. For example, if p is a pointer of some type T, then p + 1 is the address of the memory cell that is offset sizeof(T) bytes away from p. For instance, if p is a 32 bit int, p + 1 is the address starting 32 bits after p. If p were instead a 64-bit double, p + 1 is the addressing starting 64 bits after p.

A dangling pointer is a pointer that points to a heap value that has been de-allocated. For example:

int* ip = new int(5);  // C++

cout << *ip;

delete ip;

// ip is now a dangling pointer --- it no longer points to a valid int!

Nothing stops a programmer from accessing *ip after delete has been called. The problem is that the value at *ip is no longer anything sensible — it has been deleted and so has no known meaning. In C++, calling delete on a dangling pointer can corrupt your programs memory (or not — there is no well-defined behaviour here!).

In C++, std::unique_ptr is a safer way to deal with dynamic memory:

std::unique_ptr<int> ip = new int;

cout << *ip;

// memory ip points to is automatically deleted when ip goes out of scope
// ip cannot be assigned: must use std::move to ensure only one owner for
// the data

In this example the programmer does not need to remember to explicitly delete the memory ip refers to. Indeed, it even correctly calls delete[] when the value pointed to is an array.

Another difficulty with pointers is memory leaks. For instance:

int oops() {
    int* ip = new int(5);  // C++
    return *ip;
}

Every time you call oops, you are creating a new int on the heap, and then destroying the pointer that refers to it, which means there is no longer any way to access it. This is a memory leak: call oops enough times and your program will run out of memory.

These problems can be difficult to deal with in larger programs, and so they are one of the major reasons why garbage collectors have become popular. A garbage collector essentially keeps track of all allocated memory, and automatically deletes it when there are no more pointers to it (and so inaccessible).