Aggregate Data Types and Pointers¶
Initializing Arrays¶
Initializing arrays is an interesting issue because often arrays contain a lot of data, and so a lot of code may be required to initialize them. Thus, many languages provide convenient way to initialize arrays. For example, in C:
int arr [] = { 4, 5, 7, 83 }; // C array
In Java, you do essentially the same thing:
int[] arr = { 4, 5, 7, 83 }; // Java array
You can create an array with all its locations initialized to 0 like this in Java:
int[] arr = new int[4]; // Java array of all 0s
Python is similarly easy:
arr = [4, 5, 7, 83] # Python
It’s worth noting that in all these examples the size of the array is calculated automatically. While this saves some typing, it can also fail to catch errors where you forget a value in the array, or an add an extra one.
Ada uses a slightly different approach:
List : array (1..5) of Integer := (1, 3, 5, 7, 9); // Ada
Bunch : array (1..5) of Integer := (1 => 17, 3 => 34,
others => 0);
In the Bunch
array, only the first and third elements of the array are
given explicit initial values. The other elements are set to 0. Also note that
the length of the array is given.
In C++, the situation for vector
initialization has changed for the better
in C++11. Before C++11, you had to initialize a vector like this:
vector<int> arr; // C++ before C++11
arr.push_back(4);
arr.push_back(5);
arr.push_back(7);
arr.push_back(83);
Or like this:
vector<int> arr(4); // C++ before C++11
arr[0] = 4;
arr[1] = 5;
arr[2] = 7;
arr[3] = 83;
Or more cleverly:
static const int arr[] = { 4, 5, 6, 83}; // C++ before C++11
vector<int> vec (arr, arr + sizeof(arr) / sizeof(arr[0]) );
In contrast, initializing a C++ array is as simple as in C (since it is the same!), and so this is sometimes a real annoyance when using vectors.
But in C++11, there is a new initializer notation for C++ that simplifies things greatly:
vector<int> arr = { 4, 5, 6, 83 }; // C++11
Comprehensions¶
A feature that is worth mentioning here is the idea of a comprehensions. Basically, comprehensions are a shorthand for making sequences. For instance, here are some list comprehensions in Python:
>>> [x * x for x in range(5)]
[0, 1, 4, 9, 16]
>>> ['Hello %s!' % n for n in ['Will', 'May', 'Bonnie', 'Chuck']]
['Hello Will!' 'Hello May!' 'Hello Bonnie!' 'Hello Chuck!']
>>> [n for n in ['Will', 'May', 'Bonnie', 'Chuck'] if len(n) % 2 == 0]
['Will', 'Bonnie']
>>> [(x, y) for x in range(3) for y in range(3)]
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
>>> [(x, y) for x in range(3) for y in range(3) if x != y]
[(0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]
Comprehensions don’t provide any features we can’t already do with loops or recursion, but they are a clear and compact notation that many programmers enjoy using. They are also close to the mathematical notation often used for sets.
Array Processing in J¶
Most languages provide functions are processing arrays, either in a library or, sometimes, directly as part of the language. For example, Python has some useful built-in functions:
>>> [3, 1, 2] + [6, 7] # Python
[3, 1, 2, 6, 7]
>>> [i * i for i in range(5)]
[0, 1, 4, 9, 16]
Some languages, notably APL _, provide much more support for arrays, to the point where they are sometimes called array- based programming languages. APL lets you write very compact and efficient programs for processing arrays, although it comes at the cost of requiring esoteric symbols not available as keys on standard keyboards.
J is similar in spirit to APL, but can be typed with a standard keyboard.
Here’s an example of J function that calculates the average of a sequence of numbers:
avg=: +/ % #
You call it like this:
=> avg 1 2 3 4
2.5
#
counts the number of numbers, and +/
sums them. Then %
divides
those two results. This is both compact and cryptic — experienced
programmers unfamiliar with J find it is almost impossible to understand. J
has an extensive vocabulary of briefly-defined functions that
programmers need at hand.
The cryptic nature of J is likely the reason it has never become a mainstream language. It can be very difficult to read (and thus debug or modify) even simple J programs.
Slices¶
Slices refer to subsequences within an existing an array. For example, consider this Python code:
>>> arr = [5, 8, 2, 1, 4] # Python
>>> arr[2:4]
[2, 1]
arr[2:4]
is a copy of the subsequence within arr that starts at index
location 2 and ends at index location 3. Ruby has similar notations(!) for
slices.
In Go, slices are a fundamental data type that are different than arrays. In practice, slices are used far more frequently than arrays in Go.
List Types¶
List types are similar to array types, although in practice list types are often (but not always) implemented as singly-linked lists.
Lisp is the canonical example of a language with good support for lists (and comparatively poor support for arrays).
One thing we will mention here is that the values stored in a linked-list are not necessarily stored contiguously in memory the way the values of an array are. Accessing the ith element of a linked list takes O(n) time, compared to O(1) time for an array.
On the plus side, adding or deleting a node (that you have a reference to) is an O(1) operation for a linked list, but the same sort of operation in an array is typically O(n).
Thus, for some operations arrays are more efficient, while for others linked lists are more efficient. Good programmers need to be aware of both so they can choose the right data structure for the job.
Associative Arrays¶
An associative array, also know as a dictionary, map, or hash (or ash table), is, essentially, and unordered collection of (key, value) pairs that is efficiently indexed by the keys. Often, but not always, they are implemented using hash tables, and so there may not be a natural ordering to the pairs.
Here’s a Python example of how you can use an associative array to count the number of occurrences of characters in a string:
line = "once upon a time a frog bounded into the kingdom"
char_count = {}
for c in line:
if c in char_count:
char_count[c] += 1
else:
char_count[c] = 1
>>> char_count
{'a': 2, ' ': 9, 'c': 1, 'b': 1, 'e': 4, 'd': 3, 'g': 2, 'f': 1,
'i': 3, 'h': 1, 'k': 1, 'm': 2, 'o': 6, 'n': 5, 'p': 1, 'r': 1,
'u': 2, 't': 3}
>>> char_count['e']
4
Record Types¶
A record is a collection of data values, possibly of different types, accessed by name. For example, here is a Go record that contains two values:
type Rectangle struct {
length, width int
}
You can use it like this:
r := Rectangle{} // makes a new Rectangle
fmt.Println(r) // "{0 0}"
r.width = 4
r.height = 2
fmt.Println(r) // "{2 4}"
In practice, the basic difference between a record and array is that array elements are accessed by an index value, and record values are accessed by a field name.
See p. 277 and 278 of the textbook for examples of record definitions in COBOL and Ada.
Tuple Types¶
Tuple types are similar to records except that the fields of a tuple are unnamed. In Python, for example, tuples are denoted with round brackets, e.g.:
>>> (2, 'a', [8, 2])
(2, 'a', [8, 2])
>>> (4,)
(4,)
>>> (4)
4
Notice in the second example a comma comes after the 4. This is required by Python to indicate that this is a tuple and not just the number 4 in brackets.
In Python, tuples are fixed in size. Unlike a Python list, you can’t make a tuple bigger or smaller. This makes tuples less flexible then lists, but possibly more efficient. Tuples may also be less error-prone since there are fewer functions that operate on them.
A very useful application of tuples is for functions that return multiple values. For example, Go functions often return tuples in support of error handling:
resp, err := http.Get("http://example.com/")
Here the http.Get
function returns two values, a response and an error
flag. This is essentially a tuple of 2 elements. Interestingly, Go only
supports tuples in function calls and assignments, and doesn’t let you create
a stand-alone tuple (as in Python).
Pointers and References¶
Pointer types are variables that can store the addresses of memory cells in
the computer. They also have a nil
value indicating that they are not
referring to a memory address.
In most languages, pointers are used to manage objects instantiated on the heap. In C, pointers are perhaps the defining feature of the language. C pointers are notoriously powerful, with a single pointer allowing practically unlimited access to the memory of a running programming.
For example, suppose an int
is 32 bits:
int* ip = new int; // C++
ip
points to the address of a newly allocated int
somewhere on the
heap. You can access the value as *ip
.
You can also write the expression ip + 1
. which points to the memory
address 32 bits after ip
. There’s no guarantee what is contained at
address ip + 1
, so this is a source of confusion, errors, and even
security breaches. But for many low-level applications, such quick and easy
access to specific memory locations is one of the strong points of C.
Most other languages that supports pointers don’t allow the above kind of
pointer arithmetic. For instance, Go has pointers, but if ip
is an
int
pointer, the expression ip + 1
is an error because you are not
allowed to do arithmetic with pointer types.
An important detail of C-style pointers is that they have a type. It is
perhaps not immediately obvious why this is the case since all address in a
computer are the same type. The type of the pointer helps in pointer
arithmetic. For example, if p
is a pointer of some type T
, then p +
1
is the address of the memory cell that is offset sizeof(T)
bytes away
from p
. For instance, if p
is a 32 bit int
, p + 1
is the
address starting 32 bits after p
. If p
were instead a 64-bit
double
, p + 1
is the addressing starting 64 bits after p
.
A dangling pointer is a pointer that points to a heap value that has been de-allocated. For example:
int* ip = new int(5); // C++
cout << *ip;
delete ip;
// ip is now a dangling pointer --- it no longer points to a valid int!
Nothing stops a programmer from accessing *ip
after delete
has been
called. The problem is that the value at *ip
is no longer anything
sensible — it has been deleted and so has no known meaning. In C++, calling
delete
on a dangling pointer can corrupt your programs memory (or not —
there is no well-defined behaviour here!).
In C++, std::unique_ptr
is a safer way to deal with dynamic memory:
std::unique_ptr<int> ip = new int;
cout << *ip;
// memory ip points to is automatically deleted when ip goes out of scope
// ip cannot be assigned: must use std::move to ensure only one owner for
// the data
In this example the programmer does not need to remember to explicitly delete
the memory ip
refers to. Indeed, it even correctly calls delete[]
when
the value pointed to is an array.
Another difficulty with pointers is memory leaks. For instance:
int oops() {
int* ip = new int(5); // C++
return *ip;
}
Every time you call oops
, you are creating a new int
on the heap, and
then destroying the pointer that refers to it, which means there is no longer
any way to access it. This is a memory leak: call oops
enough times and
your program will run out of memory.
These problems can be difficult to deal with in larger programs, and so they are one of the major reasons why garbage collectors have become popular. A garbage collector essentially keeps track of all allocated memory, and automatically deletes it when there are no more pointers to it (and so inaccessible).