Basic Data Types

In programming, a data type is a set of values and operations that manipulate those values.

A data type that is not defined in terms of other data types is called a primitive data type.

Numeric Types

Integers are a common numeric data type. Many languages have different sizes of integers, e.g. Java has byte, short, int, and long. These are signed integers, i.e. they have both positive and negative values, and are usually stored using the twos-complement encoding. C and C++ also support unsigned integers, which don’t have negative values.

Integers are typically represent by a finite sequence of bits, and so they have min and max values. Some languages, such as Python and many versions of Lisp, support arbitrarily long integers that can be much larger than the typical integers in other languages. For example, in Python 2 you can calculate 2 to the power of 1000 like this:

>>> 2 ** 1000
1071508607186267320948425049060001810561404811705533607443750388
3703510511249361224931983788156958581275946729175531468251871452
8569231404359845775746985748039345677748242309854210746050623711
4187795418215304647498358194126739876755916554394607706291457119
6477686542167660429831652624386837205668069376L

The L at the end denotes that this is a Python “long” variable, which uses a very different internal representation than a non-L integer.

Floating-point numbers are numbers that include a decimal point, like -76.103. They are trickier to represent than integers because binary encodings always result in some sort of rounding errors. In programs that do thousands and thousands of floating point calculations, even small rounding errors can add up to be significant problems. Most computers use the IEEE 754 standard for floating point numbers, which address many of the difficulties with representing floating point numbers with bits. For example, a single-precision IEEE float might be formatted like this:

| sign bit | exponent | fraction |
  1 bits     8 bits     23 bits

A few languages, such as Go, FORTRAN, and Python, also provide built-in support for complex numbers. For instance, in Python:

>>> (7 + 3j) * (7 + 3j)    # j is the square root of -1
(40+42j)
>>> 1j*1j
(-1+0j)

Another kind of numeric type that is most common in business software is the decimal type. Decimal types used a fixed number of decimal digits to represent numbers. Notably, COBOL (a business language), has decimal types, as do C# and F#.

A big advantage of decimal types is they can store, exactly, some values that floating point numbers can only approximate. For example, 0.1 can be represented precisely as a decimal type, but can only approximated as a floating point type.

Decimal types are stored like strings: each digit is sequence of bits. Thus decimal types often use more memory than floating point types.

Boolean Types

A boolean type has only two values: true and false. While many modern languages have explicit boolean types, they did not appear in some important older languages. For example, C originally did not have any standard boolean type, and so 0 was treated as false, and all other numbers as true. This allows for some tricky coding to be in C, which is fun for experts but unpleasant for beginners. More recent versions of C now have a standard boolean type, but C still lets you interpret numbers as booleans.

Character Types

Traditionally, character types were 8-bits used to denote a character. With 8 bits you can represent 256 different characters, which is enough for English, and languages with a similar character set, such as French. However, 256 is not enough to represent all the common characters and symbols found internationally, and so now many modern programming languages support more complex character types that can be used to represent Unicode characters. Java was the first popular language to provide support for Unicode.

Strings

String are sequence of characters, and in many applications they are an extremely important data type. Designing a good string data type is tricky because there are so many different ways that strings are used.

In C, strings are little more than arrays of characters that end with a null (to mark the end of the string), e.g.:

char[] s = "apple";   // compiler adds a 0 after 'e' to mark the end
                      // of the string

Thus, C strings don’t store their length, and calculating the length of such a string requires looping over all of its characters.

C strings are sometimes called limited dynamic strings, because a string’s size can change to be any value less than the size of its underlying array.

C has a rich library of string functions, but they are generally trickier and more awkward to use than the string functions in higher-level languages. Indeed, some C string functions are a common source of serious programming errors, e.g. consider strcpy(dest, src). Since strcpy doesn’t know the length of either of the 2 passed-in strings, nothing but programmer diligence stops copying a 50 character source into a 20 character destination (thus overwriting 30 characters past the end of dest).

Many languages, such as Java, C++, Python, Go, etc., support a high-level string data type that prevents many of the problems with C-style string handling. These strings classes are easy to use and come with many useful functions.

Immutable Strings

In Java, Python, and Go, strings are immutable (aka static strings), i.e. you cannot modify a string after you create it. Immutable strings often perform better than mutable ones because immutable strings can be safely shared by multiple variables with no aliasing problems. Also, immutable strings are useful in data structures such as maps or dictionaries. For example, Python dictionaries are essentially hash tables. If you use a string as a key in a dictionary, then the hash value is based on the string. If Python strings were mutable, then it would be possible to change the value of a string without changing its associated hash value, thus corrupting the dictionary.

In C++, however, strings (the ones based on the string class) are mutable, meaning you can edit them. For some applications, this is quite useful. But copying strings is expensive, and efficiently managing the memory they use can be hard.

Some languages, such as Perl, Ruby, JavaScript, and PHP have regular expressions built into the language itself. Regular expressions are useful for many string processing problems. Regular expressions are also common in many Linux/Unix utilities.

Implementations of Strings

Static strings compile-time descriptor:

-------------------------
| characters            |
-------------------------
| length                |
-------------------------
| address of first char |
-------------------------

Limited dynamic strings have this run-time descriptor:

-------------------------
| characters            |
-------------------------
| max length            |
-------------------------
| current length        |
-------------------------
| address of first char |
-------------------------

Fully dynamic strings, as in C++ or JavaScript, are more complicated. One way to implement them is to use an array and special rules for when to increase the size of the array. The hard part about this approach is how to increase the size of the string in a way that is efficient in both time and memory.

Enumeration Types

Enumeration types let programmers define a small set of named values. For example:

public enum Day {                         // Java
    SUNDAY, MONDAY, TUESDAY, WEDNESDAY,
    THURSDAY, FRIDAY, SATURDAY
}

This lets you create variables of type Day whose values are restricted to the ones inside the enum. You will get a compiler error if you assign a value not listed in the enum for Day.

Java’s approach to enumerated types is somewhat more sophisticated than in, say, C:

enum Day {                                 // C
    SUNDAY, MONDAY, TUESDAY, WEDNESDAY,
    THURSDAY, FRIDAY, SATURDAY
};

In a C enum, the values are assigned to integers, and so can be used interchangeably with integers. But each value of a Java enum is its own object, and so can’t intermingle with integers or any other primitive types.

In Haskell, there is a more powerful kind of enumerated type called a sum type. One example would be something like this:

data Bool = False | True

One of the nice features of this is that in its equivalent of a switch statement, Haskell can check that all possible values of Bool have been handled. This is only the simplest example of a Haskell sum type, and they have many more useful features.

Go does not have enumerated types in the usual sense, but it does provide a novel way of creating a series of related numeric constants. Consider:

const (
    Sunday = iota   // 0
    Monday          // 1
    Tuesday         // 2
    Wednesday       // 3
    Thursday        // 4
    Friday          // 5
    Saturday        // 6
)

iota is a predefined identifier that is used inside a const declaration to generate a sequence of integers. The first constant is initialized to 0, and the rest are numbered successively.

Subrange Types

Ordinal types are, roughly speaking, types that can be mapped to the set of integers. For instance, integer types, boolean types, and enumerated types can all be considered ordinal types.

A subrange type is a contiguous subsequence of an ordinal type. For example:

--
-- Ada code
--
type Days is (Mon, Tue, Wed, Thu, Fri, Sat, Sun);
subtype Weekdays is Days range Mon..Fri;
subtype Index is Integer range 1..100;

Arrays

An array is a sequence of values where each element is identified by its relative position in the sequence. In most languages, the elements of an array must all be of the same type. In languages like Python or Ruby, arrays can appear to contain different types of values, but the arrays contain references to the values, and references in those languages are all of the same type.

Since all the values stored in an array are of the same type, we sometimes say that an array is a homogeneous sequence.

Array Index Values

Arrays are accessed via index values. For instance, in the languages Ada and Pascal, if A is an array, then A(i) refers to the element at location i. Notice the use of round brackets. Function calls also use round brackets in those languages, and so you can’t tell if the expression f(i) is an array access or a function call. For this reason, many other languages use square-bracket notation for array access, writing A[i]. This makes it easier for both the compiler and the human programmer to recognize when arrays are being used.

Negative Indices

In C-style languages, an array of length n has its indices range from 0 to n - 1. Some languages, such as Lua, instead range from 1 to n. Python is interesting because it allows you both positive and negative indices:

      0      1      2
A = ["cat", "dog", "wet"]    # Python
     -3     -2     -1

In Python, you can use either A[0] or A[-3] to access the first element. Perhaps the most useful aspect of negative indices is that they give a simple way to access the last element of an array, i.e. A[-1] is always the last element of A. This is much easier than typing A[len(A) - 1].

Why Start at 0?

Many programmers wonder why starting at 0 is so common in arrays. The answer seems to be that experience shows it is a little more convenient for many common calculations than starting at 1. 0-based arrays can be seen as treating indices like distances, i.e. the first element is distance 0 from the start of the array, the second element is distance 1 from the start of the array, and so on. So, just like a ruler, it can be quite natural to think of arrays as starting at 0.

Range Checking

Another important issue with indices is range-checking. If array A has index values that range from 0 to n - 1, then A[i] only makes sense when 0 <= i < n. Values of i less than 0, or greater than n - 1, are errors. Most modern languages (and many, many older ones!) would automatically check that array indices were in this range, and issue an error if they weren’t. But C did not do this, presumably for performance reasons (i.e. checking every array access takes time). Thus, out-of-range bugs are a big problem in C code, and have been the source of many security bugs.

Static and Dynamics Arrays

It’s useful to distinguish between at least two types of arrays: static and dynamic. A static array has a fixed size that cannot change. C, C++, and Java are examples of languages with static arrays. An advantage of static arrays is that they always use a fixed amount of memory, and so after they are allocated they never need to be re-sized. But on the downside, there are many cases where you really do want an array that can change size, either shrink or grow, and static arrays offer no help with that.

So dynamic arrays, as in Python or with vector in C++, can grow and shrink on demand. This is very convenient for many applications, but the tricky part is to make dynamic arrays efficient in both time and memory. Dynamic arrays are often implemented as a static array plus a variable keeping track of the true length of the dynamic array. When the size of the dynamic array must expand past the size of the static backing array, a new, larger static array is created and the old values copied into it. As you can imagine, this takes a lot of time and space, and so dynamic arrays typically do tricks like make the backing array twice the size of the dynamic array so that future increases in its size are cheap.

Go arrays are static, but in practice they are not used that often. Instead, Go provides slices, which are objects that point to a sub-sequence of an array. Slices are essentially dynamic arrays, and using the append function you can safely increase their size when necessary.