Pointers, Memory, and Simple Sorting

C++ Memory

managing memory turns out to be one of the trickiest and most important parts of C++ programming

a C++ program divides its memory into three main components

static memory has a fixed size (that is known at compile-time) that never changes; global variables are stored here

stack memory changes in size, and is where local variables and function parameters are stored; as the name suggests, stack memory is implemented as a stack, where new is pushed/popped only on the top of the stack; stack memory is handled automatically by C++, e.g. variables are pushed onto the stack when they are defined, and popped from the stack when they go out of scope

free store (heap) memory is all the other memory, and it is accessed in C++ using new and delete/delete[]; it allows programmers to access chunks of memory while a program is running, and that memory is allowed to exist between function calls; by default, free store memory must be manually managed by the programmer, which turns out to be a surprisingly difficult thing to do correctly

in C++, free store memory is accessed via pointers

a pointer is a data type that, essentially, stores the address of some memory

double x = 5.2; // x is a local variable, thus it’s stored on the stack

here, x names a chunk of memory that stores the value 5.2

5.2 is stored at some location in the computer’s memory, and that location has an address

double* x_addr = &x;

the & operator returns the address of a variable

double* is the data type “pointer to a double”, i.e. x_addr is only allowed to store the address of a memory location that store’s a double

The Free Store

when we access the free store, we must do it through pointers

double* y_addr = new double(6.4);

new always returns a pointer

use the * to de-reference a pointer, i.e. to access the value it points to

cout << *y_addr << "\n";
*y_addr += 2;
cout << *y_addr << "\n";

in C++, having a pointer to a value lets us read/write that value in the usual way using the *

Deleting Memory

whenever you new some memory, you must also be sure to delete it when you’re done

delete y_addr;

if you forget to do this, your program is said to have a memory leak

undeleted memory cannot be re-used

for small, toy programs that’s fine: when the program ends any undeleted memory is given back to the operating system

for long-running programs (such as an operating system, a web server, a web browser, a text editor, etc.), this can be a major problem that eventually causes them to crash

experience shows that finding and fixing memory leaks is very difficult in general

many other programming languages use an automatic memory deleter know as a garbage collector to deal with this

but garbage collectors can have a run-time performance impact that is too high for many of the applications that C++ is used for

instead, C++ provides destructors, and also smart pointers, to give programmers control over memory in a way that it is just about as easy having a garbage collector

Problem

Problem: Given a path name to a folder (on Linux), print all its files (not folders!) ordered from largest to smallest. Print the name of each file along with its size (in bytes).

a good strategy for solving programming problems like this to first write down a high-level description of the steps you think will solve the problem

  1. get the name of the folder (using cin)
  2. get a list of all the file names, and their sizes, contained in the folder
  3. sort the list of file names and sizes in order from biggest size to smallest size
  4. print the list

this is quite high-level — it looks nothing like C++ code!

but it is a description of a program that, if we could do each step, should solve our problem

the first step is easy, but the other steps are not so simple

indeed, it is not at all clear how, in C++, one is supposed to get files, and their sizes, given just the name of a folder

so lets list the sub-problems we’ll need to solve

we will have to

  • determine the number of bytes in a given file
  • determine if something in a folder is a folder is a file or a folder (we’ll ignore symbolic links)
  • get all the files in a given folder

lets figure out how to do these basic things first

note that we are assuming a Linux file system

the functions needed for other file systems might be different!

Getting the Size of a File

in Linux, the stat command can be used to get lots of information about a file, e.g.:

$ stat error.h
  File: ‘error.h’
  Size: 556         Blocks: 8          IO Block: 4096   regular file
Device: 802h/2050d  Inode: 75240436    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/     tjd)   Gid: ( 1000/     tjd)
Access: 2015-01-10 14:32:17.145603274 -0800
Modify: 2014-12-03 14:40:32.385175073 -0800
Change: 2014-12-03 14:40:32.385175073 -0800
 Birth: -

among other things, you can see that the size of the file error.h is 556 bytes

it turns out there is a similar stat function available in C++ that returns a stat struct defined like this:

//
// from http://linux.die.net/man/2/stat
//
struct stat {
    dev_t     st_dev;     // ID of device containing file
    ino_t     st_ino;     // inode number
    mode_t    st_mode;    // protection
    nlink_t   st_nlink;   // number of hard links
    uid_t     st_uid;     // user ID of owner
    gid_t     st_gid;     // group ID of owner
    dev_t     st_rdev;    // device ID (if special file)
    off_t     st_size;    // total size, in bytes
    blksize_t st_blksize; // blocksize for file system I/O
    blkcnt_t  st_blocks;  // number of 512B blocks allocated
    time_t    st_atime;   // time of last access
    time_t    st_mtime;   // time of last modification
    time_t    st_ctime;   // time of last status change
};

this struct contains all the basic info associated with a file in Linux

the size, in bytes, is st_size

the data type off_t is an integral type (perhaps unsigned long)

the stat struct is important because it is what the stat function gives us:

#include <sys/stat.h>

void print_file_info(const string& s) {
    struct stat info; // note that we must put struct first!

    // the stat function expects a C-style string, so we can't pass it s
    // instead, we call s.c_str() to get a C-style string
    //
    // the second parameter to the stat function expects a pointer to a
    // stat object
    //
    // if the call to stat returns -1, then no file of that name was found
    //
    if (stat(s.c_str(), &info) == -1) // note the & at the start of info!
    {
        cmpt::error("file does not exist");
    }
    cout << s << " is " << info.st_size << " bytes\n";
}

you can use this function to print the size of any file

this even works if s is the name of a directory

notice how the return value of stat is used to signal success/failure

and the stat struct is modified through C-style pass-by-reference

Testing for a File/Folder

it turns out if we have a stat structure for a file/folder, we can test if it is folder by passing the value st_mode to the function S_ISDIR:

#include <sys/stat.h>

void print_file_info(const string& s) {
    struct stat info; // note that we must put struct first

    // the stat function expects a C-style string, so we can't pass it s
    // instead, we call s.c_str() to get a C-style string
    //
    // the second parameter to the stat function expects a pointer to a
    // stat object
    //
    // if the call to stat returns -1, then no file of that name was found
    //
    if (stat(s.c_str(), &info) == -1) // note that it's &info
    {
        cmpt::error("file does not exist");
    }
    cout << s << " is " << info.st_size << " bytes, ";

    if (S_ISDIR(info.st_mode)) {
        cout << "it's a folder\n";
    } else {
        cout << "it's a file\n";
    }
}

Getting All File Names in a Folder

the final sub-problem we want to solve is how to get a list of all the files in a given folder

to do this, we’ll use dirent.h

while not an official standard library file, dirent.h is pretty commonly used to get directory info in C/C++

dirent.h has some structures and functions we need to know

DIR is a struct that is used to store information about the entries in a directory

strictly speaking, we know nothing about it’s actual contents — we only know that it is named DIR

the opendir function opens a directory:

DIR* opendir(const char *);

opendir takes a path name (as a C-style string) and returns a pointer to a newly created DIR object

if the folder can’t be opened (e.g. because there is no folder of that name), a null pointer is returned

when you are done, use closedir to close a directory:

int closedir(DIR*);

closedir takes a pointer to the DIR object to close

dirent is a struct that contains information about the individual files/folders in a DIR object

a dirent struct has, at least, these two entries:

ino_t  d_ino      // file serial number
char   d_name[]   // name of entry (C-style string)

we will only be concerned with d_name here

to get the next dirent object from a DIR, use

struct dirent* readdir(DIR*);

readdir takes a pointer to a DIR object as input, and returns a pointer to the next dirent object

readdir returns a null pointer when there are no more dirent objects

now we combine all these structure and functions into a function that returns all the names in a directory:

vector<string> dirlist(const string& path) {
    DIR* cwd = opendir(path.c_str());
    if (cwd == nullptr) cmpt::error("couldn't open directory");

    vector<string> result;
    dirent* entry;
    while ((entry = readdir(cwd))) {
        result.push_back(entry->d_name);
    }

    closedir(cwd);

    return result;
}

Solving the Original Problem

now that we know how to solve the sub-problems, let solve the original problem

since we will need to store the name and size of a file, lets create a struct to do that:

struct File {
    string name;
    off_t size;  // off_t is an integer type from the stat object
};

we use the data type off_t because that’s the type of the file size, and we don’t want to have to do any type conversions (which might cause errors/confusion)

now the idea is to read all the files, and their sizes, in a vector<File> that we can sort in whatever order is convenient

vector<File> get_files(const string& folder_name) {
    // open the given folder
    DIR* cwd = opendir(folder_name.c_str());
    if (cwd == nullptr) cmpt::error("couldn't open directory");

    // store all the file names in DIR in result
    vector<File> result;
    dirent* entry;
    while ((entry = readdir(cwd))) {
        struct stat info;
        if (stat(entry->d_name, &info) == -1) {
            cmpt::error("file does not exist");
        }
        if (S_ISDIR(info.st_mode)) {
            // it's a folder, ignore it
        } else {
            result.push_back(File{entry->d_name, info.st_size});
        }
    } // while

    closedir(cwd);
    return result;
} // get_files

now we can get a vector of all the files in a given folder like this:

vector<File> files = get_files(".");  // "." means current directory
for(const File& f : files) {
    cout << f.name << ", " << f.size << " bytes\n";
}

"." is the standard Linux notation for “current directory”

Sorting a Vector

sorting data into order is an interesting and important topic

sorting has applications in many different areas of programming

many different sorting algorithms have been created

some are fast, some are slow, some use extra memory, some don’t

first, let’s see how to sort a vector<File> using the standard sort function that comes with C++

then we’ll implement our own simple sorting routine

Sorting with the Standard C++ sort

the C++ sort function can be used to sort any vector or array (and pretty much any other sequential data structure defined in the right way)

it is very efficient in both time and memory usage

for sort to work, it has to know how to compare the objects being sorted

there is no < operator defined for our File objects, so what we can do is pass in a boolean comparison function that sort will use to do the comparisons

#include <algorithm>

// ...

// comparison function
bool less_than_size(const File& a, const File& b) {
    return a.size < b.size;
}

void test() {
    vector<File> files = get_files(".");

    sort(files.begin(), files.end(), less_than_size);
    reverse(files.begin(), files.end());

    for(const File& f : files) {
        cout << f.name << ", " << f.size << " bytes\n";
    }
}

note that we use the reverse function to reverse the order of the files so they are from most number of bytes to least number of bytes (as required by the original problem)

here’s how we could sort the files alphabetically by name:

bool less_than_name(const File& a, const File& b) {
    return a.name < b.name;
}

bool less_than_size(const File& a, const File& b) {
    return a.size < b.size;
}

void test3() {
    vector<File> files = get_files(".");

    sort(files.begin(), files.end(), less_than_size);
    reverse(files.begin(), files.end());

    for(const File& f : files) {
        cout << f.name << ", " << f.size << " bytes\n";
    }

    cout << "-------\n";

    sort(files.begin(), files.end(), less_than_name);

    for(const File& f : files) {
        cout << f.name << ", " << f.size << " bytes\n";
    }
}

Writing Our Own Sorting Function

in practice, it almost always makes sense to use the C++ sort function when you need to sort data

it is fast, efficient, flexible, and almost certainly error-free

but it is instructive to see how to write our own simple sorting function

initially, we won’t be overly concerned with performance and instead aim just for correctness

later in the course we will see some more efficient sorting algorithms

A Very Slow Sorting Algorithm: Bubblesort

bubblesort is a simple sorting algorithm with a cute name

however, it is terribly slow in practice — its slowness is probably the most interesting thing about it!

so never use bubblesort for sorting large amounts of data!

but it’s instructive to walk through its development

the idea of bubblesort is to repeatedly scan through the list of items you want to sort, and swap adjacent items that are out of order

you stop after you scan through the vector without having swapped any items

lets trace an example by hand

we start with this list of numbers that we want to put into ascending order

8 3 6 1 5

Scan 1

8 3 6 1 5
^ ^

3 8 6 1 5
  ^ ^

3 6 8 1 5
    ^ ^

3 6 1 8 5
      ^ ^

3 6 1 5 8
      ^ ^

Scan 2

3 6 1 5 8
^ ^

3 6 1 5 8
  ^ ^

3 1 6 5 8
    ^ ^

3 1 5 6 8
      ^ ^

3 1 5 6 8

Scan 2

3 1 5 6 8
^ ^

1 3 5 6 8
  ^ ^

1 3 5 6 8
    ^ ^

1 3 5 6 8
      ^ ^

Scan 3

1 3 5 6 8
^ ^

1 3 5 6 8
  ^ ^

1 3 5 6 8
    ^ ^

1 3 5 6 8
      ^ ^

1 3 5 6 8

notice that no swaps are done during the final scan

that’s how bubblesort knows that the numbers must be in ascending order

Implementing Bubblesort

now lets write our own bubblesort to sort a vector<File> object from smallest size to largest size

void bubblesort(vector<File>& files) {
    bool sorted = false;
    while (!sorted) {
        sorted = true;
        for(int i = 1; i < files.size(); ++i) {  // note i starts at 1
            if (files[i - 1].size > files[i].size) {
                sorted = false;
                // swap values in files[i - 1] and files[i]
                File temp = files[i - 1];
                files[i - 1] = files[i];
                files[i] = temp;
            }
        }
    }
}

look at the bool variable sorted

it controls the outer while-loop

it must start out false so that the while runs at least once

then, inside the loop, it is immediately set to true

then, in the for-loop, if one, or more, swaps, are made, we set sorted to false

this means that at least one more scan will be done after this one

using a bool variable like this to signal and event is a common kind of trick in programming

we sometimes call such variables “flags”

we could have simplified the main bubblesort code a little bit by putting the swapping code in its own function:

void swap(File& a, File& b) {
    File temp = a;
    a = b;
    b = temp;
}

void bubblesort(vector<File>& files) {
    bool sorted = false;
    while (!sorted) {
        sorted = true;
        for(int i = 1; i < files.size(); ++i) {  // note i starts at 1
            if (files[i - 1].size > files[i].size) {
                sorted = false;
                swap(files[i - 1], files[i]);
            }
        }
    }
}

this doesn’t change the functionality or performance (much, at least)

but it does make the code easier to read and thus easier to understand