Assignment 4 - Hash Tables and the Dictionary ADT

In this assignment you are to implement a Dictionary ADT (also known as Map, or Associative Array) using a hash table data structure. You will use this to analyze a collection of text documents efficiently.

Suppose your job is to organize the world's information. Let us start with a small collection of fine works of English literature (provided data are public domain in Canada).

Let's try to provide two pieces of functionality to users of these data.

Build a histogram of word frequencies in a document. Provide a list of all unique words in a document with the number of times each occurs. E.g. how many times did the word "comrades" appear in Animal Farm?
Find all documents that contain a particular word. E.g. find all documents in which "thoughtcrime" occurs.

Start by downloading the assignment files. This zipfile contains two directories, one for each part, with makefiles, answers.txt, a test script and ground truths, a few books, and stubs for all of the .h and .cpp files you need.

Please do not create any additional .h or .cpp files.
You can add to the .h files, but do not modify any of the provided method prototypes. (Except for part 2.)
You can modify hashtable_test.cpp for testing, but your code must work with the original.

Naive Algorithms

Start by considering naive algorithms and data structures for performing these queries.

Histograms of word frequencies.

read all w words into an array "words"
for i = 1 to w
  count = 0
  for j = 1 to w
    if words[i] == words[j]
      if j < i
        break
      end
      count++
    end
  end
  print words[i] + ": " + count
end

Answer the questions about this code in answers.txt (Naive 1).

Finding documents.

for i = 1 to d
  read all w words from document i into array "words"
  for j = 1 to w
    if words[j] == keyword
      print "Document " + i + " contains " + keyword
      break
    end
  end
end

Answer the questions about this code in answers.txt (Naive 2).

Faster Algorithms

These naive algorithms are very slow for reasonably large numbers of words, documents, and/or users (see answers.txt). You will now develop better versions, using a hash table to implement the Dictionary ADT.

Recall the specification of the Dictionary ADT. It allows the following operations:

Insertion of (key, value) pair
Removal of (key, value) pair
Lookup of value for key
Modification of value for existing (key, value) pair

We will implement the Dictionary ADT using a hash table data structure. All of the operations should be performed in O(1) average case (O(n) worst case) time. Conveniently, the hash table has the same operations, so we will not actually have a Dictionary class. But note that the Dictionary ADT can be implemented using data structures other than hash tables.

Part 0: Implementing the hash table

For this part, please look at HashTable.h and HashTable.cpp. Much of this class has been written for you. Please examine it, and understand how the hash table is implemented.

Your task is to write:

remove
lookup
modify

Don't modify the hash function yet.

Verify that you have implemented the hash function correctly. You can run the test script test.py to test your code.

This test script is more intelligent than previous ones -- if you add cout statements in your code for debugging, it will ignore these.

There is also a debugging program ht_debug.cpp (builds into ht_debug with the makefile). You can use this to test out parts of your code too.

Part 1: Word frequencies

The provided code word_frequencies.cpp is complete. This uses your hash table to compute word frequencies according to the following algorithm.

for each word (w words)
  // Each word
  lookup word in hash table
  if it is present, increment its value by 1
  else insert it into the hash table
end

keys = array of all keys (words) in the hash table
for each key
  lookup count in hashtable
  print key + ": " + count
end

What is the average case time complexity of this pseudocode, using the provided hash function? If we used a "good" hash function, what do you think the average case time complexity would be? Write your answer in answers.txt.
Try running the code to see how fast it is.
```
./word_frequencies data/1400.txt
./word_frequencies data/pg2600.txt
```
(Ctrl-C terminates a running program.)
Modify the hash function in HashTable.cpp to improve the performance. Try running word_frequencies on War and Peace (pg2600.txt) again. You shouldn't feel like using Ctrl-C.
A good hash function to use would be the following:
```
ch₀ * 32^n-1 + ch₁ * 32^n-2 + ... + ch_n-2 * 32¹ + ch_n-1 * 32⁰
```
where ch₀ is the left most character of the string, characters are represented as numbers in 1-26 (case insensitive), and n is the size of the string. For example the string "cat" has the value:
```
3 * 32² + 1 * 32 + 20 = 3124
```
Note that using this calculation on large strings will result in numbers that will cause overflow. You should therefore use Horner's rule to perform the calculation, and apply the mod operator after computing each expression in Horner's rule. Horner's rule will be briefly discussed in class.

Part 2: Indexing documents (bonus marks)

The provided code index_directory.cpp is complete. It uses a hash table to allow indexing documents, finding all documents that contain a certain substring.

// Preprocessing to build the index
for each document
  for each word in the document
    add document to the list of documents for this word
  end
end

// Answer queries
while true
  ask user for query word
  find all documents containing word
end

InvertedIndex is similar to the first HashTable. However, instead of storing integer counts, it should store lists of document names. This is done in the methods InvertedIndex::add and InvertedIndex::lookup. You can modify how these are implemented if you wish, e.g. with a linked list, a vector, a set, etc. You might need to modify index_directory.cpp based on your decisions.

This data structure is called an inverted index, and is a standard, very useful data structure for document retrieval. Modern search engines essentially use this (with many details).

You can use the program document_index to find documents in a specified directory. document_index takes a directory as a parameter, and builds an inverted index for all files in all subdirectories. It can further limit the index to specified file types (e.g. just .h and .cpp). After building the index, the user can query for different words.

> ./index_directory .
Found a file: ./ii_debug
Found a file: ./ii_debug.cpp
Found a file: ./index_directory
Found a file: ./index_directory.cpp
Found a file: ./InvertedIndex.cpp
Found a file: ./InvertedIndex.h
Found a file: ./InvertedIndex.o
Found a file: ./Makefile
Indexed .
  found 0 subdirectories and 8 files
  built index with 789 keys
Enter a word to search for, q to terminate
break
Searched for break
 found 1 occurrences: 
{./index_directory.cpp, }
Enter a word to search for, q to terminate
for
Searched for for
 found 4 occurrences: 
{./InvertedIndex.cpp, ./InvertedIndex.h, ./index_directory, ./index_directory.cpp, }
Enter a word to search for, q to terminate
q

The next part is open-ended. Feel free to be creative, and make improvements to this application and InvertedIndex class. Here are some suggestions:

Create a templated version that allows one to map anything to anything, and use it to map strings to vectors for example, and use this in the application.
Modify InvertedIndex to use separate chaining.
Modify InvertedIndex to use a different probing strategy (double hashing, quadratic).
Make the hash table dynamically growable -- insert should double the table size if the hash table is getting too full. This application might make a big table, but its size is difficult to predict a priori.
Change the inverted index to return a sorted list of documents, sorted by a score, for instance the number of occurrences of the word in the document.
Modify the application and inverted index to allow saving a built index to disk, and reading it back.
Modify the application and inverted index to support matching non-alphabetic characters, e.g. search for "break;".
Modify the application and inverted index to support partial matching of strings, e.g. "brea" matches "break;".

Assessment and Submission

Submission

You should submit your assignment online to the CourSys submission server. You should submit the following:

Part 1:

Modified HashTable.h
Modified HashTable.cpp

Part 2:

A single file part2.zip with only your directory part2. This zip file should contain everything needed to build your code -- running the following should build your code:
```
unzip part2.zip
cd part2
make
```
A PDF file report.pdf with a description of what you did. Please include a description of the modifications you chose to make, any interesting design decisions you made, and example output of running your application.

Please read the documentation on the submission site for further information. The assignment is due at 11:59pm on April 10.

Assessment

The submitted files must build with the provided makefile in order to receive marks. Part 1 is worth 3% and marks are allocated to the assignment as follows:

Correctness 1.5% (provided and held-out test cases, efficient O(1) average case hash table)
Answers to questions 0.75%
Memory management 0.25% (valgrind --error-exitcode=1 -q --leak-check=full ./word_frequencies data/pg2600.txt)
Coding style 0.5% (use of functions and loops, code indentation and spacing, comments and variable naming)

Part 2 is open-ended, and bonus marks will be awarded based on the creativity, scope, correctness, and description of the application.

Back to the CMPT 225 homepage.

CMPT 225 Assignment 4: Hash Tables and the Dictionary ADT