CMPT 225 Assignment 4: Hash Tables and the Dictionary ADT


In this assignment you are to implement a Dictionary ADT (also known as Map, or Associative Array) using a hash table data structure. You will use this to analyze a collection of text documents efficiently.

Suppose your job is to organize the world's information. Let us start with a small collection of fine works of English literature (provided data are public domain in Canada).

Let's try to provide two pieces of functionality to users of these data.

  1. Build a histogram of word frequencies in a document. Provide a list of all unique words in a document with the number of times each occurs. E.g. how many times did the word "comrades" appear in Animal Farm?
  2. Find all documents that contain a particular word. E.g. find all documents in which "thoughtcrime" occurs.

Start by downloading the assignment files. This zipfile contains two directories, one for each part, with makefiles, answers.txt, a test script and ground truths, a few books, and stubs for all of the .h and .cpp files you need.


Naive Algorithms

Start by considering naive algorithms and data structures for performing these queries.

  1. Histograms of word frequencies.
    read all w words into an array "words"
    for i = 1 to w
      count = 0
      for j = 1 to w
        if words[i] == words[j]
          if j < i
            break
          end
          count++
        end
      end
      print words[i] + ": " + count
    end
    
    Answer the questions about this code in answers.txt (Naive 1).
  2. Finding documents.
    for i = 1 to d
      read all w words from document i into array "words"
      for j = 1 to w
        if words[j] == keyword
          print "Document " + i + " contains " + keyword
          break
        end
      end
    end
    
    Answer the questions about this code in answers.txt (Naive 2).

Faster Algorithms

These naive algorithms are very slow for reasonably large numbers of words, documents, and/or users (see answers.txt). You will now develop better versions, using a hash table to implement the Dictionary ADT.

Recall the specification of the Dictionary ADT. It allows the following operations:

We will implement the Dictionary ADT using a hash table data structure. All of the operations should be performed in O(1) average case (O(n) worst case) time. Conveniently, the hash table has the same operations, so we will not actually have a Dictionary class. But note that the Dictionary ADT can be implemented using data structures other than hash tables.


Part 0: Implementing the hash table

For this part, please look at HashTable.h and HashTable.cpp. Much of this class has been written for you. Please examine it, and understand how the hash table is implemented.

Your task is to write:

Don't modify the hash function yet.

Verify that you have implemented the hash function correctly. You can run the test script test.py to test your code.

This test script is more intelligent than previous ones -- if you add cout statements in your code for debugging, it will ignore these.

There is also a debugging program ht_debug.cpp (builds into ht_debug with the makefile). You can use this to test out parts of your code too.


Part 1: Word frequencies

The provided code word_frequencies.cpp is complete. This uses your hash table to compute word frequencies according to the following algorithm.

for each word (w words)
  // Each word
  lookup word in hash table
  if it is present, increment its value by 1
  else insert it into the hash table
end

keys = array of all keys (words) in the hash table
for each key
  lookup count in hashtable
  print key + ": " + count
end

Part 2: Indexing documents (bonus marks)

The provided code index_directory.cpp is complete. It uses a hash table to allow indexing documents, finding all documents that contain a certain substring.

// Preprocessing to build the index
for each document
  for each word in the document
    add document to the list of documents for this word
  end
end

// Answer queries
while true
  ask user for query word
  find all documents containing word
end

InvertedIndex is similar to the first HashTable. However, instead of storing integer counts, it should store lists of document names. This is done in the methods InvertedIndex::add and InvertedIndex::lookup. You can modify how these are implemented if you wish, e.g. with a linked list, a vector, a set, etc. You might need to modify index_directory.cpp based on your decisions.

This data structure is called an inverted index, and is a standard, very useful data structure for document retrieval. Modern search engines essentially use this (with many details).

You can use the program document_index to find documents in a specified directory. document_index takes a directory as a parameter, and builds an inverted index for all files in all subdirectories. It can further limit the index to specified file types (e.g. just .h and .cpp). After building the index, the user can query for different words.

> ./index_directory .
Found a file: ./ii_debug
Found a file: ./ii_debug.cpp
Found a file: ./index_directory
Found a file: ./index_directory.cpp
Found a file: ./InvertedIndex.cpp
Found a file: ./InvertedIndex.h
Found a file: ./InvertedIndex.o
Found a file: ./Makefile
Indexed .
  found 0 subdirectories and 8 files
  built index with 789 keys
Enter a word to search for, q to terminate
break
Searched for break
 found 1 occurrences: 
{./index_directory.cpp, }
Enter a word to search for, q to terminate
for
Searched for for
 found 4 occurrences: 
{./InvertedIndex.cpp, ./InvertedIndex.h, ./index_directory, ./index_directory.cpp, }
Enter a word to search for, q to terminate
q

The next part is open-ended. Feel free to be creative, and make improvements to this application and InvertedIndex class. Here are some suggestions:


Assessment and Submission

Submission

You should submit your assignment online to the CourSys submission server. You should submit the following:

Part 1: Part 2:

Please read the documentation on the submission site for further information. The assignment is due at 11:59pm on April 10.

Assessment

The submitted files must build with the provided makefile in order to receive marks. Part 1 is worth 3% and marks are allocated to the assignment as follows: Part 2 is open-ended, and bonus marks will be awarded based on the creativity, scope, correctness, and description of the application.
Back to the CMPT 225 homepage.