In this assignment you are to implement a Dictionary ADT (also known as Map, or Associative Array) using a hash table data structure. You will use this to analyze a collection of text documents efficiently.
Suppose your job is to organize the world's information. Let us start with a small collection of fine works of English literature (provided data are public domain in Canada).
Let's try to provide two pieces of functionality to users of these data.
Start by downloading the assignment files. This zipfile contains two directories, one for each part, with makefiles, answers.txt, a test script and ground truths, a few books, and stubs for all of the .h and .cpp files you need.
Start by considering naive algorithms and data structures for performing these queries.
read all w words into an array "words" for i = 1 to w count = 0 for j = 1 to w if words[i] == words[j] if j < i break end count++ end end print words[i] + ": " + count endAnswer the questions about this code in answers.txt (Naive 1).
for i = 1 to d read all w words from document i into array "words" for j = 1 to w if words[j] == keyword print "Document " + i + " contains " + keyword break end end endAnswer the questions about this code in answers.txt (Naive 2).
Recall the specification of the Dictionary ADT. It allows the following operations:
We will implement the Dictionary ADT using a hash table data structure. All of the operations should be performed in O(1) average case (O(n) worst case) time. Conveniently, the hash table has the same operations, so we will not actually have a Dictionary class. But note that the Dictionary ADT can be implemented using data structures other than hash tables.
For this part, please look at HashTable.h and HashTable.cpp. Much of this class has been written for you. Please examine it, and understand how the hash table is implemented.
Your task is to write:
Don't modify the hash function yet.
Verify that you have implemented the hash function correctly. You can run the test script test.py to test your code.
This test script is more intelligent than previous ones -- if you add cout statements in your code for debugging, it will ignore these.
There is also a debugging program ht_debug.cpp (builds into ht_debug with the makefile). You can use this to test out parts of your code too.
The provided code word_frequencies.cpp is complete. This uses your hash table to compute word frequencies according to the following algorithm.
for each word (w words) // Each word lookup word in hash table if it is present, increment its value by 1 else insert it into the hash table end keys = array of all keys (words) in the hash table for each key lookup count in hashtable print key + ": " + count end
./word_frequencies data/1400.txt ./word_frequencies data/pg2600.txt(Ctrl-C terminates a running program.)
A good hash function to use would be the following:
ch0 * 32n-1 + ch1 * 32n-2 + ... + chn-2 * 321 + chn-1 * 320
where ch0 is the left most character of the string, characters are represented as numbers in 1-26 (case insensitive), and n is the size of the string. For example the string "cat" has the value:
3 * 322 + 1 * 32 + 20 = 3124
Note that using this calculation on large strings will result in numbers that will cause overflow. You should therefore use Horner's rule to perform the calculation, and apply the mod operator after computing each expression in Horner's rule. Horner's rule will be briefly discussed in class.
The provided code index_directory.cpp is complete. It uses a hash table to allow indexing documents, finding all documents that contain a certain substring.
// Preprocessing to build the index for each document for each word in the document add document to the list of documents for this word end end // Answer queries while true ask user for query word find all documents containing word end
InvertedIndex is similar to the first HashTable. However, instead of storing integer counts, it should store lists of document names. This is done in the methods InvertedIndex::add and InvertedIndex::lookup. You can modify how these are implemented if you wish, e.g. with a linked list, a vector, a set, etc. You might need to modify index_directory.cpp based on your decisions.
This data structure is called an inverted index, and is a standard, very useful data structure for document retrieval. Modern search engines essentially use this (with many details).
You can use the program document_index to find documents in a specified directory. document_index takes a directory as a parameter, and builds an inverted index for all files in all subdirectories. It can further limit the index to specified file types (e.g. just .h and .cpp). After building the index, the user can query for different words.
> ./index_directory . Found a file: ./ii_debug Found a file: ./ii_debug.cpp Found a file: ./index_directory Found a file: ./index_directory.cpp Found a file: ./InvertedIndex.cpp Found a file: ./InvertedIndex.h Found a file: ./InvertedIndex.o Found a file: ./Makefile Indexed . found 0 subdirectories and 8 files built index with 789 keys Enter a word to search for, q to terminate break Searched for break found 1 occurrences: {./index_directory.cpp, } Enter a word to search for, q to terminate for Searched for for found 4 occurrences: {./InvertedIndex.cpp, ./InvertedIndex.h, ./index_directory, ./index_directory.cpp, } Enter a word to search for, q to terminate q
The next part is open-ended. Feel free to be creative, and make improvements to this application and InvertedIndex class. Here are some suggestions:
You should submit your assignment online to the CourSys submission server. You should submit the following:
Part 1:unzip part2.zip cd part2 make
Please read the documentation on the submission site for further information. The assignment is due at 11:59pm on April 10.