Thursday, March 1, 2012

Search a Word

Given a large file, how will you store the words in the file so that searching of a word can be done in constant time? Also how will you find the 10 most frequently occurring words in the file?

Approach:
Searching for words in a file is a very common interview problem and is asked a lot. There are 2 approaches to deal with this.
On would be to make use of Hash Table and store the count of the word as the value and the word as the key. Once the Hash Table has been made we can use a min-heap to find out the 10 most frequently occurring words. A new word is compared with the root and if its count is more than the root element then the root is replaced by that word and heapify is called again on the heap.
The other approach would be to use trie. In trie each node represents a character of a word and has pointer to a character which comes after it in a valid word. This structure supports insertion and search operation in O(k) where k is the length of the word so it can be treated as constant time. You can read more about trie here.
An even more space efficient variation of trie would be a radix tree in which if a node has only one child then the node of the tree is merged with it. Read here
Once the trie has been constructed we can traverse through it in O(n) time (through each of the leaf node --> time spent in reaching a leaf node is considered to be constant time) and again use a min-heap to find the 10 most frequently occurring words. Both the approaches have the space complexity of O(n) and time complexity of O(nlog10).
As ever, Hash tables have the problem of collisions, hence some would argue that tries are better. But the point is the argument that all the words in trie can be found using O(n) times, is fallacious because we will be actually visiting each node and that will be equal to the total number of nodes in the trie.
However, we must realize that trie or suffix tree, is useful data structure, and has a number of practical applications as can be read in this discussion thread . You can find some extensions and implementations here. 
 
Food for thought:
The interviewer can make the problem more complex by saying that we may choose to get any number of the most frequently occurring words.
The solution to this problem would be to construct the hash table with the word count, sort the hash table once it has been created (of course it will destroy the structure) and then get any number of most frequently occurring words. We should pay attention in this case tries cannot be used. until we are ready to build a different min-heap for each request.
If we want to avoid sorting, there has to be some constraint on the number of most frequently occurring words that we can ask for. If there is an upper bound 'k', we can again resort to the use of min-heap and give the desired most frequently occurring words.

No comments:

Post a Comment