Sunday, June 19, 2016

Design a Search typehead

Clarifying questions :
  • How many typeahead suggestions are to be provided ?
  • Do we need to account for spelling mistakes ? Example : Should typing “mik” give michael as a suggestion because michael is really popular as a query ?
  • What can be the length of a search query ?

Lets assume for this question, we focus on only providing 5 suggestions at max. We need not account for spelling mistakes, and assume that the suggestions will have the typed phrase as the strict prefix.

So, in effect we have a system which does 2 major things :
  • Given a query, it gives back upto 5 typeahead suggestions.
  • Given a search query ( which is actually search for by the user ), it updates the suggestions if needed.

One approach would be to store the data as a trie. But, do you construct a complete trie ? Is there a limit to the number of characters in a query ? Users can type any random string as a query and  that can cause the trie size to blow up.
Alright, so we only construct the nodes that are needed.

How do we calculate the top 5 suggestions then ? Top 5 frequency query terms with the user typed query as prefix seems to be a good approach. How do we find top 5 frequency query results for a query ?
Do we go and traverse the whole subtree in the trie to find the top frequency terms ? If so, what can be the size of the subtree ? Do you think its going to grow too inefficient ( especially because typeahead is latency sensitive ) ?  
Can we store some additional information on the node itself ? How about we store the top 5 terms along with the frequency on the top 5 terms ? Query becomes really fast then. Update in the trie would mean percolating up the new term with its frequency, and see if its eligible to be in top 5 at every node.
What about the updates ? Do you update the trie with every update ? Would that cause things to be really slow ? Would sampling work here ?
Now, what do you think about the trie size ? Do you think it fits on a single machine ?
There are 2 options here :
  1. You shard the trie. How do you shard the trie ? Do you only shard it on the first level ?
  2. Maintain the trie as a refined set of queries which are more frequent than a certain threshold. All query terms along with the actual frequencies are stored in another hashmap. How do you do the update ? Batch update ? Would you compromise on real time updates for recent trending search terms ? What if you trigger the entry / update on search terms when it crosses certain threshold post sampling ?

What about fault tolerance ? Replication ?
How about optimizations on the client side ? Do you trigger off a request to the backend on every keystroke ? Or do you wait for 100ms and trigger off request if there have been no other keystroke ?


No comments:

Post a Comment