Tuesday, June 21, 2016

Design Google Search

The first phase of implementing Google (or any search engine) is to build an indexer. This is the piece of software that crawls the corpus of data and produces the results in a data structure that is more efficient for doing reads.
To implement this, consider two parts: a crawler and indexer.

Phase 1: Crawling
Crawling is where it all begins – the acquisition of data about a website. This involves scanning the site and getting a complete list of everything on there – the page title, images, keywords it contains, and any other pages it links to – at a bare minimum. Modern crawlers may cache a copy of the whole page, as well as look for some additional information such as the page layout, where the advertising units are, where the links are on the page (featured prominently in the article text, or hidden in the footer?).

How is a website crawled exactly? An automated bot – a spider – visits each page, just like you or I would, only very quickly. Even in the earliest days, Google reported that they were reading a few hundred pages a second.The crawler then adds all the new links it found to a list of places to crawl next – in addition to re-crawling sites again to see if anything has changed. It’s a never-ending process, really.

The web crawler's job is to spider web page links and dump them into a set. The most important step here is to avoid getting caught in infinite loop or on infinitely generated content. Place each of these links in one massive text file (for now).

Phase 2i: Indexing Infrastructure
Indexer in layman Terms:
Indexing is the process of taking all of that data you have from a crawl, and placing it in a big database. Imagine trying to a make a list of all the books you own, their author and the number of pages. Going through each book is the crawl and writing the list is the index. But now imagine it’s not just a room full of books, but every library in the world. That’s pretty much a small-scale version of what Google does.

What Indexer does:
The indexer will run as part of a Map/Reduce job. (Map a function to every item in the input, and then Reduce the results into a single 'thing'.) The indexer will take a single web link, retrieve the website, and convert it into an index file. (Discussed next.) The reduction step will simply be aggregating all of these index files into a single unit. (Rather than millions of loose files.) Since the indexing steps can be done in parallel, you can farm this Map/Reduce job across an arbitrarily-large data center.

How Indexer actually works:

Phase 2ii: Specifics of Indexing Algorithms

Once you have stated how you will process web pages, the next part is explaining how you can compute meaningful results. The short answer here is 'a lot more Map/Reduces', but consider the sorts of things you can do:

For each web site, count the number of incoming links. (More heavily linked-to pages should be 'better'.)
For each web site, look at how the link was presented. (Links in an < h1 > or < b > should be more important than those buried in an < h3 >.)
For each web site, look at the number of outbound links. (Nobody likes spammers.)
For each web site, look at the types of words used. For example, 'hash' and 'table' probably means the web site is related to Computer Science. 'hash' and 'brownies' on the other hand would imply the site was about something far different.
Unfortunately I don't know enough about the sorts of ways to analyze and process the data to be super helpful. But the general idea is scalable ways to analyze your data.

Phase 3: Serving Results

The final phase is actually serving the results. Hopefully you've shared some interesting insights in how to analyze web page data, but the question is how do you actually query it? Anecdotally 10% of Google search queries each day have never been seen before. This means you cannot cache previous results.

You cannot have a single 'lookup' from your web indexes, so which would you try? How would you look across different indexes? (Perhaps combining results -- perhaps keyword 'stackoverflow' came up highly in multiple indexes.)

Also, how would you look it up anyways? What sorts of approaches can you use for reading data from massive amounts of information quickly? (Feel free to namedrop your favorite NoSQL database here and/or look into what Google's BigTable is all about.) Even if you have an awesome index that is highly accurate, you need a way to find data in it quickly. (E.g., find the rank number for 'stackoverflow.com' inside of a 200GB file.)

Phase 4 : Ranking
The last step is what you see – you type in a search query, and the search engine attempts to display the most relevant documents it finds that match your query. This is the most complicated step, but also the most relevant to you or I, as web developers and users.
The ranking algorithm checks your search query against billions of pages to determine how relevant each one is. This operation is so complex that companies closely guard their own ranking algorithms as patented industry secrets. Why? Competitive advantage for a start – so long as they are giving you the best search results, they can stay on top of the market. Secondly, to prevent gaming of the system and giving an unfair advantage to one site over another.


Random Issues
Once you have covered the 'bones' of your search engine, feel free to rat hole on any individual topic you are especially knowledgeable about.

Performance of the website frontend
Managing the data center for your Map/Reduce jobs
A/B testing search engine improvements
Integrating previous search volume / trends into indexing. (E.g., expecting frontend server loads to spike 9-5 and die off in the early AM.)

Google Architecture

Useful Links:

No comments:

Post a Comment