Saturday, June 25, 2016

Design a scalable web crawling system

Complete Design:
http://webpages.uncc.edu/sakella/courses/cloud09/papers/Mercator.pdf

Questions To Ask:
If you were designing a web crawler, how would you avoid getting into infinite loops?
http://stackoverflow.com/questions/5834808/designing-a-web-crawler
https://github.com/filipegoncalves/interview-questions/blob/master/systems_design/WebCrawler.md
http://baozitraining.org/blog/design-a-basic-web-crawler/
http://massivetechinterview.blogspot.in/2015/06/design-web-crawler.html

Final Design:
http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
https://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
http://flexaired.blogspot.in/2011/09/design-web-crawler.html


There are at least 3 components that are required. 
1. HTTP Request/Getting page.
2. HTML Parser
3. URL Tracker

The first component is to request a given URL and either download it to the machine or just keep it in memory. (Downloading will need design to store the web page for easy retreival)

HTML Parser - Removes the html tags and retains text of interest (I needed only part of the page based on some pattern) and URL s in the current page. A more generic webcrawler will have to save different components like image/sound etc

URL Tracker - URL tracker makes sure that no URL is visited twice within a set time frame( A simple mechanism is a hash table with a user-defined comparator function, some urls may still point to the exact same page eg www.abc.com and www.abc.com/index.htm)
Crawler basic algorithm
  1. Remove a URL from the unvisited URL list
  2. Determine the IP Address of its host name
  3. Download the corresponding document
  4. Extract any links contained in it.
  5. If the URL is new, add it to the list of unvisited URLs
  6. Process the downloaded document
  7. Back to step 1

1 comment:

  1. Thanks for the wonderful blog. The rotator and link manager you use is the core of your business. It’s something you use every single day, and it can significantly impact your overall profitability.
    url tracker

    ReplyDelete