Code Zone: Design a scalable web crawling system

Saturday, June 25, 2016

Design a scalable web crawling system

Complete Design:

http://cis.poly.edu/suel/papers/crawl.pdf

http://webpages.uncc.edu/sakella/courses/cloud09/papers/Mercator.pdf

Questions To Ask:

If you were designing a web crawler, how would you avoid getting into infinite loops?

http://stackoverflow.com/questions/5834808/designing-a-web-crawler
https://github.com/filipegoncalves/interview-questions/blob/master/systems_design/WebCrawler.md
http://baozitraining.org/blog/design-a-basic-web-crawler/
http://massivetechinterview.blogspot.in/2015/06/design-web-crawler.html

Final Design:

http://blog.gainlo.co/index.php/2016/06/29/build-web-crawler/
https://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
http://flexaired.blogspot.in/2011/09/design-web-crawler.html

There are at least 3 components that are required.
1. HTTP Request/Getting page.
2. HTML Parser
3. URL Tracker

The first component is to request a given URL and either download it to the machine or just keep it in memory. (Downloading will need design to store the web page for easy retreival)

HTML Parser - Removes the html tags and retains text of interest (I needed only part of the page based on some pattern) and URL s in the current page. A more generic webcrawler will have to save different components like image/sound etc

URL Tracker - URL tracker makes sure that no URL is visited twice within a set time frame( A simple mechanism is a hash table with a user-defined comparator function, some urls may still point to the exact same page eg www.abc.com and www.abc.com/index.htm)

Crawler basic algorithm

Remove a URL from the unvisited URL list
Determine the IP Address of its host name
Download the corresponding document
Extract any links contained in it.
If the URL is new, add it to the list of unvisited URLs
Process the downloaded document
Back to step 1

1 comment:

Civil Lab Equipment ManufacturerOctober 18, 2019 at 10:07 PM
Civil Lab Equipment Manufacturer is the leading Manufacturer, Supplier and Exporter of Civil Engineering Lab Equipments or instruments. Established in 2005.

Mob: +91-9891445495, +91-8448366515, +918587026175
Phone : +91-11-23657121
Website : http://setestindia.com, http://civillabequipmentmanufacturer.com/
ReplyDelete
Replies

Add comment

Labels

Saturday, June 25, 2016

Design a scalable web crawling system

1 comment: