Monday, August 24, 2015

Build a crawler

Have you ever wondered how a web crawler worked under the hood?  Me too... so I wrote a quick and dirty one.

What does a web crawler do?  It starts at a URL and collates all links found on that page then follows those links and does the same.  This continues either indefinitely or until some type of limit has been hit. Sounds simple enough right? So go build one yourself just for fun.  If you wish to productize it go a head the world could always use another good crawler for testing purposes.  In the past I have used crawlers as a way to warm up a site after a deploy or just do a smoke test by watch logs for exceptions during a crawl.

Here is the code for the crawler I wrote over the span of a couple hours:

What I felt were the tricky bits:
  • Parsing an HTML page to find the links
    • there are libraries out there to help
    • the regex used in my code works but is not perfect
    • doesn't handle Javascript bound actions.. I would love to see the code for a crawler that can handle those
    • found links to images which really should have been ignored
  • Determine where to crawl next
    • all links discovered were put into a queue
    • this created a breath first crawling behavior
    • in retrospect I should have limited the number of links able to be discovered on a page
    • followed external links... this is often an option on crawlers but I did it by default

How does your design an implementation improve upon what I wrote?  I am super curious what people come up with after spending a couple hours creating a crawler and trying to stick to core libraries.


