Problem case to solve : Crawl any given site(url), scrape each page uniquely for content from given selector or Xpath
Available options for Crawling and Scraping :
With [libxml2] [Nokogiri] is one of the fastest HTML/XML parser. Almost all of the scraping tools written in Ruby such as Mechanize, Wombat, Anemone, etc. uses Nokogiri as there base DSL for sraping.
[Mechanize] - Is a beast in this category. It efficiently uses Nokogiri for HTML scrapping. Do see Abi's post on [Automating browser navigation's through Script]
[Wombat] takes the scraping term to a next higher level, where we can specify the xpath along with CSS from which the content need to be extracted. The eye catcher from wombat is the structured data that it returns as a result post scraping i.e. is a Hash of links, subheadings and all the labels that we specified in the Crawl block. Simple example [here]
Next is [Upton] - much similar to what Wombat does, Upton has an additional feature of saving the results of scrape in a file(.csv or any other desired format).
[Anemone] is a simple Ruby library, that does crawls and scrapes websites in a multi-threaded fashion and thus is fast. The unique feature of Anemone is that it allows us to store data in one of the storage options such as
Anemone also gives an option to ignore a path from the website
Hope this helps someone who is looking for crawling or scraping technique
Subscribe to Engineering At Kiprosh
Get the latest posts delivered right to your inbox