/ nokogiri

3 ways to parse HTML with Ruby

In this post I am just going to show 3 ways to parse and extract HTML documents - a useful way of scraping websites, analysis and conversion of offline documents etc.

  • Nokogiri gem -

    The nokogiri gem is a popular Ruby HTML/XML parser which uses libxml2(a software library for parsing XML documents).
    Parse HTML with nokogiri using the Nokogiri::HTML method:

     require 'nokogiri'
     document = Nokogiri::HTML(input)

- [Oga gem][2] - 

     The oga gem is a Ruby XML/HTML parser with a small native extension. 
Parse HTML with oga using the `Oga.parse_html` method:

 require 'oga'
 document = Oga.parse_html(input)

You might want to use oga if you have difficulties installing nokogiri.

  • Nokogumbo gem -

    The nokogumbo gem is a wrapper for gumbo, Google’s pure-C HTML5 parser. Parse HTML with nokogumbo using the Nokogiri::HTML5 method:

     require 'nokogumbo'
     document = Nokogiri::HTML5(input)

Nokogumbo returns nokogiri data structures, which makes it relatively straightforward to switch to from nokogiri.


**Parsing HTML fragments**

You can also parse fragments of HTML instead of complete documents. Use the fragment class method with nokogiri and nokogumbo, and the same as before with oga:

- Nokogiri

 require 'nokogiri'
 fragment = Nokogiri::HTML.fragment('<span>Hello World</span>')
  • Nokogumbo
     require 'nokogumbo'
     fragment = Nokogiri::HTML5.fragment('<span>Hello World</span>')

- Oga

 require 'oga'
 fragment = Oga.parse_html('<span>Hello World</span>')

Searching by CSS selector

The easiest way to identify specific elements in a document is to search for them by CSS selector.

Nokogiri provides the #search method, oga provides the #css method. For example, here’s how you would search for all anchor elements within a document:

  • Nokogiri and Nokogumbo
     document.search('a')

- Oga

 document.css('a')

To search for a single element nokogiri provides the #at method, oga provides the #at_css method. For example, searching for the title element:

  • Nokogiri and Nokogumbo
     document.at('title')

- Oga

 document.at_css('title')

There are several other techniques like traversing every element, extracting element text, extracting attribute values, extracting attribute hashes, extracting tabular data which I might cover in the next post.

We mostly use nokogiri to parse and extract HTML code, but just out of curiosity I found this two alternatives nokogumbo and oga and thought of sharing it with you all.

Hope this helps.