In one of the projects(ACES), we came across a functionality: to parse huge(50mb+) XML files. Here efficiency was very critical piece, as file size was huge. Also we wanted the results of parsing quickly for further analysis so the effective time was less and application performance was unhampered.
We chose SAX Parsing Technique(DOM vs SAX)
There are two ways to parse an xml file, DOM and SAX. SAX being more efficient as it process the xml file incrementally, thereby memory utilization is minimum. Whereas a DOM parser, builds a tree representation of the entire document in memory to begin with, thus using memory that increases with the entire document length. This takes considerable time and space for large documents.
For more details on SAX, please read: http://en.wikipedia.org/wiki/SAX_parser
Ruby Options to parse XML
Next, we wanted to weigh the options in ruby which provided SAX Parsing and do a thorough analysis of the performance measurements for each.
The two most common options were Nokogiri and LibXML, but were found ineffective and slow with larger files. On further analysis, we found a gem named OX. OX when compared with the former two options was found superior, and fitted our needs.
Why we chose OX?
-
OX is built to address the need for a more optimized XML parser so that the advantages of XML could be made available in Ruby without suffering the a performance impact.
-
OX is fastest, both with DOM and SAX Parsing.
A detailed article on speed comparision, performance graphs and code:
http://www.ohler.com/dev/xml_with_ruby/xml_with_ruby.html
Benchmark figures when Nokogiri and OX were compared:
https://gist.github.com/danneu/3977120
Nokogiri and LibXML are effective gems with many added features which OX lacks(eg: XPath Support). But when choosing raw performance for parsing and writing, OX is far better.