Lately I found myself looking for an easy-to-use streaming xml parser, and everywhere I looked I found libraries that build up entire xml trees into memory. This is great for working with reasonably-sized documents, however when working with large files running out of memory can be a real issue. Another drawback of a DOM-based approach is that while you're waiting on the entire document to be loaded into memory, no actual work on the document (or any fragment thereof) can be performed.
With a stream-based parser, work can be performed as soon as a pertinent piece of the document is available. The only thing that really fits the bill for streaming XML in Ruby is to use a SAX paradigm, and several popular ruby XML libraries make such an API available...
The Problem
I'm never entirely happy when I have to write SAX code directly. Inevitably I end up writing a state machine whose entire purpose it to pluck out the bits of xml I'm interested in, and depending on the structure these parsers can get complicated. The code itself is hard to follow (particularly when revisiting it later) because it's spread across various callback methods, and does not follow a linear flow.
The Solution
I have settled on a pattern by which I ignore all tags until I hit my desired tag, then parse everything underneath it into an object, and when I hit the end of my desired tag I make a call out to some method that can deal with that document fragment. Parse pertinent fragment, callback, repeat, ignore all else.
From this pattern Saxerator was born!
Drawing inspiration from nori (a sax-based xml-to-hash parser) and a Practicing Ruby article on Enumerable I created a syntax for working with chunks of xml parsed into a hash.
Given an xml something like this:
Leviticus Alabaster Eunice Diesel How to eat an airplane ...
Saxerator allows you to iterate over each item as a Ruby hash as soon as they're parsed:
parser = Saxerator.parser File.new('bookshelf.xml')
parser.for_tag(:book).each do |book|
# book looks like { 'authors' => { 'author' => ["Leviticus Alabaster", "Eunice Diesel"] }, 'title' => 'How to eat an airplane' }
puts book['title'] # or whatever
end
Because I mix in Enumerable, you get tons of handy syntactical sugar like first, find, select and tons more (see the Enumerable documentation):
parser.for_tag(:author).first # 'Leviticus Alabaster'
eunice_books = parser.for_tag(:book).select do |x|
x['authors']['author'].include?('Eunice Diesel')
end
Future plans
Right now the basic parser works, but there are several improvements I'd like to make. Some ideas include:
- Configurable conversion of some xml values ('true' to TrueClass, date strings to Date objects, etc)
- Have to deal with element attributes somehow. This is completely not handled at all right now, trying to determine what syntax would be good for this.
- Ability to specify the tag depth the parser should look for the tags at. For example it may be that there is a book element nested inside a tag adjacent to other books, and you only want to parse the top-level book elements.
can i make nested a parser?
for example if i have a |item| then can i analyse it and make more item.for_tag ?
Posted by: Arno.Nyhm | 2012.04.15 at 08:34
There's not really any way to do that from within Saxerator, however since you have the Hash for that element you can use normal Hash semantics for dealing with subtags stored within the hash.
For example:
Saxerator.parser(file).for_tag(:item).each do |item|
authors = item['authors']
authors.each do |author|
# do something with author here
end
end
Posted by: Soulcutter | 2012.04.15 at 11:41