An Overview of XML Parsers

We’ve brought up that XML is a vital skill for developers. Parsing XML is an essential task, but can be a confusing place for beginners. There is no one stop shop for libraries that can handle the creating, editing, parsing, and the like of XML documents. Let’s take a look specifically at parsing which involves transforming the XML document in a way useful to a program.

For those who are unfamiliar with XML, one can think of the XML documents as database tables and the XML parsers as database engines. Any experience in databases will tell you there is no single best database engine. Likewise, there is no single best XML parser. In either situation one has to look at what fits best with the data in use. With that in mind, let’s look at the various types of XML parsers and their use cases.

Stream vs. Tree

There are two types of XML parsers: stream (event) based and tree based. Stream based parsers are fast, efficient, require less memory, but do not do well with accessing data at random. Stream parsers are also called event parsers because it only grabs information when an event is fired. Tree based parsers aren’t nearly as fast and require loading the entire document into memory, but can access the data at random. It parses the whole document and creates a tree structure, so it has all the information about the document at all times. Now, within these two categories there are many variations, but let’s look at a few major ones.

Push (SAX)

Stream parsers have two subsets: push and pull. Push parsers in essence “push” data to the application. Thus, the application has to keep track of where the push parser is at in the document. Additionally, data that has been parsed cannot be accessed without re-parsing the document again. This can get very complex. The most popular implementation of this is the Simple API for XML (SAX). The SAX parser is best used when reading and not editing, because the application has to use callbacks that are not intuitive.

Pull (StAX or XMLReader)

Pull parsers in essence “pull” the data to the application. This means that where the parser is in the document will always be available to the application. As one article mentions, this is a much more intuitive approach because the application isn’t reacting to event as they happen, rather, the application calls the desired events. Also, the article notes that it is much easier to pass an input stream as a function parameter than to pass SAX events.

Tree (DOM)

Tree parsers don’t have any subsets, and the most popular implementation is the DOM parser.
The DOM API parses the entire XML document and converts it into a DOM tree before you can begin processing it. This cost might be beneficial if you know that you need to access the entire document. If you occasionally need to access only part of the XML document, the cost could decrease the performance of your application with no added benefit. In this case the SAX or StAX API is preferable.

As a general principle, one would look for a stream based parser if the document is large, and a tree based parser if the document is small. One benchmarking test conveys the idea of just how much faster SAX parsers are than DOM parsers on large files, DOM average is about 3.063s and SAX average is 1.001s–almost 3 times faster! (If looking for parsers on iPhone, be sure to check out this benchmark test.) Another consideration is that if the document is deep as well, that is, the document has extensively nested data, or the document doesn’t have unique element names, stream parsers begin to lose their benefit. As an additional note, if looking to parse many documents it may benefit to parse them in multiple instances, as a StackOverflow post discusses. Hopefully, this article will help make apparent what parsers are best in a given situation.

Categorized as XML

By Joe Purcell

Joe Purcell is a technology virtuoso, cyberspace frontiersman, and connoisseur of Linux, Mac, and Windows alike.