How to Parse Large Files With PHP

PHP has great tools built in for parsing XML files. SimpleXML is perfect most anything, especially XHR responses, but is too resource intensive for large files. There are no easy solutions for parsing large files, and the major setback to non-DOM parsers is that they require more coding and are not intuitive. PHP’s XML Parser extension offers great power and flexibility and if well presented is easy to understand.

Create the Parser

The first step is to create the parser, and though it seems trivial there are settings the user should be aware of. The first we are interested in is `XML_OPTION_CASE_FOLDING` which essentially upper-cases the contents. Typically, you won’t want this on unless you need uniformity. The second is `XML_OPTION_SKIP_WHITE` which will skip values having white space. This option can be useful either way, it’s only a matter of whether or not you want to process empty data.

Setup the Handlers

The next step is to setup the handlers. Here is where the power is, but also the confusion. The handlers are where you will actually program what you want done with the data. By the nature of parsing itself, you only need to know the start element, end element, and data handlers. Let’s look at what each of these do:

xml_set_element_handler($xml_parser, “startElement”, “endElement”)

This handler sets the start and end handlers that will be called by `xml_parse()`. In our case, we use the functions called `startElement` and `endElement`, but these could also be class methods (see docs). If any processing needs to be done with an element’s parameters it will be done in the `startElement` handler. The `startElement` function will be sent the parser, the element’s name, and the attributes as an array with key-value pairs. It is here that you want to do things that would initialize a new element, such as incrementing the depth level.

The `endElement` function will be sent the parser and the element name. There is no data given here that isn’t given to the start or data handlers, but it can be used for decrementing the depth level and perhaps running queries or writing files.

xml_set_character_data_handler($xml_parser, “contents”)

This handler will be called for each of the contents within XML tags. Typically, here is where you get the data you’re looking for. Perhaps the biggest trick to using stream parsers like XML Parser is that the programmer has to keep track of where the parser is in the document. Putting the parser into a class wields enormous benefit in this case.

For additional information one can read the beginner or intermediate tutorials at Kirupa. If this is your first time through with parsing XML, using roan’s note on as groundwork is a great start to see how the XML Parser extension works.

Categorized as XML

By Joe Purcell

Joe Purcell is a technology virtuoso, cyberspace frontiersman, and connoisseur of Linux, Mac, and Windows alike.

Leave a comment