Parsing Large XML Files Using PHP

I ran into a situation where I needed to parse a large (1 GB) XML file in order to extract the data into a MySQL table. As usual, I did my initial round of research. First, I decided to use the DOMDocument PHP class.

First Mistake

For my testing, I used a small subset of the data… weighing in at a measly 24 records.

Initially, all of my tests ran quite nicely. Then I decided to throw the complete (1 GB) XML file at it. Epic fail… I mean, it ran well for a while, but eventually ran out of memory. (And, yes… I did increase the memory_limit* to 1.5 GB and max_execution_time* to 5 hours.) I feared this may happen.

The problem with utilizing DOMDocument on large XML files is that it loads the data into an array. While parsing, that array is growing. Not good when you’re dealing with massive XML files.

With this fail under my belt, I went back to the drawing board. Knowledge is power… knowledge is power… knowledge is power.

My Next Move

XMLReader. From the PHP website: ”The XMLReader extension is an XML Pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.” OK, sounds considerably more promising.

And Survey Says, Ding!

$file = "PATH_TO_FILE";
$reader = new XMLReader();
$reader->open($file);
while( $reader->read() )
{
// Execute processing here
}
$reader->close();

After that, it was gravy. Well, aside from the additional logic that had to go into it. That’s easily a topic all of it’s own, perfect for perhaps a “Part 2” of this post. No promises though… unless, of course, incoming requests prompt for more information!

* How to modify PHP’s “memory_limit” and “max_execution_time” on a per script basis

// Tweak some PHP configurations
ini_set('memory_limit','1536M'); // 1.5 GB
ini_set('max_execution_time', 18000); // 5 hours

Like this:

Tags: ,

%d bloggers like this: