Random thoughts on working with XML

It’s no real secret that I love XML but truly hate working with XML parsers in general (^_^).

Xerces-C++ and libxml++ are not to bad but I have never met a parser that I love. The main reason I choice Xerces was the painless’ness of compiling and linking against the library; I really do not want to go through the bother of setting up libxml++ in MSVC. Especially when taking a look at the pkg-config output on my workstation:

FreeBSD$  pkg-config libxml++-2.6 --cflags --libs                      18:07
-I/usr/local/include/libxml++-2.6 -I/usr/local/include/libxml++-2.6/include
-I/usr/local/include/libxml2 -I/usr/local/include -I/usr/local/include/glibmm-2.4
-I/usr/local/lib/glibmm-2.4/include -I/usr/local/include/sigc++-2.0
-I/usr/local/lib/sigc++-2.0/include -I/usr/local/include/glib-2.0
-I/usr/local/lib/glib-2.0/include -L/usr/local/lib -lxml++-2.6 -lxml2 -lglibmm-2.4
-lgobject-2.0 -lsigc-2.0 -lglib-2.0

SAX, DOM, or whatever else, the parser style doesn’t really matter to me that much: as long as it gets the job *done*. Although obviously, I am more familiar with DOMs (thank you JavaScript). I tend use XML for storing structured data without having to resort to a binary file/database, or a curmudgeon of files within a zip archive. So operations tend to be very straight forward using a couple of glue functions.

Personally, my idea of fun XML parsing is to take data this as input:

<child1 attr="val">string of text</child1>
<child2>another string of text</child2>

and to in turn receive a nested data structure like this as output:

# example in Perl
$structure = {
node => 'rootnode',
attributes => undef,
data => [
node => 'child1',
attributes => { attr => 'val' },
data => 'string of text'
node => 'child1',
attributes => undef,
data => [
node => 'child2',
attributes => undef,
data => 'another string of text'

Probably because that is how my brain sees the preceding XML xD.

Not to mention it makes writing something like a pretty printer easy as pi:

# for some reason, writing this subroutine was very relaxing...
sub pp_xml {

my $xhr = shift;
my $depth = shift;
my $indent = sub { "t" x shift };
my $node = $xhr->{node} or warn "XML node has no data!n";

if ($xhr->{attributes}) {
while (my ($attr, $val) = each %{$xhr->{attributes}}) {
$node .= " " . $attr . "='" . $val . "'";
print $indent->($depth), '<', $node, '>', "n";

$xhr = $xhr->{data};
if (ref $xhr eq 'ARRAY') {
pp_xml($_, $depth+1) foreach @$xhr;
} else {
print $indent->($depth+1), $xhr, "n";
print $indent->($depth), '</', $node, '>', "n";

pp_xml($structure, 0);

Making it accept a callback ident function as a 3rd argument, is left as an exercise for others who are equally in need of R&R 8=).

Terry@dixie$ perl -Mstrict /tmp/xml.pl -Mwarnings                         21:57
<child1 attr='val'>
string of text
</child1 attr='val'>
another string of text