Tuesday, May 19, 2009

Random thoughts on working with XML

It's no real secret that I love XML but truly hate working with XML parsers in general (^_^).

Xerces-C++ and libxml++ are not to bad but I have never met a parser that I love. The main reason I choice Xerces was the painless'ness of compiling and linking against the library; I really do not want to go through the bother of setting up libxml++ in MSVC. Especially when taking a look at the pkg-config output on my workstation:

FreeBSD$  pkg-config libxml++-2.6 --cflags --libs                      18:07
-I/usr/local/include/libxml++-2.6 -I/usr/local/include/libxml++-2.6/include 
-I/usr/local/include/libxml2 -I/usr/local/include -I/usr/local/include/glibmm-2.4 
-I/usr/local/lib/glibmm-2.4/include -I/usr/local/include/sigc++-2.0 
-I/usr/local/lib/sigc++-2.0/include -I/usr/local/include/glib-2.0 
-I/usr/local/lib/glib-2.0/include  -L/usr/local/lib -lxml++-2.6 -lxml2 -lglibmm-2.4
-lgobject-2.0 -lsigc-2.0 -lglib-2.0  

SAX, DOM, or whatever else, the parser style doesn't really matter to me that much: as long as it gets the job *done*. Although obviously, I am more familiar with DOMs (thank you JavaScript). I tend use XML for storing structured data without having to resort to a binary file/database, or a curmudgeon of files within a zip archive. So operations tend to be very straight forward using a couple of glue functions.

Personally, my idea of fun XML parsing is to take data this as input:

  <child1 attr="val">string of text</child1>
    <child2>another string of text</child2>

and to in turn receive a nested data structure like this as output:

# example in Perl
$structure = { 
    node       => 'rootnode',
    attributes => undef,
    data       => [
                          node => 'child1',
                          attributes => { attr => 'val' },
                          data => 'string of text'
                          node => 'child1',
                          attributes => undef,
                          data => [
                                  node        => 'child2',
                                  attributes  => undef,
                                  data        => 'another string of text'

Probably because that is how my brain sees the preceding XML xD.

Not to mention it makes writing something like a pretty printer easy as pi:

# for some reason, writing this subroutine was very relaxing...
sub pp_xml {

    my $xhr     = shift;
    my $depth   = shift;
    my $indent  = sub { "\t" x shift };
    my $node    = $xhr->{node}  or warn "XML node has no data!\n";

    if ($xhr->{attributes}) {
        while (my ($attr, $val) = each %{$xhr->{attributes}}) {
            $node .= " " . $attr . "='" . $val . "'";
    print $indent->($depth), '<', $node, '>', "\n";

    $xhr = $xhr->{data};
    if (ref $xhr eq 'ARRAY') {
        pp_xml($_, $depth+1) foreach @$xhr;
    } else {
        print $indent->($depth+1), $xhr, "\n";
    print $indent->($depth), '</', $node, '>', "\n";

pp_xml($structure, 0);

Making it accept a callback ident function as a 3rd argument, is left as an exercise for others who are equally in need of R&R 8=).

Terry@dixie$ perl -Mstrict /tmp/xml.pl -Mwarnings                         21:57
        <child1 attr='val'>
                string of text
        </child1 attr='val'>
                        another string of text

No comments:

Post a Comment