The STAMAN Project: Phase III, of tasks and storage formats

At least, for me, there are only two pieces to STAMAN that are not trivial to work out before writing the code: choosing the storage format and implementation language.  Both also happen to be areas where experience strongly augments ones intrusion, more so than the rest of the app’.

In the design outline, I noted that YAML would work quite nicely, yet an exposition of the outline suggests that something closer to SQL could better serve the applications design. The reasons behind it should be fairly obvious, if you’ve ever worked with textual data before.

During Phase I, I concentrated on the data involved with task management. It’s not hard to implement an SQL schema capable of representing that. Even better, most dialects offer useful features for handling times/dates. Virtually every programming language has a way of interfacing with such an SQL database, either through natural bindings or calling out to scriptable client programs. SQLite, MySQL, and PostgreSQL in fact provide both means, I’m not familiar with MSSQL. So that’s a big set of pluses all the way around. We even get a reusable DSL to help without having to write it!

The problem however, becomes one of migration paths: what happens if you need to change the data structures, perhaps heavily? That means having a lot more work whenever restructuring is needed, and it’s IMHO, less scriptable than a little perl golf: sufficiently so that I’m not going to screw with it. Insert shameless plug for Ruby on Rails here ;).

In a commercial environment; i.e. oriented on making money off the program, XML would be more likely than any other textual format, but not very convenient for me. I also hate XML parsing with a passion. It is however sufficient for getting the job done, if a bit, ahem, jacking the amount of internal documentation you need to write (or later wish you had) several notches higher than it need be.

Someone might think of a simple Comma Separate Value (CSV) format, but CSV is any thing but simple. Don’t believe me? Just think about data that may contain commas. That being said, the only good things I can say about CSV, from a programming perspective, is CPAN rocks. Unless you’re munging address books or spreadsheet data around, and need a LCD: it is best to avoid CSV, period.

The best bet, in terms of structured text: but one sufficiently able to represent the data set, and be easily edited by hand. What is really needed is a dedicated format: enter YAML. It’s basically a hierarchial way of recording data as sequences of elements and key/value mappings. Works excellently.

The SQL solution relinquishes fine control over the operations, where as the YAML method is assured to slurp up memory in proportion to the input. It’s a lot more like DOM oriented XML, only the translation between the code and textural representation is a hell of a lot more natural. When working with program generated output, it also doesn’t need to be fed through a pretty printer to be comprehensible, which can’t be said of XML—without more pain for someone.

Pro YAML:

  • Easily edited by hand (notepad) and many unix tools.
  • So simple you can skip reading the spec0
  • If you have to write your own parser, make it YAML and save grey hairs.
  • It’s easy to serialize/marshal data around, as easy as it gets without eval().
  • More likely to benefit from compression.
Pro SQL:
  • Less imperative-style code to be written.
  • The hardest processing code is already in the database engine.
  • Can focus on querying data, not parsing it.
  • Languages/frameworks are more likely to ship SQLite bindings then a YAML parser.
Con’ YAML: 
  • It really is as simple as it looks.
  • You have to write your own list/dictionary handling code.
  • Scales less.
Con’ SQL:
  • You have to learn basic SQL.
  • Not the most fun in some languages (C, C++, Java, and C#).
  • Can’t really get at the data, short of a database client.
Note that I haven’t said anything about separating the data store from the client application: using an SQL server is just as viable as storing YAML files on a network drive. It really is that simple.
My personal view? SQLs virtues likely outweigh YAMLs here—unless you’re going to be designing by exploration. I’m not in this case, and I am also competent enough not to shoot myself in the foot. If I was smart, I would make the application wide interface to the data store more abstract than writing SQL queries all over the place like an asshole. Yes, I can be that smart. Don’t tell your neighbours.
0: I read the YAML specification the first time I used it for a project, which was for a rake based built system. How else could I expect to hand write my build spec’s in YAML? :-).