Skip to content

Designing XML: Some Notes

December 12, 2010

Back in 2003, a couple of things happened. The program that I had bought to catalog my comic books had finally died. The company that sold it had gone under years before, and the program simply stopped working under Windows XP, and I didn’t have the money to buy a new program at the time (I was freelancing in the aftermath of the burst Internet bubble). I was also having a lot of trouble managing my books, CD’s, and DVD’s…they were piling up, and I’d occasionally buy something I already owned.

Excel really didn’t fit my needs, and I didn’t want to use the ‘cheap’ office-suite database applications. They provide immediate benefits, but if you don’t keep up with the upgrades, your data is eventually locked into a dead proprietary format…I’d just experienced that with my comics, and I didn’t want to re-enter thousands of entries every few years.

This was also long before the Apple App Store, librarything.com, or any of the other services that have come along in the last 5 years.

My needs were modest. I didn’t even need an application, really…just a plain-text format that I could use to script reports from.

At the time, ASP Classic was still being heavily used, and I had enough chops to build basic pages. I was also becoming interested in PHP. All I needed was a data format.

Enter XML.

I created formats for my comics, books, CD’s, DVD’s…even my video games. I’d like to talk about the books.xml file I put together, and how it’s changed across 7 years.

The first cut

I started without a DTD, because I didn’t want the overhead of maintaining one at first. What I wanted to track was simple:

  • One root element, containing multiple book elements
  • Each book tracked contributors, title, whether I owned it, whether I read it, and whether I wanted to buy it
  • Each book element could contain as many contributors as needed
  • Each contributor had a role assigned to them (author, editor, illustrator, etc)
  • If the book had been signed, which contributor had signed it
  • A notes field, to contain any other information

Here’s the first cut:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<bookcollection>
	<book>
		<person type="author">
			<first></first>
			<last></last>
			<signed />
		</person>
		<person type="editor">
			<first></first>
			<last></last>
		</person>
		<title></title>
		<notes></notes>
		<flags own="true" read="true" wanted="false" />
	</book>
</bookcollection>

The <notes> element was never really used. Neither was the wanted attribute in the <flags> element. I have to admit that splitting the author’s name into a <first> and <last> element has remained a thorn in my side, but I haven’t found a better way to do alphabetical sorting.

I’m lazy, and I edit this file by hand. No admin screens, no passwords, no fancy forms, no JavaScript. I could, if I wanted, run XSLT transforms on this from the command line, but I found it easier to make a simple ASP script that pushed books.xml together with a stylesheet, and pushed the output to a browser.

I didn’t (and still don’t) care much about styling. This is how it looked back in the day:

Maintaining a record with this format is fairly simple. When I buy a book, I add it with the flags <flags own="true" read="false" wanted="false" />; when I’ve read it, I set: read="true"; when I sell it, I set: own="false"; if I sell it without reading it, I delete the whole record. I add the <signed /> element to any <person> who signs the book. If a person signs it who hasn’t been a contributor, I can add them with type="non-contributor". I can run totals against most of the elements and attributes to give me some basic stats.

Adding years

One thing I couldn’t do with this format is track when I read something. Because I’d been reading my whole life, a lot of that data is lost, but I could start tracking what I was reading now.

So, the data format was modified in late 2009 to this:



<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<bookcollection>
	<book>
		<person type="author">
			<first></first>
			<last></last>
			<signed />
		</person>
		<person type="editor">
			<first></first>
			<last></last>
		</person>
		<title></title>
		<notes></notes>
		<flags own="true" read="true" wanted="false" year="2010" />
	</book>
</bookcollection>

Notice that we still haven’t added a DTD, or dropped the useless wanted attribute…yet. I still don’t have a way to track when the book was added to the collection, but I think that would add some needless overhead.

I have a ‘mini-format’ for the year attribute that runs like this:

  • for all books read prior to 2009: year="1970-2009"
  • for books read in 2009: year="2009"
  • moving ahead, add the year the book was read as YYYY
  • leave the attribute blank if unread

This gives me an easy way to run stats in XSLT:

  • count(book/person/signed)
  • count(//book/flags[@year='1970-2008'])
  • count(//book/flags[@year='2009'])
  • count(//book/flags[@year='2010'])

February 2010: The Huge Re-org

In February, I decided to do a major re-org. I hadn’t added a single CD, DVD, or video game since I built those formats, and had switched to PHP as the major way to run transformation on my files. I deleted everything but the book and comics XML files, killed all the ASP, and moved a lot of stylesheets and support files into a common directory.

I also moved around a lot of records in books.xml. It became clear that although the flags were working, grouping the records into 3 large buckets would cut down on the amount of time I needed to move around in the file.

This was also a prelude to adding more elements, but let’s look at books.xml before that happened:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<bookcollection>
	<!-- ========================================== -->
	<!-- OWN AND NOT READ ========================= -->
	<book>
		<person type="author">
			<first>Terry</first>
			<last>Eagleton</last>
		</person>
		<title>Saints and Scholars</title>
		<flags own="true" read="false" wanted="false" year="" />
	</book>
	<!-- ========================================== -->
	<!-- OWN AND READ ============================= -->
	<book>
		<person type="editor">
			<first>Karl</first>
			<last>Hyde</last>
		</person>
		<person type="editor">
			<first>John</first>
			<last>Warwicker</last>
		</person>
		<title>mmm..skyscraper i love you</title>
		<flags own="true" read="true" wanted="false" year="1970-2009" />
	</book>
	<!-- ========================================== -->
	<!-- READ BUT DON'T OWN                         -->
	<book>
		<person type="author">
			<first>Sergei</first>
			<last>Lukyanenko</last>
		</person>
		<title>The Night Watch</title>
		<flags own="false" read="true" wanted="false" year="1970-2009" />
	</book>
</bookcollection>

Going Online

I’m now registered at goodreads.com and librarything.com, and both are good services, but the bulk import options leaves something to be desired. You need to have a CSV file, and you absolutely need ISBN’s. Sigh…by this point, I had 1200 books in my books.xml file, and none of them had ISBN’s. I also don’t have a bar-code scanner.

I was also getting into trouble with validation, as I would occasionally mis-place an element, and only know about it from mal-formed HTML coming out the other end. Time for a DTD:


<!ELEMENT bookcollection (book*)>

<!ELEMENT book (person*, title, series?, notes?, flags)>
	<!ELEMENT person (first, last, signed?)>
		<!ATTLIST person type CDATA #IMPLIED>
		<!ELEMENT first  (#PCDATA)>
		<!ELEMENT last   (#PCDATA)>
		<!ELEMENT signed EMPTY>
	<!ELEMENT title  (#PCDATA)>
	<!ELEMENT series (#PCDATA)>
	<!ELEMENT notes  (#PCDATA)>
	<!ELEMENT flags  EMPTY>
		<!ATTLIST flags own  CDATA #IMPLIED>
		<!ATTLIST flags read CDATA #IMPLIED>
		<!ATTLIST flags isbn CDATA #IMPLIED>
		<!ATTLIST flags year CDATA #IMPLIED>

You’ll notice some new elements, and some dropped elements. Let’s look at books.xml again:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE bookcollection PUBLIC 
	 "-//Jeff Wyonch//DTD Books//EN//"
	 "Common/dtd/Books.dtd">
<bookcollection>
	<book>
		<person type="author">
			<first></first>
			<last></last>
			<signed />
		</person>
		<person type="editor">
			<first></first>
			<last></last>
		</person>
		<title></title>
		<series></series>
		<notes></notes>
		<flags own="true" read="true" isbn="" year="2010" />
	</book>
</bookcollection>

Technically, you’re supposed to put the organization’s name in the DTD declaration, but there is no company here, so my name will have to do.

The wanted attribute is gone from <flags>, replaced by an isbn attribute.

I also wanted to start tracking whether the book was part of a series (The Lord of the Rings, the Amber series) or part of an imprint (Penguin Classics, Peter Pauper Press). This led to the first shiny new element in 7 years, the <series> element. As we’ll see, this was not the best solution.

Another thing: the <notes> element only supports text, so no bold, italics, etc. You can ‘mix-in’ other DTD’s with external entity sets, and I may, at some point, add some XHTML to this field.

This led to the first (ENORMOUS…ok, just kidding) style refresh in 7 years:

I know…I’m Picasso. Not. Anyway, it gets the job done.

By this point, I had turned my simple little PHP script into something that could handle the transformation needs for both my book and comic collections, validating as well:


<?

	$DEBUG = 'NO';

	error_reporting(E_ALL);

	/* Define variables */
	$Type = $_GET['type'];
	$Report = $_GET['report'];
	$Params = $_GET['params'];
	$paramList = explode(',',$Params);
	$xmlPath =  $Type . '/' . $Type . '.xml';
	$xslPath = 'Common/xsl/' . $Type . '_' . $Report . '.xsl';

	/* load xslt and xml docs */
	$xsl = new DOMDocument;
	$xsl->load($xslPath);

	$xslt = new XSLTProcessor();
	$xslt->importStylesheet($xsl);

	$xml = new DOMDocument;
	$xml->load($xmlPath);

	if($xml->validate()) {
		/* Keep on trucking */
	} else {
		echo "$xml is invalid.\n";
	}

	/* add parameter/PHP/xinclude support out-of-the-box */
	if( empty($paramList[0]) ) {
		/* Keep on trucking */
	} else {
		foreach ($paramList as $value) {
			$xslt->setParameter(NULL,$value,$value);
		}
	}
	$xslt->registerPHPFunctions();
	$xml->xinclude();

	/* TODO: add char encoding functions here */

	/* echo results */
	if($DEBUG == 'YES') {
		echo '<h2>Type: ' . $Type . '</h2>';
		echo '<h2>Report: ' . $Report . '</h2>';
		echo '<h2>xslPath: ' . $xslPath . '</h2>';
		echo '<h2>xmlPath: ' . $xmlPath . '</h2>';
		echo '<h2>paramList: ';
		print_r( $paramList );
		echo '</h2>';
	} else {
		$results = $xslt->transformToXML($xml);
		echo $results;
	}

?>

This isn’t the greatest script on the face of the earth. A lot of it should be in functions. One of the things I’ve done is activated PHP function and parameter support inside the XSLT, so in the future, I’ll be able to pass parameters in through a query string, set variables with the params, and use PHP functions inside the XSLT when I need to.

This turns my urls into something like this: transform.php?type=Books&report=chronological&params=foo,bar. Eventually, I’ll need to pass key/value pairs in, so params will eventually look like this: params=foo.bar,read.true. This is largely a convenience, so I don’t have to code an endless amount of variables every time I add or subtract a param in the url.

You’ll notice the very strong naming conventions that cut down on how many variables you need to initialize. Naming conventions are your friend. They can automate an enormous amount of grunt work.

A lot of books start off telling you how to build XML documents in memory. I think this is a big waste of time and server resources for any type of records format. It’s almost always better to bake the files, and I have no need to do create files in memory.

Adding Subjects

It’s strange, how your interests can trap you. Take this blog, for example. Now that I can track what I’ve read in a year, I do a year-end round-up posting, talking about the year’s books, and I like to split up my books into subjects. Well, this is another thing I’d rather have the computer do for me. Enter the <subject> element:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE bookcollection PUBLIC 
	 "-//Jeff Wyonch//DTD Books//EN//"
	 "Common/dtd/Books.dtd">
<bookcollection>
	<book>
		<person type="author">
			<first></first>
			<last></last>
			<signed />
		</person>
		<person type="editor">
			<first></first>
			<last></last>
		</person>
		<subject></subject>
		<title></title>
		<series></series>
		<imprint></imprint>
		<notes></notes>
		<flags own="true" read="true" isbn="" year="2010" />
	</book>
</bookcollection>

One thing you’ll note is that I’m not revving the DTD, which is another little error.

You’ll also notice that <series> and <imprint> are now separate elements.

By this point, the format is getting a little messy. There’s an enormous amount of redundant data. Every time you add a subject, one typo can lead to it not being tracked properly when you run reports.

Towards a new format

I’ve begun experimenting with a new format for storing this information. So far, I’m limited by PHP only having an XSLT 1.0 parser. I haven’t looked at PHP in a while, so this may have changed, or change with the next version.

This is a tentative first step:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE books PUBLIC 
	 "-//Jeff Wyonch//DTD Books//EN//"
	 "Common/dtd/BooksV2.dtd">
<books>
	<book>
		<title>Foundation</title>
		<contributor person="1" type="author">
			<signed />
		</contributor>
		<labels subject="30" series="1" imprint="1" />
		<flags own="true" read="true" year="2010" isbn="0553293354" />
	</book>

	<people>
		<person id="1">
			<first>Isaac</first>
			<last>Asimov</last>
		</person>
	</people>

	<categories>
		<category id="30">Science Fiction</category>
	</categories>

	<series>
		<seriesname id="1">Foundation Novels</seriesname>
	</series>

	<imprints>
		<imprint id="1">Ballantine</imprint>
	</imprints>

</books>

This is an improvement in terms of normalizing data, but it makes hand-editing a huge pain in the ass. I also haven’t found a way to sort alphabetically without the authors name being part of the <book> element.

I haven’t really played enough with this yet to understand if I’m just barking up the wrong tree. During a transformation, copying the data from the people, categories, series and imprint elements back into the book element, then applying sorting and formatting may be the way to go. If this is the case, it may be better to go with full elements in book and use attributes only as ‘unique keys’ between the books and the other ‘table’ elements.

Why the hell is he even doing this?

If you’ve read this far, I suppose I owe at least a little bit of an explanation.

I love the online book collection services. I also love software. I get a big kick out of sharing what I’ve read, and talking about it.

But a lot of software and software services realize that once they’ve locked you into a particular data format, they have you for life. The problem is, they usually don’t last that long. It would be grand if Google was around for our grandchildren, and you could open the document you write on Google Docs today in that future. (Not slagging on Google, just using them as an example.)

But that won’t happen. At the end of the day, XML is just plain-text with angle brackets. Even if XML parsers won’t be around 50 years from now, plain-text probably will. And there will be somebody willing to create a custom conversion program for you. You can argue this point about most proprietary data formats, but most are not as copiously documented as XML.

I wish there was a perfect world where universal data formats existed, and you could choose what program you wanted to use to do your data entry. HTML gets most of the way there, but not quite. XML is almost perfect.

2011

I’m working on creating an XSLT that exports to CSV, so I can start to get a lot of my books into goodreads and librarything. I also have a lot of ISBN’s, subjects, imprints, and series information to enter, before I can really start to migrate to a new format.

Another plus would be to the ability to get it all into a mobile format. Then, if I ever end up with a phone that accesses the Internet, I can peruse the collection in the store. Of course, I’ll need a personal website too.

If I move to a new format, I’ll need to create filters…with almost 1300 books listed, there’s no way I could do it manually without enormous effort. I’ve already built a filter that outputs authors, series, and imprints and removes the duplicates. For the record: currently 911 unique authors, 12 unique series, and 20 unique imprints.

I also want to cut down on the amount of XSLT stylesheets I have, and create different views of the data by feeding parameters in with the query string. If I can do that, I can slice the data any way I want, before I choose an export format like HTML or CSV. I’d also like to work with Atom and JSON, just to get those chops.

And, of course, read more books. See you online.

Advertisements
3 Comments leave one →
  1. Bryan Cook permalink
    December 13, 2010 1:02 am

    XML is so baked into our apps today that I have a hard time imaging a future without it. No doubt other formats like JSON will compete in this space, but xml’s self describing nature makes it hard to beat.

    I’m surprised that you’ve been managing this by hand for seven years! Every Christmas I write an app in twelve days, an editor may be fun…

    • Jeff Wyonch permalink
      December 14, 2010 10:11 pm

      Interestingly, managing it by hand has its advantages, the main one being portability. It was trivial to move from ASP to PHP, and modifying the data model only meant updating some XSLT…if I had built an admin app, it wouldn’t have been nearly as flexible.

Trackbacks

  1. Dude…you got your XML in my comic books. « Ramblebramble's Weblog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: