Skip to content

Dude…you got your XML in my comic books.

February 21, 2012

They’re the two great tastes that taste great together! But I digress.

A while back, I wrote a post describing how I used XML to keep track of my books. I also use XML to keep track of my comics. I recently created a custom report via XSLT/HTML 5 that allows me to bring my want list with me digitally as a web app on my iPod Touch. Web app is a little misleading here…basically, this is just a URL that gets saved to my home screen, and uses HTML 5 storage and offline capabilities, along with some CSS 3 styling. But hey, it works. No more copying this out on a sheet of paper.

I’d like to talk about how I store comics data in XML in this post. Once the bugs have all been ironed out in the web app, I’ll talk about that (I’m still weeding some things out with cache.manifest, and there are some other weird bugs I’m tracking down).

A lot of what I said at the beginning of the original post still holds true now:

Back in 2003, a couple of things happened. The program that I had bought to catalog my comic books had finally died. The company that sold it had gone under years before, and the program simply stopped working under Windows XP, and I didn’t have the money to buy a new program at the time (I was freelancing in the aftermath of the burst Internet bubble). I was also having a lot of trouble managing my books, CD’s, and DVD’s…they were piling up, and I’d occasionally buy something I already owned.

Excel really didn’t fit my needs, and I didn’t want to use the ‘cheap’ office-suite database applications. They provide immediate benefits, but if you don’t keep up with the upgrades, your data is eventually locked into a dead proprietary format…I’d just experienced that with my comics, and I didn’t want to re-enter thousands of entries every few years.

I started out with a full DTD. That was crazy, and there was a lot of blood on the keyboard for the first month or so. What I ended up with was something that looked like this:



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE cbml PUBLIC 
	 "-//Jeff Wyonch//DTD CBML//EN//"
	 "Common/dtd/cbml.dtd">

<cbml xmlns:xi="http://www.w3.org/2001/XInclude">

<!-- Series Stored -->
	<xi:include href="series/1602.xml" />

<!--  Series In File  -->
	<series>
		<title></title>
		<volume></volume>
		<buylist />

		<comic>
			<publisher></publisher>
			<issue date="" binding="regular" printing="1st" no="" />
			<coverp cur="US Dollar">0</coverp>
			<creator media="cover"></creator>
			<creator media="pinup"></creator>
			<creator media="writer"></creator>
			<creator media="artist"></creator>
			<notes></notes>
			<copies>
				<copy grade="Near Mint" value="0" cur="US Dollar">
					<signed></signed>
					<tosell />
				</copy>
				<wanted />
			</copies>
		</comic>

	</series>

</cbml>

You’ll notice a few things right off the bat:

  • The use of XInclude to chop up the main document into more manageable pieces.
  • The appearance of duplicated data. It isn’t, but we’ll get to that in a moment.
  • The use of a lot of elements as booleans. If the <buylist /> element is there, then it’s on the buylist; if it isn’t, then it isn’t.

What I needed out of the data format was fairly simple:

  • I want to track which series I regularly buy (the <buylist /> element)
  • I want to track contributors to each comic in a series, and their role
  • I want to know how many physical copies I have of each comic for each series, and if those copies vary (for instance, if one copy is signed, and the other isn’t).
  • Whether I want to sell or buy (irregardless of how many I actually own)
  • The difference between the cover price and the actual current value

The reality is that some of these elements and attributes are copied verbatim whenever I add new series, because I collect to read, not for value, so condition and current value aren’t necessarily important to me.

Using XInclude was very important to me, primarily because I needed a way to store the series that wasn’t a monolithic file, and I wanted that to be as technology-agnostic as possible. Any XML parser that supports XInclude can handle this DTD and file (at least, in theory). Because XInclude requires the included document have only a single root element, I split the master document up via the <series> element. This is a benefit, because if I needed to move to a different technology, or wanted to re-use the separate documents in another XML format, they are logically organized already.

This is a pragmatic format, not a bibliographic one. Concision is more important than precision. Here’s an example of a filled-out series, with only one issue:


<series>
	<title>Cerebus Jam</title>
	<volume>1</volume>
	<comic>
		<publisher>Aardvark-Vanaheim</publisher>
		<issue date="April 1985" binding="regular" printing="1st" no="1" />
		<coverp cur="US Dollar">1.70</coverp>
		<creator media="cover">Bill Sienkiewicz</creator>
		<creator media="writer">Dave Sim</creator>
		<creator media="artist">Scott Hampton</creator>
		<creator media="artist">Bo Hampton</creator>
		<creator media="artist">Murphy Anderson</creator>
		<creator media="artist">Gerhard</creator>
		<creator media="artist">Terry Austin</creator>
		<creator media="artist">Will Eisner</creator>
		<copies>
			<copy grade="Near Mint" value="1.70" cur="US Dollar">
				<signed>Will Eisner</signed>
				<signed>Dave Sim</signed>
				<signed>Gerhard</signed>
				<signed>Terry Austin</signed>
				<signed>Bill Sienkiewicz</signed>
			</copy>
		</copies>
	</comic>
</series>

The signatures are tracked inside the copy element because a second copy may not have the same signatures, or none at all. The publisher is tracked inside the comic element because the publisher may change, but the volume and numbering continue (this happens more often than you may think). All the creators that are important to me are listed, but not sub-divided by contribution. In the case above, most of those creators contributed a different story…I could get to this level of detail, but I’d never leave the house.

If I want to leave the record intact, but don’t own a copy, I can leave the copies element blank. If I have a copy, but want another (for whatever reason, but usually because I can finally afford a more respectable copy), I can add the wanted flag. Honestly, the tosell flag hasn’t really been used, and I’m thinking of removing it.

Although the above seems like a lot of information to manually fill out, I have about 3 to 4 canned templates saved as text files that allow me to very quickly add things.

If you want to see an example of a truly bibliographic comic book data format, check out the massive amount of information tracked at the Grand Comics Database. My hats off to the folk who contribute to this, but if I tried to manage this much information on my own, I’d still be recording my first longbox.

Here is the super-simple DTD for this baby:


<!-- CBML 0.0.7 DTD	 -->
<!ELEMENT cbml (xi:include*,series*)*>
	<!ATTLIST cbml xmlns:xi CDATA #IMPLIED>

<!ELEMENT series (title, volume, buylist?, comic*)>
   <!ELEMENT title (#PCDATA)>
   <!ELEMENT volume (#PCDATA)>
   <!ELEMENT buylist EMPTY>

   <!ELEMENT comic (publisher, issue, coverp, creator+, notes?, copies, appearance*)>
      <!ELEMENT publisher (#PCDATA)>
      <!ELEMENT issue EMPTY>
         <!ATTLIST issue date     CDATA #IMPLIED>
         <!ATTLIST issue binding  CDATA #IMPLIED>
         <!ATTLIST issue printing CDATA #IMPLIED>
         <!ATTLIST issue no       CDATA #IMPLIED>
      <!ELEMENT coverp (#PCDATA)>
         <!ATTLIST coverp cur CDATA #IMPLIED>
      <!ELEMENT creator (#PCDATA)>
         <!ATTLIST creator media CDATA #IMPLIED>

      <!ELEMENT notes (#PCDATA)>

      <!ELEMENT copies (copy*, wanted?)>
         <!ELEMENT copy (signed*, tosell?)>
            <!ATTLIST copy grade CDATA #IMPLIED>
            <!ATTLIST copy value CDATA #IMPLIED>
            <!ATTLIST copy cur   CDATA #IMPLIED>
            <!ELEMENT signed (#PCDATA)>
            <!ELEMENT tosell EMPTY>
         <!ELEMENT wanted EMPTY>

      <!ELEMENT appearance (#PCDATA)>

<!ELEMENT xi:include EMPTY>
	<!ATTLIST xi:include href CDATA #IMPLIED>

You’ll notice a completely unused element, appearance. This was originally intended to be used to denote the appearance of characters, when important (in the case of first appearances, important guest appearances, etc). Frankly, this is something that, while cool, is just too much information. I can just Google it. If I really need to record it, it can go in the notes element.

The DTD probably has a few subtle bugs in determining content models, which I’ll probably investigate later.

And, of course, I was hopeful enough, at the beginning, to think this would become a markup language and gave it a shiny acronym, CBML (Comic Book Markup Language). I’m sure I’m not the only nerd to attempt this.

A small rant about MVC

When most folk talk about MVC frameworks, they often make an unspoken assumption that each part (the model, view and controller) are equally important, effectively, the ‘three legs of the chair’. I find myself disagreeing with this assumption more and more.

To me, the model is almost always the most critically important piece of the puzzle, as it is the one thing you want to outlast the application.

O’Reilly seems to understand this. If you read the colophons for many of their books, you’ll notice that they chose Docbook as their publication/storage format. This decision was made years ago, and some of their earliest titles were written in Docbook when it was still an SGML application.

What is the major benefit to O’Reilly? Well, Docbook is technology-agnostic: it doesn’t care what you use to work with it, or what the output format is. This has allowed O’Reilly to create Safari, output their books in Mobi, ePub, and Kindle formats, syndicate via RSS/Atom, and provide camera-ready hi-res output to printers.

Now, full disclosure: I’ve never worked for O’Reilly, and I’m guessing at some of this (ok, quite a bit of it). Plus, I’m sure there is a lot of blood on the floor when they add new output formats, and change the view and controller parts of their CMS architecture. It’s never easy.

What I’m trying to get at is this: protecting your data format allows it to outlast the temporary needs of your client-base. This is probably the single most important part of managing content.

Back to the CBML show

What’s next for this format? Well, I mentioned my ‘web app’. Plus, along with a little pruning, an updated DTD.

This is something I’d like to actually store on github, but most of it is just content. And, of course, somebody somewhere will take offence at something you list as owned. I don’t need that hassle.

Once I get the web app truly working, I’ll write a post about that. In the meantime, if you have any comments about the format, drop me a line.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: