2007-08-14

Extreme Markup 2007: Friday

This is a report on Extreme Markup 2007 for Friday.

Friday was a half-day, but Extreme 2007 saved the best of all its many excellent papers for almost the last. Why is it that we repeat the "content over presentation" mantra so incessantly, but throw up our hands when it comes to tables? All the standard table models -- HTML, TEI, CALS -- are entirely presentational: a table is a sequence of rows, each of which is a sequence of cells. There are ways to give each column a label, but row labels are just the leftmost cell, perhaps identified presentationally but certainly not in a semantic way. If have a table of sales by region in various years, pulling out North American sales for 2005 with XPath is no trivial matter. Why must this be? Inquiring minds (notably David Birnbaum's) want to know.

David's proposal, really more of a meta-proposal, is to use semantic markup throughout. Mark up the table using an element expressing the sort of data, perhaps salesdata. Provide child elements naming the regions and years, and grandchildren showing the legal values of each. Then each cell in the table is simply a sales element whose value is the sales data, and whose attributes signify the region and year that it is data for. That's easy to query, and not particularly hard to XSLT into HTML either. (First use of the verb to XSLT? Probably not.)

Of course, there is no reason to stop at two categories. You can have a cube, or a hypercube, or as many dimensions as you want. The OLAP community knows all about the n-cubes you get from data mining, and the Lotus Improv tradition, now proundly carried on at quantrix.com (insert plug from highly satisfied customer here) has always been about presenting such n-cubes cleverly as spreadsheets.

The conference proper ended in plenary session with work by Henry Thompson of W3C that extends my own TagSoup project, which provides a stream of SAX events based on nasty, ugly HTML, to have a different back end. TagSoup is divided into a lexical scanner for HTML, which uses a state machine driven by a private markup language, and a rectifier, which is responsible for outputting the right SAX events in the right order. It's a constraint on TagSoup that although it can retain tags, it always pushes character content through right away, so it is streaming (modulo the possibility of a very large start-tag).

Henry's PyXup replaces the TagSoup rectifier, which uses another small markup language specifying the characteristics of elements, with his own rectifier using a pattern-action language. In TagSoup, you can say what kind of element an element is and something about its content model, but you can't specify the actions to be taken without modifying the code. (The same is technically true of the lexical scanner, but the list of existing actions is pretty generic.) In PyXup, you can specify actions to take, some of which take arguments which can be either constants or parts of the input stream matched by the associated pattern. This is a huge win, and I wish I'd thought of it when designing TagSoup. The question period involved both Henry and I giving answers to lots of the questions.

To wrap up we had, as we always have (it's a tradition) a talk by Michael Sperberg-McQueen. These talks are not recorded, nor are any notes or slides published, still less a full paper. You just hafta be there to appreciate the full (comedic and serious) impact. The title of this one was "Topic maps, RDF, and mushroom lasagne"; the relevance of the last item was that if you provide two lasagnas labeled "Meat" and "Vegetarian", most people avoid the latter, but if it's labeled "Mushroom" (or some other non-privative word) instead, it tends to get eaten about as much as the meat item). The talk was, if I have to nutshell it, about the importance of doing things rather than fighting about how to do them. At least that was my take-away; probably everyone in the room had a different view of the real meaning.

That's all, folks. Hope to see you in Montreal next August at Extreme Markup 2008.

6 comments:

Unknown said...

Thanks John.

Good clear summary.

Anonymous said...

Actually, John, several of CMSMcQ's splendiferous closing keynotes in years past were laboriously transcribed and included in the Extreme Proceedings. The hard part is telling the difference between them and the numerous other conference papers Michael has produced.

Not that this has any bearing on your statement about haffing ta be there.

Anonymous said...

You say "Hope to see you in Montreal next August at Extreme Markup 2008." I hope to see you and a hotel-full of markup geeks next August in Montreal at Balisage: The Markup Conference.

John Cowan said...

Hurrah! Thanks, Anon. (You misspelled the link: it's http://www.balisage.net .)

Len Bullard said...

When working with the USAMICOM IADS product, I semantically tagged all of the tables (early 1990s) specifically the RPSTLs (repair/parts) lists. I even published a paper on using this with frames (the predecessor of divs) that Yuri pesented.

Why?

1. We had a stylesheet and putting style in the SGML was ugly.

2. It was easier to see the errors when eyeball parsing.

3. It was easier for the loggies to check the data by searching for tag names.

4. The frames made is easy to divvy it up into logical chunks.

It all comes back around if you wait. The HTML browser was a tremendous throwback as most page metaphors systems are. Holy Scroller Forever!

Len Bullard said...

http://xml.coverpages.org/sgml92.html

It's in there. These papers are a decent read for those who want to look forward by looking backward.

I still think a fun app to have would be a semantic tracer that finds the first mentions of certain ideas and then graphs their development over time (find out if independent development really is a good metric as TimBL claims).