2007-08-13

Extreme Markup 2007: Monday

This is a report on Extreme Markup 2007 for Monday.

Monday was not technically part of Extreme at all; it was a separate workshop called "International Workshop on Markup of Overlapping Structures 2007". Overlap is a special interest of mine: how to deal with marking up things like books, where you may want to preserve both the chapter/paragraph/sentence structure and chapter/page structure (or even just page structure, in books where the chapter boundary doesn't mean the beginning of a page). Bibles can add chapter/verse structure, and indeed one of the presenters was formerly with the Society of Biblical Literature.

I won't give details of the presentations: you will eventually be able to see them on the Extreme Markup site, and googling for LMNL, Trojan Horse markup, TexMecs, Goddag structures, and rabbit/duck grammars should provide more than enough information. I will, however, say a bit about how overlap is typically represented in XML, and then about the two different kinds of (or use cases for) overlap that this workshop identified.

A variety of techniques have been used to represent overlap, but work in the field is generally converging on using XML augmented with milestone elements. A milestone is an empty element that represents the start-tag or end-tag of a virtual element that can overlap with other virtual elements or the actual elements of the document. A milestone is tagged with an attribute to show whether it is meant to be seen as a start-tag or an end-tag, and how the start- and end-tags match up. So <foo sID="xyz"/> would indicate the beginning of a logical element, and <foo eID="xyz"/> would indicate the end. The xyz has no particular significance, but must be present in exactly one start milestone and one end milestone. This technique has the advantage that the virtual element's name is an element name, potentially belonging to a namespace. Clever XSLT processing can be used to convert real elements into virtual ones or (provided no overlap of real elements results) vice versa.

Overlap of the first kind descends from the old SGML CONCUR feature, and is basically about having multiple hierarchies over a single document, like the examples I gave above. Typically you want to be able to pull out each individual hierarchy as a well-formed XML expression without milestones, and then look at it or validate it or process it however you like. Often this is all that's needed for overlapping documents, and it's then a matter of deciding whether all the elements are to be milestones in the base form of the document, or whether some virtual elements (the most important ones) are left as real XML elements rather than milestones.

Overlap of the second kind cares about ranges rather than elements, interval arithmetic (and computational geometry) rather than trees and DAGs, inclusion rather than dominance. It's the kind that most interests me, and LMNL is in my highly biased opinion the state of the art here. If you want dominance information, you can mark it up using annotations (LMNL's broadened version of attributes), or just infer it from what ranges include what, but you aren't assumed to care about it: basic LMNL just has a simple list of ranges over text.

From 2:00 PM onward there was a kind of Quaker meeting, with people going to the mike and talking for 5 minutes. Originally there were speakers and audience members asking questions, but very quickly the discussion became general, as signaled by the audience mike being moved to the front of the room. (When you are in this situation, always use the mike: people with hearing aids can't hear you no matter how loud you yell.) Nothing new emerged here, but participants got to understand one another's viewpoints and systems pretty well. Extreme probably will do this workshop again in a few years, and some other kind of workshop (XML data binding was suggested) next year.

No comments: