2010-12-23

MicroRNG

This is a contribution to the MicroXML conversation. It's a stripped-down version of RELAX NG suitable for validating MicroXML documents. It excludes namespaces, since MicroXML doesn't have them either. Somewhat reluctantly, I have also jettisoned all simple types but a few and all value types except the default, since I figure that MicroXML will mostly be used by applications that need to validate string values in more complicated ways anyhow.

Generalized interleave has a high implementation cost, so I've removed it as well, except for mixed content, which I consider essential. Finally, I've ditched lists, datatype libraries other than stripped-down XSD, foreign markup, name classes, nested grammars, external file inclusion, the notAllowed pattern, divs (which are just for documentation), and definition combining methods.

Here's what's left, in the form of a compact RELAX NG grammar. When translated to XML format, this is also a MicroRNG grammar (modulo namespace issues).

start = elementElem | grammarElem

grammarElem = element grammar {startElem, defineElem*}

startElem = element start {elementElem | refElem | element choice {(startElem | refElem)+ } }

defineElem = element define {attribute name {text}, pattern+}

pattern = elementElem | textElem | mixedElem | attributeElem | valueElem | groupElem | choiceElem | optionalElem | z zeroOrMoreElem | oneOrMoreElem | refElem | dataElem

elementElem = element element {attribute name {text}, (emptyElem | pattern+)}

emptyElem = element empty {empty}

textElem = element text {empty}

mixedElem = element mixed {pattern+}

attributeElem = element attribute {attribute name {text}, (valueElem|textElem)?}

valueElem = element value {text}<

groupElem = element group {pattern+}

choiceElem = element choice {pattern+}

optionalElem = element optional {pattern+}

oneOrMoreElem = element oneOrMore {pattern+}

zeroOrMoreElem = element zeroOrMore {pattern+}

refElem = element ref {attribute name {text}}

dataElem = element data {"string" | "decimal" | "double"| "integer" | "date" | "dateTime" | "boolean" | "base64Binary"}

In addition, MicroRNG just allows a single unique element element in a definition (that is, no more than one definition of an element), even though that would reduce the convenience of RNG definitions to their DTD equivalents.  There are other possible simplifications, like getting rid of element elements as the root, or removing zeroOrMore elements in favor of optional elements wrapped around oneOrMore elements, but I judge them to be more annoying to schema authors than helpful to implementers.

Comments are gratefully solicited either here or at James Clark's blog.

5 comments:

Anonymous said...

Why no xml:id John?
Was it just simplicity?
no id,idref?

JC included it?


DaveP

James Clark said...

To make this really straightforward to implement, I would suggest that, for any name, a schema must contain at most one element pattern with that name (were you suggesting this restriction or something slightly different?). This reduces pattern matching to a regex-level of complexity.

Also start should be able to be a ref.

I wonder about losing enumerations. Matching a specific string with or without whitespace ignoring is a little bit different from general datatyping, and the value element (with a type of either string or token) makes sense without the rest of the datatyping facilities.

John Cowan said...

DaveP: RNG's ID/IDREF support is only for DTD compatibility. I suppose there could be a hardwired check for the uniqueness of "xml:id" values, but I expect people will use an "id" attribute just as often, maybe more so.

James: That's the further restriction I had in mind, but didn't actually state; I'll fix that up. I'll also fix up start as a ref.

I figure that for enumerations the code will contain a case statement matching the character content or attribute value against the strings that the application understands. Validating the enumerated values would just let you drop the else-clause of the case, which seems a very small benefit to me.

James Clark said...

@John

The key reason for allowing value is that it is common for the presence of an attribute with a specific value to affect what attributes and child elements are allowed. For this to work, you also need to be match an attribute that does match an enumerated set.

John Cowan said...

Fair enough. I was thinking that with only one declaration per element name allowed, such attribute declarations didn't make sense, but they actually do if you have a choice between two content models within the definition.

So I've added value patterns, and allowed either a value pattern or a text pattern (or neither, which means a text pattern) as the content of an attribute pattern.