2005-06-15

URIs, URNs, and SGML/XML public identifiers

When I was helping to develop the spec for URIs beginning urn:publicid:, which are the URI equivalent of SGML and XML public identifiers (either formal or not), I worked out this table of what was and was not legal according to publicid syntax, URN syntax, and URI syntax (I'm referencing an older RFC for URI syntax rather than the current one, because I used the older one and some terminology has changed.)

Character Name(s) Pubid URN URI Status
LATIN CAPITAL LETTER ? yes upper lowalpha NORM
LATIN SMALL LETTER ? yes lower upalpha NORM
DIGIT * yes number digit NORM
HYPHEN-MINUS yes other mark NORM
LEFT PARENTHESIS yes other mark NORM
RIGHT PARENTHESIS yes other mark NORM
FULL STOP yes other mark NORM
EXCLAMATION MARK yes other mark NORM
ASTERISK yes other mark NORM
LOW LINE yes other mark NORM
PLUS SIGN yes other reserved AVAIL
COMMA yes other reserved AVAIL
COLON yes other reserved AVAIL
EQUALS SIGN yes other reserved AVAIL
SEMICOLON yes other reserved AVAIL
COMMERCIAL AT yes other reserved AVAIL
DOLLAR SIGN yes other reserved AVAIL
QUESTION MARK yes reserved reserved ENCODE
SOLIDUS yes reserved reserved ENCODE
NUMBER SIGN yes reserved delims ENCODE
PERCENT SIGN yes reserved delims ENCODE
SPACE yes excluded space ENCODE
APOSTROPHE yes excluded mark ENCODE
AMPERSAND no excluded reserved AVAIL
TILDE no excluded mark NULL
REVERSE SOLIDUS no excluded delims NULL
QUOTATION MARK no excluded delims NULL
LESS-THAN SIGN no excluded delims NULL
GREATER-THAN SIGN no excluded delims NULL
LEFT SQUARE BRACKET no excluded unwise NULL
RIGHT SQUARE BRACKET no excluded unwise NULL
CIRCUMFLEX no excluded unwise NULL
GRAVE ACCENT no excluded unwise NULL
LEFT CURLY BRACE no excluded unwise NULL
VERTICAL LINE no excluded unwise NULL
RIGHT CURLY BRACE no excluded unwise NULL

What the codewords in the tables mean:

URN
upper, lower, number, otherMAY be used without %-encoding.
reservedSHOULD NOT be used without %-encoding.
excludedMUST NOT be used without %-encoding.
URI
lowalpha, upalpha, digits, markMAY be used without %-encoding; %-encoding MUST NOT affect semantics.
reservedMAY be used without %-encoding; %-encoding MAY affect semantics.
space, delims, unwiseMUST NOT be used without %-encoding.
Status
NORMNo encoding needed, can't be used as syntax.
ENCODEMUST be encoded (%-encoded or privately).
AVAILAvailable for use as syntax character if literal use is %-encoded (AMPERSAND has no literal use).
NULLNot usable in pubids, included for completeness.

1 comment:

jaya said...

just linked this article on my facebook account. it’s a very interesting article for all.

Urns