2008-02-07

Which characters are excluded in XML 5th Edition names?

The list of allowed name characters in the XML 1.0 Fifth Edition looks pretty miscellaneous. The clue to what's really going on is that unlike the rule of earlier XML 1.0 versions, where everything not permitted was forbidden, now everything that is not forbidden is permitted. (I emphasize that this is only about name characters: every character is and always has been permitted in running text and attribute values except the ASCII controls.)

So what's forbidden, and why?

  • The ASCII control characters and their 8-bit counterparts. Obviously.
  • The ASCII and Latin-1 symbolic characters, with the exceptions of hyphen, period, colon, underscore, and middle dot, which have always been permitted in XML names. These characters are commonly used as syntax delimiters either in XML itself or in other languages, and so are excluded.
  • The Greek question mark, which looks like a semicolon and is canonically equivalent to a regular semicolon.
  • The General Punctuation block of Unicode, with the exceptions of the zero-width joiner, zero-width non-joiner, undertie, and character-tie characters, which are required in certain languages to spell words correctly. Various kinds of blank spaces and assorted punctuation don't make sense in names.
  • The various Unicode symbols blocks reserved for "pattern syntax", from U+2190 to U+2BFF. These characters should never appear in identifiers of any sort, as they are reserved for use as syntactic delimiters in future languages that exploit non-ASCII syntax. Many are assigned, some are not.
  • The Ideographic Description Characters block, which is used to describe (not create) uncoded Chinese characters.
  • The surrogate code units (which don't correspond to Unicode characters anyhow) and private-use characters. Using the latter, in names or otherwise, is very bad for interoperability.
  • The Plane 0 non-characters at U+FDD0 to U+FDEF, U+FFFE, and U+FFFF. The non-characters on the other planes are allowed, not because they are a good idea, but to simplify implementation.

Note that the undertie and character tie, the European digits 0-9, and the diacritics in the Combining Characters block are not permitted at the start of a name. Other characters could have sensibly been excluded, particularly combining characters that don't happen to be in the Combining Characters block, but it simplifies implementation to permit them.

This list is intentionally sparse. The new Appendix J gives a simplified set of non-binding suggestions for choosing names that are actually sensible.

2008-02-06

Who do I work for?

Well, a company that provides an email service with about 107 users, and a calendar service with about 106 users, and a news syndicate with about 104 sources, and a video sharing facility that displays about 108 video views a day, and an image index with about 109 images. And it connects about 105 advertisers with about 105 online publishers and 103 offline ones, and provides online wallets for about 106 buyers and 105 sellers, and is localized in about 102 interface languages, and employs about 104 people, and is rated 100 in the list of best companies to work for. And it is not best known for any of these things.

Who are they?

10100.

Justice at last, part two

The Fifth Edition of XML 1.0 is now a Proposed Edited Recommendation.

So what, you say. Ho hum, you say. A bunch of errata folded in to a new edition, you say. No real change here, you say.

But no, not at all, but quite otherwise. There's a big change here, assuming this PER gets past the W3C membership vote and becomes a full W3C Recommendation. There's something happening here, and what it is is eminently clear.

Justice is coming at last to XML 1.0.

For a long time, the characters used in the markup of an XML document -- element names, attribute names, processing instruction targets, and so on -- have been limited to those that were allowed in Unicode 2.0, which was issued in July 1996. If you wanted your element names in English, or French, or Arabic, or Hindi, or Mandarin Chinese, all was good. But if you wanted them in the national languages of Sri Lanka, or Eritrea, or Cambodia, or in Cantonese Chinese, to say nothing of lots and lots of minority languages, you were simply out of luck -- forever.

Not fair, people.

I tried fixing this the right way, by pushing the XML Core WG of the W3C to issue XML 1.1. It acquired some additional cruft along the way, some good, some in hindsight bad. It was roundly booed and even more roundly ignored. In particular, at least one 800-pound gorilla voted against it at W3C and refused to implement it.

Now it's being done the wrong way. We are simply extending the set of legal name characters to almost every Unicode character, relying on document authors and schema authors not to be idiots about it. Is that an incompatible change to XML 1.0 well-formedness? Hell yes. Is any existing XML 1.0 document going to become not well-formed? Hell no. We learned our lesson on that one.

Who supports this? I won't name names, but XML parser authors and distributors from gorillas to gibbons have been consulted in advance this time, and there are no screaming objections. Some will probably provide an option to turn Fifth Edition support on, others will turn it on by default. Unlike XML 1.1 support, this is actually a simplification: the big table of legal characters in Appendix B just isn't needed any more.

"Hot diggity (or however you say that in Amharic). When can I start using this?" Not so fast. First the W3C has to vote it in -- if they don't, all bets are off. Then implementations have to spread through the XML ecosystem, including not only development but deployment. It'll take years. But it only has to be done once, for all the writing systems that aren't in Unicode yet will all Just Work when they do get implemented.

Ask not what you can do for XML, but what XML can do for you.

It's morning in the world.

(Oh yes: Send comments before 16 May 2008 to xml-editor@w3.org.)

Justice at last

There was an old man from Nantucket
Who kept all his cash in a bucket.
   But his daughter Nan
   Ran away with a man
And as for the bucket, Nantucket.

The pair of them went to Manhasset,
The man and his Nan with the asset.
   Pa followed them there
   But they left in a tear
And as for the asset, Manhasset.

He followed them next to Pawtucket,
Nan and her man and the bucket.
   Pa said to the man,
   "You can keep my sweet Nan",
But as for the bucket, Pawtucket.

(This works best if you pronounce "Pa" as "paw", assuming you make any difference between the two -- in New England, there definitely is. If your "aw" sounds like "ah", you can hear the "aw" sound the rest of us use by saying "Awwwwwwwwww!")

Here's the trio's route:
   note the doubling back
      to avoid pursuit.


View Larger Map