A rose by any other name is incorrect: a rant on character sets and encodings

I have been working with XML since SGML. I have been working with character-sets since before cuneiform was contemporary. That is, a long time. I have seen and continue to see characters very poorly handled. Even the use of the term "characters" is wrong. Unicode calls them "code points" as a single character (graphic impression) can be made from one or more code points. Unicode supports both a single code point that represents an e with an acute, U+00E9, and a multiple code point representation, U+0065 with a U+0301. The single code point is 'é' while the multiple code points are 'é'. For most users and most web browsers there is no difference in how this character is presented.

Now é is presented differently in different character-sets and in different character-encodings. For UTF-8 it is represented as two bytes, 0xC3 and 0xA9. For XML it is presented as the numbered entity "é" or "é". For HTML both the XML encodings can be used and the named entity "é" can be used. For percentage encoding within a URL's path and query it is represented as "%C3%A9". Within a domain name that uses punycode it is "xn--9ca". While these differences are not inherently problematic it has been my experience that their combined use very much is.

For example, if I have a URL that contains an é and I want to use this in an HTML A tag's href how do I encode it? All of the following are useable but only the first is correct:

  • http://foo.com/bar?%C3%A9
  • http://foo.com/bar?é
  • http://foo.com/bar?é
  • http://foo.com/bar?é

And being correct matters more and more because this stuff is being handled by code. Code is written by programmers and programmers don't universally understand character-encodings and character-sets. Moreover, you have programmers at both ends of the supply chain: The writer writes it wrong and the reader reads it wrong. How many times have you seen the following?

  • An HTML named entity being used in XML.
  • A entity double encoded, eg é is incorrectly encoded as &é.
  • A french text, for instance, with question marks oddly scattered throughout.

I don't have a solution to the misunderstandings and misuses. My only advice is that you have some simple rules and adhere to them:

  1. Be clear that you and your supplier know the difference between a character-set and a character-encoding.
  2. Don't ever accept bad data. Being generous and accepting (and "correcting") bad data never turns out well for either you or your supplier.
  3. Only accept percent character-encoding for URLs.
  4. Only accept numbered entities character-encoding for XML & HTML (except for <, >, &, and ").
  5. Only accept content for which there is a specified character-set. (For XML it is UTF-8 by default.) So, for HTML form elements make sure to use the accept-charset attribute. For HTTP requests make sure to require the charset attribute of the content-type header. If you can, only accept UTF-8 -- mostly because the supplier's tools will almost aways to the right thing with UTF-8.
  6. Never use a character-set in your byte-oriented repositories as anything other than UTF-8. (So, when you get a percent-encoded URL make sure to store it not as supplied but as its decoding.)

I hope this helps.