Now é is presented differently in different character-sets and in different character-encodings. For UTF-8 it is represented as two bytes, 0xC3 and 0xA9. For XML it is presented as the numbered entity "é" or "é". For HTML both the XML encodings can be used and the named entity "é" can be used. For percentage encoding within a URL's path and query it is represented as "%C3%A9". Within a domain name that uses punycode it is "xn--9ca". While these differences are not inherently problematic it has been my experience that their combined use very much is.
For example, if I have a URL that contains an é and I want to use this in an HTML A tag's href how do I encode it? All of the following are useable but only the first is correct:
- An HTML named entity being used in XML.
- A entity double encoded, eg é is incorrectly encoded as &&eacute;.
- A french text, for instance, with question marks oddly scattered throughout.
- Be clear that you and your supplier know the difference between a character-set and a character-encoding.
- Don't ever accept bad data. Being generous and accepting (and "correcting") bad data never turns out well for either you or your supplier.
- Only accept percent character-encoding for URLs.
- Only accept numbered entities character-encoding for XML & HTML (except for <, >, &, and ").
- Only accept content for which there is a specified character-set. (For XML it is UTF-8 by default.) So, for HTML form elements make sure to use the accept-charset attribute. For HTTP requests make sure to require the charset attribute of the content-type header. If you can, only accept UTF-8 -- mostly because the supplier's tools will almost aways to the right thing with UTF-8.
- Never use a character-set in your byte-oriented repositories as anything other than UTF-8. (So, when you get a percent-encoded URL make sure to store it not as supplied but as its decoding.)