A rose by any other name is incorrect: a rant on character sets and encodings

I have been working with XML since SGML. I have been working with character-sets since before cuneiform was contemporary. That is, a long time. I have seen and continue to see characters very poorly handled. Even the use of the term "characters" is wrong. Unicode calls them "code points" as a single character (graphic impression) can be made from one or more code points. Unicode supports both a single code point that represents an e with an acute, U+00E9, and a multiple code point representation, U+0065 with a U+0301. The single code point is 'é' while the multiple code points are 'é'. For most users and most web browsers there is no difference in how this character is presented.
Now é is presented differently in different character-sets and in different character-encodings. For UTF-8 it is represented as two bytes, 0xC3 and 0xA9. For XML it is presented as the numbered entity "é" or "é". For HTML both the XML encodings can be used and the named entity "é" can be used. For percentage encoding within a URL's path and query it is represented as "%C3%A9". Within a domain name that uses punycode it is "xn--9ca". While these differences are not inherently problematic it has been my experience that their combined use very much is.

For example, if I have a URL that contains an é and I want to use this in an HTML A tag's href how do I encode it? All of the following are useable but only the first is correct:
  • http://foo.com/bar?%C3%A9
  • http://foo.com/bar?é
  • http://foo.com/bar?é
  • http://foo.com/bar?é
And being correct matters more and more because this stuff is being handled by code. Code is written by programmers and programmers don't universally understand character-encodings and character-sets. Moreover, you have programmers at both ends of the supply chain: The writer writes it wrong and the reader reads it wrong. How many times have you seen the following?
  • An HTML named entity being used in XML.
  • A entity double encoded, eg é is incorrectly encoded as &é.
  • A french text, for instance, with question marks oddly scattered throughout.
I don't have a solution to the misunderstandings and misuses. My only advice is that you have some simple rules and adhere to them:
  1. Be clear that you and your supplier know the difference between a character-set and a character-encoding.
  2. Don't ever accept bad data. Being generous and accepting (and "correcting") bad data never turns out well for either you or your supplier.
  3. Only accept percent character-encoding for URLs.
  4. Only accept numbered entities character-encoding for XML & HTML (except for <, >, &, and ").
  5. Only accept content for which there is a specified character-set. (For XML it is UTF-8 by default.) So, for HTML form elements make sure to use the accept-charset attribute. For HTTP requests make sure to require the charset attribute of the content-type header. If you can, only accept UTF-8 -- mostly because the supplier's tools will almost aways to the right thing with UTF-8.
  6. Never use a character-set in your byte-oriented repositories as anything other than UTF-8. (So, when you get a percent-encoded URL make sure to store it not as supplied but as its decoding.)
I hope this helps.




Need to paint my soldiers.

I have returned to earth after some months of all-things-wargames-all-the-time. I have learned much about Roman, feudal, and 18C and 19C warfare. Perhaps not enough to author a Buffer's Guide but enough to ask respectable initial and followup questions. But what I have not yet done is actually play a game! My Bacuss Saxons and a Vikings are still unpainted. I do not have a game table. I haven't tracked down local DBA wargamers. So over the coming weeks I intended to paint my toy soldier: that is, after helping with kids projects, house projects, garden projects, etc. Wish me luck.

Library catalog, data: URIs, & bookmarklets

My public library, and, I expect, so does your's, uses Innovative Interfaces catalog tools. This web application is the poster child for poor user interface and user experience. It is not my objective to enumerate the problems. Instead, I discovered a useful mechanism of getting around the problem I have with remembering my library card's barcode. (Why on earth do the designers of this software expect me to remember my barcode is unfathomable.)

My original solution was to embed the barcode in a URL that mimicked the web application's form-based login. This URL was then bookmarked to allow for immediate access. This worked for a good number of years until the catalog software was updated. At which point the URL broke and I was unable to reproduce this solution using the updated software. (The obstacle seemed to be the need for a session identification that I could not contrive.) And so I needed a different approach.

The approach I took was to be able to show the barcode when I used the catalog. Having an absolute DIV positioned at the bottom the browser's window is easy to do with CSS. The problem was that to do this I needed to dynamically alter the catalog's HTML. I really did not want to install GreaseMonkey to accomplish this. The next best solution was to have a standalone page with an embedded IFRAME. This give me full control over what was on the page. The HTML is

<html>
    <head>
        <title>South Kingstown Library Catalog</title>
        <style>
            * {
                margin: 0;
                padding: 0;
                border: 0;
            }
            #tab {
                font-family: Verdana;
                position: fixed;
                bottom: 0px;
                right: 4ex;
                padding: 2ex;
                border-top-left-radius: 1ex;
                border-top-right-radius: 1ex;
                color: white;
                background: gray;
            }
        </style>
    </head>
    <body>
        <div id="tab">
        barcode: 123456789
        </div>
        <iframe
            src="https://catalog.oslri.net/search/"
            width="100%"
            height="100%"
            scroll="auto"
            marginheight="0"
            marginwidth="0"></iframe>
    </body>
</html>

I didn't like having to store this page as a file on my local machine and storing it on a HTTP server seemed overkill. Clearly, I was over thinking, but this is what I do for my day job. It then occurred to me that I could encode the page as a data URI and, I hoped, that when I used this URI the browser would render the encoded page. I used David Wilkinson's data: URI creation tool to create the URI. And, to my wonderment, Safari, FireFox, and Chrome (all on OS X) did exactly as hoped for.   
What this means is that my future bookmarklets can more sophisticated then I ever considered practical before. Perhaps you can use this discovery too.