Working on integrating of automation scripts with Testuff, I’ve encountered an interesting Unicode-related issue I’d like to share.
The integration allows for an automated testing script to report the results of its run to the Testuff server. In order for the results to be grouped, displayed and summarized correctly, the automation script needs to tell the server which test it ran, and whether the test has passed or failed. A long discussion emerged on what the best way to uniquely identify tests.
After quite a bit of back and forth, we’ve settled on permalinks, those more-or-less-readable URLs that are in common use in blogs. The idea of a permalink is to take the title (of a blog post or a test) and replace any characters that aren’t numbers or letters with an underscore or a hyphen. Using this simple scheme, “Unicode and permalinks” becomes “unicode-and-permalinks”, which is quite suitable for use in a URL.
The implementation is a simple regular expression:
return re.sub("[^a-zA-Z0-9]+", "_", string).lower()
While this code works perfectly for the English language, it doesn’t work at all if string is a Unicode string containing something in Hebrew, Russian or Polish – language that some of our customers use. And so, I set out to write code that will essentially behave like the regular expression above, but will work for letters and numbers in all the languages of the world.
Fortunately the Unicode standard includes a rarely used classification of characters into various categories. For each given character we can find out whether it is an uppercase letter, a lowercase letter, and number, a punctuation mark and so on. Surprisingly, Python includes a module called unicodedata that contains all that information. The function category accepts a character and returns a string that tells us what the character is: “Lu” denotes an uppercase letter, “Nd” denotes a decimal digit, etc.
All that remains to be done is go over the characters in the title, keep the letters and numbers, and replace all the other characters with a dash or an underscore. The regular expression at the end replaces any sequence of underscores into a single underscore to make the resulting URLs even nicer to look at.
Converts sequences of characters that aren’t letters or numbers
to a single underscore to achieve wikpedia like unicode URLs.
if unicodedata.category(c) in ["L", "N"]:
s2 = "".join([conv(c) for c in s])
return re.sub("_+", "_", s2)
[Update] Or, as Almad correctly pointed out, you could just use the re module support for Unicode and be done with it in two lines, which kind of takes the air out of this post.
return re.compile("\W+", re.UNICODE).sub("_", s)
There’s one other thing to consider when dealing with Unicode permalinks. If you’re a native speaker of a language other than English, you’ve probably seen URLs that in your own language in Wikipedia.
From the looks of it, URLs can include characters in any language. Right?
RFC3986 defines the syntax for URLs (actually URIs, but that’s a moot point) explicitly and states which characters are allowed in a URL. This includes little more than English letters and numbers from the lower half of the ASCII chart.
If you look at the headers your browser passes when you access such a URL, you’ll see that it encodes all the characters with percent encoding, so neither the browser nor the web server is violating the standard. This is what the server saw when I navigated to the main Hebrew page of Wikipedia:
GET /wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99 HTTP/1.1 Host: he.wikipedia.org
In order to understand what this percent encoding means, you need to know a bit about Unicode. Basically, the Unicode URL is encoded in UTF8 and each byte of the UTF8-encoded string is encoded using percent encoding. The browser apparently recognized this specific encoding scheme (which isn’t documented anywhere I could fine) and displays nice internationalized URLs for the user.
If you want to support such URLs in your server, you’ll probably need to write some code to translate the percent-encoded URLs into their actual Unicode representation.