Unicode and permalinks

Working on integrating of automation scripts with Testuff, I’ve encountered an interesting Unicode-related issue I’d like to share.

The integration allows for an automated testing script to report the results of its run to the Testuff server. In order for the results to be grouped, displayed and summarized correctly, the automation script needs to tell the server which test it ran, and whether the test has passed or failed. A long discussion emerged on what the best way to uniquely identify tests.

After quite a bit of back and forth, we’ve settled on permalinks, those more-or-less-readable URLs that are in common use in blogs. The idea of a permalink is to take the title (of a blog post or a test) and replace any characters that aren’t numbers or letters with an underscore or a hyphen. Using this simple scheme, “Unicode and permalinks” becomes “unicode-and-permalinks”, which is quite suitable for use in a URL.

The implementation is a simple regular expression:

def to_permalink(string):
    return re.sub("[^a-zA-Z0-9]+", "_", string).lower()

While this code works perfectly for the English language, it doesn’t work at all if string is a Unicode string containing something in Hebrew, Russian or Polish – language that some of our customers use. And so, I set out to write code that will essentially behave like the regular expression above, but will work for letters and numbers in all the languages of the world.

Fortunately the Unicode standard includes a rarely used classification of characters into various categories. For each given character we can find out whether it is an uppercase letter, a lowercase letter, and number, a punctuation mark and so on. Surprisingly, Python includes a module called unicodedata that contains all that information. The function category accepts a character and returns a string that tells us what the character is: “Lu” denotes an uppercase letter, “Nd” denotes a decimal digit, etc.

All that remains to be done is go over the characters in the title, keep the letters and numbers, and replace all the other characters with a dash or an underscore. The regular expression at the end replaces any sequence of underscores into a single underscore to make the resulting URLs even nicer to look at.

def to_permalink(s):
    """
    Converts sequences of characters that aren’t letters or numbers
    to a single underscore to achieve wikpedia like unicode URLs.
    "
""
    import re
    import unicodedata
    def conv(c):
        if unicodedata.category(c)[0] in ["L", "N"]:
            return c
        else:
            return "_"
    s2 = "".join([conv(c) for c in s])
    return re.sub("_+", "_", s2)

[Update] Or, as Almad correctly pointed out, you could just use the re module support for Unicode and be done with it in two lines, which kind of takes the air out of this post.

def to_permalink(s):
    import re
    return re.compile("\W+", re.UNICODE).sub("_", s)

There’s one other thing to consider when dealing with Unicode permalinks. If you’re a native speaker of a language other than English, you’ve probably seen URLs that in your own language in Wikipedia.

From the looks of it, URLs can include characters in any language. Right?

Wrong.

RFC3986 defines the syntax for URLs (actually URIs, but that’s a moot point) explicitly and states which characters are allowed in a URL. This includes little more than English letters and numbers from the lower half of the ASCII chart.

If you look at the headers your browser passes when you access such a URL, you’ll see that it encodes all the characters with percent encoding, so neither the browser nor the web server is violating the standard. This is what the server saw when I navigated to the main Hebrew page of Wikipedia:

GET /wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99 HTTP/1.1
Host: he.wikipedia.org

In order to understand what this percent encoding means, you need to know a bit about Unicode. Basically, the Unicode URL is encoded in UTF8 and each byte of the UTF8-encoded string is encoded using percent encoding. The browser apparently recognized this specific encoding scheme (which isn’t documented anywhere I could fine) and displays nice internationalized URLs for the user.

If you want to support such URLs in your server, you’ll probably need to write some code to translate the percent-encoded URLs into their actual Unicode representation.

16 Comments on “Unicode and permalinks”


By Love Encounter Flow. September 22nd, 2008 at 13:22

there’s even more things to be aware of in regard to unicode in urls: (1) some browsers, including firefox 2, may chose to send *some* urls not in utf-8, but in a legacy encoding such as the system default or plain latin-1. for ffx2, this would appear to be true whenever the characters entered by the user happen to be encodable as latin-1 etc. there is no rfc that would govern usage and announcement of encodings in urls (what a joke), so you have to guess yourself when doing url decoding (i always use a routine that first tries utf-8, then latin-1 or similar as a fallback. web application frameworks surprisingly often fail in doing that for me). (2) there is yet another percent-encoding style using unicode character ids, using four hex digits à la %a34f (see http://en.wikipedia.org/wiki/Percent-encoding#Non-standard_implementations); this has been rejected by the w3c (probably on the grounds that this would definitely make things too easy—which is presumably why the standards bodies adopted http://en.wikipedia.org/wiki/Punycode, by far the weirdest character encoding standard ever released). (3) on the bright side, there is a healthy tendency among browser vendors to show, in the address bar, the intended, not the encoded likeness of the url typed in. the ffx2 locationbar² extension does it, google chrome does it, flock does it (the fine people over at ffx3 have sadly missed the trend so far—although your screenshots seem to indicate otherwise?). this is an important feature to get readable urls for the people of the world. hopefully, with browser vendors paying more attention to this issue, encoding problems will also become less of a burden in the future.

By Love Encounter Flow. September 22nd, 2008 at 14:55

oops, that alternative percent-encoding would be %ua34f and so on, the u being an indicator that four digits are used and the character set referred to is unicode.

By Almad. September 23rd, 2008 at 14:16

Is it not sufficient to use re.sub(”\W”, “_”, re.UNICODE)?

By gooli. September 23rd, 2008 at 14:39

Damn! I didn’t know the re module could do that.

By The Cave » Blog Archive » Unicode Permalinks. September 27th, 2008 at 19:08

[...] Even better, solid information on uncommon (and poorly understood) Unicode handling in Python. [...]

By andy mckay. October 27th, 2008 at 15:12

Perhaps a quick peek at Plone code might help that takes utf-8 and runs it through a decode to form a nice asciid url. Tthere’s going to be problems with it but Plone uses this to make urls. Here’s some sample code, https://svn.plone.org/svn/plone/CMFPlone/tags/2.5.5/UnicodeNormalizer.py and a sample that does it in JSONP: http://clearwind-labs.appspot.com/

By Lawrence Sheed. February 17th, 2009 at 19:47

@Love Encounter Flow.

Quote “…no rfc that would govern usage and announcement of encodings in urls”

Actually, there is: RFC 3987

Non Ascii (and valid) Characters are assumed to be UTF-8, and should be encoded in percent encoding.

This article at the W3C http://www.w3.org/International/articles/idn-and-iri/ talks about this.

I’m in the middle of a discussion about URI implementation in an open source CMS at the moment about this (which I how I found this link)

Lawrence / Computer Solutions Design China.

By video encoding. July 23rd, 2009 at 01:57

thank you, great post.

i never knew it could be done, and this could really help me in my new project.

By Web Design Company Boston Affordable and Cheap Custom Website Design. April 1st, 2011 at 23:08

[...] web design awards web design award winning web design award web design australia web design atlanta web design arizona web design area web [...]

By Buy Autodesk Software Online. June 19th, 2011 at 01:14

[...] point of purchase software xp software purchase.discounted software windows xp software purchase. Autodesk AutoCAD Mechanical 2011 purchase antivirus software purchase oem software order control software.buy cheap software [...]

By Adobe InDesign CS4. June 23rd, 2011 at 20:47

[...] software software purchase agreement order software products.point of purchase software cheap soft. Adobe InDesign CS4 order management system software buy software order software.software purchase agreement with [...]

By Buy Adobe Software Online. June 23rd, 2011 at 23:55

[...] purchase autocad software order software.purchase autocad software purchase express software. Adobe InDesign CS4 web order software sales order software purchase order management software.soft sale purchase [...]

By Buy Adobe Software Online. June 24th, 2011 at 01:59

[...] purchase express software purchase act software.build to order software order database software. Adobe InDesign CS4 photo order software discount soft software installation order.buy discount software order [...]

By Buy Adobe Software Online. June 24th, 2011 at 04:08

[...] cheap software software purchase download.educational software purchase order process software. Adobe InDesign CS4 purchase windows software software installation order purchase order tracking software.order [...]

By Buy Adobe Software Online. June 24th, 2011 at 06:08

[...] to purchase purchase oem software order entry software.buy cheap soft sales order software. Adobe InDesign CS4 purchase ledger software discount software order software.purchase oem software purchase soft. [...]

By Buy Adobe Software Online. June 24th, 2011 at 08:07

[...] CS4 with software purchase discount soft service order software.work order software buy soft. Adobe InDesign CS4 purchase express software order taking software sales order software.windows xp home oem soft sale. [...]