A couple of days ago I posted on Google+ a comment when I was frustrated trying to update the QUrl hostname-parsing code. Turns out that rewriting the parser wasn’t that difficult for QUrl, but dealing with hostnames is very much so. The old code in QUrl simply deals with it directly, even what’s supposed to be IPv4 and IPv6 addresses.
Trying to validate them according to the Augmented Backus–Naur Form grammar found in Appendiix A of RFC 3986 is a extremely difficult. What’s more, the grammar is very strict and doesn’t allow for common forms of v4-compat and v4-mapped IPv6 addresses (that is, “::18.104.22.168″ and “::ffff:10.0.0.1″).
So I took it upon myself to rewrite the IP-parsing and reconstructing routines, which were previously in src/network/kernel/qhostaddress.cpp. I wrote a lot of unit tests for it and tried to match the behaviour of inet_aton(3) for IPv4 and of inet_pton(3) for IPv6. That means accepting some non-standard, old behaviour like an IP address of “127.1″ or “2130706433″ for “127.0.0.1″.
Why am I accepting that? Well, when I talked to Adam Barth on IRC, he pointed me to his result list of comparing broken URLs found in the wild and how different browsers handled them. QUrl in Qt 4 probably fails or has broken behaviour for most, if not all, of the entries in the “LayoutTests/fast/url/host.html” section.
So when rewriting the parser, the first thing I noticed is that the hostnames may come in percent-encoded form, so we need to decode them to find a proper address that may fit the rules. For example:
Table 1. Percent-encoded hostnames
The next thing I noted from the tests is this particular URL: "http://０Ｘｃ０．０２５０．０１/" (I used a fixed-width font here so you can see the difference). This particular URL is using characters found in Fullwidth Latin Latters range of Unicode (from U+FF00 to U+FFEF). They are exactly the same letters and numbers as found in the regular range, but they occupy one full width, like the ideographs in the CJK block. The regular codepoints used in mostly Latin text, like this blog, is considered halfwidth in Unicode parlance.
What’s interesting about that URL is that when you apply the rules of the ToASCII transformation of the IDNA process, the step called Nameprep (described in RFC 3491), the fullwidth forms are transformed into their halfwidth counterparts. So the URL above, after going through the ToASCII process, becomes simply “http://0xc0.0250.01″ and, despite having non-digits, the new IPv4 address parser accepts as “192.168.0.1″. So let’s add to our table:
Table 2. Unicode latin fullwidth hostnames
(note that IPv6Address is not on the table, it will be important later)
In other words, a hostname can be encoded in either the percent-encoded form or in Unicode and still be a regular IPv4 address. To make matters worse, it can be encoded in both!
The RFC describing URIs and URLs (RFC 3986) has a companion describing IRIs (Internationalised Resource Identifiers): RFC 3987. The IRI spec requires that a Unicode codepoint be equivalent to its percent-encoded UTF-8 form. That is, the letter “é” (U+00E9 LATIN SMALL LETTER E WITH ACUTE) is equivalent to “%C3%A9″. If that is so, then the Unicode fullwidth forms can be encoded in UTF-8 percent encoded too. If we encode the hostnames found on table 2 above, we get:
|%ef%bc%90 %ef%bc%b8 %ef%bd%83 %ef%bc%90 %ef%bc%8e %ef%bc%90 %ef%bc%92 %ef%bc%95 %ef%bc%90 %ef%bc%8e %ef%bc%90 %ef%bc%91||192.168.0.1||IPv4Address|
|%ef%bc%a5 %ef%bd%98 %ef%bd%81 %ef%bd%8d %ef%bd%90 %ef%bd%8c %ef%bd%85 %ef%bc%8e %ef%bd%83 %ef%bd%8f %ef%bd%8d||example.com||reg-name|
Table 3. Hostnames in Unicode fullwidth latin letters and percent-encoded
(spaces are for legibility purposes only)
Could it get any uglier? I thought it could. If there are fullwidth characters that transform to regular numbers and letters, is there one that transforms to the percent sign? Well, turns out that there is: “％”. If we apply NKFC to that, we obtain a regular ‘%’. And if you pay close attention to Adam Barth’s list, you see it being used (lines 39-47): “http://%ef%bc%85%ef%bc%90%ef%bc%90.com/” and http://%ef%bc%85%ef%bc%94%ef%bc%91.com/” (the fullwidth percent is “%ef%bc%85″).
At this point, I was about to pull my hair out (thankfully, I had a haircut last week, so I can’t get a grip on my hair). Which operation should I do first: decoding the percent-encoding or applying Nameprep (ToASCII)? Moreover, what’s stopping me from writing “%ef%bc%85″ (the percent-encoded representation of the fullwidth percent) in its fullwidth form (“％ＥＦ％ＢＣ％８５”)? And then encoding that in percent-encoding (“%ef%bc%85 %ef%bc%a5 %ef%bc%a6 %ef%bc%85 %ef%bc%a2 %ef%bc%a3 %ef%bc%85 %ef%bc%98 %ef%bc%95″)? And then repeating the process ad nauseam?
If you’re still with me, we’ve just found a problem: this is infinite recursion. We have to put a stop to it.
Then I remembered another detail: there are also a fullwidth character for slash (“／”), question-mark (“？”) and hash (“＃”), all of which are special in URL encoding. Those characters, especially the slash, were the source of a security problem a year or two ago, in which you could hide it in a specially-crafted domain name: for example, in “www.bank.com.xn--6g7c.com”, a blind ToUnicode operation results in “www.bank.com.／.com”. Since this attack appeared, QUrl enforces strict STD 3 compliancy. After the Nameprep operation, QUrl will apply these steps from RFC 3490 Section 4.1:
(a) Verify the absence of non-LDH ASCII code points; that is, the
absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.
(b) Verify the absence of leading and trailing hyphen-minus; that
is, the absence of U+002D at the beginning and end of the
With this, the “%” and the “[” charactery are rejected and the hostname is considered invalid. For that reason, a hostname containing “％” or its UTF-8 percent-encoding is never valid and we put a stop to the iteration.
That also means the hostname field of QUrl will continue to reject anything that doesn’t conform to the rules above (an exception was made for accepting the underscore character), even if the ABNF for URIs would otherwise accept them. In particular, none of the “sub-delims” or non-URL characters are permitted, either in decoded or percent-encoded forms. That’s why “ed2k://” URLs are not allowed: they use the pipe character (“|”) in the hostname and that fails to comply with STD 3.
I’m almost done with QUrl. Yesterday, after completing the code, I started to run the Unit tests, which are down to 87 failures (from 269 the first time I ran, after fixing the crashes). I should finish with 90% of the failures by tonight, by a mix of fixing the code that isn’t correct and fixing tests that are now wrong.