In the previous blog about QUrl in Qt 5, Kenneth suggested I talk to Adam Barth about the issues with URLs in WebKit. So I went to the #webkit channel on Freenode, pinged him and we discussed a bit. He pointed me to a lot of unit tests that WebKit runs related to URL parsing and interpreting, including some of his own results about the acceptance of different browsers (using WebKit and not) to some of the garbage we find out there.
Turns out he has a very extreme view on the subject of URLs and URIs. His position was that there is no standard for what a URL truly is and the RFCs trying to define it (RFC 3986 and RFC 3987) are to be ignored. My position — which matters to you because I’m doing QUrl for Qt 5 — is that the RFC standards are valid and specify how to handle those URIs. Everything else is undefined behaviour and could be rightfully rejected, but we won’t because there’s just too much out there.
RFC 3986 defines an ABNF grammar in Appendix A for parsing of URIs. Qt 4′s QUrl followed this grammar strictly. If you look at the source code, you’ll find matches for exactly the same terms as defined by the grammar. I’m not sure if this is fast, however: could the parser be faster if we had coded it differently? We’ll soon find out, as I’m about to rewrite it.
As a turn of events, however, QUrl started not from the URI but instead from a broader definition of URI-reference which can be either the URI as we’re used to or a relative-ref. The latter is what you usually find in fields where URLs are expected, like the HREF attribute for the A element or the SRC attribute for the IMG element in HTML. That meant that the QUrl::isValid() function was mostly useless, as most inputs were considered valid. What people expected to be invalid did match the relative-ref part of the grammar and the data ended up in the URL’s path component.
So despite being strictly-conforming, the parser was actually quite liberal. Couple that with the QUrl::TolerantMode parsing which corrected mistakes in the percent-encoding, QUrl almost never rejected a URL. The only thing it started to reject were bad hostnames because I considered them a security issue (homograph attacks). QUrl started to apply strict STD 3 conformance and rejected anything malformed there.
For Qt 5, I will relax the parser even further and I’ll accept some of the really strange inputs that I found in WebKit’s unit tests. QUrl in Qt 5 will accept strictly-conforming URLs as expected and will only produce standards-compliant URIs and URLs. The new parser I’ll write is actually closer to what people expect a URL to be. Take this example from the QUrl documentation:
Instead of following the grammar to parse it, I’ll just delimit at the expected boundaries and then try to correct the components as extracted. I mean, I’ll try — we’ll see if I manage or if I need to scrap this method. Hopefully, this will be a faster algorithm.
What does this mean to you? If you were passing QUrl some strict-conforming URIs and URLs, nothing will happen. In fact, it should be 1:1 and give you exactly the same as you gave it. If you had URLs that decoded some percent-encoded characters or UTF-8 sequences without causing it to become ambiguous, QUrl will also still accept your input.
If you had really broken URLs which QUrl accepted and corrected in Qt 4, there’s a good chance that QUrl in Qt 5 will continue to accept and interpret the same way. That’s because the set of unit tests for QUrl is quite extensive and I’ll do my best to keep compatibility.
Finally, if you had really really broken URLs, specially those with broken hostnames, I haven’t decided yet. I will accept some more URLs but, as I said, I consider them undefined behaviour. They may be accepted or they may be rejected — what’s more, the behaviour might change in new versions of Qt.
If your application breaks because of parsing of URLs, please report the bug. I will pay attention to each report. If we can prove that QUrl is failing to comply with the RFC, then the bug is proven and we’ll need to fix Qt. If your input fails to comply, I’ll need convincing arguments why QUrl should accept and correct your input.
PS: ed2k URIs will never be accepted.