«

»

Oct 12

QUrl in Qt 5: validity

In the previous blog about QUrl in Qt 5, Kenneth suggested I talk to Adam Barth about the issues with URLs in WebKit. So I went to the #webkit channel on Freenode, pinged him and we discussed a bit. He pointed me to a lot of unit tests that WebKit runs related to URL parsing and interpreting, including some of his own results about the acceptance of different browsers (using WebKit and not) to some of the garbage we find out there.

Turns out he has a very extreme view on the subject of URLs and URIs. His position was that there is no standard for what a URL truly is and the RFCs trying to define it (RFC 3986 and RFC 3987) are to be ignored. My position — which matters to you because I’m doing QUrl for Qt 5 — is that the RFC standards are valid and specify how to handle those URIs. Everything else is undefined behaviour and could be rightfully rejected, but we won’t because there’s just too much out there.

RFC 3986 defines an ABNF grammar in Appendix A for parsing of URIs. Qt 4′s QUrl followed this grammar strictly. If you look at the source code, you’ll find matches for exactly the same terms as defined by the grammar. I’m not sure if this is fast, however: could the parser be faster if we had coded it differently? We’ll soon find out, as I’m about to rewrite it.

As a turn of events, however, QUrl started not from the URI but instead from a broader definition of URI-reference which can be either the URI as we’re used to or a relative-ref. The latter is what you usually find in fields where URLs are expected, like the HREF attribute for the A element or the SRC attribute for the IMG element in HTML. That meant that the QUrl::isValid() function was mostly useless, as most inputs were considered valid. What people expected to be invalid did match the relative-ref part of the grammar and the data ended up in the URL’s path component.

So despite being strictly-conforming, the parser was actually quite liberal. Couple that with the QUrl::TolerantMode parsing which corrected mistakes in the percent-encoding, QUrl almost never rejected a URL. The only thing it started to reject were bad hostnames because I considered them a security issue (homograph attacks). QUrl started to apply strict STD 3 conformance and rejected anything malformed there.

For Qt 5, I will relax the parser even further and I’ll accept some of the really strange inputs that I found in WebKit’s unit tests. QUrl in Qt 5 will accept strictly-conforming URLs as expected and will only produce standards-compliant URIs and URLs. The new parser I’ll write is actually closer to what people expect a URL to be. Take this example from the QUrl documentation:

Instead of following the grammar to parse it, I’ll just delimit at the expected boundaries and then try to correct the components as extracted. I mean, I’ll try — we’ll see if I manage or if I need to scrap this method. Hopefully, this will be a faster algorithm.

What does this mean to you? If you were passing QUrl some strict-conforming URIs and URLs, nothing will happen. In fact, it should be 1:1 and give you exactly the same as you gave it. If you had URLs that decoded some percent-encoded characters or UTF-8 sequences without causing it to become ambiguous, QUrl will also still accept your input.

If you had really broken URLs which QUrl accepted and corrected in Qt 4, there’s a good chance that QUrl in Qt 5 will continue to accept and interpret the same way. That’s because the set of unit tests for QUrl is quite extensive and I’ll do my best to keep compatibility.

Finally, if you had really really broken URLs, specially those with broken hostnames, I haven’t decided yet. I will accept some more URLs but, as I said, I consider them undefined behaviour. They may be accepted or they may be rejected — what’s more, the behaviour might change in new versions of Qt.

If your application breaks because of parsing of URLs, please report the bug. I will pay attention to each report. If we can prove that QUrl is failing to comply with the RFC, then the bug is proven and we’ll need to fix Qt. If your input fails to comply, I’ll need convincing arguments why QUrl should accept and correct your input.

PS: ed2k URIs will never be accepted.

7 comments

  1. avatar
    sebsauer


    PS: ed2k URIs will never be accepted.

    ed2k, magnet, sig2dat and slsk then at least. For reference the bug-report dealing with that back then was http://bugs.kde.org/show_bug.cgi?id=62425

  2. avatar
    Mike Lothian

    Does anyone still use eDonkey?

  3. avatar
    Paolo

    is there hope for second life uri?
    http://wiki.secondlife.com/wiki/Secondlife://_URL_scheme

  4. avatar
    Thiago Macieira

    Paolo: I see that secondlife “URIs” can use two or three slashes. In Qt 4, three slashes equals one. In Qt 5, I plan on maintaining the difference.

    When used with two slashes, the component after the slashes and before the next one is an authority. For QUrl, it must contain a valid hostname and will be treated as a hostname. That means it will be lowercased and subject to STD 3 rules. If Second Life region names match STD 3, it will be fine. If they don’t, those URIs won’t work.

    Hint for implementors: don’t use double slashes. Use plain URIs, like mailto.

  5. avatar
    sebsauer

    @Mike Lothian

    No idea. But it’s maybe a good explanation for Adam Barth’s extreme view. Just interesting to see how defined standards vs usage differ. We are collecting similar experiences with ISO ODF and MSOOXML in Calligra every day. Sometimes it’s indeed better to go with the standard and ignore implementations cause else you easily end in situations where old mistakes (and the way ed2k defined urls where one) are carried on forever.

  6. avatar
    Giorgos Tsiapaliwkas

    @thiago

    Do you intend to merge the functionality of KUrl in QUrl. If i recall correct this scenario has been addressed in Randa.

    thank you for your time

  7. avatar
    Thiago Macieira

    The objective is that most of the functionality that is in KUrl should be rolled back into QUrl. That’s why I’m working on this.

    The one functionality that I refuse to merge is to make the constructor become “fromPathOrUrl”. See the discussion on the qt5-feedback mailing list.

Comments have been disabled.

Page optimized by WP Minify WordPress Plugin