Sep 19

QUrl in Qt 5: encoding

Categories:

One of the classes I used to maintain in Qt was QUrl and I had worked quite extensively with KUrl previously, so I knew what was wrong with the current API. And during the Contributor Summit in Berlin, I volunteered to rewrite QUrl, to add some features KUrl needed.

I described the original goals of the change in the “QUrl in 5″ thread in the qt5-feedback mailing list. After I had done that, I began working on implementing them. This blog is a status update. As I’m not done yet, things might change a bit until the end and the code is accepted into QtCore.

This is the first part and I’ll write more later.

QUrl in Qt 4

To understand the design decisions, I need to explain about what a URL is. It is governed by a pair of relevant specifications: RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) and RFC 3987 (Internationalized Resource Identifiers (IRIs)). The first of the two is the base definition of URI using 7-bit ASCII only, whereas the latter specifies what you’re supposed to do when you find non-ASCII codepoints. Long story short, RFC 3986 says that a URI is a sequence of letters, digits, hyphens, underscores and the tilde character, separated by some delimiters. Anything else must be represented by its percent-encoding. The latter RFC says non-ASCII characters are allowed and are is equivalent to their UTF-8 encoding in percent-encoded form.

Originally, QUrl in Qt 4.0 tried to implement those specifications: an 8-bit QByteArray was parsed, by searching the boundaries of the components of the URL and then it decoded the percent-encoded forms into proper QStrings, applying the UTF-8-to-UTF-16 transformation in the process. This allowed you to get components like the path or the fragment fully decoded into QStrings, which are nicer to read by us humans. It implements the parsing by following the URI ABNF definition strictly, so we know that the parsing follows the specification to the letter.

However, back in Qt 4.4, that the second operation had a problem: it could not handle an encoded sequence like %80%81, which is not a valid UTF-8 sequence, but is valid in a URL. The solution I found back then was to add an “encoded” form of each of the components of the URL, which returned a QByteArray with proper percent encoding. Those forms are capable of handling anything you throw at it, restoring the full compatibility. So QUrl keeps a pair of forms for each component of the URL: one in UTF-16 decoded form and one in 8-bit fully encoded form.

It also has a problem I had not identified until much later: the QUrl constructor and QUrl::setUrl() functions, which take a fully-decoded QString URL, cannot handle the presence of some special characters in non-delimiter functions (e.g., the ‘#’ character used in a path name or as part of a query). In order for them to be interpreted as non-special, they need to appear percent-encoded — but these functions expect fully decoded forms! Worse, the QUrl::toString() function blindly decodes all the percent-encoded forms it might have and, in the process, may lose the distinction between a delimiter and a regular character.

To avoid those problems, it is recommended in Qt 4 that you use the “encoded” functions, especially QUrl::fromEncoded() and QUrl::toEncoded(). Unfortunately, the output of such functions is not always human-friendly.

Solution

All of those problems had one common cause: using UTF-16 QStrings in the internal representation and in the API, but using an 8-bit strict parser. As I said before, the RFCs say that URIs are a sequence of arbitrary 8-bit characters, some of which may appear in decoded form. The specification does not contemplate or give hints how to write an UTF-16 implementation. You can take the easy way out, like what Java does, and ignore any encoding and decoding. But note how java.net.URL refers to the old URI specification and doesn’t claim to support IRIs. For QUrl, we want to do the right thing.

The centrepieces of QUrl in Qt 5 are a rewritten parser, operating on UTF-16, and what I’ve called the “recoder” (you can also call it a “transcoder”). The principle of both is that QUrl should operate exclusively on QString (UTF-16) data, without having to convert to and from 8-bit, especially for parsing. That is matched in the API by removing all the QByteArray methods: everything is now using QString.

To continue supporting arbitrary sequences that are not valid UTF-8 sequences, the QUrl API now only supports the “encoded” mode. In Qt 4, we had encoded in QByteArray and decoded in QString; in Qt 5, it’s encoded in QString. To avoid the problem of ugly encoded sequences, I’ve introduced a new enum allowing you to specify what you want to see decoded:

   enum ComponentFormattingOption {
        FullyEncoded = 0x0000,
        DecodeSpaces = 0x1000,
        DecodeUnambiguousDelimiters = 0x2000,
        DecodeAllDelimiters = DecodeUnambiguousDelimiters | 0x4000,
        DecodeUnicode = 0x8000,
 
        PrettyDecoded = DecodeSpaces | DecodeUnambiguousDelimiters
                        | DecodeUnicode,
        MostDecoded = PrettyDecoded | DecodeAllDelimiters
    };
    Q_DECLARE_FLAGS(ComponentFormattingOptions, ComponentFormattingOption)

The FullyEncoded type is equivalent to the Qt4 QUrl “encoded” method, strictly following the RFC 3986 specification. If you add the DecodeUnicode option, then it will follow RFC 3987 and decode all the percent-encoded sequences that form valid UTF-8 characters into their QString equivalents. The decoding of delimiters will cause the result to be non-compliant, but won’t lose information, ever. The default value for all functions is PrettyDecoded.

For that reason, since the inputs and outputs of QUrl are always encoded, the function that transforms it according to the flags above doesn’t encode or decode — it simply “recodes”. That’s why I called it like that. And then I spent a day writing unit tests to ensure it works as I want it to.

The new parser will require working directly on QStrings, without converting them to 8-bit like the parser in Qt4 requires. Moreover, given the new possibilities for encoding, the parser needs to be rewritten to be less strict, without sacrificing compatibility. In specific, the new parser will accept certain characters to appear decoded while RFC 3986 would require them to be encoded. With that, I expect to gain in performance quite considerably.

So the new process of parsing a QString will be to find the positions of the components of the URL, then recode straight from the QString into the internal representation (which should be the PrettyDecoded representation to avoid recoding in the common case). The parsing itself should be fast enough so I can drop the lazy-parsing functionality of QUrl (which is currently horribly broken in Qt4).

On the next blog, I’ll go over the front-end API.

Tags: qt, qt5, unicode, url

10 comments

TheBlackCat
September 19, 2011 at 19:13 UTC (UTC 0)
Nice to see some work going into this. Will this fix ?
Kenneth Christiansen
September 19, 2011 at 19:30 UTC (UTC 0)
Did you ever look into all the fuzz in WebKit about it’s KURL implementation and the Google URL implementation? It popped up at one WebKit Summit and Adam Barth did some work on getting one common implementation. Maybe it would be good if you could talk to Adam (abarth on #webkit) to figure out what were the issues.
Thiago Macieira
September 19, 2011 at 20:21 UTC (UTC 0)
Fix what?
Thiago Macieira
September 19, 2011 at 21:15 UTC (UTC 0)
@Kenneth: I’ve talked to him now and his main concern was to the fact that the specifications do not cover many cases found on the wild and that we need to deal with anyway. Some of what he told me isn’t applicable to QUrl, but instead lies in the WebKit layer. But I came out with a lot of new testcases, some of which I’m sure my code doesn’t pass without having even to try.
TheBlackCat
September 20, 2011 at 07:53 UTC (UTC 0)
I screwed up the html I think. What I meant to ask was:
“Nice to see some work going into this. Will this fix KDE bug #165044″
With a link to https://bugs.kde.org/show_bug.cgi?id=165044
Thiago Macieira
September 20, 2011 at 17:52 UTC (UTC 0)
@TheBlackCat: no, no one has the intention of ever fixing that in Qt. Broken filename encodings will be forever considered filesystem corruption.
TheBlackCat
September 21, 2011 at 11:04 UTC (UTC 0)
So there is no possible way to make it so people can at least rename or delete the files? There seems to already be patches available in the bug report to allow exactly that, can the approach used in the patches not be integrated with what you are doing?
Thiago Macieira
September 21, 2011 at 22:02 UTC (UTC 0)
The file name cannot be represented properly. No, please consider them as filesystem corruption and it will be better for everyone: applications generating them are buggy and need fixing, their presence causes undefined behaviour in other applications; the fixing of these problems does not need to be done in general applications.
TheBlackCat
September 22, 2011 at 06:29 UTC (UTC 0)
First, this isn’t about displaying them properly, no one is asking for that as far as I can tell. They are currently displayed with a ? symbol and that seems fine. What I asked about was being able to rename or delete them. For example, show the same ? symbol in file name editor for instance so the broken character can be deleted and replaced. kate and kwrite, for instance, do this just fine with characters they cannot recognize because of encoding issues, you can even do find/replace to batch change them. For instance, if a character with broken encoding is found, have it stored internally be a special character or byte sequence that tells QUrl it shouldn’t try to handle it and just treat it as that ? symbol both in display and in text entry fields.
I understand from a technical standpoint (although I don’t agree it is filesystem corruption, that has a very specific meaning, this is an issue with how Qt handles specific sorts of text). I don’t agree, however, from a user standpoint. Most users won’t understand the technical side, they will just see an error message that says clearly-visible files, files that apparently can be handled fine by other desktop environments, don’t exist, and they will think (appropriately in my opinion) that the behavior of Qt-based applications is broken.
Most users aren’t going to know they need to open a terminal or another file browser like nautilus to fix the problem, they are just going to see a cryptic and apparently nonsensical error message, think something is very wrong with the file manager, and go back to whatever they were using before that, in their opinion, worked just fine.
I wish KDE and Qt were in a position to force proprietary software developmers to fix such problems, but unfortunately that is not the case yet. Users need to be able to deal with files produced by programs that KDE and Qt has zero control over, whose developers probably never even heard of KDE or Qt. If users can’t they handle the files they need to handle they going to abandon the KDE or Qt software, not the other software, since the KDE or Qt software will seem like the one that is broken. Users need to be able to handle files given to them by users of proprietary software. They won’t be able to tell their CEO “your software is broken, please use something that follows international standards on file text encoding properly”, that will get him or her fired.
Whatever the case, it is your decision (although I disagree with it and will continue to do so). But if you aren’t going to accept patches to fix the problem, you should probably tell people on that bug report to stop wasting their time. Several of them seem to be working really hard to provide patches to fix the problem and have been for a while, and it sounds like a working solution is almost finished, so it is probably better to tell them now that you won’t accept any patches to fix this.
Thiago Macieira
September 28, 2011 at 13:05 UTC (UTC 0)
@TheBlackCat: QUrl can handle broken encoding just fine. It’s QString that can’t since QString is UTF-16. And all the filesystem API is done on top of QString, not QUrl (QFile, QDir, etc.). In order to store those broken encodings, we need to either switch the entire filesystem API to QUrl or we need to change QString to support storing brokenly-encoded data. The latter is a flagrant violation of the UTF-8 and UTF-16 specs. We had that in Qt from 3.2 to 4.4 and we thought that people had had enough time to switch completely to UTF-8 in those 6 years.
From the user’s point of view, that’s exactly how it should behave and that’s exactly why I call it filesystem corruption. If you have a literally corrupted filesystem, behaviour is unpredictable. Files may not be removable at all, with any command. So Dolphin or other file managers are not required to cope. You simply run a filesystem recovery program as administrator and it should fix things for you.
If you have a novel solution that allows this to be fixed, send us your patches. But I doubt there’s anything you can do to fix this which doesn’t cause more trouble. Violating UTF-8 will not be allowed, for example. If you don’t have a novel solution for this problem, then yes, stop wasting your time and ours because there’s nothing we can or will do.

Comments have been disabled.

Avatars by Sterling Adventures

Thiago Macieira's blog

An Open Source hacker's ramblings

QUrl in Qt 5: encoding

QUrl in Qt 4

Solution

10 comments

TheBlackCat

Kenneth Christiansen

Thiago Macieira

Thiago Macieira

TheBlackCat

Thiago Macieira

TheBlackCat

Thiago Macieira

TheBlackCat

Thiago Macieira

Comments have been disabled.

Categories

Me

Google Plus

Blogroll

My Gitorious activity

Archives

Meta

Copyright