May 11

Source code must be UTF-8 and QString wants it

Categories:

I’ve talked about source code encoding in the past, arguing that the C++ language lacks a fundamental setting. However, since this Monday, Qt 5 now starts to enforce that source code must be UTF-8. In a way.

The commit that landed on the qtbase repository finally changed the codec used by QString’s 8-bit methods to be UTF-8. That concludes a long series of changes that we had planned for Qt 5, that started with Robin Burchell’s work on removing the QTextCodec::setCodecForCStrings function. But to be clear: QString still stores data internally as UTF-16 and that won’t change.

To understand what the change is, we need to go back a little in history. Four years ago, I wrote a blog called “String Theory” that presented QString’s history and I said:

what encoding is your file? Even today, with the widespread use of UTF-8, we can’t rely on that fact (text editors in Windows being the worst example).

In 2008, we were still struggling with UTF-8 encoding in source code, and we definitely were in 2003 when QTextCodec::setCodecForCStrings came about in Qt 3. The reason is that, back then, text editors usually saved code only in the operating system’s locale encoding and very seldom supported writing anything else. Unicode wasn’t widespread enough, so people ended up with a variety of different encodings. That wasn’t a problem, provided that the data exchange only happened with people who used the same encoding — usually people in the same country, using the same operating system.

Times have changed. The protocols from the late 90s that did not possess an encoding marker quickly became obsolete or gained such a tag (I remember when the Kopete developers were struggling to decode ICQ messages properly, and Russian users often ended up with mojibake). Protocols designed in the 2000s all had such a tag, and soon began to standardise on one of the Unicode transforms.

Last year, when revisiting the subject, I wrote:

this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you’re just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the “micro” sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). [...] Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.

The next line of the blog was the decision: we would change the default codec of QString’s 8-bit functions from Latin 1 to UTF-8 in Qt 5 (note that we hadn’t yet started thinking of Qt 5 until about 15 days later). That’s what the commit I made this Monday finally accomplishes.

What does this mean to you? Well, the first thing is that it depends on whether you use these methods or not. If you compile your source code with the QT_NO_CAST_FROM_ASCII and QT_NO_CAST_TO_ASCII macros, you will feel absolutely no difference. And I really mean none, zero, zilch: if you use those macros, you’ve disabled all of the functions affected by my change.

If you do use the functions that are disabled by those macros, then the question is what encoding is used in those strings. My assumption in 2008 is still valid today: most of the strings found in source code are 7-bit, US-ASCII, English text. The 7-bit text will not be affected at all: it will get converted to QString’s UTF-16 internal encoding just like it used to. There might be a slight performance impact, but I do plan on optimising the UTF-8 decoder like I said last year. However, if you can, I recommend wrapping such strings with QLatin1String, especially if you’re using them with a QString function that has a QLatin1String overload.

On the other hand, if you do have text with the high bit set in the QString 8-bit functions, you might need to change your code. You’ll either have to recode your source code to UTF-8, or you will need to wrap those strings with a suitable QLatin1String or QTextCodec::toUnicode call. I highly recommend choosing the former option: use UTF-8 in your source code. You’ll also gain the ability to use QStringLiteral properly, which requires UTF-8 source code anyway.

[As an interesting twist of history, the seed that became QStringLiteral was in the second of my encoding blogs last year, after the part I quoted above asking for the change to UTF-8, but it landed in Qt 5 before the change of this Monday.]

For Qt’s own source code, we have decreed that the source should be UTF-8 only, and so I proceeded a few weeks ago to find and recode all non-UTF-8 sources. And I’m going even further than that: if you don’t use UTF-8 for your source code, you’ll be on your own. Though it’s possible to make it work, do not ask us for help and do not expect us to add convenience functions. I am also discarding any arguments of the form “my editor/IDE/OS/environment does not support UTF-8″. This is 2012 and we live in a global world, with global data. Any such editor or environment should be left where it belongs: in a museum dedicated to the 80s and 90s.

Long live Unicode!

Tags: qt, qt5, unicode, utf-8

7 comments

1 ping

Kevin Kofler
May 11, 2012 at 09:06 UTC (UTC 0)
Can we also have a QUtf8String for the sources which want to use QT_NO_CAST_FROM_ASCII and still have the literals be in UTF-8?
jen
May 11, 2012 at 10:17 UTC (UTC 0)
“QString still stores data internally as UTF-16 and that won’t change.”
Why not?
Olivier Goffart
May 11, 2012 at 17:23 UTC (UTC 0)
@Kevin: You can use the new QStringLiteral
@jen: Because one of the goal of Qt5 is to keep as much source compatibility as possible, and there is probably a lot of code out there which work with QChar and assume UTF-16
Ralf
May 11, 2012 at 17:26 UTC (UTC 0)
Great to see that UTF-8 will finally be the default I still remember I had to dig through some docs back with Qt 4.2, to get UTF-8 source code working the way I wanted.
However, making toAscii and fromAscii operate on UTF-8 is IMHO totally confusing. The function was obviously already mis-named in Qt4 since it actually converted to/from the local 8 bit encoding. The only sane thing (as in, what you’d expect if you just read the function name) this function can do is to actually return the string in ASCII encoding, and complain if any character is beyond 128. Qt is overall great at resulting in code that you can just read, and you know what it does. Please keep it that way.
I already asked this at http://www.kdab.com/last-week-in-qt-development/ but then I was told that this was just a temporary change – does that still apply?
Thiago Macieira
May 11, 2012 at 20:14 UTC (UTC 0)
@Kevin: we could, but we could question the motive now. Why can’t you use the QString constructor?
The biggest benefit of QLatin1String is that it has extra overloads in QString, for things like startsWith, endsWith, contains, indexOf, etc. And we can only do them because the Latin1-to-UTF16 comparison is straightforward. If we added a QUtf8String without those overloads, it would gain you no benefit. And those overloads would be extremely hard to write.
Thiago Macieira
May 11, 2012 at 20:16 UTC (UTC 0)
@Ralf: yes, it’s still temporary. I have a pending commit which deprecates those and changes them back to fromLatin1/toLatin1. Hey, today is Friday, I guess I can stage it now
Ralf
May 18, 2012 at 08:22 UTC (UTC 0)
That’s good to hear, thanks