I’ve talked about source code encoding in the past, arguing that the C++ language lacks a fundamental setting. However, since this Monday, Qt 5 now starts to enforce that source code must be UTF-8. In a way.
The commit that landed on the qtbase repository finally changed the codec used by QString’s 8-bit methods to be UTF-8. That concludes a long series of changes that we had planned for Qt 5, that started with Robin Burchell’s work on removing the QTextCodec::setCodecForCStrings function. But to be clear: QString still stores data internally as UTF-16 and that won’t change.
To understand what the change is, we need to go back a little in history. Four years ago, I wrote a blog called “String Theory” that presented QString’s history and I said:
what encoding is your file? Even today, with the widespread use of UTF-8, we can’t rely on that fact (text editors in Windows being the worst example).
In 2008, we were still struggling with UTF-8 encoding in source code, and we definitely were in 2003 when QTextCodec::setCodecForCStrings came about in Qt 3. The reason is that, back then, text editors usually saved code only in the operating system’s locale encoding and very seldom supported writing anything else. Unicode wasn’t widespread enough, so people ended up with a variety of different encodings. That wasn’t a problem, provided that the data exchange only happened with people who used the same encoding — usually people in the same country, using the same operating system.
Times have changed. The protocols from the late 90s that did not possess an encoding marker quickly became obsolete or gained such a tag (I remember when the Kopete developers were struggling to decode ICQ messages properly, and Russian users often ended up with mojibake). Protocols designed in the 2000s all had such a tag, and soon began to standardise on one of the Unicode transforms.
Last year, when revisiting the subject, I wrote:
this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you’re just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the “micro” sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). [...] Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.
The next line of the blog was the decision: we would change the default codec of QString’s 8-bit functions from Latin 1 to UTF-8 in Qt 5 (note that we hadn’t yet started thinking of Qt 5 until about 15 days later). That’s what the commit I made this Monday finally accomplishes.
What does this mean to you? Well, the first thing is that it depends on whether you use these methods or not. If you compile your source code with the QT_NO_CAST_FROM_ASCII and QT_NO_CAST_TO_ASCII macros, you will feel absolutely no difference. And I really mean none, zero, zilch: if you use those macros, you’ve disabled all of the functions affected by my change.
If you do use the functions that are disabled by those macros, then the question is what encoding is used in those strings. My assumption in 2008 is still valid today: most of the strings found in source code are 7-bit, US-ASCII, English text. The 7-bit text will not be affected at all: it will get converted to QString’s UTF-16 internal encoding just like it used to. There might be a slight performance impact, but I do plan on optimising the UTF-8 decoder like I said last year. However, if you can, I recommend wrapping such strings with QLatin1String, especially if you’re using them with a QString function that has a QLatin1String overload.
On the other hand, if you do have text with the high bit set in the QString 8-bit functions, you might need to change your code. You’ll either have to recode your source code to UTF-8, or you will need to wrap those strings with a suitable QLatin1String or QTextCodec::toUnicode call. I highly recommend choosing the former option: use UTF-8 in your source code. You’ll also gain the ability to use QStringLiteral properly, which requires UTF-8 source code anyway.
[As an interesting twist of history, the seed that became QStringLiteral was in the second of my encoding blogs last year, after the part I quoted above asking for the change to UTF-8, but it landed in Qt 5 before the change of this Monday.]
For Qt’s own source code, we have decreed that the source should be UTF-8 only, and so I proceeded a few weeks ago to find and recode all non-UTF-8 sources. And I’m going even further than that: if you don’t use UTF-8 for your source code, you’ll be on your own. Though it’s possible to make it work, do not ask us for help and do not expect us to add convenience functions. I am also discarding any arguments of the form “my editor/IDE/OS/environment does not support UTF-8″. This is 2012 and we live in a global world, with global data. Any such editor or environment should be left where it belongs: in a museum dedicated to the 80s and 90s.
Long live Unicode!