«

»

Jul 17

QString improved

On my birthday, I blogged about how I’d like QString to support proper UTF-8 strings and be much easier to use. The code that I said would be my preferred would be:

    QString s = u"Résumé"q;

Recently, in Qt 5.0 we have begun to make steps to reach that. Most of the work was done by Lars Knoll, but since he is on vacation right now, I’ll take the opportunity to explain it all. (Also, many thanks to Olivier Goffart for helping with reviewing)

Analysing Qt 4

One of the things we most wanted to do in Qt 5.0 was to make QStrings be storable in read-only memory. As I said in my post back in March, we want to ask the compiler to write the UTF-16 strings for us in the .rodata section of the binary. Right now, in Qt 4.x, whenever you write code as:

    QString s = QLatin1String("Hello, World");

The compiler will emit a standard, 8-bit C string in the read-only section of the binary, then make a call to the QString constructor to convert that to UTF-16. For the string above, which contains 13 characters plus the ending NUL, we store 14 bytes in .rodata and we must allocate 28 bytes plus the QString::Data overhead in the heap to create our string.

As fast as malloc() can be, it’s still a non-negligible cost. For example, you should always avoid it when doing benchmarks, since its runtime can vary a lot and skew your results. Not to mention, of course, that dynamic memory (the heap) cannot be shared among applications.

If we look at QString::Data in Qt 4, we see:

    struct Data {
        QBasicAtomicInt ref;
        int alloc, size;
        ushort *data; // QT5: put that after the bit field to fill alignment gap; don't use sizeof any more then
        ushort clean : 1;
        ushort simpletext : 1;
        ushort righttoleft : 1;
        ushort asciiCache : 1;
        ushort capacity : 1;
        ushort reserved : 11;
        // ### Qt5: try to ensure that "array" is aligned to 16 bytes on both 32- and 64-bit
        ushort array[1];
    };

The way QString works is that the data pointer is initialised to point to the beginning of the actual UTF-16 data. On a normal QString, this pointer points to the first element of the array array, whereas it points elsewhere in case of a QString created using QString::fromRawData.

The flags in the 16-bit bitfield were set by a couple of functions and are technically a heritage from Qt 3 days.

Read-only QStringData

In order to make QString be saved in read-only memory, a couple of modifications were required. The first thing that needed changing was the reference counting: QString always increased and decreased it, all the time. If we want to save the QStringData object in read-only memory, we must not try to increment or decrement the reference counter: we then chose the value -1 to indicate the constant QStringData. Whenever this value is seen, the new code will avoid the atomic operations.

The next thing that needed changing was the pointer. First of all, it’s not possible to initialise the pointer to the value of another member in the object. Second, even if it were possible, having a pointer means the linker cannot place a value there in Position-Independent Code. The loader would need to do that and then the object would be stored in a read-write section because of the relocations.

The solution we found was to replace the pointer with an offset (using qptrdiff), pointing to how far after the beginning of the array the actual data is located. When this member is zero, it means the data is stored in the array in the QStringData, which is how we initialise it.

Then we were only left with the problem of getting UTF-16 data generated by the compiler. With C++0x (C++11), as I pointed out in my post back in March, it’s easy. The alternative we found, for compilers without C++0x support, is on Windows: there, the wchar_t type is 2 bytes wide and encodes an UTF-16 string. By the way, it’s possible to get this behaviour with GCC on other platforms using the -fshort-wchar option.

All of this, Lars implemented in commit ee85e9cc10bc6874c892b09fa54b5dbd79854069 (Gitorious won’t display it, it’s too large). He added a new macro called QStringLiteral which can be used as:

    QString s = QStringLiteral("Hello, World\n");

Producing a non-temporary

Lars’s implementation worked fine in the compilers he tested: GCC 4.4 and 4.5. However, when I tested with GCC 4.6, I started getting crashes. After analysing the assembly output, it turns out that the compiler initialised the QStringData object in the stack. If you had a function like the following:

QString foo()
{
    return QStringLiteral("Hello, World\n");
}

and compiled it in -O3 mode, GCC 4.6 even skipped all the initialisation since it figured it was dead code. It simply set the d-pointer in the returned QString to an address on the stack.

In order to produce a non-static, I needed to figure out a way to create a static variable. My solution was to use the GCC Statement Expressions extension. It works fine for code inside a function, but not outside. And, of course, it doesn’t work on other compilers.

This was implemented in commit 571785b31d21715857228b00f96cd24601b28c8c.

Olivier then had an inspired suggestion: to use C++0x lambdas. They work very similar to GCC’s statement expressions in that they allow us to have code (including static variables) in what is otherwise an expression. I implemented that in commit cd80fcb5d6db9d99684b94a90d2c798b712442c4.

Current status

The new macro QStringLiteral is present in Qt 5.0 and can be used almost anywhere where a QLatin1String is currently used. It also works in all compilers, albeit not the same way. If your compiler supports C++0x lambdas or statement expressions, and it supports one way of UTF-16 strings (C++0x’s Unicode strings or UTF-16 wide chars), then this macro will produce read-only, sharable data. There’s no creation cost at all for using this: the code generated is only an assignment and an integer comparison (which has the same result always). If your compiler doesn’t support that, then QStringLiteral is defined to be QLatin1String and we fall back to current Qt 4.x behaviour.

The one caveat is that you cannot use QStringLiteral outside a function in all compilers, since GCC statement expressions don’t support that. Moreover, the following code would work, but isn’t read-only sharable:

static const QString s = QStringLiteral("Hello, World\n");

If you can, use the following C++0x expression, which is read-only and sharable:

static const auto s = QStringLiteral("Hello, World\n");

Future plans

The next step is to convert all uses of QLatin1String with a character literal to QStringLiteral. I have such a commit in my repository’s master branch (warning: I rebase it often), proving that QtCore compiles just fine.

After that, I’d like to go back to my ideal solution using C++0x User-Defined Literals. However, it’s also clear that, unlike the prototype I presented at the end of the blog in March, we’ll need to use the template version of the operator, as Wikipedia shows it. It would probably look something like this:

template<char16_t... str> inline QConstStringData<sizeof(str)+1> operator"" q()
{
    static const QStringData<sizeof(str) + 1> qstring_literal =
        { { Q_REFCOUNT_INITIALIZER(-1), sizeof(str), 0, 0, { 0 }, str } };
    QConstStringData<sizeof(str) + 1> holder = { &qstring_literal };
    return holder;
}

Unfortunately, no compiler currently supports User-Defined Literals (the GCC C++0x page says someone is working on it). That means we cannot even try out the code above to see if it works or could be improved. When it does, I’ll play with this again.

In the meantime, I’m interested in any feedback you may have.

Update: The C++ standard says that all user-defined literals that do not start with an underscore are reserved. So the operator above should be _q.

5 comments

  1. avatar
    Olivier Goffart

    I would not go so fast while replacing all the QLatin1String. It is often used with operator== or operator+ or others functions that take a QLatin1String and that are optimized to work fast (no additional conversion or copy). QStringLitteral have no benefit in that case and take more memory in the binary (utf-16 vs. ascii, + the header)

    So it should be done case by case.

  2. avatar
    The User

    You have to use “sizeof…” and you cannot treat such a variadic template parameter as a pointer to initialise QStringData, you have to acess the head and initialise it recursively, at least I do not know any good alternative (you could define some macros such that you do not have to write the recursion explicitly). You could test at least the variadic code, create a non-operator function-template and pass the characters manually.

  3. avatar
    Thiago Macieira

    @The User: it works with GCC 4.6, but some … are probably missing above.

    The code I wrote is:

    template<int N> struct QConstStringData
    {
        int size;
        char16_t data[N + 1];
    };
    template<int N> struct QConstStringDataPtr
    {
        const QConstStringData<N> *ptr;
    };
    template <char16_t... str> inline QConstStringDataPtr<sizeof...(str)> initString()
    {
        static const QConstStringData<sizeof...(str)> qstring_literal =
            { sizeof...(str), str... };
        return { &qstring_literal };
    }
    

    So the UDL operator would probably be:

    template<char16_t... str> inline QConstStringData<sizeof...(str)+1> operator"" q()
    {
        static const QStringData<sizeof(str) + 1> qstring_literal =
            { { Q_REFCOUNT_INITIALIZER(-1), sizeof...(str), 0, 0, { 0 }, str } };
        QConstStringData<sizeof(str) + 1> holder = { &qstring_literal };
        return holder;
    }
    
  4. avatar
    The User

    Delete the old comment, did not see the QConstStringData definition, no initializer_list, hm, that is nice that it can be treated as array initialiser, did not know that. :)

  5. avatar
    Thiago Macieira

    You seem to be forgetting your plain old C…

    struct foo
    {
        int i;
        short j;
    };
    struct foo f = { 1, 2 };
    

    The QConstStringData and QConstStringDataPtr classes are both “standard-layout” and “trivially constructible”, see: http://en.wikipedia.org/wiki/C++0x#Modification_to_the_definition_of_plain_old_data

Comments have been disabled.

Page optimized by WP Minify WordPress Plugin