Just a quick post so I can say I’m going to both events: Akademy 2012 and the Qt Contributors Summit 2012. I hope to see many of you there, and we have a lot to discuss and work on.

May 18
Just a quick post so I can say I’m going to both events: Akademy 2012 and the Qt Contributors Summit 2012. I hope to see many of you there, and we have a lot to discuss and work on.

May 11
I’ve talked about source code encoding in the past, arguing that the C++ language lacks a fundamental setting. However, since this Monday, Qt 5 now starts to enforce that source code must be UTF-8. In a way.
The commit that landed on the qtbase repository finally changed the codec used by QString’s 8-bit methods to be UTF-8. That concludes a long series of changes that we had planned for Qt 5, that started with Robin Burchell’s work on removing the QTextCodec::setCodecForCStrings function. But to be clear: QString still stores data internally as UTF-16 and that won’t change.
To understand what the change is, we need to go back a little in history. Four years ago, I wrote a blog called “String Theory” that presented QString’s history and I said:
what encoding is your file? Even today, with the widespread use of UTF-8, we can’t rely on that fact (text editors in Windows being the worst example).
In 2008, we were still struggling with UTF-8 encoding in source code, and we definitely were in 2003 when QTextCodec::setCodecForCStrings came about in Qt 3. The reason is that, back then, text editors usually saved code only in the operating system’s locale encoding and very seldom supported writing anything else. Unicode wasn’t widespread enough, so people ended up with a variety of different encodings. That wasn’t a problem, provided that the data exchange only happened with people who used the same encoding — usually people in the same country, using the same operating system.
Times have changed. The protocols from the late 90s that did not possess an encoding marker quickly became obsolete or gained such a tag (I remember when the Kopete developers were struggling to decode ICQ messages properly, and Russian users often ended up with mojibake). Protocols designed in the 2000s all had such a tag, and soon began to standardise on one of the Unicode transforms.
Last year, when revisiting the subject, I wrote:
this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you’re just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the “micro” sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). [...] Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.
The next line of the blog was the decision: we would change the default codec of QString’s 8-bit functions from Latin 1 to UTF-8 in Qt 5 (note that we hadn’t yet started thinking of Qt 5 until about 15 days later). That’s what the commit I made this Monday finally accomplishes.
What does this mean to you? Well, the first thing is that it depends on whether you use these methods or not. If you compile your source code with the QT_NO_CAST_FROM_ASCII and QT_NO_CAST_TO_ASCII macros, you will feel absolutely no difference. And I really mean none, zero, zilch: if you use those macros, you’ve disabled all of the functions affected by my change.
If you do use the functions that are disabled by those macros, then the question is what encoding is used in those strings. My assumption in 2008 is still valid today: most of the strings found in source code are 7-bit, US-ASCII, English text. The 7-bit text will not be affected at all: it will get converted to QString’s UTF-16 internal encoding just like it used to. There might be a slight performance impact, but I do plan on optimising the UTF-8 decoder like I said last year. However, if you can, I recommend wrapping such strings with QLatin1String, especially if you’re using them with a QString function that has a QLatin1String overload.
On the other hand, if you do have text with the high bit set in the QString 8-bit functions, you might need to change your code. You’ll either have to recode your source code to UTF-8, or you will need to wrap those strings with a suitable QLatin1String or QTextCodec::toUnicode call. I highly recommend choosing the former option: use UTF-8 in your source code. You’ll also gain the ability to use QStringLiteral properly, which requires UTF-8 source code anyway.
[As an interesting twist of history, the seed that became QStringLiteral was in the second of my encoding blogs last year, after the part I quoted above asking for the change to UTF-8, but it landed in Qt 5 before the change of this Monday.]
For Qt’s own source code, we have decreed that the source should be UTF-8 only, and so I proceeded a few weeks ago to find and recode all non-UTF-8 sources. And I’m going even further than that: if you don’t use UTF-8 for your source code, you’ll be on your own. Though it’s possible to make it work, do not ask us for help and do not expect us to add convenience functions. I am also discarding any arguments of the form “my editor/IDE/OS/environment does not support UTF-8″. This is 2012 and we live in a global world, with global data. Any such editor or environment should be left where it belongs: in a museum dedicated to the 80s and 90s.
Long live Unicode!
Apr 30
Update on last Friday’s post on the Qt Project’s statistics: the script ran again this morning, so we now have data for last week. The Qt Project Statistics Page now includes the number of contributors per week:
Visit the statistics page for more graphs.
Apr 27
For about a month, I’ve been improving a set of scirpts to calculate statistics on the Qt Project. What I wanted to know, at first, was how well I was doing, how much I was contributing. Another question I had in mind and I know many others did too was “how much is the Qt Project dependent on Nokia?”
First it started with a simple “|wc -l” depending on whose statistics I wanted to get. This week, I decided to make graphs, so I spent a great deal of time learning gnuplot instead of doing other work. I’ll blog about the script itself on my next blog.
The statistics are online now. You can see it at http://macieira.org/blog/qt-stats. And come back every week, as it will update itself every Sunday to Monday evening.
Let me just point out the overall graph:
As you can see from the graph, the commit rate for the Qt Project was at its lowest during two days-off periods: New Years (week 52 of last year and week 1 of this year) and Easter (week 14). Aside from the first week of the project’s existence, it’s constantly been over 400 commits a week, and over 600 commits for 6 of the past 8 weeks. That’s impressive!
And answering the question of how much the project depends on Nokia, take a look at this other one:
You can see that the participation from Nokia developers still is quite high (and will probably remain so), at around 80%. But in turn that means around 20% of the commits going to the Qt Project come from other people, employed by other companies or in their free time, and this less than 6 months after the official launch of the Qt Project.
More than that, note the trend: Nokia’s participation tends to diminish, not because they’re doing less, but because others are doing more. The following graph, with Nokia’s numbers removed, shows the trend participation from others:
Apr 03
Lars writes to let us know that the first (and hopefully only) Qt 5 alpha has been released! It’s the first in the major release series in 7 years, the first major release of the Qt Project (though not the first release in of the project, since we released 4.8.1 just a few weeks ago).
I won’t copy what Lars said in his blog. Instead, here are some useful links:
Please note that the alpha release does not support make install yet. You really need to configure it with that -prefix option. We’ll work on an installable package and multiple tarballs for the beta.
Mar 28
I usually write about C++, since it’s the programming language that I use on my daily work. Today, however, I’m talking about its nearest cousin: C. In specific, about a certain keyword introduced by the C99 standard, which was issued over 12 years ago. Usually, the C standard plays catch-up with the C++ standard (like the C11 standard bringing some C++11 features to C), but each new issue brings a few new things that C++ doesn’t have yet. This cross-pollinisation by the two standard teams is very welcome.
The one I’m thinking of today is one that, interestingly, has not been added to C++ yet, though many compilers support it. If you’ve paid attention to the blog title, you may realise I’m talking about the restrict keyword.
Raise your hand if you’ve seen it before. Now only the people who have seen it outside of the C library headers on their systems. Not many, eh?
The keyword appears defined in the C99 (N1256) and C11 (N1570) standards in section 6.7.3 “Type qualifiers” and 6.7.3.1 “Formal definition of restrict”, which, as usual, is barely readable for us. The Wikipedia definition is better:
The restrict keyword is a declaration of intent given by the programmer to the compiler. It says that for the lifetime of the pointer, only it or a value directly derived from it (such as pointer + 1) will be used to access the object to which it points.
Well, so what? Why do we need a keyword for that? Well, clearly it’s not just something that the programmer says — otherwise, we’d only write it in the documentation. The Wikipedia text continues by saying that “[t]his limits the effects of pointer aliasing“.
That should now tell you something. At least, it should bring you back some memories of compiler warnings about “dereferencing type-punned pointer does break strict aliasing”.
The C and C++ standards say that pointers of different types do not alias each other. That’s the strict aliasing, which you often break by dereferencing type-punned pointers. I’ve talked about this in the past, I think. In any case, what matters to us here is when the pointers are allowed to alias each other. Since the C99 standard couldn’t very well go and change a basic principle of the C90 standard, they instead created a keyword to allow the programmer to declare when aliasing will not happen.
The simplest example is the following pair of functions from the C library (copied verbatim from glibc’s string.h header):
/* Copy N bytes of SRC to DEST. */
extern void *memcpy (void *__restrict __dest,
__const void *__restrict __src, size_t __n)
__THROW __nonnull ((1, 2));
/* Copy N bytes of SRC to DEST, guaranteeing
correct behavior for overlapping strings. */
extern void *memmove (void *__dest, __const void *__src, size_t __n)
__THROW __nonnull ((1, 2));
Note the difference: memcpy uses the restrict keyword, whereas memmove does not but does say that it is correct for overlapping strings.
Let’s try and implement these two functions to see if we understand what the keywords mean. Let’s start with memcpy, which is very simple at first approach and you must have written its equivalent hundreds of times already:
// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
char *d = dest;
const char *s = src;
size_t i;
for (i = 0; i != n; ++i)
d[i] = s[i];
return dest;
}
Having written that, we wonder: why do we need memmove at all? The comment in the header talks about “overlapping strings” and that’s where the code above has an issue. What if we tried to memcpy(ptr, ptr + 1, n)? In the first iteration of the loop above, the byte copied would overwrite the second byte to be read — or worse.
For that reason, the simplest memmove is usually implemented as:
void *memmove(void *dest, const void *src, size_t n)
{
char *d = dest;
const char *s = src;
size_t i;
if (d < s) {
for (i = 0; i != n; ++i)
d[i] = s[i];
} else {
i = n;
while (i) {
--i;
dst[i] = src[i];
}
}
return dest;
}
If we know that the two pointers do not alias each other, we can do some more interesting things to optimise the copying performance. The first thing we can try is to increase the stride. That is, copy more than one byte at a time, like so:
// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
int *di = dest;
const int *si = src;
char *d = dest;
const char *s = src;
size_t i;
for (i = 0; i != n / sizeof(int); ++i)
di[i] = si[i];
i *= sizeof(int);
for ( ; i != n; ++i)
d[i] = s[i];
return dest;
}
The above code first copies the data in int-size chunks, then copies the remaining 1 to 3 bytes one byte at a time (epilog copy). It’s more efficient than the original code on architectures where unaligned loads and stores are efficient, or when we know both pointers to be aligned to the proper boundary. In those cases, since we have fewer iterations to execute, the copying is usually faster.
We can definitely improve this code further, by using for example 64-bit loads and stores in architectures that support them, applying this to all architectures by aligning the two pointers if possible in a prolog copy, unrolling the prolog and epilogs, or use Single Instruction Multiple Data instructions that the architecture may have.
Note that this is only possible because this is memcpy, not memmove. For the latter function, if we wanted to increase the stride, we would need to additionally check that the distance between the two pointers is at least the size of the chunk of data copied per iteration. Doing that is left as an exercise for the reader.
Now, I said above that the only reason why there’s a language keyword in the first place is so that the compiler can optimise better. Well, that’s exactly what it does. Unfortunately, it’s easy to prove this straight-away with assembly code, as we’re depending on optimisations performed by the compiler, which change over time and are implemented differently in each one. For example, if I use the Intel Compiler on the original memcpy function, it will insert a call to _intel_fast_memcpy if the pointers aren’t suitably aligned or the copy size isn’t big enough. GCC, on the other hand, will insert a prolog to align one of the pointers.
What is interesting to note is that the presence of the restrict keyword, everything else being the same, does cause different code generation. With GCC, the output without the keyword contains a couple of instructions comparing the dest pointer to src + 16 and only if the two pointers don’t overlap in the first 16 bytes will it execute SSE2 16-byte copies. ICC is even more extreme: without the keyword, the code generated for memcpy does only byte-sized copies.
In other words, the keyword is being used: when the compiler knows the two blocks don’t overlap, it can generate better code.
Feb 22
I’ve written in the past about how passing certain types by value in C++ would be more efficient than passing by constant reference. But it turns out that the ABI rules are somewhat more complex than what I said back in 2008. Time to investigate.
This is also prompted by the discussion on qreal on the Qt development mailing list. In trying to decide on the fate of qreal, we also run into the discussion of the geometric classes (point, size, rectangle, polygon) and the algebraic classes (matrixes, 2D and 3D vectors) and whether they should use single- or double-precision. I’m not going to go into the arguments discussed there, I’m merely focussing here on the ABI.
Before we go into the ABI documentation and try to compile code, we need to define what problem we’re trying to solve. In general terms, I’m trying to find the most optimal way of passing small C++ structures: when is it better to pass by value, as opposed to by constant reference? And under those conditions, are there any important implications to the qreal discussion?
In the String Theory blog, I concluded that a small structure like QLatin1String, which contained exactly one pointer as a member, would benefit from passing by value. What other types of structures should we look at?
I’ll investigate the x86-64, ARMv7 hard-float, MIPS hard-float (o32) and IA-64 ABIs because they are the ones I for which I have access to compilers. All of them support passing parameters by registers and have at least 4 integer registers used in parameter passing. Besides MIPS, all of them also have at least 4 floating-point registers used in parameter passing. See my earlier ABI detail blog for more information.
So we will investigate what happens when you pass by value the following structures:
struct Pointers2
{
void *p1, *p2;
};
struct Pointers4
{
void *p1, *p2, *p3, *p4;
};
struct Integers2 // like QSize and QPoint
{
int i1, i2;
};
struct Integers4 // like QRect
{
int i1, i2, i3, i4;
};
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
{
F f1, f2;
};
template <typename F> struct Floats3 // like QVector3D
{
F f1, f2, f3;
};
template <typename F> struct Floats4 // like QRectF, QVector4D
{
F f1, f2, f3, f4;
};
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
{
F m[4][4];
};
struct QChar
{
unsigned short ucs;
};
struct QLatin1String
{
const char *str;
int len;
};
template <typename F> struct QMatrix
{
F _m11, _m12, _m21, _m22, _dx, _dy;
};
template <typename F> struct QMatrix4x4 // like QMatrix4x4
{
F m[4][4];
int f;
};
And we’ll analyse the assembly of the following program:
template <typename T> void externalFunction(T);
template <typename T> void passOne()
{
externalFunction(T());
}
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
{
externalReturningFunction<T>();
}
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();
In addition, we’re interested in what happens to non-structure floating point parameters: are they promoted or not? So we’ll also test the following:
void passFloat()
{
void externalFloat(float, float, float, float);
externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
}
void passDouble()
{
void externalDouble(double, double, double, double);
externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
}
float returnFloat()
{
return 1.0f;
}
double returnDouble()
{
return 1.0;
}
You might have noticed I skipped old-style 32-bit x86. That was intentional, since that platform does not support passing by registers anyway. The only conclusion we could draw from that would be:
Moreover, I’m intentionally ignoring it because I want people to start thinking of the new ILP32 ABI for x86-64, enabled by GCC 4.7′s -mx32 switch, which follows the same ABI as the one described below (with the exception that pointers are 32-bit).
So let’s take a look at the assembly results. For parameter passing, we find out that
For return values, the conclusion is the same as above: if the value is passed in registers, it's returned in registers too; if it's passed in memory, it's returned in memory. This leads us to the following conclusions, supported by careful reading of the ABI document:
Here are the results for parameter passing:
For the return values, we have:
The conclusions are:
I've compiled the code only for ARMv7, with the floating-point parameters passed in the VFP registers. If you're reading this blog, you're probably interested in performance and therefore you must be using the "hard-float" model for ARM. I will not concern myself with the slower "soft-float" mode. Also note that this is ARMv7 only: the ARMv8 64-bit (AArch64) rules differ slightly, but no compiler for it is available.
Here are the results for parameter passing:
For returning those types, we have:
Note that the return type is one of the places where the 32-bit AAPCS differs from the 64-bit one: there, if a type is passed in registers to a function where it is the first parameter, it is returned in those same registers. The 32-bit AAPCS restricts the return-in-registers to structures of 4 bytes or less.
My conclusions are:
I have attempted both a MIPS 32-bit build (using the GCC-default o32 ABI) and a MIPS 64-bit (using -mabi=o64 -mlong64). Unless noted otherwise, the results are the same for both architectures.
For passing parameters, they were:
For the return values, MIPS is easy: everything is returned in memory, even QChar.
The conclusions are even easier:
There are only few aggregate conclusion that we can take. One of them is that single-precision floating point values are not explicitly promoted to double when formal parameters are present. The automatic promotion probably happens only for floating-point values passed in ellipsis (...), but our problem statement was about calling functions where the parameters are know. The only slight deviation from the rule is IA-64, but it's unimportant as the hardware, like x87, only operates in one mode.
For the structures containing integer parameters (that includes pointers), there's nothing further to optimise: they are loaded into registers exactly as they appear in memory. That means the portion of the register corresponding to padding might contain uninitialised or garbage data, or it might make something really strange like MIPS in big-endian mode. It also means, on all architectures, that types smaller than a register do not occupy the entire register, so they might be packed with other members.
Another is quite obvious: structures containing floats are smaller than structures containing doubles, so they will use less memory or fewer registers to be passed.
To continue taking conclusions, we need to exclude MIPS since it passes everything in the integer registers and returns everything by memory. If we do that, we are able to see that all ABIs provide an optimisation for structures containing only one floating-point type. Those are called by slightly different names in the ABI documents, all meaning homogeneous floating-point structure. Those optimisations mean that the structure is passed on floating-point registers under certain conditions.
The first one to break down is actually x86-64: the upper limit is 16 bytes, limited to two SSE registers. The rationale for this seems to be passing one double-precision complex value, which takes 16 bytes. That we are able to pass four single-precision values is an unexpected benefit.
The remaining architectures (ARM and IA-64) can pass more values by register, and always at one value per register (no packing). IA-64 has more registers dedicated to parameter passing, so it can pass more than ARM.
The above is only valid for structures that are trivially-copiable and trivially-destrucitble. All C structures (POD in C++) meet those criteria.
I should note that the recommendations above do not always produce more efficient code. Even though the values can be passed in registers, every single compiler I tested (GCC 4.6, Clang 3.0, ICC 12.1) still does a lot of memory operations in some cases. It's quite common for the compiler to write the structure to memory and then load it into the registers. When it does that, passing by constant reference would be more efficient since it would replace the memory loads with arithmetic on the stack pointer.
However, those are simply a matter of further optimisation work by the compiler teams. The three compilers I tested for x86-64 optimise differently and, in almost all cases, at least one of them managed to do without memory access. Interestingly, the behaviour changes also when we replace the padding space with zeroes.
Jan 19
My last blog on the dynamic libraries on Linux attracted over 15000 visits, which was quite unexpected (it’s 15x more than the usual traffic). It got linked from reddit and ycombinator and comments there and in the previous post have raised some interesting questions I’ll try to answer.
First, a quck background: LD_PRELOAD and /etc/ld.so.preload tell the dynamic linker to load a certain ELF module before the rest normal initialisation sequence. It’s preloaded before the rest of the modules, but after two important modules have been loaded: the executable itself and the dynamic linker. By itself, it means nothing at all about symbol hijacking. Its sole purpose is to load something. I have, for example, used it for loading a different binary of a library that a program required. That works fine.
If you complained that I said it’s little-known, you’re somewhat biased. If you complained, it’s because you knew about it, therefore you’re part of the minority that knows about it. Just think about it: there are millions of people directly using Linux today in the world. How many do you think know about this feature?
Even more so, think about how often:
The ratio is at least 1:1000 for a heavy user of the feature (like me!) in the best of the circumstances. It’s probably several orders of magnitude more than that for the average. Something that is used in one case in a million qualifies as little-used to me.
Some people suggested I was thinking of getting read of the preloading feature in exchange for a few cycles saved. I would still be in my right to suggest that, given the improvements and how often it is used, but I wasn’t. I’ve never proposed getting rid of the preloading feature and my proposal would not harm the most often used cases of interposition.
This requires a bit more explanation, so bear with me please.
Symbol interposition works by adding a symbol to the symbol table before the “rightful” symbol appears. The dynamic linker will resolve the symbol to the first occurrence it finds in the search order, so if you preload a library out of its order, its symbols will have higher priority than they would otherwise. The extreme case is when you preload a library or module that wouldn’t otherwise be loaded. But remember something I said before: preloaded modules are loaded after two others are loaded, so they don’t get the chance to interpose symbols defined by those.
If the executable performed a copy relocation on a data symbol, then LD_PRELOAD’ed modules cannot interpose those. For that reason, I am not counting interposition of data symbols as valid. In fact, in 14 years I’ve been hacking on Linux, I’ve never done that, so I guess the chances of that happening are a billion to one or even lower. What’s more, my proposal would do away with copy relocation, which may make data interposition a valid case.
The next important thing you must understand is that my proposal would do away with interposition of intra-library symbols, but not inter-library ones. My friend Michael Meek’s proposal of -Bdirect linking might, but even that proposal wouldn’t totally do away with it.
What do I mean by this? Intra-library means “within the same library,” while inter-library means “across libraries” (think of “Internet” vs “intranet”). My proposal was intended to improve binding of symbols inside one library because we can gain performance doing that without losing the Position-independent code and the advantages that come with it (like Address space layout randomisation). Specifically because we don’t want to lose the PIC support and we don’t want to go back to pre-ELF days and their problems (see Ulrich Drepper’s paper for some information on it), all inter-library symbol resolution would remain as-is, via PLTs and GOTs, including the ability to interpose symbols.
And here’s why I think we’re entitled to doing that: because you cannot do it anyway unless the library has been specifically designed to allow it, like glibc is. Let’s take the code from the last blog:
extern void *externalVariable;
extern void externalFunction(void);
void myFunction()
{
externalFunction();
externalVariable = &externalFunction;
}
And amend it like so:
void externalFunction(void)
{
}
If we compile this code with optimisation (GCC’s -O is enough) and inspect the assembly output, we can notice that both functions are present in the output but that myFunction does not call externalFunction. In other words, the compiler inlined one function into the other, even if the inline keyword was never added to it, and that expanded to zero code. With advances such as link-time optimisation, even moving the function to another compilation unit might not be enough to prevent the inlining.
That’s why I said that to support the case of intra-library symbol interposition, the library must be specifically designed to allow it, which is definitely still possible under my proposal. Most libraries aren’t designed like that and will never be, so I am confident that optimising for the greater majority of the libraries instead of the few is warranted (taking my system: I counted 3623 distinct libraries and plugins and I’m pretty sure none except libc and libpthread allow for interposition, so it’s probably a 1000:1 case again).
Another important remark I saw in the comment threads was about the lack of benchmarks in my previous blog. Here they are.
Please note that “benchmark” means “comparison.” It does not imply “speed executing something.”
I started by trying to find an executable I could run non-interactively, that executed a relatively CPU-intense activity and quit. That executable should be in my standard set of built executables, as I didn’t want to recompile the entire system. I settled on KDE’s kbuildsycoca4 with the options --noincremental --nosignal: it looks for all *.desktop files in the search paths and compiles a database for faster lookup, called the SYstem COnfiguration CAche. The options tell it to ignore existing databases and do it all, plus avoid signalling running applications over D-Bus to reload their settings.
The tests were run on my laptop, which is an Intel Core-i7-2620M, clocked at 2.6 GHz, with an SSD but no tmpfs temporary dir, with 2x32kB of L1 cache, 256 kB of L2 cache, 4MB of L3 cache and 4 GB of main RAM. I locked the CPU scaling governor to “performance” so the CPU was running at 2.6 GHz when the test starts and it soon goes over to turbo-mode and stays there (3.2 GHz). The system was not completely idle while running the test, but relatively so. To try and avoid other problems, the native benchmarks were run under the FIFO real-time scheduler, with a single processor of affinity. The tests were run in 64-bit mode and were run “warm”: I ran the benchmark first after any recompilation and discarded the results.
I did four sets of tests, as follows:
Each set of tests consisted of:
The raw results I collected you can download from here (that also includes results with LD_BIND_NOW=1).
First of all, I went into these benchmarks fully expecting that nothing would be visible in the performance benchmarks. It’s clear that these are micro-optimisations, so in a fairly large program they should be drowned out by inefficiencies in other parts. Also, considering that my system wasn’t completely idle when running the CPU benchmarks, the numbers have a degree of noise which could hide the faint results. The results have, however, shown a few clear improvements.
Here’s what I found:
The numbers are fairly small, as was expected, since we’re talking about micro-optimisations. However, three distinct benchmarks have shown with a reasonable degree of confidence that there’s a performance improvement in the order of 3% (execution time, cycle count and instruction count, and that’s reasonable to me, with the limited sample size I had). That’s more or less what I hoped to see, but much more than I expected to be able to show.
Another important aspect is that this was a non-GUI testcase, even though by virtue of library dependencies, both QtGui and kdeui libraries were present. Note how the two libraries have, together, 45824 relocations and 14708 PLT entries in the original library set, which corresponds to 73.3% and 62.4% of the total relocations in play respectively, as well as 65% of the PLT entries for local symbols. The number of relocations is indicative also of the size of the code in those libraries. But since the application isn’t a GUI one, that code is mostly not executed.
If we consider that the problem of cache misses increases with code size (and the cache miss rate could increase too, compounding the effect) and that of cycles lost due to mispredicted branches increases with the number of branches unless the misprediction ratio drops (which the benchmarks have shown to remain stable), we can expect that a GUI application could gain even more in performance due to these improvements. That’s difficult to prove however in a GUI application, so we’ll have to stay with just the theoretical exercise.
In all, I still think this is warranted. The drawbacks are fairly minor: the interposition of symbols is rarely used already, interposition of symbols in intra-library lookups close to non-existent in libraries that aren’t designed to do that. All we need to do now is change the status-quo, which is probably the hardest part.
Who will support me?
Jan 16
Last week, we identified a bug in Qt with Olivier‘s new signal-slot syntax. Upon further investigation, it turns out it’s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux.
Shared libraries on Linux are linked with -fPIC, which makes all variable references and function calls indirect, unless they are static. That’s because in addition to making it position-independent, it makes every variable and function interposable by another module: it can be overridden by the executable and by LD_PRELOAD libraries. The indirectness of accesses is a performance impact and we should do away with it, without sacrificing position-independence.
Plus, there are a few more actions we should take (like prelinking) to improve performance even further.
Jump to existing or proposed solutions, Google+ discussion.
Note: in the following, I will show x86-64 64-bit assembly and will restrict myself to that architecture. However, the problems and solutions also apply to many other architectures, like x86 and ARM, which should make you consider what I say. The only platform that this mostly does not apply to is actually IA-64.
Imagine the following C file, which also compiles in C++ mode:
extern void *externalVariable;
extern void externalFunction(void);
void myFunction()
{
externalFunction();
externalVariable = &externalFunction;
}
The code above demonstrates three features of the languages in one function: it loads the address of a function, it calls a function and it writes to a variable. The compiler does not know where the function and variable are: they might be in another .o file linked into this ELF module or they might be in another ELF module (i.e., a library) this module links to.
This compiler produces the following assembly output (gcc 4.6.0, -O3):
call externalFunction movq $externalFunction, externalVariable(%rip)
This assembly snippet is making use of two symbols whose values the assembler does not know. When assembled, the assembler produces a .o with three relocations. This GCC has produced the most efficient and most compact compilation of the code I wrote.
When we link this .o into an executable, we start to see the drawbacks. The first is that both instructions need to encode, in their bits, the values of the symbols whose values we didn’t know. So the linker must somehow fix this. It fixes the call instruction by making it call a stub or a trampoline, which jumps to the actual address. This stub is placed in a separate section of code called the Procedure Linkage Table (PLT). The contents of the PLT stub is not that important, but suffice to say that it is an indirect jump.
The movq instruction cannot be fixed. There’s simply no way, because it writes a constant value to a constant location, directly. Even if we allowed for the instruction or a pair of instructions wide enough to write any 64-bit value to any variable in the 64-bit space, we still have a problem: those values are not known at link time. So instead of fixing the instruction, the linker “fixes” the values. For the address of externalFunction, it uses the address of the PLT stub it created in the previous paragraph. For the externalVariable variable, tt will create a copy relocation, which means the dynamic linker will need to find the variable where it is, copy its value to a fixed location in the executable and then tell everyone that the variable is actually in the executable.
What are the consequences of this? For the PLT call, it’s a simple performance impact which could not be avoided. Since the address of the actual externalFunction function is not known at compile and link-time, and we don’t want to leave a text relocation, the only way to place that call to find the address at run-time and indirectly call it.
For the copy relocation, the consequences for the executable are small. The code it will execute is still the most efficient and most compact. The dynamic linker will have to find where the symbol actually is at load-time, which is something that it would have to do anyway, plus copy its contents, checking that the size hasn’t changed. This is done only once, then the code runs in its most efficient form.
The fact that we resolved &externalFunction to the address of the PLT stub means that any use of that function pointer (an indirect call) will end up in a function that does an indirect call too. That is, it’s a doubly-indirect call. I seriously doubt any processor can do proper branch prediction, speculative execution, and prefetching of code under those circumstances.
So far we’ve analysed what happens in an executable. Now let’s see what happens when we try to build the same C code for a shared library. We do that by introducing the -fPIC compiler option, which tells the compiler to generate position-independent code. The compiler produces the following assembly output:
call externalFunction@PLT movq externalFunction@GOTPCREL(%rip), %rdx movq externalVariable@GOTPCREL(%rip), %rax movq %rdx, (%rax)
When assembled, the .o still contains three relocations, albeit of different type.
When we compare the output of the position-dependent and the position-independent code, we notice the following:
The compiler needed to generate the code above since it doesn’t know where the symbols will actually be. As was the case before, those symbols can be linked into the same ELF module as this compilation unit, or they may be found elsewhere in another ELF module this one links to.
Moreover, the compiler and linker need to deal with the possibility that an executable might have done exactly what our executable in the previous section did: create a copy relocation on the variable and fixed the address of the function to its own PLT stub. In order to work properly, this code must deal with the fact that its own variable might have ended up elsewhere, and that &externalFunction might have a different value.
That means the indirect call through the PLT and the three movq instructions remain, even if those two symbols were in the same compilation unit!
The problem is that even if at first glance you’d think that the compiler should know for a fact where those symbols are, it actually doesn’t. The -fPIC option doesn’t enable only position-independent code. It also enables ELF symbol interposition, which is when another module “steals” the symbol. That happens normally by way of the copy relocations, but can also happen if an LD_PRELOAD’ed module were to override those symbols. So the compiler and linker must produce code that deals with that possibility.
In the end, we’re left with indirect calls, indirect symbol address loadings and indirect variable references, which impact code performance. In addition, the linker must leave behind relocations by name for the dynamic linker to resolve at load-time.
Yes, it seems so. The impact is there for this little-known and little-used feature. Instead of optimising for the common-case scenario where the symbols are not overridden, the ABI optimises for the corner case.
Another argument is that the ABI optimises for executable code, placing the impact on the libraries. The argument is valid if the executables are much larger and more complex than the libraries themselves. It’s valid too if we consider that application developers write sloppy code, whereas library developers will write very optimised code.
I don’t think that argument holds anymore. Libraries have got much more complex in the past 10-15 years and do a lot more than they once did. They are not mere wrappers around system calls, like libc 4 and 5 were on Linux in the late 90s. Moreover, if we consider that the rise of interpreted languages, like Perl, Python, Ruby, even QML and JavaScript, the code belonging to the ELF executables is negligible. Compare the size of the executables with the libraries that actually do the interpretation:
-rwxr-xr-x. 2 root root 13544 Aug 5 06:27 /usr/bin/perl -rwxr-xr-x. 2 root root 9144 Apr 12 2011 /usr/bin/python -rwxr-xr-x. 1 root root 5160 Dec 29 13:46 /usr/bin/ruby -r-xr-xr-x. 1 root root 1763488 Apr 12 2011 /usr/lib64/libpython2.7.so.1.0 -rwxr-xr-x. 1 root root 947736 Dec 29 13:46 /usr/lib64/libruby.so.1.8.7 -rwxr-xr-x. 1 root root 1524064 Aug 5 06:27 /usr/lib64/perl5/CORE/libperl.so
That’s even valid for interpreters that JIT the code. As optimised as the code they generate can be, current understanding is that operations with critical performance are implemented in native code, which means libraries or plugins.
When developing your library, if you know that certain symbols are private and will never be used by any other library, you have an option. You can declare their ELF visibility to be “hidden”, which has two consequences. The clear one is that the linker will not add the hidden symbols to the dynamic symbol table, so other ELF modules simply cannot find them. If they can’t find them, they can’t steal them. And if they can’t steal them, the linker does not need to produce a PLT stub for the function call, so the call instruction will be linked to a simple, direct call as the executable in the first part had been.
The other consequence is an optimisation that the compiler does. Since it also knows that the externalVariable variable cannot be stolen, it does not need to address the variable indirectly. The generated assembly becomes:
call externalFunction@PLT movq externalFunction@GOTPCREL(%rip), %rax movq %rax, externalVariable(%rip)
The .o file will still contain three relocations. However, note how the getting of the address of the externalFunction function is still done indirectly, even though the compiler knows it cannot be interposed. That means the linker will still generate a load-time relocation for the dynamic linker, to get the address of that function. Fortunately, it’s a simpler relocation since the symbol name itself is not present.
If there’s a reason for getting the address indirectly like this, I have yet to find it.
If your symbols are public, however, you cannot use the ELF “hidden” visibility trick. But if you know that they cannot and will not ever be stolen or interposed, you have another possibility, which is to tell that to the compiler and linker.
If you declare a variable with ELF “protected” visibility, you’re telling the compiler and linker that it cannot be stolen, yet can be placed in the dynamic symbol table for other ELF modules to reference. You just have to be absolutely sure that they will not ever be interposed, because that will create subtle bugs that are hard to track down. That includes access to those symbols by position-dependent executable code, like we did in the first section.
The GCC syntax __attribute__((visibility("protected"))) works in ELF platforms only, whereas the one with the “hidden” keyword is known to work in non-ELF platforms too, like Mac OS X (Mach-O) and IBM AIX (XCOFF).
Another way to do the same is to use one of two linker options: -Bsymbolic and -Bsymbolic-functions. They do basically the same as the protected visibility: they keep the symbols in the dynamic symbol table, but they make the linker use the symbol inside the library unconditionally. The difference between those two options is that the former applies to all symbols, whereas the latter applies to functions only.
The reason why -Bsymbolic-functions exists requires looking back at the executable code from the first section. While the variable reference required a copy relocation, the function call was done indirectly, through the PLT stub. A variable can be moved, but moving code isn’t possible, so the executable code needs to deal with the code being elsewhere anyway. For that reason, it’s possible to symbolically bind function calls inside a library without affecting executables.
Or so we thought. The problem we discovered last week deals with a situation of when you treat a function as a data reference: taking its address. As we saw on the first part, the linker will resolve the address of the function to the address of the PLT stub found in the executable. But if you symbolically bind the function in the library, it will resolve to the real address. If you try to compare the two addresses, they won’t be the same.
Some of the solutions I propose are ABI and binary compatible with existing builds; some others are ABI incompatible and would require recompilation. Unfortunately, the best solution would require source-incompatible changes. Still, all the changes below are giving a bit of optimisation to libraries by making executables less optimised.
As we saw in the code generated for the library, with -fPIC, the compiler decided to make the call indirectly by adding “@PLT” to the symbol name. Turns out that the linker doesn’t really care about this and will generate (or not) the PLT stub if needed. If that’s the case, the compiler should not make a judgement call about where the symbol is located just because of -fPIC.
Function calls already require a pointer-sized variable somewhere and a relocation to make it point to the valid entry point of the function being called. What’s more, taking addresses of functions is a somewhat rare operation, compared to the number of function calls across ELF modules.
That being the case, we can take a small “hit” in performance and the loading of a function address should happen via the GOT in position-dependent code (executables) just like it is done for position-independent code.
The benefit of doing this is that the function address we load will point to exactly function’s real entry point, instead of the PLT stub. When we call this function, we avoid the doubly-indirect branching we found earlier.
If a given function is both called and its address is taken, the PLT stub should reference GOT entry that was used for the taking of the address. The reason why it isn’t already so, I guess, is because the entries in the .got.plt section aren’t initialised with the target function’s address, but the local module’s function resolver. This trick allows for the “lazy resolution” of functions: they are resolved only the first time they are called.
I wouldn’t ask for all functions to be resolved at load-time, but if the address of the function is taken anyway, the dynamic linker will need to resolve it at load time. So why waste CPU cycles in a function call if the address was computed already?
Instead of copying the variable from the library into the executable, executables should use indirect addressing for reading variables and writing to them, as well as taking their addresses. One benefit of doing this is avoiding the actual copying. For example, for read-only variables, they may remain in read-only pages of memory, instead of being copied to read-write pages found in the executable.
The big drawback of this is that the indirect addressing is a lot more expensive, since it requires two memory references, not just one. The next suggestion might help alleviate the problem.
This is a suggestion found in the IA-64 ABI: the compiler generates the instructions needed to load the address of the variable from the GOT, then use it as it needs to. If the linker concludes (by whichever means, like protected or hidden symbols, the use of one of the symbolic options, or because this is an ELF application and the symbol is defined in it) that the symbol must reside in the current ELF module, it can change the load instruction into a register-to-register move or similar.
For our x86-64 64-bit case, the instructions the compiler generated were:
movq externalVariable@GOTPCREL(%rip), %rax movq %rdx, (%rax)
By changing one bit in the opcode of the first instruction, with no code size change, we can produce:
leaq externalVariable@GOTPCREL(%rip), %rax movq %rdx, (%rax)
The x86 instruction “LEA” means “Load Effective Address”. Instead of loading 64 bits from the memory address externalVariable@GOTPCREL(%rip) and storing them in the register, that instruction the address it would have loaded from in the register. This isn’t as optimised as the original code found in the executable for two reasons: it requires two instructions instead of just one and it requires an additional register.
It’s possible to generate an even more efficient code if the assembler leaves a 32-bit immediate offset in the second movq instruction, making it 6 bytes long. This extra immediate would be of no impact in the original code, besides making it longer, but it would allow the linker to optimise the code further:
The original would be:
movq externalVariable@GOTPCREL(%rip), %rax movq.d32 %rdx, 0x0(%rax)
And it would get relaxed to:
nopl.d32 0x0(%rax) movq %rdx, externalVariable@GOTPCREL(%rip)
That is, the first 6-byte instruction is resolved to a 6-byte NOP, whereas the second 6-byte instruction executes the actual store, with no extra register use. The compiler cannot know that the register will be left untouched, but at least there is no dependency between the two instructions that might cause a CPU stall.
The same applies to other architectures too. The full -fPIC code on ARM to store a value from a register into a variable is the following:
ldr r3, .L2+8 @ points to a constant whose value is: externalVariable(GOT) .LPIC1: ldr r3, [r4, r3] @ r4 contains the base address of the GOT str r2, [r3, #0]
If the linker can conclude the symbol must be in the current ELF module and cannot change, it may be able to avoid the extra load (the middle instruction) by changing the code to be:
ldr r3, .L2+8 @ points to a constant whose value is: externalVariable-(.LPIC1-8) .LPIC1: add r3, pc, r3 str r2, [r3, #0]
Unlike x86, the ARM instructions cannot be optimised further, since the immediates encodable in the instructions have limited range.
Similar to the above, but instead looking at function addresses. The original library code is:
movq externalFunction@GOTPCREL(%rip), %rdx
But it can be relaxed to:
leaq externalFunction(%rip), %rdx
With ARM, the original code is:
ldr r3, .L2+8 @ points to a constant of value: externalFunction(GOT) ldr r2, [r4, r3] @ r4 contains the address of the base of the GOT
But relaxed, it would be:
ldr r2, .L2+8 @ points to a constant of value: externalFunction-(.LPIC0+8) .LPIC0: add r2, pc, r2
We’re already able to tell the compiler that a symbol is in the current module, with the hidden visibility attribute. We should be able to tell the compiler that we know that the symbol is in the current module but exported as well as that we know that the symbol is in another module.
I would suggest simply using the existing ELF markers and being explicit about them:
Considering the other suggestions, we know the references to symbols with “default” visibility can be relaxed into simpler and more efficient code in the presence of one of the symbolic binding options. That means we can use the “default” visibility for cases of uncertain symbols.
Some of the solutions I listed are already possible and they should be used immediately in all libraries. That is especially true about the use of the hidden visibility: all libraries, without exception, should make use of this feature. In fact, since this option was introduced in GCC 4.0 seven years ago, many libraries have started using it and are now “good citizens”, for they access their own private data most efficiently, they don’t have huge symbol tables (which impact lookup speed) and they don’t pollute the global namespace with unnecessary symbols.
Other solutions are not possible to implement yet. The solution I personally feel is most important to be implemented first is that of the ELF executables: they need to stop using copy relocations and they should resolve addresses of functions via the GOT. Only once that is done can libraries start using the “protected” visibility and generate improved code. This implies changing the psABI for the affected libraries, which may not be an easy transition.
An alternative to using the “protected” visibility is to use the symbolic binding options. The code relaxation optimisations would come in handy at this point to optimise at link-time the code that the compiler could not make a decision on. Unfortunately, those options apply to all symbols in a library, so libraries that must have overridable symbols need to use an extra option (--dynamic-list) and list each symbol one by one.
The compiler option -fPIE tells the compiler to generate position-independent code for executables. It is similar to the -fPIC option in that it generates position-independent code, but it has the added optimisation that the compiler can assume none of its symbols can be interposed.
With executables compiled with this option, copy relocations and direct loading of function addresses aren’t used. This solves the problem we had. Therefore, compiling executables with this option allows us to start using some of the optimisations I described before.
Unfortunately, as its description says, this option also generates position-independent code, which can be less efficient than position-dependent code in some situations. My preference would be to have position-dependent code executables without the copy relocations. However, there’s an added, side-effect of this option: it defines the __PIC__ macro, whose absence can be used to abort compilations for libraries that have transitioned to the more efficient options.
I highly recommend Urlich Drepper’s “How to Write Shared Libraries” paper. His recommendations did not go as far as suggest changing the ABI like I have, but he has many that library developers should adhere to, regardless of whether my recommendations are accepted or not. For example, using static functions and data where possible and avoiding arrays of pointers are recommendations I have made to many people.
Other work necessary is to improve prelinking support. Shared libraries are position-independent, but they can be prelinked to a preferred location in memory. One optimisation I have yet to see done is to use the read-only pages of prelinked data when the library is loaded at that preferred address (the .data.rel.ro sections).
Jan 13
I’ve previously talked about how the Qt 5 Winter is coming. Since we started talking about that, people have begun asking what are the date limits for each thing, when the API would freeze, when Qt 5.0 would be stable, when we’d release, etc. This blog tries to answer that a little.
Last month, we were preparing a list of features that needed to be done for Qt 5.0. The result of that activity is Task QTBUG-20885, which is a meta-task containing as sub-tasks everything that needs to happen for Qt 5.0′s feature freeze. Those are the changes that must go into Qt 5.0 and not in any later release. They are major refactorings or other changes that would break source- or binary-compatibility.
That task is now mostly accomplished. Lars has suggested a feature freeze date of February 4th, on his post on the Qt development mailing list. There’s not a lot of time left, so if you have something that needs to go in and hasn’t been taken into account, create the task and post now to the mailing list.
What happens next? Well, I don’t have dates, but I can tell you what will be[1] the stages of API freezing for Qt 5.0:
There should be only one alpha release, sometime next month. There may be multiple beta releases, as time progresses and issues are fixed. The point of a beta is to find more issues, so we need to release often for our users to give feedback. There’s also likely going to be only one release candidate, but it’s possible to have more than one as we find issues. And ideally, the final release should be just the last RC rebadged, but history shows we will add a few minor fixes between the two.
This process may not be followed exactly as I listed, though. Given the number of important new features, Lars has said that he might accept new features past the freeze date, provided we can see that there is progress. In other words, we will not wait for features we’re not certain will be delivered soon.
Finally, this process applies only to Qt 5.0. The process for Qt 5.1 and onwards should be different. For one thing, those releases will not have BC breakages, so the provisions relating to BC will not apply. For another, we plan to put in place a different branching model (subject for another blog) and keep the Qt Project maintainers true to their duty of “code is always ready for beta,” meaning that the feedback we’re scheduling for the period between alpha and beta right now should happen before the feature is accepted into the mainline.
Happy hacking.
Page optimized by WP Minify WordPress Plugin