Apr 30

Quick update to the Qt Project statistics

Update on last Friday’s post on the Qt Project’s statistics: the script ran again this morning, so we now have data for last week. The Qt Project Statistics Page now includes the number of contributors per week:

Visit the statistics page for more graphs.

Apr 27

Qt Project Statistics

For about a month, I’ve been improving a set of scirpts to calculate statistics on the Qt Project. What I wanted to know, at first, was how well I was doing, how much I was contributing. Another question I had in mind and I know many others did too was “how much is the Qt Project dependent on Nokia?”

First it started with a simple “|wc -l” depending on whose statistics I wanted to get. This week, I decided to make graphs, so I spent a great deal of time learning gnuplot instead of doing other work. I’ll blog about the script itself on my next blog.

The statistics are online now. You can see it at http://macieira.org/blog/qt-stats. And come back every week, as it will update itself every Sunday to Monday evening.

Let me just point out the overall graph:

As you can see from the graph, the commit rate for the Qt Project was at its lowest during two days-off periods: New Years (week 52 of last year and week 1 of this year) and Easter (week 14). Aside from the first week of the project’s existence, it’s constantly been over 400 commits a week, and over 600 commits for 6 of the past 8 weeks. That’s impressive!

And answering the question of how much the project depends on Nokia, take a look at this other one:

You can see that the participation from Nokia developers still is quite high (and will probably remain so), at around 80%. But in turn that means around 20% of the commits going to the Qt Project come from other people, employed by other companies or in their free time, and this less than 6 months after the official launch of the Qt Project.

More than that, note the trend: Nokia’s participation tends to diminish, not because they’re doing less, but because others are doing more. The following graph, with Nokia’s numbers removed, shows the trend participation from others:

Apr 03

Qt 5 alpha released

Lars writes to let us know that the first (and hopefully only) Qt 5 alpha has been released! It’s the first in the major release series in 7 years, the first major release of the Qt Project (though not the first release in of the project, since we released 4.8.1 just a few weeks ago).

I won’t copy what Lars said in his blog. Instead, here are some useful links:

Please note that the alpha release does not support make install yet. You really need to configure it with that -prefix option. We’ll work on an installable package and multiple tarballs for the beta.

Mar 28

Restricting what you can do

I usually write about C++, since it’s the programming language that I use on my daily work. Today, however, I’m talking about its nearest cousin: C. In specific, about a certain keyword introduced by the C99 standard, which was issued over 12 years ago. Usually, the C standard plays catch-up with the C++ standard (like the C11 standard bringing some C++11 features to C), but each new issue brings a few new things that C++ doesn’t have yet. This cross-pollinisation by the two standard teams is very welcome.

The one I’m thinking of today is one that, interestingly, has not been added to C++ yet, though many compilers support it. If you’ve paid attention to the blog title, you may realise I’m talking about the restrict keyword.

Raise your hand if you’ve seen it before. Now only the people who have seen it outside of the C library headers on their systems. Not many, eh?

What does restrict do?

The keyword appears defined in the C99 (N1256) and C11 (N1570) standards in section 6.7.3 “Type qualifiers” and 6.7.3.1 “Formal definition of restrict”, which, as usual, is barely readable for us. The Wikipedia definition is better:

The restrict keyword is a declaration of intent given by the programmer to the compiler. It says that for the lifetime of the pointer, only it or a value directly derived from it (such as ​pointer + 1​) will be used to access the object to which it points.

Well, so what? Why do we need a keyword for that? Well, clearly it’s not just something that the programmer says — otherwise, we’d only write it in the documentation. The Wikipedia text continues by saying that “[t]his limits the effects of pointer aliasing“.

That should now tell you something. At least, it should bring you back some memories of compiler warnings about “dereferencing type-punned pointer does break strict aliasing”.

The dreaded strict aliasing (or where it lacks)

The C and C++ standards say that pointers of different types do not alias each other. That’s the strict aliasing, which you often break by dereferencing type-punned pointers. I’ve talked about this in the past, I think. In any case, what matters to us here is when the pointers are allowed to alias each other. Since the C99 standard couldn’t very well go and change a basic principle of the C90 standard, they instead created a keyword to allow the programmer to declare when aliasing will not happen.

The simplest example is the following pair of functions from the C library (copied verbatim from glibc’s string.h header):

/* Copy N bytes of SRC to DEST.  */
extern void *memcpy (void *__restrict __dest,
		     __const void *__restrict __src, size_t __n)
     __THROW __nonnull ((1, 2));
/* Copy N bytes of SRC to DEST, guaranteeing
   correct behavior for overlapping strings.  */
extern void *memmove (void *__dest, __const void *__src, size_t __n)
     __THROW __nonnull ((1, 2));

Note the difference: memcpy uses the restrict keyword, whereas memmove does not but does say that it is correct for overlapping strings.

Implementing memcpy and memmove

Let’s try and implement these two functions to see if we understand what the keywords mean. Let’s start with memcpy, which is very simple at first approach and you must have written its equivalent hundreds of times already:

// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    char *d = dest;
    const char *s = src;
    size_t i;
    for (i = 0; i != n; ++i)
        d[i] = s[i];
    return dest;
}

Having written that, we wonder: why do we need memmove at all? The comment in the header talks about “overlapping strings” and that’s where the code above has an issue. What if we tried to memcpy(ptr, ptr + 1, n)? In the first iteration of the loop above, the byte copied would overwrite the second byte to be read — or worse.

For that reason, the simplest memmove is usually implemented as:

void *memmove(void *dest, const void *src, size_t n)
{
    char *d = dest;
    const char *s = src;
    size_t i;
    if (d < s) {
        for (i = 0; i != n; ++i)
            d[i] = s[i];
    } else {
        i = n;
        while (i) {
            --i;
            dst[i] = src[i];
        }
    }
    return dest;
}

Improving the code

If we know that the two pointers do not alias each other, we can do some more interesting things to optimise the copying performance. The first thing we can try is to increase the stride. That is, copy more than one byte at a time, like so:

// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    int *di = dest;
    const int *si = src;
    char *d = dest;
    const char *s = src;
    size_t i;
 
    for (i = 0; i != n / sizeof(int); ++i)
        di[i] = si[i];
    i *= sizeof(int);
    for ( ; i != n; ++i)
        d[i] = s[i];
 
    return dest;
}

The above code first copies the data in int-size chunks, then copies the remaining 1 to 3 bytes one byte at a time (epilog copy). It’s more efficient than the original code on architectures where unaligned loads and stores are efficient, or when we know both pointers to be aligned to the proper boundary. In those cases, since we have fewer iterations to execute, the copying is usually faster.

We can definitely improve this code further, by using for example 64-bit loads and stores in architectures that support them, applying this to all architectures by aligning the two pointers if possible in a prolog copy, unrolling the prolog and epilogs, or use Single Instruction Multiple Data instructions that the architecture may have.

Note that this is only possible because this is memcpy, not memmove. For the latter function, if we wanted to increase the stride, we would need to additionally check that the distance between the two pointers is at least the size of the chunk of data copied per iteration. Doing that is left as an exercise for the reader.

I’m lazy

Now, I said above that the only reason why there’s a language keyword in the first place is so that the compiler can optimise better. Well, that’s exactly what it does. Unfortunately, it’s easy to prove this straight-away with assembly code, as we’re depending on optimisations performed by the compiler, which change over time and are implemented differently in each one. For example, if I use the Intel Compiler on the original memcpy function, it will insert a call to _intel_fast_memcpy if the pointers aren’t suitably aligned or the copy size isn’t big enough. GCC, on the other hand, will insert a prolog to align one of the pointers.

What is interesting to note is that the presence of the restrict keyword, everything else being the same, does cause different code generation. With GCC, the output without the keyword contains a couple of instructions comparing the dest pointer to src + 16 and only if the two pointers don’t overlap in the first 16 bytes will it execute SSE2 16-byte copies. ICC is even more extreme: without the keyword, the code generated for memcpy does only byte-sized copies.

In other words, the keyword is being used: when the compiler knows the two blocks don’t overlap, it can generate better code.

Feb 22

The value of passing by value

I’ve written in the past about how passing certain types by value in C++ would be more efficient than passing by constant reference. But it turns out that the ABI rules are somewhat more complex than what I said back in 2008. Time to investigate.

This is also prompted by the discussion on qreal on the Qt development mailing list. In trying to decide on the fate of qreal, we also run into the discussion of the geometric classes (point, size, rectangle, polygon) and the algebraic classes (matrixes, 2D and 3D vectors) and whether they should use single- or double-precision. I’m not going to go into the arguments discussed there, I’m merely focussing here on the ABI.

Problem statement

Before we go into the ABI documentation and try to compile code, we need to define what problem we’re trying to solve. In general terms, I’m trying to find the most optimal way of passing small C++ structures: when is it better to pass by value, as opposed to by constant reference? And under those conditions, are there any important implications to the qreal discussion?

In the String Theory blog, I concluded that a small structure like QLatin1String, which contained exactly one pointer as a member, would benefit from passing by value. What other types of structures should we look at?

  • Structures with more than one pointer
  • Structures with 32-bit integers on 64-bit architectures
  • Structures with floating-point (single and double precision)
  • Mixed-type and specialised structures found in Qt

I’ll investigate the x86-64, ARMv7 hard-float, MIPS hard-float (o32) and IA-64 ABIs because they are the ones I for which I have access to compilers. All of them support passing parameters by registers and have at least 4 integer registers used in parameter passing. Besides MIPS, all of them also have at least 4 floating-point registers used in parameter passing. See my earlier ABI detail blog for more information.

So we will investigate what happens when you pass by value the following structures:

struct Pointers2
{
    void *p1, *p2;
};
struct Pointers4
{
    void *p1, *p2, *p3, *p4;
};
struct Integers2 // like QSize and QPoint
{
    int i1, i2;
};
struct Integers4 // like QRect
{
    int i1, i2, i3, i4;
};
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
{
    F f1, f2;
};
template <typename F> struct Floats3 // like QVector3D
{
    F f1, f2, f3;
};
template <typename F> struct Floats4 // like QRectF, QVector4D
{
    F f1, f2, f3, f4;
};
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
{
    F m[4][4];
};
struct QChar
{
    unsigned short ucs;
};
struct QLatin1String
{
    const char *str;
    int len;
};
template <typename F> struct QMatrix
{
    F _m11, _m12, _m21, _m22, _dx, _dy;
};
template <typename F> struct QMatrix4x4 // like QMatrix4x4
{
    F m[4][4];
    int f;
};

And we’ll analyse the assembly of the following program:

template <typename T> void externalFunction(T);
template <typename T> void passOne()
{
    externalFunction(T());
}
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
{
    externalReturningFunction<T>();
}
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();

In addition, we’re interested in what happens to non-structure floating point parameters: are they promoted or not? So we’ll also test the following:

void passFloat()
{
    void externalFloat(float, float, float, float);
    externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
}
void passDouble()
{
    void externalDouble(double, double, double, double);
    externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
}
float returnFloat()
{
    return 1.0f;
}
double returnDouble()
{
    return 1.0;
}

Analysis of the output

x86-64

You might have noticed I skipped old-style 32-bit x86. That was intentional, since that platform does not support passing by registers anyway. The only conclusion we could draw from that would be:

  • whether the structures are stored in the stack in the place of the argument, or whether they’re stored elsewhere and it’s passed by pointer
  • whether single-precision floating-point is promoted to double-precision

Moreover, I’m intentionally ignoring it because I want people to start thinking of the new ILP32 ABI for x86-64, enabled by GCC 4.7′s -mx32 switch, which follows the same ABI as the one described below (with the exception that pointers are 32-bit).

So let’s take a look at the assembly results. For parameter passing, we find out that

  • Pointers2 is passed in registers;
  • Pointers4 is passed in memory;
  • Integers2 is passed in a single register (two 32-bit values per 64-bit register);
  • Integers4 is passed in two registers only (two 32-bit values per 64-bit register);
  • Floats2<float> is passed packed into a single SSE register, no promotion to double
  • Floats3<float> is passed packed into two SSE registers, no promotion to double;
  • Floats4<float> is passed packed into two SSE registers, no promotion to double;
  • Floats2<double> is passed in two SSE registers, one value per register
  • Floats3<double> and Floats4<double> are passed in memory;
  • Matrix4x4 and QMatrix4x4 are passed in memory regardless of the underlying type;
  • QChar is passed in a register;
  • QLatin1String is passed in registers.
  • The floating point parameters are passed one per register, without float promotion to double.

For return values, the conclusion is the same as above: if the value is passed in registers, it's returned in registers too; if it's passed in memory, it's returned in memory. This leads us to the following conclusions, supported by careful reading of the ABI document:

  • Single-precision floating-point types are not promoted to double;
  • Single-precision floating-point types in a structure are packed into SSE registers if they are still available
  • Structures bigger than 16 bytes are passed in memory, with an exception for __m256, the type corresponding to one AVX 256-bit register.

IA-64

Here are the results for parameter passing:

  • Both Pointers structures are passed in registers, one pointer per register;
  • Both Integers structures are passed in registers, packed like x86-64 (two ints per register);
  • All of the Floats structures are passed in registers, one value per register (unpacked);
  • QMatrix4x4<float> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers out4 to out7 as the memory representations (packed);
  • QMatrix4x4<double> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;
  • QChar and QLatin1String are passed in registers;
  • Both QMatrix are passed entirely in registers, one value per register (unpacked);
  • QMatrix4x4 is passed like Matrix4x4, except that the integer is always in memory (the structure is larger than 8*8 bytes);
  • Individual floating-point parameters are passed one per register; type promotion happens internally in the register.

For the return values, we have:

  • The floating-point structures with up to 8 floating-point members are returned in registers;
  • The integer structures of up to 32 bytes are returned in registers;
  • All the rest is returned in memory supplied by the caller.

The conclusions are:

  • Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);
  • Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;
  • All other structures are passed in the integer registers, up to 64 bytes;
  • Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).

ARM

I've compiled the code only for ARMv7, with the floating-point parameters passed in the VFP registers. If you're reading this blog, you're probably interested in performance and therefore you must be using the "hard-float" model for ARM. I will not concern myself with the slower "soft-float" mode. Also note that this is ARMv7 only: the ARMv8 64-bit (AArch64) rules differ slightly, but no compiler for it is available.

Here are the results for parameter passing:

  • Pointers2, Pointers4, Integers2, and Integers4 are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);
  • All of the Float types are passed in registers, one value per register, without promotion of floats to doubles; the values are also stored in memory but I can't tell if this is required or just GCC being dumb;
  • All types of Matrix4x4, QMatrix and QMatrix4x4 are passed in both memory and registers, which contains the first 16 bytes;
  • QChar and QLatin1String are passed in registers;
  • are passed in memory regardless of the underlying type.
  • The floating point parameters are passed one per register, without float promotion to double.

For returning those types, we have:

  • All of the Float types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;
  • QChar is returned in a register;
  • Everything else is returned in memory.

Note that the return type is one of the places where the 32-bit AAPCS differs from the 64-bit one: there, if a type is passed in registers to a function where it is the first parameter, it is returned in those same registers. The 32-bit AAPCS restricts the return-in-registers to structures of 4 bytes or less.

My conclusions are:

  • Single-precision floating-point types are not promoted to double;
  • Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;

MIPS

I have attempted both a MIPS 32-bit build (using the GCC-default o32 ABI) and a MIPS 64-bit (using -mabi=o64 -mlong64). Unless noted otherwise, the results are the same for both architectures.

For passing parameters, they were:

  • Both types of Integers and Pointers structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;
  • Float2<float>, Float3<float>, and Float4<float> are passed in integer registers, not on the floating-point registers; on 64-bit, two floats are packed into a single 64-bit register;
  • Float2<double> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each double;
  • On 32-bit, the first two doubles of Float3<double> and Float3<double> are passed in integer registers, the rest are passed in memory;
  • On 64-bit, Float3<double> and Float3<double> are passed entirely in integer registers;
  • Matrix4x4, QMatrix, and QMatrix4x4 are passed in integer registers (the portion that fits) and in memory (the rest);
  • QChar is passed in a register (on MIPS big-endian, it's passed on bits 16-31);
  • QLatin1String is passed on two registers;
  • The floating point parameters are passed one per register, without float promotion to double.

For the return values, MIPS is easy: everything is returned in memory, even QChar.

The conclusions are even easier:

  • No float is promoted to double;
  • No structure is ever passed in floating-point registers;
  • No structure is ever returned in registers.

General conclusion

There are only few aggregate conclusion that we can take. One of them is that single-precision floating point values are not explicitly promoted to double when formal parameters are present. The automatic promotion probably happens only for floating-point values passed in ellipsis (...), but our problem statement was about calling functions where the parameters are know. The only slight deviation from the rule is IA-64, but it's unimportant as the hardware, like x87, only operates in one mode.

For the structures containing integer parameters (that includes pointers), there's nothing further to optimise: they are loaded into registers exactly as they appear in memory. That means the portion of the register corresponding to padding might contain uninitialised or garbage data, or it might make something really strange like MIPS in big-endian mode. It also means, on all architectures, that types smaller than a register do not occupy the entire register, so they might be packed with other members.

Another is quite obvious: structures containing floats are smaller than structures containing doubles, so they will use less memory or fewer registers to be passed.

To continue taking conclusions, we need to exclude MIPS since it passes everything in the integer registers and returns everything by memory. If we do that, we are able to see that all ABIs provide an optimisation for structures containing only one floating-point type. Those are called by slightly different names in the ABI documents, all meaning homogeneous floating-point structure. Those optimisations mean that the structure is passed on floating-point registers under certain conditions.

The first one to break down is actually x86-64: the upper limit is 16 bytes, limited to two SSE registers. The rationale for this seems to be passing one double-precision complex value, which takes 16 bytes. That we are able to pass four single-precision values is an unexpected benefit.

The remaining architectures (ARM and IA-64) can pass more values by register, and always at one value per register (no packing). IA-64 has more registers dedicated to parameter passing, so it can pass more than ARM.

Recommendations for code

  • Structures of up to 16 bytes containing integers and pointers should be passed by value;
  • Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);
  • Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;

The above is only valid for structures that are trivially-copiable and trivially-destrucitble. All C structures (POD in C++) meet those criteria.

Final note

I should note that the recommendations above do not always produce more efficient code. Even though the values can be passed in registers, every single compiler I tested (GCC 4.6, Clang 3.0, ICC 12.1) still does a lot of memory operations in some cases. It's quite common for the compiler to write the structure to memory and then load it into the registers. When it does that, passing by constant reference would be more efficient since it would replace the memory loads with arithmetic on the stack pointer.

However, those are simply a matter of further optimisation work by the compiler teams. The three compilers I tested for x86-64 optimise differently and, in almost all cases, at least one of them managed to do without memory access. Interestingly, the behaviour changes also when we replace the padding space with zeroes.

Jan 19

Update and benchmark on the dynamic library proposals

My last blog on the dynamic libraries on Linux attracted over 15000 visits, which was quite unexpected (it’s 15x more than the usual traffic). It got linked from reddit and ycombinator and comments there and in the previous post have raised some interesting questions I’ll try to answer.

LD_PRELOAD

First, a quck background: LD_PRELOAD and /etc/ld.so.preload tell the dynamic linker to load a certain ELF module before the rest normal initialisation sequence. It’s preloaded before the rest of the modules, but after two important modules have been loaded: the executable itself and the dynamic linker. By itself, it means nothing at all about symbol hijacking. Its sole purpose is to load something. I have, for example, used it for loading a different binary of a library that a program required. That works fine.

Yes, it is little-known and little-used

If you complained that I said it’s little-known, you’re somewhat biased. If you complained, it’s because you knew about it, therefore you’re part of the minority that knows about it. Just think about it: there are millions of people directly using Linux today in the world. How many do you think know about this feature?

Even more so, think about how often:

  • LD_PRELOAD is used compared to running applications without it
  • LD_PRELOAD is used to load an ELF module compared to how many ELF modules are loaded by regular means
  • how many functions are interposed using LD_PRELOAD versus how many aren’t

The ratio is at least 1:1000 for a heavy user of the feature (like me!) in the best of the circumstances. It’s probably several orders of magnitude more than that for the average. Something that is used in one case in a million qualifies as little-used to me.

No, I wasn’t proposing to get rid of it (not entirely)

Some people suggested I was thinking of getting read of the preloading feature in exchange for a few cycles saved. I would still be in my right to suggest that, given the improvements and how often it is used, but I wasn’t. I’ve never proposed getting rid of the preloading feature and my proposal would not harm the most often used cases of interposition.

This requires a bit more explanation, so bear with me please.

Symbol interposition works by adding a symbol to the symbol table before the “rightful” symbol appears. The dynamic linker will resolve the symbol to the first occurrence it finds in the search order, so if you preload a library out of its order, its symbols will have higher priority than they would otherwise. The extreme case is when you preload a library or module that wouldn’t otherwise be loaded. But remember something I said before: preloaded modules are loaded after two others are loaded, so they don’t get the chance to interpose symbols defined by those.

If the executable performed a copy relocation on a data symbol, then LD_PRELOAD’ed modules cannot interpose those. For that reason, I am not counting interposition of data symbols as valid. In fact, in 14 years I’ve been hacking on Linux, I’ve never done that, so I guess the chances of that happening are a billion to one or even lower. What’s more, my proposal would do away with copy relocation, which may make data interposition a valid case.

The next important thing you must understand is that my proposal would do away with interposition of intra-library symbols, but not inter-library ones. My friend Michael Meek’s proposal of -Bdirect linking might, but even that proposal wouldn’t totally do away with it.

What do I mean by this? Intra-library means “within the same library,” while inter-library means “across libraries” (think of “Internet” vs “intranet”). My proposal was intended to improve binding of symbols inside one library because we can gain performance doing that without losing the Position-independent code and the advantages that come with it (like Address space layout randomisation). Specifically because we don’t want to lose the PIC support and we don’t want to go back to pre-ELF days and their problems (see Ulrich Drepper’s paper for some information on it), all inter-library symbol resolution would remain as-is, via PLTs and GOTs, including the ability to interpose symbols.

And here’s why I think we’re entitled to doing that: because you cannot do it anyway unless the library has been specifically designed to allow it, like glibc is. Let’s take the code from the last blog:

extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &externalFunction;
}

And amend it like so:

void externalFunction(void)
{
}

If we compile this code with optimisation (GCC’s -O is enough) and inspect the assembly output, we can notice that both functions are present in the output but that myFunction does not call externalFunction. In other words, the compiler inlined one function into the other, even if the inline keyword was never added to it, and that expanded to zero code. With advances such as link-time optimisation, even moving the function to another compilation unit might not be enough to prevent the inlining.

That’s why I said that to support the case of intra-library symbol interposition, the library must be specifically designed to allow it, which is definitely still possible under my proposal. Most libraries aren’t designed like that and will never be, so I am confident that optimising for the greater majority of the libraries instead of the few is warranted (taking my system: I counted 3623 distinct libraries and plugins and I’m pretty sure none except libc and libpthread allow for interposition, so it’s probably a 1000:1 case again).

Benchmarks

Another important remark I saw in the comment threads was about the lack of benchmarks in my previous blog. Here they are.

Please note that “benchmark” means “comparison.” It does not imply “speed executing something.”

How I did it

I started by trying to find an executable I could run non-interactively, that executed a relatively CPU-intense activity and quit. That executable should be in my standard set of built executables, as I didn’t want to recompile the entire system. I settled on KDE’s kbuildsycoca4 with the options --noincremental --nosignal: it looks for all *.desktop files in the search paths and compiles a database for faster lookup, called the SYstem COnfiguration CAche. The options tell it to ignore existing databases and do it all, plus avoid signalling running applications over D-Bus to reload their settings.

The tests were run on my laptop, which is an Intel Core-i7-2620M, clocked at 2.6 GHz, with an SSD but no tmpfs temporary dir, with 2x32kB of L1 cache, 256 kB of L2 cache, 4MB of L3 cache and 4 GB of main RAM. I locked the CPU scaling governor to “performance” so the CPU was running at 2.6 GHz when the test starts and it soon goes over to turbo-mode and stays there (3.2 GHz). The system was not completely idle while running the test, but relatively so. To try and avoid other problems, the native benchmarks were run under the FIFO real-time scheduler, with a single processor of affinity. The tests were run in 64-bit mode and were run “warm”: I ran the benchmark first after any recompilation and discarded the results.

I did four sets of tests, as follows:

  1. The first, the baseline, was a regular build on my system, with no change to default KDE 4 build options or to Qt 4.8′s.
  2. The second was modified by adding -Bsymbolic-functions to the five KDE libraries and six Qt libraries used by the program
  3. The third was modified by replacing -Bsymbolic-functions with -Bsymbolic and recompiling the same 11 libraries
  4. Finally, on the fourth, in addition to keeping -Bsymbolic, I made all symbols exported from those 11 libraries have protected visibility. This required surprisingly few modifications to them, as they were more-or-less ready to be built on Windows too. Each library already has a XXXX_EXPORT macro associated because of the “hidden” visibility support, which right now expands to __attribute__((visibility("default"))). Moreover, the buildsystem for those library already defines a specific macro only during their builds. So it was easy to ensure that #ifdef that macro from the buildsystem, the XXXX_EXPORT macro should instead expand to __attribute__((visibility("protected"))), otherwise it should remain unchanged.

Each set of tests consisted of:

  • Run Ulrich Drepper’s relinfo script on the 11 libraries and tally up the types of relocations
  • Run Valgrind’s cachegrind tool with branch-prediction and the cache sizes set to match my machine
  • Run the perf stat tool to gather hardware counters. Each run of the tool reported the average of 10 runs of kbuildsycoca4, all run under FIFO real-time scheduler. After the first warm-up run, I chose the best of 3 runs in quick succession

The raw results I collected you can download from here (that also includes results with LD_BIND_NOW=1).

Results

First of all, I went into these benchmarks fully expecting that nothing would be visible in the performance benchmarks. It’s clear that these are micro-optimisations, so in a fairly large program they should be drowned out by inefficiencies in other parts. Also, considering that my system wasn’t completely idle when running the CPU benchmarks, the numbers have a degree of noise which could hide the faint results. The results have, however, shown a few clear improvements.

Here’s what I found:

  • Relocations: relocations are work that the dynamic linker must do either at load-time (non-PLT relocations) or during run-time (PLT). Reducing or simplifying relocations improves start-up and run-time performance.
    • The number of non-PLT relocations drops by 2.65% with protected visibility: that was expected because the linker options affect only the PLT. To change the non-PLT relocation count, a change to the compilation was necessary.
    • The number of relative relocations doubles with the linker options: that was also expected, because the linker can bind the relocation to the symbol that is inside the library being linked. Instead of referring to the symbol by its name and triggering a full look-up, a relative relocation simply records how many bytes past a fixed mark (the load address) the relocation should be, which is much simpler to execute. The number increases again with -Bsymbolic compared to -Bsymbolic-functions because the linker can bind non-functions too. The number dropped with protected visibility, but by less than the number of total relocations removed.
    • The number of PLT entries is one-third of the original because the linker can make intra-library function calls directly instead of going through the PLT stub. Each PLT entry corresponds to 8 bytes in the .got.plt section and 16 bytes of stub, which means this reduction saved as many as 15571 relocations and as much as 373 kB of memory size. This is confirmed by the count of PLT entries used for local symbols, which drops to nearly zero. The number isn’t exactly zero because both QtCore and QtGui have been prepared for 5 of its symbols to be interposed when built with -Bsymbolic-functions, a preoccupation I didn’t take into account in the protected visibility work because it wasn’t relevant.
        Note that there must have been an error with the -Bsymbolic builds because two libraries had a higher PLT count than they should. I have not investigated whether this was a a mistake on my part or a bug in the linker.
  • Valgrind results: valgrind executes the program in a simulated CPU, which on one hand means we get consistent results independent of what CPU I run this in and how idle or busy my system was, but on the other hand may or may not reflect reality (YMMV).
    • Instruction count decreases slightly by 0.9%, 1.1% and 1.2%
    • Data accesses to L1 data cache decreases slightly by 1.4%, 1.6% and 2.1%
    • Last-level cache references decrease by 7% while the LL cache miss rate remains constant, probably because there are fewer instructions executed, fewer data accesses and a slightly improvement in L1D miss rate
    • Number of indirect branches executed drops by 22%
    • The indirect branch misprediction rate drops considerably from 22% in the original to 16% with just the linker options and 8.8% with the protected visibility, while the overall branch misprediction rate drops from 4.7% to 4.3% and then to 4.1%. With 2.9 million fewer mispredicted branches, at a 20-cycle misprediction penalty, that’s 57 million cycles saved.
  • Perf results: perf uses hardware counters from the CPU to do its bidding, but it is subject to scheduling issues. The kbuildsycoca4 program does context-switch in its execution because it tries to verify with the D-Bus daemon if another instance isn’t already running. Moreover, this program is I/O intensive, meaning it makes a lot of system calls, which is why I let the benchmarks run with a “warm” system cache. Unlike the Valgrind results, there’s a great deal of noise and error in the numbers from perf because they represent an actual CPU.
    • There’s a roughly 3% overall performance improvement as measured by the execution time. The noise in the number doesn’t show which solution is best, but it shows that all three are better than the unmodified library code.
    • There’s a 3 to 4% improvement in number of cycles required to complete the operation. Unfortunately, the numbers are showing performance decreasing as I optimise more, which is counter-intuitive and I cannot explain (noise or real mis-optimisation). I think my machine was slightly less idle on the last test set, as the last results I got showed a much worse performance with a much bigger standard deviation.
    • There’s roughly 3% improvement in the number of instructions executed, which is similar to the reduction in cycles, but also shows that more instructions are executed per cycle with the optimisations. I cannot say why exactly it is, but I imagine it’s because of reductions in branching, branch misprediction and cache misses. The calculation of instructions per cycle shows improvement in two of the three benchmarks by close to 1%.
    • Branches executed reduce by 4 to 5% but the reduction is in the opposite order of the number of branches I know are in the code, which means there was a considerable amount of noise in this test. Another similar metric shows a roughly 5% improvement in branch loads.
    • The rates of cache misses and branch mispredictions remain more or less constant, which coupled with the number of branches reducing means we have an improvement in performance due to fewer absolute mispredictions happening. I cannot conclude anything about a reduction in cache references because the numbers varied too much.
        This is supported by the calculation of cycles gained in the reduction of branch misprediction. The SandyBridge architecture has a 20-cycle penalty for branch misprediction, so if we calculate how many cycles were lost in each benchmark due to mispredicted branches and subtract from the original, we get roughly 6 million cycles gained (0.24% of the total), which is in the same order as the improvement in instruction throughput (instructions per cycle).

Conclusions

The numbers are fairly small, as was expected, since we’re talking about micro-optimisations. However, three distinct benchmarks have shown with a reasonable degree of confidence that there’s a performance improvement in the order of 3% (execution time, cycle count and instruction count, and that’s reasonable to me, with the limited sample size I had). That’s more or less what I hoped to see, but much more than I expected to be able to show.

Another important aspect is that this was a non-GUI testcase, even though by virtue of library dependencies, both QtGui and kdeui libraries were present. Note how the two libraries have, together, 45824 relocations and 14708 PLT entries in the original library set, which corresponds to 73.3% and 62.4% of the total relocations in play respectively, as well as 65% of the PLT entries for local symbols. The number of relocations is indicative also of the size of the code in those libraries. But since the application isn’t a GUI one, that code is mostly not executed.

If we consider that the problem of cache misses increases with code size (and the cache miss rate could increase too, compounding the effect) and that of cycles lost due to mispredicted branches increases with the number of branches unless the misprediction ratio drops (which the benchmarks have shown to remain stable), we can expect that a GUI application could gain even more in performance due to these improvements. That’s difficult to prove however in a GUI application, so we’ll have to stay with just the theoretical exercise.

In all, I still think this is warranted. The drawbacks are fairly minor: the interposition of symbols is rarely used already, interposition of symbols in intra-library lookups close to non-existent in libraries that aren’t designed to do that. All we need to do now is change the status-quo, which is probably the hardest part.

Who will support me?

Jan 16

Sorry state of dynamic libraries on Linux

Last week, we identified a bug in Qt with Olivier‘s new signal-slot syntax. Upon further investigation, it turns out it’s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux.

tl;dr (a.k.a. Executive Summary)

Shared libraries on Linux are linked with -fPIC, which makes all variable references and function calls indirect, unless they are static. That’s because in addition to making it position-independent, it makes every variable and function interposable by another module: it can be overridden by the executable and by LD_PRELOAD libraries. The indirectness of accesses is a performance impact and we should do away with it, without sacrificing position-independence.

Plus, there are a few more actions we should take (like prelinking) to improve performance even further.

Jump to existing or proposed solutions, Google+ discussion.

Details

Note: in the following, I will show x86-64 64-bit assembly and will restrict myself to that architecture. However, the problems and solutions also apply to many other architectures, like x86 and ARM, which should make you consider what I say. The only platform that this mostly does not apply to is actually IA-64.

The basics

Imagine the following C file, which also compiles in C++ mode:

extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &externalFunction;
}

The code above demonstrates three features of the languages in one function: it loads the address of a function, it calls a function and it writes to a variable. The compiler does not know where the function and variable are: they might be in another .o file linked into this ELF module or they might be in another ELF module (i.e., a library) this module links to.

This compiler produces the following assembly output (gcc 4.6.0, -O3):

        call    externalFunction
        movq    $externalFunction, externalVariable(%rip)

This assembly snippet is making use of two symbols whose values the assembler does not know. When assembled, the assembler produces a .o with three relocations. This GCC has produced the most efficient and most compact compilation of the code I wrote.

When we link this .o into an executable, we start to see the drawbacks. The first is that both instructions need to encode, in their bits, the values of the symbols whose values we didn’t know. So the linker must somehow fix this. It fixes the call instruction by making it call a stub or a trampoline, which jumps to the actual address. This stub is placed in a separate section of code called the Procedure Linkage Table (PLT). The contents of the PLT stub is not that important, but suffice to say that it is an indirect jump.

The movq instruction cannot be fixed. There’s simply no way, because it writes a constant value to a constant location, directly. Even if we allowed for the instruction or a pair of instructions wide enough to write any 64-bit value to any variable in the 64-bit space, we still have a problem: those values are not known at link time. So instead of fixing the instruction, the linker “fixes” the values. For the address of externalFunction, it uses the address of the PLT stub it created in the previous paragraph. For the externalVariable variable, tt will create a copy relocation, which means the dynamic linker will need to find the variable where it is, copy its value to a fixed location in the executable and then tell everyone that the variable is actually in the executable.

What are the consequences of this? For the PLT call, it’s a simple performance impact which could not be avoided. Since the address of the actual externalFunction function is not known at compile and link-time, and we don’t want to leave a text relocation, the only way to place that call to find the address at run-time and indirectly call it.

For the copy relocation, the consequences for the executable are small. The code it will execute is still the most efficient and most compact. The dynamic linker will have to find where the symbol actually is at load-time, which is something that it would have to do anyway, plus copy its contents, checking that the size hasn’t changed. This is done only once, then the code runs in its most efficient form.

The fact that we resolved &externalFunction to the address of the PLT stub means that any use of that function pointer (an indirect call) will end up in a function that does an indirect call too. That is, it’s a doubly-indirect call. I seriously doubt any processor can do proper branch prediction, speculative execution, and prefetching of code under those circumstances.

It gets worse

So far we’ve analysed what happens in an executable. Now let’s see what happens when we try to build the same C code for a shared library. We do that by introducing the -fPIC compiler option, which tells the compiler to generate position-independent code. The compiler produces the following assembly output:

        call    externalFunction@PLT
        movq    externalFunction@GOTPCREL(%rip), %rdx
        movq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

When assembled, the .o still contains three relocations, albeit of different type.

When we compare the output of the position-dependent and the position-independent code, we notice the following:

  1. The call is still a call, but now we’re explicitly calling the PLT stub. This might seem irrelevant, since the linker would have fixed the call anyway to point to the PLT if it had to, but isn’t.
  2. The single movq instruction was split in three. This is required by the x86-64 processor, since the instruction set cannot encode a 64-bit value and the 64-bit address to store it in the same instruction (such instruction would be at least 17 bytes long, which 2 two bytes longer than the maximum instruction length).
  3. The values for the two symbols are loaded indirectly. Instead of encoding the two values in those two middle movq instructions, the compiler is loading the values from another linker-generated structure called the Global Offset Table (GOT).

The compiler needed to generate the code above since it doesn’t know where the symbols will actually be. As was the case before, those symbols can be linked into the same ELF module as this compilation unit, or they may be found elsewhere in another ELF module this one links to.

Moreover, the compiler and linker need to deal with the possibility that an executable might have done exactly what our executable in the previous section did: create a copy relocation on the variable and fixed the address of the function to its own PLT stub. In order to work properly, this code must deal with the fact that its own variable might have ended up elsewhere, and that &externalFunction might have a different value.

That means the indirect call through the PLT and the three movq instructions remain, even if those two symbols were in the same compilation unit!

The problem is that even if at first glance you’d think that the compiler should know for a fact where those symbols are, it actually doesn’t. The -fPIC option doesn’t enable only position-independent code. It also enables ELF symbol interposition, which is when another module “steals” the symbol. That happens normally by way of the copy relocations, but can also happen if an LD_PRELOAD’ed module were to override those symbols. So the compiler and linker must produce code that deals with that possibility.

In the end, we’re left with indirect calls, indirect symbol address loadings and indirect variable references, which impact code performance. In addition, the linker must leave behind relocations by name for the dynamic linker to resolve at load-time.

All this for the possibility of interposition?

Yes, it seems so. The impact is there for this little-known and little-used feature. Instead of optimising for the common-case scenario where the symbols are not overridden, the ABI optimises for the corner case.

Another argument is that the ABI optimises for executable code, placing the impact on the libraries. The argument is valid if the executables are much larger and more complex than the libraries themselves. It’s valid too if we consider that application developers write sloppy code, whereas library developers will write very optimised code.

I don’t think that argument holds anymore. Libraries have got much more complex in the past 10-15 years and do a lot more than they once did. They are not mere wrappers around system calls, like libc 4 and 5 were on Linux in the late 90s. Moreover, if we consider that the rise of interpreted languages, like Perl, Python, Ruby, even QML and JavaScript, the code belonging to the ELF executables is negligible. Compare the size of the executables with the libraries that actually do the interpretation:

-rwxr-xr-x. 2 root root   13544 Aug  5 06:27 /usr/bin/perl
-rwxr-xr-x. 2 root root    9144 Apr 12  2011 /usr/bin/python
-rwxr-xr-x. 1 root root    5160 Dec 29 13:46 /usr/bin/ruby
-r-xr-xr-x. 1 root root 1763488 Apr 12  2011 /usr/lib64/libpython2.7.so.1.0
-rwxr-xr-x. 1 root root  947736 Dec 29 13:46 /usr/lib64/libruby.so.1.8.7
-rwxr-xr-x. 1 root root 1524064 Aug  5 06:27 /usr/lib64/perl5/CORE/libperl.so

That’s even valid for interpreters that JIT the code. As optimised as the code they generate can be, current understanding is that operations with critical performance are implemented in native code, which means libraries or plugins.

Existing solutions

Partial solution for private symbols

When developing your library, if you know that certain symbols are private and will never be used by any other library, you have an option. You can declare their ELF visibility to be “hidden”, which has two consequences. The clear one is that the linker will not add the hidden symbols to the dynamic symbol table, so other ELF modules simply cannot find them. If they can’t find them, they can’t steal them. And if they can’t steal them, the linker does not need to produce a PLT stub for the function call, so the call instruction will be linked to a simple, direct call as the executable in the first part had been.

The other consequence is an optimisation that the compiler does. Since it also knows that the externalVariable variable cannot be stolen, it does not need to address the variable indirectly. The generated assembly becomes:

        call    externalFunction@PLT
        movq    externalFunction@GOTPCREL(%rip), %rax
        movq    %rax, externalVariable(%rip)

The .o file will still contain three relocations. However, note how the getting of the address of the externalFunction function is still done indirectly, even though the compiler knows it cannot be interposed. That means the linker will still generate a load-time relocation for the dynamic linker, to get the address of that function. Fortunately, it’s a simpler relocation since the symbol name itself is not present.

If there’s a reason for getting the address indirectly like this, I have yet to find it.

Partial solution for public non-interposable symbols

If your symbols are public, however, you cannot use the ELF “hidden” visibility trick. But if you know that they cannot and will not ever be stolen or interposed, you have another possibility, which is to tell that to the compiler and linker.

If you declare a variable with ELF “protected” visibility, you’re telling the compiler and linker that it cannot be stolen, yet can be placed in the dynamic symbol table for other ELF modules to reference. You just have to be absolutely sure that they will not ever be interposed, because that will create subtle bugs that are hard to track down. That includes access to those symbols by position-dependent executable code, like we did in the first section.

The GCC syntax __attribute__((visibility("protected"))) works in ELF platforms only, whereas the one with the “hidden” keyword is known to work in non-ELF platforms too, like Mac OS X (Mach-O) and IBM AIX (XCOFF).

Another way to do the same is to use one of two linker options: -Bsymbolic and -Bsymbolic-functions. They do basically the same as the protected visibility: they keep the symbols in the dynamic symbol table, but they make the linker use the symbol inside the library unconditionally. The difference between those two options is that the former applies to all symbols, whereas the latter applies to functions only.

The reason why -Bsymbolic-functions exists requires looking back at the executable code from the first section. While the variable reference required a copy relocation, the function call was done indirectly, through the PLT stub. A variable can be moved, but moving code isn’t possible, so the executable code needs to deal with the code being elsewhere anyway. For that reason, it’s possible to symbolically bind function calls inside a library without affecting executables.

Or so we thought. The problem we discovered last week deals with a situation of when you treat a function as a data reference: taking its address. As we saw on the first part, the linker will resolve the address of the function to the address of the PLT stub found in the executable. But if you symbolically bind the function in the library, it will resolve to the real address. If you try to compare the two addresses, they won’t be the same.

Proposed solutions

Some of the solutions I propose are ABI and binary compatible with existing builds; some others are ABI incompatible and would require recompilation. Unfortunately, the best solution would require source-incompatible changes. Still, all the changes below are giving a bit of optimisation to libraries by making executables less optimised.

Use of PLT in function calls should rest only with the linker

As we saw in the code generated for the library, with -fPIC, the compiler decided to make the call indirectly by adding “@PLT” to the symbol name. Turns out that the linker doesn’t really care about this and will generate (or not) the PLT stub if needed. If that’s the case, the compiler should not make a judgement call about where the symbol is located just because of -fPIC.

Function addresses should always be resolved through the GOT

Function calls already require a pointer-sized variable somewhere and a relocation to make it point to the valid entry point of the function being called. What’s more, taking addresses of functions is a somewhat rare operation, compared to the number of function calls across ELF modules.

That being the case, we can take a small “hit” in performance and the loading of a function address should happen via the GOT in position-dependent code (executables) just like it is done for position-independent code.

The benefit of doing this is that the function address we load will point to exactly function’s real entry point, instead of the PLT stub. When we call this function, we avoid the doubly-indirect branching we found earlier.

PLT stubs should use the regular GOT’s address, if it exists

If a given function is both called and its address is taken, the PLT stub should reference GOT entry that was used for the taking of the address. The reason why it isn’t already so, I guess, is because the entries in the .got.plt section aren’t initialised with the target function’s address, but the local module’s function resolver. This trick allows for the “lazy resolution” of functions: they are resolved only the first time they are called.

I wouldn’t ask for all functions to be resolved at load-time, but if the address of the function is taken anyway, the dynamic linker will need to resolve it at load time. So why waste CPU cycles in a function call if the address was computed already?

Copy relocations should be deprecated

Instead of copying the variable from the library into the executable, executables should use indirect addressing for reading variables and writing to them, as well as taking their addresses. One benefit of doing this is avoiding the actual copying. For example, for read-only variables, they may remain in read-only pages of memory, instead of being copied to read-write pages found in the executable.

The big drawback of this is that the indirect addressing is a lot more expensive, since it requires two memory references, not just one. The next suggestion might help alleviate the problem.

The linker should relax instructions used for loading variable addresses

This is a suggestion found in the IA-64 ABI: the compiler generates the instructions needed to load the address of the variable from the GOT, then use it as it needs to. If the linker concludes (by whichever means, like protected or hidden symbols, the use of one of the symbolic options, or because this is an ELF application and the symbol is defined in it) that the symbol must reside in the current ELF module, it can change the load instruction into a register-to-register move or similar.

For our x86-64 64-bit case, the instructions the compiler generated were:

        movq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

By changing one bit in the opcode of the first instruction, with no code size change, we can produce:

        leaq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

The x86 instruction “LEA” means “Load Effective Address”. Instead of loading 64 bits from the memory address externalVariable@GOTPCREL(%rip) and storing them in the register, that instruction the address it would have loaded from in the register. This isn’t as optimised as the original code found in the executable for two reasons: it requires two instructions instead of just one and it requires an additional register.

It’s possible to generate an even more efficient code if the assembler leaves a 32-bit immediate offset in the second movq instruction, making it 6 bytes long. This extra immediate would be of no impact in the original code, besides making it longer, but it would allow the linker to optimise the code further:

The original would be:

        movq     externalVariable@GOTPCREL(%rip), %rax
        movq.d32 %rdx, 0x0(%rax)

And it would get relaxed to:

        nopl.d32 0x0(%rax)
        movq     %rdx, externalVariable@GOTPCREL(%rip)

That is, the first 6-byte instruction is resolved to a 6-byte NOP, whereas the second 6-byte instruction executes the actual store, with no extra register use. The compiler cannot know that the register will be left untouched, but at least there is no dependency between the two instructions that might cause a CPU stall.

The same applies to other architectures too. The full -fPIC code on ARM to store a value from a register into a variable is the following:

        ldr     r3, .L2+8     @ points to a constant whose value is: externalVariable(GOT)
.LPIC1: ldr     r3, [r4, r3]  @ r4 contains the base address of the GOT
        str     r2, [r3, #0]

If the linker can conclude the symbol must be in the current ELF module and cannot change, it may be able to avoid the extra load (the middle instruction) by changing the code to be:

        ldr     r3, .L2+8     @ points to a constant whose value is: externalVariable-(.LPIC1-8)
.LPIC1: add     r3, pc, r3
        str     r2, [r3, #0]

Unlike x86, the ARM instructions cannot be optimised further, since the immediates encodable in the instructions have limited range.

The linker should relax instructions used for loading function addresses

Similar to the above, but instead looking at function addresses. The original library code is:

        movq    externalFunction@GOTPCREL(%rip), %rdx

But it can be relaxed to:

        leaq    externalFunction(%rip), %rdx

With ARM, the original code is:

        ldr     r3, .L2+8     @ points to a constant of value: externalFunction(GOT)
        ldr     r2, [r4, r3]  @ r4 contains the address of the base of the GOT

But relaxed, it would be:

        ldr     r2, .L2+8    @ points to a constant of value: externalFunction-(.LPIC0+8)
.LPIC0: add     r2, pc, r2

There should be a way to tell the compiler where the symbol is

We’re already able to tell the compiler that a symbol is in the current module, with the hidden visibility attribute. We should be able to tell the compiler that we know that the symbol is in the current module but exported as well as that we know that the symbol is in another module.

I would suggest simply using the existing ELF markers and being explicit about them:

  • __attribute__((visibility("hidden"))): symbol is in this ELF module and is not exported (equivalent on Windows: no decoration);
  • __attribute__((visibility("protected"))): symbol is in this ELF module and is exported (equivalent on Windows: __declspec(dllexport));
  • __attribute__((visibility("default"))): symbol is in another ELF module (equivalent on Windows: __declspec(dllimport)); this also applies to symbols that must be overridable according to the library’s API (like C++’s global operator new).

Considering the other suggestions, we know the references to symbols with “default” visibility can be relaxed into simpler and more efficient code in the presence of one of the symbolic binding options. That means we can use the “default” visibility for cases of uncertain symbols.

Getting there

Some of the solutions I listed are already possible and they should be used immediately in all libraries. That is especially true about the use of the hidden visibility: all libraries, without exception, should make use of this feature. In fact, since this option was introduced in GCC 4.0 seven years ago, many libraries have started using it and are now “good citizens”, for they access their own private data most efficiently, they don’t have huge symbol tables (which impact lookup speed) and they don’t pollute the global namespace with unnecessary symbols.

Other solutions are not possible to implement yet. The solution I personally feel is most important to be implemented first is that of the ELF executables: they need to stop using copy relocations and they should resolve addresses of functions via the GOT. Only once that is done can libraries start using the “protected” visibility and generate improved code. This implies changing the psABI for the affected libraries, which may not be an easy transition.

An alternative to using the “protected” visibility is to use the symbolic binding options. The code relaxation optimisations would come in handy at this point to optimise at link-time the code that the compiler could not make a decision on. Unfortunately, those options apply to all symbols in a library, so libraries that must have overridable symbols need to use an extra option (--dynamic-list) and list each symbol one by one.

Using -fPIE

The compiler option -fPIE tells the compiler to generate position-independent code for executables. It is similar to the -fPIC option in that it generates position-independent code, but it has the added optimisation that the compiler can assume none of its symbols can be interposed.

With executables compiled with this option, copy relocations and direct loading of function addresses aren’t used. This solves the problem we had. Therefore, compiling executables with this option allows us to start using some of the optimisations I described before.

Unfortunately, as its description says, this option also generates position-independent code, which can be less efficient than position-dependent code in some situations. My preference would be to have position-dependent code executables without the copy relocations. However, there’s an added, side-effect of this option: it defines the __PIC__ macro, whose absence can be used to abort compilations for libraries that have transitioned to the more efficient options.

Further work and further reading

I highly recommend Urlich Drepper’s “How to Write Shared Libraries” paper. His recommendations did not go as far as suggest changing the ABI like I have, but he has many that library developers should adhere to, regardless of whether my recommendations are accepted or not. For example, using static functions and data where possible and avoiding arrays of pointers are recommendations I have made to many people.

Other work necessary is to improve prelinking support. Shared libraries are position-independent, but they can be prelinked to a preferred location in memory. One optimisation I have yet to see done is to use the read-only pages of prelinked data when the library is loaded at that preferred address (the .data.rel.ro sections).

Jan 13

Qt temperatures drop from January to June

I’ve previously talked about how the Qt 5 Winter is coming. Since we started talking about that, people have begun asking what are the date limits for each thing, when the API would freeze, when Qt 5.0 would be stable, when we’d release, etc. This blog tries to answer that a little.

Last month, we were preparing a list of features that needed to be done for Qt 5.0. The result of that activity is Task QTBUG-20885, which is a meta-task containing as sub-tasks everything that needs to happen for Qt 5.0′s feature freeze. Those are the changes that must go into Qt 5.0 and not in any later release. They are major refactorings or other changes that would break source- or binary-compatibility.

That task is now mostly accomplished. Lars has suggested a feature freeze date of February 4th, on his post on the Qt development mailing list. There’s not a lot of time left, so if you have something that needs to go in and hasn’t been taken into account, create the task and post now to the mailing list.

What happens next? Well, I don’t have dates, but I can tell you what will be[1] the stages of API freezing for Qt 5.0:

  • Alpha (Feature freeze): the first step, where all the features are in and work as best we can determine, in all the reference platforms[2] of Qt. The purpose of the Alpha release is to validate the API and get feedback from our own developers as well as bleeding-edge testers whether the code really works and solves the problems it was intended to. Since the point of the Alpha release is to get feedback on the API and whether it works, the API is definitely not frozen at this point. After this point, no new features are accepted.
  • Beta: the API is soft-frozen, which means it will almost not change anymore. Most of the feedback that we expected to receive regarding the API has been received and acted upon. From this point on, early users of Qt can start depending on the API. If any further API changes are required, they can still be done but must be clearly documented and communicated to those early users. The purpose of the Beta release is to start using the API and to start validating the implementation of the solutions present. That means the focus after the Beta release is to discover issues and fix bugs, not to completely refactor something that isn’t solving the problems.
  • Release Candidate: the API is now deep-frozen and will not change unless a catastrophic flaw is discovered. If that happens, the developer who wants to change the API must convince the Release Team to postpone the release. At this time, the ABI (binary compatibility) should be soft-frozen too, but issues with it may still be solved.
  • Final Release: the API and ABI is completely frozen; the source- and binary-compatibilities of Qt kick in. This release will be called Qt 5.0.0. All programs compiled with this release will run without recompilation on any Qt 5.x.y release. Additionally, any programs compiled with Qt 5.0.y will also run without recompilation on Qt 5.0.0.
  • Patch Releases: the Qt 5.0.y releases, to be had in the second half of this year, fixing issues reported, but not adding new features.

There should be only one alpha release, sometime next month. There may be multiple beta releases, as time progresses and issues are fixed. The point of a beta is to find more issues, so we need to release often for our users to give feedback. There’s also likely going to be only one release candidate, but it’s possible to have more than one as we find issues. And ideally, the final release should be just the last RC rebadged, but history shows we will add a few minor fixes between the two.

This process may not be followed exactly as I listed, though. Given the number of important new features, Lars has said that he might accept new features past the freeze date, provided we can see that there is progress. In other words, we will not wait for features we’re not certain will be delivered soon.

Finally, this process applies only to Qt 5.0. The process for Qt 5.1 and onwards should be different. For one thing, those releases will not have BC breakages, so the provisions relating to BC will not apply. For another, we plan to put in place a different branching model (subject for another blog) and keep the Qt Project maintainers true to their duty of “code is always ready for beta,” meaning that the feedback we’re scheduling for the period between alpha and beta right now should happen before the feature is accepted into the mainline.

Happy hacking.

Footnotes

  1. The list presented is the one I sent to the mailing list in December. Lars agreed to it and no one else challenged.
  2. The current reference platforms for Qt are: Windows; Mac OS X 10.6 and above, using Cocoa, Linux using XCB; and Linux using Wayland.

Jan 10

Architectures and ABIs detailed

Yesterday I wrote about instruction set and ABI manuals. Today I’d like to go into details about the ABIs I listed there. This was done mostly as a summary for me: it’s tiresome to search for the information in the manuals, especially since some of the manuals are PDFs without links. For example, I never remember what is the order of the registers used in parameter passing on x86-64. So what you’ll find here is a listing of what I found interesting for when I might need to read or write assembly code.

As a bonus for you, dear reader, I added a few words about each platform.

First, a summary with numbers.

 x86x86‑64IA‑64AArch32AArch64MIPSPOWERSPARC
Endianlittlelittlebothbothbothbothbigbig
Instruction width (bits)8 to 1128 to 11241 or 8216 or 323216 or 323232
# general-purpose registers816128 (a)16 (b)32323232 (a)
GPR width in bits326464+1 (c)326432 / 6432 / 6432 / 64
# Special GPRs architecturally + ABI1 + 01 + 01 + 31 + 11 + 21 + 31 + 12 + 8
# GPRs used in parameter passing04 or 6 (d)8484 or 8 (e)86
# scratch GPRs (f)47 or 9 (d)24+9651818 or 17 (e)117+9+7
# saved GPRs (f)38 or 6 (d)7+968109 or 10 (e)2015
Number of floating point registers8+88+1612816 or 3232323232
FPR width in bits80 / 128 (g)80 / 128 (g)826412832 or 646464
# FPR used in parameter passing0+04 or 8 (d)8882 or 8 (e)80

Notes:

  1. 128 registers on Itanium can be accessed, but the processor has a minimum of 144 and can have more; for SPARC, 32 registers can be accessed, but the processor has anywhere between 64 and 528
  2. in Thumb mode, some instructions can only access 8 of the 16 registers
  3. the extra bit, called the Not-A-Thing (NAT) bit, is only used in some special circumstances
  4. the first number applies to Windows, the second number applies to Unix systems
  5. the first number applies to o32 and o64; the second number applies to n32 and n64
  6. “scratch” registers are those that a function may overwrite and need not save, also known as “caller-saved”, including the registers used as parameter passing; “saved” registers are those that a function must save before using; the concept does not apply directly to the rotating registers found on the Itanium and on SPARC (see below)
  7. the 387 registers are 80-bits wide and the SSE registers are at least 128-bits wide; they have been extended to 256 bits with the AVX extensions

Details

i386 or x86 or IA-32

The x86 architecture is the oldest in consideration and its age shows. The 32-bit architecture debuted with the Intel 80386 (whence the name “i386″) in 1985. It expanded on the Intel 8086 16-bit assembly by expanding the registers to 32-bit among other things. This architecture is still in use today and even modern processors like my Intel® Core™ i7-2620M (Sandy Bridge) boot into 8086 real-mode. I have some applications running on my Linux that are still i386 (like Skype).

The name x86 is because the 80386 (family 3) was followed by the 80486 (family 4), the Pentium (family 5) and the Pentium Pro (P6 archiecture, family 6). Some Linux distributions compile their packages for higher architectures, so you’ll find .i586.rpm and .i686.rpm too. The name IA-32 means “Intel Architecture, 32-bit,” which was created to indicate the difference to IA-64.

The instructions on x86 have variable lengths and can be anywhere from 1 to 15 bytes, averaging usually between 3 and 5 bytes, making the code density around 4 instructions per 16 bytes. That means jump and call targets can use all 32 bits of the addressing space. For performance and ABI reasons, jump targets and functions are usually aligned to 16 bytes (the ABI requires the low 3 bits to be clear for C++ member functions).

The traditional parameter passing uses no registers for parameter passing and pushes the parameter values from right to left as 32-bit slots onto the stack, which is popped by the caller. The stack is memory, so it suffers some penalties for its use. For that reason, most compilers offer alternative calling conventions, which allow passing some values in registers, pushing from left to right, and/or having the stack popped by the callee. You can find them by the names of “regparm” (GCC), “stdcall”, “syscall”, “pascal”. On Windows, the Win32 API is actually “stdcall”, whereas on Linux you’ll seldom ever find public API using anything other than the default convention. You can find more details about them in the Wikipedia article about X86 calling conventions.

The base i386 processor has 8 general-purpose registers and 8 stacked 80-bit wide floating point registers. All the floating point registers are scratch and can hold IEEE 754 single, double- or extended-precision values, while the general-purpose registers are distributed as follows (on Linux at least, though I think it applies to Windows too):

Special registerESP
Registers used for return valuesEAX, EDX
Scratch registersEAX, ECX, EDX
Saved registersEBX, ESI, EDI, EBP
Floating-point register used for return valuesST(0), ST(1)
Scratch floating-point registersall (ST(0) to ST(7))

The ESP register is special for architectural reasons: instructions that manipulate the stack work on it exclusively. That includes the procedure call and return mechanism, which store the return address on the stack. All the other registers can be used in almost any condition, even though there are certain instructions preferring one register over another. Only a few special instructions refer exclusively to a particular register (ECX in looping instructions and ESI and EDI in streaming instructions).

The EBP register is most often used as the “frame pointer” register: its value is the memory address where the previous function’s frame pointer was saved. It is used to load and store the incoming and local values at a fixed position. When writing assembly, it’s important to remember too that the EBX register is often used as the PIC register and cannot be used. If you need to use it in an special instruction, you’ll need to save it and restore afterwards (such as pushing it onto the stack or by xchg’ing it with another register).

The x86 architecture gained 8 MMX technology registers with the Pentium MMX, which are aliased to the floating-point registers and are all thus scratch. Later, with the Pentium III, 8 SSE registers, 128-bit wide, were added and then extended to 256-bits with the Sandy Bridge family. They are also all scratch and they can hold IEEE 754 single- or double-precision floating-point values. They can also be used in a variety of scalar, integer or floating point SIMD instructions.

x86-64

When the x86 architecture gained 64-bit support, not only were the registers expanded to 64-bit, the register set itself was expanded to 16 general-purpose registers, 16 MMX-technology registers and 16 SSE-technology registers. The floating-point registers are unchanged, though, as they are considered legacy. Unlike the i386 before it, the 64-bit expansion did away with compatibility with the 16- and 32-bit assembler instructions. Programs running in 64-bit mode (the “long mode“) run with a slightly different list of instructions. (Note that the 16-bit assembly is technically source compatible with the 32-bit one, but it’s not binary compatible)

As with x86, instructions have variable length in bytes, but the ABI and performance requirements are the same, so jump targets and functions are often aligned to 16 bytes.

As this architecture was created after SSE registers were introduced, the SSE registers are part of the calling convention. In fact, the SSE and SSE2 instructions are the preferred way of manipulating single- and double-precision floating-point values. The ABI for this architecture was specified by AMD when it launched the first 64-bit processor and by Microsoft for its Windows operating system.

 UnixWindows
Special registerRSP
Function return addresstop of stack
GPRs used for return valuesRAX, RDX
GPRs used in paramter passingRDI, RSI, RDX, RCX, R8, R9RCX, RDX, R8, R9
Scratch GPRsRAX, RCX, RDX, RSI, RDI, R8-R11RAX, RCX, RDX, R8-R11
Saved GPRsRBX, RBP, R12-R15RBX, RBP, RSI, RDI, R12-R15
387 register used for return values (long double)ST(0), ST(1)
Scratch 387 registersall (ST(0) to ST(7))
Floating-point registers used for return valuesXMM0, XMM1
Floating-point parameter registersXMM0-XMM7XMM0-XMM3
SSE scratch registersall (XMM0-XMM15, YMM0-YMM15)

Like 32-bit x86, the RSP register is architecturally-special and it’s manipulated by the push, pop, call, ret and similar instructions. The RBP register is also used as a frame pointer. The x86-64 architecture does allow for RIP-relative addressing, which was introduced so that a PIC register wouldn’t be necessary. Yet RBX is still used by some compilers under some conditions like that, so it’s best to apply the same saving mechanisms as before.

On Windows, this architecture runs in LLP64 mode: long longs and pointers are 64-bit wide, but longs and ints are 32-bit. On Linux, this architecture can run in both LP64 mode (longs and pointers are 64-bit wide) and in ILP32 mode (ints, longs and pointers are 32-bit). The ILP32 mode, called “x32“, makes use of the 8 additional GPRs and 8 additional SSE registers along with this calling convention as an effort to renew the 32-bit x86 world.

Itanium (IA-64)

The Intel Itanium architecturewas the result of the joint project between Hewlett-Packard and Intel in the late 1990s and was released in 2001. It was designed to take the best of the expertise of the time and produce a new, future-proof architecture for years to come. It was intended to replace the old 32-bit x86 architecture, which is why it got the name of IA-64.

Itanium uses a concept called Very long instruction word (VLIW) and each instruction is 41 bits in length, with a few 82 bits for encoding of 61- to 64-bit immediates. Each 3 instructions are grouped in a “bundle” occupying 128 bits (16 bytes), so all jump and function targets are aligned to 16. Another concept used in the architecture is Explicitly parallel instruction computing (EPIC) where the compiler must tell the processor which instructions can be executed in parallel and which ones must wait for others. This is encoded in the assembly as “stop bits”, which are coded to the 5 remaining bits of the 128-bit bundle. Not all combinations of instructions and stop bits are possible, so Itanium code has often many “nop” instructions and is very big. Code density is 3 instructions per 16 bytes, including counting the “nop”.

The Itanium architecture is still the record-holder in terms of raw number of accessible registers. Application programmers have access to 128 general-purpose registers, 128 floating-point registers, 128 architecturally-specific registers, 64 1-bit predicate registers and 8 branch registers. That’s 4 times as many GPRs and FPRs as any other architecture I listed, plus the other special registers. They are divided thus:

General-purpose registers
Architecturally-special GPRr0 (reads are always 0, writes are discarded)
ABI-defined special GPRsr1 (gp), r12 (sp), r13 (tp)
GPRs used in return valuesr8-r11
GPRs used in integer parameter passingr32-r39 (in0 to in7), specially r8
Non-rotating scratch GPRsr2, r3, r8-r11, r14-r31
Non-rotating saved GPRsr4-r7
Rotating GPRsr32-r127 (in0-in96, loc0-loc96, out0-out96)
Floating-point registers
Architecturally-special FPRsf0 (0.0) and f1 (+1.0)
Registers used for return valuesf8-f15
Registers used in parameter passingf8-f15
Non-rotating scratch FPRsf6-f15
Non-rotating saved FPRsf2-f5, f16-f31
Rotating FPRsf32-f127
Predicate registers
Architecturally-special PRp0 (always 1)
Non-rotating scratch PRsp6-p15
Non-rotating saved PRsp1-p5
Rotating PRsp16-p63
Branch registers
Function return addressb0 (rp)
Scratch registersb0 (rp), b6, b7
Saved registersb1-b5

On function entry, the r8 register contains the address of a memory region for the return value if the struct or union being returned is larger than 32 bytes (i.e., doesn’t fit r8-r11).

The three architecturally-special registers (r0, f0 and f1) always have the same value when read: integer 0, floating point 0.0 and floating-point 1.0 respectively. This allows for the assembly to do away with some instructions by just making them alias to others: for example, there is no instruction to load small immediate values onto a GPR. Instead, the instruction is replaced by an addition instruction where one of the operads is r0. The same applies to the floating-point multiplication and addition instructions: the Itanium only has a 4-operand fused multiply-add, so pure additions are done by multiplying one of the sources by f1 and pure multiplications are done by using f0 as the other source.

The 96 upper GPRs, FPRs and the 48 upper PRs are rotating: that means that some instructions can cause the register names to rotate. The three types of registers can be used in rotating loops, where several iterations of the loop are running in parallel with different registers. When not used in rotating fashion, all those registers can be used as scratch.

In addition to loops, the 96 upper GPRs can be rotated on function calls and returns. For that reason, each function can consider it has up to 96 saved registers because those registers simply cannot be seen by functions it calls. They are saved by the Register Stack Engine, asynchronously and at processor-specified times. The architecture allows each function to select how many rotating registers it wants to use and how many of those are to be available to functions called (those are the outgoing registers), though when writing assembly, one specifies how many registers are incoming, local and outgoing, so the named registers are available in the function body. The ABI limits the number of outgoing registers to only 8 rotating registers.

A leaf function or one with tail-call optimisation may opt to keep the rotating registers unchanged. It has available 24 non-rotating scratch GPRs, 10 non-rotating FPRs and 96 rotating ones, 10 non-rotating PRs and 48 rotating, and 3 scratch branch registers (one of them containing the return address), plus the incoming registers. That’s more than enough for most leaf functions, without even using the stack. A tail-call optimisation, however, requires that the called function take no more arguments than this function took, as expanding the number of outgoing registers would destroy another register that must be saved (ar.pfs).

An interesting feature of the GPRs is that they are actually 65-bit in width: the extra bit is called the “Not a Thing” (NAT) bit, which is an indication of whether the other 64 contain a valid value or not. The Itanium has some instructions that allow a “speculative load”: the instruction will try to load the value from memory so long as it doesn’t cause a page fault. If the value could not be loaded, the NAT bit is set and software must later check it, once it determines that it really needs that value. Using the value contained in a GPR while the NAT bit is active, besides copying the contents to another register or saving the contents with a special spill instruction, causes an exception.

The floating-point registers are 82 bits in width, allowing each to hold intermediate values of higher precision than IEEE 754 extended-precision. The “application registers” are 128 special 64-bit registers, each of which with a special meaning. Some of those registers are read-only, some are used by certain instructions and are thus scratch, most have special purpose. In particular, the ar.pfs register must be saved across function calls.

Itanium is defined for LP64 and ILP32 mode for Unix and LLP64 mode for Windows. The ILP32 mode is supported by a special instruction for dealing with pointers: once loaded from 32-bit storage, the pointer is “pointer-extended” to 64-bit before it can be used.

The ABI for Itanium was specified by Intel in the document I linked to in the last blog. Interestingly, Intel specified almost everything relating to the Itanium, including a full C++ ABI. This became known as the Itanium C++ ABI and is what GCC uses in all platforms, not just Itanium.

ARM 32-bit mode (AArch32)

Instructions in the ARM architecture, when running in “ARM mode”, are all 32-bits wide. For that reason, all jump targets and function addresses are aligned to 4 bytes and the low 2 bits are always unused. However, when ARM code is in “interworking” with Thumb code, those two bits are special, which mean that function addresses on ARM require the use of all bits. This has implications for the C++ ABI: since all bits are used in function addresses, the bit indicating whether a pointer-to-member-function is virtual is moved to the adjustment field.

32-bit ARM has 16 registers, one of which is the program counter. All registers are 32 bits in width and can be used in all instructions alike, including the PC, which makes it possible to have branching with arithmetic instructions (for example, “add pc,pc,r0″). The PC register is special and all operations on it are not supported. Moreover, reading from it yields the address of the current instruction plus 8. “nop” instructions are not common in ARM assembly, so the code density is 4 instructions per 16 bytes. However, due to the limited range of immediates, ARM code is often littered with nearby constants that must be loaded, not executed.

The ARMv6 architecture mandates at least 16 floating-point 64-bit wide registers, and ARMv7 allows optionally for 16 more of them to exist. The registers are divided so:

General-purpose registers
Architecturally-special GPRr15 (pc)
ABI-defined special GPRr13 (sp)
GPRs used for returning valuesr0-r3
GPRs used in parameter passingr0-r3
Function return addressr14 (lr)
Scratch GPRsr0-r3, r12 (ip), r14 (lr)
Saved GPRsr4-r11
Floating-point registers
FPRs used for returning valuesd0-d7
FPRs used in parameter passingd0-d7
Scratch FPRsd0-d7, d16-d31
Saved FPRsd8-d15

The table above assumes that one is using the floating-point hardware registers to pass parameters, in what is called in the ARM world “hard float”. According to the ARM Architecture Procedure Call Standard, this is optional: if not enabled, the floating-point parameters are converted to their 32- or 64-bit representations and passed in the GPRs.

The floating-point registers can be accessed in 64-bit mode to hold one IEEE 754 double-precision value or as two 32-bit registers holding IEEE 754 single-precision values. Extended-precision is not supported by hardware — on the ARM ABI, the “long double” type is an alias to “double”. Each of the original sixteen 64-bit FPRs can be accessed as two 32-bit FPRs when one prefixes them with “s” instead of “d”: s(2N) corresponds to the lower half of dN, while s(2N+1) to the upper half. A pair of any two sequential FPRs, starting on an even-numbered register, can also be accessed as sixteen quad-word (128-bit) registers when prefixed with “q”.

The r13 (sp) register was chosen by the ABI more-or-less arbitrarily, as any other register could be used to store the current address of the top of the stack. However, this register becomes architecturally-specific when Thumb mode is in use.

Thumb sub-mode

ARM CPUs can also run a sub-mode called Thumb, in which most instructions are 16-bit in width. Older ARM processors can only run 16-bit Thumb instructions, while newer ones support additional 32-bit Thumb instructions. Thumb instructions are therefore not aligned to 4-byte boundaries. When ARM and Thumb code interwork, the lowest bit of jump and call addresses indicates the instruction mode: 0 indicates ARM code while 1 indicates Thumb code. However, when the PC register is accessed in Thumb mode, the lowest two bits are forced to zero, so a read of the PC always yields a 4-byte aligned value. Like in ARM mode, reads of the PC give the current instruction plus 8 (rounded down).

The 16-bit instructions are a reduced set, with restricted access to registers. The r13 (sp) register is hardcoded in some stack operations, which makes it architecturally-specific. The ABI itself does not change, but 16-bit instructions can only encode the lower 8 of the 16 registers, which means that 16-bit Thumb functions are limited to 4 scratch and 4 saved registers.

ARM 64-bit (AArch64)

The ARM 64-bit architecture has expanded the ARM 32-bit register set from 16 to 32 general-purpose registers and 32 floating-point registers. The program counter register is no longer architecturally visible and the stack pointer register is architecturally special.

There are still no ARMv8 chips available and I have not seen any compiler generating code for it. The information below comes from the public ARM manuals I could find. The instruction width appears to be 32-bit, like in AArch32 ARM mode (A32). The ABI was specified by ARM too.

General-purpose registers
Architecturally-special GPRSP and ZR
GPRs used for returning valuesr0-r7
GPRs used in parameter passingr0-r7, specially r8
Function return addressr30 (lr)
Scratch GPRsr0-r18, r30 (lr)
Saved GPRsr19-r29
Floating-point registers
FPRs used for returning valuesv0-v7
FPRs used in parameter passingv0-v7
Scratch FPRsv0-v7, v16-v31
Saved FPRsv8-v15 (lower 64-bit values only)

On function entry, the r8 register contains the address of a memory region for the return value if the type being returned would not be stored in registers if it were the first parameter in a function call.

The SP and ZR registers are actually the same register, the difference is how the instruction deals with them. Some instructions are specified to work on SP and will read and write values to it. Some others treat that register as a zero or discard the output when it is the destination. When used in those conditions, the assembly lists it as “ZR”, or, like Itanium, uses different mnemonics to indicate the absence of source.

The floating-point registers can be accessed as 32-, 64- or 128-bit wide registers, holding single-precision, double-precision and quad-precision values respectively. However, the hardware only supports floating-point math in single- and double-precision. The ABI asks that only the 64 lower bits of the registers v8 to v15 be saved. That means if a function stores data in the high bits, it must save them on its own before calling another.

In assembly code, one will not see the registers named with the prefixes “r” or “v”. Instead, the GPRs are prefixed with “w” to mean 32-bit access or “x” for 64-bit, while the FPRs are prefixed “s” (32-bit), “d” (64-bit) or “q” (128-bit). The “v” prefix is seen on SIMD operations.

The ARM 64-bit architecture is defined for LP64 mode, but it might be possible to run it in ILP32 mode to make use of the extra registers and improved calling convention.

MIPS

MIPS processors exist with both 32- and 64-bit registers, respectively called MIPS32 and MIPS64. Unlike the x86 and ARM architectures where the assembly language differs considerably between 32- and 64-bit, the MIPS64 instruction set is a complete superset of MIPS32 and differences in programs are only due to the ABI. The MIPS64 processors do not have a special architectural mode to run MIPS32 code. MIPS64 processors were mostly used with the IRIX operating system, which has been discontinued (SGI now sells Linux machines running on Itanium). MIPS32 processors are quite common in the embedded world, finding their use on many WiFi routers and Set-Top-Boxes.

MIPS instructions are 32-bit in width, but some processors support an extension called microMIPS16 and can run 16-bit wide instruction code. I have not investigated if the instruction streams can be mixed or a technique similar to the ARM-Thumb interworking is necessary. GCC does not seem able to generate microMIPS16 instructions.

MIPS processors have 32 general-purpose registers, which are 64-bit wide on MIPS64 and 32-bit on MIPS32. Additionally, an FPU and a second co-processor are optional. The FPU, if present, has 32 registers that are 32-bit wide and support (at least) single precision, double precision and 32-bit integers, with double-precision values stored in a pair of registers. 64-bit support in the FPU is possible, but optional.

 o32 and o64 ABIn32 ABIn64 ABI
Architecturally-special GPR$0 (zero)
ABI-specified special GPRs$26 (kt0), $27 (kt1), $29 (sp)
GPRs used for returning values$2, $3
GPRs used in parameter passing$4 to $7$4 to $11
Function return address$31 (ra)
Scratch GPRs$1 (at), $2 to $15, $24, $25, $28 (gp)$1 (at), $2 to $15, $24, $25
Saved GPRs$16 to $23, $30$16 to $23, $28 (gp), $30
FPRs used for returning values$f0 to $f3$f0 and $f2 (not $f1)
FPRs used in parameter passing$f12 to $f15$f12 to $f19
Scratch FPRs$f0 to $f19$f0 to $f19, $f20, $f22, $f24, $f26, $f28, $f30$f0 to $f23
Saved FPRs$f20 to $f31$f21, $f23, $f25, $f27, $f29, $f31$f4 to $f31

Some of the registers deserve special mention:

  • $1 is used by the assembler for some operations where an assembly mnemonic does not fit; it is known as “at” (assembler temporary)
  • $25 contains the address of the function on entry on PIC code; since the compiler usually doesn’t know whether the target function is PIC or not, it will most likely load its address on $25
  • $26 and $27 are reserved to hold values from the kernel and should not be modified
  • $28 (gp), unlike the previous architectures, on o32 and o64, the global pointer (the PIC register) is not stored in a saved register
  • $f20 to $f31: since the early MIPS double-precision operations operate on a register pair, the registers must be saved in pairs too (o32 only)

In particular, the o32 ABI uses only the floating point registers always in pairs: the odd-numbered registers are never used alone. The o64, n32 and n64 ABIs allow using them independently and assume they can hold a double-precision value.

An interesting feature of the MIPS assembly, that it shares with SPARC, is that the first instruction after a taken branch is still executed.

POWER and PowerPC

The Power Architecture defines processors with both 32- and 64-bit registers. Unlike the x86 and ARM architectures where the assembly language differs considerably between 32- and 64-bit, the 64-bit instruction set is a complete superset of the 32-bit one and differences in programs are only due to the ABI.

The Power Architecture specifies two profiles for its processors: the Server Platform, which is mandatorily 64-bit, and the Embedded Platform. It has 32 general-purpose registers, one of which is special. If the FPU is present, it provides 32 registers capable of holding 64-bit double-precision floating point values. It also has an optional vector unit extension, known as Altivec.

The Power Architecture ABI document I found specifies many optional functionality, including soft-float. I am listing here the use of the floating point registers for parameter passing and a common-sense profile.

General-purpose registers
Architecturally-special GPRr0
ABI-defined special GPRr1 (sp)
GPRs used for return valuesr3-r6
GPRs used for parameter passingr3-r10
Function return addressLR
Scratch GPRsr0, r3-r12
Saved GPRsr2 (tp), r13-r31
Floating-point registers
FPR used for return valuesf1
FPRs used in parameter passingf1-f8
Scratch FPRsf0-f13
Saved FPRsf14-f31
Vector registers
VR used for return valuesv2
VRs used in parameter passingv2-v13
Scratch VRsv0-v19
Saved VRsv20-v31

Notes:

  • lr: it’s a special register that contains the value of the return address after a function call and must be saved (it’s a scratch register)
  • r0: it is a valid register containing values, but some instructions cannot access it; instead, they will always read a zero value and will discard the result

The register names above are prefixed with a letter for convenience. The POWER assembly unfortunately uses no prefixes for registers, addresses or absolute values: when you see a number like “2″ in the disassembly, you need to understand the instruction in question to determine if that refers to r2, f2 or a value of 2.

An interesting feature of the assembly is the Enforce In-Order Execution of I/O instruction, whose mnemonic reads “EIEIO”.

SPARC

The SPARC architecture began as 32-bit but was extended to 64-bit with the SPARCv9 in 1993. The SPARC processors are most commonly known as UltraSPARC today. 32-bit processors are not sold anymore, however, ILP32 applications still exist and run unmodified in current processors as the difference is only in the ABI.

The SPARC architecture has 32 general-purpose registers, of 64-bit in width on SPARCv9 and above. Like Itanium, the SPARC has a rotating register window: 24 GPRs are rotating and 8 are fixed. Unlike Itanium, the rotation window has a fixed size. The save instruction rotates it by 16, making the registers %r8 to %r15 become %r24 to %r31 and 16 clean registers at %r8 to %r23, while the restore instruction does the opposite and restores the 16 registers that had been saved. So, unlike Itanium, the outgoing parameters are in lower-numbererd registers and the incoming ones are in higher numbered ones.

The general-purpose register set is divided in four groups of 8 registers because of the window. The lowest 8 registers are fixed and are named %g0 to %g7; the next 8 are shared with a function being called, so they are the outgoing registers and named %o0 to %o7; the next 8 are only visible in the current function and are therefore named %l0 to %l7; finally, the upper 8 registers are shared with this function’s caller and are named %i0 to %i7.

Because of the register rotation, the definition of “scratch” and “saved” does not apply directly: the registers a function must preserve for its caller are not the same registers that its callee will preserve. The following table shows the registers in the point of view of the function after is has rotated the register window.

General-purpose registers
Architecturally-special GPR%g0, %i7 (partially)
ABI-defined special GPRs%g2-%g7, %o6 (%sp), %i6 (%fp)
GPRs used in return values%i0
GPRs used in passing parameters%i0-%i5
Function return address%i7
GPRs preserved by a callee%l0-%l7, %i0-%i7
GPRs destroyed by a callee%g1, %o0-%o5, %o7
Floating-point registers
FPRs used in passing parametersNone, they are passed in the outgoing registers or stack
FPRs used in returning values%f0, %f1
Scratch FPRsAll

The registers %g2 to %g4 are reserved by the ABI for the application, for uses to be decided by the application and compiler, while registers %g5 to %g7 are to be considered read-only by the application and compiler.

When a function is called, it has available 6 rotating registers containing its incoming values and can be considered scratch. Additionally, %g1 also is scratch. For that reason, leaf functions or functions with tail-call optimisation do not have to rotate the register window if they are satisifed with those 7 registers.

The rotating window also makes the %sp register become the %fp upon rotation, allowing for easy save of that value. Unlike other architectures, it’s the %fp register that must be preserved for the caller, so a leaf function can use the %sp register as a scratch if it needs to (provided it has rotated the register window).

Like MIPS, SPARC’s branch instructions also execute the instruction immediately following the branch, what’s called the “delay slot”. Disassemblers often indent that instruction further for clarity. However, unlike other architectures, the SPARC call instruction does not save the return address, but its own address. To return, a function must jump to %i7 + 8.

The FPU contains 32 registers that are 64-bit wide, numbered %f0, %f2, %f4, … %f62. Each pair of registers, starting at one numbered multiple of four (%f0, %f4, %f8, … %f60) can be accessed in a quad-precision way. Single-precision access is done in sequentially-numbered registers %f0, %f1, %f2, … %f31, where each pair aliases one 64-bit register. Note that SPARC is a big-endian platform, so the upper halves of a larger register are found in the lower numbered register.

Jan 09

Assembly developer’s library

Every now and then, when coding in C++, I find myself needing to know some assembly to understand what’s going on. Sometimes, it’s because I am actually writing assembly code, such as when I was writing the new atomic classes for Qt. More often, it’s because I need to read the assembly generated by the compiler to figure out if it’s optimal or if it’s doing something weird.

So I often found myself downloading the same manuals over and over. I decided to put together a small library of manuals I use often and those I seldom use, but might want to some day. This is the list.

ArchitectureInstruction set manualABI description (calling convention)
i386 (IA-32)Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2traditional / many / varied (Wikipedia article)
x86-64 64-bit a.k.a. x64Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2System V ABI for x86-64 (LP64) and Windows x64 calling convention (LLP64)
x86-64 32-bit (ILP32) a.k.a. x32Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2x32 ABI
Itanium (IA-64)Intel® Itanium® Architecture Developer’s Manual, Vol. 3

Itanium® Software Conventions and Runtime Architecture Guide (both ILP32 and LP64)
ARM 32-bit (AArch32)ARM assembler referenceARM Architecture Procedure Call Standard (AAPCS)
ARM 64-bit (AArch64)ARMv8 Instruction Set (registration required)ARM 64-bit Architecture Procedure Call Standard (AAPCS64)
MIPS32The MIPS32® Instruction Set and The microMIPS32™ Instruction Set (registration required)o32 ABI, n32 and n64
MIPS64The MIPS64® Instruction Set (registration required)o64 ABI, n32 and n64
POWER Architecture (includes PowerPC)Power Instruction Set Architecture Version 2.06Power Architecture 32-bit ABI Supplement 1.0 Unified (applies to 64-bit too I think)
SPARCSPARC archictecture V9SPARC psABI 3.0

Of course there are more architectures. Those are just the ones that are (somewhat) relevant to me a Qt developer. Also, please note that I have not looked with detail into the POWER and SPARC manual, other than a cursory glance to ensure that they contained the relevant information. For example, one site I found says that Linux 32-bit on PowerPC uses an ABI different than that defined by power.org.

I have not listed quick reference guides, many of which exist. I don’t use them because I often need details of the instructions.

What do you usually use when you code in assembly?

Additional IA-32 and x86-64 resources

The IA-32 ans x86-64 architectures contain a very big number of extensions that are added in different generations of the processors, both by Intel and AMD. The manuals above include the main extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AVX, plus the AES, FMA, F16C, RDRAND ones. It does not include extensions specific to AMD processors, like 3dNow! (effectively deprecated), SSE4a, FMA4 or XOP.

Other useful manuals:

Additionally, most of the non-general purpose instructions have intrinsic functions associated with them, so you can write really low-level code in C or C++ without actually having to write assembly. Unfortunately, I haven’t found a good, downloadable, and up-to-date intrinsics reference manual. The Intel(R) C++ Intrinsics Reference is the closest I’ve found, but it stops at SSE4 intrinsics, not including the AVX or AES ones.

If you know of one, please leave me a comment.

Glossary

ABI
Application Binary Interface
ILP32
int, long, pointer are 32-bit, virtually all 32-bit platforms (all of the ABIs listed here)
LP64
long and pointer are 64-bit, the 64-bit environment of all Unix platforms here
LLP64
long long and pointer are 64-bit, but long is 32-bit; the 64-bit environment of Windows

Older posts «

» Newer posts

Page optimized by WP Minify WordPress Plugin