The value of passing by value

I’ve written in the past about how passing certain types by value in C++ would be more efficient than passing by constant reference. But it turns out that the ABI rules are somewhat more complex than what I said back in 2008. Time to investigate.

This is also prompted by the discussion on qreal on the Qt development mailing list. In trying to decide on the fate of qreal, we also run into the discussion of the geometric classes (point, size, rectangle, polygon) and the algebraic classes (matrixes, 2D and 3D vectors) and whether they should use single- or double-precision. I’m not going to go into the arguments discussed there, I’m merely focussing here on the ABI.

Problem statement

Before we go into the ABI documentation and try to compile code, we need to define what problem we’re trying to solve. In general terms, I’m trying to find the most optimal way of passing small C++ structures: when is it better to pass by value, as opposed to by constant reference? And under those conditions, are there any important implications to the qreal discussion?

In the String Theory blog, I concluded that a small structure like QLatin1String, which contained exactly one pointer as a member, would benefit from passing by value. What other types of structures should we look at?

Structures with more than one pointer
Structures with 32-bit integers on 64-bit architectures
Structures with floating-point (single and double precision)
Mixed-type and specialised structures found in Qt

I’ll investigate the x86-64, ARMv7 hard-float, MIPS hard-float (o32) and IA-64 ABIs because they are the ones I for which I have access to compilers. All of them support passing parameters by registers and have at least 4 integer registers used in parameter passing. Besides MIPS, all of them also have at least 4 floating-point registers used in parameter passing. See my earlier ABI detail blog for more information.

So we will investigate what happens when you pass by value the following structures:

struct Pointers2
{
    void *p1, *p2;
};
struct Pointers4
{
    void *p1, *p2, *p3, *p4;
};
struct Integers2 // like QSize and QPoint
{
    int i1, i2;
};
struct Integers4 // like QRect
{
    int i1, i2, i3, i4;
};
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
{
    F f1, f2;
};
template <typename F> struct Floats3 // like QVector3D
{
    F f1, f2, f3;
};
template <typename F> struct Floats4 // like QRectF, QVector4D
{
    F f1, f2, f3, f4;
};
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
{
    F m[4][4];
};
struct QChar
{
    unsigned short ucs;
};
struct QLatin1String
{
    const char *str;
    int len;
};
template <typename F> struct QMatrix
{
    F _m11, _m12, _m21, _m22, _dx, _dy;
};
template <typename F> struct QMatrix4x4 // like QMatrix4x4
{
    F m[4][4];
    int f;
};

And we’ll analyse the assembly of the following program:

template <typename T> void externalFunction(T);
template <typename T> void passOne()
{
    externalFunction(T());
}
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
{
    externalReturningFunction<T>();
}
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();

In addition, we’re interested in what happens to non-structure floating point parameters: are they promoted or not? So we’ll also test the following:

void passFloat()
{
    void externalFloat(float, float, float, float);
    externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
}
void passDouble()
{
    void externalDouble(double, double, double, double);
    externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
}
float returnFloat()
{
    return 1.0f;
}
double returnDouble()
{
    return 1.0;
}

Analysis of the output

x86-64

You might have noticed I skipped old-style 32-bit x86. That was intentional, since that platform does not support passing by registers anyway. The only conclusion we could draw from that would be:

whether the structures are stored in the stack in the place of the argument, or whether they’re stored elsewhere and it’s passed by pointer
whether single-precision floating-point is promoted to double-precision

Moreover, I’m intentionally ignoring it because I want people to start thinking of the new ILP32 ABI for x86-64, enabled by GCC 4.7′s -mx32 switch, which follows the same ABI as the one described below (with the exception that pointers are 32-bit).

So let’s take a look at the assembly results. For parameter passing, we find out that

Pointers2 is passed in registers;
Pointers4 is passed in memory;
Integers2 is passed in a single register (two 32-bit values per 64-bit register);
Integers4 is passed in two registers only (two 32-bit values per 64-bit register);
Floats2<float> is passed packed into a single SSE register, no promotion to double
Floats3<float> is passed packed into two SSE registers, no promotion to double;
Floats4<float> is passed packed into two SSE registers, no promotion to double;
Floats2<double> is passed in two SSE registers, one value per register
Floats3<double> and Floats4<double> are passed in memory;
Matrix4x4 and QMatrix4x4 are passed in memory regardless of the underlying type;
QChar is passed in a register;
QLatin1String is passed in registers.
The floating point parameters are passed one per register, without float promotion to double.

For return values, the conclusion is the same as above: if the value is passed in registers, it's returned in registers too; if it's passed in memory, it's returned in memory. This leads us to the following conclusions, supported by careful reading of the ABI document:

Single-precision floating-point types are not promoted to double;
Single-precision floating-point types in a structure are packed into SSE registers if they are still available
Structures bigger than 16 bytes are passed in memory, with an exception for __m256, the type corresponding to one AVX 256-bit register.

IA-64

Here are the results for parameter passing:

Both Pointers structures are passed in registers, one pointer per register;
Both Integers structures are passed in registers, packed like x86-64 (two ints per register);
All of the Floats structures are passed in registers, one value per register (unpacked);
QMatrix4x4<float> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers out4 to out7 as the memory representations (packed);
QMatrix4x4<double> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;
QChar and QLatin1String are passed in registers;
Both QMatrix are passed entirely in registers, one value per register (unpacked);
QMatrix4x4 is passed like Matrix4x4, except that the integer is always in memory (the structure is larger than 8*8 bytes);
Individual floating-point parameters are passed one per register; type promotion happens internally in the register.

For the return values, we have:

The floating-point structures with up to 8 floating-point members are returned in registers;
The integer structures of up to 32 bytes are returned in registers;
All the rest is returned in memory supplied by the caller.

The conclusions are:

Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);
Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;
All other structures are passed in the integer registers, up to 64 bytes;
Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).

ARM

I've compiled the code only for ARMv7, with the floating-point parameters passed in the VFP registers. If you're reading this blog, you're probably interested in performance and therefore you must be using the "hard-float" model for ARM. I will not concern myself with the slower "soft-float" mode. Also note that this is ARMv7 only: the ARMv8 64-bit (AArch64) rules differ slightly, but no compiler for it is available.

Here are the results for parameter passing:

Pointers2, Pointers4, Integers2, and Integers4 are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);
All of the Float types are passed in registers, one value per register, without promotion of floats to doubles; the values are also stored in memory but I can't tell if this is required or just GCC being dumb;
All types of Matrix4x4, QMatrix and QMatrix4x4 are passed in both memory and registers, which contains the first 16 bytes;
QChar and QLatin1String are passed in registers;
are passed in memory regardless of the underlying type.
The floating point parameters are passed one per register, without float promotion to double.

For returning those types, we have:

All of the Float types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;
QChar is returned in a register;
Everything else is returned in memory.

Note that the return type is one of the places where the 32-bit AAPCS differs from the 64-bit one: there, if a type is passed in registers to a function where it is the first parameter, it is returned in those same registers. The 32-bit AAPCS restricts the return-in-registers to structures of 4 bytes or less.

My conclusions are:

Single-precision floating-point types are not promoted to double;
Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;

MIPS

I have attempted both a MIPS 32-bit build (using the GCC-default o32 ABI) and a MIPS 64-bit (using -mabi=o64 -mlong64). Unless noted otherwise, the results are the same for both architectures.

For passing parameters, they were:

Both types of Integers and Pointers structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;
Float2<float>, Float3<float>, and Float4<float> are passed in integer registers, not on the floating-point registers; on 64-bit, two floats are packed into a single 64-bit register;
Float2<double> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each double;
On 32-bit, the first two doubles of Float3<double> and Float3<double> are passed in integer registers, the rest are passed in memory;
On 64-bit, Float3<double> and Float3<double> are passed entirely in integer registers;
Matrix4x4, QMatrix, and QMatrix4x4 are passed in integer registers (the portion that fits) and in memory (the rest);
QChar is passed in a register (on MIPS big-endian, it's passed on bits 16-31);
QLatin1String is passed on two registers;
The floating point parameters are passed one per register, without float promotion to double.

For the return values, MIPS is easy: everything is returned in memory, even QChar.

The conclusions are even easier:

No float is promoted to double;
No structure is ever passed in floating-point registers;
No structure is ever returned in registers.

General conclusion

There are only few aggregate conclusion that we can take. One of them is that single-precision floating point values are not explicitly promoted to double when formal parameters are present. The automatic promotion probably happens only for floating-point values passed in ellipsis (...), but our problem statement was about calling functions where the parameters are know. The only slight deviation from the rule is IA-64, but it's unimportant as the hardware, like x87, only operates in one mode.

For the structures containing integer parameters (that includes pointers), there's nothing further to optimise: they are loaded into registers exactly as they appear in memory. That means the portion of the register corresponding to padding might contain uninitialised or garbage data, or it might make something really strange like MIPS in big-endian mode. It also means, on all architectures, that types smaller than a register do not occupy the entire register, so they might be packed with other members.

Another is quite obvious: structures containing floats are smaller than structures containing doubles, so they will use less memory or fewer registers to be passed.

To continue taking conclusions, we need to exclude MIPS since it passes everything in the integer registers and returns everything by memory. If we do that, we are able to see that all ABIs provide an optimisation for structures containing only one floating-point type. Those are called by slightly different names in the ABI documents, all meaning homogeneous floating-point structure. Those optimisations mean that the structure is passed on floating-point registers under certain conditions.

The first one to break down is actually x86-64: the upper limit is 16 bytes, limited to two SSE registers. The rationale for this seems to be passing one double-precision complex value, which takes 16 bytes. That we are able to pass four single-precision values is an unexpected benefit.

The remaining architectures (ARM and IA-64) can pass more values by register, and always at one value per register (no packing). IA-64 has more registers dedicated to parameter passing, so it can pass more than ARM.

Recommendations for code

Structures of up to 16 bytes containing integers and pointers should be passed by value;
Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);
Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;

The above is only valid for structures that are trivially-copiable and trivially-destrucitble. All C structures (POD in C++) meet those criteria.

Final note

I should note that the recommendations above do not always produce more efficient code. Even though the values can be passed in registers, every single compiler I tested (GCC 4.6, Clang 3.0, ICC 12.1) still does a lot of memory operations in some cases. It's quite common for the compiler to write the structure to memory and then load it into the registers. When it does that, passing by constant reference would be more efficient since it would replace the memory loads with arithmetic on the stack pointer.

However, those are simply a matter of further optimisation work by the compiler teams. The three compilers I tested for x86-64 optimise differently and, in almost all cases, at least one of them managed to do without memory access. Interestingly, the behaviour changes also when we replace the padding space with zeroes.

7 comments

Andreas Aardal Hanssen
February 22, 2012 at 20:56 UTC (UTC 0)
Can compilers rewrite passing by const& as pass by value?
Volker Hilsheimer
February 22, 2012 at 21:42 UTC (UTC 0)
Useful and detailed analysis! However, the conclusion omits that the majority of structures has (copy-)constructors and destructors which will be executed for each function call when you pass by value, generating both more code and runtime overhead. Unless I miss something obvious your recommendation seems to apply only to POD types.
Thiago Macieira
February 22, 2012 at 21:53 UTC (UTC 0)
@Andreas: The compiler can only replace a const-ref if it’s inlining the function. I’m thinking of non-inlinable cases here.
Thiago Macieira
February 22, 2012 at 21:54 UTC (UTC 0)
@Volker: you’re right. I did not say it, but all of this depends on the type in question being trivially copyable and trivially destructible. Otherwise, the type is never passed in registers since the copy constructor and destructor need a value for “this”. I’ve updated the conclusion to say that.
Pritam
February 23, 2012 at 08:32 UTC (UTC 0)
can compiers inline class functions on their own? I remember reading in kde’s techbase that making existing functions will break BC.
http://techbase.kde.org/Policies/Binary_Compatibility_Issues_With_C++#The_Do.27s_and_Don.27ts
Nick Shaforostoff
February 23, 2012 at 11:25 UTC (UTC 0)
Hey, you were talking about QLatin1String.
I checked out http://doc.qt.nokia.com/qt5/qstring.html and it still uses const reference.
Thiago Macieira
February 23, 2012 at 20:38 UTC (UTC 0)
@Nick: yes, we need to remove the reference. QLatin1String should be passed by value.

Thiago Macieira's blog

An Open Source hacker's ramblings

Problem statement

Analysis of the output

x86-64

IA-64

ARM

MIPS

General conclusion

Recommendations for code

Final note

7 comments

Andreas Aardal Hanssen

Volker Hilsheimer

Thiago Macieira

Thiago Macieira

Pritam

Nick Shaforostoff

Thiago Macieira

Comments have been disabled.

Categories

Me

Google Plus

Blogroll

My Gitorious activity

Archives

Meta

Copyright

Thiago Macieira's blog

An Open Source hacker's ramblings

The value of passing by value

Problem statement

Analysis of the output

x86-64

IA-64

ARM

MIPS

General conclusion

Recommendations for code

Final note

7 comments

Andreas Aardal Hanssen

Volker Hilsheimer

Thiago Macieira

Thiago Macieira

Pritam

Nick Shaforostoff

Thiago Macieira

Comments have been disabled.

Categories

Tags

Me

Google Plus

Blogroll

My Gitorious activity

Archives

Meta

Copyright