Jan 16

Sorry state of dynamic libraries on Linux

Categories:

Last week, we identified a bug in Qt with Olivier‘s new signal-slot syntax. Upon further investigation, it turns out it’s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux.

tl;dr (a.k.a. Executive Summary)

Shared libraries on Linux are linked with -fPIC, which makes all variable references and function calls indirect, unless they are static. That’s because in addition to making it position-independent, it makes every variable and function interposable by another module: it can be overridden by the executable and by LD_PRELOAD libraries. The indirectness of accesses is a performance impact and we should do away with it, without sacrificing position-independence.

Plus, there are a few more actions we should take (like prelinking) to improve performance even further.

Jump to existing or proposed solutions, Google+ discussion.

Details

Note: in the following, I will show x86-64 64-bit assembly and will restrict myself to that architecture. However, the problems and solutions also apply to many other architectures, like x86 and ARM, which should make you consider what I say. The only platform that this mostly does not apply to is actually IA-64.

The basics

Imagine the following C file, which also compiles in C++ mode:

extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &externalFunction;
}

The code above demonstrates three features of the languages in one function: it loads the address of a function, it calls a function and it writes to a variable. The compiler does not know where the function and variable are: they might be in another .o file linked into this ELF module or they might be in another ELF module (i.e., a library) this module links to.

This compiler produces the following assembly output (gcc 4.6.0, -O3):

        call    externalFunction
        movq    $externalFunction, externalVariable(%rip)

This assembly snippet is making use of two symbols whose values the assembler does not know. When assembled, the assembler produces a .o with three relocations. This GCC has produced the most efficient and most compact compilation of the code I wrote.

When we link this .o into an executable, we start to see the drawbacks. The first is that both instructions need to encode, in their bits, the values of the symbols whose values we didn’t know. So the linker must somehow fix this. It fixes the call instruction by making it call a stub or a trampoline, which jumps to the actual address. This stub is placed in a separate section of code called the Procedure Linkage Table (PLT). The contents of the PLT stub is not that important, but suffice to say that it is an indirect jump.

The movq instruction cannot be fixed. There’s simply no way, because it writes a constant value to a constant location, directly. Even if we allowed for the instruction or a pair of instructions wide enough to write any 64-bit value to any variable in the 64-bit space, we still have a problem: those values are not known at link time. So instead of fixing the instruction, the linker “fixes” the values. For the address of externalFunction, it uses the address of the PLT stub it created in the previous paragraph. For the externalVariable variable, tt will create a copy relocation, which means the dynamic linker will need to find the variable where it is, copy its value to a fixed location in the executable and then tell everyone that the variable is actually in the executable.

What are the consequences of this? For the PLT call, it’s a simple performance impact which could not be avoided. Since the address of the actual externalFunction function is not known at compile and link-time, and we don’t want to leave a text relocation, the only way to place that call to find the address at run-time and indirectly call it.

For the copy relocation, the consequences for the executable are small. The code it will execute is still the most efficient and most compact. The dynamic linker will have to find where the symbol actually is at load-time, which is something that it would have to do anyway, plus copy its contents, checking that the size hasn’t changed. This is done only once, then the code runs in its most efficient form.

The fact that we resolved &externalFunction to the address of the PLT stub means that any use of that function pointer (an indirect call) will end up in a function that does an indirect call too. That is, it’s a doubly-indirect call. I seriously doubt any processor can do proper branch prediction, speculative execution, and prefetching of code under those circumstances.

It gets worse

So far we’ve analysed what happens in an executable. Now let’s see what happens when we try to build the same C code for a shared library. We do that by introducing the -fPIC compiler option, which tells the compiler to generate position-independent code. The compiler produces the following assembly output:

        call    externalFunction@PLT
        movq    externalFunction@GOTPCREL(%rip), %rdx
        movq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

When assembled, the .o still contains three relocations, albeit of different type.

When we compare the output of the position-dependent and the position-independent code, we notice the following:

The call is still a call, but now we’re explicitly calling the PLT stub. This might seem irrelevant, since the linker would have fixed the call anyway to point to the PLT if it had to, but isn’t.
The single movq instruction was split in three. This is required by the x86-64 processor, since the instruction set cannot encode a 64-bit value and the 64-bit address to store it in the same instruction (such instruction would be at least 17 bytes long, which 2 two bytes longer than the maximum instruction length).
The values for the two symbols are loaded indirectly. Instead of encoding the two values in those two middle movq instructions, the compiler is loading the values from another linker-generated structure called the Global Offset Table (GOT).

The compiler needed to generate the code above since it doesn’t know where the symbols will actually be. As was the case before, those symbols can be linked into the same ELF module as this compilation unit, or they may be found elsewhere in another ELF module this one links to.

Moreover, the compiler and linker need to deal with the possibility that an executable might have done exactly what our executable in the previous section did: create a copy relocation on the variable and fixed the address of the function to its own PLT stub. In order to work properly, this code must deal with the fact that its own variable might have ended up elsewhere, and that &externalFunction might have a different value.

That means the indirect call through the PLT and the three movq instructions remain, even if those two symbols were in the same compilation unit!

The problem is that even if at first glance you’d think that the compiler should know for a fact where those symbols are, it actually doesn’t. The -fPIC option doesn’t enable only position-independent code. It also enables ELF symbol interposition, which is when another module “steals” the symbol. That happens normally by way of the copy relocations, but can also happen if an LD_PRELOAD’ed module were to override those symbols. So the compiler and linker must produce code that deals with that possibility.

In the end, we’re left with indirect calls, indirect symbol address loadings and indirect variable references, which impact code performance. In addition, the linker must leave behind relocations by name for the dynamic linker to resolve at load-time.

All this for the possibility of interposition?

Yes, it seems so. The impact is there for this little-known and little-used feature. Instead of optimising for the common-case scenario where the symbols are not overridden, the ABI optimises for the corner case.

Another argument is that the ABI optimises for executable code, placing the impact on the libraries. The argument is valid if the executables are much larger and more complex than the libraries themselves. It’s valid too if we consider that application developers write sloppy code, whereas library developers will write very optimised code.

I don’t think that argument holds anymore. Libraries have got much more complex in the past 10-15 years and do a lot more than they once did. They are not mere wrappers around system calls, like libc 4 and 5 were on Linux in the late 90s. Moreover, if we consider that the rise of interpreted languages, like Perl, Python, Ruby, even QML and JavaScript, the code belonging to the ELF executables is negligible. Compare the size of the executables with the libraries that actually do the interpretation:

-rwxr-xr-x. 2 root root   13544 Aug  5 06:27 /usr/bin/perl
-rwxr-xr-x. 2 root root    9144 Apr 12  2011 /usr/bin/python
-rwxr-xr-x. 1 root root    5160 Dec 29 13:46 /usr/bin/ruby
-r-xr-xr-x. 1 root root 1763488 Apr 12  2011 /usr/lib64/libpython2.7.so.1.0
-rwxr-xr-x. 1 root root  947736 Dec 29 13:46 /usr/lib64/libruby.so.1.8.7
-rwxr-xr-x. 1 root root 1524064 Aug  5 06:27 /usr/lib64/perl5/CORE/libperl.so

That’s even valid for interpreters that JIT the code. As optimised as the code they generate can be, current understanding is that operations with critical performance are implemented in native code, which means libraries or plugins.

Existing solutions

Partial solution for private symbols

When developing your library, if you know that certain symbols are private and will never be used by any other library, you have an option. You can declare their ELF visibility to be “hidden”, which has two consequences. The clear one is that the linker will not add the hidden symbols to the dynamic symbol table, so other ELF modules simply cannot find them. If they can’t find them, they can’t steal them. And if they can’t steal them, the linker does not need to produce a PLT stub for the function call, so the call instruction will be linked to a simple, direct call as the executable in the first part had been.

The other consequence is an optimisation that the compiler does. Since it also knows that the externalVariable variable cannot be stolen, it does not need to address the variable indirectly. The generated assembly becomes:

        call    externalFunction@PLT
        movq    externalFunction@GOTPCREL(%rip), %rax
        movq    %rax, externalVariable(%rip)

The .o file will still contain three relocations. However, note how the getting of the address of the externalFunction function is still done indirectly, even though the compiler knows it cannot be interposed. That means the linker will still generate a load-time relocation for the dynamic linker, to get the address of that function. Fortunately, it’s a simpler relocation since the symbol name itself is not present.

If there’s a reason for getting the address indirectly like this, I have yet to find it.

Partial solution for public non-interposable symbols

If your symbols are public, however, you cannot use the ELF “hidden” visibility trick. But if you know that they cannot and will not ever be stolen or interposed, you have another possibility, which is to tell that to the compiler and linker.

If you declare a variable with ELF “protected” visibility, you’re telling the compiler and linker that it cannot be stolen, yet can be placed in the dynamic symbol table for other ELF modules to reference. You just have to be absolutely sure that they will not ever be interposed, because that will create subtle bugs that are hard to track down. That includes access to those symbols by position-dependent executable code, like we did in the first section.

The GCC syntax __attribute__((visibility("protected"))) works in ELF platforms only, whereas the one with the “hidden” keyword is known to work in non-ELF platforms too, like Mac OS X (Mach-O) and IBM AIX (XCOFF).

Another way to do the same is to use one of two linker options: -Bsymbolic and -Bsymbolic-functions. They do basically the same as the protected visibility: they keep the symbols in the dynamic symbol table, but they make the linker use the symbol inside the library unconditionally. The difference between those two options is that the former applies to all symbols, whereas the latter applies to functions only.

The reason why -Bsymbolic-functions exists requires looking back at the executable code from the first section. While the variable reference required a copy relocation, the function call was done indirectly, through the PLT stub. A variable can be moved, but moving code isn’t possible, so the executable code needs to deal with the code being elsewhere anyway. For that reason, it’s possible to symbolically bind function calls inside a library without affecting executables.

Or so we thought. The problem we discovered last week deals with a situation of when you treat a function as a data reference: taking its address. As we saw on the first part, the linker will resolve the address of the function to the address of the PLT stub found in the executable. But if you symbolically bind the function in the library, it will resolve to the real address. If you try to compare the two addresses, they won’t be the same.

Proposed solutions

Some of the solutions I propose are ABI and binary compatible with existing builds; some others are ABI incompatible and would require recompilation. Unfortunately, the best solution would require source-incompatible changes. Still, all the changes below are giving a bit of optimisation to libraries by making executables less optimised.

Use of PLT in function calls should rest only with the linker

As we saw in the code generated for the library, with -fPIC, the compiler decided to make the call indirectly by adding “@PLT” to the symbol name. Turns out that the linker doesn’t really care about this and will generate (or not) the PLT stub if needed. If that’s the case, the compiler should not make a judgement call about where the symbol is located just because of -fPIC.

Function addresses should always be resolved through the GOT

Function calls already require a pointer-sized variable somewhere and a relocation to make it point to the valid entry point of the function being called. What’s more, taking addresses of functions is a somewhat rare operation, compared to the number of function calls across ELF modules.

That being the case, we can take a small “hit” in performance and the loading of a function address should happen via the GOT in position-dependent code (executables) just like it is done for position-independent code.

The benefit of doing this is that the function address we load will point to exactly function’s real entry point, instead of the PLT stub. When we call this function, we avoid the doubly-indirect branching we found earlier.

PLT stubs should use the regular GOT’s address, if it exists

If a given function is both called and its address is taken, the PLT stub should reference GOT entry that was used for the taking of the address. The reason why it isn’t already so, I guess, is because the entries in the .got.plt section aren’t initialised with the target function’s address, but the local module’s function resolver. This trick allows for the “lazy resolution” of functions: they are resolved only the first time they are called.

I wouldn’t ask for all functions to be resolved at load-time, but if the address of the function is taken anyway, the dynamic linker will need to resolve it at load time. So why waste CPU cycles in a function call if the address was computed already?

Copy relocations should be deprecated

Instead of copying the variable from the library into the executable, executables should use indirect addressing for reading variables and writing to them, as well as taking their addresses. One benefit of doing this is avoiding the actual copying. For example, for read-only variables, they may remain in read-only pages of memory, instead of being copied to read-write pages found in the executable.

The big drawback of this is that the indirect addressing is a lot more expensive, since it requires two memory references, not just one. The next suggestion might help alleviate the problem.

The linker should relax instructions used for loading variable addresses

This is a suggestion found in the IA-64 ABI: the compiler generates the instructions needed to load the address of the variable from the GOT, then use it as it needs to. If the linker concludes (by whichever means, like protected or hidden symbols, the use of one of the symbolic options, or because this is an ELF application and the symbol is defined in it) that the symbol must reside in the current ELF module, it can change the load instruction into a register-to-register move or similar.

For our x86-64 64-bit case, the instructions the compiler generated were:

        movq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

By changing one bit in the opcode of the first instruction, with no code size change, we can produce:

        leaq    externalVariable@GOTPCREL(%rip), %rax
        movq    %rdx, (%rax)

The x86 instruction “LEA” means “Load Effective Address”. Instead of loading 64 bits from the memory address externalVariable@GOTPCREL(%rip) and storing them in the register, that instruction the address it would have loaded from in the register. This isn’t as optimised as the original code found in the executable for two reasons: it requires two instructions instead of just one and it requires an additional register.

It’s possible to generate an even more efficient code if the assembler leaves a 32-bit immediate offset in the second movq instruction, making it 6 bytes long. This extra immediate would be of no impact in the original code, besides making it longer, but it would allow the linker to optimise the code further:

The original would be:

        movq     externalVariable@GOTPCREL(%rip), %rax
        movq.d32 %rdx, 0x0(%rax)

And it would get relaxed to:

        nopl.d32 0x0(%rax)
        movq     %rdx, externalVariable@GOTPCREL(%rip)

That is, the first 6-byte instruction is resolved to a 6-byte NOP, whereas the second 6-byte instruction executes the actual store, with no extra register use. The compiler cannot know that the register will be left untouched, but at least there is no dependency between the two instructions that might cause a CPU stall.

The same applies to other architectures too. The full -fPIC code on ARM to store a value from a register into a variable is the following:

        ldr     r3, .L2+8     @ points to a constant whose value is: externalVariable(GOT)
.LPIC1: ldr     r3, [r4, r3]  @ r4 contains the base address of the GOT
        str     r2, [r3, #0]

If the linker can conclude the symbol must be in the current ELF module and cannot change, it may be able to avoid the extra load (the middle instruction) by changing the code to be:

        ldr     r3, .L2+8     @ points to a constant whose value is: externalVariable-(.LPIC1-8)
.LPIC1: add     r3, pc, r3
        str     r2, [r3, #0]

Unlike x86, the ARM instructions cannot be optimised further, since the immediates encodable in the instructions have limited range.

The linker should relax instructions used for loading function addresses

Similar to the above, but instead looking at function addresses. The original library code is:

        movq    externalFunction@GOTPCREL(%rip), %rdx

But it can be relaxed to:

        leaq    externalFunction(%rip), %rdx

With ARM, the original code is:

        ldr     r3, .L2+8     @ points to a constant of value: externalFunction(GOT)
        ldr     r2, [r4, r3]  @ r4 contains the address of the base of the GOT

But relaxed, it would be:

        ldr     r2, .L2+8    @ points to a constant of value: externalFunction-(.LPIC0+8)
.LPIC0: add     r2, pc, r2

There should be a way to tell the compiler where the symbol is

We’re already able to tell the compiler that a symbol is in the current module, with the hidden visibility attribute. We should be able to tell the compiler that we know that the symbol is in the current module but exported as well as that we know that the symbol is in another module.

I would suggest simply using the existing ELF markers and being explicit about them:

__attribute__((visibility("hidden"))): symbol is in this ELF module and is not exported (equivalent on Windows: no decoration);
__attribute__((visibility("protected"))): symbol is in this ELF module and is exported (equivalent on Windows: __declspec(dllexport));
__attribute__((visibility("default"))): symbol is in another ELF module (equivalent on Windows: __declspec(dllimport)); this also applies to symbols that must be overridable according to the library’s API (like C++’s global operator new).

Considering the other suggestions, we know the references to symbols with “default” visibility can be relaxed into simpler and more efficient code in the presence of one of the symbolic binding options. That means we can use the “default” visibility for cases of uncertain symbols.

Getting there

Some of the solutions I listed are already possible and they should be used immediately in all libraries. That is especially true about the use of the hidden visibility: all libraries, without exception, should make use of this feature. In fact, since this option was introduced in GCC 4.0 seven years ago, many libraries have started using it and are now “good citizens”, for they access their own private data most efficiently, they don’t have huge symbol tables (which impact lookup speed) and they don’t pollute the global namespace with unnecessary symbols.

Other solutions are not possible to implement yet. The solution I personally feel is most important to be implemented first is that of the ELF executables: they need to stop using copy relocations and they should resolve addresses of functions via the GOT. Only once that is done can libraries start using the “protected” visibility and generate improved code. This implies changing the psABI for the affected libraries, which may not be an easy transition.

An alternative to using the “protected” visibility is to use the symbolic binding options. The code relaxation optimisations would come in handy at this point to optimise at link-time the code that the compiler could not make a decision on. Unfortunately, those options apply to all symbols in a library, so libraries that must have overridable symbols need to use an extra option (--dynamic-list) and list each symbol one by one.

Using -fPIE

The compiler option -fPIE tells the compiler to generate position-independent code for executables. It is similar to the -fPIC option in that it generates position-independent code, but it has the added optimisation that the compiler can assume none of its symbols can be interposed.

With executables compiled with this option, copy relocations and direct loading of function addresses aren’t used. This solves the problem we had. Therefore, compiling executables with this option allows us to start using some of the optimisations I described before.

Unfortunately, as its description says, this option also generates position-independent code, which can be less efficient than position-dependent code in some situations. My preference would be to have position-dependent code executables without the copy relocations. However, there’s an added, side-effect of this option: it defines the __PIC__ macro, whose absence can be used to abort compilations for libraries that have transitioned to the more efficient options.

Further work and further reading

I highly recommend Urlich Drepper’s “How to Write Shared Libraries” paper. His recommendations did not go as far as suggest changing the ABI like I have, but he has many that library developers should adhere to, regardless of whether my recommendations are accepted or not. For example, using static functions and data where possible and avoiding arrays of pointers are recommendations I have made to many people.

Other work necessary is to improve prelinking support. Shared libraries are position-independent, but they can be prelinked to a preferred location in memory. One optimisation I have yet to see done is to use the read-only pages of prelinked data when the library is loaded at that preferred address (the .data.rel.ro sections).

Tags: abi, assembly, elf, linux, low-level, optimisation

23 comments

1 ping

Aleksey Khudyakov
January 16, 2012 at 16:49 UTC (UTC 0)
Is there any measurements of impact of double indirect calls? It could be anything between serious problem and something undetectable.
Thiago Macieira
January 16, 2012 at 20:00 UTC (UTC 0)
No, I don’t have any measurements. I’m going from traditional thinking about assembly: an indirect jump requires a memory fetch in addition to the control transfer. Double that and you have double the damage.
Artie Gold
January 16, 2012 at 22:53 UTC (UTC 0)
Interposition is an indeed rarely used but very powerful technique; I’d hate to see it go away for the cost of a couple of cycles when cycles are cheap..
Thiago Macieira
January 17, 2012 at 00:36 UTC (UTC 0)
@Artie: I understand that. I have done my share of LD_PRELOAD modules too.
However, for certain libraries, I really want to have the ability to forbid interposition if I want to and save some cycles. Right now, it’s not an option since it also causes subtle bugs.
Mark Jewell
January 17, 2012 at 03:15 UTC (UTC 0)
It’s awfully unconvincing to declare interposition an unnecessary corner-case and doom-say its performance implications all based on a thought experiment. Yes, sure, indirect jumps have their cost, but you’re not going to convince anyone that your position is valid without demonstrable evidence.
leek
January 17, 2012 at 05:27 UTC (UTC 0)
Minor nit: The code example is not standard-conforming C, because function addresses cannot be assigned to data pointers, not even void *.
brigade
January 17, 2012 at 06:12 UTC (UTC 0)
In OS X, the GOT keeps the real address of the function rather than the stub, so there’s no doubly indirect calls through function pointers. DYLD_INSERT_LIBRARIES works fine, so I don’t see why Linux’s dynamic linker couldn’t do this, and it shouldn’t even need any change except to the linker.
Also, for doubly indirect calls, the only hit to branch prediction is in using another entry in the tables. Though that’s in addition the additional L1I cacheline the stub requires and a small hiccup in instruction decoding that can often be hidden by the processor.
> I wouldn’t ask for all functions to be resolved at load-time, but if the address of the function is taken anyway, the dynamic linker will need to resolve it at load time.
Actually, since the GOT points to the PLT stub, function resolution can be completely lazy even if the address is taken.
Soprt
January 17, 2012 at 07:24 UTC (UTC 0)
I am left unconvinced. What is the performance penalty of this indirection? Or, alternatively, how much is there to win in terms of performance by removing it (a few instructions)? The relevance of this post depends on this imo. If the performance improvement is as small as I think it might be, then this is pretty irrelevant.
nwf
January 17, 2012 at 08:22 UTC (UTC 0)
There’s also the diametrically opposite view: we should dramatically reduce the number of shared libraries by rethinking the system architecture. As an easy example, we could do away with almost all of the authentication libraries by having a factotum process and a single, small, static linkage library for its use, like Plan 9 does. The result would actually be two-fold: improved security and elimination of lots of dynamic linkage.
Just to stave off the inevitable, incorrect point: Plan 9′s fully-static executables are not all that big. /386/bin/acme, which is a full text editor and graphical windowing system is 444544 bytes; libgtk-x11-2.0 has a smidge over a tenth of that just in .data and .bss (and is hardly the only library of its kind in this regard) and a smidge under 10 times that in .text. I am not proposing that we jettison dynamic linkage altogether, but that revisiting the “no static linkage” decision may do us some good.
renoX
January 17, 2012 at 09:03 UTC (UTC 0)
For some reason, this make me thinks of Michael Meeks optimisations ( http://lwn.net/Articles/192624/ ), but I’m not 100% sure that this is the same issue..
anonymous
January 17, 2012 at 09:20 UTC (UTC 0)
Next: the sorry state of static libraries
wh
January 17, 2012 at 12:33 UTC (UTC 0)
I am waiting for statically compiled Linux distribution with GNUless userland. But I need more static friendly programs.
At least static core and most used programs.
Thiago Macieira
January 17, 2012 at 12:53 UTC (UTC 0)
@Mark Jewell: if you read the assembly code, you see that there are more instructions and more memory references, not to mention the extra relocations by name left in the GOT. What’s more, data references cannot be interposed if they are used in the executable. The game is already over because the variable must reside in the executable and cannot be hijacked.
There’s simply no way that the interposition has zero impact. It has an impact. What I’m asking is for a way to disable it in order to gain performance when I want to, without causing subtle bugs due to the copy relocations.
And yes, I call it a corner-case: how often do you use that facility, compared to how often you use libraries? I can bet you it’s 1000:1 or more.
Thiago Macieira
January 17, 2012 at 12:58 UTC (UTC 0)
@brigade: it’s necessary because of how the psABI refers to ELF executables. The instructions that it leaves in the executable code, as you can see in the first part of the blog, encode directly a value that must reside in the executable’s image. So everyone else must adapt and point to wherever that executable decided to point. What I’m asking for is that this be relaxed and even the executable should load the address indirectly via the GOT, so that the GOT entry can point to the real function entry point, instead of the PLT stub.
> Actually, since the GOT points to the PLT stub, function resolution can be completely lazy even if the address is taken.
The PLT has a separate GOT section (the .got.plt) on ELF platforms. So there are two entries for the same function if its address is taken and it is called (this applies to shared libraries, not executables). Because the address is taken, the function’s address was resolved at load-time anyway — that cannot be done lazily. Yet the function call resolves lazily with a small run-time impact.
Thiago Macieira
January 17, 2012 at 13:05 UTC (UTC 0)
@Soprt: don’t think of just the indirection. There’s more associated with the indirect call. Compared to the direct call that is demonstrably possible, the indirect call adds: an entry to the PLT, its corresponding entry in the GOT, another public symbol in this ELF module’s dynamic symbol table, an indirect jump, and a lazy resolution of the function by name at the time of the first call.
Yes, the impact of each is small. But how many functions are called via the PLT when they didn’t have to? Multiply that by the number of times each one is called. Think of the cumulative effect of the larger binary, more indirections, larger symbols tables.
Thiago Macieira
January 17, 2012 at 13:07 UTC (UTC 0)
@nwf: I think there’s a middle-road to be taken: have fewer, but larger libraries. Ulrich Drepper’s paper pointed out that the time required for resolving a symbol depends on the number of libraries, not just the number of symbols. A larger library would give the compiler and linker more opportunities to optimise, not to mention the developers themselves, compared to several smaller ones.
One step that is missing there is that the intra-library calls and references should be optimised too. That was the point of my blog.
Thiago Macieira
January 17, 2012 at 13:12 UTC (UTC 0)
@renoX: it’s not the same issue. The option Michael proposed to the linker, -Bdirect, is intended to speed up the resolving of undefined symbols: that is, those that this ELF module must find in other modules.
The issue I’m talking about here, which the option -Bsymbolic resolves, is about when the symbol is already defined, because it’s found on this module.
The two are complementary.
Michael Meeks
January 18, 2012 at 11:43 UTC (UTC 0)
My take is that Thiago is right – interposing is an horrible and completely un-necessary evilness – every use I’ve seen of it outside of the glibc/-lpthread nightmare (which is a royal PITA for new developers too), is a bug; I even wrote a quick tool to find and file those bugs of which there were rather many back in the day. Worse – sometimes/often (think two different copies of sleepycatdb in the same executable) it presents horrendous problems that are extremely hard to workaround.
Unfortunately, the powers that be in glibc land – happily oblivious of the unpleasant realities of large-scale software, love this feature for unknown reasons. Perhaps because of that it has become entrenched into our precious C++ ABI: this used to use string compares for exception handling, but now uses pointer comparisons (similar to Thiago’s function comparisons), and of course the same for dynamic_cast and RTTI using methods. That makes removing the mis-feature (even if we could persuade the glibc people to surrender their fondness for this ugliness) practically impossible.
So – in the absence of a sane linker – there is only really one option: write a new linker – Android in part forces this, and the Mozilla team are doing some more work there. Luckily ld-linux.so could (feasibly) be replaced, and the toolchain guys above glibc are able to discuss, interact and include new features in a pleasant way, that glibc is institutionally incapable of (I’m afraid).
So – the mix of a new linker and a tweaked C++ ABI makes this quite a challenge. Though if we tweak the linker we can also fix the idiocy around symbol driven cross-shlib vtable relocations: which has a rather significant impact on both size & startup.
Failing that – whacking all your code into one big shared library, linked symbolically, and/or (better) heavily LTO’d is prolly the solution for big apps. That’s what we’re working towards in LibreOffice.
Thiago Macieira
January 18, 2012 at 12:09 UTC (UTC 0)
Thanks for the support, Michael.
To be honest, I wasn’t calling for removal of LD_PRELOAD completely. I was pointing out that the system is heavily biased to this functionality, causing performance problems to code that doesn’t use it. So we have performance impacts on everything, every function call, every library load, for the comparatively rare possibility of symbol interposition.
I think we can start making these changes even without support from the toolchain community — though of course I’d love them to help us out. The only thing we really can’t do without them is to change the ABI. That includes your -Bdirect linking as well as my proposed position-dependent executables without copy relocations. What I’m prepared to do is to get a set of libraries, starting with QtCore, to refuse to work unless you compile your executable with -fPIE.
Most people have complained in the comments as well as on reddit and ycombinator that my blog is lacking benchmarks. I’ll give you benchmarks soon, but I am going into this knowing that the runtime performance impact might be negligible or undetectable, but hoping that it’s measurable. I’ll also give you other benchmarks, not of run-time “time required to complete a task” but of other kinds.
Diederik van der Boor
January 19, 2012 at 14:30 UTC (UTC 0)
Wow. While I don’t feel qualified to say anything about the technical details discussed here, I do think this shouldn’t be taken lightly.
For all newcomers: did you ever notice how much faster Firefox or OpenOffice start up on Windows? A few years ago that was like 2x faster. There is a lot to optimize, and C++ hasn’t historically gained the attention by glibc, the linker and loader that it needs. If this step improves such situation, I’m all up for it, and I think we should persuade this.
Allan Jensen
January 19, 2012 at 18:44 UTC (UTC 0)
Pretty good blog. You have too little confidence in modern CPUs though. They will predict the destinations of double indirect calls very easily and accurately, especially when we are talking common calling conventions cases like this. The cache misses are more of a problem since both the cache-line and the entry in the TBA will take up space even when predicted correctly.
Thiago Macieira
January 19, 2012 at 21:25 UTC (UTC 0)
@Allan: As for the CPU’s capabilities, I’m very confident. In fact, the numbers show it (see the other blog). The branch misprediction ratio remained roughly constant at 1.8% through out the tests. If the CPU were getting stumped by the indirect branches, removing them should have shown a decrease in the the ratio, but didn’t. That means the CPU is doing pretty well with them there.
However, consider still that the cost of a branch misprediction is still high (20 cycles), so by lowering the overall number of branches (by 5%), we get an improvement: number of mispredicted branches * 20 cycles * 5%. Note that 21% of the instructions executed are branches.
Also note that the misprediction ratio of indirect branches remained constant too, but it was at 13.2%. One in every 7.5 indirect jumps was mispredicted…
Uoti Urpala
January 30, 2012 at 11:06 UTC (UTC 0)
The part about -Bsymbolic, “Another way to do the same is to use one of two linker options”, was somewhat misleading. Currently -Bsymbolic does not give the same performance gains as using visibility attributes at compiler level, as the compiler has already created inefficient instructions (as described in the article) and the linker can not change them.