«

»

Jan 19

Update and benchmark on the dynamic library proposals

My last blog on the dynamic libraries on Linux attracted over 15000 visits, which was quite unexpected (it’s 15x more than the usual traffic). It got linked from reddit and ycombinator and comments there and in the previous post have raised some interesting questions I’ll try to answer.

LD_PRELOAD

First, a quck background: LD_PRELOAD and /etc/ld.so.preload tell the dynamic linker to load a certain ELF module before the rest normal initialisation sequence. It’s preloaded before the rest of the modules, but after two important modules have been loaded: the executable itself and the dynamic linker. By itself, it means nothing at all about symbol hijacking. Its sole purpose is to load something. I have, for example, used it for loading a different binary of a library that a program required. That works fine.

Yes, it is little-known and little-used

If you complained that I said it’s little-known, you’re somewhat biased. If you complained, it’s because you knew about it, therefore you’re part of the minority that knows about it. Just think about it: there are millions of people directly using Linux today in the world. How many do you think know about this feature?

Even more so, think about how often:

  • LD_PRELOAD is used compared to running applications without it
  • LD_PRELOAD is used to load an ELF module compared to how many ELF modules are loaded by regular means
  • how many functions are interposed using LD_PRELOAD versus how many aren’t

The ratio is at least 1:1000 for a heavy user of the feature (like me!) in the best of the circumstances. It’s probably several orders of magnitude more than that for the average. Something that is used in one case in a million qualifies as little-used to me.

No, I wasn’t proposing to get rid of it (not entirely)

Some people suggested I was thinking of getting read of the preloading feature in exchange for a few cycles saved. I would still be in my right to suggest that, given the improvements and how often it is used, but I wasn’t. I’ve never proposed getting rid of the preloading feature and my proposal would not harm the most often used cases of interposition.

This requires a bit more explanation, so bear with me please.

Symbol interposition works by adding a symbol to the symbol table before the “rightful” symbol appears. The dynamic linker will resolve the symbol to the first occurrence it finds in the search order, so if you preload a library out of its order, its symbols will have higher priority than they would otherwise. The extreme case is when you preload a library or module that wouldn’t otherwise be loaded. But remember something I said before: preloaded modules are loaded after two others are loaded, so they don’t get the chance to interpose symbols defined by those.

If the executable performed a copy relocation on a data symbol, then LD_PRELOAD’ed modules cannot interpose those. For that reason, I am not counting interposition of data symbols as valid. In fact, in 14 years I’ve been hacking on Linux, I’ve never done that, so I guess the chances of that happening are a billion to one or even lower. What’s more, my proposal would do away with copy relocation, which may make data interposition a valid case.

The next important thing you must understand is that my proposal would do away with interposition of intra-library symbols, but not inter-library ones. My friend Michael Meek’s proposal of -Bdirect linking might, but even that proposal wouldn’t totally do away with it.

What do I mean by this? Intra-library means “within the same library,” while inter-library means “across libraries” (think of “Internet” vs “intranet”). My proposal was intended to improve binding of symbols inside one library because we can gain performance doing that without losing the Position-independent code and the advantages that come with it (like Address space layout randomisation). Specifically because we don’t want to lose the PIC support and we don’t want to go back to pre-ELF days and their problems (see Ulrich Drepper’s paper for some information on it), all inter-library symbol resolution would remain as-is, via PLTs and GOTs, including the ability to interpose symbols.

And here’s why I think we’re entitled to doing that: because you cannot do it anyway unless the library has been specifically designed to allow it, like glibc is. Let’s take the code from the last blog:

extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &externalFunction;
}

And amend it like so:

void externalFunction(void)
{
}

If we compile this code with optimisation (GCC’s -O is enough) and inspect the assembly output, we can notice that both functions are present in the output but that myFunction does not call externalFunction. In other words, the compiler inlined one function into the other, even if the inline keyword was never added to it, and that expanded to zero code. With advances such as link-time optimisation, even moving the function to another compilation unit might not be enough to prevent the inlining.

That’s why I said that to support the case of intra-library symbol interposition, the library must be specifically designed to allow it, which is definitely still possible under my proposal. Most libraries aren’t designed like that and will never be, so I am confident that optimising for the greater majority of the libraries instead of the few is warranted (taking my system: I counted 3623 distinct libraries and plugins and I’m pretty sure none except libc and libpthread allow for interposition, so it’s probably a 1000:1 case again).

Benchmarks

Another important remark I saw in the comment threads was about the lack of benchmarks in my previous blog. Here they are.

Please note that “benchmark” means “comparison.” It does not imply “speed executing something.”

How I did it

I started by trying to find an executable I could run non-interactively, that executed a relatively CPU-intense activity and quit. That executable should be in my standard set of built executables, as I didn’t want to recompile the entire system. I settled on KDE’s kbuildsycoca4 with the options --noincremental --nosignal: it looks for all *.desktop files in the search paths and compiles a database for faster lookup, called the SYstem COnfiguration CAche. The options tell it to ignore existing databases and do it all, plus avoid signalling running applications over D-Bus to reload their settings.

The tests were run on my laptop, which is an Intel Core-i7-2620M, clocked at 2.6 GHz, with an SSD but no tmpfs temporary dir, with 2x32kB of L1 cache, 256 kB of L2 cache, 4MB of L3 cache and 4 GB of main RAM. I locked the CPU scaling governor to “performance” so the CPU was running at 2.6 GHz when the test starts and it soon goes over to turbo-mode and stays there (3.2 GHz). The system was not completely idle while running the test, but relatively so. To try and avoid other problems, the native benchmarks were run under the FIFO real-time scheduler, with a single processor of affinity. The tests were run in 64-bit mode and were run “warm”: I ran the benchmark first after any recompilation and discarded the results.

I did four sets of tests, as follows:

  1. The first, the baseline, was a regular build on my system, with no change to default KDE 4 build options or to Qt 4.8′s.
  2. The second was modified by adding -Bsymbolic-functions to the five KDE libraries and six Qt libraries used by the program
  3. The third was modified by replacing -Bsymbolic-functions with -Bsymbolic and recompiling the same 11 libraries
  4. Finally, on the fourth, in addition to keeping -Bsymbolic, I made all symbols exported from those 11 libraries have protected visibility. This required surprisingly few modifications to them, as they were more-or-less ready to be built on Windows too. Each library already has a XXXX_EXPORT macro associated because of the “hidden” visibility support, which right now expands to __attribute__((visibility("default"))). Moreover, the buildsystem for those library already defines a specific macro only during their builds. So it was easy to ensure that #ifdef that macro from the buildsystem, the XXXX_EXPORT macro should instead expand to __attribute__((visibility("protected"))), otherwise it should remain unchanged.

Each set of tests consisted of:

  • Run Ulrich Drepper’s relinfo script on the 11 libraries and tally up the types of relocations
  • Run Valgrind’s cachegrind tool with branch-prediction and the cache sizes set to match my machine
  • Run the perf stat tool to gather hardware counters. Each run of the tool reported the average of 10 runs of kbuildsycoca4, all run under FIFO real-time scheduler. After the first warm-up run, I chose the best of 3 runs in quick succession

The raw results I collected you can download from here (that also includes results with LD_BIND_NOW=1).

Results

First of all, I went into these benchmarks fully expecting that nothing would be visible in the performance benchmarks. It’s clear that these are micro-optimisations, so in a fairly large program they should be drowned out by inefficiencies in other parts. Also, considering that my system wasn’t completely idle when running the CPU benchmarks, the numbers have a degree of noise which could hide the faint results. The results have, however, shown a few clear improvements.

Here’s what I found:

  • Relocations: relocations are work that the dynamic linker must do either at load-time (non-PLT relocations) or during run-time (PLT). Reducing or simplifying relocations improves start-up and run-time performance.
    • The number of non-PLT relocations drops by 2.65% with protected visibility: that was expected because the linker options affect only the PLT. To change the non-PLT relocation count, a change to the compilation was necessary.
    • The number of relative relocations doubles with the linker options: that was also expected, because the linker can bind the relocation to the symbol that is inside the library being linked. Instead of referring to the symbol by its name and triggering a full look-up, a relative relocation simply records how many bytes past a fixed mark (the load address) the relocation should be, which is much simpler to execute. The number increases again with -Bsymbolic compared to -Bsymbolic-functions because the linker can bind non-functions too. The number dropped with protected visibility, but by less than the number of total relocations removed.
    • The number of PLT entries is one-third of the original because the linker can make intra-library function calls directly instead of going through the PLT stub. Each PLT entry corresponds to 8 bytes in the .got.plt section and 16 bytes of stub, which means this reduction saved as many as 15571 relocations and as much as 373 kB of memory size. This is confirmed by the count of PLT entries used for local symbols, which drops to nearly zero. The number isn’t exactly zero because both QtCore and QtGui have been prepared for 5 of its symbols to be interposed when built with -Bsymbolic-functions, a preoccupation I didn’t take into account in the protected visibility work because it wasn’t relevant.
        Note that there must have been an error with the -Bsymbolic builds because two libraries had a higher PLT count than they should. I have not investigated whether this was a a mistake on my part or a bug in the linker.
  • Valgrind results: valgrind executes the program in a simulated CPU, which on one hand means we get consistent results independent of what CPU I run this in and how idle or busy my system was, but on the other hand may or may not reflect reality (YMMV).
    • Instruction count decreases slightly by 0.9%, 1.1% and 1.2%
    • Data accesses to L1 data cache decreases slightly by 1.4%, 1.6% and 2.1%
    • Last-level cache references decrease by 7% while the LL cache miss rate remains constant, probably because there are fewer instructions executed, fewer data accesses and a slightly improvement in L1D miss rate
    • Number of indirect branches executed drops by 22%
    • The indirect branch misprediction rate drops considerably from 22% in the original to 16% with just the linker options and 8.8% with the protected visibility, while the overall branch misprediction rate drops from 4.7% to 4.3% and then to 4.1%. With 2.9 million fewer mispredicted branches, at a 20-cycle misprediction penalty, that’s 57 million cycles saved.
  • Perf results: perf uses hardware counters from the CPU to do its bidding, but it is subject to scheduling issues. The kbuildsycoca4 program does context-switch in its execution because it tries to verify with the D-Bus daemon if another instance isn’t already running. Moreover, this program is I/O intensive, meaning it makes a lot of system calls, which is why I let the benchmarks run with a “warm” system cache. Unlike the Valgrind results, there’s a great deal of noise and error in the numbers from perf because they represent an actual CPU.
    • There’s a roughly 3% overall performance improvement as measured by the execution time. The noise in the number doesn’t show which solution is best, but it shows that all three are better than the unmodified library code.
    • There’s a 3 to 4% improvement in number of cycles required to complete the operation. Unfortunately, the numbers are showing performance decreasing as I optimise more, which is counter-intuitive and I cannot explain (noise or real mis-optimisation). I think my machine was slightly less idle on the last test set, as the last results I got showed a much worse performance with a much bigger standard deviation.
    • There’s roughly 3% improvement in the number of instructions executed, which is similar to the reduction in cycles, but also shows that more instructions are executed per cycle with the optimisations. I cannot say why exactly it is, but I imagine it’s because of reductions in branching, branch misprediction and cache misses. The calculation of instructions per cycle shows improvement in two of the three benchmarks by close to 1%.
    • Branches executed reduce by 4 to 5% but the reduction is in the opposite order of the number of branches I know are in the code, which means there was a considerable amount of noise in this test. Another similar metric shows a roughly 5% improvement in branch loads.
    • The rates of cache misses and branch mispredictions remain more or less constant, which coupled with the number of branches reducing means we have an improvement in performance due to fewer absolute mispredictions happening. I cannot conclude anything about a reduction in cache references because the numbers varied too much.
        This is supported by the calculation of cycles gained in the reduction of branch misprediction. The SandyBridge architecture has a 20-cycle penalty for branch misprediction, so if we calculate how many cycles were lost in each benchmark due to mispredicted branches and subtract from the original, we get roughly 6 million cycles gained (0.24% of the total), which is in the same order as the improvement in instruction throughput (instructions per cycle).

Conclusions

The numbers are fairly small, as was expected, since we’re talking about micro-optimisations. However, three distinct benchmarks have shown with a reasonable degree of confidence that there’s a performance improvement in the order of 3% (execution time, cycle count and instruction count, and that’s reasonable to me, with the limited sample size I had). That’s more or less what I hoped to see, but much more than I expected to be able to show.

Another important aspect is that this was a non-GUI testcase, even though by virtue of library dependencies, both QtGui and kdeui libraries were present. Note how the two libraries have, together, 45824 relocations and 14708 PLT entries in the original library set, which corresponds to 73.3% and 62.4% of the total relocations in play respectively, as well as 65% of the PLT entries for local symbols. The number of relocations is indicative also of the size of the code in those libraries. But since the application isn’t a GUI one, that code is mostly not executed.

If we consider that the problem of cache misses increases with code size (and the cache miss rate could increase too, compounding the effect) and that of cycles lost due to mispredicted branches increases with the number of branches unless the misprediction ratio drops (which the benchmarks have shown to remain stable), we can expect that a GUI application could gain even more in performance due to these improvements. That’s difficult to prove however in a GUI application, so we’ll have to stay with just the theoretical exercise.

In all, I still think this is warranted. The drawbacks are fairly minor: the interposition of symbols is rarely used already, interposition of symbols in intra-library lookups close to non-existent in libraries that aren’t designed to do that. All we need to do now is change the status-quo, which is probably the hardest part.

Who will support me?

5 comments

  1. avatar
    Allan Jensen

    Have you tried the effect of adding LTO-information to libraries with GCC 3.6+ ?

    If you compile and link all the files with -flto, it will leave behind information GCC can use to optimize when linking. By adding -fwpo it will even remove all exports and optimize for it (but assumes you are making a program, not a library).

  2. avatar
    Thiago Macieira

    @Allan: I have tested LTO and I have it on by default in my Qt builds. I have even tested the equivalent option in the Intel compiler (called “IPO”), but I don’t think LTO is relevant. The point was that you don’t need LTO enabled to create libraries where certain exported symbols cannot be interposed properly.

  3. avatar
    Oleg

    I’m sorry, but I cannot understand, what is your proposition.

    If it is “fvisibility=hidden by default and fvisibility=protected for exported symbols” then I’m all for it! If it is something else, please explain :-)

  4. avatar
    Allan Jensen

    @Thiago Of course, my was only to hear it could work as a compromise of keeping the features, but still improve linking optimization. I am really unsure if or how well LTO even works on shared libraries.

    I don’t personally enable LTO on Qt because it takes forever to compile, and uses an obscene amount of memory.

  5. avatar
    Thiago Macieira

    I’m proposing that we:

    • compile all libraries with -fvisibility=hidden
    • each library exports its own public API with __attribute__((visibility("protected")))
    • the intra-library references to the public API be done the most efficient way (it currently is, but there's a GCC bug fix that will change that)
    • it works

    The last item requires that symbols exported under protected visibility do not suffer copy relocation or the equivalent moving of the function official address. You can accomplish that by compiling the executable with -fPIE, but it can also be done differently if the psABI were changed.

Comments have been disabled.

Page optimized by WP Minify WordPress Plugin