<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>Thiago Macieira&#039;s blog &#187; MeeGo</title> <atom:link href="http://www.macieira.org/blog/category/meego/feed/" rel="self" type="application/rss+xml" /><link>http://www.macieira.org/blog</link> <description>An Open Source hacker&#039;s ramblings</description> <lastBuildDate>Thu, 18 Apr 2013 15:34:17 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>Update and benchmark on the dynamic library proposals</title><link>http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/</link> <comments>http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/#comments</comments> <pubDate>Thu, 19 Jan 2012 15:17:39 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[KDE]]></category> <category><![CDATA[Linux]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[linux]]></category> <category><![CDATA[low-level]]></category> <category><![CDATA[optimisation]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=302</guid> <description><![CDATA[My last blog on the dynamic libraries on Linux attracted over 15000 visits, which was quite unexpected (it&#8217;s 15x more than the usual traffic). It got linked from reddit and ycombinator and comments there and in the previous post have raised some interesting questions I&#8217;ll try to answer. LD_PRELOAD First, a quck background: LD_PRELOAD and &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>My last blog <a
href="https://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/">on the dynamic libraries on Linux</a> attracted over 15000 visits, which was quite unexpected (it&#8217;s 15x more than the usual traffic). It got linked from <a
href="http://www.reddit.com/r/programming/comments/ojm5z/sorry_state_of_dynamic_libraries_on_linux_thiago/">reddit</a> and <a
href="http://news.ycombinator.com/item?id=3472142">ycombinator</a> and comments there and in the previous post have raised some interesting questions I&#8217;ll try to answer.</p><h1>LD_PRELOAD</h1><p>First, a quck background: LD_PRELOAD and /etc/ld.so.preload tell the dynamic linker to load a certain ELF module before the rest normal initialisation sequence. It&#8217;s preloaded before the rest of the modules, but after two important modules have been loaded: the executable itself and the dynamic linker. By itself, it means nothing at all about symbol hijacking. Its sole purpose is to load something. I have, for example, used it for loading a different binary of a library that a program required. That works fine.</p><h2>Yes, it is little-known and little-used</h2><p>If you complained that I said it&#8217;s little-known, you&#8217;re somewhat biased. If you complained, it&#8217;s because you knew about it, therefore you&#8217;re part of the minority that knows about it. Just think about it: there are millions of people directly using Linux today in the world. How many do you think know about this feature?</p><p>Even more so, think about how often:</p><ul><li>LD_PRELOAD is used compared to running applications without it</li><li>LD_PRELOAD is used to load an ELF module compared to how many ELF modules are loaded by regular means</li><li>how many functions are interposed using LD_PRELOAD versus how many aren&#8217;t</li></ul><p>The ratio is at least 1:1000 for a heavy user of the feature (like me!) in the best of the circumstances. It&#8217;s probably several orders of magnitude more than that for the average. Something that is used in one case in a million qualifies as little-used to me.</p><h2>No, I wasn&#8217;t proposing to get rid of it (not entirely)</h2><p>Some people suggested I was thinking of getting read of the preloading feature in exchange for a few cycles saved. I would still be in my right to suggest that, given the improvements and how often it is used, but I wasn&#8217;t. I&#8217;ve never proposed getting rid of the preloading feature and my proposal would not harm the most often used cases of interposition.</p><p>This requires a bit more explanation, so bear with me please.</p><p>Symbol interposition works by adding a symbol to the symbol table before the &#8220;rightful&#8221; symbol appears. The dynamic linker will resolve the symbol to the first occurrence it finds in the search order, so if you preload a library out of its order, its symbols will have higher priority than they would otherwise. The extreme case is when you preload a library or module that wouldn&#8217;t otherwise be loaded. But remember something I said before: preloaded modules are loaded after two others are loaded, so they don&#8217;t get the chance to interpose symbols defined by those.</p><p>If the executable performed a copy relocation on a data symbol, then LD_PRELOAD&#8217;ed modules cannot interpose those. For that reason, I am not counting interposition of data symbols as valid. In fact, in 14 years I&#8217;ve been hacking on Linux, I&#8217;ve never done that, so I guess the chances of that happening are a billion to one or even lower. What&#8217;s more, my proposal would do away with copy relocation, which may make data interposition a valid case.</p><p>The next important thing you must understand is that my proposal would do away with interposition of <strong><em>intra-library</em></strong> symbols, but not <strong><em>inter-library</em></strong> ones. My friend Michael Meek&#8217;s <a
href="http://lwn.net/Articles/192624/">proposal of -Bdirect linking</a> might, but even that proposal wouldn&#8217;t totally do away with it.</p><p>What do I mean by this? <em>Intra-library</em> means &#8220;within the same library,&#8221; while <em>inter-library</em> means &#8220;across libraries&#8221; (think of &#8220;Internet&#8221; vs &#8220;intranet&#8221;). My proposal was intended to improve binding of symbols inside one library because we can gain performance doing that without losing the <a
href="http://en.wikipedia.org/wiki/Position-independent_code">Position-independent code</a> and the advantages that come with it (like <a
href="http://en.wikipedia.org/wiki/Address_space_layout_randomization">Address space layout randomisation</a>). Specifically because we don&#8217;t want to lose the PIC support and we don&#8217;t want to go back to pre-ELF days and their problems (see Ulrich Drepper&#8217;s <a
href="http://www.akkadia.org/drepper/dsohowto.pdf">paper</a> for some information on it), all <em>inter-library</em> symbol resolution would remain as-is, via PLTs and GOTs, including the ability to interpose symbols.</p><p>And here&#8217;s why I think we&#8217;re entitled to doing that: because you cannot do it anyway unless the library has been <strong>specifically designed to allow it</strong>, like glibc is. Let&#8217;s take the code from the last blog:</p><pre class="brush: cpp; title: ; notranslate">
extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &amp;externalFunction;
}
</pre><p>And amend it like so:</p><pre class="brush: cpp; first-line: 9; title: ; notranslate">
void externalFunction(void)
{
}
</pre><p>If we compile this code with optimisation (GCC&#8217;s -O is enough) and inspect the assembly output, we can notice that both functions are present in the output but that <tt>myFunction</tt> does not call <tt>externalFunction</tt>. In other words, the compiler inlined one function into the other, even if the <tt>inline</tt> keyword was never added to it, and that expanded to zero code. With advances such as <a
href="http://en.wikipedia.org/wiki/Link-time_optimization">link-time optimisation</a>, even moving the function to another compilation unit might not be enough to prevent the inlining.</p><p>That&#8217;s why I said that to support the case of <em>intra-library</em> symbol interposition, the library must be specifically designed to allow it, which is definitely still possible under my proposal. Most libraries aren&#8217;t designed like that and will never be, so I am confident that optimising for the greater majority of the libraries instead of the few is warranted (taking my system: I counted 3623 distinct libraries and plugins and I&#8217;m pretty sure none except libc and libpthread allow for interposition, so it&#8217;s probably a 1000:1 case again).</p><h1>Benchmarks</h1><p>Another important remark I saw in the comment threads was about the lack of benchmarks in my previous blog. Here they are.</p><p>Please note that &#8220;benchmark&#8221; means &#8220;comparison.&#8221; It does not imply &#8220;speed executing something.&#8221;</p><h2>How I did it</h2><p>I started by trying to find an executable I could run non-interactively, that executed a relatively CPU-intense activity and quit. That executable should be in my standard set of built executables, as I didn&#8217;t want to recompile the entire system. I settled on KDE&#8217;s <tt>kbuildsycoca4</tt> with the options <tt>--noincremental --nosignal</tt>: it looks for all *.desktop files in the search paths and compiles a database for faster lookup, called the SYstem COnfiguration CAche. The options tell it to ignore existing databases and do it all, plus avoid signalling running applications over D-Bus to reload their settings.</p><p>The tests were run on my laptop, which is an Intel Core-i7-2620M, clocked at 2.6 GHz, with an SSD but no tmpfs temporary dir, with 2x32kB of L1 cache, 256 kB of L2 cache, 4MB of L3 cache and 4 GB of main RAM. I locked the CPU scaling governor to &#8220;performance&#8221; so the CPU was running at 2.6 GHz when the test starts and it soon goes over to turbo-mode and stays there (3.2 GHz). The system was not <strong>completely</strong> idle while running the test, but relatively so. To try and avoid other problems, the native benchmarks were run under the FIFO real-time scheduler, with a single processor of affinity. The tests were run in 64-bit mode and were run &#8220;warm&#8221;: I ran the benchmark first after any recompilation and discarded the results.</p><p>I did four sets of tests, as follows:</p><ol><li>The first, the baseline, was a regular build on my system, with no change to default KDE 4 build options or to Qt 4.8&#8242;s.</li><li>The second was modified by adding <tt>-Bsymbolic-functions</tt> to the five KDE libraries and six Qt libraries used by the program</li><li>The third was modified by replacing <tt>-Bsymbolic-functions</tt> with <tt>-Bsymbolic</tt> and recompiling the same 11 libraries</li><li>Finally, on the fourth, in addition to keeping <tt>-Bsymbolic</tt>, I made all symbols exported from those 11 libraries have protected visibility. This required surprisingly few modifications to them, as they were more-or-less ready to be built on Windows too. Each library already has a <tt>XXXX_EXPORT</tt> macro associated because of the &#8220;hidden&#8221; visibility support, which right now expands to <tt>__attribute__((visibility("default")))</tt>. Moreover, the buildsystem for those library already defines a specific macro only during their builds. So it was easy to ensure that #ifdef that macro from the buildsystem, the <tt>XXXX_EXPORT</tt> macro should instead expand to <tt>__attribute__((visibility("protected")))</tt>, otherwise it should remain unchanged.</li></ol><p>Each set of tests consisted of:</p><ul><li>Run Ulrich Drepper&#8217;s <a
href="/blog/wp-content/uploads/2012/01/relinfo.txt"><tt>relinfo</tt></a> script on the 11 libraries and tally up the types of relocations</li><li>Run <a
href="http://valgrind.org/">Valgrind&#8217;s</a> <a
href="http://valgrind.org/docs/manual/cg-manual.html">cachegrind tool</a> with branch-prediction and the cache sizes set to match my machine</li><li>Run the <a
href="https://perf.wiki.kernel.org/index.html">perf</a> <a
href="https://perf.wiki.kernel.org/articles/t/u/t/Tutorial.html#Counting_with_perf_stat">stat</a> tool to gather hardware counters. Each run of the tool reported the average of 10 runs of kbuildsycoca4, all run under FIFO real-time scheduler. After the first warm-up run, I chose the best of 3 runs in quick succession</li></ul><p>The raw results I collected you can download from <a
href="/blog/wp-content/uploads/2012/01/benchmarking-abi.txt">here</a> (that also includes results with LD_BIND_NOW=1).</p><h2>Results</h2><p>First of all, I went into these benchmarks fully expecting that nothing would be visible in the performance benchmarks. It&#8217;s clear that these are micro-optimisations, so in a fairly large program they should be drowned out by inefficiencies in other parts. Also, considering that my system wasn&#8217;t completely idle when running the CPU benchmarks, the numbers have a degree of noise which could hide the faint results. The results have, however, shown a few clear improvements.</p><p>Here&#8217;s what I found:</p><ul><li><strong>Relocations</strong>: relocations are work that the dynamic linker must do either at load-time (non-PLT relocations) or during run-time (PLT). Reducing or simplifying relocations improves start-up and run-time performance.<ul><li><strong>The number of non-PLT relocations drops by 2.65% with protected visibility</strong>: that was expected because the linker options affect only the PLT. To change the non-PLT relocation count, a change to the compilation was necessary.</li><li><strong>The number of relative relocations doubles with the linker options</strong>: that was also expected, because the linker can bind the relocation to the symbol that is inside the library being linked. Instead of referring to the symbol by its name and triggering a full look-up, a relative relocation simply records how many bytes past a fixed mark (the load address) the relocation should be, which is much simpler to execute. The number increases again with <tt>-Bsymbolic</tt> compared to <tt>-Bsymbolic-functions</tt> because the linker can bind non-functions too. The number dropped with protected visibility, but by less than the number of total relocations removed.</li><li><strong>The number of PLT entries is one-third of the original</strong> because the linker can make <em>intra-library</em> function calls directly instead of going through the PLT stub. Each PLT entry corresponds to 8 bytes in the <tt>.got.plt</tt> section and 16 bytes of stub, which means this reduction saved as many as 15571 relocations and as much as 373 kB of memory size. This is confirmed by the count of PLT entries used for local symbols, which drops to nearly zero. The number isn&#8217;t exactly zero because both QtCore and QtGui <strong>have</strong> been prepared for 5 of its symbols to be interposed when built with <tt>-Bsymbolic-functions</tt>, a preoccupation I didn&#8217;t take into account in the protected visibility work because it wasn&#8217;t relevant.<ul>Note that there must have been an error with the <tt>-Bsymbolic</tt> builds because two libraries had a higher PLT count than they should. I have not investigated whether this was a a mistake on my part or a bug in the linker.</ul></li></ul></li><li><strong>Valgrind results</strong>: valgrind executes the program in a simulated CPU, which on one hand means we get consistent results independent of what CPU I run this in and how idle or busy my system was, but on the other hand may or may not reflect reality (YMMV).<ul><li><strong>Instruction count decreases slightly</strong> by 0.9%, 1.1% and 1.2%</li><li><strong>Data accesses to L1 data cache decreases slightly</strong> by 1.4%, 1.6% and 2.1%</li><li><strong>Last-level cache references decrease by 7%</strong> while the LL cache miss rate remains constant, probably because there are fewer instructions executed, fewer data accesses and a slightly improvement in L1D miss rate</li><li><strong>Number of indirect branches executed drops by 22%</strong></li><li><strong>The indirect branch misprediction rate drops considerably</strong> from 22% in the original to 16% with just the linker options and 8.8% with the protected visibility, while the overall branch misprediction rate drops from 4.7% to 4.3% and then to 4.1%. With 2.9 million fewer mispredicted branches, at a 20-cycle misprediction penalty, that&#8217;s 57 million cycles saved.</li></ul></li><li><strong>Perf results</strong>: perf uses hardware counters from the CPU to do its bidding, but it is subject to scheduling issues. The kbuildsycoca4 program does context-switch in its execution because it tries to verify with the D-Bus daemon if another instance isn&#8217;t already running. Moreover, this program is I/O intensive, meaning it makes a lot of system calls, which is why I let the benchmarks run with a &#8220;warm&#8221; system cache. Unlike the Valgrind results, there&#8217;s a great deal of noise and error in the numbers from perf because they represent an actual CPU.<ul><li><strong>There&#8217;s a roughly 3% overall performance improvement</strong> as measured by the execution time. The noise in the number doesn&#8217;t show which solution is best, but it shows that all three are better than the unmodified library code.</li><li><strong>There&#8217;s a 3 to 4% improvement in number of cycles</strong> required to complete the operation. Unfortunately, the numbers are showing performance decreasing as I optimise more, which is counter-intuitive and I cannot explain (noise or real mis-optimisation). I think my machine was slightly less idle on the last test set, as the last results I got showed a much worse performance with a much bigger standard deviation.</li><li><strong>There&#8217;s roughly 3% improvement in the number of instructions executed</strong>, which is similar to the reduction in cycles, but also shows that more instructions are executed per cycle with the optimisations. I cannot say why exactly it is, but I imagine it&#8217;s because of reductions in branching, branch misprediction and cache misses. The calculation of instructions per cycle shows improvement in two of the three benchmarks by close to 1%.</li><li><strong>Branches executed reduce by 4 to 5%</strong> but the reduction is in the opposite order of the number of branches I know are in the code, which means there was a considerable amount of noise in this test. Another similar metric shows a roughly 5% improvement in branch loads.</li><li><strong>The rates of cache misses and branch mispredictions remain more or less constant</strong>, which coupled with the number of branches reducing means we have an improvement in performance due to fewer absolute mispredictions happening. I cannot conclude anything about a reduction in cache references because the numbers varied too much.<ul>This is supported by the calculation of cycles gained in the reduction of branch misprediction. The SandyBridge architecture has a 20-cycle penalty for branch misprediction, so if we calculate how many cycles were lost in each benchmark due to mispredicted branches and subtract from the original, we get roughly 6 million cycles gained (0.24% of the total), which is in the same order as the improvement in instruction throughput (instructions per cycle).</ul></li></ul></li></ul><h1>Conclusions</h1><p>The numbers are fairly small, as was expected, since we&#8217;re talking about micro-optimisations. However, three distinct benchmarks have shown with a reasonable degree of confidence that there&#8217;s a performance improvement in the order of 3% (execution time, cycle count and instruction count, and that&#8217;s reasonable to me, with the limited sample size I had). That&#8217;s more or less what I hoped to see, but much more than I expected to be able to show.</p><p>Another important aspect is that this was a non-GUI testcase, even though by virtue of library dependencies, both QtGui and kdeui libraries were present. Note how the two libraries have, together, 45824 relocations and 14708 PLT entries in the original library set, which corresponds to 73.3% and 62.4% of the total relocations in play respectively, as well as 65% of the PLT entries for local symbols. The number of relocations is indicative also of the size of the code in those libraries. But since the application isn&#8217;t a GUI one, that code is mostly not executed.</p><p>If we consider that the problem of cache misses increases with code size (and the cache miss rate could increase too, compounding the effect) and that of cycles lost due to mispredicted branches increases with the number of branches unless the misprediction ratio drops (which the benchmarks have shown to remain stable), we can expect that a GUI application could gain even more in performance due to these improvements. That&#8217;s difficult to prove however in a GUI application, so we&#8217;ll have to stay with just the theoretical exercise.</p><p>In all, I still think this is warranted. The drawbacks are fairly minor: the interposition of symbols is rarely used already, interposition of symbols in <em>intra-library</em> lookups close to non-existent in libraries that aren&#8217;t designed to do that. All we need to do now is change the status-quo, which is probably the hardest part.</p><p>Who will support me?</p><div
class="bottomcontainerBox" style="border:1px solid #808080;background-color:#F0F4F9;"><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <iframe
src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.macieira.org%2Fblog%2F2012%2F01%2Fupdate-and-benchmark-on-the-dynamic-library-proposals%2F&amp;layout=button_count&amp;show_faces=false&amp;width=85&amp;action=like&amp;font=verdana&amp;colorscheme=light&amp;height=21" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:85px; height:21px;"></iframe></div><div
style="float:left; width:80px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <g:plusone size="medium" href="http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/"></g:plusone></div><div
style="float:left; width:95px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <a
href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/"  data-text="Update and benchmark on the dynamic library proposals" data-count="horizontal"></a></div><div
style="float:left; width:105px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><a
href="http://pinterest.com/pin/create/button/?url=http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/&media=" class="pin-it-button" count-layout="horizontal"></a></div><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><script src="http://www.stumbleupon.com/hostedbadge.php?s=1&amp;r=http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/"></script></div></div><div
style="clear:both"></div><div
style="padding-bottom:4px;"></div>]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/feed/</wfw:commentRss> <slash:comments>5</slash:comments> </item> <item><title>Sorry state of dynamic libraries on Linux</title><link>http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/</link> <comments>http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/#comments</comments> <pubDate>Mon, 16 Jan 2012 15:12:14 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[Assembly]]></category> <category><![CDATA[C++]]></category> <category><![CDATA[KDE]]></category> <category><![CDATA[Linux]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[abi]]></category> <category><![CDATA[assembly]]></category> <category><![CDATA[elf]]></category> <category><![CDATA[linux]]></category> <category><![CDATA[low-level]]></category> <category><![CDATA[optimisation]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=287</guid> <description><![CDATA[Last week, we identified a bug in Qt with Olivier&#8216;s new signal-slot syntax. Upon further investigation, it turns out it&#8217;s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux. tl;dr (a.k.a. Executive Summary) Shared libraries on Linux are linked &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>Last week, we identified a bug in Qt with <a
href="http://woboq.com">Olivier</a>&#8216;s <a
href="http://developer.qt.nokia.com/wiki/New_Signal_Slot_Syntax">new signal-slot syntax</a>. Upon further investigation, it turns out it&#8217;s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux.</p><h1>tl;dr (a.k.a. Executive Summary)</h1><p>Shared libraries on Linux are linked with <tt><a
href="http://en.wikipedia.org/wiki/Position-independent_code">-fPIC</a></tt>, which makes all variable references and function calls indirect, unless they are <tt>static</tt>. That&#8217;s because in addition to making it position-independent, it makes every variable and function <strong>interposable</strong> by another module: it can be overridden by the executable and by <tt>LD_PRELOAD</tt> libraries. The indirectness of accesses is a performance impact and we should do away with it, without sacrificing position-independence.</p><p>Plus, there are a few more actions we should take (like prelinking) to improve performance even further.</p><p>Jump to <a
href="#existing_solutions">existing</a> or <a
href="#proposed_solutions">proposed</a> solutions, <a
href="https://plus.google.com/108138837678270193032/posts/No8T7VLoF33">Google+ discussion</a>.</p><h1>Details</h1><p>Note: in the following, I will show x86-64 64-bit assembly and will restrict myself to that architecture. However, the problems and solutions also apply to many other architectures, like x86 and ARM, which should make you consider what I say. The only platform that this mostly does not apply to is actually IA-64.</p><h2>The basics</h2><p>Imagine the following C file, which also compiles in C++ mode:</p><pre class="brush: cpp; title: ; notranslate">
extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &amp;externalFunction;
}
</pre><p>The code above demonstrates three features of the languages in one function: it loads the address of a function, it calls a function and it writes to a variable. The compiler does not know where the function and variable are: they might be in another .o file linked into this ELF module or they might be in another ELF module (i.e., a library) this module links to.</p><p>This compiler produces the following assembly output (gcc 4.6.0, -O3):</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        <span class="kw1">call</span>    externalFunction
        movq    $externalFunction<span class="sy0">,</span> externalVariable<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>This assembly snippet is making use of two symbols whose values the assembler does not know. When assembled, the assembler produces a .o with three relocations. This GCC has produced the most efficient and most compact compilation of the code I wrote.</p><p>When we link this .o into an executable, we start to see the drawbacks. The first is that both instructions need to encode, in their bits, the values of the symbols whose values we didn&#8217;t know. So the linker must somehow fix this. It fixes the <tt>call</tt> instruction by making it call a stub or a trampoline, which jumps to the actual address. This stub is placed in a separate section of code called the Procedure Linkage Table (PLT). The contents of the PLT stub is not that important, but suffice to say that it is an indirect jump.</p><p>The <tt>movq</tt> instruction cannot be fixed. There&#8217;s simply no way, because it writes a constant value to a constant location, directly. Even if we allowed for the instruction or a pair of instructions wide enough to write any 64-bit value to any variable in the 64-bit space, we still have a problem: those values are not known at link time. So instead of fixing the instruction, the linker &#8220;fixes&#8221; the values. For the address of <tt>externalFunction</tt>, it uses the address of the PLT stub it created in the previous paragraph. For the <tt>externalVariable</tt> variable, tt will create a <a
href="http://docs.oracle.com/cd/E19082-01/819-0690/chapter4-84604/index.html">copy relocation</a>, which means the dynamic linker will need to find the variable where it is, <strong>copy</strong> its value to a fixed location in the executable and then tell everyone that the variable is actually in the executable.</p><p>What are the consequences of this? For the PLT call, it&#8217;s a simple performance impact which could not be avoided. Since the address of the actual <tt>externalFunction</tt> function is not known at compile and link-time, and we don&#8217;t want to leave a <a
href="http://www.akkadia.org/drepper/textrelocs.html">text relocation</a>, the only way to place that call to find the address at run-time and indirectly call it.</p><p>For the copy relocation, the consequences for the executable are small. The code it will execute is still the most efficient and most compact. The dynamic linker will have to find where the symbol actually is at load-time, which is something that it would have to do anyway, plus copy its contents, checking that the size hasn&#8217;t changed. This is done only once, then the code runs in its most efficient form.</p><p>The fact that we resolved <tt>&#038;externalFunction</tt> to the address of the PLT stub means that any use of that function pointer (an indirect call) will end up in a function that does an indirect call too. That is, it&#8217;s a <strong>doubly-indirect</strong> call. I seriously doubt any processor can do proper branch prediction, speculative execution, and prefetching of code under those circumstances.</p><h2>It gets worse</h2><p>So far we&#8217;ve analysed what happens in an executable. Now let&#8217;s see what happens when we try to build the same C code for a shared library. We do that by introducing the <tt>-fPIC</tt> compiler option, which tells the compiler to generate position-independent code. The compiler produces the following assembly output:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        <span class="kw1">call</span>    externalFunction@PLT
        movq    externalFunction@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rdx
        movq    externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>When assembled, the .o still contains three relocations, albeit of different type.</p><p>When we compare the output of the position-dependent and the position-independent code, we notice the following:</p><ol><li>The <tt>call</tt> is still a call, but now we&#8217;re explicitly calling the PLT stub. This might seem irrelevant, since the linker would have fixed the call anyway to point to the PLT if it had to, but isn&#8217;t.</li><li>The single <tt>movq</tt> instruction was split in three. This is required by the x86-64 processor, since the instruction set cannot encode a 64-bit value and the 64-bit address to store it in the same instruction (such instruction would be at least 17 bytes long, which 2 two bytes longer than the maximum instruction length).</li><li>The values for the two symbols are loaded indirectly. Instead of encoding the two values in those two middle <tt>movq</tt> instructions, the compiler is loading the values from another linker-generated structure called the Global Offset Table (GOT).</li></ol><p>The compiler needed to generate the code above since it doesn&#8217;t know where the symbols will actually be. As was the case before, those symbols can be linked into the same ELF module as this compilation unit, or they may be found elsewhere in another ELF module this one links to.</p><p>Moreover, the compiler and linker need to deal with the possibility that an executable might have done exactly what our executable in the previous section did: create a copy relocation on the variable and fixed the address of the function to its own PLT stub. In order to work properly, this code must deal with the fact that its own variable might have ended up elsewhere, and that <tt>&#038;externalFunction</tt> might have a different value.</p><p>That means the indirect call through the PLT and the three <tt>movq</tt> instructions remain, even if those two symbols were in the same compilation unit!</p><p>The problem is that even if at first glance you&#8217;d think that the compiler should know for a fact where those symbols are, it actually doesn&#8217;t. The <tt>-fPIC</tt> option doesn&#8217;t enable only position-independent code. It also enables ELF symbol interposition, which is when another module &#8220;steals&#8221; the symbol. That happens normally by way of the copy relocations, but can also happen if an LD_PRELOAD&#8217;ed module were to override those symbols. So the compiler and linker must produce code that deals with that possibility.</p><p>In the end, we&#8217;re left with indirect calls, indirect symbol address loadings and indirect variable references, which impact code performance. In addition, the linker must leave behind relocations by name for the dynamic linker to resolve at load-time.</p><h2>All this for the possibility of interposition?</h2><p>Yes, it seems so. The impact is there for this little-known and little-used feature. Instead of optimising for the common-case scenario where the symbols are not overridden, the ABI optimises for the corner case.</p><p>Another argument is that the ABI optimises for executable code, placing the impact on the libraries. The argument is valid if the executables are much larger and more complex than the libraries themselves. It&#8217;s valid too if we consider that application developers write sloppy code, whereas library developers will write very optimised code.</p><p>I don&#8217;t think that argument holds anymore. Libraries have got much more complex in the past 10-15 years and do a lot more than they once did. They are not mere wrappers around system calls, like libc 4 and 5 were on Linux in the late 90s. Moreover, if we consider that the rise of interpreted languages, like Perl, Python, Ruby, even QML and JavaScript, the code belonging to the ELF executables is negligible. Compare the size of the executables with the libraries that actually do the interpretation:</p><pre>
-rwxr-xr-x. 2 root root   13544 Aug  5 06:27 /usr/bin/perl
-rwxr-xr-x. 2 root root    9144 Apr 12  2011 /usr/bin/python
-rwxr-xr-x. 1 root root    5160 Dec 29 13:46 /usr/bin/ruby
-r-xr-xr-x. 1 root root 1763488 Apr 12  2011 /usr/lib64/libpython2.7.so.1.0
-rwxr-xr-x. 1 root root  947736 Dec 29 13:46 /usr/lib64/libruby.so.1.8.7
-rwxr-xr-x. 1 root root 1524064 Aug  5 06:27 /usr/lib64/perl5/CORE/libperl.so
</pre><p>That&#8217;s even valid for interpreters that JIT the code. As optimised as the code they generate can be, current understanding is that operations with critical performance are implemented in native code, which means libraries or plugins.</p><h1><a
name="existing_solutions"></a>Existing solutions</h1><h2>Partial solution for private symbols</h2><p>When developing your library, if you know that certain symbols are private and will never be used by any other library, you have an option. You can declare their ELF visibility to be &#8220;hidden&#8221;, which has two consequences. The clear one is that the linker will not add the hidden symbols to the dynamic symbol table, so other ELF modules simply cannot find them. If they can&#8217;t find them, they can&#8217;t steal them. And if they can&#8217;t steal them, the linker does not need to produce a PLT stub for the function call, so the <tt>call</tt> instruction will be linked to a simple, direct call as the executable in the first part had been.</p><p>The other consequence is an optimisation that the compiler does. Since it also knows that the <tt>externalVariable</tt> variable cannot be stolen, it does not need to address the variable indirectly. The generated assembly becomes:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        <span class="kw1">call</span>    externalFunction@PLT
        movq    externalFunction@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rax<span class="sy0">,</span> externalVariable<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>The .o file will still contain three relocations. However, note how the getting of the address of the <tt>externalFunction</tt> function is still done indirectly, even though the compiler knows it cannot be interposed. That means the linker will still generate a load-time relocation for the dynamic linker, to get the address of that function. Fortunately, it&#8217;s a simpler relocation since the symbol name itself is not present.</p><p>If there&#8217;s a reason for getting the address indirectly like this, I have yet to find it.</p><h2>Partial solution for public non-interposable symbols</h2><p>If your symbols are public, however, you cannot use the ELF &#8220;hidden&#8221; visibility trick. But if you know that they cannot and will not ever be stolen or interposed, you have another possibility, which is to tell that to the compiler and linker.</p><p>If you declare a variable with ELF &#8220;protected&#8221; visibility, you&#8217;re telling the compiler and linker that it cannot be stolen, yet can be placed in the dynamic symbol table for other ELF modules to reference. You just have to be absolutely sure that they will not <strong>ever</strong> be interposed, because that will create subtle bugs that are hard to track down. That includes access to those symbols by position-dependent executable code, like we did in the first section.</p><p>The GCC syntax <tt>__attribute__((visibility("protected")))</tt> works in ELF platforms only, whereas the one with the &#8220;hidden&#8221; keyword is known to work in non-ELF platforms too, like Mac OS X (Mach-O) and IBM AIX (XCOFF).</p><p>Another way to do the same is to use one of two linker options: <tt>-Bsymbolic</tt> and <tt>-Bsymbolic-functions</tt>. They do basically the same as the protected visibility: they keep the symbols in the dynamic symbol table, but they make the linker use the symbol inside the library unconditionally. The difference between those two options is that the former applies to all symbols, whereas the latter applies to functions only.</p><p>The reason why <tt>-Bsymbolic-functions</tt> exists requires looking back at the executable code from the first section. While the variable reference required a copy relocation, the function call was done indirectly, through the PLT stub. A variable can be moved, but moving code isn&#8217;t possible, so the executable code needs to deal with the code being elsewhere anyway. For that reason, it&#8217;s possible to symbolically bind function calls inside a library without affecting executables.</p><p>Or so we thought. The problem we discovered last week deals with a situation of when you treat a function as a data reference: taking its address. As we saw on the first part, the linker will resolve the address of the function to the address of the PLT stub found in the executable. But if you symbolically bind the function in the library, it will resolve to the real address. If you try to compare the two addresses, they won&#8217;t be the same.</p><h1><a
name="proposed_solutions"></a>Proposed solutions</h1><p>Some of the solutions I propose are ABI and binary compatible with existing builds; some others are ABI incompatible and would require recompilation. Unfortunately, the best solution would require source-incompatible changes. Still, all the changes below are giving a bit of optimisation to libraries by making executables less optimised.</p><h2>Use of PLT in function calls should rest only with the linker</h2><p>As we saw in the code generated for the library, with -fPIC, the compiler decided to make the call indirectly by adding &#8220;@PLT&#8221; to the symbol name. Turns out that the linker doesn&#8217;t really care about this and will generate (or not) the PLT stub if needed. If that&#8217;s the case, the compiler should not make a judgement call about where the symbol is located just because of -fPIC.</p><h2>Function addresses should always be resolved through the GOT</h2><p>Function calls already require a pointer-sized variable somewhere and a relocation to make it point to the valid entry point of the function being called. What&#8217;s more, taking addresses of functions is a somewhat rare operation, compared to the number of function calls across ELF modules.</p><p>That being the case, we can take a small &#8220;hit&#8221; in performance and the loading of a function address should happen via the GOT in position-dependent code (executables) just like it is done for position-independent code.</p><p>The benefit of doing this is that the function address we load will point to exactly function&#8217;s real entry point, instead of the PLT stub. When we call this function, we avoid the doubly-indirect branching we found earlier.</p><h2>PLT stubs should use the regular GOT&#8217;s address, if it exists</h2><p>If a given function is both called and its address is taken, the PLT stub should reference GOT entry that was used for the taking of the address. The reason why it isn&#8217;t already so, I guess, is because the entries in the <tt>.got.plt</tt> section aren&#8217;t initialised with the target function&#8217;s address, but the local module&#8217;s function resolver. This trick allows for the &#8220;lazy resolution&#8221; of functions: they are resolved only the first time they are called.</p><p>I wouldn&#8217;t ask for all functions to be resolved at load-time, but if the address of the function is taken <strong>anyway</strong>, the dynamic linker will need to resolve it at load time. So why waste CPU cycles in a function call if the address was computed already?</p><h2>Copy relocations should be deprecated</h2><p>Instead of copying the variable from the library into the executable, executables should use indirect addressing for reading variables and writing to them, as well as taking their addresses. One benefit of doing this is avoiding the actual copying. For example, for read-only variables, they may remain in read-only pages of memory, instead of being copied to read-write pages found in the executable.</p><p>The big drawback of this is that the indirect addressing is a lot more expensive, since it requires two memory references, not just one. The next suggestion might help alleviate the problem.</p><h2>The linker should relax instructions used for loading variable addresses</h2><p>This is a suggestion found in the IA-64 ABI: the compiler generates the instructions needed to load the address of the variable from the GOT, then use it as it needs to. If the linker concludes (by whichever means, like protected or hidden symbols, the use of one of the symbolic options, or because this is an ELF application and the symbol is defined in it) that the symbol must reside in the current ELF module, it can change the load instruction into a register-to-register move or similar.</p><p>For our x86-64 64-bit case, the instructions the compiler generated were:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        movq    externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>By changing one bit in the opcode of the first instruction, with no code size change, we can produce:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        leaq    externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>The x86 instruction &#8220;LEA&#8221; means &#8220;Load Effective Address&#8221;. Instead of loading 64 bits from the memory address externalVariable@GOTPCREL(%rip) and storing them in the register, that instruction the address it would have loaded from in the register. This isn&#8217;t as optimised as the original code found in the executable for two reasons: it requires two instructions instead of just one and it requires an additional register.</p><p>It&#8217;s possible to generate an even more efficient code if the assembler leaves a 32-bit immediate offset in the second <tt>movq</tt> instruction, making it 6 bytes long. This extra immediate would be of no impact in the original code, besides making it longer, but it would allow the linker to optimise the code further:</p><p>The original would be:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        movq     externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq<span class="sy0">.</span>d32 <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="nu0">0x0</span><span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>And it would get relaxed to:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        nopl<span class="sy0">.</span>d32 <span class="nu0">0x0</span><span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span>
        movq     <span class="sy0">%</span>rdx<span class="sy0">,</span> externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>That is, the first 6-byte instruction is resolved to a 6-byte NOP, whereas the second 6-byte instruction executes the actual store, with no extra register use. The compiler cannot know that the register will be left untouched, but at least there is no dependency between the two instructions that might cause a CPU stall.</p><p>The same applies to other architectures too. The full <tt>-fPIC</tt> code on ARM to store a value from a register into a variable is the following:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r3<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>     @ points to a constant whose value is<span class="sy0">:</span> externalVariable<span class="br0">&#40;</span>GOT<span class="br0">&#41;</span>
<span class="sy0">.</span>LPIC1<span class="sy0">:</span> ldr     r3<span class="sy0">,</span> <span class="br0">&#91;</span>r4<span class="sy0">,</span> r3<span class="br0">&#93;</span>  @ r4 contains the base address of the GOT
        <span class="kw1">str</span>     r2<span class="sy0">,</span> <span class="br0">&#91;</span>r3<span class="sy0">,</span> #<span class="nu0">0</span><span class="br0">&#93;</span></pre></div></div></div></div></div></div></div><p>If the linker can conclude the symbol must be in the current ELF module and cannot change, it may be able to avoid the extra load (the middle instruction) by changing the code to be:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r3<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>     @ points to a constant whose value is<span class="sy0">:</span> externalVariable<span class="sy0">-</span><span class="br0">&#40;</span><span class="sy0">.</span>LPIC1<span class="sy0">-</span><span class="nu0">8</span><span class="br0">&#41;</span>
<span class="sy0">.</span>LPIC1<span class="sy0">:</span> <span class="kw1">add</span>     r3<span class="sy0">,</span> pc<span class="sy0">,</span> r3
        <span class="kw1">str</span>     r2<span class="sy0">,</span> <span class="br0">&#91;</span>r3<span class="sy0">,</span> #<span class="nu0">0</span><span class="br0">&#93;</span></pre></div></div></div></div></div></div></div><p>Unlike x86, the ARM instructions cannot be optimised further, since the immediates encodable in the instructions have limited range.</p><h2>The linker should relax instructions used for loading function addresses</h2><p>Similar to the above, but instead looking at function addresses. The original library code is:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        movq    externalFunction@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rdx</pre></div></div></div></div></div></div></div><p>But it can be relaxed to:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        leaq    externalFunction<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rdx</pre></div></div></div></div></div></div></div><p>With ARM, the original code is:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r3<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>     @ points to a constant of value<span class="sy0">:</span> externalFunction<span class="br0">&#40;</span>GOT<span class="br0">&#41;</span>
        ldr     r2<span class="sy0">,</span> <span class="br0">&#91;</span>r4<span class="sy0">,</span> r3<span class="br0">&#93;</span>  @ r4 contains the address of the base of the GOT</pre></div></div></div></div></div></div></div><p>But relaxed, it would be:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r2<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>    @ points to a constant of value<span class="sy0">:</span> externalFunction<span class="sy0">-</span><span class="br0">&#40;</span><span class="sy0">.</span>LPIC0<span class="sy0">+</span><span class="nu0">8</span><span class="br0">&#41;</span>
<span class="sy0">.</span>LPIC0<span class="sy0">:</span> <span class="kw1">add</span>     r2<span class="sy0">,</span> pc<span class="sy0">,</span> r2</pre></div></div></div></div></div></div></div><h2>There should be a way to tell the compiler where the symbol is</h2><p>We&#8217;re already able to tell the compiler that a symbol is in the current module, with the hidden visibility attribute. We should be able to tell the compiler that we know that the symbol is in the current module but exported as well as that we know that the symbol is in another module.</p><p>I would suggest simply using the existing ELF markers and being explicit about them:</p><ul><li><tt>__attribute__((visibility("hidden")))</tt>: symbol is in this ELF module and is not exported (equivalent on Windows: no decoration);</li><li><tt>__attribute__((visibility("protected")))</tt>: symbol is in this ELF module and is exported (equivalent on Windows: <tt>__declspec(dllexport)</tt>);</li><li><tt>__attribute__((visibility("default")))</tt>: symbol is in another ELF module (equivalent on Windows: <tt>__declspec(dllimport)</tt>); this also applies to symbols that must be overridable according to the library&#8217;s API (like C++&#8217;s global operator new).</li></ul><p>Considering the other suggestions, we know the references to symbols with &#8220;default&#8221; visibility can be relaxed into simpler and more efficient code in the presence of one of the symbolic binding options. That means we can use the &#8220;default&#8221; visibility for cases of uncertain symbols.</p><h1>Getting there</h1><p>Some of the solutions I listed are already possible and they should be used immediately in all libraries. That is especially true about the use of the hidden visibility: all libraries, without exception, should make use of this feature. In fact, since this option was introduced in GCC 4.0 seven years ago, many libraries have started using it and are now &#8220;good citizens&#8221;, for they access their own private data most efficiently, they don&#8217;t have huge symbol tables (which impact lookup speed) and they don&#8217;t pollute the global namespace with unnecessary symbols.</p><p>Other solutions are not possible to implement yet. The solution I personally feel is most important to be implemented first is that of the ELF executables: they need to stop using copy relocations and they should resolve addresses of functions via the GOT. Only once that is done can libraries start using the &#8220;protected&#8221; visibility and generate improved code. This implies changing the psABI for the affected libraries, which may not be an easy transition.</p><p>An alternative to using the &#8220;protected&#8221; visibility is to use the symbolic binding options. The code relaxation optimisations would come in handy at this point to optimise at link-time the code that the compiler could not make a decision on. Unfortunately, those options apply to all symbols in a library, so libraries that must have overridable symbols need to use an extra option (<tt>--dynamic-list</tt>) and list each symbol one by one.</p><h2>Using -fPIE</h2><p>The compiler option <tt>-fPIE</tt> tells the compiler to generate position-independent code for executables. It is similar to the <tt>-fPIC</tt> option in that it generates position-independent code, but it has the added optimisation that the compiler can assume none of its symbols can be interposed.</p><p>With executables compiled with this option, copy relocations and direct loading of function addresses aren&#8217;t used. This solves the problem we had. Therefore, compiling executables with this option allows us to start using some of the optimisations I described before.</p><p>Unfortunately, as its description says, this option also generates position-independent code, which can be less efficient than position-dependent code in some situations. My preference would be to have position-dependent code executables without the copy relocations. However, there&#8217;s an added, side-effect of this option: it defines the <tt>__PIC__</tt> macro, whose absence can be used to abort compilations for libraries that have transitioned to the more efficient options.</p><h1>Further work and further reading</h1><p>I highly recommend Urlich Drepper&#8217;s <a
href="http://www.akkadia.org/drepper/dsohowto.pdf">&#8220;How to Write Shared Libraries&#8221;</a> paper. His recommendations did not go as far as suggest changing the ABI like I have, but he has many that library developers should adhere to, regardless of whether my recommendations are accepted or not. For example, using <tt>static</tt> functions and data where possible and avoiding arrays of pointers are recommendations I have made to many people.</p><p>Other work necessary is to improve prelinking support. Shared libraries are position-independent, but they can be prelinked to a preferred location in memory. One optimisation I have yet to see done is to use the read-only pages of prelinked data when the library is loaded at that preferred address (the <tt>.data.rel.ro</tt> sections).</p><div
class="bottomcontainerBox" style="border:1px solid #808080;background-color:#F0F4F9;"><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <iframe
src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.macieira.org%2Fblog%2F2012%2F01%2Fsorry-state-of-dynamic-libraries-on-linux%2F&amp;layout=button_count&amp;show_faces=false&amp;width=85&amp;action=like&amp;font=verdana&amp;colorscheme=light&amp;height=21" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:85px; height:21px;"></iframe></div><div
style="float:left; width:80px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <g:plusone size="medium" href="http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/"></g:plusone></div><div
style="float:left; width:95px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <a
href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/"  data-text="Sorry state of dynamic libraries on Linux" data-count="horizontal"></a></div><div
style="float:left; width:105px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><a
href="http://pinterest.com/pin/create/button/?url=http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/&media=" class="pin-it-button" count-layout="horizontal"></a></div><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><script src="http://www.stumbleupon.com/hostedbadge.php?s=1&amp;r=http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/"></script></div></div><div
style="clear:both"></div><div
style="padding-bottom:4px;"></div>]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/feed/</wfw:commentRss> <slash:comments>24</slash:comments> </item> <item><title>Qt-Project.org is live</title><link>http://www.macieira.org/blog/2011/10/qt-project-org-is-live/</link> <comments>http://www.macieira.org/blog/2011/10/qt-project-org-is-live/#comments</comments> <pubDate>Fri, 21 Oct 2011 12:28:32 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[KDE]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Open Governance]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[codereview]]></category> <category><![CDATA[gerrit]]></category> <category><![CDATA[opengov]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt-project]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=200</guid> <description><![CDATA[As you may have noticed in Lars&#8217;s blog the new Qt Project website and organisation is live! Yeah! It&#8217;s the product of many people&#8217;s work over the course of a year and a half, changing the way how over 200 engineers work on their daily lives. The change is just in time for the Qt &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2011/10/qt-project-org-is-live/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>As you may have noticed in <a
href="http://labs.qt.nokia.com/2011/10/21/the-qt-project-is-live/">Lars&#8217;s blog</a> the new <a
href="http://qt-project.org">Qt Project</a> website and organisation is live! Yeah! It&#8217;s the product of many people&#8217;s work over the course of a year and a half, changing the way how over 200 engineers work on their daily lives.</p><p>The change is just in time for the <a
href="http://qt.nokia.com/qtdevdays2011/">Qt Developer Days</a> event in Munich, which starts next Monday with a Qt Contributors&#8217; Unconference Day. I&#8217;ll be there and we&#8217;ll be discussing how to get started. It&#8217;s also just soon after Research In Motiion opened up its <a
href="http://blackberry.github.com/">Native BlackBerry SDK</a> with support for Qt.</p><p>Here are some resources you may want to get started with:</p><ul><li><strong>Mailing lists</strong>: Subscribe at <a
href="http://lists.qt-project.org/mailman/listinfo">http://lists.qt-project.org/mailman/listinfo</a> to the mailing lists that might be of your interest. There&#8217;s no description available now for them, but you can guess what they are as per the name.</li><li><strong>Bug tracking</strong>: If you don&#8217;t have an account yet, create one by singing up at the <a
href="http://bugreports.qt.nokia.com">Qt Bugreports</a> website, which is still in a .nokia.com domain but should change hopefully soon</li><li><strong>Code review</strong>: sign up first for the Bugrepotrs account above, then head to <a
href="https://codereview.qt-project.org">Codereview</a> website and log in with the same credentials. There, set your real name and add one or more email addresses you&#8217;re known by.</li></ul><p>The <a
href="https://codereview.qt-project.org">Codereview</a> website is where the reviews and approvals will all happen, and the <a
href="http://lists.qt-project.org/mailman/subscribe/development">Development mailing list</a> is where all discussions will happen. If you plan on being involved, you should be on both.</p><p>In order to contribute a code change to Qt, you&#8217;ll need to provide an SSH public key in order to authenticate yourself and you&#8217;ll need to agree to the terms of the terms of the Qt Contribution Agreement, now on version 1.1. If you choose not to do that, you won&#8217;t be able to contribute code, but you can of course contribute in many other ways, including reviewing and offering advice on how to improve other people&#8217;s code.</p><p>You may want to add this to your <tt>~/.ssh/config</tt></p><p
style="white-space: pre"><tt><br
/> Host codereview.qt-project.org<br
/> &#x200D;&#x200D;Port 29418<br
/> &#x200D;&#x200D;User <em>insert-username-here</em><br
/> &#x200D;&#x200D;IdentityFile <em>insert-path-to-ssh-key-here</em><br
/> </tt></p><p>And this is the SSH key and fingerprint for the website:</p><pre>
Fingerprint: 11:24:25:51:5d:ab:4f:b1:15:49:10:3a:68:6d:ec:0f
[codereview.qt-project.org]:29418,[87.238.53.162]:29418 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQCvXdApmCFiAyXDiYU5+z6762Qv8+vrmM3+9YrxDKByyphaxblLJC9txPv3D/w7rzSyiMMHL/5ssCemwz+6QBqnemFl4B+FNv81fpZFsqCg5afrTi62WFllGWIQAiYb2JZmkmSAbxm+sAxLE1ritp+Syxz8Gb8WR27G/3TSHerdBQ==
</pre><div
class="bottomcontainerBox" style="border:1px solid #808080;background-color:#F0F4F9;"><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <iframe
src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.macieira.org%2Fblog%2F2011%2F10%2Fqt-project-org-is-live%2F&amp;layout=button_count&amp;show_faces=false&amp;width=85&amp;action=like&amp;font=verdana&amp;colorscheme=light&amp;height=21" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:85px; height:21px;"></iframe></div><div
style="float:left; width:80px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <g:plusone size="medium" href="http://www.macieira.org/blog/2011/10/qt-project-org-is-live/"></g:plusone></div><div
style="float:left; width:95px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <a
href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.macieira.org/blog/2011/10/qt-project-org-is-live/"  data-text="Qt-Project.org is live" data-count="horizontal"></a></div><div
style="float:left; width:105px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><a
href="http://pinterest.com/pin/create/button/?url=http://www.macieira.org/blog/2011/10/qt-project-org-is-live/&media=" class="pin-it-button" count-layout="horizontal"></a></div><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><script src="http://www.stumbleupon.com/hostedbadge.php?s=1&amp;r=http://www.macieira.org/blog/2011/10/qt-project-org-is-live/"></script></div></div><div
style="clear:both"></div><div
style="padding-bottom:4px;"></div>]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2011/10/qt-project-org-is-live/feed/</wfw:commentRss> <slash:comments>3</slash:comments> </item> <item><title>New domain for Qt announced: qt-project.org</title><link>http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/</link> <comments>http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/#comments</comments> <pubDate>Mon, 12 Sep 2011 11:46:07 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[KDE]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Open Governance]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[labs]]></category> <category><![CDATA[opengov]]></category> <category><![CDATA[qt]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=147</guid> <description><![CDATA[Lars has just announced on his blog that the Qt Open Source Project, under the Open Governance, will be moved to a new domain: qt-project.org (don&#8217;t bother copy/pasting, there&#8217;s nothing there yet). At the official Nokia Qt Blog, Daniel Kihlberg gives us the date for the launch: October 17th. The moving to a new domain &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>Lars has just <a
href="http://labs.qt.nokia.com/2011/09/12/qt-project/">announced</a> on his blog that the Qt Open Source Project, under the Open Governance, will be moved to a new domain: <tt>qt-project.org</tt> (don&#8217;t bother copy/pasting, there&#8217;s nothing there yet). At the official Nokia Qt Blog, Daniel Kihlberg <a
href="http://blog.qt.nokia.com/2011/09/12/qt-project/">gives us</a> the date for the launch: October 17th.</p><p>The moving to a new domain name has always been in the plans. I remember registering a domain when we started discussing Open Governance, in April of last year. That was, of course, before I knew how long it would take to actually get off the ground and that tranferring domains would be a hassle.</p><p>This domain, along with all the infrastructure required to run the project, will be owned by a non-profit foundation. It will not be owned by Nokia, nor any other company. This is to be absolutely clear that the project is neutral, independent of its uses by the companies. Lars is also clear in his blog that decision-making is done by the community, following the guidelines that others and I have been talking about for months.</p><p>From the launch point forward, we will say &#8220;the Qt Project releases version x.y&#8221;, or &#8220;the Qt Project has decided to do Z&#8221;, where we understand &#8220;Qt Project&#8221; to be the community decision.</p><p>What does this mean for other projects using Qt, like KDE and MeeGo (the &#8220;downstreams&#8221;)? More access to the decision-making, to the inner workings, directing Qt to their needs; learning from Qt&#8217;s good and bad, also finding out where Qt isn&#8217;t going, so they can go. In fact, this topic was the subject of my my Camp KDE presentation in April.</p><p>Both KDE and MeeGo have begun doing that, to some extent. MeeGo&#8217;s goals of a Wayland-based installation, with much improved graphics performance, rhymes with Qt&#8217;s. MeeGo has been helping drive the Wayland project. KDE, for its part, has begun the &#8220;KDE Frameworks&#8221; project, to refactor the KDE Libraries from KDE 4 and make them part of the Qt Ecosystem.</p><p>I&#8217;m glad to be part of all three projects, helping them see and help each other. Shameless plug: alone, I can&#8217;t do much, but together we can do a lot. Join us!</p><div
class="bottomcontainerBox" style="border:1px solid #808080;background-color:#F0F4F9;"><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <iframe
src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.macieira.org%2Fblog%2F2011%2F09%2Fnew-domain-for-qt-announced-qt-project-org%2F&amp;layout=button_count&amp;show_faces=false&amp;width=85&amp;action=like&amp;font=verdana&amp;colorscheme=light&amp;height=21" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:85px; height:21px;"></iframe></div><div
style="float:left; width:80px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <g:plusone size="medium" href="http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/"></g:plusone></div><div
style="float:left; width:95px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <a
href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/"  data-text="New domain for Qt announced: qt-project.org" data-count="horizontal"></a></div><div
style="float:left; width:105px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><a
href="http://pinterest.com/pin/create/button/?url=http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/&media=" class="pin-it-button" count-layout="horizontal"></a></div><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><script src="http://www.stumbleupon.com/hostedbadge.php?s=1&amp;r=http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/"></script></div></div><div
style="clear:both"></div><div
style="padding-bottom:4px;"></div>]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2011/09/new-domain-for-qt-announced-qt-project-org/feed/</wfw:commentRss> <slash:comments>4</slash:comments> </item> <item><title>Qt 4.8 beta 1 released</title><link>http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/</link> <comments>http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/#comments</comments> <pubDate>Thu, 21 Jul 2011 21:28:42 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[KDE]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt4.8]]></category> <category><![CDATA[releases]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=113</guid> <description><![CDATA[I&#8217;ve just realised that neither Eckhart&#8217;s post nor the QtWebKit post were aggregated on either Planet KDE nor Planet MeeGo. Quoting: It has been some weeks since we released the Qt 4.8 Technology Preview to the community. The release raised a lot of interest and we have received many comments in response to the Qt &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>I&#8217;ve just realised that neither <a
href="http://labs.qt.nokia.com/2011/07/19/qt-4-8-beta-released/">Eckhart&#8217;s post</a> nor the <a
href="http://qtwebkit.blogspot.com/2011/07/qtwebkit-22-beta1-vs-qt-48-beta1.html">QtWebKit post</a> were aggregated on either Planet KDE nor Planet MeeGo.</p><p>Quoting:</p><blockquote><p>It has been some weeks since we released the Qt 4.8 Technology Preview to the community. The release raised a lot of interest and we have received many comments in response to the Qt 4.8 TP blog.</p><p>Today we release the Qt 4.8 Beta. It should be noted it is not yet a final release candidate but it helps us make the quality of the final release even better. It will be available as an online Qt SDK 1.1 update only.</p></blockquote><p>The Qt 4.8 beta1 package contains QtWebKit 2.2 beta1. It also contains the new port QPA (from the project Lighthouse), which brings us partial Wayland support too.</p><p>For the impatient, here are the download links:</p><ul><li><a
href="http://get.qt.nokia.com/qt/source/qt-everywhere-opensource-src-4.8.0-beta1.zip">qt-everywhere-opensource-src-4.8.0-beta1.zip</a> (257.6 MB)</li><li><a
href="http://get.qt.nokia.com/qt/source/qt-everywhere-opensource-src-4.8.0-beta1.tar.gz">qt-everywhere-opensource-src-4.8.0-beta1.tar.gz</a> (223.6 MB)</li><li><a
href="http://get.qt.nokia.com/qt/source/md5sums.txt">md5sums.txt</a> (0.046 MB)</li></ul><p>PS: the <a
href="https://qt.gitorious.org/qt/qt/commit/v4.8.0-beta">Git tag</a> is called just &#8220;v4.8.0-beta&#8221;, missing the  number 1.</p><div
class="bottomcontainerBox" style="border:1px solid #808080;background-color:#F0F4F9;"><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <iframe
src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.macieira.org%2Fblog%2F2011%2F07%2Fqt-4-8-beta-1-released%2F&amp;layout=button_count&amp;show_faces=false&amp;width=85&amp;action=like&amp;font=verdana&amp;colorscheme=light&amp;height=21" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:85px; height:21px;"></iframe></div><div
style="float:left; width:80px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <g:plusone size="medium" href="http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/"></g:plusone></div><div
style="float:left; width:95px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <a
href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/"  data-text="Qt 4.8 beta 1 released" data-count="horizontal"></a></div><div
style="float:left; width:105px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><a
href="http://pinterest.com/pin/create/button/?url=http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/&media=" class="pin-it-button" count-layout="horizontal"></a></div><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><script src="http://www.stumbleupon.com/hostedbadge.php?s=1&amp;r=http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/"></script></div></div><div
style="clear:both"></div><div
style="padding-bottom:4px;"></div>]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2011/07/qt-4-8-beta-1-released/feed/</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>QString improved</title><link>http://www.macieira.org/blog/2011/07/qstring-improved/</link> <comments>http://www.macieira.org/blog/2011/07/qstring-improved/#comments</comments> <pubDate>Sun, 17 Jul 2011 22:32:40 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[KDE]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[c++11]]></category> <category><![CDATA[optimisation]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[unicode]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=15</guid> <description><![CDATA[On my birthday, I blogged about how I&#8217;d like QString to support proper UTF-8 strings and be much easier to use. The code that I said would be my preferred would be: QString s = u&#34;Résumé&#34;q; Recently, in Qt 5.0 we have begun to make steps to reach that. Most of the work was done &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2011/07/qstring-improved/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>On my birthday, I <a
href="http://labs.qt.nokia.com/2011/03/26/on-utf-8-latin-1-and-charsets/">blogged</a> about how I&#8217;d like QString to support proper UTF-8 strings and be much easier to use. The code that I said would be my preferred would be:</p><pre>
    QString s = u<span style='color:#bf0303;'>&quot;Résumé&quot;</span>q;
</pre><p>Recently, in Qt 5.0 we have begun to make steps to reach that. Most of the work was done by <a
href="http://labs.qt.nokia.com/author/lars/">Lars Knoll</a>, but since he is on vacation right now, I&#8217;ll take the opportunity to explain it all. (Also, many thanks to <a
href="http://labs.qt.nokia.com/author/olivier/">Olivier Goffart</a> for helping with reviewing)</p><p><span
id="more-15"></span></p><h2>Analysing Qt 4</h2><p>One of the things we most wanted to do in Qt 5.0 was to make QStrings be storable in read-only memory. As I said in my post back in March, we want to ask the compiler to write the UTF-16 strings for us in the <tt>.rodata</tt> section of the binary. Right now, in Qt 4.x, whenever you write code as:</p><pre class="brush: cpp; title: ; notranslate">
    QString s = QLatin1String(&quot;Hello, World&quot;);
</pre><p>The compiler will emit a standard, 8-bit C string in the read-only section of the binary, then make a call to the QString constructor to convert that to UTF-16. For the string above, which contains 13 characters plus the ending NUL, we store 14 bytes in <tt>.rodata</tt> and we must allocate 28 bytes plus the <tt>QString::Data</tt> overhead in the heap to create our string.</p><p>As fast as <tt>malloc()</tt> can be, it&#8217;s still a non-negligible cost. For example, you should always avoid it when doing benchmarks, since its runtime can vary a lot and skew your results. Not to mention, of course, that dynamic memory (the heap) cannot be shared among applications.</p><p>If we look at <tt>QString::Data</tt> in Qt 4, we see:</p><pre class="brush: cpp; title: ; notranslate">
    struct Data {
        QBasicAtomicInt ref;
        int alloc, size;
        ushort *data; // QT5: put that after the bit field to fill alignment gap; don't use sizeof any more then
        ushort clean : 1;
        ushort simpletext : 1;
        ushort righttoleft : 1;
        ushort asciiCache : 1;
        ushort capacity : 1;
        ushort reserved : 11;
        // ### Qt5: try to ensure that &quot;array&quot; is aligned to 16 bytes on both 32- and 64-bit
        ushort array[1];
    };
</pre><p>The way QString works is that the <tt>data</tt> pointer is initialised to point to the beginning of the actual UTF-16 data. On a normal QString, this pointer points to the first element of the <tt>array</tt> array, whereas it points elsewhere in case of a QString created using <a
href="http://doc.qt.nokia.com/latest/qstring.html#fromRawData"><tt>QString::fromRawData</tt></a>.</p><p>The flags in the 16-bit bitfield were set by a couple of functions and are technically a heritage from Qt 3 days.</p><h2>Read-only QStringData</h2><p>In order to make QString be saved in read-only memory, a couple of modifications were required. The first thing that needed changing was the reference counting: QString always increased and decreased it, all the time. If we want to save the <tt>QStringData</tt> object in read-only memory, we must not try to increment or decrement the reference counter: we then chose the value -1 to indicate the constant <tt>QStringData</tt>. Whenever this value is seen, the new code will avoid the atomic operations.</p><p>The next thing that needed changing was the pointer. First of all, it&#8217;s not possible to initialise the pointer to the value of another member in the object. Second, even if it were possible, having a pointer means the linker cannot place a value there in Position-Independent Code. The loader would need to do that and then the object would be stored in a read-write section because of the relocations.</p><p>The solution we found was to replace the pointer with an offset (using <tt>qptrdiff</tt>), pointing to how far after the beginning of the array the actual data is located. When this member is zero, it means the data is stored in the array in the <tt>QStringData</tt>, which is how we initialise it.</p><p>Then we were only left with the problem of getting UTF-16 data generated by the compiler. With C++0x (C++11), as I pointed out in my post back in March, it&#8217;s easy. The alternative we found, for compilers without C++0x support, is on Windows: there, the <tt>wchar_t</tt> type is 2 bytes wide and encodes an UTF-16 string. By the way, it&#8217;s possible to get this behaviour with GCC on other platforms using the <tt>-fshort-wchar</tt> option.</p><p>All of this, Lars implemented in commit <a
href="https://qt.gitorious.org/qt/qtbase/commit/ee85e9cc10bc6874c892b09fa54b5dbd79854069">ee85e9cc10bc6874c892b09fa54b5dbd79854069</a> (Gitorious won&#8217;t display it, it&#8217;s too large). He added a new macro called <tt>QStringLiteral</tt> which can be used as:</p><pre class="brush: cpp; title: ; notranslate">
    QString s = QStringLiteral(&quot;Hello, World\n&quot;);
</pre><h2>Producing a non-temporary</h2><p>Lars&#8217;s implementation worked fine in the compilers he tested: GCC 4.4 and 4.5. However, when I tested with GCC 4.6, I started getting crashes. After analysing the assembly output, it turns out that the compiler initialised the <tt>QStringData</tt> object in the stack. If you had a function like the following:</p><pre class="brush: cpp; title: ; notranslate">
QString foo()
{
    return QStringLiteral(&quot;Hello, World\n&quot;);
}
</pre><p>and compiled it in -O3 mode, GCC 4.6 even skipped all the initialisation since it figured it was dead code. It simply set the d-pointer in the returned QString to an address on the stack.</p><p>In order to produce a non-static, I needed to figure out a way to create a <tt><strong>static</strong></tt> variable. My solution was to use the GCC <a
href="http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Statement-Exprs.html">Statement Expressions</a> extension. It works fine for code inside a function, but not outside. And, of course, it doesn&#8217;t work on other compilers.</p><p>This was implemented in commit <a
href="https://qt.gitorious.org/qt/qtbase/commit/571785b31d21715857228b00f96cd24601b28c8c">571785b31d21715857228b00f96cd24601b28c8c</a>.</p><p>Olivier then had an inspired suggestion: to use C++0x lambdas. They work very similar to GCC&#8217;s statement expressions in that they allow us to have code (including static variables) in what is otherwise an expression. I implemented that in commit <a
href="https://qt.gitorious.org/qt/qtbase/commit/cd80fcb5d6db9d99684b94a90d2c798b712442c4">cd80fcb5d6db9d99684b94a90d2c798b712442c4</a>.</p><h2>Current status</h2><p>The new macro <tt>QStringLiteral</tt> is present in Qt 5.0 and can be used almost anywhere where a QLatin1String is currently used. It also works in all compilers, albeit not the same way. If your compiler supports C++0x lambdas or statement expressions, and it supports one way of UTF-16 strings (C++0x&#8217;s Unicode strings or UTF-16 wide chars), then this macro will produce read-only, sharable data. There&#8217;s no creation cost at all for using this: the code generated is only an assignment and an integer comparison (which has the same result always). If your compiler doesn&#8217;t support that, then <tt>QStringLiteral</tt> is defined to be <tt>QLatin1String</tt> and we fall back to current Qt 4.x behaviour.</p><p>The one caveat is that you cannot use QStringLiteral outside a function in all compilers, since GCC statement expressions don&#8217;t support that. Moreover, the following code would work, but isn&#8217;t read-only sharable:</p><pre class="brush: cpp; title: ; notranslate">
static const QString s = QStringLiteral(&quot;Hello, World\n&quot;);
</pre><p>If you can, use the following C++0x expression, which is read-only and sharable:</p><pre class="brush: cpp; title: ; notranslate">
static const auto s = QStringLiteral(&quot;Hello, World\n&quot;);
</pre><h2>Future plans</h2><p>The next step is to convert all uses of QLatin1String with a character literal to QStringLiteral. I have such a commit in my <a
href="https://qt.gitorious.org/qt/thiago-personals-qtbase">repository&#8217;s</a> <a
href="https://qt.gitorious.org/~thiago-personal/qt/thiago-personals-qtbase/commits/master">master branch</a> (warning: I rebase it often), proving that QtCore compiles just fine.</p><p>After that, I&#8217;d like to go back to my ideal solution using C++0x User-Defined Literals. However, it&#8217;s also clear that, unlike the prototype I presented at the end of the blog in March, we&#8217;ll need to use the template version of the operator, as <a
href="http://en.wikipedia.org/wiki/C++0x#User-defined_literals">Wikipedia</a> shows it. It would probably look something like this:</p><pre class="brush: cpp; title: ; notranslate">
template&lt;char16_t... str&gt; inline QConstStringData&lt;sizeof(str)+1&gt; operator&quot;&quot; q()
{
    static const QStringData&lt;sizeof(str) + 1&gt; qstring_literal =
        { { Q_REFCOUNT_INITIALIZER(-1), sizeof(str), 0, 0, { 0 }, str } };
    QConstStringData&lt;sizeof(str) + 1&gt; holder = { &amp;qstring_literal };
    return holder;
}
</pre><p>Unfortunately, no compiler currently supports User-Defined Literals (the <a
href="http://gcc.gnu.org/projects/cxx0x.html">GCC C++0x page</a> says someone is working on it). That means we cannot even try out the code above to see if it works or could be improved. When it does, I&#8217;ll play with this again.</p><p>In the meantime, I&#8217;m interested in any feedback you may have.</p><p>Update: The C++ standard says that all user-defined literals that do not start with an underscore are reserved. So the operator above should be <tt>_q</tt>.</p><div
class="bottomcontainerBox" style="border:1px solid #808080;background-color:#F0F4F9;"><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <iframe
src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.macieira.org%2Fblog%2F2011%2F07%2Fqstring-improved%2F&amp;layout=button_count&amp;show_faces=false&amp;width=85&amp;action=like&amp;font=verdana&amp;colorscheme=light&amp;height=21" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:85px; height:21px;"></iframe></div><div
style="float:left; width:80px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <g:plusone size="medium" href="http://www.macieira.org/blog/2011/07/qstring-improved/"></g:plusone></div><div
style="float:left; width:95px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"> <a
href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.macieira.org/blog/2011/07/qstring-improved/"  data-text="QString improved" data-count="horizontal"></a></div><div
style="float:left; width:105px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><a
href="http://pinterest.com/pin/create/button/?url=http://www.macieira.org/blog/2011/07/qstring-improved/&media=" class="pin-it-button" count-layout="horizontal"></a></div><div
style="float:left; width:85px;padding-right:10px; margin:4px 4px 4px 4px;height:30px;"><script src="http://www.stumbleupon.com/hostedbadge.php?s=1&amp;r=http://www.macieira.org/blog/2011/07/qstring-improved/"></script></div></div><div
style="clear:both"></div><div
style="padding-bottom:4px;"></div>]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2011/07/qstring-improved/feed/</wfw:commentRss> <slash:comments>5</slash:comments> </item> </channel> </rss>