<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>Thiago Macieira&#039;s blog</title> <atom:link href="http://www.macieira.org/blog/feed/" rel="self" type="application/rss+xml" /><link>http://www.macieira.org/blog</link> <description>An Open Source hacker&#039;s ramblings</description> <lastBuildDate>Fri, 18 May 2012 10:42:42 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>I&#8217;m going to Akademy and the Qt Contributor Summit</title><link>http://www.macieira.org/blog/2012/05/im-going-to-akademy-and-the-qt-contributor-summit/</link> <comments>http://www.macieira.org/blog/2012/05/im-going-to-akademy-and-the-qt-contributor-summit/#comments</comments> <pubDate>Fri, 18 May 2012 10:42:42 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[KDE]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[akademy]]></category> <category><![CDATA[berlin]]></category> <category><![CDATA[conferences]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qtcs]]></category> <category><![CDATA[tallinn]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=379</guid> <description><![CDATA[Just a quick post so I can say I&#8217;m going to both events: Akademy 2012 and the Qt Contributors Summit 2012. I hope to see many of you there, and we have a lot to discuss and work on.]]></description> <content:encoded><![CDATA[<p>Just a quick post so I can say I&#8217;m going to both events: <a
href="http://akademy2012.kde.org">Akademy 2012</a> and the <a
href="http://qt-project.org/groups/qt-contributors-summit-2012/wiki">Qt Contributors Summit 2012</a>. I hope to see many of you there, and we have a lot to discuss and work on.</p><p><img
src="http://community.kde.org/images.community/0/03/Ak2012_imgoing2.png" alt="'I'm going to Akademy 2012'" /> <img
src="http://i.imgur.com/LYiEH.png" alt="" /></p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/05/im-going-to-akademy-and-the-qt-contributor-summit/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Source code must be UTF-8 and QString wants it</title><link>http://www.macieira.org/blog/2012/05/source-code-must-be-utf-8-and-qstring-wants-it/</link> <comments>http://www.macieira.org/blog/2012/05/source-code-must-be-utf-8-and-qstring-wants-it/#comments</comments> <pubDate>Fri, 11 May 2012 01:19:56 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[C++]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt5]]></category> <category><![CDATA[unicode]]></category> <category><![CDATA[utf-8]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=373</guid> <description><![CDATA[I&#8217;ve talked about source code encoding in the past, arguing that the C++ language lacks a fundamental setting. However, since this Monday, Qt 5 now starts to enforce that source code must be UTF-8. In a way. The commit that landed on the qtbase repository finally changed the codec used by QString&#8217;s 8-bit methods to &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/05/source-code-must-be-utf-8-and-qstring-wants-it/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>I&#8217;ve talked about source code encoding in <a
href="http://labs.qt.nokia.com/2011/03/26/on-utf-8-latin-1-and-charsets/">the past</a>, arguing that the C++ language lacks a fundamental setting. However, since <a
href="http://qt.gitorious.org/qt/qtbase/commit/592fe0a02609503670cc9238d1a4ad29e4e65185">this Monday</a>, Qt 5 now starts to enforce that source code must be UTF-8. In a way.</p><p>The commit that landed on the <a
href="http://qt.gitorious.org/qt/qtbase">qtbase repository</a> finally changed the codec used by QString&#8217;s 8-bit methods to be UTF-8. That concludes a long series of changes that we had planned for Qt 5, that started with Robin Burchell&#8217;s work on removing the <a
href="http://qt-project.org/doc/qt-4.8/qtextcodec.html#setCodecForCStrings"><tt>QTextCodec::setCodecForCStrings</tt></a> function. But to be clear: QString still stores data internally as UTF-16 and that won&#8217;t change.</p><p>To understand what the change is, we need to go back a little in history. Four years ago, I <a
href="http://labs.qt.nokia.com/2008/04/28/string-theory/">wrote a blog called &#8220;String Theory&#8221;</a> that presented QString&#8217;s history and I said:</p><blockquote><p>what encoding is your file? Even today, with the widespread use of UTF-8, we can’t rely on that fact (text editors in Windows being the worst example).</p></blockquote><p>In 2008, we were still struggling with UTF-8 encoding in source code, and we definitely were in 2003 when <a
href="http://doc.trolltech.com/3.3/qtextcodec.html#setCodecForCStrings"><tt>QTextCodec::setCodecForCStrings</tt></a> came about in Qt 3. The reason is that, back then, text editors usually saved code only in the operating system&#8217;s locale encoding and very seldom supported writing anything else. Unicode wasn&#8217;t widespread enough, so people ended up with a variety of different encodings. That wasn&#8217;t a problem, provided that the data exchange only happened with people who used the same encoding &#8212; usually people in the same country, using the same operating system.</p><p>Times have changed. The protocols from the late 90s that did not possess an encoding marker quickly became obsolete or gained such a tag (I remember when the Kopete developers were struggling to decode ICQ messages properly, and Russian users often ended up with <a
href="http://en.wikipedia.org/wiki/Mojibake">mojibake</a>). Protocols designed in the 2000s all had such a tag, and soon began to standardise on one of the Unicode transforms.</p><p>Last year, when <a
href="http://labs.qt.nokia.com/2011/03/25/qstrings-and-unicode-optimising-qstringfromutf8/">revisiting the subject</a>, I wrote:</p><blockquote><p>this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you’re just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the &#8220;micro&#8221; sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). [...] Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.</p></blockquote><p>The next line of the blog was the decision: we would change the default codec of QString&#8217;s 8-bit functions from Latin 1 to UTF-8 in Qt 5 (note that we hadn&#8217;t yet started thinking of Qt 5 until about 15 days later). That&#8217;s what the commit I made this Monday finally accomplishes.</p><p>What does this mean to you? Well, the first thing is that it depends on whether you use these methods or not. If you compile your source code with the <a
href="http://qt-project.org/doc/qt-4.8/qstring.html#QT_NO_CAST_FROM_ASCII"><tt>QT_NO_CAST_FROM_ASCII</tt></a> and <a
href="http://qt-project.org/doc/qt-4.8/qstring.html#QT_NO_CAST_TO_ASCII"><tt>QT_NO_CAST_TO_ASCII</tt></a> macros, you will feel absolutely no difference. And I really mean none, zero, zilch: if you use those macros, you&#8217;ve disabled all of the functions affected by my change.</p><p>If you do use the functions that are disabled by those macros, then the question is what encoding is used in those strings. My assumption in 2008 is still valid today: most of the strings found in source code are 7-bit, US-ASCII, English text. The 7-bit text will not be affected at all: it will get converted to QString&#8217;s UTF-16 internal encoding just like it used to. There might be a slight performance impact, but I do plan on optimising the UTF-8 decoder like I <a
href="http://labs.qt.nokia.com/2011/03/25/qstrings-and-unicode-optimising-qstringfromutf8/">said last year</a>. However, if you can, I recommend wrapping such strings with <a
href="http://qt-project.org/doc/qt-4.8/qlatin1string.html">QLatin1String</a>, especially if you&#8217;re using them with a QString function that has a QLatin1String overload.</p><p>On the other hand, if you do have text with the high bit set in the QString 8-bit functions, you might need to change your code. You&#8217;ll either have to recode your source code to UTF-8, or you will need to wrap those strings with a suitable <a
href="http://qt-project.org/doc/qt-4.8/qlatin1string.html">QLatin1String</a> or <a
href="http://qt-project.org/doc/qt-4.8/qtextcodec.html#toUnicode-4"><tt>QTextCodec::toUnicode</tt></a> call. I highly recommend choosing the former option: use UTF-8 in your source code. You&#8217;ll also gain the ability to use QStringLiteral properly, which requires UTF-8 source code anyway.</p><p><em>[As an interesting twist of history, the seed that became QStringLiteral was in the <a
href="http://labs.qt.nokia.com/2011/03/26/on-utf-8-latin-1-and-charsets/">second</a> of my encoding blogs last year, after the part I quoted above asking for the change to UTF-8, but it landed in Qt 5 before the change of this Monday.]</em></p><p>For Qt&#8217;s own source code, we have decreed that the source should be UTF-8 only, and so I proceeded a few weeks ago to find and recode all non-UTF-8 sources. And I&#8217;m going even further than that: if you don&#8217;t use UTF-8 for <strong>your</strong> source code, you&#8217;ll be on your own. Though it&#8217;s possible to make it work, do not ask us for help and do not expect us to add convenience functions. I am also discarding any arguments of the form &#8220;my editor/IDE/OS/environment does not support UTF-8&#8243;. This is 2012 and we live in a global world, with global data. Any such editor or environment should be left where it belongs: in a museum dedicated to the 80s and 90s.</p><p>Long live Unicode!</p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/05/source-code-must-be-utf-8-and-qstring-wants-it/feed/</wfw:commentRss> <slash:comments>8</slash:comments> </item> <item><title>Quick update to the Qt Project statistics</title><link>http://www.macieira.org/blog/2012/04/quick-update-to-the-qt-project-statistics/</link> <comments>http://www.macieira.org/blog/2012/04/quick-update-to-the-qt-project-statistics/#comments</comments> <pubDate>Mon, 30 Apr 2012 10:33:11 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[Qt]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt-project]]></category> <category><![CDATA[statistics]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=368</guid> <description><![CDATA[Update on last Friday&#8217;s post on the Qt Project&#8217;s statistics: the script ran again this morning, so we now have data for last week. The Qt Project Statistics Page now includes the number of contributors per week: Visit the statistics page for more graphs.]]></description> <content:encoded><![CDATA[<p>Update on last Friday&#8217;s post on <a
href="/blog/2012/04/qt-project-statistics/">the Qt Project&#8217;s statistics</a>: the script ran again this morning, so we now have data for last week. The <a
href="/blog/qt-stats/">Qt Project Statistics Page</a> now includes the number of contributors per week:</p><p
align="center"><a
href="/~thiago/qt-stats/current/qt-all-full.author.unique.png"><img
src="/~thiago/qt-stats/current/thumb/qt-all-full.author.unique.png" alt="" /></a></p><p>Visit the <a
href="/blog/qt-stats/">statistics page</a> for more graphs.</p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/04/quick-update-to-the-qt-project-statistics/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>Qt Project Statistics</title><link>http://www.macieira.org/blog/2012/04/qt-project-statistics/</link> <comments>http://www.macieira.org/blog/2012/04/qt-project-statistics/#comments</comments> <pubDate>Fri, 27 Apr 2012 20:31:06 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[Qt]]></category> <category><![CDATA[community]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt-project]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=359</guid> <description><![CDATA[For about a month, I&#8217;ve been improving a set of scirpts to calculate statistics on the Qt Project. What I wanted to know, at first, was how well I was doing, how much I was contributing. Another question I had in mind and I know many others did too was &#8220;how much is the Qt &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/04/qt-project-statistics/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>For about a month, I&#8217;ve been improving a set of scirpts to calculate statistics on the Qt Project. What I wanted to know, at first, was how well I was doing, how much I was contributing. Another question I had in mind and I know many others did too was &#8220;how much is the Qt Project dependent on Nokia?&#8221;</p><p>First it started with a simple &#8220;|wc -l&#8221; depending on whose statistics I wanted to get. This week, I decided to make graphs, so I spent a great deal of time learning <a
href="http://www.gnuplot.info/">gnuplot</a> instead of doing other work. I&#8217;ll blog about the script itself on my next blog.</p><p>The statistics are online now. You can see it at <a
href="/blog/qt-stats">http://macieira.org/blog/qt-stats</a>. And come back every week, as it will update itself every Sunday to Monday evening.</p><p>Let me just point out the overall graph:</p><p
align="center"><a
href="/~thiago/qt-stats/2012-04-27/qt-all-full.author.total.png"><img
src="/~thiago/qt-stats/2012-04-27/thumb/qt-all-full.author.total.png" alt="" /></a></p><p>As you can see from the graph, the commit rate for the <a
href="http://qt-project.org">Qt Project</a> was at its lowest during two days-off periods: New Years (week 52 of last year and week 1 of this year) and Easter (week 14). Aside from the first week of the project&#8217;s existence, it&#8217;s constantly been over 400 commits a week, and over 600 commits for 6 of the past 8 weeks. That&#8217;s impressive!</p><p>And answering the question of how much the project depends on Nokia, take a look at this other one:</p><p
align="center"><a
href="/~thiago/qt-stats/2012-04-27/qt-all-full.employer.relative.png"><img
src="/~thiago/qt-stats/2012-04-27/thumb/qt-all-full.employer.relative.png" alt="" /></a></p><p>You can see that the participation from Nokia developers still is quite high (and will probably remain so), at around 80%. But in turn that means around 20% of the commits going to the Qt Project come from other people, employed by other companies or in their free time, and this less than 6 months after the official launch of the Qt Project.</p><p>More than that, note the trend: Nokia&#8217;s participation tends to diminish, not because they&#8217;re doing less, but because others are doing more. The following graph, with Nokia&#8217;s numbers removed, shows the trend participation from others:</p><p
align="center"><a
href="/~thiago/qt-stats/2012-04-27/qt-all-full-no-nokia.employer.absolute.png"><img
src="/~thiago/qt-stats/2012-04-27/thumb/qt-all-full-no-nokia.employer.absolute.png" alt="" /></a></p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/04/qt-project-statistics/feed/</wfw:commentRss> <slash:comments>11</slash:comments> </item> <item><title>Qt 5 alpha released</title><link>http://www.macieira.org/blog/2012/04/qt-5-alpha-released/</link> <comments>http://www.macieira.org/blog/2012/04/qt-5-alpha-released/#comments</comments> <pubDate>Tue, 03 Apr 2012 14:45:53 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[Qt]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt-project]]></category> <category><![CDATA[qt5]]></category> <category><![CDATA[releases]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=344</guid> <description><![CDATA[Lars writes to let us know that the first (and hopefully only) Qt 5 alpha has been released! It&#8217;s the first in the major release series in 7 years, the first major release of the Qt Project (though not the first release in of the project, since we released 4.8.1 just a few weeks ago). &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/04/qt-5-alpha-released/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>Lars <a
href="http://labs.qt.nokia.com/author/lars/">writes</a> to let us know that the first (and hopefully only) <a
href="http://labs.qt.nokia.com/2012/04/03/qt-5-alpha/">Qt 5 alpha has been released</a>! It&#8217;s the first in the major release series in 7 years, the first major release of the <a
href="http://qt-project.org">Qt Project</a> (though not the first release in of the project, since we released 4.8.1 just a few weeks ago).</p><p>I won&#8217;t copy what Lars said in his blog. Instead, here are some useful links:</p><ul><li>Download it from <a
href="http://qt-project.org/wiki/Qt-5-Alpha">http://qt-project.org/wiki/Qt-5-Alpha</a>;</li><li>Build instructions from <a
href="http://qt-project.org/wiki/Qt-5-Alpha-building-instructions">http://qt-project.org/wiki/Qt-5-Alpha-building-instructions</a></ul><p>Please note that the alpha release does not support make install yet. You really need to configure it with that <tt>-prefix</tt> option. We&#8217;ll work on an installable package and multiple tarballs for the beta.</p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/04/qt-5-alpha-released/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Restricting what you can do</title><link>http://www.macieira.org/blog/2012/03/restricting-what-you-can-do/</link> <comments>http://www.macieira.org/blog/2012/03/restricting-what-you-can-do/#comments</comments> <pubDate>Wed, 28 Mar 2012 15:30:43 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[Algorithms]]></category> <category><![CDATA[C++]]></category> <category><![CDATA[c99]]></category> <category><![CDATA[low-level]]></category> <category><![CDATA[optimisation]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=335</guid> <description><![CDATA[I usually write about C++, since it&#8217;s the programming language that I use on my daily work. Today, however, I&#8217;m talking about its nearest cousin: C. In specific, about a certain keyword introduced by the C99 standard, which was issued over 12 years ago. Usually, the C standard plays catch-up with the C++ standard (like &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/03/restricting-what-you-can-do/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>I usually write about C++, since it&#8217;s the programming language that I use on my daily work. Today, however, I&#8217;m talking about its nearest cousin: C. In specific, about a certain keyword introduced by the <a
href="http://en.wikipedia.org/wiki/C99">C99 standard</a>, which was issued over 12 years ago. Usually, the C standard plays catch-up with the C++ standard (like the C11 standard bringing some C++11 features to C), but each new issue brings a few new things that C++ doesn&#8217;t have yet. This cross-pollinisation by the two standard teams is very welcome.</p><p>The one I&#8217;m thinking of today is one that, interestingly, has not been added to C++ yet, though many compilers support it. If you&#8217;ve paid attention to the blog title, you may realise I&#8217;m talking about the <tt>restrict</tt> keyword.</p><p>Raise your hand if you&#8217;ve seen it before. Now only the people who have seen it outside of the C library headers on their systems. Not many, eh?</p><h2>What does <tt>restrict</tt> do?</h2><p>The keyword appears defined in the <a
href="http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf">C99 (N1256)</a> and <a
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf">C11 (N1570)</a> standards in section 6.7.3 &#8220;Type qualifiers&#8221; and 6.7.3.1 &#8220;Formal definition of restrict&#8221;, which, as usual, is barely readable for us. The <a
href="http://en.wikipedia.org/wiki/Restrict">Wikipedia</a> definition is better:</p><blockquote><p>The restrict keyword is a declaration of intent given by the programmer to the compiler. It says that for the lifetime of the pointer, only it or a value directly derived from it (such as <tt>​pointer + 1​</tt>) will be used to access the object to which it points.</p></blockquote><p>Well, so what? Why do we need a keyword for that? Well, clearly it&#8217;s not just something that the programmer says &#8212; otherwise, we&#8217;d only write it in the documentation. The Wikipedia text continues by saying that &#8220;[t]his limits the effects of <a
href="http://en.wikipedia.org/wiki/Pointer_aliasing">pointer aliasing</a>&#8220;.</p><p>That should now tell you something. At least, it should bring you back some memories of compiler warnings about &#8220;dereferencing type-punned pointer does break strict aliasing&#8221;.</p><h2>The dreaded <em>strict aliasing</em> (or where it lacks)</h2><p>The C and C++ standards say that pointers of different types do not alias each other. That&#8217;s the strict aliasing, which you often break by dereferencing type-punned pointers. I&#8217;ve talked about this in the past, I think. In any case, what matters to us here is when the pointers <strong>are</strong> allowed to alias each other. Since the C99 standard couldn&#8217;t very well go and change a basic principle of the C90 standard, they instead created a keyword to allow the programmer to declare when aliasing will not happen.</p><p>The simplest example is the following pair of functions from the C library (copied verbatim from glibc&#8217;s <tt>string.h</tt> header):</p><pre class="brush: cpp; title: ; notranslate">
/* Copy N bytes of SRC to DEST.  */
extern void *memcpy (void *__restrict __dest,
		     __const void *__restrict __src, size_t __n)
     __THROW __nonnull ((1, 2));
/* Copy N bytes of SRC to DEST, guaranteeing
   correct behavior for overlapping strings.  */
extern void *memmove (void *__dest, __const void *__src, size_t __n)
     __THROW __nonnull ((1, 2));
</pre><p>Note the difference: <tt>memcpy</tt> uses the <tt>restrict</tt> keyword, whereas <tt>memmove</tt> does not but does say that it is correct for overlapping strings.</p><h2>Implementing <tt>memcpy</tt> and <tt>memmove</tt></h2><p>Let&#8217;s try and implement these two functions to see if we understand what the keywords mean. Let&#8217;s start with memcpy, which is very simple at first approach and you must have written its equivalent hundreds of times already:</p><pre class="brush: cpp; title: ; notranslate">
// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    char *d = dest;
    const char *s = src;
    size_t i;
    for (i = 0; i != n; ++i)
        d[i] = s[i];
    return dest;
}
</pre><p>Having written that, we wonder: why do we need <tt>memmove</tt> at all? The comment in the header talks about &#8220;overlapping strings&#8221; and that&#8217;s where the code above has an issue. What if we tried to <tt>memcpy(ptr, ptr + 1, n)</tt>? In the first iteration of the loop above, the byte copied would overwrite the second byte to be read &#8212; or worse.</p><p>For that reason, the simplest <tt>memmove</tt> is usually implemented as:</p><pre class="brush: cpp; title: ; notranslate">
void *memmove(void *dest, const void *src, size_t n)
{
    char *d = dest;
    const char *s = src;
    size_t i;
    if (d &lt; s) {
        for (i = 0; i != n; ++i)
            d[i] = s[i];
    } else {
        i = n;
        while (i) {
            --i;
            dst[i] = src[i];
        }
    }
    return dest;
}
</pre><h2>Improving the code</h2><p>If we know that the two pointers do not alias each other, we can do some more interesting things to optimise the copying performance. The first thing we can try is to increase the stride. That is, copy more than one byte at a time, like so:</p><pre class="brush: cpp; title: ; notranslate">
// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    int *di = dest;
    const int *si = src;
    char *d = dest;
    const char *s = src;
    size_t i;
 
    for (i = 0; i != n / sizeof(int); ++i)
        di[i] = si[i];
    i *= sizeof(int);
    for ( ; i != n; ++i)
        d[i] = s[i];
 
    return dest;
}
</pre><p>The above code first copies the data in <tt>int</tt>-size chunks, then copies the remaining 1 to 3 bytes one byte at a time (epilog copy). It&#8217;s more efficient than the original code on architectures where unaligned loads and stores are efficient, or when we know both pointers to be aligned to the proper boundary. In those cases, since we have fewer iterations to execute, the copying is usually faster.</p><p>We can definitely improve this code further, by using for example 64-bit loads and stores in architectures that support them, applying this to all architectures by aligning the two pointers if possible in a prolog copy, unrolling the prolog and epilogs, or use <a
href="http://en.wikipedia.org/wiki/SIMD">Single Instruction Multiple Data</a> instructions that the architecture may have.</p><p>Note that this is only possible because this is <tt>memcpy</tt>, not <tt>memmove</tt>. For the latter function, if we wanted to increase the stride, we would need to additionally check that the distance between the two pointers is at least the size of the chunk of data copied per iteration. Doing that is left as an exercise for the reader.</p><h2>I&#8217;m lazy</h2><p>Now, I said above that the only reason why there&#8217;s a language keyword in the first place is so that the compiler can optimise better. Well, that&#8217;s exactly what it does. Unfortunately, it&#8217;s easy to prove this straight-away with assembly code, as we&#8217;re depending on optimisations performed by the compiler, which change over time and are implemented differently in each one. For example, if I use the Intel Compiler on the original <tt>memcpy</tt> function, it will insert a call to <tt>_intel_fast_memcpy</tt> if the pointers aren&#8217;t suitably aligned or the copy size isn&#8217;t big enough. GCC, on the other hand, will insert a prolog to align one of the pointers.</p><p>What is interesting to note is that the presence of the <tt>restrict</tt> keyword, everything else being the same, does cause different code generation. With GCC, the output without the keyword contains a couple of instructions comparing the <tt>dest</tt> pointer to <tt>src + 16</tt> and only if the two pointers don&#8217;t overlap in the first 16 bytes will it execute SSE2 16-byte copies. ICC is even more extreme: without the keyword, the code generated for <tt>memcpy</tt> does only byte-sized copies.</p><p>In other words, the keyword is being used: when the compiler knows the two blocks don&#8217;t overlap, it can generate better code.</p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/03/restricting-what-you-can-do/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>The value of passing by value</title><link>http://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/</link> <comments>http://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/#comments</comments> <pubDate>Wed, 22 Feb 2012 18:36:25 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[Assembly]]></category> <category><![CDATA[C++]]></category> <category><![CDATA[Linux]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[abi]]></category> <category><![CDATA[arm]]></category> <category><![CDATA[assembly]]></category> <category><![CDATA[ia64]]></category> <category><![CDATA[low-level]]></category> <category><![CDATA[mips]]></category> <category><![CDATA[optimisation]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt5]]></category> <category><![CDATA[x86]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=314</guid> <description><![CDATA[I&#8217;ve written in the past about how passing certain types by value in C++ would be more efficient than passing by constant reference. But it turns out that the ABI rules are somewhat more complex than what I said back in 2008. Time to investigate. This is also prompted by the discussion on qreal on &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>I&#8217;ve <a
href="http://labs.qt.nokia.com/2008/04/28/string-theory/">written in the past</a> about how passing certain types by value in C++ would be more efficient than passing by constant reference. But it turns out that the <a
href="http://www.macieira.org/blog/2012/01/assembly-developers-library/">ABI</a> rules are somewhat more complex than what I said back in 2008. Time to investigate.</p><p>This is also prompted by the <a
href="http://groups.google.com/group/qt-project-list-development/browse_thread/thread/74f70fdcd0126359">discussion on <tt>qreal</tt></a> on the <a
href="http://lists.qt-project.org/mailman/listinfo/development">Qt development</a> mailing list. In trying to decide on the fate of <tt>qreal</tt>, we also run into the discussion of the geometric classes (point, size, rectangle, polygon) and the algebraic classes (matrixes, 2D and 3D vectors) and whether they should use single- or double-precision. I&#8217;m not going to go into the arguments discussed there, I&#8217;m merely focussing here on the ABI.</p><h2>Problem statement</h2><p>Before we go into the ABI documentation and try to compile code, we need to define what problem we&#8217;re trying to solve. In general terms, I&#8217;m trying to find the most optimal way of passing small C++ structures: when is it better to pass by value, as opposed to by constant reference? And under those conditions, are there any important implications to the qreal discussion?</p><p>In the <a
href="http://labs.qt.nokia.com/2008/04/28/string-theory/"><em>String Theory</em></a> blog, I concluded that a small structure like <tt>QLatin1String</tt>, which contained exactly one pointer as a member, would benefit from passing by value. What other types of structures should we look at?</p><ul><li>Structures with more than one pointer</li><li>Structures with 32-bit integers on 64-bit architectures</li><li>Structures with floating-point (single and double precision)</li><li>Mixed-type and specialised structures found in Qt</li></ul><p>I&#8217;ll investigate the x86-64, ARMv7 hard-float, MIPS hard-float (o32) and IA-64 ABIs because they are the ones I for which I have access to compilers. All of them support passing parameters by registers and have at least 4 integer registers used in parameter passing. Besides MIPS, all of them also have at least 4 floating-point registers used in parameter passing. See my earlier <a
href="http://www.macieira.org/blog/2012/01/architectures-and-abis-detailed/">ABI detail</a> blog for more information.</p><p>So we will investigate what happens when you pass by value the following structures:</p><pre class="brush: cpp; title: ; notranslate">
struct Pointers2
{
    void *p1, *p2;
};
struct Pointers4
{
    void *p1, *p2, *p3, *p4;
};
struct Integers2 // like QSize and QPoint
{
    int i1, i2;
};
struct Integers4 // like QRect
{
    int i1, i2, i3, i4;
};
template &lt;typename F&gt; struct Floats2 // like QSizeF, QPointF, QVector2D
{
    F f1, f2;
};
template &lt;typename F&gt; struct Floats3 // like QVector3D
{
    F f1, f2, f3;
};
template &lt;typename F&gt; struct Floats4 // like QRectF, QVector4D
{
    F f1, f2, f3, f4;
};
template &lt;typename F&gt; struct Matrix4x4 // like QGenericMatrix&lt;4, 4&gt;
{
    F m[4][4];
};
struct QChar
{
    unsigned short ucs;
};
struct QLatin1String
{
    const char *str;
    int len;
};
template &lt;typename F&gt; struct QMatrix
{
    F _m11, _m12, _m21, _m22, _dx, _dy;
};
template &lt;typename F&gt; struct QMatrix4x4 // like QMatrix4x4
{
    F m[4][4];
    int f;
};
</pre><p>And we&#8217;ll analyse the assembly of the following program:</p><pre class="brush: cpp; title: ; notranslate">
template &lt;typename T&gt; void externalFunction(T);
template &lt;typename T&gt; void passOne()
{
    externalFunction(T());
}
template &lt;typename T&gt; T externalReturningFunction();
template &lt;typename T&gt; void returnOne()
{
    externalReturningFunction&lt;T&gt;();
}
// C++11 explicit template instantiation
template void passOne&lt;Pointers2&gt;();
template void passOne&lt;Pointers4&gt;();
template void passOne&lt;Integers2&gt;();
template void passOne&lt;Integers4&gt;();
template void passOne&lt;Floats2&lt;float&gt; &gt;();
template void passOne&lt;Floats2&lt;double&gt; &gt;();
template void passOne&lt;Floats3&lt;float&gt; &gt;();
template void passOne&lt;Floats3&lt;double&gt; &gt;();
template void passOne&lt;Floats4&lt;float&gt; &gt;();
template void passOne&lt;Floats4&lt;double&gt; &gt;();
template void passOne&lt;Matrix4x4&lt;float&gt; &gt;();
template void passOne&lt;Matrix4x4&lt;double&gt; &gt;();
template void passOne&lt;QChar&gt;();
template void passOne&lt;QLatin1String&gt;();
template void passOne&lt;QMatrix&lt;float&gt; &gt;();
template void passOne&lt;QMatrix&lt;double&gt; &gt;();
template void passOne&lt;QMatrix4x4&lt;float&gt; &gt;();
template void passOne&lt;QMatrix4x4&lt;double&gt; &gt;();
template void returnOne&lt;Pointers2&gt;();
template void returnOne&lt;Pointers4&gt;();
template void returnOne&lt;Integers2&gt;();
template void returnOne&lt;Integers4&gt;();
template void returnOne&lt;Floats2&lt;float&gt; &gt;();
template void returnOne&lt;Floats2&lt;double&gt; &gt;();
template void returnOne&lt;Floats3&lt;float&gt; &gt;();
template void returnOne&lt;Floats3&lt;double&gt; &gt;();
template void returnOne&lt;Floats4&lt;float&gt; &gt;();
template void returnOne&lt;Floats4&lt;double&gt; &gt;();
template void returnOne&lt;Matrix4x4&lt;float&gt; &gt;();
template void returnOne&lt;Matrix4x4&lt;double&gt; &gt;();
template void returnOne&lt;QChar&gt;();
template void returnOne&lt;QLatin1String&gt;();
template void returnOne&lt;QMatrix&lt;float&gt; &gt;();
template void returnOne&lt;QMatrix&lt;double&gt; &gt;();
template void returnOne&lt;QMatrix4x4&lt;float&gt; &gt;();
template void returnOne&lt;QMatrix4x4&lt;double&gt; &gt;();
</pre><p>In addition, we&#8217;re interested in what happens to non-structure floating point parameters: are they promoted or not? So we&#8217;ll also test the following:</p><pre class="brush: cpp; title: ; notranslate">
void passFloat()
{
    void externalFloat(float, float, float, float);
    externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
}
void passDouble()
{
    void externalDouble(double, double, double, double);
    externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
}
float returnFloat()
{
    return 1.0f;
}
double returnDouble()
{
    return 1.0;
}
</pre><h2>Analysis of the output</h2><h3>x86-64</h3><p>You might have noticed I skipped old-style 32-bit x86. That was intentional, since that platform does not support passing by registers anyway. The only conclusion we could draw from that would be:</p><ul><li>whether the structures are stored in the stack in the place of the argument, or whether they&#8217;re stored elsewhere and it&#8217;s passed by pointer</li><li>whether single-precision floating-point is promoted to double-precision</li></ul><p>Moreover, I&#8217;m intentionally ignoring it because I want people to start thinking of the new ILP32 ABI for x86-64, enabled by GCC 4.7&#8242;s <tt>-mx32</tt> switch, which follows the same ABI as the one described below (with the exception that pointers are 32-bit).</p><p>So let&#8217;s take a look at the assembly results. For parameter passing, we find out that</p><ul><li><tt>Pointers2</tt> is passed in registers;</li><li><tt>Pointers4</tt> is passed in memory;</li><li><tt>Integers2</tt> is passed in a single register (two 32-bit values per 64-bit register);</tt><li><tt>Integers4</tt> is passed in two registers only (two 32-bit values per 64-bit register);</li><li><tt>Floats2&lt;float&gt;</tt> is passed packed into a single SSE register, no promotion to double</li><li><tt>Floats3&lt;float&gt;</tt> is passed packed into two SSE registers, no promotion to double;</li><li><tt>Floats4&lt;float&gt;</tt> is passed packed into two SSE registers, no promotion to double;</li><li><tt>Floats2&lt;double&gt;</tt> is passed in two SSE registers, one value per register</li><li><tt>Floats3&lt;double&gt;</tt> and <tt>Floats4&lt;double&gt;</tt> are passed in memory;</li><li><tt>Matrix4x4</tt> and <tt>QMatrix4x4</tt> are passed in memory regardless of the underlying type;</li><li><tt>QChar</tt> is passed in a register;</li><li><tt>QLatin1String</tt> is passed in registers.</li><li>The floating point parameters are passed one per register, without float promotion to double.</li></ul><p>For return values, the conclusion is the same as above: if the value is passed in registers, it's returned in registers too; if it's passed in memory, it's returned in memory. This leads us to the following conclusions, supported by careful reading of the ABI document:</p><ul><li>Single-precision floating-point types are not promoted to double;</li><li>Single-precision floating-point types in a structure are packed into SSE registers if they are still available</li><li>Structures bigger than 16 bytes are passed in memory, with an exception for <tt>__m256</tt>, the type corresponding to one AVX 256-bit register.</li></ul><h3>IA-64</h3><p>Here are the results for parameter passing:</p><ul><li>Both <tt>Pointers</tt> structures are passed in registers, one pointer per register;</li><li>Both <tt>Integers</tt> structures are passed in registers, packed like x86-64 (two ints per register);</li><li>All of the <tt>Floats</tt> structures are passed in registers, one value per register (unpacked);</li><li><tt>QMatrix4x4&lt;float&gt;</tt> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers <tt>out4</tt> to <tt>out7</tt> as the memory representations (packed);</li><li><tt>QMatrix4x4&lt;double&gt;</tt> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;</li><li><tt>QChar</tt> and <tt>QLatin1String</tt> are passed in registers;</li><li>Both <tt>QMatrix</tt> are passed entirely in registers, one value per register (unpacked);</li><li><tt>QMatrix4x4</tt> is passed like <tt>Matrix4x4</tt>, except that the integer is always in memory (the structure is larger than 8*8 bytes);</li><li>Individual floating-point parameters are passed one per register; type promotion happens internally in the register.</li></ul><p>For the return values, we have:</p><ul><li>The floating-point structures with up to 8 floating-point members are returned in registers;</li><li>The integer structures of up to 32 bytes are returned in registers;</li><li>All the rest is returned in memory supplied by the caller.</li></ul><p>The conclusions are:</p><ul><li>Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);</li><li>Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;</li><li>All other structures are passed in the integer registers, up to 64 bytes;</li><li>Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).</li></ul><h3>ARM</h3><p>I've compiled the code only for ARMv7, with the floating-point parameters passed in the VFP registers. If you're reading this blog, you're probably interested in performance and therefore you must be using the "hard-float" model for ARM. I will not concern myself with the slower "soft-float" mode. Also note that this is ARMv7 only: the ARMv8 64-bit (AArch64) rules differ slightly, but no compiler for it is available.</p><p>Here are the results for parameter passing:</p><ul><li><tt>Pointers2</tt>, <tt>Pointers4</tt>, <tt>Integers2</tt>, and <tt>Integers4</tt> are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);</li><li>All of the <tt>Float</tt> types are passed in registers, one value per register, without promotion of floats to doubles; the values are <strong>also</strong> stored in memory but I can't tell if this is required or just GCC being dumb;</li><li>All types of <tt>Matrix4x4</tt>, <tt>QMatrix</tt> and <tt>QMatrix4x4</tt> are passed in both memory and registers, which contains the first 16 bytes;</li><li><tt>QChar</tt> and <tt>QLatin1String</tt> are passed in registers;</li><li> are passed in memory regardless of the underlying type.</li><li>The floating point parameters are passed one per register, without float promotion to double.</li></ul><p>For returning those types, we have:</p><ul><li>All of the <tt>Float</tt> types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;</li><li>QChar is returned in a register;</li><li>Everything else is returned in memory.</li></ul><p>Note that the return type is one of the places where the 32-bit AAPCS differs from the 64-bit one: there, if a type is passed in registers to a function where it is the first parameter, it is returned in those same registers. The 32-bit AAPCS restricts the return-in-registers to structures of 4 bytes or less.</p><p>My conclusions are:</p><ul><li>Single-precision floating-point types are not promoted to double;</li><li>Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;</li></ul><h3>MIPS</h3><p>I have attempted both a MIPS 32-bit build (using the GCC-default o32 ABI) and a MIPS 64-bit (using <tt>-mabi=o64 -mlong64</tt>). Unless noted otherwise, the results are the same for both architectures.</p><p>For passing parameters, they were:</p><ul><li>Both types of <tt>Integers</tt> and <tt>Pointers</tt> structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;</li><li><tt>Float2&lt;float&gt;</tt>, <tt>Float3&lt;float&gt;</tt>, and <tt>Float4&lt;float&gt;</tt> are passed in <strong>integer</strong> registers, not on the floating-point registers; on 64-bit, two <tt>floats</tt> are packed into a single 64-bit register;</li><li><tt>Float2&lt;double&gt;</tt> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each <tt>double</tt>;</li><li>On 32-bit, the first two doubles of <tt>Float3&lt;double&gt;</tt> and <tt>Float3&lt;double&gt;</tt> are passed in integer registers, the rest are passed in memory;</li><li>On 64-bit, <tt>Float3&lt;double&gt;</tt> and <tt>Float3&lt;double&gt;</tt> are passed entirely in integer registers;</li><li><tt>Matrix4x4</tt>, <tt>QMatrix</tt>, and <tt>QMatrix4x4</tt> are passed in integer registers (the portion that fits) and in memory (the rest);</li><li><tt>QChar</tt> is passed in a register (on MIPS big-endian, it's passed on bits 16-31);</li><li><tt>QLatin1String</tt> is passed on two registers;</li><li>The floating point parameters are passed one per register, without float promotion to double.</li></ul><p>For the return values, MIPS is easy: everything is returned in memory, even QChar.</p><p>The conclusions are even easier:</p><ul><li>No float is promoted to double;</li><li>No structure is ever passed in floating-point registers;</li><li>No structure is ever returned in registers.</li></ul><h2>General conclusion</h2><p>There are only few aggregate conclusion that we can take. One of them is that single-precision floating point values are not explicitly promoted to double when formal parameters are present. The automatic promotion probably happens only for floating-point values passed in ellipsis (...), but our problem statement was about calling functions where the parameters are know. The only slight deviation from the rule is IA-64, but it's unimportant as the hardware, like x87, only operates in one mode.</p><p>For the structures containing integer parameters (that includes pointers), there's nothing further to optimise: they are loaded into registers exactly as they appear in memory. That means the portion of the register corresponding to padding might contain uninitialised or garbage data, or it might make something really strange like MIPS in big-endian mode. It also means, on all architectures, that types smaller than  a register do not occupy the entire register, so they might be packed with other members.</p><p>Another is quite obvious: structures containing floats are smaller than structures containing doubles, so they will use less memory or fewer registers to be passed.</p><p>To continue taking conclusions, we need to exclude MIPS since it passes everything in the integer registers and returns everything by memory. If we do that, we are able to see that all ABIs provide an optimisation for structures containing only one floating-point type. Those are called by slightly different names in the ABI documents, all meaning homogeneous floating-point structure. Those optimisations mean that the structure is passed on floating-point registers under certain conditions.</p><p>The first one to break down is actually x86-64: the upper limit is 16 bytes, limited to two SSE registers. The rationale for this seems to be passing one double-precision complex value, which takes 16 bytes. That we are able to pass four single-precision values is an unexpected benefit.</p><p>The remaining architectures (ARM and IA-64) can pass more values by register, and always at one value per register (no packing). IA-64 has more registers dedicated to parameter passing, so it can pass more than ARM.</p><h2>Recommendations for code</h2><ul><li>Structures of up to 16 bytes containing integers and pointers should be passed by value;</li><li>Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);</li><li>Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;</li></ul><p>The above is only valid for structures that are trivially-copiable and trivially-destrucitble. All C structures (POD in C++) meet those criteria.</p><h2>Final note</h2><p>I should note that the recommendations above do not <strong>always</strong> produce more efficient code. Even though the values can be passed in registers, every single compiler I tested (GCC 4.6, Clang 3.0, ICC 12.1) still does a lot of memory operations in some cases. It's quite common for the compiler to write the structure to memory and then load it into the registers. When it does that, passing by constant reference would be more efficient since it would replace the memory loads with arithmetic on the stack pointer.</p><p>However, those are simply a matter of further optimisation work by the compiler teams. The three compilers I tested for x86-64 optimise differently and, in almost all cases, at least one of them managed to do without memory access. Interestingly, the behaviour changes also when we replace the padding space with zeroes.</p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/feed/</wfw:commentRss> <slash:comments>7</slash:comments> </item> <item><title>Update and benchmark on the dynamic library proposals</title><link>http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/</link> <comments>http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/#comments</comments> <pubDate>Thu, 19 Jan 2012 15:17:39 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[KDE]]></category> <category><![CDATA[Linux]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[linux]]></category> <category><![CDATA[low-level]]></category> <category><![CDATA[optimisation]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=302</guid> <description><![CDATA[My last blog on the dynamic libraries on Linux attracted over 15000 visits, which was quite unexpected (it&#8217;s 15x more than the usual traffic). It got linked from reddit and ycombinator and comments there and in the previous post have raised some interesting questions I&#8217;ll try to answer. LD_PRELOAD First, a quck background: LD_PRELOAD and &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>My last blog <a
href="https://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/">on the dynamic libraries on Linux</a> attracted over 15000 visits, which was quite unexpected (it&#8217;s 15x more than the usual traffic). It got linked from <a
href="http://www.reddit.com/r/programming/comments/ojm5z/sorry_state_of_dynamic_libraries_on_linux_thiago/">reddit</a> and <a
href="http://news.ycombinator.com/item?id=3472142">ycombinator</a> and comments there and in the previous post have raised some interesting questions I&#8217;ll try to answer.</p><h1>LD_PRELOAD</h1><p>First, a quck background: LD_PRELOAD and /etc/ld.so.preload tell the dynamic linker to load a certain ELF module before the rest normal initialisation sequence. It&#8217;s preloaded before the rest of the modules, but after two important modules have been loaded: the executable itself and the dynamic linker. By itself, it means nothing at all about symbol hijacking. Its sole purpose is to load something. I have, for example, used it for loading a different binary of a library that a program required. That works fine.</p><h2>Yes, it is little-known and little-used</h2><p>If you complained that I said it&#8217;s little-known, you&#8217;re somewhat biased. If you complained, it&#8217;s because you knew about it, therefore you&#8217;re part of the minority that knows about it. Just think about it: there are millions of people directly using Linux today in the world. How many do you think know about this feature?</p><p>Even more so, think about how often:</p><ul><li>LD_PRELOAD is used compared to running applications without it</li><li>LD_PRELOAD is used to load an ELF module compared to how many ELF modules are loaded by regular means</li><li>how many functions are interposed using LD_PRELOAD versus how many aren&#8217;t</li></ul><p>The ratio is at least 1:1000 for a heavy user of the feature (like me!) in the best of the circumstances. It&#8217;s probably several orders of magnitude more than that for the average. Something that is used in one case in a million qualifies as little-used to me.</p><h2>No, I wasn&#8217;t proposing to get rid of it (not entirely)</h2><p>Some people suggested I was thinking of getting read of the preloading feature in exchange for a few cycles saved. I would still be in my right to suggest that, given the improvements and how often it is used, but I wasn&#8217;t. I&#8217;ve never proposed getting rid of the preloading feature and my proposal would not harm the most often used cases of interposition.</p><p>This requires a bit more explanation, so bear with me please.</p><p>Symbol interposition works by adding a symbol to the symbol table before the &#8220;rightful&#8221; symbol appears. The dynamic linker will resolve the symbol to the first occurrence it finds in the search order, so if you preload a library out of its order, its symbols will have higher priority than they would otherwise. The extreme case is when you preload a library or module that wouldn&#8217;t otherwise be loaded. But remember something I said before: preloaded modules are loaded after two others are loaded, so they don&#8217;t get the chance to interpose symbols defined by those.</p><p>If the executable performed a copy relocation on a data symbol, then LD_PRELOAD&#8217;ed modules cannot interpose those. For that reason, I am not counting interposition of data symbols as valid. In fact, in 14 years I&#8217;ve been hacking on Linux, I&#8217;ve never done that, so I guess the chances of that happening are a billion to one or even lower. What&#8217;s more, my proposal would do away with copy relocation, which may make data interposition a valid case.</p><p>The next important thing you must understand is that my proposal would do away with interposition of <strong><em>intra-library</em></strong> symbols, but not <strong><em>inter-library</em></strong> ones. My friend Michael Meek&#8217;s <a
href="http://lwn.net/Articles/192624/">proposal of -Bdirect linking</a> might, but even that proposal wouldn&#8217;t totally do away with it.</p><p>What do I mean by this? <em>Intra-library</em> means &#8220;within the same library,&#8221; while <em>inter-library</em> means &#8220;across libraries&#8221; (think of &#8220;Internet&#8221; vs &#8220;intranet&#8221;). My proposal was intended to improve binding of symbols inside one library because we can gain performance doing that without losing the <a
href="http://en.wikipedia.org/wiki/Position-independent_code">Position-independent code</a> and the advantages that come with it (like <a
href="http://en.wikipedia.org/wiki/Address_space_layout_randomization">Address space layout randomisation</a>). Specifically because we don&#8217;t want to lose the PIC support and we don&#8217;t want to go back to pre-ELF days and their problems (see Ulrich Drepper&#8217;s <a
href="http://www.akkadia.org/drepper/dsohowto.pdf">paper</a> for some information on it), all <em>inter-library</em> symbol resolution would remain as-is, via PLTs and GOTs, including the ability to interpose symbols.</p><p>And here&#8217;s why I think we&#8217;re entitled to doing that: because you cannot do it anyway unless the library has been <strong>specifically designed to allow it</strong>, like glibc is. Let&#8217;s take the code from the last blog:</p><pre class="brush: cpp; title: ; notranslate">
extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &amp;externalFunction;
}
</pre><p>And amend it like so:</p><pre class="brush: cpp; first-line: 9; title: ; notranslate">
void externalFunction(void)
{
}
</pre><p>If we compile this code with optimisation (GCC&#8217;s -O is enough) and inspect the assembly output, we can notice that both functions are present in the output but that <tt>myFunction</tt> does not call <tt>externalFunction</tt>. In other words, the compiler inlined one function into the other, even if the <tt>inline</tt> keyword was never added to it, and that expanded to zero code. With advances such as <a
href="http://en.wikipedia.org/wiki/Link-time_optimization">link-time optimisation</a>, even moving the function to another compilation unit might not be enough to prevent the inlining.</p><p>That&#8217;s why I said that to support the case of <em>intra-library</em> symbol interposition, the library must be specifically designed to allow it, which is definitely still possible under my proposal. Most libraries aren&#8217;t designed like that and will never be, so I am confident that optimising for the greater majority of the libraries instead of the few is warranted (taking my system: I counted 3623 distinct libraries and plugins and I&#8217;m pretty sure none except libc and libpthread allow for interposition, so it&#8217;s probably a 1000:1 case again).</p><h1>Benchmarks</h1><p>Another important remark I saw in the comment threads was about the lack of benchmarks in my previous blog. Here they are.</p><p>Please note that &#8220;benchmark&#8221; means &#8220;comparison.&#8221; It does not imply &#8220;speed executing something.&#8221;</p><h2>How I did it</h2><p>I started by trying to find an executable I could run non-interactively, that executed a relatively CPU-intense activity and quit. That executable should be in my standard set of built executables, as I didn&#8217;t want to recompile the entire system. I settled on KDE&#8217;s <tt>kbuildsycoca4</tt> with the options <tt>--noincremental --nosignal</tt>: it looks for all *.desktop files in the search paths and compiles a database for faster lookup, called the SYstem COnfiguration CAche. The options tell it to ignore existing databases and do it all, plus avoid signalling running applications over D-Bus to reload their settings.</p><p>The tests were run on my laptop, which is an Intel Core-i7-2620M, clocked at 2.6 GHz, with an SSD but no tmpfs temporary dir, with 2x32kB of L1 cache, 256 kB of L2 cache, 4MB of L3 cache and 4 GB of main RAM. I locked the CPU scaling governor to &#8220;performance&#8221; so the CPU was running at 2.6 GHz when the test starts and it soon goes over to turbo-mode and stays there (3.2 GHz). The system was not <strong>completely</strong> idle while running the test, but relatively so. To try and avoid other problems, the native benchmarks were run under the FIFO real-time scheduler, with a single processor of affinity. The tests were run in 64-bit mode and were run &#8220;warm&#8221;: I ran the benchmark first after any recompilation and discarded the results.</p><p>I did four sets of tests, as follows:</p><ol><li>The first, the baseline, was a regular build on my system, with no change to default KDE 4 build options or to Qt 4.8&#8242;s.</li><li>The second was modified by adding <tt>-Bsymbolic-functions</tt> to the five KDE libraries and six Qt libraries used by the program</li><li>The third was modified by replacing <tt>-Bsymbolic-functions</tt> with <tt>-Bsymbolic</tt> and recompiling the same 11 libraries</li><li>Finally, on the fourth, in addition to keeping <tt>-Bsymbolic</tt>, I made all symbols exported from those 11 libraries have protected visibility. This required surprisingly few modifications to them, as they were more-or-less ready to be built on Windows too. Each library already has a <tt>XXXX_EXPORT</tt> macro associated because of the &#8220;hidden&#8221; visibility support, which right now expands to <tt>__attribute__((visibility("default")))</tt>. Moreover, the buildsystem for those library already defines a specific macro only during their builds. So it was easy to ensure that #ifdef that macro from the buildsystem, the <tt>XXXX_EXPORT</tt> macro should instead expand to <tt>__attribute__((visibility("protected")))</tt>, otherwise it should remain unchanged.</li></ol><p>Each set of tests consisted of:</p><ul><li>Run Ulrich Drepper&#8217;s <a
href="/blog/wp-content/uploads/2012/01/relinfo.txt"><tt>relinfo</tt></a> script on the 11 libraries and tally up the types of relocations</li><li>Run <a
href="http://valgrind.org/">Valgrind&#8217;s</a> <a
href="http://valgrind.org/docs/manual/cg-manual.html">cachegrind tool</a> with branch-prediction and the cache sizes set to match my machine</li><li>Run the <a
href="https://perf.wiki.kernel.org/index.html">perf</a> <a
href="https://perf.wiki.kernel.org/articles/t/u/t/Tutorial.html#Counting_with_perf_stat">stat</a> tool to gather hardware counters. Each run of the tool reported the average of 10 runs of kbuildsycoca4, all run under FIFO real-time scheduler. After the first warm-up run, I chose the best of 3 runs in quick succession</li></ul><p>The raw results I collected you can download from <a
href="/blog/wp-content/uploads/2012/01/benchmarking-abi.txt">here</a> (that also includes results with LD_BIND_NOW=1).</p><h2>Results</h2><p>First of all, I went into these benchmarks fully expecting that nothing would be visible in the performance benchmarks. It&#8217;s clear that these are micro-optimisations, so in a fairly large program they should be drowned out by inefficiencies in other parts. Also, considering that my system wasn&#8217;t completely idle when running the CPU benchmarks, the numbers have a degree of noise which could hide the faint results. The results have, however, shown a few clear improvements.</p><p>Here&#8217;s what I found:</p><ul><li><strong>Relocations</strong>: relocations are work that the dynamic linker must do either at load-time (non-PLT relocations) or during run-time (PLT). Reducing or simplifying relocations improves start-up and run-time performance.<ul><li><strong>The number of non-PLT relocations drops by 2.65% with protected visibility</strong>: that was expected because the linker options affect only the PLT. To change the non-PLT relocation count, a change to the compilation was necessary.</li><li><strong>The number of relative relocations doubles with the linker options</strong>: that was also expected, because the linker can bind the relocation to the symbol that is inside the library being linked. Instead of referring to the symbol by its name and triggering a full look-up, a relative relocation simply records how many bytes past a fixed mark (the load address) the relocation should be, which is much simpler to execute. The number increases again with <tt>-Bsymbolic</tt> compared to <tt>-Bsymbolic-functions</tt> because the linker can bind non-functions too. The number dropped with protected visibility, but by less than the number of total relocations removed.</li><li><strong>The number of PLT entries is one-third of the original</strong> because the linker can make <em>intra-library</em> function calls directly instead of going through the PLT stub. Each PLT entry corresponds to 8 bytes in the <tt>.got.plt</tt> section and 16 bytes of stub, which means this reduction saved as many as 15571 relocations and as much as 373 kB of memory size. This is confirmed by the count of PLT entries used for local symbols, which drops to nearly zero. The number isn&#8217;t exactly zero because both QtCore and QtGui <strong>have</strong> been prepared for 5 of its symbols to be interposed when built with <tt>-Bsymbolic-functions</tt>, a preoccupation I didn&#8217;t take into account in the protected visibility work because it wasn&#8217;t relevant.<ul>Note that there must have been an error with the <tt>-Bsymbolic</tt> builds because two libraries had a higher PLT count than they should. I have not investigated whether this was a a mistake on my part or a bug in the linker.</ul></li></ul></li><li><strong>Valgrind results</strong>: valgrind executes the program in a simulated CPU, which on one hand means we get consistent results independent of what CPU I run this in and how idle or busy my system was, but on the other hand may or may not reflect reality (YMMV).<ul><li><strong>Instruction count decreases slightly</strong> by 0.9%, 1.1% and 1.2%</li><li><strong>Data accesses to L1 data cache decreases slightly</strong> by 1.4%, 1.6% and 2.1%</li><li><strong>Last-level cache references decrease by 7%</strong> while the LL cache miss rate remains constant, probably because there are fewer instructions executed, fewer data accesses and a slightly improvement in L1D miss rate</li><li><strong>Number of indirect branches executed drops by 22%</strong></li><li><strong>The indirect branch misprediction rate drops considerably</strong> from 22% in the original to 16% with just the linker options and 8.8% with the protected visibility, while the overall branch misprediction rate drops from 4.7% to 4.3% and then to 4.1%. With 2.9 million fewer mispredicted branches, at a 20-cycle misprediction penalty, that&#8217;s 57 million cycles saved.</li></ul></li><li><strong>Perf results</strong>: perf uses hardware counters from the CPU to do its bidding, but it is subject to scheduling issues. The kbuildsycoca4 program does context-switch in its execution because it tries to verify with the D-Bus daemon if another instance isn&#8217;t already running. Moreover, this program is I/O intensive, meaning it makes a lot of system calls, which is why I let the benchmarks run with a &#8220;warm&#8221; system cache. Unlike the Valgrind results, there&#8217;s a great deal of noise and error in the numbers from perf because they represent an actual CPU.<ul><li><strong>There&#8217;s a roughly 3% overall performance improvement</strong> as measured by the execution time. The noise in the number doesn&#8217;t show which solution is best, but it shows that all three are better than the unmodified library code.</li><li><strong>There&#8217;s a 3 to 4% improvement in number of cycles</strong> required to complete the operation. Unfortunately, the numbers are showing performance decreasing as I optimise more, which is counter-intuitive and I cannot explain (noise or real mis-optimisation). I think my machine was slightly less idle on the last test set, as the last results I got showed a much worse performance with a much bigger standard deviation.</li><li><strong>There&#8217;s roughly 3% improvement in the number of instructions executed</strong>, which is similar to the reduction in cycles, but also shows that more instructions are executed per cycle with the optimisations. I cannot say why exactly it is, but I imagine it&#8217;s because of reductions in branching, branch misprediction and cache misses. The calculation of instructions per cycle shows improvement in two of the three benchmarks by close to 1%.</li><li><strong>Branches executed reduce by 4 to 5%</strong> but the reduction is in the opposite order of the number of branches I know are in the code, which means there was a considerable amount of noise in this test. Another similar metric shows a roughly 5% improvement in branch loads.</li><li><strong>The rates of cache misses and branch mispredictions remain more or less constant</strong>, which coupled with the number of branches reducing means we have an improvement in performance due to fewer absolute mispredictions happening. I cannot conclude anything about a reduction in cache references because the numbers varied too much.<ul>This is supported by the calculation of cycles gained in the reduction of branch misprediction. The SandyBridge architecture has a 20-cycle penalty for branch misprediction, so if we calculate how many cycles were lost in each benchmark due to mispredicted branches and subtract from the original, we get roughly 6 million cycles gained (0.24% of the total), which is in the same order as the improvement in instruction throughput (instructions per cycle).</ul></li></ul></li></ul><h1>Conclusions</h1><p>The numbers are fairly small, as was expected, since we&#8217;re talking about micro-optimisations. However, three distinct benchmarks have shown with a reasonable degree of confidence that there&#8217;s a performance improvement in the order of 3% (execution time, cycle count and instruction count, and that&#8217;s reasonable to me, with the limited sample size I had). That&#8217;s more or less what I hoped to see, but much more than I expected to be able to show.</p><p>Another important aspect is that this was a non-GUI testcase, even though by virtue of library dependencies, both QtGui and kdeui libraries were present. Note how the two libraries have, together, 45824 relocations and 14708 PLT entries in the original library set, which corresponds to 73.3% and 62.4% of the total relocations in play respectively, as well as 65% of the PLT entries for local symbols. The number of relocations is indicative also of the size of the code in those libraries. But since the application isn&#8217;t a GUI one, that code is mostly not executed.</p><p>If we consider that the problem of cache misses increases with code size (and the cache miss rate could increase too, compounding the effect) and that of cycles lost due to mispredicted branches increases with the number of branches unless the misprediction ratio drops (which the benchmarks have shown to remain stable), we can expect that a GUI application could gain even more in performance due to these improvements. That&#8217;s difficult to prove however in a GUI application, so we&#8217;ll have to stay with just the theoretical exercise.</p><p>In all, I still think this is warranted. The drawbacks are fairly minor: the interposition of symbols is rarely used already, interposition of symbols in <em>intra-library</em> lookups close to non-existent in libraries that aren&#8217;t designed to do that. All we need to do now is change the status-quo, which is probably the hardest part.</p><p>Who will support me?</p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/01/update-and-benchmark-on-the-dynamic-library-proposals/feed/</wfw:commentRss> <slash:comments>5</slash:comments> </item> <item><title>Sorry state of dynamic libraries on Linux</title><link>http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/</link> <comments>http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/#comments</comments> <pubDate>Mon, 16 Jan 2012 15:12:14 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[Assembly]]></category> <category><![CDATA[C++]]></category> <category><![CDATA[KDE]]></category> <category><![CDATA[Linux]]></category> <category><![CDATA[MeeGo]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[abi]]></category> <category><![CDATA[assembly]]></category> <category><![CDATA[elf]]></category> <category><![CDATA[linux]]></category> <category><![CDATA[low-level]]></category> <category><![CDATA[optimisation]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=287</guid> <description><![CDATA[Last week, we identified a bug in Qt with Olivier&#8216;s new signal-slot syntax. Upon further investigation, it turns out it&#8217;s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux. tl;dr (a.k.a. Executive Summary) Shared libraries on Linux are linked &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>Last week, we identified a bug in Qt with <a
href="http://woboq.com">Olivier</a>&#8216;s <a
href="http://developer.qt.nokia.com/wiki/New_Signal_Slot_Syntax">new signal-slot syntax</a>. Upon further investigation, it turns out it&#8217;s not a Qt issue, but an ABI one. Which prompted me to investigate more and decide that dynamic libraries need a big overhaul on Linux.</p><h1>tl;dr (a.k.a. Executive Summary)</h1><p>Shared libraries on Linux are linked with <tt><a
href="http://en.wikipedia.org/wiki/Position-independent_code">-fPIC</a></tt>, which makes all variable references and function calls indirect, unless they are <tt>static</tt>. That&#8217;s because in addition to making it position-independent, it makes every variable and function <strong>interposable</strong> by another module: it can be overridden by the executable and by <tt>LD_PRELOAD</tt> libraries. The indirectness of accesses is a performance impact and we should do away with it, without sacrificing position-independence.</p><p>Plus, there are a few more actions we should take (like prelinking) to improve performance even further.</p><p>Jump to <a
href="#existing_solutions">existing</a> or <a
href="#proposed_solutions">proposed</a> solutions, <a
href="https://plus.google.com/108138837678270193032/posts/No8T7VLoF33">Google+ discussion</a>.</p><h1>Details</h1><p>Note: in the following, I will show x86-64 64-bit assembly and will restrict myself to that architecture. However, the problems and solutions also apply to many other architectures, like x86 and ARM, which should make you consider what I say. The only platform that this mostly does not apply to is actually IA-64.</p><h2>The basics</h2><p>Imagine the following C file, which also compiles in C++ mode:</p><pre class="brush: cpp; title: ; notranslate">
extern void *externalVariable;
extern void externalFunction(void);
 
void myFunction()
{
    externalFunction();
    externalVariable = &amp;externalFunction;
}
</pre><p>The code above demonstrates three features of the languages in one function: it loads the address of a function, it calls a function and it writes to a variable. The compiler does not know where the function and variable are: they might be in another .o file linked into this ELF module or they might be in another ELF module (i.e., a library) this module links to.</p><p>This compiler produces the following assembly output (gcc 4.6.0, -O3):</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        <span class="kw1">call</span>    externalFunction
        movq    $externalFunction<span class="sy0">,</span> externalVariable<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>This assembly snippet is making use of two symbols whose values the assembler does not know. When assembled, the assembler produces a .o with three relocations. This GCC has produced the most efficient and most compact compilation of the code I wrote.</p><p>When we link this .o into an executable, we start to see the drawbacks. The first is that both instructions need to encode, in their bits, the values of the symbols whose values we didn&#8217;t know. So the linker must somehow fix this. It fixes the <tt>call</tt> instruction by making it call a stub or a trampoline, which jumps to the actual address. This stub is placed in a separate section of code called the Procedure Linkage Table (PLT). The contents of the PLT stub is not that important, but suffice to say that it is an indirect jump.</p><p>The <tt>movq</tt> instruction cannot be fixed. There&#8217;s simply no way, because it writes a constant value to a constant location, directly. Even if we allowed for the instruction or a pair of instructions wide enough to write any 64-bit value to any variable in the 64-bit space, we still have a problem: those values are not known at link time. So instead of fixing the instruction, the linker &#8220;fixes&#8221; the values. For the address of <tt>externalFunction</tt>, it uses the address of the PLT stub it created in the previous paragraph. For the <tt>externalVariable</tt> variable, tt will create a <a
href="http://docs.oracle.com/cd/E19082-01/819-0690/chapter4-84604/index.html">copy relocation</a>, which means the dynamic linker will need to find the variable where it is, <strong>copy</strong> its value to a fixed location in the executable and then tell everyone that the variable is actually in the executable.</p><p>What are the consequences of this? For the PLT call, it&#8217;s a simple performance impact which could not be avoided. Since the address of the actual <tt>externalFunction</tt> function is not known at compile and link-time, and we don&#8217;t want to leave a <a
href="http://www.akkadia.org/drepper/textrelocs.html">text relocation</a>, the only way to place that call to find the address at run-time and indirectly call it.</p><p>For the copy relocation, the consequences for the executable are small. The code it will execute is still the most efficient and most compact. The dynamic linker will have to find where the symbol actually is at load-time, which is something that it would have to do anyway, plus copy its contents, checking that the size hasn&#8217;t changed. This is done only once, then the code runs in its most efficient form.</p><p>The fact that we resolved <tt>&#038;externalFunction</tt> to the address of the PLT stub means that any use of that function pointer (an indirect call) will end up in a function that does an indirect call too. That is, it&#8217;s a <strong>doubly-indirect</strong> call. I seriously doubt any processor can do proper branch prediction, speculative execution, and prefetching of code under those circumstances.</p><h2>It gets worse</h2><p>So far we&#8217;ve analysed what happens in an executable. Now let&#8217;s see what happens when we try to build the same C code for a shared library. We do that by introducing the <tt>-fPIC</tt> compiler option, which tells the compiler to generate position-independent code. The compiler produces the following assembly output:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        <span class="kw1">call</span>    externalFunction@PLT
        movq    externalFunction@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rdx
        movq    externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>When assembled, the .o still contains three relocations, albeit of different type.</p><p>When we compare the output of the position-dependent and the position-independent code, we notice the following:</p><ol><li>The <tt>call</tt> is still a call, but now we&#8217;re explicitly calling the PLT stub. This might seem irrelevant, since the linker would have fixed the call anyway to point to the PLT if it had to, but isn&#8217;t.</li><li>The single <tt>movq</tt> instruction was split in three. This is required by the x86-64 processor, since the instruction set cannot encode a 64-bit value and the 64-bit address to store it in the same instruction (such instruction would be at least 17 bytes long, which 2 two bytes longer than the maximum instruction length).</li><li>The values for the two symbols are loaded indirectly. Instead of encoding the two values in those two middle <tt>movq</tt> instructions, the compiler is loading the values from another linker-generated structure called the Global Offset Table (GOT).</li></ol><p>The compiler needed to generate the code above since it doesn&#8217;t know where the symbols will actually be. As was the case before, those symbols can be linked into the same ELF module as this compilation unit, or they may be found elsewhere in another ELF module this one links to.</p><p>Moreover, the compiler and linker need to deal with the possibility that an executable might have done exactly what our executable in the previous section did: create a copy relocation on the variable and fixed the address of the function to its own PLT stub. In order to work properly, this code must deal with the fact that its own variable might have ended up elsewhere, and that <tt>&#038;externalFunction</tt> might have a different value.</p><p>That means the indirect call through the PLT and the three <tt>movq</tt> instructions remain, even if those two symbols were in the same compilation unit!</p><p>The problem is that even if at first glance you&#8217;d think that the compiler should know for a fact where those symbols are, it actually doesn&#8217;t. The <tt>-fPIC</tt> option doesn&#8217;t enable only position-independent code. It also enables ELF symbol interposition, which is when another module &#8220;steals&#8221; the symbol. That happens normally by way of the copy relocations, but can also happen if an LD_PRELOAD&#8217;ed module were to override those symbols. So the compiler and linker must produce code that deals with that possibility.</p><p>In the end, we&#8217;re left with indirect calls, indirect symbol address loadings and indirect variable references, which impact code performance. In addition, the linker must leave behind relocations by name for the dynamic linker to resolve at load-time.</p><h2>All this for the possibility of interposition?</h2><p>Yes, it seems so. The impact is there for this little-known and little-used feature. Instead of optimising for the common-case scenario where the symbols are not overridden, the ABI optimises for the corner case.</p><p>Another argument is that the ABI optimises for executable code, placing the impact on the libraries. The argument is valid if the executables are much larger and more complex than the libraries themselves. It&#8217;s valid too if we consider that application developers write sloppy code, whereas library developers will write very optimised code.</p><p>I don&#8217;t think that argument holds anymore. Libraries have got much more complex in the past 10-15 years and do a lot more than they once did. They are not mere wrappers around system calls, like libc 4 and 5 were on Linux in the late 90s. Moreover, if we consider that the rise of interpreted languages, like Perl, Python, Ruby, even QML and JavaScript, the code belonging to the ELF executables is negligible. Compare the size of the executables with the libraries that actually do the interpretation:</p><pre>
-rwxr-xr-x. 2 root root   13544 Aug  5 06:27 /usr/bin/perl
-rwxr-xr-x. 2 root root    9144 Apr 12  2011 /usr/bin/python
-rwxr-xr-x. 1 root root    5160 Dec 29 13:46 /usr/bin/ruby
-r-xr-xr-x. 1 root root 1763488 Apr 12  2011 /usr/lib64/libpython2.7.so.1.0
-rwxr-xr-x. 1 root root  947736 Dec 29 13:46 /usr/lib64/libruby.so.1.8.7
-rwxr-xr-x. 1 root root 1524064 Aug  5 06:27 /usr/lib64/perl5/CORE/libperl.so
</pre><p>That&#8217;s even valid for interpreters that JIT the code. As optimised as the code they generate can be, current understanding is that operations with critical performance are implemented in native code, which means libraries or plugins.</p><h1><a
name="existing_solutions"></a>Existing solutions</h1><h2>Partial solution for private symbols</h2><p>When developing your library, if you know that certain symbols are private and will never be used by any other library, you have an option. You can declare their ELF visibility to be &#8220;hidden&#8221;, which has two consequences. The clear one is that the linker will not add the hidden symbols to the dynamic symbol table, so other ELF modules simply cannot find them. If they can&#8217;t find them, they can&#8217;t steal them. And if they can&#8217;t steal them, the linker does not need to produce a PLT stub for the function call, so the <tt>call</tt> instruction will be linked to a simple, direct call as the executable in the first part had been.</p><p>The other consequence is an optimisation that the compiler does. Since it also knows that the <tt>externalVariable</tt> variable cannot be stolen, it does not need to address the variable indirectly. The generated assembly becomes:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        <span class="kw1">call</span>    externalFunction@PLT
        movq    externalFunction@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rax<span class="sy0">,</span> externalVariable<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>The .o file will still contain three relocations. However, note how the getting of the address of the <tt>externalFunction</tt> function is still done indirectly, even though the compiler knows it cannot be interposed. That means the linker will still generate a load-time relocation for the dynamic linker, to get the address of that function. Fortunately, it&#8217;s a simpler relocation since the symbol name itself is not present.</p><p>If there&#8217;s a reason for getting the address indirectly like this, I have yet to find it.</p><h2>Partial solution for public non-interposable symbols</h2><p>If your symbols are public, however, you cannot use the ELF &#8220;hidden&#8221; visibility trick. But if you know that they cannot and will not ever be stolen or interposed, you have another possibility, which is to tell that to the compiler and linker.</p><p>If you declare a variable with ELF &#8220;protected&#8221; visibility, you&#8217;re telling the compiler and linker that it cannot be stolen, yet can be placed in the dynamic symbol table for other ELF modules to reference. You just have to be absolutely sure that they will not <strong>ever</strong> be interposed, because that will create subtle bugs that are hard to track down. That includes access to those symbols by position-dependent executable code, like we did in the first section.</p><p>The GCC syntax <tt>__attribute__((visibility("protected")))</tt> works in ELF platforms only, whereas the one with the &#8220;hidden&#8221; keyword is known to work in non-ELF platforms too, like Mac OS X (Mach-O) and IBM AIX (XCOFF).</p><p>Another way to do the same is to use one of two linker options: <tt>-Bsymbolic</tt> and <tt>-Bsymbolic-functions</tt>. They do basically the same as the protected visibility: they keep the symbols in the dynamic symbol table, but they make the linker use the symbol inside the library unconditionally. The difference between those two options is that the former applies to all symbols, whereas the latter applies to functions only.</p><p>The reason why <tt>-Bsymbolic-functions</tt> exists requires looking back at the executable code from the first section. While the variable reference required a copy relocation, the function call was done indirectly, through the PLT stub. A variable can be moved, but moving code isn&#8217;t possible, so the executable code needs to deal with the code being elsewhere anyway. For that reason, it&#8217;s possible to symbolically bind function calls inside a library without affecting executables.</p><p>Or so we thought. The problem we discovered last week deals with a situation of when you treat a function as a data reference: taking its address. As we saw on the first part, the linker will resolve the address of the function to the address of the PLT stub found in the executable. But if you symbolically bind the function in the library, it will resolve to the real address. If you try to compare the two addresses, they won&#8217;t be the same.</p><h1><a
name="proposed_solutions"></a>Proposed solutions</h1><p>Some of the solutions I propose are ABI and binary compatible with existing builds; some others are ABI incompatible and would require recompilation. Unfortunately, the best solution would require source-incompatible changes. Still, all the changes below are giving a bit of optimisation to libraries by making executables less optimised.</p><h2>Use of PLT in function calls should rest only with the linker</h2><p>As we saw in the code generated for the library, with -fPIC, the compiler decided to make the call indirectly by adding &#8220;@PLT&#8221; to the symbol name. Turns out that the linker doesn&#8217;t really care about this and will generate (or not) the PLT stub if needed. If that&#8217;s the case, the compiler should not make a judgement call about where the symbol is located just because of -fPIC.</p><h2>Function addresses should always be resolved through the GOT</h2><p>Function calls already require a pointer-sized variable somewhere and a relocation to make it point to the valid entry point of the function being called. What&#8217;s more, taking addresses of functions is a somewhat rare operation, compared to the number of function calls across ELF modules.</p><p>That being the case, we can take a small &#8220;hit&#8221; in performance and the loading of a function address should happen via the GOT in position-dependent code (executables) just like it is done for position-independent code.</p><p>The benefit of doing this is that the function address we load will point to exactly function&#8217;s real entry point, instead of the PLT stub. When we call this function, we avoid the doubly-indirect branching we found earlier.</p><h2>PLT stubs should use the regular GOT&#8217;s address, if it exists</h2><p>If a given function is both called and its address is taken, the PLT stub should reference GOT entry that was used for the taking of the address. The reason why it isn&#8217;t already so, I guess, is because the entries in the <tt>.got.plt</tt> section aren&#8217;t initialised with the target function&#8217;s address, but the local module&#8217;s function resolver. This trick allows for the &#8220;lazy resolution&#8221; of functions: they are resolved only the first time they are called.</p><p>I wouldn&#8217;t ask for all functions to be resolved at load-time, but if the address of the function is taken <strong>anyway</strong>, the dynamic linker will need to resolve it at load time. So why waste CPU cycles in a function call if the address was computed already?</p><h2>Copy relocations should be deprecated</h2><p>Instead of copying the variable from the library into the executable, executables should use indirect addressing for reading variables and writing to them, as well as taking their addresses. One benefit of doing this is avoiding the actual copying. For example, for read-only variables, they may remain in read-only pages of memory, instead of being copied to read-write pages found in the executable.</p><p>The big drawback of this is that the indirect addressing is a lot more expensive, since it requires two memory references, not just one. The next suggestion might help alleviate the problem.</p><h2>The linker should relax instructions used for loading variable addresses</h2><p>This is a suggestion found in the IA-64 ABI: the compiler generates the instructions needed to load the address of the variable from the GOT, then use it as it needs to. If the linker concludes (by whichever means, like protected or hidden symbols, the use of one of the symbolic options, or because this is an ELF application and the symbol is defined in it) that the symbol must reside in the current ELF module, it can change the load instruction into a register-to-register move or similar.</p><p>For our x86-64 64-bit case, the instructions the compiler generated were:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        movq    externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>By changing one bit in the opcode of the first instruction, with no code size change, we can produce:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        leaq    externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq    <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>The x86 instruction &#8220;LEA&#8221; means &#8220;Load Effective Address&#8221;. Instead of loading 64 bits from the memory address externalVariable@GOTPCREL(%rip) and storing them in the register, that instruction the address it would have loaded from in the register. This isn&#8217;t as optimised as the original code found in the executable for two reasons: it requires two instructions instead of just one and it requires an additional register.</p><p>It&#8217;s possible to generate an even more efficient code if the assembler leaves a 32-bit immediate offset in the second <tt>movq</tt> instruction, making it 6 bytes long. This extra immediate would be of no impact in the original code, besides making it longer, but it would allow the linker to optimise the code further:</p><p>The original would be:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        movq     externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rax
        movq<span class="sy0">.</span>d32 <span class="sy0">%</span>rdx<span class="sy0">,</span> <span class="nu0">0x0</span><span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>And it would get relaxed to:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        nopl<span class="sy0">.</span>d32 <span class="nu0">0x0</span><span class="br0">&#40;</span><span class="sy0">%</span>rax<span class="br0">&#41;</span>
        movq     <span class="sy0">%</span>rdx<span class="sy0">,</span> externalVariable@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span></pre></div></div></div></div></div></div></div><p>That is, the first 6-byte instruction is resolved to a 6-byte NOP, whereas the second 6-byte instruction executes the actual store, with no extra register use. The compiler cannot know that the register will be left untouched, but at least there is no dependency between the two instructions that might cause a CPU stall.</p><p>The same applies to other architectures too. The full <tt>-fPIC</tt> code on ARM to store a value from a register into a variable is the following:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r3<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>     @ points to a constant whose value is<span class="sy0">:</span> externalVariable<span class="br0">&#40;</span>GOT<span class="br0">&#41;</span>
<span class="sy0">.</span>LPIC1<span class="sy0">:</span> ldr     r3<span class="sy0">,</span> <span class="br0">&#91;</span>r4<span class="sy0">,</span> r3<span class="br0">&#93;</span>  @ r4 contains the base address of the GOT
        <span class="kw1">str</span>     r2<span class="sy0">,</span> <span class="br0">&#91;</span>r3<span class="sy0">,</span> #<span class="nu0">0</span><span class="br0">&#93;</span></pre></div></div></div></div></div></div></div><p>If the linker can conclude the symbol must be in the current ELF module and cannot change, it may be able to avoid the extra load (the middle instruction) by changing the code to be:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r3<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>     @ points to a constant whose value is<span class="sy0">:</span> externalVariable<span class="sy0">-</span><span class="br0">&#40;</span><span class="sy0">.</span>LPIC1<span class="sy0">-</span><span class="nu0">8</span><span class="br0">&#41;</span>
<span class="sy0">.</span>LPIC1<span class="sy0">:</span> <span class="kw1">add</span>     r3<span class="sy0">,</span> pc<span class="sy0">,</span> r3
        <span class="kw1">str</span>     r2<span class="sy0">,</span> <span class="br0">&#91;</span>r3<span class="sy0">,</span> #<span class="nu0">0</span><span class="br0">&#93;</span></pre></div></div></div></div></div></div></div><p>Unlike x86, the ARM instructions cannot be optimised further, since the immediates encodable in the instructions have limited range.</p><h2>The linker should relax instructions used for loading function addresses</h2><p>Similar to the above, but instead looking at function addresses. The original library code is:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        movq    externalFunction@GOTPCREL<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rdx</pre></div></div></div></div></div></div></div><p>But it can be relaxed to:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        leaq    externalFunction<span class="br0">&#40;</span><span class="sy0">%</span>rip<span class="br0">&#41;</span><span class="sy0">,</span> <span class="sy0">%</span>rdx</pre></div></div></div></div></div></div></div><p>With ARM, the original code is:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r3<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>     @ points to a constant of value<span class="sy0">:</span> externalFunction<span class="br0">&#40;</span>GOT<span class="br0">&#41;</span>
        ldr     r2<span class="sy0">,</span> <span class="br0">&#91;</span>r4<span class="sy0">,</span> r3<span class="br0">&#93;</span>  @ r4 contains the address of the base of the GOT</pre></div></div></div></div></div></div></div><p>But relaxed, it would be:</p><div
class="wp-geshi-highlight-wrap5"><div
class="wp-geshi-highlight-wrap4"><div
class="wp-geshi-highlight-wrap3"><div
class="wp-geshi-highlight-wrap2"><div
class="wp-geshi-highlight-wrap"><div
class="wp-geshi-highlight"><div
class="asm"><pre class="de1">        ldr     r2<span class="sy0">,</span> <span class="sy0">.</span>L2<span class="sy0">+</span><span class="nu0">8</span>    @ points to a constant of value<span class="sy0">:</span> externalFunction<span class="sy0">-</span><span class="br0">&#40;</span><span class="sy0">.</span>LPIC0<span class="sy0">+</span><span class="nu0">8</span><span class="br0">&#41;</span>
<span class="sy0">.</span>LPIC0<span class="sy0">:</span> <span class="kw1">add</span>     r2<span class="sy0">,</span> pc<span class="sy0">,</span> r2</pre></div></div></div></div></div></div></div><h2>There should be a way to tell the compiler where the symbol is</h2><p>We&#8217;re already able to tell the compiler that a symbol is in the current module, with the hidden visibility attribute. We should be able to tell the compiler that we know that the symbol is in the current module but exported as well as that we know that the symbol is in another module.</p><p>I would suggest simply using the existing ELF markers and being explicit about them:</p><ul><li><tt>__attribute__((visibility("hidden")))</tt>: symbol is in this ELF module and is not exported (equivalent on Windows: no decoration);</li><li><tt>__attribute__((visibility("protected")))</tt>: symbol is in this ELF module and is exported (equivalent on Windows: <tt>__declspec(dllexport)</tt>);</li><li><tt>__attribute__((visibility("default")))</tt>: symbol is in another ELF module (equivalent on Windows: <tt>__declspec(dllimport)</tt>); this also applies to symbols that must be overridable according to the library&#8217;s API (like C++&#8217;s global operator new).</li></ul><p>Considering the other suggestions, we know the references to symbols with &#8220;default&#8221; visibility can be relaxed into simpler and more efficient code in the presence of one of the symbolic binding options. That means we can use the &#8220;default&#8221; visibility for cases of uncertain symbols.</p><h1>Getting there</h1><p>Some of the solutions I listed are already possible and they should be used immediately in all libraries. That is especially true about the use of the hidden visibility: all libraries, without exception, should make use of this feature. In fact, since this option was introduced in GCC 4.0 seven years ago, many libraries have started using it and are now &#8220;good citizens&#8221;, for they access their own private data most efficiently, they don&#8217;t have huge symbol tables (which impact lookup speed) and they don&#8217;t pollute the global namespace with unnecessary symbols.</p><p>Other solutions are not possible to implement yet. The solution I personally feel is most important to be implemented first is that of the ELF executables: they need to stop using copy relocations and they should resolve addresses of functions via the GOT. Only once that is done can libraries start using the &#8220;protected&#8221; visibility and generate improved code. This implies changing the psABI for the affected libraries, which may not be an easy transition.</p><p>An alternative to using the &#8220;protected&#8221; visibility is to use the symbolic binding options. The code relaxation optimisations would come in handy at this point to optimise at link-time the code that the compiler could not make a decision on. Unfortunately, those options apply to all symbols in a library, so libraries that must have overridable symbols need to use an extra option (<tt>--dynamic-list</tt>) and list each symbol one by one.</p><h2>Using -fPIE</h2><p>The compiler option <tt>-fPIE</tt> tells the compiler to generate position-independent code for executables. It is similar to the <tt>-fPIC</tt> option in that it generates position-independent code, but it has the added optimisation that the compiler can assume none of its symbols can be interposed.</p><p>With executables compiled with this option, copy relocations and direct loading of function addresses aren&#8217;t used. This solves the problem we had. Therefore, compiling executables with this option allows us to start using some of the optimisations I described before.</p><p>Unfortunately, as its description says, this option also generates position-independent code, which can be less efficient than position-dependent code in some situations. My preference would be to have position-dependent code executables without the copy relocations. However, there&#8217;s an added, side-effect of this option: it defines the <tt>__PIC__</tt> macro, whose absence can be used to abort compilations for libraries that have transitioned to the more efficient options.</p><h1>Further work and further reading</h1><p>I highly recommend Urlich Drepper&#8217;s <a
href="http://www.akkadia.org/drepper/dsohowto.pdf">&#8220;How to Write Shared Libraries&#8221;</a> paper. His recommendations did not go as far as suggest changing the ABI like I have, but he has many that library developers should adhere to, regardless of whether my recommendations are accepted or not. For example, using <tt>static</tt> functions and data where possible and avoiding arrays of pointers are recommendations I have made to many people.</p><p>Other work necessary is to improve prelinking support. Shared libraries are position-independent, but they can be prelinked to a preferred location in memory. One optimisation I have yet to see done is to use the read-only pages of prelinked data when the library is loaded at that preferred address (the <tt>.data.rel.ro</tt> sections).</p> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/feed/</wfw:commentRss> <slash:comments>24</slash:comments> </item> <item><title>Qt temperatures drop from January to June</title><link>http://www.macieira.org/blog/2012/01/qt-temperatures-drop-from-january-to-june/</link> <comments>http://www.macieira.org/blog/2012/01/qt-temperatures-drop-from-january-to-june/#comments</comments> <pubDate>Fri, 13 Jan 2012 17:51:54 +0000</pubDate> <dc:creator>Thiago Macieira</dc:creator> <category><![CDATA[KDE]]></category> <category><![CDATA[Qt]]></category> <category><![CDATA[qt]]></category> <category><![CDATA[qt-project]]></category> <category><![CDATA[qt5]]></category> <category><![CDATA[releases]]></category> <guid
isPermaLink="false">http://www.macieira.org/blog/?p=280</guid> <description><![CDATA[I&#8217;ve previously talked about how the Qt 5 Winter is coming. Since we started talking about that, people have begun asking what are the date limits for each thing, when the API would freeze, when Qt 5.0 would be stable, when we&#8217;d release, etc. This blog tries to answer that a little. Last month, we &#8230;</p><p><a
class="more-link block-button" href="http://www.macieira.org/blog/2012/01/qt-temperatures-drop-from-january-to-june/">Continue reading &#187;</a>]]></description> <content:encoded><![CDATA[<p>I&#8217;ve previously talked about how the <a
href="http://www.macieira.org/blog/2011/12/winter-is-coming/">Qt 5 Winter is coming</a>. Since we started talking about that, people have begun asking what are the date limits for each thing, when the API would freeze, when Qt 5.0 would be stable, when we&#8217;d release, etc. This blog tries to answer that a little.</p><p>Last month, we were preparing a list of features that needed to be done for Qt 5.0. The result of that activity is <a
href="https://bugreports.qt.nokia.com/browse/QTBUG-20885">Task QTBUG-20885</a>, which is a meta-task containing as sub-tasks everything that needs to happen for Qt 5.0&#8242;s feature freeze. Those are the changes that <strong>must</strong> go into Qt 5.0 and not in any later release. They are major refactorings or other changes that would break source- or binary-compatibility.</p><p>That task is now mostly accomplished. Lars has <a
href="http://lists.qt-project.org/pipermail/development/2012-January/001240.html">suggested a feature freeze date</a> of February 4th, on his post on the <a
href="http://lists.qt-project.org/mailman/listinfo/development">Qt development mailing list</a>. There&#8217;s not a lot of time left, so if you have something that needs to go in and hasn&#8217;t been taken into account, create the task and post now to the mailing list.</p><p>What happens next? Well, I don&#8217;t have dates, but I can tell you what will be<sup><a
href="#foot-1">[1]</a></sup> the stages of API freezing for Qt 5.0:</p><ul><li><strong>Alpha</strong> (Feature freeze): the first step, where all the features are in and work as best we can determine, in all the reference platforms<sup><a
href="#foot-2">[2]</a></sup> of Qt. The purpose of the Alpha release is to validate the API and get feedback from our own developers as well as bleeding-edge testers whether the code really works and solves the problems it was intended to. Since the point of the Alpha release is to get feedback on the API and whether it works, the API is definitely not frozen at this point. After this point, no new features are accepted.</li><li><strong>Beta</strong>: the API is soft-frozen, which means it will almost not change anymore. Most of the feedback that we expected to receive regarding the API has been received and acted upon. From this point on, early users of Qt can start depending on the API. If any further API changes are required, they can still be done but must be clearly documented and communicated to those early users. The purpose of the Beta release is to start using the API and to start validating the implementation of the solutions present. That means the focus after the Beta release is to discover issues and fix bugs, not to completely refactor something that isn&#8217;t solving the problems.</li><li><strong>Release Candidate</strong>: the API is now deep-frozen and will not change unless a catastrophic flaw is discovered. If that happens, the developer who wants to change the API must convince the Release Team to postpone the release. At this time, the ABI (binary compatibility) should be soft-frozen too, but issues with it may still be solved.</li><li><strong>Final Release</strong>: the API and ABI is completely frozen; the source- and binary-compatibilities of Qt kick in. This release will be called Qt 5.0.0. All programs compiled with this release will run without recompilation on any Qt 5.x.y release. Additionally, any programs compiled with Qt 5.0.y will also run without recompilation on Qt 5.0.0.</li><li><strong>Patch Releases</strong>: the Qt 5.0.y releases, to be had in the second half of this year, fixing issues reported, but not adding new features.</li></ul><p>There should be only one alpha release, sometime next month. There may be multiple beta releases, as time progresses and issues are fixed. The point of a beta is to find more issues, so we need to release often for our users to give feedback. There&#8217;s also likely going to be only one release candidate, but it&#8217;s possible to have more than one as we find issues. And ideally, the final release should be just the last RC rebadged, but history shows we will add a few minor fixes between the two.</p><p>This process may not be followed exactly as I listed, though. Given the number of important new features, Lars has said that he might accept new features past the freeze date, provided we can see that there is progress. In other words, we will not wait for features we&#8217;re not certain will be delivered soon.</p><p>Finally, this process applies <strong>only</strong> to Qt 5.0. The process for Qt 5.1 and onwards should be different. For one thing, those releases will not have BC breakages, so the provisions relating to BC will not apply. For another, we plan to put in place a different branching model (subject for another blog) and keep the Qt Project maintainers true to their duty of &#8220;code is always ready for beta,&#8221; meaning that the feedback we&#8217;re scheduling for the period between alpha and beta right now should happen before the feature is accepted into the mainline.</p><p>Happy hacking.</p><h3>Footnotes</h3><ol><li><a
id="foot-1" name="foot-1"></a>The list presented is the one I <a
href="http://lists.qt-project.org/pipermail/development/2011-December/000890.html">sent to the mailing list</a> in December. Lars agreed to it and no one else challenged.</li><li><a
id="foot-2" name="foot-2"></a>The current reference platforms for Qt are: Windows; Mac OS X 10.6 and above, using Cocoa, Linux using XCB; and Linux using Wayland.</li></ol> ]]></content:encoded> <wfw:commentRss>http://www.macieira.org/blog/2012/01/qt-temperatures-drop-from-january-to-june/feed/</wfw:commentRss> <slash:comments>2</slash:comments> </item> </channel> </rss>
