Yesterday, one of my contributions to Qt was merged which finally adds better support for optimised raster painting on Windows, with SSE2 and AVX instructions. This feature has long been present on the Unix systems, but it was somewhat lacking on Windows.
If you’ve read my past blogs, you know I often talk about and work on Single Instruction Multiple Data (SIMD) improvements. The idea is quite simple: if you have a lot of identical operations to do and your source data is independent from one another, you can execute all of those operations in parallel, improving the throughput (processors are optimised for loading chunks of memory of a certain size, so if we only use small quantities, we’ve wasted resources). In the past, I’ve mostly worked on SIMD for string operations, like comparison, searching, and conversion to and from Latin-1. That’s sometimes unrewarding because strings are quite small, so we don’t get the full gain of the improved throughput.
But you might not know that SIMD in Qt actually started in the QtGui library, in the raster drawing code. There, the data sizes are often in the order from several kilobytes to multiple megabytes — a tiny 16×16 icon has 256 pixels, each of which is 4 bytes wide, which adds up to 1 kB; you reach 1 MB at 512×512. As you might gather, even copying such data blocks is a somewhat expensive operation. So it’s no wonder that the more common ones of compositing, alpha blending, etc., needed optimisation. And I cannot claim credit for doing them, those were done by very talented hackers working at Trolltech back in the day.
My history with the drawables started about 6 months ago, during the last romjul, when I realised that the optimisations applied to the raster painting code could use some love. Back then, we were still mixing MMX code into the painting code, even when we reported we were using SSE. In fact, when Qt said it was enabling SSE (not SSE2), it was actually just using some new instructions that came with SSE, but on the old MMX technology registers. My first action in that area for Qt 5.0 was to finally remove support for the old MMX-era optimisations, all of which only increased the code size in a Qt build but weren’t used anywhere. The next-level of optimisations (SSE2 and above) overrode the older ones — remember that all 64-bit capable processors have SSE2 support.
Another thing I noticed back then was that we weren’t using the full extent of the optimisations possible. With GCC, we were forced to pass some extra compiler options so GCC would allow us to use some intrinsic functions to execute SSE2 and SSSE3, but that was not the case for the Microsoft compiler. In addition, the Windows configuration did not try to use the intrinsics to verify if they were really available, it simply checked for the presence of the header that usually declares them. What’s more, those checks had not been updated for the SSSE3 optimisations that were done in 2010 in cooperation with Intel, which meant that those optimisations were disabled on Windows.
On Unix, right after removing the old MMX-era code, I proceeded to a very quick and easy gain: add AVX support, the new generation of SIMD instructions from Intel. It was easy because I barely had to write code: if you compile SSE2-era code with GCC’s -msse2avx option (which is automatically enabled by -mavx), it will generate the code using the new AVX instructions. The advantage lies in the fact that the AVX instructions use a new coding mechanism (called the VEX prefix) which specifies an additional register, allowing the compiler to use fewer instructions to accomplish the same goal. Using the expanded 256-bit registers will have to wait for AVX2, coming next year.
Except that even this easy improvement had never come to Windows either. Until now.
To enable Windows support, I had to update the way that the configuration detected the capabilities of the compiler, which is what took most of my time: dealing with building on Windows and with the binary configure.exe is not exactly my forte. Now, like on Unix, the Windows configuration will ask the compiler to try and compile some code. The checks are now shared with Unix, so we have the full range of checks available: SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, and AVX2. Previously, the only one that remained after I removed the MMX-era checks was SSE2.
Another update I made was to tell the Microsoft compiler to improve code generated. Since it did not require special compiler options to enable its support for SSE2, no one had thought until now to pass it the /arch:SSE2 option. Like on Unix, now we pass this option to the compiler whenever we’re compiling code that uses SSE2 anyway, making the compiler use the extended instruction set for generic code, not just what we wrote with intrisincs. From there, adding support for /arch:AVX was trivial: if you have Microsoft Visual C++ 10.0 or higher (it comes with Visual Studio 2010), you also now get the AVX-era instructions and Qt will enable them at runtime if it detects that your processor has them.
I’m not done. I have also a couple of other quick wins in terms of performance, all by improving code generation. Those changes are a bit more complex than the previous ones and I haven’t cleaned them up properly after 6 months of rebasing. I hope to add them to Qt 5.1 soon after its branch opens.