«

»

Mar 28

Restricting what you can do

I usually write about C++, since it’s the programming language that I use on my daily work. Today, however, I’m talking about its nearest cousin: C. In specific, about a certain keyword introduced by the C99 standard, which was issued over 12 years ago. Usually, the C standard plays catch-up with the C++ standard (like the C11 standard bringing some C++11 features to C), but each new issue brings a few new things that C++ doesn’t have yet. This cross-pollinisation by the two standard teams is very welcome.

The one I’m thinking of today is one that, interestingly, has not been added to C++ yet, though many compilers support it. If you’ve paid attention to the blog title, you may realise I’m talking about the restrict keyword.

Raise your hand if you’ve seen it before. Now only the people who have seen it outside of the C library headers on their systems. Not many, eh?

What does restrict do?

The keyword appears defined in the C99 (N1256) and C11 (N1570) standards in section 6.7.3 “Type qualifiers” and 6.7.3.1 “Formal definition of restrict”, which, as usual, is barely readable for us. The Wikipedia definition is better:

The restrict keyword is a declaration of intent given by the programmer to the compiler. It says that for the lifetime of the pointer, only it or a value directly derived from it (such as ​pointer + 1​) will be used to access the object to which it points.

Well, so what? Why do we need a keyword for that? Well, clearly it’s not just something that the programmer says — otherwise, we’d only write it in the documentation. The Wikipedia text continues by saying that “[t]his limits the effects of pointer aliasing“.

That should now tell you something. At least, it should bring you back some memories of compiler warnings about “dereferencing type-punned pointer does break strict aliasing”.

The dreaded strict aliasing (or where it lacks)

The C and C++ standards say that pointers of different types do not alias each other. That’s the strict aliasing, which you often break by dereferencing type-punned pointers. I’ve talked about this in the past, I think. In any case, what matters to us here is when the pointers are allowed to alias each other. Since the C99 standard couldn’t very well go and change a basic principle of the C90 standard, they instead created a keyword to allow the programmer to declare when aliasing will not happen.

The simplest example is the following pair of functions from the C library (copied verbatim from glibc’s string.h header):

/* Copy N bytes of SRC to DEST.  */
extern void *memcpy (void *__restrict __dest,
		     __const void *__restrict __src, size_t __n)
     __THROW __nonnull ((1, 2));
/* Copy N bytes of SRC to DEST, guaranteeing
   correct behavior for overlapping strings.  */
extern void *memmove (void *__dest, __const void *__src, size_t __n)
     __THROW __nonnull ((1, 2));

Note the difference: memcpy uses the restrict keyword, whereas memmove does not but does say that it is correct for overlapping strings.

Implementing memcpy and memmove

Let’s try and implement these two functions to see if we understand what the keywords mean. Let’s start with memcpy, which is very simple at first approach and you must have written its equivalent hundreds of times already:

// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    char *d = dest;
    const char *s = src;
    size_t i;
    for (i = 0; i != n; ++i)
        d[i] = s[i];
    return dest;
}

Having written that, we wonder: why do we need memmove at all? The comment in the header talks about “overlapping strings” and that’s where the code above has an issue. What if we tried to memcpy(ptr, ptr + 1, n)? In the first iteration of the loop above, the byte copied would overwrite the second byte to be read — or worse.

For that reason, the simplest memmove is usually implemented as:

void *memmove(void *dest, const void *src, size_t n)
{
    char *d = dest;
    const char *s = src;
    size_t i;
    if (d < s) {
        for (i = 0; i != n; ++i)
            d[i] = s[i];
    } else {
        i = n;
        while (i) {
            --i;
            dst[i] = src[i];
        }
    }
    return dest;
}

Improving the code

If we know that the two pointers do not alias each other, we can do some more interesting things to optimise the copying performance. The first thing we can try is to increase the stride. That is, copy more than one byte at a time, like so:

// C99 code
void *memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    int *di = dest;
    const int *si = src;
    char *d = dest;
    const char *s = src;
    size_t i;
 
    for (i = 0; i != n / sizeof(int); ++i)
        di[i] = si[i];
    i *= sizeof(int);
    for ( ; i != n; ++i)
        d[i] = s[i];
 
    return dest;
}

The above code first copies the data in int-size chunks, then copies the remaining 1 to 3 bytes one byte at a time (epilog copy). It’s more efficient than the original code on architectures where unaligned loads and stores are efficient, or when we know both pointers to be aligned to the proper boundary. In those cases, since we have fewer iterations to execute, the copying is usually faster.

We can definitely improve this code further, by using for example 64-bit loads and stores in architectures that support them, applying this to all architectures by aligning the two pointers if possible in a prolog copy, unrolling the prolog and epilogs, or use Single Instruction Multiple Data instructions that the architecture may have.

Note that this is only possible because this is memcpy, not memmove. For the latter function, if we wanted to increase the stride, we would need to additionally check that the distance between the two pointers is at least the size of the chunk of data copied per iteration. Doing that is left as an exercise for the reader.

I’m lazy

Now, I said above that the only reason why there’s a language keyword in the first place is so that the compiler can optimise better. Well, that’s exactly what it does. Unfortunately, it’s easy to prove this straight-away with assembly code, as we’re depending on optimisations performed by the compiler, which change over time and are implemented differently in each one. For example, if I use the Intel Compiler on the original memcpy function, it will insert a call to _intel_fast_memcpy if the pointers aren’t suitably aligned or the copy size isn’t big enough. GCC, on the other hand, will insert a prolog to align one of the pointers.

What is interesting to note is that the presence of the restrict keyword, everything else being the same, does cause different code generation. With GCC, the output without the keyword contains a couple of instructions comparing the dest pointer to src + 16 and only if the two pointers don’t overlap in the first 16 bytes will it execute SSE2 16-byte copies. ICC is even more extreme: without the keyword, the code generated for memcpy does only byte-sized copies.

In other words, the keyword is being used: when the compiler knows the two blocks don’t overlap, it can generate better code.

Page optimized by WP Minify WordPress Plugin