«

»

Jul 13

forkfd part 2: Finding out that a child process exited on Unix

On my previous blog, I said that the solutions we’ve got implemented on Linux are a good start, but not the full solution. We can start a child process properly, but we still can’t properly find out when it exited.

Linux or nothing

First of all, let me get one thing straight: from this point on, I’ll only be thinking about Linux. If you’re running anything else, I don’t care about you. But you may continue reading anyway.

I have a couple of reasons for doing that, one of them being that it’s the easiest to get the new API I’m proposing accepted. It’s probably just as easy to get the other open source BSDs to do the same, but not so much on the commercial Unixes, which would involve a long lead time, product management, NDAs, etc.

But the most important reason is that any of those other Unixes are years behind Linux on the state of the art. If your OS hasn’t done its homework for the past 4 years and introduced the API I mentioned in the previous blog, then I don’t care about that OS.

More to the point: the problem I am trying to solve is related to multithreading and the race conditions that are involved in such a scenario. Without those APIs, there’s no possibility of thread-safety anyway.

How to be notified of a child process exiting

There are two ways of being notified that a child process has exited: a blocking (synchronous) and a non-blocking (asynchronous) method. The blocking one is fairly simple: a call to waitpid without the WNOHANG flag (i.e., “do hang”). On Linux, the waitpid call is backed by the kernel system call of the same name, so we know the kernel is doing the right thing.

The problem with the blocking API is that, as it turns out, it’s blocking. It’s unsuitable to be run in the same thread that is handling user interaction and painting in a GUI application. If you want to use it, you need to start a thread for it. To make matters worse, to implement this in a generic-purpose library like Qt or Glib, you’d need to start one thread per child process.

The waitpid function can be used to wait in one of three conditions: one specific child process given by its PID, one specific process group given its PGID, or all child processes. The process group idea looks interesting at first, since each library could create one such group and move all the child processes it cares about into it. However, process groups have other purposes and side-effects, including the fact that the child process can change its session ID and process group, which exclude them from a generic solution. And clearly waiting for any and all child processes is not acceptable for a generic library, for it cannot know whether there are processes started outside of its control, like when both Qt and Glib are being used.

Problem two: chaining signal handlers

That leaves us with the asynchronous method of being notified of a child process exiting, which is done via POSIX signals. More specifically, by the delivery of the SIGCHLD (also spelt SIGCLD). And here we run into a series of problems that aren’t solved today.

The first of them is that there’s no thread-safe way of installing a signal handler in a generic library. The system call signal can be used to install a signal handler. This function is fine if the handler is installed by the application developer, like the case of handling SIGINT (the signal that Ctrl+C sends) or SIGTERM and performing some clean-ups.

But it’s not acceptable for a generic library. Again let’s take the case of an application using both Qt and Glib, either directly or indirectly. Since both libraries need to install a signal handler, it stands to reason that one handler must somehow call the other to let it do its work. That means signal is out.

Fortunately, there’s sigaction and this system call not only installs a new handler, it also returns the old handler too, so one handler can call the other. There’s a multithreading problem here, but let’s look into that later. The code to install the handler would look something like this:

static struct sigaction old_sigaction;
static void sigchld_handler(int signum)
{
    /* my code goes here */
 
    if (old_sigaction.sa_handler != SIG_IGN
            && old_sigaction.sa_handler != SIG_DFL)
        old_sigaction.sa_handler(signum);
}

Installation can be achieved with this API, but how about uninstallation? That’s where the unsolved problem lies. The first solution that comes to mind, a gross one and the simplest possible, is to ignore uninstallation and simply decree that the handler will remain there until the process exits completely.

That brute-force solution breaks down the moment we introduce unloading of libraries. Now, you may remember my saying that library and plugin unloading are a bad idea and should be avoided, because they create a whole number of problems. Yes, that’s true, and this is one of them. So I do recommend that you avoid unloading libraries and plugins in your applications. However, we’re looking for a generic solution here, so we must take unloading into account.

During the library unloading process, the signal handler must be uninstalled. The way to uninstall it is to install something else. But what? The options that come to mind are:

  1. install the default signal handler, SIG_DFL; or
  2. ignore the signal, by installing SIG_IGN; or
  3. install the previous signal handler, the one we saved from the sigaction call.

Unless you’re trying to be funny or you see some other problem that I don’t, you’ll suggest the third option, right? Therefore, we uninstall our handler like this:

    sigaction(SIGCHLD, NULL, old_sigaction);

Can you see the problem?

What happens if our handler is not the currently-installed handler? That could happen if the de-initialisation order is different from the initialisation one. As a concrete example, imagine an application that uses Qt, so QtCore got loaded at process start and will not be unloaded until the process exit. Then the application does this, in order:

  1. it loads a plugin that uses Glib;
  2. the plugin uses the g_spawn_async function, causing Glib to install its SIGCHLD handler;
  3. the application uses QProcess, causing Qt to install its SIGCHLD handler;
  4. the Glib-using plugin is unloaded and Glib tries to uninstall its handler.

At this point, Glib will uninstall Qt’s handler too, rendering QProcess unusable. The current code in qprocess_unix.cpp tries to work around this problem by doing:

    struct sigaction currentAction;
    ::sigaction(SIGCHLD, 0, &currentAction);
    if (currentAction.sa_handler == qt_sa_sigchld_handler) {
        ::sigaction(SIGCHLD, &qt_sa_old_sigchld_handler, 0);
    }

Let’s ignore for a moment the fact that the above code is neither thread-safe nor async signal-safe. Another thread could be trying to install a handler at the same time, and a SIGCHLD could be delivered in-between the two calls to sigaction. Let’s ignore it because we have a bigger problem: if Qt’s handler isn’t the topmost handler installed, Qt is forced to leave its handler installed. And if QtCore is about to be unloaded, there’s a very big chance that the next SIGCHLD delivery will crash the application!

Problem three: thread-safety in sigaction

This problem exists not because of the API, but because of the implementation. I am assuming that the kernel side of sigaction is correctly implemented and it will do its proper locks in case two threads of the same process try to install signal handlers for the same signal at the same time. Let’s also assume that the kernel does not allow a signal to be delivered to the process while it is modifying its own structures of the signal handlers.

The problem exists in the userland because glibc’s struct sigaction is different from the kernel’s ABI. That forces glibc to allocate a local (stack-based) structure so its address can be passed to the system call. Upon return, it needs to copy the contents into our old_sigaction variable.

Do you see the problem? There’s a race condition there.

A signal could be delivered to the process after the kernel returned to user-space, but before the glibc code could copy the contents to our variable. That means our newly-installed signal handler could be called before the old_sigaction was filled in. That would mean the old handler would not get called as it should be. And chances are that the signal being delivered wasn’t meant to our handler anyway — after all, you’d have problems in your code if you could receive your own SIGCHLD before your handler were ready.

In reality, it is actually worse than the above description: since the struct sigaction structure is not filled in atomically, our signal handler could see a partially-filled structure. And in any case, since glibc’s code allocates it on the stack and does not pre-fill it with zeroes before the system call, if the handler is called with the chain link not completely filled, there’s a good chance that the chain call will end up in a garbage address.

Propsed solution for problem three

The solution for this problem is simple: glibc must not memcpy. That means the userspace struct sigaction must be equal to the kernel’s. And therein lies another problem: to change that structure now, we’d have to break the C library ABI. It can be done with ELF versioning, but it does not remove an ABI change, which could turn up again if there were code that shared struct sigaction across libraries.

However, there’s no solution that I can see for problem two and it’s a serious issue. On the next blog, I’ll explore the solution that Qt has used for a few years and the problems with it.

4 comments

1 ping

  1. avatar
    Plop

    Why not using signalfd ?

  2. avatar
    Thiago Macieira

    @Plop: see part of the discussion on my G+ stream.

    As a commenter there explained:

    1. If you create two signalfd’s for the same signal, calling read() on either one would consume the signal. i.e. the second read() would fail (non-blocking fd) or block.
    2. A signal has to be blocked in the signalmask of every thread in the process for it to be delivered by signalfd. An old, SIGCHLD using libA and a new, signalfd() using libB would fight over the signal mask.

    And I added:

    If two threads are blocked on waiting for a child process to exit, then both would be sleeping on a select(2) on that fd. Which one gets woken up is undetermined (possibly both!), which means that one thread needs to communicate to the other that its child may have exited.

  3. avatar
    Olivier

    For problem three (thread-safety in sigaction), can’t you try the following (imperfect) solution:

    1. Declare a variable sig_atomic_t initialized;
    2. Assign 0 to initialized.
    3. Install the handler with sigaction and save the current action in (say) oldAction.
    4. Assign 1 to initialized.
    5. In the handler, read initialized. If it is 1, chain to oldAction.

    Of course the problem is that if the handler is executed before step 5, then the old action is not called; still, isn’t it better than chaining to a garbage address?

    One could also add the following step between steps 2 and 3:
    2.5. Call sigaction to save the current action in (say) oldActionTemp, but do not install the new handler.
    Then step 5 would become:
    5. In the handler, read initialized. If it is 1, chain to oldAction; otherwise chain to oldActionTemp.
    This solution is actually worse since some other thread might uninstall oldActionTemp between steps 2.5 and 3, and chaining to it would have potential disastrous consequences.

  4. avatar
    Thiago Macieira

    @Olivier: Either way, it’s still not thread-safe. We could have that sig_atomic_t extra to indicate whether the old handler is set, but it’s not thread-safe. The other option is worse because another thread could install another handler in-between and we’d lose it.

    By the way, sig_atomic_t is enough for atomicity with signals, but not with multithreading. For that, you need C11 or C++11 and a real atomic, with a store-release operation. My code uses the proper semantics.

  1. avatar
    forkfd part 1: Launching processes on Unix » Thiago Macieira's blog

    [...] Continue using QPointerforkfd part 2: Finding out that a child process exited on Unix » Jul 13 forkfd part 1: Launching processes on Unix Categories: Algorithms, Linux, [...]

Comments have been disabled.

Page optimized by WP Minify WordPress Plugin