Jul 21

forkfd part 4: proposed solutions

Last week, I wrote three blogs about the situation with starting child processes on Unix and being notified of their exit. I raised several problems with the current implementation, which I have tried to solve and I have now a proposal for. If you haven’t yet, you should take some time to read the previous three blogs:

The road so far

I explained in the first blog how one launches processes on Unix, by way of the fork and execve system calls, and the problem associated with file descriptors being inherited without closing in the child processes. I also showed how Linux has solved this problem. Since no other Unix system has yet done the same, they are excluded from going forward. They need to be brought into the 2010s first and leave the 1970s behind.

In the second blog, I went over the contortions required to be notified that a child process has exited, which uses the SIGCHLD signal. Designed in the early Unix times, signal handlers have conceptual problems with two modern requirements: libraries and multi-threading. And in the third blog, I explained the requirements that QProcess presents and how Qt has tried so far to solve those problems.

Unfortunately, there are two issues that can’t be solved. One is the race condition involved in the installation of the SIGCHLD signal handler, and the other is its uninstallation when the library is being unloaded. With the current API, unless I missed something, it’s not possible to do this cleanly. That leads me to the conclusion that signal handlers should really have been left in the 1970s.

The solution I propose

The solution I’d like to see implemented requires another change to the Linux kernel. Attentive readers may have guessed what I want by the title of the blog: I want a new system call named forkfd. Similar to additions to Linux like the signalfd, timerfd_create and eventfd, this would be a function that opens a new file descriptor.

Its man page would be something like the following:

forkfd - create a child process and a file descriptor for being notified of its exit
int forkfd(int flags, pid_t *pid);
forkfd() creates a file descriptor that can be used to be notified of when a child process exits. This file descriptor can be monitored using select(2), poll(2) or similar mechanisms.

The flags parameter can contain the following values ORed to change the behaviour of forkfd():

Set the O_NONBLOCK file status flag on the new open file descriptor. Using this flag saves extra calls to fnctl(2) to achieve the same result.
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. This flag applies to the parent process side of the fork and new processes created after that. The child process created by forkfd() does not have this file descriptor open.

The file descriptor returned by forkfd() supports the following operations:

When the child process exits, then the buffer supplied to read(2) is used to return information about the status of the child in the form of one siginfo_t structure. The buffer must be at least sizeof(siginfo_t) bytes. The return value of read(2) is the total number of bytes read.
poll(2), select(2) (and similar)
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if the child has exited or signalled via SIGCHLD.
When the file descriptor is no longer required it should be closed.
On success, in the parent process forkfd() returns a new forkfd file descriptor and sets the PID of the child process to *pid; in the child process, it returns FFD_CHILD_PROCESS and sets *pid to zero. On error, -1 is returned and errno is set to indicate the error, with no process being created.

This solution has the following benefits:

  • No signal handler installation or uninstallation is necessary, which avoids both outstanding unfixable issues;
  • If no signal handler is needed, there is no need to start a thread for managing the child process status;
  • Notification is sent via a read notification on a file descriptor, which all event-driven applications know how to handle, plus it matches the requirements that QProcess has in its own waitFor functions;
  • The child process is automatically reaped by the read() call, which avoids the need to call wait or waitpid.

Implementing it in userland

I tried to implement the above function in userland, in pure C using only POSIX calls. The idea was that this code could be used in many different libraries to solve their process-management problems. I came up with three implementations:

Using pthreads

Source code: header, source

The first attempt was a direct rewrite of the QProcess solution in C, using only POSIX calls. The code has a global pthread_mutex_t that protects a doubly-linked list of currently-running processes. It installs the SIGCHLD handler under a mutex lock, creates a pipe, and forks. In the child side of the fork, it closes the pipe and returns the magic constants. In the parent side, it adds the writing end of the pipe and the PID to the list, and returns the reading end.

Since I wrote this code before I realised the fatal flaw with SA_SIGINFO, this code is still using it. It writes the siginfo_t structure received in the signal handler to the process manager, by way of a private pipe. That one, in turn, will read the PID from the structure and proceed to write the structure again to the user, via the writing end of the pipe that was saved in the forkfd() call.

This code is fixable, by making the signal handler write one byte to the pipe (or use eventfd) and have the process manager thread loop over the currently-known child processes, calling waitpid on each and synthesising siginfo_t for the user.


Source code: same header, source code

The problem with the first implementation, besides relying on SA_SIGINFO, is that it requires pthreads and mutexes. I wanted something lock-free, so I started writing that. I wrote a lock-free structure to replace the doubly-linked list of PID and writing pipe pairs, based on previous experience with Qt’s lock-free timer ID allocator. I haven’t done exhaustive testing on it, but it’s simple enough that it’s probably correct (pending further reviews).

This solution is, unfortunately, still based on SA_SIGINFO (why? because I hadn’t realised it was a problem by then; I only did so when writing the blog). The way it works is that the signal handler will read from this structure and figure out, based on the PID that from siginfo_t, what the file descriptor to write is. The signal handler also does the necessary waitpid call to reap the child process.

Unfortunately, this solution doesn’t work, since it introduces several race conditions. One is an extremely rare situation, in which the library with this code is being unloaded in one thread while the signal handler is still running in another. More importantly, though, there’s a race condition between the time of the fork and the addition of the PID to the list of children. This condition didn’t happen before because of the mutex: the process manager thread would not read from the list until the forkfd function released the lock, after adding the child process.

This implementation is still salvageable, though. First, it needs to stop relying on SA_SIGINFO, which means it must iterate over all the known children inside the signal handler, doing waitpid calls on each. Second, with the absence of a lock, it must prevent the child process from exiting before its PID and pipe are added to the list. That can be done by adding an extra, blocking pipe between the parent and child process: the child process tries to read() from it, suspending itself, until the parent process releases it by writing something.

Adding a spin lock

Source code: same header, source code

The solution I ended up writing to the race conditions of the previous implementation was to add a spin lock (why this and not the pipe lock I described above? Because it hadn’t occurred to me until just now). It’s a step back from the fully lock-free solution, but not all the way back to the pthreads implementation. For one thing, it doesn’t start a thread for the process management. For another, since it implements the spinlock on its own, it can lock inside the signal handler (note that pthread_mutex_lock is not a permitted function inside one).

I just had to be careful about one thing: before locking the spin lock, the calling thread must block SIGCHLD using pthread_sigmask. If it didn’t do that, the signal handler could be called asynchronously in the same thread as the one where the spin lock is locked, producing a deadlock.

Choosing a solution

To be honest, none of the three solutions are the ideal ones. If I had to choose between one of them, I’d go for the lock-free one for personal reasons, but the spin-lock one might have fewer bugs in the threading code.

But that’s not what I want. What I really want is that forkfd be implemented in the kernel, so that no signal handler is involved, eliminating the unsolvable problems that those introduce.

If there are any kernel hackers listening in, do you think there’s a chance?


  1. avatar
    Robert Ancell

    Yes please! That syscall would be great. I hope you can convince someone to implement it.

  2. avatar
    Robert Ancell

    Also consider making a forkfd_read or similar to make the read simpler and type-safe.

  3. avatar
    Thiago Macieira

    Some replies from the Google+ post:

    FFD_CLOEXEC semantics should be mandatory

    Since you can’t receive SIGCHLD or use waitpid() to wait for sibling processes, you shouldn’t be able to monitor siblings through forkfd’s either. Make FFD_CLOEXEC semantics mandatory and remove the FFD_CLOEXEC flag from the API.

    Whether FFD_CLOEXEC should be mandatory, I think it depends mostly on the internal implementation details in the kernel. The userland implementation I’ve written only works in the parent process because that’s the one that receives the SIGCHLD signal. But the kernel implementation, who knows? You may be able to pass this forkfd file descriptor to other processes via Unix sockets and let them receive the information.

    Reading forkfd’s should return { pid, wait_status } instead of siginfo_t

    Since you propose child process be automatically reaped by the read() call, you should provide a wait status so that the user can use WEXITSTATUS() and WTERMSIG(). Passing a siginfo_t seems odd as members besides si_pid all seems useless for forkfd()?

    I think siginfo_t to be a better solution because it has room for future expansion, if needed, and it contains that information already. For example, the QProcess implementation I’ve written looks like this:

        siginfo_t info;
        qt_safe_read(forkfd, &info, sizeof info);
        exitCode = info.si_status;
        crashed = info.si_code != CLD_EXITED;
  4. avatar
    Mike Crowe

    Great idea!

    It would be even better if forkfd weren’t the only way to hold of such an fd. Perhaps opening something under /proc/pid/ or a pidfd system call could create one for any process (subject to certain security constraints.) This would make it easy to wait for such any process to exit in a blocking way. (For example, waiting for a daemon to exit.) Such a feature would require processes only to be reaped when the last forkfd was closed though since there may be several file descriptors open for it. AFAICR this would be similar to Win32 process handles.

    We seem to be gradually moving to a better world where everything is or can be a file descriptor so poll(2) can wait on them. It’s interesting to compare this with Win32 where (almost) everything is a handle so that WaitForMultipleObjects can wait on them.

  5. avatar
    Jaroslav Šmíd

    I like this, but there is few questions I have:

    1) Does the forkfd only report child exit or does it also report stop/continue in the same way as SIGCHLD+wait?
    2) Does the child created with forkfd also generate SIGCHLD or is the forkfd descriptor the only way to get those events?
    3) What happens if someone call wait or waitpid? Can the child created by forkfd be waited for with wait()? Is the child reaped by the wait() or after both wait() and read() from forkfd is done? So it would still be required to call waitpid or we would still have zombie?
    4) If the descriptors gets duplicated (e.g. using dup or F_DUPFD) will every event be reported on each descriptor from now on, or will read() on one descriptor lead to consuming event from second descriptor?

    These should be answered somehow in the proposal so that it doesn’t lead to confusion.

    These are few suggestions how it could work:

    The child created using forkfd() doesn’t generate SIGCHLD in the parent process.
    wait() never returns child created with forkfd(), waitpid() for a child created with forkfd() returns an error.

    The exited child is not reaped by read() of the descriptor, but after closing the last descriptor. If the childn’t haven’t exited yet when closing the last descriptor, it will be reaped automaticaly by system. wait() or waitpid() still won’t report such process, nor will the child generate SIGCHLD in the parent process.


    Regarding Mike Crowe’s comment. I think there should be no such way to create such fd for existing process. Why? Because it would lead to race conditions. E.g. you find out there is a process with PID 2556 you want to watch. But by the time you would create such descriptor, the process could have exited and new process with the same pid would appear. How can you be sure you are watching the right process?
    The “watching for daemon to exit” example you provided can be achieved by parent-child relation where the parent creates the daemon using forkfd+exec.

  6. avatar
    Thiago Macieira

    Hello Jaroslav, thanks for the questions.

    1) I think this should behave the same way, so it should probably have a flag to indicate whether SIGSTOP and SIGCONT are also expected.

    2) I’d say that the SIGCHLD is irrelevant and frameworks should not install or depend on the handler. Considering the automatic reaping, I’d say that SIGCHLD should not be generated at all.

    3) since the child is automatically reaped, wait/waitpid should behave as if another wait/waitpid had reaped the child. Whatever behaviour is currently implemented. I’d say we should probably specify it as “undefined behaviour” and “you should be shot if you do this”.

    4) that’s a good question, but I’d leave to the implementors to decide. I’d prefer if the event were reported on all copies, but I’ll also be happy if only one gets it or if it’s undefined behaviour.

    Those are more or less what you suggested too.

  7. avatar
    Mike Crowe

    Jaroslav Šmíd: That race condition always exists with process IDs no matter what you do with them. If I determine the process ID of my daemon by ferreting around in /proc looking at process names or by reading a PID file there’s always going to be a chance that the process has died by the time I pass the pid to kill(2). The assumption appears to be that there are so many process IDs that this isn’t a problem in practice. Maybe we’ll have to switch to 32-bit process IDs by default in order to avoid such races in the long term.

    I don’t really follow your claim that the parent-child relationship solves the problem. This is the very case where you aren’t the parent of the process you wish to wait for the completion of.

    For example,

    # Start DHCP client (daemonised)
    dhcpcd -i eth0

    # Find and kill DHCP client
    dhcpcd -k eth0

    As far as I can see there’s no way for the second invocation of dhcpcd to block waiting for the first instance to exit. All the second can do is send the first a signal and poll the process ID. If the second invocation could create a file descriptor on the PID then it could block efficiently inside poll(2) waiting.

  8. avatar
    Thiago Macieira

    @Mike: that problem does not exist with forkfd: the file descriptor you get is unique. There’s no race condition: even if the process exits and another is created with the same PID, the file descriptor refers to a single child process.

    That only works when you’ve just done the forking. You can’t attach a forkfd to an existing process, since that reintroduces the race condition problem.

  9. avatar
    Mike Crowe

    @Thiago: True, the race condition doesn’t exist with forkfd.

    I wrote “That race condition always exists with process IDs no matter what you do with them.” which was incorrect in the case of fork. What I meant was that if you try and do anything with a PID of a process that you aren’t responsible for reaping then you’re open to race conditions. I don’t think being able to turn a process ID into a file descriptor adds any new race conditions to such code. In fact I think it reduces the opportunity window since you can keep the file descriptor around for some time and be sure that it won’t magically start referring to a different process which is not true of a PID.

  10. avatar

    Congratulations for these highly interesting articles.

    Some of the limitations you describe also apply to another use of SIGCHLD: debugging another (possibly already started) process through ptrace. The ptracer needs to keep track of the state (stopped, running) of the ptracee by monitoring SIGCHLD signals. A library providing debugging facilities would have no satisfying way to do so for the same reasons you gave.

  11. avatar
    Colin Walters

    One fd per child process is kind of unfortunate for some use cases; for example, I have a build program kind of like ‘make’ that may launch a lot of child processes, and it’s actually pretty easy to run up against the 1024 limit.

    Why not make it like a bit more like signalfd, so you’d have forkfd_create()/forkfd_add_child(), and read() returns a structure that has data for a child?

  12. avatar
    Thiago Macieira

    You’re more likely to run into a process limit than an fd limit.

    That functionality you propose is interesting, but doesn’t solve my problem for QProcess. I need in QProcess to be woken up when my child process has died. If the code wakes up for other processes, it’s just spurious CPU usage. What’s more, remember that this must work simultaneously from multiple threads, in which case one fd might not be a good idea.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">

Page optimized by WP Minify WordPress Plugin