Have you ever tried to launch a sub-process on Unix? POSIX.1 has several APIs for doing that, including: fork+execve and posix_spawn. Starting a child process is not difficult, but ensuring that they behave properly and you get notified when the child dies, that is difficult.
I want to concentrate only on fork(2)+execve(2), so let’s talk about the new API first. The posix_spawn API was first introduced in POSIX.1-2001, derived from the earlier “1d” specification. If your Unix system is not POSIX.1-2001-compliant, you won’t have it. And even if you have a 2001- or 2008-compliant system, the specification still says today:
These functions are part of the Spawn option and need not be provided on all implementations.
So if you want to write cross-platform code, you’re not going to depend on it. (Unless you need to use it for systems that don’t have fork(), like QNX Neutrino).
In any case, for my interest here — Linux — two other factors come into play:
- The kernel has no posix_spawn API, so the C library would need to implement it in userspace, using fork and execve anyway.
- Glibc does not implement it today, it simply returns ENOSYS.
With that in mind, let’s focus on the traditional API.
fork and execve
The traditional API for launching a process on Unix systems is to first fork your process and then replace it with another. This two-step process is extremely flexible and has allowed for many uses and abuses over the years. For example, before we had proper thread support, forking and communicating with the child forks was a common way of operating. In fact, even today the extremely popular Apache web server continues to offer a module that handles requests on the time-proven non-threaded fork-based implementation.
When the call to fork succeeds, execution will continue in two different processes: the parent and the child. The parent process receives the child process’s identifier (the PID) for later use, like kill(2) or to distinguish between notifications from multiple children.
Usually, the child process will perform some clean up and preparation before later calling execve. The POSIX API offers several variations of in the exec family, but they all boil down to execve: the path to the executable is absolute, the arguments are in a vector and the environment to be passed down is known.
As I said in the introduction, so far so good. This is easy, flexible and proven by time. Yet it has some problems we’ll explore.
First problem: inheriting file descriptors
The POSIX specification for execve declares:
File descriptors open in the calling process image shall remain open in the new process image, except for those whose close-on-exec flag FD_CLOEXEC is set.
This allows the child process to inherit the standard streams of the C library — stdin, stdout and stderr. When you launch a process from a shell, for example, the process will inherit the connection to the terminal so you’ll get the output on your screen.
This feature is also what allows parent and child process to communicate and for redirections to happen. When you type in the terminal something like:
$ process > output.log
What the shell is doing is making sure that the stdout stream is connected to the file output.log before it calls execve.
Yet the major flaw in this API is that FD_CLOEXEC flag is not the default. That means at every point in that you call a function that opens a file descriptor, you must remember to also make the file descriptor close-on-exec if you don’t want to leak it to the child process.
It was a major flaw in the 1970s when this API was designed, but not catastrophic. With proper care, one could make it work. And if a particular function was going to close the file descriptor anyway before any chance of forking, it did not have to bother.
It became showstopper at the end of the 1990s when we got threads. That means that even careful code that sets the flag immediately upon opening the file descriptor, like the following, is not safe:
int fd = open("/dev/null", O_WRONLY); fcntl(fd, F_SETFD, FD_CLOEXEC);
Why it isn’t safe? Because another thread could call fork in-between the opening of the file descriptor and the setting of the FD_CLOEXEC attribute. That means that, despite the care made in ensuring that the file descriptor doesn’t leak, it can still leak.
First solution: add new APIs
Recently, through the efforts of the former glibc maintainer Ulrich Drepper, we’ve got a few new system calls or modifications to the existing ones on Linux and on glibc that solve the problem above. All of the system calls on the Linux kernel that can create a file descriptor take an extra parameter that indicates whether the FD_CLOEXEC should be set upon creation. The above source code becomes on a relatively recent glibc:
int fd = open("/dev/null", O_WRONLY | O_CLOEXEC);
And similarly for a call to socket:
int server_fd = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
For the system calls that created file descriptors but had no way of passing extra flags, a new system call was created, such as:
int pipe_fd; pipe2(pipe_fd, O_CLOEXEC); dup3(pipe_fd, STDIN_FILENO, O_CLOEXEC); accept4(server_fd, &addr, &addrlen, SOCK_CLOEXEC);
This also allows us to pass O_NONBLOCK or SOCK_NONBLOCK and save us another pair of system calls to set the flag on.
The solution that Ulrich Drepper and the kernel community came up with is elegant and solves the race condition problem. I also made Qt use those system calls automatically a couple of releases ago and contributed a patch to Glib to do the same.
That part of the problem is solved, on Linux at least, and using a modern glibc or eglibc. Still, it’s not enough, as I’ll explore on my next blog.