next up previous contents Back to Operating Systems Home Page
Next: Pipes Up: Process Description and Control Previous: Program execution

File descriptor manipulation

Now that we know, from a system programmer's point of view, the basics of process and I/O manipulation in UNIX, we first need to make a step backward and see how the system manages files descriptors, and what exactly is in that file table the descriptors index. From here we'll be able to understand some useful file descriptor shuffling games that are commonplace in UNIX programming, and then take a step forward to deal with using particular file descriptors, the pipes, to make processes ``talk'' with each other. This will also shed light on an issue that we have deliberately overlooked so far, namely the reason for separating the process creation step from the execution: after all, if the main reason for forking processes is to execute programs in them, what's the point in having two system calls, fork() and exec, and not a just one that does the whole job, as is the case in M$-DOG?

The UNIX kernel uses a three-table data structure to manage open files. This apparently complicated structure is actually very flexible and elegant, and allows open files to be shared among processes in a simple way. With reference to fig. 6

  
Figure 6: Kernel data structures for open files.

we can see each process entry in the process table contains a table of open file descriptors. The file descriptors index the entries in this table, and each entry contains a pointer to an entry in a table that the kernel mantains for all open files. Each entry in this open files table contains the file status flags (read, write, append, etc.), the current file offset, and a pointer to the entry for this file in the so-called v-node table. This table (or part of it) is stored on the physical device; we won't concern with it for now: you can't think of its entries as the ``real'' file contents on a disk, with associated informations like file location on disk, size, name, owner, etc..

Now, what if two processes want to independently open the same physical file, i.e. with each process mantaining its own access mode (read, write or append) and file offset as well? Easy: move the lower right arrow to point at the upper right box, and you get the situation depicted in fig. 7: here the open file table has two independent entries for the same file, each associated to one of the processes. This is exactly the arrangement through which a child just forked shares its parent's files (but, differenty from the fig., the child refers to them through file descriptors equal to the parent's)

  
Figure 7: Two independent processes with the same file open.

Now, what happens with this arrangement when I/O operatins are performed?

Now it's easy to see that this scenario, though useful, is the source of various types of troubles collectively referred to as race conditions, that are exactly due to the possibility of concurrent access at the same file by different processes.

Consider, for example, the following situation. Both processes in fig. 7 have opened file ``bozo'' in O_WRONLY mode; process #13 decides to append some data at the end, so it first moves at the end using lseek(), and ... rigth before it can write() it's suspended, and process #22 sneaks in, and wants to do same thing. So it moves at the end too, then writes and goes off. Now the real end of the file has moved. Process #13 wakes up and resumes writing at what it thinks is still the end, actually erasing away #22's data.

Again, as with the race condition that we previously observed with creat(), the problem is in the use of two system calls and in the possibility of concurrent access between them. The solution relies using one call, and making atomic, i.e. indivisible with respect to process switch, the operation of appending. This is the reason for having the O_APPEND mode for open(). If this mode is set, every append operation is guaranteed by the system to be atomic.

Now what if we need to have in one process two file descriptors opened for the same file. Easy again: go back to fig. 6 and move the lower left arrow to the center up box, and you get the situation depicted in fig. 8

  
Figure 8: Duplication of file descriptors.

Fine, this means that the kernel data structure can support this arrangement, but why would you want to do this? We'll see a sound reason in a moment. Let's first see how we achieve this result. Note that something like

...
fd1=open("samefile", O_RDONLY);
fd2=open("samefile", O_RDONLY);
...
doesn't work, since open() always creates a new entry in the open file table, and so you'd end having, for example, two independent offsets. What we want instead is a way to ``duplicate'' one open file descriptor, getting a new one that refers to the same entry in the file table. The dup() and dup2() system calls do exactly this job. They are declared as:
int dup(int fildes);
int dup2(int fildes, int fildes2);
The former duplicates the passed file descriptor, returning the smallest available one. The latter allows you to specify in what file descriptor you want the copy, since it duplicates fildes into fildes2, returning it as well. If fildes2 happens to be already open, dup2() closes it before duplication, unless it's equal to fildes: in this case no duplication occurs, and the file remains open. Both functions, needless to say, return -1 in case of troubles.

Now, back to the reasons for bothering with this xeroxing of descriptors. Suppose that you have a file open for reading and writing, and you have already written some data in it and now want to feed it, via fork-exec, to one of those little ``filter'' commands of UNIX, like sed, or awk, that read a a stream of characters from their standard input (stdin, for short), do something useful on them, and write the the result on their standard output. If your file were the stdin there would be no problems, since a child inherits it, but that's obviously not the case (you can't write on stdin). Then how do you let the child have the file in its stdin?

One solution is to close stdin first, then re-open the file immediately after: this makes use of a property of open(): it always returns the smallest unopened file descriptor, so if you open a file imediately after closing stdin, you get its file descriptor back. The resulting code would look like the following:

...
datafd=open("datafile", O_RDWR|O_CREAT, 0644);
...
/* Some data are written in the file */
...
close(STDIN_FILENO);
open("datafile", O_RDONLY); /* Now STDIN_FILENO -> "myfile" */

if (fork()==0)
{
    execlp("sed", "sed", (char *)0);
    perror("sed");
}
...

However, this prevents the parent from using its stdin any further, so it's probably better do the close-and-reopen trick in the child, since it inheriths the data file as well. Hence a refined solution would be:

...
datafd=open("datafile", O_RDWR|O_CREAT, 0644);
...
/* Some data are written in the file */
...
if (fork()==0)
{
    close(STDIN_FILENO); /* Only child's stdin is closed */
    open("datafile", O_RDONLY); 
    execlp("sed", "sed", (char *)0);
    perror("sed");
}

This is the first example of the fact that separating file creation from execution (i.e. fork from exec) is not a bad idea after all: it allows to rearrange file descriptors in the child differently from the parent's before executing. We'll see that this feature is a great asset in lots of cases.

Fine, but what if the child doesn't know the name of the file? What is it supposed to open in its stdin then? Suppose, for instance, that the parent of the above example doesn't open ``datafile'' itself, but just inherits its descriptor from the parent's parent. The child could go through the pain of poking in the kernel tables, looking for the blessed name, but it's much easier to duplicate the descriptor as follows

...
if (fork()==0)
{
    close(STDIN_FILENO); 
    dup(datafd); /* datafd duplicated in stdin */
    execlp("sed", "sed", (char *)0);
    perror("sed");
}

Fine again, except that there are two possible flaws in this scheme. The first is that before closing stdin we should first, as a defensive measure, make sure that datafd does not happen to be the same as stdin, otherwise after the close it's gone for good, and the subsequent dup() fails. The second flaw is subtle but, by Murphy's Law, deemed to happen sometime. The close-and-duplicate sequence above is not atomic. A signal handler, a peculiar UNIX dwarf that we'll meet later, may then show up between close() and dup(), open a file for the child and go away. Then dup() finds stdin already occupied, returns another file descriptor, and poor exec-ed sed ends in a royal mess.

The use of dup2() solves both problems: it takes care of the closing, does nothing if its arguments are the same, and its closing-duplicating sequence is guaranteed to be atomic: a real piece of cake. So, a better solution is:

...
if (fork()==0)
{
    dup2(datafd, STDIN_FILENO);
    execlp("sed", "sed", (char *)0);
    perror("sed");
}

Note that these ``duplicate and go'' approaches assume that the mode the file is open in is the right one, since it remains unchanged through a duplication.

But what if we are not sure whether the mode matches the child's needs? Introducing fcntl(), the sophisticated multipurpose good-for-all etc. system call that you've always dreamed of. Declaration as follows:

#include <fcntl.h>

int fcntl(int fildes, int command, ...)
This call can accomplish five different tasks, of which only three matter to us for now. They are

Be warned that the mode is a one-bit flag in the returned integer only for the optional modes, not for the fundamental modes. In other word, a test like

...
mode=fcntl(fildes, F_GETFL);
if (mode & O_APPEND)
...
is correct, while
...
mode=fcntl(fildes, F_GETFL);
if (mode & O_RDONLY) 
...
is not. The reason is mainly historical: in the early implementations of UNIX the constants O_RDONLY, O_WRONLY and O_RDWR had the values 0,1 and 2, and before it was realized that it much was better to have bits it was too late: there was too much software around with those values hardwired in. So the Posix committee made the due sacrifice to goddess Bckwrdcmptblty, and defined the constant O_ACCMODE, which is an AND bitmask such that, after you pass through it the value returned by fcntl(), you get bits for the fundamental modes too. Example follows:
...
mode=fcntl(fildes, F_GETFL);
if (mode==-1)
    exit(1);
else if (mode & O_ACCMODE == O_RDONLY)
    do_something();
else if (mode & O_ACCMODE == O_WRONLY)
    do_something_else();
else if (mode & O_ACCMODE == O_RDWR)
    gimme_a_cookie();
...


next up previous contents Back to Operating Systems Home Page
Next: Pipes Up: Process Description and Control Previous: Program execution

Franco Callari