Low-Level Input/Output

Opening and Closing Files

#include <fcnt.h> int open (const char *filename, int flags[, mode_t mode])

The open function creates and returns a new file descriptor for the file named by filename. Initially, the file position indicator for the file is at the beginning of the file. The argument mode specifies file permissions and is used only when a file is created.

The flags argument controls how the file is to be opened. This is a bit mask; you create the value by the bitwise OR of the appropriate parameters (using the `|' operator in C). See File Status Flags, for the parameters available.

The normal return value from open is a non-negative integer file descriptor. In the case of an error, a value of -1 is returned instead.

#include <unistd.h> int close (int filedes)

The function close closes the file descriptor filedes.

The normal return value from close is 0; a value of -1 is returned in case of failure. The usual failure mode an argument that is not a valid file descriptor.

File Access Flags

One and only one of the following three constants (from fcntl.h) may be specified in the flags argument:

Flag Description
O_RDONLY Open the file for read access.
O_WRONLY Open the file for write access.
O_RDWR Open the file for both reading and writing

Flag	Description
`O_RDONLY`	Open the file for read access.
`O_WRONLY`	Open the file for write access.
`O_RDWR`	Open the file for both reading and writing

The following constants are optional:

Flag Description
O_APPEND Append to the end of the file on each write.
O_CREAT Create the file if it doesn't already exist.
O_EXCL If both O_CREAT and O_EXCL are set, then open fails if the specified file already exists.
O_TRUNC Truncate the file to zero length.
O_NOCTTY If the named file is a terminal device, don't make it the controlling terminal for the process.
O_NONBLOCK This prevents open from blocking for a “long time” to open the file.
O_NOLINK If the named file is a symbolic link, open the link itself instead of the file it refers to.

Flag	Description
`O_APPEND`	Append to the end of the file on each write.
`O_CREAT`	Create the file if it doesn't already exist.
`O_EXCL`	If both O_CREAT and O_EXCL are set, then open fails if the specified file already exists.
`O_TRUNC`	Truncate the file to zero length.
`O_NOCTTY`	If the named file is a terminal device, don't make it the controlling terminal for the process.
`O_NONBLOCK`	This prevents open from blocking for a “long time” to open the file.
`O_NOLINK`	If the named file is a symbolic link, open the link itself instead of the file it refers to.

The use of O_NONBLOCK is only meaningful for some kinds of files, usually devices such as serial ports; when it is not meaningful, it is harmless and ignored. Often opening a port to a modem blocks until the modem reports carrier detection; if O_NONBLOCK is specified, open will return immediately without a carrier.

Reading and Writing

#include <unistd.h> ssize_t read (int filedes, void *buffer, size_t size); ssize_t write (int filedes, const void *buffer, size_t size);

Data Type: ssize_t is used to represent the sizes of blocks that can be read or written in a single operation. It is similar to size_t, but must be a signed type.

`read`

The read function reads up to size bytes from the file with descriptor filedes, storing the results in the buffer. (This is not necessarily a character string, and no terminating null character is added.)

The return value is the number of bytes actually read. This might be less than size; for example, if there aren't that many bytes left in the file or if there aren't that many bytes immediately available. The exact behavior depends on what kind of file it is. Note that reading less than size bytes is not an error.

A value of zero indicates end-of-file (except if the value of the size argument is also zero). This is not considered an error. If you keep calling read while at end-of-file, it will keep returning zero and doing nothing else.

If read returns at least one character, there is no way you can tell whether end-of-file was reached. But if you did reach the end, the next read will return zero.

In case of an error, read returns -1.

`write`

The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not necessarily a character string and a null character is output like any other character.

The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program should always call write in a loop, iterating until all the data is written.

In the case of an error, write returns -1.

Once write returns, the data is enqueued to be written and can be read back right away, but it is not necessarily written out to permanent storage immediately. You can use fsync when you need to be sure your data has been permanently stored before continuing. (It is more efficient for the system to batch up consecutive writes and do them all at once when convenient. Normally they will always be written to disk within a minute or less.) Modern systems provide another function fdatasync which guarantees integrity only for the file data and is therefore faster.

Effect of BUFSIZE on I/O Efficiency

The file system for this test was the Linux ext2 file system with 4,096-byte blocks. This accounts for the minimum in the system time occurring at a BUFSIZE of 4,096 bytes. Increasing the buffer size beyond this has little positive effect.

Most file systems support some kind of read-ahead to improve performance. When sequential reads are detected, the system tries to read in more data than the application requests, assuming that the application will read it shortly. From the table, it appears that read-ahead in ext2 stops having an effect after 128KB.

Beware when trying to measure the performance of programs that read and write files. The operating system will try to cache the file in memory, so if you measure the performance of the program repeatedly, the successive timings will likely be better than the first. This is because the first run will cause the file to be entered into the system's cache, and successive runs will access the file from the system's cache instead of from the disk.

Timing results for reading with different buffer sizes in Linux
Stevens and Rao, p 70

BUFSIZE User CPU
(seconds) System CPU
(seconds) Clock time
seconds #loops
1 124.89 161.65 288.64 103,316,352
2 63.10 80.96 145.81 51,658,176
4 31.84 40.00 72.75 25,829,088
8 15.17 21.01 36.85 12,914,544
16 7.86 10.27 18.76 6,457,272
32 4.13 5.01 9.76 3,228,636
64 2.11 2.48 6.76 1,614,318
128 1.01 1.27 6.82 807,159
256 0.56 0.62 6.80 403,579
512 0.27 0.41 7.03 201,789
1,024 0.17 0.23 7.84 100,894
2,048 0.05 0.19 6.82 50,447
4,096 0.03 0.16 6.86 25,223
8,192 0.01 0.18 6.67 12,611
16,384 0.02 0.18 6.87 6,305
32,768 0.00 0.16 6.70 3,152
65,536 0.02 0.19 6.92 1,576
131,072 0.00 0.16 6.84 788
262,144 0.01 0.25 7.30 394
524,288 0.00 0.22 7.35 198

BUFSIZE	User CPU (seconds)	System CPU (seconds)	Clock time seconds	#loops
1	124.89	161.65	288.64	103,316,352
2	63.10	80.96	145.81	51,658,176
4	31.84	40.00	72.75	25,829,088
8	15.17	21.01	36.85	12,914,544
16	7.86	10.27	18.76	6,457,272
32	4.13	5.01	9.76	3,228,636
64	2.11	2.48	6.76	1,614,318
128	1.01	1.27	6.82	807,159
256	0.56	0.62	6.80	403,579
512	0.27	0.41	7.03	201,789
1,024	0.17	0.23	7.84	100,894
2,048	0.05	0.19	6.82	50,447
4,096	0.03	0.16	6.86	25,223
8,192	0.01	0.18	6.67	12,611
16,384	0.02	0.18	6.87	6,305
32,768	0.00	0.16	6.70	3,152
65,536	0.02	0.19	6.92	1,576
131,072	0.00	0.16	6.84	788
262,144	0.01	0.25	7.30	394
524,288	0.00	0.22	7.35	198

In the tests reported here, each run with a different buffer size was made using different copy of the file so that the current run didn't find the data in the cache from previous run. The files are large enough that they all don't remain in the cache (the test system was configured with 512 MB of RAM)

Buffered and Unbuffered I/O

The following discussion is excerpted from Das, p 515.

To appreciate the debate that concerns system calls and library functions, you need to know something about the way disk I/O actually takes place. The read and wri te calls never access the disk directly. Rather, they read and write a pool of kernel buffers, called the buffer cache. If the buffer is found to be empty during a read, the kernel instructs the disk controller to read data from disk and fill up the cache. read blocks (waits) while the disk is being read and the process even relinquishes control of the CPU.

To ensure that a single invocation of read gathers all bytes stored in the kernel buffer, the size of the latter and buffer used by read (char buf[BUFSIZE] in a previous example) should be equal. Improper setting of the buffer size can make your program inefficient. So if each kernel buffer stores 8192 bytes, then BUFS I ZE should also be set to 8192. A smaller figure makes I/O inefficient, but a larger figure doesn't improve performance.

write also uses the buffer cache, but it differs from read in one way: it returns immediately after the call is invoked. The kernel writes the buffer to disk later at a convenient time. Database applications often can't accept this behavior, in which case you should open a file with the O_SYNC status flag to ensure that write doesn't return until the kernel has finally written the buffer to disk.

Unlike the standard library functions, the read and write calls are unbuffered when they interact with the terminal. When you use write to output a string to the terminal, the string appears on your display as soon as the call is invoked. On the other hand, the standard library functions (like printf) are line-bUffered when they access the terminal. That means a string is printed on the terminal only when the newline character is encountered.

The size of the kernel buffer is system-dependent and is set at the time of installation of the operating system. To develop portable and optimized applications, you must not use a feature that is system-dependent. You can't arbitrarily set BUFSIZE to 8192. This is where library functions come in.

The I/O-bound library functions use a buffer in the FILE structure and adjust its size dynamically during runtime usingma11oc. Unless you are using system calls for their exclusive features, it makes sense to use library functions on most occasions.

Setting File Position

#include <unistd.h> off_t lseek (int filedes, off_t offset, int whence)

The lseek function is used to change the file position of the file with descriptor filedes.

The whence argument specifies how the offset should be interpreted, in the same way as for the fseek function, and it must be one of the symbolic constants SEEK_SET, SEEK_CUR, or SEEK_END.

Flag Description
SEEK_SET Seek from the beginning of the file.
SEEK_CUR Seek from the current file position. This count may be positive or negative.
SEEK_END Seek from the end of the file.

Flag	Description
`SEEK_SET`	Seek from the beginning of the file.
`SEEK_CUR`	Seek from the current file position. This count may be positive or negative.
`SEEK_END`	Seek from the end of the file.

The return value from lseek is normally the resulting file position, measured in bytes from the beginning of the file. You can use this feature together with SEEK_CUR to read the current file position:

	cur_pos = lseek(fd,0,SEEK_CUR);

If the file position cannot be changed, or the operation is in some way invalid, lseek returns a value of -1.

A negative count with SEEK_END specifies a position within the current extent of the file; a positive count specifies a position past the current end. If you set the position past the current end, and actually write data, you will extend the file with zeros up to that position.

If you want to append to the file, setting the file position to the current end of file with SEEK_END is not sufficient. Another process may write more data after you seek but before you write, extending the file so the position you write onto clobbers their data. Instead, use the O_APPEND operating mode.

You can set the file position past the current end of the file. This does not by itself make the file longer; lseek never changes the file. But subsequent output at that position will extend the file. Characters between the previous end of file and the new position are filled with zeros. Extending the file in this way can create a “hole”: the blocks of zeros are not actually allocated on disk, so the file takes up less space than it appears to; it is then called a “sparse file”.

The lseek function is the underlying primitive for the fseek, fseeko, ftell, ftello and rewind functions, which operate on streams instead of file descriptors.

Reading File Attributes

#include <sys/stat.h> int stat (const char *filename, struct stat *buf); int fstat (int filedes, struct stat *buf); int lstat (const char *filename, struct stat *buf);

The stat function returns information about the attributes of the file named by filename in the structure pointed to by buf.

If filename is the name of a symbolic link, the attributes you get describe the file that the link points to. If the link points to a nonexistent file name, then stat fails reporting a nonexistent file.

The return value is 0 if the operation is successful, or -1 on failure.

The fstat function is like stat, except that it takes an open file descriptor as an argument instead of a file name. Like stat, fstat returns 0 on success and -1 on failure.

The lstat function is like stat, except that it does not follow symbolic links. If filename is the name of a symbolic link, lstat returns information about the link itself; otherwise lstat works like stat.

The meaning of the File Attributes

When you read the attributes of a file, they come back in a structure called struct stat. This section describes the names of the attributes, their data types, and what they mean.

The stat structure type is used to return information about the attributes of a file. It contains at least the following members:

Member Description
mode_t st_mode Specifies the mode of the file. This includes file type information (see Testing File Type) and the file permission bits (see Permission Bits).
ino_t st_ino The file serial number, which distinguishes this file from all other files on the same device.
dev_t st_dev Identifies the device containing the file. The st_ino and st_dev, taken together, uniquely identify the file. The st_dev value is not necessarily consistent across reboots or system crashes, however.
nlink_t st_nlink The number of hard links to the file. This count keeps track of how many directories have entries for this file. If the count is ever decremented to zero, then the file itself is discarded as soon as no process still holds it open. Symbolic links are not counted in the total.
uid_t st_uid The user ID of the file's owner.
gid_t st_gid The group ID of the file.
off_t st_size This specifies the size of a regular file in bytes. For files that are really devices this field isn't usually meaningful. For symbolic links this specifies the length of the file name the link refers to.
time_t st_atime This is the last access time for the file.
unsigned long int st_atime_usec This is the fractional part of the last access time for the file.
time_t st_mtime This is the time of the last modification to the contents of the file.
unsigned long int st_mtime_usec This is the fractional part of the time of the last modification to the contents of the file.
time_t st_ctime This is the time of the last modification to the attributes of the file.
unsigned long int st_ctime_usec This is the fractional part of the time of the last modification to the attributes of the file.
blkcnt_t st_blocks This is the amount of disk space that the file occupies, measured in units of 512-byte blocks.
unsigned int st_blksize The optimal block size for reading of writing this file, in bytes. You might use this size for allocating the buffer space for reading of writing the file. (This is unrelated to st_blocks.)

Member	Description
`mode_t st_mode`	Specifies the mode of the file. This includes file type information (see Testing File Type) and the file permission bits (see Permission Bits).
`ino_t st_ino`	The file serial number, which distinguishes this file from all other files on the same device.
`dev_t st_dev`	Identifies the device containing the file. The `st_ino` and `st_dev`, taken together, uniquely identify the file. The `st_dev` value is not necessarily consistent across reboots or system crashes, however.
`nlink_t st_nlink`	The number of hard links to the file. This count keeps track of how many directories have entries for this file. If the count is ever decremented to zero, then the file itself is discarded as soon as no process still holds it open. Symbolic links are not counted in the total.
`uid_t st_uid`	The user ID of the file's owner.
`gid_t st_gid`	The group ID of the file.
`off_t st_size`	This specifies the size of a regular file in bytes. For files that are really devices this field isn't usually meaningful. For symbolic links this specifies the length of the file name the link refers to.
`time_t st_atime`	This is the last access time for the file.
`unsigned long int st_atime_usec`	This is the fractional part of the last access time for the file.
`time_t st_mtime`	This is the time of the last modification to the contents of the file.
`unsigned long int st_mtime_usec`	This is the fractional part of the time of the last modification to the contents of the file.
`time_t st_ctime`	This is the time of the last modification to the attributes of the file.
`unsigned long int st_ctime_usec`	This is the fractional part of the time of the last modification to the attributes of the file.
`blkcnt_t st_blocks`	This is the amount of disk space that the file occupies, measured in units of 512-byte blocks.
`unsigned int st_blksize`	The optimal block size for reading of writing this file, in bytes. You might use this size for allocating the buffer space for reading of writing the file. (This is unrelated to `st_blocks`.)

Some of the file attributes have special data type names which exist specifically for those attributes. (They are all aliases for well-known integer types that you know and love.) These typedef names are defined in the header file sys/types.h as well as in sys/stat.h.

The number of disk blocks in st_blocks is not strictly proportional to the size of the file, for two reasons: the file system may use some blocks for internal record keeping; and the file may be sparse—it may have “holes” which contain zeros but do not actually take up space on the disk.

You can tell (approximately) whether a file is sparse by comparing this value with st_size, like this:

               (st.st_blocks * 512 < st.st_size)

This test is not perfect because a file that is just slightly sparse might not be detected as sparse at all. For practical applications, this is not a problem.

File Times

Each file has three time stamps associated with it: its access time, its modification time, and its attribute modification time. These correspond to the st_atime, st_mtime, and st_ctime members of the stat structure.

All of these times are represented in calendar time format, as time_t objects. This data type is defined in time.h. For more information about representation and manipulation of time values, see Calendar Time. Reading from a file updates its access time attribute, and writing updates its modification time. When a file is created, all three time stamps for that file are set to the current time. In addition, the attribute change time and modification time fields of the directory that contains the new entry are updated.

Testing the Type of a File

The file mode, stored in the st_mode field of the file attributes, contains two kinds of information: the file type code, and the access permission bits. This section discusses only the type code, which you can use to tell whether the file is a directory, socket, symbolic link, and so on. For details about access permissions see Permission Bits.

There are two ways you can access the file type information in a file mode. Firstly, for each file type there is a predicate macro which examines a given file mode and returns whether it is of that type or not. Secondly, you can mask out the rest of the file mode to leave just the file type code, and compare this against constants for each of the supported file types.

All of the symbols listed in this section are defined in the header file sys/stat.h. The following predicate macros test the type of a file, given the value m which is the st_mode field returned by stat on that file:

Macro Description
int S_ISDIR (mode_t m) non-zero if the file is a directory.
int S_ISCHR (mode_t m) non-zero if the file is a character special file (a device like a terminal).
int S_ISBLK (mode_t m) non-zero if the file is a block special file (a device like a disk).
int S_ISREG (mode_t m) non-zero if the file is a regular file.
int S_ISFIFO (mode_t m) non-zero if the file is a FIFO special file, or a pipe. See Pipes and FIFOs.
int S_ISLNK (mode_t m) non-zero if the file is a symbolic link. See Symbolic Links.
int S_ISSOCK (mode_t m) non-zero if the file is a socket. See Sockets.

Macro	Description
`int S_ISDIR (mode_t m)`	non-zero if the file is a directory.
`int S_ISCHR (mode_t m)`	non-zero if the file is a character special file (a device like a terminal).
`int S_ISBLK (mode_t m)`	non-zero if the file is a block special file (a device like a disk).
`int S_ISREG (mode_t m)`	non-zero if the file is a regular file.
`int S_ISFIFO (mode_t m)`	non-zero if the file is a FIFO special file, or a pipe. See Pipes and FIFOs.
`int S_ISLNK (mode_t m)`	non-zero if the file is a symbolic link. See Symbolic Links.
`int S_ISSOCK (mode_t m)`	non-zero if the file is a socket. See Sockets.

The Mode Bits for Access Permission

The file mode, stored in the st_mode field of the file attributes, contains two kinds of information: the file type code, and the access permission bits. This section discusses only the access permission bits, which control who can read or write the file.

All of the symbols listed in this section are defined in the header file sys/stat.h. These symbolic constants are defined for the file mode bits that control access permission for the file:

Mask Description
S_IRUSR Read permission bit for the owner of the file. On many systems this bit is 0400.
S_IWUSR Write permission bit for the owner of the file. Usually 0200.
S_IXUSR Execute (for ordinary files) or search (for directories) permission bit for the owner of the file. Usually 0100.
S_IRWXU This is equivalent to `(S_IRUSR | S_IWUSR | S_IXUSR)'.
S_IRGRP Read permission bit for the group owner of the file. Usually 040.
S_IWGRP Write permission bit for the group owner of the file. Usually 020.
S_IXGRP Execute or search permission bit for the group owner of the file. Usually 010.
S_IRWXG This is equivalent to `(S_IRGRP | S_IWGRP | S_IXGRP)'.
S_IROTH Read permission bit for other users. Usually 04.
S_IWOTH Write permission bit for other users. Usually 02.
S_IXOTH Execute or search permission bit for other users. Usually 01.
S_IRWXO This is equivalent to `(S_IROTH | S_IWOTH | S_IXOTH)'.
S_ISUID This is the set-user-ID on execute bit, usually 04000.
S_ISGID This is the set-group-ID on execute bit, usually 02000.
S_ISVTX This is the sticky bit, usually 01000.

Mask	Description
`S_IRUSR`	Read permission bit for the owner of the file. On many systems this bit is 0400.
`S_IWUSR`	Write permission bit for the owner of the file. Usually 0200.
`S_IXUSR`	Execute (for ordinary files) or search (for directories) permission bit for the owner of the file. Usually 0100.
`S_IRWXU`	This is equivalent to ``(S_IRUSR \| S_IWUSR \| S_IXUSR)`'.
`S_IRGRP`	Read permission bit for the group owner of the file. Usually 040.
`S_IWGRP`	Write permission bit for the group owner of the file. Usually 020.
`S_IXGRP`	Execute or search permission bit for the group owner of the file. Usually 010.
`S_IRWXG`	This is equivalent to ``(S_IRGRP \| S_IWGRP \| S_IXGRP)`'.
`S_IROTH`	Read permission bit for other users. Usually 04.
`S_IWOTH`	Write permission bit for other users. Usually 02.
`S_IXOTH`	Execute or search permission bit for other users. Usually 01.
`S_IRWXO`	This is equivalent to ``(S_IROTH \| S_IWOTH \| S_IXOTH)`'.
`S_ISUID`	This is the set-user-ID on execute bit, usually 04000.
`S_ISGID`	This is the set-group-ID on execute bit, usually 02000.
`S_ISVTX`	This is the sticky bit, usually 01000.

The actual bit values of the symbols are listed in the table above so you can decode file mode values when debugging your programs. These bit values are correct for most systems, but they are not guaranteed.

Warning: Writing explicit numbers for file permissions is bad practice. Not only is it not portable, it also requires everyone who reads your program to remember what the bits mean. To make your program clean use the symbolic names.

The Sticky Bit

For a directory the sticky bit gives permission to delete a file in that directory only if you own that file. Ordinarily, a user can either delete all the files in a directory or cannot delete any of them (based on whether the user has write permission for the directory). The same restriction applies—you must have both write permission for the directory and own the file you want to delete. The one exception is that the owner of the directory can delete any file in the directory, no matter who owns it (provided the owner has given himself write permission for the directory). This is commonly used for the /tmp directory, where anyone may create files but not delete files created by other users.

Originally the sticky bit on an executable file modified the swapping policies of the system. Normally, when a program terminated, its pages in core were immediately freed and reused. If the sticky bit was set on the executable file, the system kept the pages in core for a while as if the program were still running. This was advantageous for a program likely to be run many times in succession. This usage is obsolete in modern systems. When a program terminates, its pages always remain in core as long as there is no shortage of memory in the system. When the program is next run, its pages will still be in core if no shortage arose since the last run.

On some modern systems where the sticky bit has no useful meaning for an executable file, you cannot set the bit at all for a non-directory. If you try, chmod fails with EFTYPE.

Some systems (particularly SunOS) have yet another use for the sticky bit. If the sticky bit is set on a file that is not executable, it means the opposite: never cache the pages of this file at all. The main use of this is for the files on an NFS server machine which are used as the swap area of diskless client machines. The idea is that the pages of the file will be cached in the client's memory, so it is a waste of the server's memory to cache them a second time. With this usage the sticky bit also implies that the filesystem may fail to record the file's modification time onto disk reliably (the idea being that no-one cares for a swap file).

Truncate Files

#include <unistd.h> int truncate (const char *filename, off_t length) int ftruncate (int fd, off_t length)

The truncate function changes the size of filename to length. If length is shorter than the previous length, data at the end will be lost. The file must be writable by the user to perform this operation.

If length is longer, holes will be added to the end. However, some systems do not support this feature and will leave the file unchanged.

The return value is 0 for success, or -1 for an error.

The ftruncate function is like truncate, but it works on a file descriptor fd for an opened file instead of a file name to identify the object. The file must be opened for writing to successfully carry out the operation.

References

Low-Level Input/Output (from the GNU C Library Reference Manual).

Sumitabha Das, Your Unix, the Ultimate Guide, Second Edition,
McGraw-Hill, 2006. ISBN 0-07-252042-6. chapter 16

Neil Matthew and Richard Stone, Beginning Linux Programming, Third Edition,
Wrox, 2004. ISBN 0-7645-4497-7. p 96-106.

W. Richard Stevens and Stephen A. Rago, Advanced Programming in the UNIX Environment, Second Edition, Addison Wesley, 2005. ISBN 0-201-43307-9. p 60-70

Maintained by John Loomis, last updated 10 September 2006