Kernel Korner

Intro to inotify

Robert Love

Issue #139, November 2005

Applications that watch thousands of files for changes, or that need to know when a storage device gets disconnected, need a clean, fast solution to the file change notification problem. Here it is.

John McCutchan and I had been working on inotify for about a year when it was finally merged into Linus' kernel tree and released with kernel version 2.6.13. Although a long struggle, the effort culminated in success and was ultimately worth every rewrite, bug and debate.

What Is inotify?

inotify is a file change notification system—a kernel feature that allows applications to request the monitoring of a set of files against a list of events. When the event occurs, the application is notified. To be useful, such a feature must be simple to use, lightweight with little overhead and flexible. It should be easy to add new watches and painless to receive notification of events.

To be sure, inotify is not the first of its kind. Every modern operating system provides some sort of file notification system; many network and desktop applications require such functionality—Linux too. For years, Linux has offered dnotify. The problem was, dnotify was not very good. In fact, it stank.

dnotify, which ostensibly stands for directory notify, was never considered easy to use. Sporting a cumbersome interface and several painful features that made life arduous, dnotify failed to meet the demands of the modern desktop, where asynchronous notification of events and a free flow of information rapidly are becoming the norm. dnotify has, in particular, several problems:

  • dnotify can watch only directories.

  • dnotify requires maintaining an open file descriptor to the directory that the user wants to watch. First, this open file descriptor pins the directory, disallowing the device on which it resides from being unmounted. Second, watching a large number of directories requires too many open file descriptors.

  • dnotify's interface to user space is signals. Yes, seriously, signals!

  • dnotify ignores the issue of hard links.

The goal, therefore, was twofold: design a first-class file notification system and ensure that all of the deficiencies of dnotify were addressed.

inotify is an inode-based file notification system that does not require a file ever be opened in order to watch it. inotify does not pin filesystem mounts—in fact, it has a clever event that notifies the user whenever a file's backing filesystem is unmounted. inotify is able to watch any filesystem object whatsoever, and when watching directories, it is able to tell the user the name of the file inside of the directory that changed. dnotify can report only that something changed, requiring applications to maintain an in-memory cache of stat() results and compare for any changes.

Finally, inotify is designed with an interface that user-space application developers would want to use, enjoy using and benefit from using. Instead of signals, inotify communicates with applications via a single file descriptor. This file descriptor is select-, poll-, epoll- and read-able. Simple and fast—the world is happy.

Getting Started with inotify

inotify is available in kernel 2.6.13-rc3 and later. Because some bugs were found and subsequently fixed right after that release, kernel 2.6.13 or later is recommended. The inotify system calls, being the new kids on the block, might not yet be supported in your system's version of the C library, in which case the header files listed in the on-line Resources will provide the necessary C declarations and system call stubs.

If your C library supports inotify, all you should need is the following:


#include <sys/inotify.h>

If not, grab the two header files, stick them in the same directory as your source files, and use the following:

#include "inotify.h"
#include "inotify-syscalls.h"

The following examples are in straight C. You can compile them the same as any other C application.

Initialize, inotify!

inotify is initialized via the inotify_init() system call, which instantiates an inotify instance inside the kernel and returns the associated file descriptor:

int inotify_init (void);

On failure, inotify_init() returns minus one and sets errno as appropriate. The most common errno values are EMFILE and ENFILE, which signify that the per-user and the system-wide open file limit was reached, respectively.

Usage is simple:


int fd;

fd = inotify_init ();
if (fd < 0)
        perror ("inotify_init");

Watches

The heart of inotify is the watch, which consists of a pathname specifying what to watch and an event mask specifying what to watch for. inotify can watch for many different events: opens, closes, reads, writes, creates, deletes, moves, metadata changes and unmounts. Each inotify instance can have thousands of watches, each watch for a different list of events.

Adding Watches

Watches are added with the inotify_add_watch() system call:

int inotify_add_watch (int fd, const char *path, __u32 mask);

A call to inotify_add_watch() adds a watch for the one or more events given by the bitmask mask on the file path to the inotify instance associated with the file descriptor fd. On success, the call returns a watch descriptor, which is used to identify this particular watch uniquely. On failure, minus one is returned and errno is set as appropriate.

Usage is simple:


int wd;

wd = inotify_add_watch (fd,
                "/home/rlove/Desktop",
                IN_MODIFY | IN_CREATE | IN_DELETE);

if (wd < 0)
        perror ("inotify_add_watch");

This example adds a watch on the directory /home/rlove/Desktop for any modifications, file creations or file deletions.

Table 1 shows valid events.

Table 1. Valid Events

EventDescription
IN_ACCESSFile was read from.
IN_MODIFYFile was written to.
IN_ATTRIB File's metadata (inode or xattr) was changed.
IN_CLOSE_WRITE File was closed (and was open for writing).
IN_CLOSE_NOWRITEFile was closed (and was not open for writing).
IN_OPEN File was opened.
IN_MOVED_FROMFile was moved away from watch.
IN_MOVED_TO File was moved to watch.
IN_DELETE File was deleted.
IN_DELETE_SELFThe watch itself was deleted.

Table 2 shows the provided helper events.

Table 2. Helper Events

EventDescription
IN_CLOSEIN_CLOSE_WRITE | IN_CLOSE_NOWRITE
IN_MOVEIN_MOVED_FROM | IN_MOVED_TO
IN_ALL_EVENTSBitwise OR of all events.

As an example, if an application wanted to know whenever the file safe_combination.txt was opened or closed, it could do the following:


int wd;

wd = inotify_add_watch (fd,
                "safe_combination.txt",
                IN_OPEN | IN_CLOSE);

if (wd < 0)
        perror ("inotify_add_watch");

Receiving Events

With inotify initialized and watches added, your application is now ready to receive events. Events are queued asynchronously, in real time as the events happen, but they are read synchronously via the read() system call. The call blocks until events are ready and then returns all available events once any event is queued.

Events are delivered in the form of an inotify_event structure, which is defined as:

struct inotify_event {
        __s32 wd;             /* watch descriptor */
        __u32 mask;           /* watch mask */
        __u32 cookie;         /* cookie to synchronize two events */
        __u32 len;            /* length (including nulls) of name */
        char name[0];        /* stub for possible name */
};

The wd field is the watch descriptor originally returned by inotify_add_watch(). The application is responsible for mapping this identifier back to the filename.

The mask field is a bitmask representing the event that occurred.

The cookie field is a unique identifier linking together two related but separate events. It is used to link together an IN_MOVED_FROM and an IN_MOVED_TO event. We will look at it later.

The len field is the length of the name field or nonzero if this event does not have a name. The length contains any potential padding—that is, the result of strlen() on the name field may be smaller than len.

The name field contains the name of the object to which the event occurred, relative to wd, if applicable. For example, if a watch for writes in /etc triggers an event on the writing to /etc/vimrc, the name field will contain vimrc, and the wd field will link back to the /etc watch. Conversely, if watching the file /etc/fstab for reads, a triggered read event will have a len of zero and no associated name whatsoever, because the watch descriptor associates directly with the affected file.

The size of name is dynamic. If the event has no associated filename, no name is sent at all and no space is consumed. If the event does have an associated filename, the name field is dynamically allocated and trails the structure for len bytes. This approach allows the name's length to vary in size and consume no space when not needed.

Because the name field is dynamic, the size of the buffer passed to read() is unknown. If the size is too small, the system call returns zero, alerting the application. inotify, however, allows user space to “slurp” multiple events at once. Consequently, most applications should pass in a large buffer, which inotify will fill with as many events as possible.

It sounds complicated, but usage is simple:


/* size of the event structure, not counting name */
#define EVENT_SIZE  (sizeof (struct inotify_event))

/* reasonable guess as to size of 1024 events */
#define BUF_LEN        (1024 * (EVENT_SIZE + 16)

char buf[BUF_LEN];
int len, i = 0;

len = read (fd, buf, BUF_LEN);
if (len < 0) {
        if (errno == EINTR)
                /* need to reissue system call */
        else
                perror ("read");
} else if (!len)
        /* BUF_LEN too small? */

while (i < len) {
        struct inotify_event *event;

        event = (struct inotify_event *) &buf[i];

        printf ("wd=%d mask=%u cookie=%u len=%u\n",
                event->wd, event->mask,
                event->cookie, event->len);

        if (event->len)
                printf ("name=%s\n", event->name);

        i += EVENT_SIZE + event->len;
}

This approach is undertaken to allow many events to be read and processed in a single swoop and to deal with the dynamically sized name. Clever readers will immediately question whether the following code is safe with respect to alignment requirements:


while (i < len) {
        struct inotify_event *event;
        event = (struct inotify_event *) &buf[i];

        /* ... */

        i += EVENT_SIZE + event->len;
}

Indeed, it is. This is the reason that the len field may be longer than the string's length. Additional null characters may follow the string, padding it out to a size that ensures the following structure is properly aligned.

But I Don't Want to Read!

Having to sit blocked on a read() system call does not sound very appealing, unless your application is heavily threaded—in which case, hey, just one more thread! Thankfully, the inotify file descriptor can be polled or selected on, allowing inotify to be multiplexed along with other I/O and optionally integrated into an application's mainloop.

Here is an example of monitoring the inotify file descriptor with select():


struct timeval time;
fd_set rfds;
int ret;

/* timeout after five seconds */
time.tv_sec = 5;
time.tv_usec = 0;

/* zero-out the fd_set */
FD_ZERO (&rfds);

/*
 * add the inotify fd to the fd_set -- of course,
 * your application will probably want to add
 * other file descriptors here, too
 */
FD_SET (fd, &rfds);

ret = select (fd + 1, &rfds, NULL, NULL, &time);
if (ret < 0)
        perror ("select");
else if (!ret)
        /* timed out! */
else if (FD_ISSET (fd, &rfds)
        /* inotify events are available! */

You can follow a similar approach with pselect(), poll() or epoll()—take your pick.

Events

The mask field in the inotify_event structure describes the event that occurred. In addition to the events listed earlier, Table 3 shows events that are also sent, as applicable.

Table 3. Events That Cover General Changes

NameDescription
IN_UNMOUNTThe backing filesystem was unmounted.
IN_Q_OVERFLOWThe inotify queue overflowed.
IN_IGNOREDThe watch was automatically removed, because the file was deleted or its filesystem was unmounted.

Additionally, the bit IN_ISDIR is set telling the application if the event occurred against a directory. This is more than just a convenience—consider the case of a deleted file.

Because flags such as IN_ISDIR are present in the bitmask, it never should be compared to a possible event directly. Instead, the bits should be tested individually. For example:


if (event->mask & IN_DELETE) {
        if (event->mask & IN_ISDIR)
                printf ("Directory deleted!\n");
        else
                printf ("File deleted!\n");
}

Modifying Watches

A watch is modified by calling inotify_add_watch() with an updated event mask. If the watch already exists, the mask is simply updated and the original watch descriptor is returned.

Removing Watches

Watches are removed with the inotify_rm_watch() system call:

int inotify_rm_watch (int fd, int wd);

A call to inotify_rm_watch() removes the watch associated with the watch descriptor wd from the inotify instance associated with the file descriptor fd. The call returns zero on success and negative one on failure, in which case errno is set as appropriate.

Usage, as usual, is simple:


int ret;

ret = inotify_rm_watch (fd, wd);
if (ret)
        perror ("inotify_rm_watch");

Shutting inotify Down

To destroy any existing watches, pending events and the inotify instance itself, invoke the close() system call on the inotify instance's file descriptor. For example:


int ret;

ret = close (fd);
if (ret)
        perror ("close");

One-Shot Support

If the IN_ONESHOT value is OR'ed into the event mask at watch addition, the watch is atomically removed during generation of the first event. Subsequent events will not be generated against the file until the watch is added back. This behavior is desired by some applications, for example, Samba, where one-shot support mimics the behavior of the file change notification system on Microsoft Windows.

Usage is, naturally, simple:


int wd;

wd = inotify_add_watch (fd,
                "/home/rlove/Desktop",
                IN_MODIFY | IN_ONESHOT);

if (wd < 0)
        perror ("inotify_add_watch");

On Unmount

One of the biggest issues with dnotify (aside from the signals and basically everything else) is that a dnotify watch on a directory requires that said directory remain open. Consequently, watching a directory on, say, a USB keychain drive prevents the drive from unmounting. inotify solves this problem by not requiring that any file be open.

inotify takes this one step further, though, and sends out the IN_UNMOUNT event when the filesystem on which a file resides is unmounted. It also automatically destroys the watch and cleanup.

Moves

Move events are complicated because inotify may be watching the directory that the file is moved to or from, but not the other. Because of this, it is not always possible to alert the user of the source and destination of a file involved in a move. inotify is able to alert the application to both only if the application is watching both directories.

In that case, inotify emits an IN_MOVED_FROM from the watch descriptor of the source directory, and it emits an IN_MOVED_TO from the watch descriptor of the destination directory. If watching only one or the other, only the one event will be sent.

To tie together two disparate moved to/from events, inotify sets the cookie field in the inotify_event structure to a unique nonzero value. Two events with matching cookies are thus related, one showing the source and one showing the destination of the move.

Obtaining the Size of the Queue

The size of the pending event queue can be obtained via FIONREAD:


unsigned int queue_len;
int ret;

ret = ioctl (fd, FIONREAD, &queue_len);
if (ret < 0)
        perror ("ioctl");
else
        printf ("%u bytes pending in queue\n", queue_len);

This is useful to implement throttling: reading from the queue only when the number of events has grown sufficiently large.

Configuring inotify

inotify is configurable via procfs and sysctl.

/proc/sys/filesystem/inotify/max_queued_events is the maximum number of events that can be queued at once. If the queue reaches this size, new events are dropped, but the IN_Q_OVERFLOW event is always sent. With a significantly large queue, overflows are rare even if watching many objects. The default value is 16,384 events per queue.

/proc/sys/filesystem/inotify/max_user_instances is the maximum number of inotify instances that a given user can instantiate. The default value is 128 instances, per user.

/proc/sys/filesystem/inotify/max_user_watches is the maximum number of watches per instance. The default value is 8,192 watches, per instance.

These knobs exist because kernel memory is a precious resource. Although any user can read these files, only the system administrator can write to them.

Conclusion

inotify is a simple yet powerful file change notification system with an intuitive user interface, excellent performance, support for many different events and numerous features. inotify is currently in use in various projects, including Beagle, an advanced desktop indexing system, and Gamin, a FAM replacement.

What application will use inotify next?

Resources for this article: /article/8534.

Robert Love is a senior kernel hacker in Novell's Ximian Desktop group and the author of Linux Kernel Development (SAMS 2005), now in its second edition. He holds degrees in CS and Mathematics from the University of Florida. Robert lives in Cambridge, Massachusetts.