SetupFeatures

io_uring_params->features flags

Values

ValueMeaning
NONE0
SINGLE_MMAP1U << 0

IORING_FEAT_SINGLE_MMAP (from Linux 5.4)

Indicates that we can use single mmap feature to map both sq and cq rings and so to avoid the second mmap.

NODROP1U << 1

IORING_FEAT_NODROP (from Linux 5.5)

Currently we drop completion events, if the CQ ring is full. That's fine for requests with bounded completion times, but it may make it harder or impossible to use io_uring with networked IO where request completion times are generally unbounded. Or with POLL, for example, which is also unbounded.

After this patch, we never overflow the ring, we simply store requests in a backlog for later flushing. This flushing is done automatically by the kernel. To prevent the backlog from growing indefinitely, if the backlog is non-empty, we apply back pressure on IO submissions. Any attempt to submit new IO with a non-empty backlog will get an -EBUSY return from the kernel. This is a signal to the application that it has backlogged CQ events, and that it must reap those before being allowed to submit more IO.

Note that if we do return -EBUSY, we will have filled whatever backlogged events into the CQ ring first, if there's room. This means the application can safely reap events WITHOUT entering the kernel and waiting for them, they are already available in the CQ ring.

SUBMIT_STABLE1U << 2

IORING_FEAT_SUBMIT_STABLE (from Linux 5.5)

If this flag is set, applications can be certain that any data for async offload has been consumed when the kernel has consumed the SQE.

RW_CUR_POS1U << 3

IORING_FEAT_RW_CUR_POS (from Linux 5.6)

If this flag is set, applications can know if setting -1 as file offsets (meaning to work with current file position) is supported.

CUR_PERSONALITY1U << 4

IORING_FEAT_CUR_PERSONALITY (from Linux 5.6) We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all.

Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds.

Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue.

FAST_POLL1U << 5

IORING_FEAT_FAST_POLL (from Linux 5.7) Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it.

This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete.

The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well.

POLL_32BITS1U << 6

IORING_FEAT_POLL_32BITS (from Linux 5.9) Poll events should be 32-bits to cover EPOLLEXCLUSIVE. Explicit word-swap the poll32_events for big endian to make sure the ABI is not changed. We call this feature IORING_FEAT_POLL_32BITS, applications who want to use EPOLLEXCLUSIVE should check the feature bit first.

Meta