In a nutshell: In skyzh/uring-positioned-io, I wrapped the underlying io_uring interface provided by Tokio and implemented io_uring based asynchronous random file reading in Rust. You can use it like this.

1
ctx.read(fid, offset, &mut buf).await?;

This article introduces the basic use of io_uring, then describes the implementation of an asynchronous read file library I wrote, and finally does a benchmark to compare performance with mmap.

Click here to visit the GitHub project.

io_uring Introduction

io_uring is an asynchronous I/O interface provided by the Linux kernel. It was introduced in Linux 5.1 in May 2019 and is now used in various projects. For example.

  • RocksDB’s MultiRead is currently doing concurrent reads of files via io_uring.
  • Tokio wraps a layer of API for io_uring. With the release of Tokio 1.0, the developers have indicated that true asynchronous file operations will be provided via io_uring in the future (see Announcing Tokio 1.0). Currently Tokio’s asynchronous file operations are implemented by opening a separate I/O thread to call the synchronous API.
  • QEMU 5.0 already uses io_uring (see ChangeLog).

Most of the current tests on io_uring compare the performance of Direct I/O with Linux AIO. io_uring can typically achieve twice the performance of AIO.

Random file reading scenarios

In database systems, we often need multiple threads to read the contents of a file at any location (<fid>, <offset>, <size>). The often used read / write API cannot do this (because it has to seek first and needs an exclusive file handle). The following method allows for random reads of files.

  • Map the file directly into memory via mmap. Reading a file becomes a direct memory read, and can be done concurrently in multiple threads.
  • pread can read count bytes starting from a certain location offset, also supporting concurrent reads in multiple threads.

However, both of these options block the current thread. For example, if a page fault occurs after mmap reads a block of memory, the current thread will block; pread itself is a blocking API. Asynchronous APIs (e.g. Linux AIO / io_uring) can reduce context switching and thus improve throughput in some scenarios.

Basic usage of io_uring

The io_uring related syscall can be found at here. The liburing provides an easier-to-use API. Tokio’s io_uring crate builds on this by providing a io_uring API for the Rust language. language’s io_uring API. Here is an example of how to use io_uring.

To use io_uring, you need to create a ring, and here we use the concurrent API provided by tokio-rs/io-uring, which supports multiple threads using the same ring.

1
2
3
use io_uring::IoUring;
let ring = IoUring::new(256)?;
let ring = ring.concurrent();

Each ring corresponds to a commit queue and a completion queue, where the queue is set to hold up to 256 elements.

The process of I/O operations via io_uring is divided into three steps: adding tasks to the commit queue, submitting tasks to the kernel, and retrieving tasks from the completion queue. Here is an example of the process of reading a file.

You can construct a read file task with opcode::Read, and add the task to the queue with ring.submit().push(entry).

1
2
3
4
5
6
use io_uring::{opcode, types::Fixed};
let read_op = opcode::Read::new(Fixed(fid), ptr, len).offset(offset);
let entry = read_op
            .build()
            .user_data(user_data);
unsafe { ring.submission().push(entry)?; }

Once the task has been added, commit it to the kernel.

1
assert_eq!(ring.submit()?, 1);

Final polling for completed tasks.

1
2
3
4
5
loop {
    if let Some(entry) = ring.completion().pop() {
        // do something
    }
}

In this way, we implement random reads of files based on io_uring.

io_uring currently has three execution modes: default mode, poll mode and kernel poll mode. If you use kernel poll mode, you do not necessarily need to call the function that commits the task.

Implementing an asynchronous read file interface with io_uring

Our goal is to implement an interface like this, wrapping io_uring and exposing only a simple read function to the developer.

1
ctx.read(fid, offset, &mut buf).await?;

After referring to tokio-linux-aio for asynchronous wrapping of Linux AIO, I used the following method to implement io_uring based asynchronous reads.

  • The developer needs to create a UringContext before using io_uring.
  • While the UringContext is created, one (or more) UringPollFuture is run in the background to submit and poll for completed tasks. (corresponds to the second and third operations of the read file in the previous section).
  • A developer can call the interface for reading a file from ctx, creating a UringReadFuture with ctx.read. After calling ctx.read.await.
    1. UringReadFuture creates an object UringTask that is fixed in memory, and then puts the read file task into a queue, using the address of UringTask as the user data for the read operation. There is a channel inside UringTask.
    2. UringPollFuture submits the task in the background.
    3. UringPollFuture polls for completed tasks in the background.
    4. UringPollFuture fetches the user data, reduces it to a UringTask object, and notifies UringReadFuture through the channel that the I/O operation has completed.

The whole process is shown in the following figure.

tokio-linux-aio

This makes it easy to call io_uring to read the file asynchronously. This also has a side benefit: task commits can be automatically batching. Normally, an I/O operation would generate a syscall, but since we use a single Future to commit and poll tasks, there may be multiple uncommitted tasks in the queue at commit time, and they can all be committed at once. This reduces the overhead of syscall context cutting (and of course increases latency). From the benchmark results, we can see that each commit can pack about 20 read tasks.

Benchmark

Compare the performance of wrapped io_uring with that of mmap. The test load is 128 1G files with random read aligned 4K blocks. My computer has 32G of RAM and a 1T NVMe SSD. 6 cases were tested as follows.

  • 8-thread mmap. (mmap_8)
  • 32 thread mmap. (mmap_32)
  • 512-thread mmap. (mmap_512)
  • 8 threads 8 concurrent io_uring. (uring_8)
  • 8 threads 32 concurrent io_uring. That is, 8 worker threads, 32 future simultaneous reads.(uring_32)
  • 8 threads of 512 concurrent io_uring. (uring_512)

Tested Throughput (op/s) and Latency (ns).

case throughput p50 p90 p999 p9999 max
uring_8 104085.77710777053 83166 109183 246416 3105883 14973666
uring_32 227097.61356918357 142869 212730 1111491 3321889 14336132
uring_512 212076.5160505447 1973421 3521119 19478348 25551700 35433481
mmap_8 109697.87025744558 78971 107021 204211 1787823 18522047
mmap_32 312829.53428971884 100336 178914 419955 4408214 55129932
mmap_512 235368.9890904751 2556429 3265266 15946744 50029659 156095218

Found that mmap is far superior to io_uring. Well, it’s true that this wrapper is not very good, but it’s barely usable. Here is a heatmap with one minute latency, each set of data is presented in the order of mmap first and then io_uring.

mmap_8 / uring_8

mmap_8 / uring_8 mmap_8 / uring_8

mmap_32 / uring_32

mmap_32 / uring_32 mmap_32 / uring_32

mmap_512 / uring_512

mmap_512 / uring_512 mmap_512 / uring_512

io_uring io_uring

Some possible improvements

  • It looks like right now io_uring is not performing very well after my wrapping with Tokio. You can later test the overhead introduced by this wrapping of Tokio by comparing the performance of Rust / C on the io_uring nop instruction.
  • Test the performance of Direct I/O. Only Buffered I/O has been tested so far.
  • Compare to Linux AIO, performance is not worse than Linux AIO.
  • Use perf to see where the bottlenecks are now. Currently cargo flamegraph is hooked up and io_uring can’t request memory.
  • Currently, the user must ensure that &mut buf is valid throughout the read cycle. If Future is aborted, there will be a memory leak. A similar problem with futures-rs can be found at https://github.com/rust-lang/futures-rs/issues/1278 . Tokio’s current I/O solves this problem by copying twice (first to the cache, then to the user).
  • Maybe wrap write files and other operations along the way as well.