01 How processes read data

A process has its own separate heap area memory and stack area memory, so it can only read the area of memory that belongs to it. If it needs to access a device outside the process, then it can only be handled by the operating system on its behalf. The operation by which a process requests access to a device from the operating system is called a system call (system call).

The operating system kernel also has its own memory operation area, which is divided in the high address segment of the overall system memory area, while the process stack and heap areas are placed in the corresponding low address segment. The kernel area is fundamentally different from the process memory area in terms of permissions. When the kernel wants to read data from a device, such as a disk, it first copies the data from the disk to the kernel memory area, and then copies the data cached in the kernel area to the memory area of the user process. The whole process is absolutely blocking, but there are different types of IO, such as blocking IO and non-blocking IO, as to whether the process is blocking or not.

Thus, the exact flow of a process reading a file from disk is as follows.

  1. The process initiates a read request system call, the process blocks or non-blocks (different IO modes).
  2. The kernel initiates an IO interrupt to read the data from the target device (e.g., disk) into the kernel buffer, which is either blocking or non-blocking (synchronous or asynchronous IO).
  3. The kernel copies the data from the kernel buffer to the process space and notifies the process that it is ready to be read (poll/epoll).
  4. The process wakes up and the application processes the data.

For this process, operations on data must pass through kernel space, and kernel processing of files also requires processing through a virtual file system (VFS). Thus, its entire process involves many layer transitions and context switching of the CPU. Note that the CPU cannot do anything else while receiving device data, so when very large traffic (e.g., gigabit NIC data) is handled by a single CPU core, the CPU is basically processing 100% of the data, and the application faces a huge performance challenge at this point.

Therefore, is it possible to read data without going through the CPU?

02 DMA Technology

To solve the problem of high CPU load due to handling heavy IO traffic, an effective solution has emerged: DMA, or Direct Memory Access. A simple understanding is that the work of data copying is not allowed to be done by the CPU itself, but delegated to the DMA controller. This way, even in mega traffic intensive scenarios, the CPU can still ensure a good load to each application. dma is used to copy data from one address space to another, and this process is done by the DMA controller, the CPU is no longer involved in the process. the CPU only needs to initialize the transfer action and then hand it over to the DMA controller.

The DMA solution, compared to traditional IO, is to leave the copying of data from the device to the kernel to the DMA, thus freeing up CPU productivity.

Thus, the IO bottleneck for processes to read external data is effectively improved by DMA technology, which enables processes to obtain higher execution efficiency. But what about a service that just wants to forward files on disk to other devices on the network? Common scenarios are e.g. file transfer servers.

File Transfer Server: The user requests a file on the host disk from the service.

This process, which will involve both reading and sending, each requires a system call to the system, and each round trip is a CPU context switch, so a complete file transfer will require four such context switches (kernel-state to user-state switch). In Kafka, messages are landed on disk, and when a message is consumed, it would require four CPU context switches if followed in this way, and the switch itself is costly, as it would involve resetting and staging the stack and register states. Therefore, for this scenario, its bottleneck will still be on the CPU context switch.

03 Zero Copy Technique

The reason for the context switch is that the user process memory area and the kernel memory area have different permission levels, because user memory cannot manipulate data outside of its process stack memory. Therefore, if you want to solve the above problem, you have to find a way to bypass the kernel.

If you let the kernel release its privileges so that the user process can directly read or write to an area of memory, then you can reduce one copy and a complete file transfer requires only three copies. In the operating system, mmap() provides such a function.

  1. the process calls mmap() to read the file, the CPU sends a request to the DMA, which reads the disk data and writes it to the kernel buffer.
  2. The process shares the buffer with the kernel and reads data directly from the buffer.
  3. the process calls write() to send the file and writes the data directly to the kernel buffer. the CPU copies the data from the kernel to the socket buffer, one copy at a time.

Even then, it still requires 3 copies of the process. So is it possible to bypass the kernel directly? The answer is the SG-DMA controller, which is a controller on the NIC. It supports reading data directly from the kernel buffer and copying it to the NIC, so that there is no need to go through the CPU to relay the copied data. The overall process of using SG-DMA technology is as follows.

  1. The process calls sendfile() to read and send the file. The CPU sends a request to the DMA, which reads the disk file and then writes it to the kernel buffer.
  2. CPU notifies SG-DMA controller, SG-DMA copies data directly from kernel buffer to NIC and then sends.

These two processes do not involve CPU copy work, therefore, it is a zero-copy technique.

04 Application of zero-copy technology

  • Kafka consumption: Send data files on disk directly to the network card based on offset.
  • Nginx caching: Use zero-copy technology directly for small files, and use asynchronous IO for large files.