Sockmap is a tool for socket stitching on Linux that can reverse proxy traffic to other destinations directly in the kernel, without going through the application layer, which is an interesting and interesting application to explore for highly loaded web proxy applications.

The reason for the name sockmap is that in eBPF, the data structure for passing data is called a Map. eBPF provides many maps, but sockmap is a relatively new one, created by John Fastabend, an engineer at Cilium.

Weaknesses of the traditional approach

Why use sockmap? This goes back to what problems we had before sockmap was available. On Linux systems, if we want to reverse data in sockets and other data stream structures, the following system calls are commonly used.

System calls Both sides of the communication Need user space memory? Is it a zero-copy
sendfile disk file —> socket yes no
splice pipe <—> socket yes yes?
vmsplice memory region —> pipe no yes

The problems with them are, respectively:

  • System call cost: these functions are themselves system calls, and making multiple system calls for each forwarded packet is expensive.
  • Wakeup latency: processes in user space must be woken up frequently to forward data. Depending on the scheduler, this can result in poor tail latency.
  • Replication costs: Copying data from the kernel to user space and then immediately back to the kernel is not free and adds up to a measurable cost.

The extra cost of splice() and sendfile()

Although the splice() and sendfile() system calls optimize TCP data forwarding performance to some extent by avoiding data replication between kernel and user space, there are still some performance bottlenecks:

  1. System call overhead: both splice() and sendfile() are system calls that require context switching between user space and kernel space each time they are called. These context switches add additional overhead, which can affect performance.

  2. Multiple data buffer operations: Although splice() and sendfile() can do the data transfer in kernel space, they still require multiple operations on the data buffer. For example, in splice(), data needs to be moved from the receive buffer of one socket to the pipe buffer, and then to the send buffer of another socket. These extra data buffer operations may affect performance.

  3. Additional memory allocation and release: splice() and sendfile() require memory to be allocated and released in kernel space for data transfers. These memory operations may cause additional overhead and may lead to memory fragmentation under high load conditions.

  4. Data replication cannot be completely avoided: In some cases, sendfile() cannot completely avoid data replication. For example, when data processing (such as encryption) is required, sendfile() cannot do this directly in kernel space and may need to copy the data back to user space for processing.

So splice() and sendfile() reduce data replication between kernel and user space during data transfer, but they still have some performance bottlenecks such as system call overhead, multiple data buffer operations, extra memory allocation and release, and in some cases the inability to avoid data replication altogether. These performance bottlenecks may limit the performance of these methods in high performance scenarios.

Performance comparison

Number of system calls User Space Wakeup Number of copies
read write loop 2 syscalls yes 2 copies
splice 2 syscalls yes 0 copy (?)
io_submit 1 syscall yes 2 copies
SOCKMAP none no 0 copies

Clouflare in practice

Cloudflare implements TCP Splicing in its edge network using Sockmap, where each request passes through a load balancer that forwards the request to the back-end server. By using Sockmap, Cloudflare is able to achieve efficient request forwarding on the load balancer.

Although Sockmap is promising, there are some limitations and challenges, such as compatibility issues with other kernel features and support for complex network environments, for example

  • Kernel version: Sockmap is a new feature of the Linux kernel and is only supported by newer versions of the kernel. This means that using this technology requires an update to the kernel, which may affect systems that are already stable and running.
  • Compatibility issues: There may be compatibility issues between Sockmap and some kernel features. For example, Sockmap does not currently work with the SO_ORIGINAL_DST socket option, which may affect applications that rely on this option.
  • Scope of support: Currently, Sockmap only supports the TCP protocol and does not support all TCP options. This may limit the use of Sockmap in some specific network environments.
  • Error handling: When using Sockmap for data forwarding, it can be a challenge to perform error handling and resume data transfer if an error is encountered.
  • Resource management: Sockmap uses some kernel resources and how to manage these resources to prevent resource exhaustion or memory leaks is an issue that needs attention.
  • Security: Although Sockmap itself is designed to be secure, it can be a target for attackers as it can control network data in the kernel. How to protect Sockmap from abuse and prevent possible security risks is also an issue to be considered.

Ref