eBPF, derived from BPF, is essentially an efficient and flexible virtual class virtual machine component in the kernel that executes bytecode at many kernel hook points in a secure manner. BPF was originally intended for efficient Originally intended for efficient network message filtering, eBPF has been redesigned and is no longer limited to the network stack, but has become a top-level subsystem of the kernel, evolving into a general-purpose execution engine. Developers can develop performance analysis tools, software-defined networking, security, and many other scenarios based on eBPF. In this article, we will introduce the history of eBPF and build an eBPF environment for development practice. All the code in this article can be found in this Github Repository.

Technical Background

Development History

In 1992, Steven McCanne and Van Jacobson wrote a paper entitled The BSD Packet Filter: A New Architecture for User-level Packet Capture. In the paper, the authors described how they implemented network packet filtering in the Unix kernel, a new technique that was 20 times faster than the state-of-the-art packet filtering techniques of the time.

The BSD Packet Filter: A New Architecture for User-level Packet Capture

BPF introduces two major innovations in packet filtering.

  • A new virtual machine (VM) design that works efficiently on CPUs with a register-based architecture.
  • Applications use caches to copy only the data associated with filtered packets, not all of the packet’s information, which minimizes the amount of data processed by BPF.

Because of these great improvements, all Unix systems chose to use BPF as the network packet filtering technique, and to this day many derivatives of the Unix kernel (including the Linux kernel) still use this implementation. tcpdump uses BPF as the underlying packet filtering technique, and we can add -d to the end of the command to see the tcpdump filtering conditions for the underlying assembly directive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ tcpdump -d 'ip and tcp port 8080'
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 12
(002) ldb      [23]
(003) jeq      #0x6             jt 4    jf 12
(004) ldh      [20]
(005) jset     #0x1fff          jt 12   jf 6
(006) ldxb     4*([14]&0xf)
(007) ldh      [x + 14]
(008) jeq      #0x1f90          jt 11   jf 9
(009) ldh      [x + 16]
(010) jeq      #0x1f90          jt 11   jf 12
(011) ret      #262144
(012) ret      #0

In early 2014, Alexei Starovoitov implemented eBPF (extended Berkeley Packet Filter). After redesign, eBPF evolved into a general-purpose execution engine on which performance analysis tools, software-defined networks, and many other scenarios can be developed. eBPF first appeared in the 3.18 kernel, and since then the original BPF has been called the classic BPF (cBPF). cBPF is now largely deprecated. Now, the Linux kernel runs only eBPF, and the kernel transparently converts the loaded cBPF bytecode into eBPF before executing it .

eBPF vs. cBPF

The new design of eBPF is optimized for modern hardware, so the instruction set generated by eBPF executes faster than the machine code generated by the old BPF interpreter. The extended version also increases the number of registers in the virtual machine, increasing the original 2 32-bit registers to 10 64-bit registers. Due to the increased number and width of registers, developers can freely exchange more information and write more complex programs using function parameters. All in all, these improvements make the eBPF version 4 times faster than the original BPF.

Dimension cBPF eBPF
Kernel Version Linux 2.1.75(1997) Linux 3.18(2014)[4.x for kprobe/uprobe/tracepoint/perf-event]
Number of registers 2: A, X 10: R0 - R9, plus R10 a read-only frame pointer. r0 return and exit values of kernel functions in eBPF, R1 - R5 parameter values of eBF programs in the kernel, R6 - R9 registers saved by the callee that the kernel function will save, R10 a read-only stack frame pointer.
Register width 32-bit 64-bit
Storage 16 memory bits: M[0-15] 512-byte stack, unlimited size map storage
Restricted kernel calls Very limited, limited to JIT specific limited, called by the bpf_call instruction
Target events Packets, seccomp-BPF Packets, kernel functions, user functions, trace point PMCs, etc.

In June 2014, eBPF was extended to the user space, which became a turning point for BPF technology . As Alexei writes in the notes of the patch commit, “This patch demonstrates the potential of eBPF”. Currently, eBPF is no longer limited to the network stack and has become a top-level subsystem of the kernel.

eBPF and kernel modules

In contrast to the evolution of the Web, eBPF’s relationship to the kernel is somewhat similar to that of JavaScript to the browser kernel. eBPF provides a new kernel-programmable option compared to directly modifying the kernel and writing kernel modules. eBPF program architecture emphasizes security and stability and looks more like a kernel module, but unlike kernel modules, eBPF programs do not need to recompile the Unlike kernel modules, eBPF programs do not require recompilation of the kernel and can ensure that eBPF programs run to completion without crashing the system.

Dimension Linux Kernel Modules eBPF
kprobes/tracepoints Support Support
Security May introduce security vulnerabilities or cause kernel Panic Checked by verifier, can secure the kernel
Kernel functions can call kernel functions can only be called through BPF Helper functions
compilability need to compile the kernel no need to compile the kernel, just introduce the headers
Running on the same kernel Stable ABI-based BPF programs can be compiled once and run everywhere
interact with the application print logs or files via perf_event or map structures
Data Structure Richness General Rich
Threshold High Low
Upgrade requires uninstall and load, which may cause processing interruption atomic replacement upgrade, which does not cause processing interruption
kernel built-in depends kernel built-in support

eBPF architecture

The eBPF is divided into two parts: a user-space program and a kernel program.

  • The user space program is responsible for loading the BPF bytecode into the kernel and, if needed, for reading the statistics or event details returned by the kernel.
  • The BPF bytecode in the kernel is responsible for executing specific events in the kernel, and if needed, sending the results to the user space via maps or perf-event events.
  • The user space program can communicate with the kernel BPF bytecode program in both directions using the map structure, which provides more flexible control over the BPF bytecode program running in the kernel.

The overall structure of eBPF is as follows.

overall structure of eBPF

The flow of interaction between the user space program and the BPF bytecode in the kernel is as follows.

  1. compile the written BPF code program into BPF bytecode using LLVM or GCC tools

  2. load the bytecode into the kernel using the loader

  3. The kernel uses the Verfier component to secure the execution of the bytecode to avoid disaster to the kernel, and loads the bytecode into the corresponding kernel module for execution after confirming its security

  4. BPF bytecode programs running in the kernel can use two ways to send data back to user space

    • maps method can be used to pass back to user space statistical summary information implemented in the kernel (e.g. measurement latency, stack information), etc.
    • perf-event is used to send events collected by the kernel to user space in real time, which are read and analyzed by user space programs in real time.

eBPF Restrictions

Although powerful, eBPF technology in the kernel imposes a number of restrictions to ensure safe and timely kernel processing, but as the technology develops and evolves, the restrictions are gradually relaxed or provide corresponding solutions.

  • eBPF programs cannot call arbitrary kernel parameters, but are limited to the BPF Helper functions listed in the kernel module, and the list of supported functions is growing as the kernel evolves.

  • eBPF programs are not allowed to contain unreachable instructions to prevent loading invalid code and delaying the termination of the program.

  • The number of loops in an eBPF program is limited and must end in a finite amount of time, which is primarily used to prevent inserting arbitrary loops in kprobes that would result in locking the entire system; solutions include expanding loops and adding helper functions for common uses that require loops. Linux 5.3 includes support for bounded loops in the BPF, which has a verifiable upper limit on runtime.

  • The eBPF stack size is limited to MAX_BPF_STACK, which as of kernel Linux 5.8 is set to 512; see include/linux/filter.h, this limit is particularly relevant when storing multiple string buffers on the stack: a char[256] buffer will consume half of this stack. There are no plans to increase this limit; the solution is to switch to bpf mapped storage, which is effectively infinite.

    1
    2
    
    /* BPF program can access up to 512 bytes of stack space. */
    #define MAX_BPF_STACK   512
    
  • The eBPF bytecode size was originally limited to 4096 instructions, but as of kernel Linux 5.8, it has now been relaxed to 1 million instructions (BPF_COMPLEXITY_LIMIT_INSNS), see: include/linux/bpf.h, the 4096 instruction limit ( BPF_MAXINSNS ) is still retained for unprivileged BPF programs; the new version of eBPF also supports cascading calls to multiple eBPF programs, which, although there are some limitations on passing information, can be combined to This new version of eBPF also supports cascading calls to multiple eBPF programs, which can be combined to achieve more powerful functionality, although there are some limitations on passing information.

    1
    
    #define BPF_COMPLEXITY_LIMIT_INSNS      1000000 /* yes. 1M insns */
    

eBPF in action

Before we dive into the features of eBPF, let’s Get Hands Dirty and get a real feel for what eBPF programs are and how we can develop them. As the eBPF ecosystem evolves, there are now more and more toolchains for developing eBPF programs, which will also be described in detail in the following sections.

  • bcc-based development: bcc provides development of eBPF with a Python API in the front end and a C implementation of the back-end eBPF program. It is simple and easy to use, but the performance is poor.
  • Based on libebpf-bootstrap development: libebpf-bootstrap provides a convenient scaffolding
  • Based on kernel source development: kernel source development is a higher threshold, but also more relevant to the underlying eBPF principles, so this approach is used here as an example

Kernel source code compilation

The system environment is as follows, using Tencent Cloud CVM, Ubuntu 20.04, kernel version 5.4.0

1
2
$ uname -a
Linux VM-1-3-ubuntu 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

First install the necessary dependencies.

1
2
3
sudo apt install -y bison build-essential cmake flex git libedit-dev pkg-config libmnl-dev \
   python zlib1g-dev libssl-dev libelf-dev libcap-dev libfl-dev llvm clang pkg-config \
   gcc-multilib luajit libluajit-5.1-dev libncurses5-dev libclang-dev clang-tools

It is generally recommended to use apt to install the source code, which is easy and only installs the current kernel source code, and the size of the source code is around 200M.

1
2
3
# apt-cache search linux-source

# apt install linux-source-5.4.0

The source code is installed in the /usr/src/ directory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ ls -hl
total 4.0K
drwxr-xr-x 4 root root 4.0K Nov  9 13:22 linux-source-5.4.0
lrwxrwxrwx 1 root root   45 Oct 15 10:28 linux-source-5.4.0.tar.bz2 -> linux-source-5.4.0/linux-source-5.4.0.tar.bz2
$ tar -jxvf linux-source-5.4.0.tar.bz2
$ cd linux-source-5.4.0

$ cp -v /boot/config-$(uname -r) .config # make defconfig 或者 make menuconfig
$ make headers_install
$ make modules_prepare
$ make scripts     # 可选
$ make M=samples/bpf  # 如果配置出错,可以使用 make oldconfig && make prepare 修复

After successful compilation, you can see a series of target files and binaries in the samples/bpf directory.

Hello World

As mentioned before, eBPF usually consists of two parts: kernel-space programs and user-space programs, and there are many such programs in the samples/bpf directory, with kernel-space programs ending in _kern.c and user-space programs ending in _user.c. Without looking at these complicated programs, let’s write a Hello World for the eBPF program manually.

program hello_kern.c in the kernel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <linux/bpf.h>
#include "bpf_helpers.h"

#define SEC(NAME) __attribute__((section(NAME), used))

SEC("tracepoint/syscalls/sys_enter_execve")
int bpf_prog(void *ctx)
{
    char msg[] = "Hello BPF from houmin!\n";
    bpf_trace_printk(msg, sizeof(msg));
    return 0;
}

char _license[] SEC("license") = "GPL";

function entry

There are some differences between the above code and normal C programming.

  1. the entry of the program is specified by pragama __section("tracepoint/syscalls/sys_enter_execve") of the compiler.
  2. The entry argument is no longer argc, argv , which varies depending on the prog type. In our case, the prog type is BPF_PROG_TYPE_TRACEPOINT, and its entry argument is void *ctx.

headers

#include <linux/bpf.h>

The source of this header file is the kernel source header file. It is installed in /usr/include/linux/bpf.h.

It provides many of the symbols needed for bpf programming, for example

  1. enum bpf_func_id defines the id of all kerne helper functions
  2. enum bpf_prog_type defines all the types of prog supported by the kernel.
  3. struct __sk_buff is the interface in the bpf code to access the kernel struct sk_buff.

etc.

#include “bpf_helpers.h”

comes from libbpf , which needs to be installed by itself. We refer to this header file because of the call to bpf_printk(). This is a kernel helper function.

program explanation

Here we briefly explain the kernel state ebpf program, which is very simple.

  • bpf_trace_printk is an eBPF helper function that prints information to trace_pipe (/sys/kernel/debug/tracing/trace_pipe), see here for details
  • The code declares the SEC macro and defines the GPL license, because the eBPF program loaded into the kernel needs to have a license check, similar to the kernel module

loads the BPF code

User state program hello_user.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <stdio.h>
#include "bpf_load.h"

int main(int argc, char **argv)
{
    if(load_bpf_file("hello_kern.o") != 0)
    {
        printf("The kernel didn't load BPF program\n");
        return -1;
    }

    read_trace_pipe();
    return 0;
}

In the user-state ebpf program, it reads as follows.

  • Load the compiled kernel-state ebpf target file into the kernel via load_bpf_file
  • Read the trace information from trace_pipe via read_trace_pipe and print the trace information from trace_ pipe and print the trace information to the console

Modify the Makefile file in the samples/bpf directory by adding the following three lines in the corresponding locations.

1
2
3
hostprogs-y += hello
hello-objs := bpf_load.o hello_user.o
always += hello_kern.o

Recompile and you can see the compiled file successfully.

1
2
3
4
5
6
7
$ make M=samples/bpf
$ ls -hl samples/bpf/hello*
-rwxrwxr-x 1 ubuntu ubuntu 404K Mar 30 17:48 samples/bpf/hello
-rw-rw-r-- 1 ubuntu ubuntu  317 Mar 30 17:47 samples/bpf/hello_kern.c
-rw-rw-r-- 1 ubuntu ubuntu 3.8K Mar 30 17:48 samples/bpf/hello_kern.o
-rw-rw-r-- 1 ubuntu ubuntu  246 Mar 30 17:47 samples/bpf/hello_user.c
-rw-rw-r-- 1 ubuntu ubuntu 2.2K Mar 30 17:48 samples/bpf/hello_user.o

Go to the corresponding directory and run the hello program, you can see the following output.

1
2
3
4
5
6
$ sudo ./hello
           <...>-102735 [001] ....  6733.481740: 0: Hello BPF from houmin!

           <...>-102736 [000] ....  6733.482884: 0: Hello BPF from houmin!

           <...>-102737 [002] ....  6733.483074: 0: Hello BPF from houmin!

Code Explanation

As mentioned earlier, the load_bpf_file function loads the eBPF bytecode compiled by LLVM into the kernel, how exactly is this done?

  • After searching, we can see that load_bpf_file is also implemented in the samples/bpf directory, see bpf_load.c.
  • Read the load_bpf_file code to see that it mainly parses the ELF formatted eBPF bytecode and then calls load_and_attach function.
  • In the load_and_attach function, we can see that it calls . bpf_load_program function, which is a function provided by libbpf.
  • The arguments license and kern_version in the called bpf_load_program come from parsing the eBPF ELF file, and the prog_type comes from the type specified in the SEC field inside the bpf code.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
{
  bool is_socket = strncmp(event, "socket", 6) == 0;
    bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
    bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
    bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
    bool is_raw_tracepoint = strncmp(event, "raw_tracepoint/", 15) == 0;
    bool is_xdp = strncmp(event, "xdp", 3) == 0;
    bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
    bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
    bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
    bool is_sockops = strncmp(event, "sockops", 7) == 0;
    bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
    bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;

    //...

    fd = bpf_load_program(prog_type, prog, insns_cnt, license, kern_version,
                    bpf_log_buf, BPF_LOG_BUF_SIZE);
    if (fd < 0) {
        printf("bpf_load_program() err=%d\n%s", errno, bpf_log_buf);
        return -1;
    }
  //...
}

eBPF Features

Hook Overview

eBPF programs are event-driven, they run when the kernel or application passes through a defined Hook point. These Hook points are defined in advance and include system calls, function entry/exit, kernel tracepoints, network events, etc.

eBPF Hook

If a Hook point for a specific requirement does not exist, you can mount the eBPF program almost anywhere in the kernel or user program by using kprobe or uprobe.

kprobe/uprobe

Verification

With great power there must also come great responsibility.

Every eBPF program loaded into the kernel is subject to Verification, which is used to ensure the security of eBPF programs, mainly by.

  • the process that loads the eBPF program has the necessary privilege level, unless the node has the unpriviledged feature enabled, so that only privileged programs can load eBPF programs

    • The kernel provides a configuration item /proc/sys/kernel/unprivileged_bpf_disabled to disable the use of the bpf(2) system call by unprivileged users, which can be changed with the sysctl command

    • A special feature is that this configuration item is designed as a one-time kill switch, which means that once it is set to 1, there is no way to change it to 0 unless you restart the kernel

    • Once set to 1, only processes with the CAP_SYS_ADMIN privilege in their initial namespace can call the bpf(2) system call. Cilium will also set this configuration item to 1 when it starts.

      1
      
      $ echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled
      
  • To ensure that eBPF programs do not crash or make the system malfunction.

  • Make sure that eBPF programs cannot get stuck in a dead loop and can runs to completion.

  • Ensure that eBPF programs must meet system requirements for size, and that oversized eBPF programs are not allowed to be loaded into the kernel.

  • To ensure that eBPF programs are of limited complexity, Verifier will evaluate all possible execution paths of eBPF programs and must be able to complete the eBPF program complexity analysis in a limited time.

JIT Compilation

Just-In-Time(JIT) compilations are used to translate generic eBPF bytecode into machine-relevant instruction sets, thereby greatly accelerating the execution of BPF programs.

  • They reduce the per-instruction overhead compared to interpreters. Typically, instructions can be mapped 1:1 to the underlying architecture’s native instructions.
  • This also reduces the size of the generated executable image and is therefore more friendly to the CPU’s instruction cache.
  • In particular, for the CISC instruction set (e.g. x86), JIT makes a number of special optimizations aimed at generating the shortest possible opcode for a given instruction, in order to reduce the space required for the program translation process.

The 64-bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32-bit arm, x86_32 architectures all have built-in in-kernel eBPF JIT compilers, which are all functionally identical and can be opened in the following way.

1
$ echo 1 > /proc/sys/net/core/bpf_jit_enable

The 32-bit mips, ppc, and sparc architectures currently have a built-in cBPF JIT compiler. These architectures that only have a cBPF JIT compiler, and those that do not even have a BPF JIT compiler at all, require an in-kernel interpreter to execute eBPF programs.

To determine which platforms support the eBPF JIT, grep HAVE_EBPF_JIT in the kernel source file.

1
2
3
4
5
6
7
8
$ git grep HAVE_EBPF_JIT arch/
arch/arm/Kconfig:       select HAVE_EBPF_JIT   if !CPU_ENDIAN_BE32
arch/arm64/Kconfig:     select HAVE_EBPF_JIT
arch/powerpc/Kconfig:   select HAVE_EBPF_JIT   if PPC64
arch/mips/Kconfig:      select HAVE_EBPF_JIT   if (64BIT && !CPU_MICROMIPS)
arch/s390/Kconfig:      select HAVE_EBPF_JIT   if PACK_STACK && HAVE_MARCH_Z196_FEATURES
arch/sparc/Kconfig:     select HAVE_EBPF_JIT   if SPARC64
arch/x86/Kconfig:       select HAVE_EBPF_JIT   if X86_64

bpf program

Maps

BPF Maps are efficient Key/Value stores that reside in kernel space** and contain multiple types of maps that are implemented by the kernel.

BPF Maps

The interaction scenarios of BPF Map are as follows.

  • Interaction between a BPF program and a user-state program: After the BPF program is run, the results obtained are stored in the map for the user-state program to access via file descriptors.
  • Interaction between BPF programs and kernel programs: Interaction with kernel programs other than BPF programs can also use map as an intermediary.
  • Interaction between BPF programs: If a BPF program needs to interact with global variables internally, but for security reasons BPF programs are not allowed to access global variables, you can use map to act as a global variable.
  • BPF Tail call: Tail call is a jump from one BPF program to another BPF program. The BPF program first knows the pointer to the other BPF program by using a map of type BPF_MAP_TYPE_PROG_ARRAY, and then calls the helper function of tail_call() to execute Tail call.

BPF programs that share maps are not required to be of the same program type; for example, tracing programs can share maps with network programs, and a single BPF program can currently directly access up to 64 different maps.

BPF Maps

The currently available generic maps are.

  • BPF_MAP_TYPE_HASH
  • BPF_MAP_TYPE_ARRAY
  • BPF_MAP_TYPE_PERCPU_HASH
  • BPF_MAP_TYPE_PERCPU_ARRAY
  • BPF_MAP_TYPE_LRU_HASH
  • BPF_MAP_TYPE_LRU_PERCPU_HASH
  • BPF_MAP_TYPE_LPM_TRIE

The above maps all use the same set of BPF helper functions to perform lookup, update, or delete operations, but each implements a different backend, each with different semantics and performance characteristics. As multi-CPU architectures matured, BPF Map also introduced per-cpu types such as BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_PERCPU_ARRAY, etc. When you use this type of BPF Map, each CPU stores and sees its own Map data, and the data belonging to different CPUs are isolated from each other. This has the advantage of being more efficient and better performance when performing lookup and aggregation operations. This is especially true if your BPF program is primarily doing collection of time series type data such as traffic data or metrics, etc.

The current non-generic maps in the kernel are.

  • BPF_MAP_TYPE_PROG_ARRAY : an array map to hold other BPF programs.
  • BPF_MAP_TYPE_PERF_EVENT_ARRAY.
  • BPF_MAP_TYPE_CGROUP_ARRAY : an array to check the cgroup2 member information in skb.
  • BPF_MAP_TYPE_STACK_TRACE : used to store the MAP of the stack trace.
  • BPF_MAP_TYPE_ARRAY_OF_MAPS : holds (holds) pointers to other maps so that the whole map can be replaced atomically at runtime.
  • BPF_MAP_TYPE_HASH_OF_MAPS: holds pointers to other maps so that the entire map can be replaced atomically at runtime.

Helper Calls

eBPF programs cannot call kernel functions at will; to do so would cause the eBPF program to be bound to a specific kernel version, instead it has a set of kernel-defined Helper functions. Helper functions enable BPF to query data from the kernel or push data to the kernel through a set of kernel-defined stable function calls. All BPF helper functions are part of the core kernel and cannot be extended or added to via kernel modules . There are currently dozens of BPF helper functions available, and the number is growing. You can see the current Linux support for Helper functions at Linux Manual Page: bpf-helpers.

Helper functions

The helper functions that can be used by different types of BPF programs may be different , e.g:

  • A BPF program that attaches to a socket can call only a subset of the helper functions that the former can call, compared to a BPF program that attaches to the tc layer.
  • The wrapped and unwrapped helper functions used by lightweight tunneling can only be used by the lower tc layer, while the event output helper functions used by push notifications to the user state can be used by both tc and XDP programs.

All helper functions share the same generic, system call-like function method, which is defined as follows.

1
u64 fn(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)

The kernel abstracts the helper functions into BPF_CALL_0() to BPF_CALL_5() macros, similar in form to the corresponding type of system calls. The definition of these macros can be found in include/linux/filter.h. Take bpf_map_update_elem for example, you can see that it updates the map element by calling the callback function of the corresponding map.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
           void *, value, u64, flags)
{
    WARN_ON_ONCE(!rcu_read_lock_held());
    return map->ops->map_update_elem(map, key, value, flags);
}

const struct bpf_func_proto bpf_map_update_elem_proto = {
    .func           = bpf_map_update_elem,
    .gpl_only       = false,
    .ret_type       = RET_INTEGER,
    .arg1_type      = ARG_CONST_MAP_PTR,
    .arg2_type      = ARG_PTR_TO_MAP_KEY,
    .arg3_type      = ARG_PTR_TO_MAP_VALUE,
    .arg4_type      = ARG_ANYTHING,
};

This approach has a number of advantages.

Although cBPF allows its load instructions to perform out-of-scope accesses (overload) in order to fetch data from a seemingly impossible packet offset to wake up a multifunction helper function, each cBPF JIT still needs to implement the corresponding support for this cBPF extension. In eBPF, however, the JIT compiler compiles the newly added helper functions in a transparent and efficient manner, meaning that the JIT compiler only needs to emit a call instruction because the register mapping is such that the BPF assignments already match the underlying architecture’s calling conventions. matches the underlying architecture’s calling conventions. This makes it very easy to extend the core kernel with helper functions. All BPF helper functions are part of the core kernel and cannot be extended or added via the kernel module .

The function signatures mentioned above also allow the verifier to perform type check. The struct bpf_func_proto above holds all the information that the checker needs to know about the helper function so that the checker can ensure that the expected type of the helper function matches the current contents of the BPF program registers.

The parameter types range from arbitrary values to restrictions to specific types, such as the pointer/size parameter pair of the BPF stack buffer, from which the helper function can read data or write data to. For this case, the verifier can also perform additional checks, for example, whether the buffer has been initialized.

Tail Calls

Tail calls are a mechanism by which a BPF program can call another BPF program and not return to the original program when the call is complete.

  • This call has minimal overhead compared to a normal function call, because it is implemented with a long jump, reusing the original stack frame.
  • BPF programs are independently verified, so to pass state, either use the per-CPU map as a scratch buffer or, in the case of tc programs, some fields of skb (e.g. cb[]).
  • Only programs of the same type can tail-call and they have to match the JIT compiler, so either the JIT compiler executes or the interpreted programs, but not both.

Tail Calls

BPF to BPF Calls

In addition to BPF helper functions and BPF tail calls, a new feature has recently been added to the BPF core infrastructure: BPF to BPF calls. Before this feature was introduced to the kernel, a typical BPF C program had to take special care of all code that needed to be reused. For example, it was declared as always_inline in the header file . When LLVM compiles and generates the BPF object file, all of these functions will be inlined, and therefore repeated multiple times in the generated object file, resulting in code size bloat.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

#ifndef __inline
# define __inline                         \
   inline __attribute__((always_inline))
#endif

static __inline int foo(void)
{
    return XDP_DROP;
}

__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
    return foo();
}

char __license[] __section("license") = "GPL";

This is necessary because of the lack of support for function calls in the BPF program loader, verifier, interpreter and JIT. Starting with Linux 4.16 and LLVM 6.0, this limitation has been addressed and BPF programs no longer need to use the always_inline declaration everywhere. As a result, the above code can be more naturally rewritten as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

static int foo(void)
{
    return XDP_DROP;
}

__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
    return foo();
}

char __license[] __section("license") = "GPL";

The BPF-to-BPF call is an important performance optimization that greatly reduces the size of the generated BPF code and is therefore more friendly to the CPU instruction cache (i-cache) .

BPF-to-BPF call

The BPF helper function calling convention also applies to BPF inter-function calls: * r1 - r5 is used to pass parameters.

  • r1 - r5 are used to pass arguments and return results to r0.

  • r1 - r5 are scratch registers, r6 - r9 are reserved registers as usual.

  • The maximum nested call depth is 8.

  • The caller can pass a pointer (e.g., a pointer to the caller’s stack frame) to the callee, but not vice versa.

  • Currently, BPF inter-function calls and BPF tail calls are incompatible** because the latter require reusing the current stack setup, while the former add an extra stack frame and thus do not conform to the desired layout of tail calls.

The BPF JIT compiler emits separate images for each function body, and later modifies the address of the function call in the image in the final JIT pass. It has been shown that this approach requires minimal modification to the various JITs because they can be implemented to treat BPF inter-function calls as regular BPF helper function calls.

Object Pinning

BPF maps and programs can only be accessed as kernel resources via file descriptors behind anonymous inodes in the kernel. This brings a number of advantages.

  • User-space applications can use most of the file descriptor-related APIs.
  • File descriptors passed to Unix sockets work transparently, etc.

But at the same time, file descriptors are limited by the process lifecycle, making operations like map sharing very unwieldy, which introduces a lot of complexity in some specific scenarios.

For example, iproute2, where the tc or XDP eventually exits after preparing the environment and loading the program into the kernel. In this case, the maps are also not accessible from user space, when they would be useful. For example, the maps shared in the ingress and egress locations of the data path (which can count packets, bytes, PPS, etc.). In addition, third-party applications may want to monitor or update maps while the BPF program is running.

To solve this problem, the kernel implements a minimum kernel space BPF file system into which BPF maps and BPF programs can be pinned, a process called object pinning. The BPF-related file system is not singleton, it supports multiple mounted instances, hard links, soft It supports multiple mounted instances, hard links, soft links, and so on.

Accordingly, the BPF system call is extended with two new commands, as shown below.

  • BPF_OBJ_PIN : pin an object.
  • BPF_OBJ_GET : Get a pinned object.

object pinning

Hardening

Protection Execution Protection

To avoid code corruption, BPF locks the entire image after interpretation by the BPF interpreter (struct bpf_prog) and the image after JIT compilation (struct bpf_binary_header) as read-only in the kernel for the lifetime of the program. Any data corruption in these locations (e.g., due to some kernel bug) triggers the generic protection mechanism and therefore causes the kernel to crash rather than allowing the corruption to occur silently.

To see which platforms support setting image memory to read-only, you can use the following search.

1
2
3
4
5
$ git grep ARCH_HAS_SET_MEMORY | grep select
arch/arm/Kconfig:    select ARCH_HAS_SET_MEMORY
arch/arm64/Kconfig:  select ARCH_HAS_SET_MEMORY
arch/s390/Kconfig:   select ARCH_HAS_SET_MEMORY
arch/x86/Kconfig:    select ARCH_HAS_SET_MEMORY

The CONFIG_ARCH_HAS_SET_MEMORY option is not configurable, so platforms either have built-in support or they don’t, and those architectures that don’t currently support it may do so in the future.

Mitigation Against Spectre

To defend against the Spectre v2 attack, the Linux kernel provides the CONFIG_BPF_JIT_ALWAYS_ON option, which, when turned on BPF interpreter will be completely removed from the kernel and the JIT compiler will be enabled forever.

  • If applied in a VM-based environment, the client kernel will not reuse the kernel’s BPF interpreter, thus avoiding certain related attacks.
  • In the case of a container-based environment, this configuration is optional, and if the JIT feature is turned on, the interpreter may still be removed at compile time to reduce the complexity of the kernel.
  • For JIT on mainstream architectures (e.g. x86_64 and arm64) it is usually recommended to turn this switch on.

Setting /proc/sys/net/core/bpf_jit_harden to 1 will do some additional hardening for unprivileged user JIT compilations. These additional hardenings will slightly degrade the performance of the program, but are effective in reducing the potential attack surface in the event that a non-trusted user is operating on the system. Still, these performance losses are relatively small compared to switching to an interpreter altogether. For the x86_64 JIT compiler, if CONFIG_RETPOLINE is set, indirect jumps for tail calls are implemented with retpoline. At the time of writing this article, this configuration is turned on in most modern Linux distributions.

Constant Blinding

Currently, enabling hardening blind all user-supplied 32- and 64-bit constants in BPF programs at JIT compile time to defend against JIT spraying attacks that inject native opcodes into the kernel as immediate numbers. This attack is effective because: immediate numbers reside in executable kernel memory, so some kernel bugs may trigger a jump action that, if it jumps to the start of immediate numbers, will start executing them as native instructions.

Blinded JIT constants are implemented by randomizing the actual instruction. In this approach, the original immediate-based operation is converted to a register-based operation by rewriting the instruction. Instruction rewriting breaks down the process of loading a value into two parts.

  1. load a blinded (immediate) number rnd ^ imm into a register
  2. perform an iso-or operation (xor) on the register and rnd.

This way the original imm immediate resides in the register and can be used for real operations. What is described here is only the blinded process of the load operation, in fact all general-purpose operations are blinded. The following is the result of a JIT compilation of a program with reinforcement turned off.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ echo 0 > /proc/sys/net/core/bpf_jit_harden

  ffffffffa034f5e9 + <x>:
  [...]
  39:   mov    $0xa8909090,%eax
  3e:   mov    $0xa8909090,%eax
  43:   mov    $0xa8ff3148,%eax
  48:   mov    $0xa89081b4,%eax
  4d:   mov    $0xa8900bb0,%eax
  52:   mov    $0xa810e0c1,%eax
  57:   mov    $0xa8908eb4,%eax
  5c:   mov    $0xa89020b0,%eax
  [...]

The result of the above program being loaded by a non-privileged user via BPF after the hardening has been opened (constant blinding has been performed here).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ echo 1 > /proc/sys/net/core/bpf_jit_harden

  ffffffffa034f1e5 + <x>:
  [...]
  39:   mov    $0xe1192563,%r10d
  3f:   xor    $0x4989b5f3,%r10d
  46:   mov    %r10d,%eax
  49:   mov    $0xb8296d93,%r10d
  4f:   xor    $0x10b9fd03,%r10d
  56:   mov    %r10d,%eax
  59:   mov    $0x8c381146,%r10d
  5f:   xor    $0x24c7200e,%r10d
  66:   mov    %r10d,%eax
  69:   mov    $0xeb2a830e,%r10d
  6f:   xor    $0x43ba02ba,%r10d
  76:   mov    %r10d,%eax
  79:   mov    $0xd9730af,%r10d
  7f:   xor    $0xa5073b1f,%r10d
  86:   mov    %r10d,%eax
  89:   mov    $0x9a45662b,%r10d
  8f:   xor    $0x325586ea,%r10d
  96:   mov    %r10d,%eax
  [...]

The two programs are semantically identical, but in the second way, the original immediate count is no longer visible in the program after disassembly. Also, the hardening disables any JIT kernel compliance (kallsyms) from being exposed to privileged users, and the JIT image address no longer appears in /proc/kallsyms.

Offloads

BPF network programs, especially tc and XDP BPF programs, have an offload to hardware interface in the kernel so that BPF programs can be executed directly on the NIC.

Currently, Netronome’s nfp driver supports offload BPF via the JIT compiler, which translates BPF instructions into the instruction set implemented by the NIC. In addition, it also supports offloading BPF maps to the NIC, so offloaded BPF programs can perform map lookup, update, and delete operations.

Offloads

eBPF interface

BPF System Call

eBPF provides the bpf() system call to operate on BPF Map or program with the following function prototype.

1
2
#include <linux/bpf.h>
int bpf(int cmd, union bpf_attr *attr, unsigned int size);

function has three arguments, of which.

  • cmd specifies the type of command to be executed by the bpf system call, and each cmd is accompanied by a parameter attr
  • bpf_attr union allows data to be passed between kernel and user space, the exact format depends on the cmd parameter
  • The size parameter indicates the size of the bpf_attr union object in bytes

The cmd can be of the following types, which can be basically divided into two types: manipulating an eBPF Map and manipulating an eBPF program.

  • BPF_MAP_CREATE: creates an eBPF Map and returns a file descriptor pointing to the Map
  • BPF_MAP_LOOKUP_ELEM: finds an element in a Map by its key and returns its value
  • BPF_MAP_UPDATE_ELEM: creates or updates an element key/value pair in a Map
  • BPF_MAP_DELETE_ELEM : Deletes an element in a Map based on key
  • BPF_MAP_GET_NEXT_KEY: Find an element in a Map based on key and return the key of the next element
  • BPF_PROG_LOAD : checks and loads an eBPF program, returning the file descriptor associated with the program

The structure of the bpf_attr union is shown below, which can be populated with different information depending on the cmd.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
union bpf_attr {
  struct {    /* Used by BPF_MAP_CREATE */
    __u32         map_type;
    __u32         key_size;    /* size of key in bytes */
    __u32         value_size;  /* size of value in bytes */
    __u32         max_entries; /* maximum number of entries in a map */
  };

  struct {    /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY commands */
    __u32         map_fd;
    __aligned_u64 key;
    union {
      __aligned_u64 value;
      __aligned_u64 next_key;
    };
    __u64         flags;
  };

  struct {    /* Used by BPF_PROG_LOAD */
    __u32         prog_type;
    __u32         insn_cnt;
    __aligned_u64 insns;      /* 'const struct bpf_insn *' */
    __aligned_u64 license;    /* 'const char *' */
    __u32         log_level;  /* verbosity level of verifier */
    __u32         log_size;   /* size of user buffer */
    __aligned_u64 log_buf;    /* user supplied 'char *' buffer */
    __u32         kern_version; /* checked when prog_type=kprobe (since Linux 4.1) */
  };
} __attribute__((aligned(8)));

Commands for using eBPF programs

The BPF_PROG_LOAD command is used to verify and load the eBPF program with the filled parameter bpf_xattr, which is shown below in libbpf bpf_load_program in libbpf, you can see that the bpf system call is eventually called.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
            size_t insns_cnt, const char *license,
            __u32 kern_version, char *log_buf,
            size_t log_buf_sz)
{
    struct bpf_load_program_attr load_attr;

    memset(&load_attr, 0, sizeof(struct bpf_load_program_attr));
    load_attr.prog_type = type;
    load_attr.expected_attach_type = 0;
    load_attr.name = NULL;
    load_attr.insns = insns;
    load_attr.insns_cnt = insns_cnt;
    load_attr.license = license;
    load_attr.kern_version = kern_version;

    return bpf_load_program_xattr(&load_attr, log_buf, log_buf_sz);
}

int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
            char *log_buf, size_t log_buf_sz)
{
  // ...
    fd = sys_bpf_prog_load(&attr, sizeof(attr));
    if (fd >= 0)
        return fd;
    // ...
}

static inline int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size)
{
    int fd;

    do {
        fd = sys_bpf(BPF_PROG_LOAD, attr, size);
    } while (fd < 0 && errno == EAGAIN);

    return fd;
}

Implementation of bpf_map_lookup_elem in libbpf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int bpf_map_lookup_elem(int fd, const void *key, void *value)
{
    union bpf_attr attr;

    memset(&attr, 0, sizeof(attr));
    attr.map_fd = fd;
    attr.key = ptr_to_u64(key);
    attr.value = ptr_to_u64(value);

    return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}

Implementation of bpf_map_update_elem in libbpf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int bpf_map_update_elem(int fd, const void *key, const void *value,
        __u64 flags)
{
    union bpf_attr attr;

    memset(&attr, 0, sizeof(attr));
    attr.map_fd = fd;
    attr.key = ptr_to_u64(key);
    attr.value = ptr_to_u64(value);
    attr.flags = flags;

    return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

Implementation of bpf_map_delete_elem in libbpf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
int bpf_map_delete_elem(int fd, const void *key)
{
    union bpf_attr attr;

    memset(&attr, 0, sizeof(attr));
    attr.map_fd = fd;
    attr.key = ptr_to_u64(key);

    return sys_bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
}

Implementation of bpf_map_get_next_key in libbpf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int bpf_map_get_next_key(int fd, const void *key, void *next_key)
{
    union bpf_attr attr;

    memset(&attr, 0, sizeof(attr));
    attr.map_fd = fd;
    attr.key = ptr_to_u64(key);
    attr.next_key = ptr_to_u64(next_key);

    return sys_bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
    }

Note that the libbpf functions here are not quite the same as the helper functions mentioned before, you can see the current Linux supported Helper functions at Linux Manual Page: bpf-helpers to see the current Helper functions supported by Linux. Take bpf_map_update_elem as an example, the eBPF program calls the helper function with the following parameters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
struct msg {
    __s32 seq;
    __u64 cts;
    __u8 comm[MAX_LENGTH];
};

struct bpf_map_def SEC("maps") map = {
    .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
    .key_size = sizeof(int),
    .value_size = sizeof(__u32),
    .max_entries = 0,
};

void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)

The first argument here comes from the bpf_map created by the SEC(".maps") syntactic sugar.

For a user-state program, the function prototype is as follows, where the eBPF map is accessed via fd.

1
int bpf_map_lookup_elem(int fd, const void *key, void *value)

BPF Program Types

The program type loaded by the function BPF_PROG_LOAD specifies four things.

  • Where the program can be attached.
  • Which helper functions in the kernel are allowed to be called by the validator.
  • Whether data from network packets can be accessed directly.
  • The type of object passed to the program as the first argument.

In fact, the program type essentially defines an API. even new program types are created to distinguish between the different lists of functions allowed to be called (e.g. BPF_PROG_TYPE_CGROUP_SKB versus BPF_PROG_TYPE_SOCKET_FILTER).

The bpf program is hooked to different hook points in the kernel. Different hook points have different entry parameters and different capabilities. Different prog types are defined, and the set of kernel functions that can be called by bpf programs with different prog types is also different. When a bpf program is loaded into the kernel, the kernel verifier program checks which helper functions are called based on the entry parameters of the program according to the bpf prog type.

The list of eBPF program types currently supported by the kernel is shown below.

  • BPF_PROG_TYPE_SOCKET_FILTER : a network packet filter
  • BPF_PROG_TYPE_KPROBE : Determines whether kprobe should be triggered
  • BPF_PROG_TYPE_SCHED_CLS : a network traffic control classifier
  • BPF_PROG_TYPE_SCHED_ACT : a network traffic control action
  • BPF_PROG_TYPE_TRACEPOINT : Determines whether a tracepoint should be triggered
  • BPF_PROG_TYPE_XDP : a network packet filter that runs in the receive path from the device driver
  • BPF_PROG_TYPE_PERF_EVENT : Determines if the perf event handler should be triggered
  • BPF_PROG_TYPE_CGROUP_SKB : a network packet filter for control groups
  • BPF_PROG_TYPE_CGROUP_SOCK : a network packet filter for control groups that is allowed to modify socket options
  • BPF_PROG_TYPE_LWT_* : a network packet filter for lightweight tunnels
  • BPF_PROG_TYPE_SOCK_OPS : a program for setting socket parameters
  • BPF_PROG_TYPE_SK_SKB : a network packet filter for forwarding packets between sockets
  • BPF_PROG_CGROUP_DEVICE : Determines whether device operations are allowed

As new program types are added, kernel developers also find the need to add new data structures.

For example, which bpf helper function can be accessed by the BPF_PROG_TYPE_SCHED_CLS bpf prog? Let’s see how the source code implements this.

Each prog type defines a struct bpf_verifier_ops structure. When a prog is loaded into the kernel, the kernel calls the get_func_proto function of the corresponding structure, depending on its type.

1
2
3
4
const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
        .get_func_proto         = tc_cls_act_func_proto,
            .convert_ctx_access     = tc_cls_act_convert_ctx_access,
};

For BPF codes of type BPF_PROG_TYPE_SCHED_CLS, the verifier calls tc_cls_act_func_proto to check if all the helper functions called by the program are legitimate.

BPF code call timing

Each prog type has a different timing.

BPF_PROG_TYPE_SCHED_CLS

BPF_PROG_TYPE_SCHED_CLS is called as follows.

Egress direction

In the egress direction, after the tcp/ip stack is running, there is a hook point. This hook point can attach the BPF_PROG_TYPE_SCHED_CLS type of bpf prog in the egress direction. After this bpf code is executed, the qos, tcpdump, and xmit to NIC driver code will be run. In this bpf code you can modify the contents of the message, the address, etc. After the changes are made, you can see them through tcpdump, because the tcpdump code is executed after that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)

{
            skb = sch_handle_egress(skb, &rc, dev);
            // enqueue tc qos
            // dequeue tc qos
            // dev_hard_start_xmit
            // tcpdump works here! dev_queue_xmit_nit
            // nic driver->ndo_start_xmit 
}
Ingress direction

There is a hook point in the ingress direction, before deliver to the tcp/ip stack and after tcpdump. This hook point can attach BPF_PROG_TYPE_SCHED_CLS type of bpf prog in ingress direction. Here you can also modify the message. But the result of the modification is not visible in tcpdump.

1
2
3
4
5
6
7
8
9
static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
                                    struct packet_type **ppt_prev)
{
        // generic xdp bpf hook
        // tcpdump 
        // tc ingress hook
        skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev, &another);
        // deliver to tcp/ip stack or bridge/ipvlan device
}
execution portal cls_bpf_classify

The real entry point for executing the bpf instruction, regardless of egress or ingress direction, is cls_bpf_classify. It traverses the bpf prog link list in tcf_proto and executes BPF_PROG_RUN(prog->filter, skb) for each bpf prog.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
                            struct tcf_result *res)
{
        struct cls_bpf_head *head = rcu_dereference_bh(tp->root);
        struct cls_bpf_prog *prog;
        
        list_for_each_entry_rcu(prog, &head->plist, link) {
                int filter_res;
                                if (tc_skip_sw(prog->gen_flags)) {
                        filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0;
                } else if (at_ingress) {
                        /* It is safe to push/pull even if skb_shared() */
                        __skb_push(skb, skb->mac_len);
                        bpf_compute_data_pointers(skb);
                        filter_res = BPF_PROG_RUN(prog->filter, skb);
                        __skb_pull(skb, skb->mac_len);
                } else {
                        bpf_compute_data_pointers(skb);
                        filter_res = BPF_PROG_RUN(prog->filter, skb);
                }
}

BPF_PROG_RUN executes the bpf instruction for JIT compile, or if the kernel does not support JIT, it calls the interpreter to execute the byte code for bpf.

The entry parameter that BPF_PROG_RUN passes to bpf prog is skb, whose type is struct sk_buff , defined in the file include/linux/skbuff.h.

But in the bpf code, you cannot access sk_buff directly for security reasons. bpf accesses struct sk_buff by accessing struct __sk_buff. __sk_buff is a subset of sk_buff and is the interface to the sk_buff-oriented bpf program. accesses to __sk_buff in bpf code are translated in the verifier program into accesses to the corresponding fileds of sk_buff.

When loading the bpf prog, the verifier calls the hook of tc_cls_act_convert_ctx_access inside the tc_cls_act_verifier_ops structure above. It will eventually call the following function to modify the ebpf directive so that the access to __sk_buff becomes an access to struct sk_buff.

BPF Attach type

A type of bpf prog can be attached to different hooks in the kernel, and these different hooks are different attach types.

The correspondence is defined in the following function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
attach_type_to_prog_type(enum bpf_attach_type attach_type)
{
        switch (attach_type) {
        case BPF_CGROUP_INET_INGRESS:
        case BPF_CGROUP_INET_EGRESS:
                return BPF_PROG_TYPE_CGROUP_SKB;
        case BPF_CGROUP_INET_SOCK_CREATE:
        case BPF_CGROUP_INET_SOCK_RELEASE:
        case BPF_CGROUP_INET4_POST_BIND:
        case BPF_CGROUP_INET6_POST_BIND:
                return BPF_PROG_TYPE_CGROUP_SOCK;
        .....
}

When a bpf prog attaches to a specific hook point via the bpf() system call, the attach type needs to be specified in its entry parameter.

Interestingly, bpf prog of type BPF_PROG_TYPE_SCHED_CLS cannot attach via the bpf system call because it does not have a corresponding attach type defined, so its attachment requires an additional implementation via the netlink interface, which is still very complicated.

Introduction to common prog types

There are currently 30 types of prog types in the kernel. Each type can do different things, so I will only talk about the ones I usually use for work.

The best way to understand a prog type is to

  • look up the attach_type_to_prog_type table and get its attach type.
  • then search the kernel code to see where these attach types are called in the kernel.
  • Finally, look at its entry parameters and return value handling to basically understand what it does.
1
2
3
4
include/uapi/linux/bpf.h

enum bpf_prog_type {
}

BPF_PROG_TYPE_SOCKET_FILTER

is the first program type to be added to the kernel. When you attach a bpf program to a socket, you get access to all packets that are processed by the socket. socket filtering does not allow you to modify these packets and their destinations. It only provides you with the ability to observe these packets. In your program you can get things like the protocol type.

In tcp for example, the call is made from tcp_v4_rcv->tcp_filter->sk_filter_trim_cap to filter packets, or trim packets. udp, icmp also have related calls.

BPF_PROG_TYPE_SOCK_OPS

The bpf hook, called when a tcp protocol event occurs, defines 15 types of events. The attach type of these events is BPF_CGROUP_SOCK_OPS. Different enum is passed in at different call points, for example

  • BPF_SOCK_OPS_TCP_CONNECT_CB is for the active tcp connect call.
  • BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB is called on passive connect success.

Main functions: tcp tuning, event statistics, etc.

BPF_PROG_TYPE_SOCK_OPS This program type allows you to modify the link options of a socket when packets are transmitted at various stages of the kernel network stack. They attach to cgroups much like BPF_PROG_TYPE_CGROUP_SOCK and BPF_PROG_TYPE_CGROUP_SKB, but the difference is that they can be called several times throughout the life of the connection. Your bpf program receives an op argument that represents the operation that the kernel will perform over the socket link. Therefore, you know when to call the program during the lifetime of the link. On the other hand, you can get the ip address, port, etc. You can also modify the link’s link options to set the timeout and change the packet round-trip delay time.

As an example, Facebook uses it to set a short recovery time objective (RTO) for connections within the same data center.RTO is a time that refers to the recovery time of the network after a failure, a metric that also indicates how long the network is unavailable in case of unacceptable conditions.Facebook believes that there should be a very short RTO, Facebook modified this time to use the bpf procedure.

BPF_PROG_TYPE_CGROUP_SOCK_ADDR

It corresponds to many attachment types, usually called when bind, connect, pass in the address of the sock.

The main role: for example, the implementation of clusterip in cilium, in the active connect, modified the destination ip address, is to use this.

BPF_PROG_TYPE_CGROUP_SOCK_ADDR, this type of program allows you to manipulate IP addresses and port numbers in userspace programs controlled by a specific cgroup. In some cases, the system will use multiple IP addresses when you want to ensure that a specific set of userspace programs use the same IP address and port. These BPF programs give you the flexibility to manipulate these bindings when you place these userspace programs in the same cgroup. This ensures that all incoming and outgoing connections to these applications use the IP and port provided by the BPF program.

BPF_PROG_TYPE_SK_MSG

BPF_PROG_TYPE_SK_MSG, These types of programs let you control whether a message sent to a socket should be delivered. When the kernel creates a socket, it is stored in the map mentioned earlier . When you attach a program to this socket map, all messages sent to those sockets are filtered. Before filtering the message, the kernel copies the data, so you can read the messages and give your decision: for example, SK_PASS and SK_DROP.

BPF_PROG_TYPE_SK_SKB

Call point: called when tcp sendmsg.

Main purpose: for sock redir.

BPF_PROG_TYPE_SK_SKB, this kind of program allows you to get socket maps and socket redirects. socket maps allows you to get some references to sockets. When you have these references, you can use the associated helpers to redirect an incoming packet from one socket to another scoket. You can forward network packets between sockets without leaving the kernel space, and Cillium and Facebook’s Katran use this type of program extensively for traffic control.

BPF_PROG_TYPE_CGROUP_SOCKOPT

Call points: getsockopt, setsockopt.

BPF_PROG_TYPE_KPROBE

Similar to ftrace’s kprobe, at the hook point of the function entry/exit, used for debug.

BPF_PROG_TYPE_TRACEPOINT

Similar to ftrace’s tracepoint.

BPF_PROG_TYPE_SCHED_CLS

As in the example above.

BPF_PROG_TYPE_XDP

A hook point before the sk_buff data structure is generated when the NIC driver receives the packet.

BPF_PROG_TYPE_XDP allows your bpf program to do this very early in the network packet’s arrival at the kernel. In such a bpf program, you may get just a little bit of information because the kernel has not had enough time to process it. Because it’s early enough, you can process these packets at a very high level of the network.

XDP defines a number of ways to handle this, for example

  • XDP_PASS means that you will pass the packet to another subsystem of the kernel to handle.
  • XDP_DROP means that the kernel should discard the packet.
  • XDP_TX means that you can forward this packet to the network interface card (NIC) the first time it receives it.

BPF_PROG_TYPE_CGROUP_SKB

BPF_PROG_TYPE_CGROUP_SKB allows you to filter the network traffic for the entire cgroup. In this program type, you can do some control over the network traffic before it reaches the programs in this cgoup. Any packet that the kernel tries to pass to any process in the same cgroup will pass through one of these filters. Also, you can decide what processes in the cgroup should do when sending network packets through this interface. In fact, you can find it very similar in type to BPF_PROG_TYPE_SOCKET_FILTER. The biggest difference is that cgroup_skb is attaching to all processes in this cgroup, not a special process. In a container environment, bpf is very useful.

  • ingress direction, tcp calls this bpf to do filtering when it receives a message (tcp_v4_rcv).
  • In the egress direction, ip calls it to do packet loss filtering on outgoing packets (ip_finish_output) The input parameter is skb.

BPF_PROG_TYPE_CGROUP_SOCK

Called on sock create, release, post_bind. It is mainly used to do some permission checking.

BPF_PROG_TYPE_CGROUP_SOCK, this type of bpf program allows you to execute your bpf program when any process in a cgroup opens a socket. This behavior is similar to the behavior of CGROUP_SKB, but it is provided to you when a process in a cgoup opens a new socket, rather than giving you permission control for network packets to pass through. This is useful for providing security and access control for groups of programs that can open sockets without having to restrict the functionality of each process separately.

eBPF tool chain

bcc

BCC is a collection of compilation tools for BPF, providing a Python/Lua API on the front-end, implemented in C/C++ itself, and integrated with LLVM/Clang to rewrite, compile and load BPF programs, providing some more user-friendly functions for users to use.

While BCC does its best to simplify the work of BPF program developers, its “black magic” (using the Clang front-end to modify user-written BPF programs) makes it difficult to find problems and solutions when they occur. Naming conventions and automatically generated trace point structures must be remembered. The libbcc library has a large LLVM/Clang library integrated into it, which makes it problematic to use.

  1. using high CPU and memory resources to compile BPF programs at the start of each tool, which may cause problems when running on a server that is already short of system resources.
  2. relying on kernel header packages that must be installed on each target host. Even so, if something not exported in the kernel is required, the type definition needs to be manually copied/pasted into the BPF code.
  3. Since BPF programs are compiled at runtime, many simple compilation errors can only be detected at runtime, which affects the development experience.

With the implementation of BPF CO-RE, we can use the libbpf library provided by kernel developers to develop BPF programs directly, in the same way as writing ordinary C user-state programs: compile once to generate small binaries. libbpf, as the BPF program loader, takes over the functions of redirection, loading, validation, etc., and BPF program developers only need to focus on the correctness and performance of BPF programs. program developer only needs to focus on the correctness and performance of the BPF program. This approach minimizes overhead and removes large dependencies, making the overall development process much smoother.

Brendan Gregg, a performance optimization guru, gives comparative performance data after converting a BCC tool with libbpf + BPF CO-RE.

As my colleague Jason pointed out, the memory footprint of opensnoop as CO-RE is much lower than opensnoop.py. 9 Mbytes for CO-RE vs 80 Mbytes for Python.

We can see that the libbpf + BPF CO-RE version saves nearly 9 times the memory overhead at runtime compared to the BCC version, which is more friendly to servers that are already strapped for physical memory resources.

bbc

bpftrace

bpftrace is a high-level tracing language for Linux eBPF and available in recent Linux kernels (4.x). bpftrace uses LLVM as a backend to compile scripts to eBPF bytecode and makes use of BCC for interacting with the Linux eBPF subsystem as well as existing Linux tracing capabilities: kernel dynamic tracing (kprobes), user-level dynamic tracing (uprobes), and tracepoints. The bpftrace language is inspired by awk, C and predecessor tracers such as DTrace and SystemTap.

bpftrace

eBPF Go Library

eBPF Go Library

libbpf

libbpf