In the previous two articles, both “Developing eBPF programs in C” and “Developing eBPF programs in Go” are hello world level, which may be useful, but not very practical.
Generally speaking, a practical eBPF program has data exchange between its kernel state part and user state part, and with this data exchange, eBPF can play a more powerful role. And to make an eBPF program more practical, eBPF MAP is the mechanism that cannot be bypassed.
In this article about eBPF program development, let’s see how to use Go based on BPF MAP to implement bi-directional data exchange between the kernel state and user state of eBPF programs.
1. Why BPF MAP?
Never forget that BPF bytecode is code that runs in the OS kernel state, which means that it is “distinct” from the user state. We know that the only way for user state to access kernel state data is to make a system call into the kernel state. Therefore, the various variable instances created in a BPF kernel-state program can only be accessed by the kernel-state code.
How do we return useful data obtained by the BPF code in the kernel state to the user state for monitoring, computing, decision making, presentation, and storage? And how does the user state code pass data to the kernel state at runtime to change the BPF code’s runtime policy?
The Linux kernel BPF developers then introduced the BPF MAP mechanism. BPF MAP provides a channel for bidirectional data exchange between the kernel state and the user state of a BPF program . At the same time, since the bpf map is stored in the kernel-allocated memory space in the kernel state, it can be shared by multiple BPF programs running in the kernel state, and can also be used as a mechanism for multiple BPF programs to exchange and share data.
2. BPF MAP is not a narrowly defined map data structure
What exactly is BPF MAP? It is not a data structure that we narrowly understand as a hash map table, but a generic data structure that can store different types of data. In the words of Andrii Nakryiko, a famous kernel BPF developer, MAP is a concept representing abstract data container in BPF.
So far, there are 20+ MAP types supported by the kernel BPF, the following are the currently supported MAP types listed in bpf.h in libbpf.
There are many types of data structures here, but they are not the focus of this article, so we won’t introduce them one by one. The BPF_MAP_TYPE_HASH type is the first type of MAP data structure supported by BPF. This type can be understood as the hash table that we come into contact with everyday, indexing data in the form of key-value pairs. We will use this type of MAP in the subsequent examples.
So how can BPF MAP share data between the kernel state and the user state? What is the principle?
We can find out from the description of the system call bpf. Here is the function prototype of the bpf system call.
Looking at the prototype of bpf, it seems relatively simple. But bpf is actually a “rich call”, i.e. it can do more than one thing, and it can do many things around BPF by passing in different values through cmd. The main function is to load the bpf program (cmd=BPF_PROG_LOAD), followed by a series of operations around MAP, including creating MAP (cmd=BPF_MAP_CREATE), querying MAP elements (cmd=BPF_MAP_LOOKUP_ELEM), and updating MAP element values (cmd=BPF_MAP _UPDATE_ELEM), etc.
When cmd=BPF_MAP_CREATE, i.e. after bpf performs the operation of creating a MAP, the bpf call will return a file descriptor fd, through which the newly created MAP can be subsequently manipulated . The map is accessed via fd, which is very unix!
Of course such an underlying system call is not generally needed for BPF user state developers to touch, like libbpf wraps a series of map operations functions that do not expose map fd to the user, simplifying the usage and improving the experience.
Let’s take a look at how to implement map-based data exchange between the BPF user state and the kernel state in C.
3. Example of using map based on libbpf using C
This example is adapted from the helloworld example. The original helloworld example outputs a kernel log (available in /sys/kernel/debug/tracing/trace_pipe) when the execve system call is called, and the user-state program does not exchange any data with the kernel-state program.
In this new example (execve_counter), we still track the system call to execve, except that we count the calls to execve and store the counts in the BPF MAP. And the user state part of the program reads the count in this MAP and outputs the count value at regular intervals.
Let’s take a look at the source code of the kernel state part of BPF.
Unlike the helloworld example, we have defined a map structure execve_counter in the new example, which is marked as a BPF MAP variable by the SEC macro.
This map structure has four fields as follows.
- type: the BPF MAP type used (see the previous bpf_map_type enumeration type), here we use BPF_MAP_TYPE_HASH, i.e. a hash hash table structure.
- max_entries: the maximum number of key-value pairs within the map.
- key: a pointer to the key memory space. Here we customize a type stringkey(char) to indicate the type of each key element.
- value: a pointer to the value memory space, here the value element is of type u64, a 64-bit integer.
The implementation of the kernel state function bpf_prog is also relatively simple: look up the key “execve_counter” in the map above, and if you find it, add 1 to the value in the memory pointed to by the value pointer.
Let’s take a look at the source code of the user state part of the execve_counter example.
The map is created in execve_counter_bpf__load, and as you will see by tracing the code (refer to the libbpf source code), the bpf system call is eventually called to create the map.
The difference with the helloworld example is that we initialize the key in the bpf map using the bpf_map__update_elem wrapped in libbpf before attaching the handler (initialized to 0, without this step, the first time the bpf program is executed, it will prompt that the key cannot be found).
Then after attaching the handler, we look up the value of key=“execve_counter” every 5s in a loop via bpf_map__lookup_elem and output it to the console.
Next we run make to compile the ebpf program, then execute it and observe the output.
Note: If you don’t know how to compile the execve_counter example, please first move to “Developing a Hello World-level eBPF program from scratch using C” to understand how it is built.
The bpftool tool provides the feature to view the map, through which we can view the map created by the example.
We can also dump the entire map.
We see that there is only one key-value pair (key=“execve_counter”) in the entire map, and its value is the same as the output of the user-state part of the example program.
Well, with the C example as a base, let’s see how to implement this example based on Go.
4. Example of using Go to implement execve-counter based on cilium/ebpf
It is much easier to use Go to develop the user state part of a BPF program, and the packages provided by cilium/ebpf are very easy to use. If you don’t know how to use Go to develop the user state part of ebpf programs, please go to the article “Developing eBPF programs using Go language” to learn more.
The essential raw material for the Go example is execve_counter.bpf.c. The only difference between this C source file and execve_counter.bpf.c in the execve_counter example above is that the include header file has been changed to common.h.
Based on the raw material execve_counter.bpf.c, the bpf2go tool generates the Go source code needed for the user state part, e.g. the bpf map instance contained in bpfObject.
Finally, we can use these generated Go functions related to bpf objects directly in the main function of the main package, here is the main.go part of the source code.
In the main function, we access the map instance directly through objs.bpfMaps.ExecveCounter, and can manipulate the map directly through its Put and Lookup methods. here it should be noted that the type of key must be consistent with the key type (char) in execve_counter.bpf.c. The memory layout is the same, you cannot use string type directly, otherwise the following error will be reported in the execution.
Compiling and executing execve-counter-go is no different from helloworld-go.
This article introduced the main method for exchanging data between the eBPF kernel-state part and the user-state part: the BPF MAP mechanism. MAP here is not a hash table in the narrow sense, but a container of abstract data structures, currently supporting more than two dozen data structures, so you can pick the appropriate structure according to your needs (you can consult the manual to understand the characteristics of various data structures).
MAP is also essentially created by the bpf system call. The bpf program only needs to declare the key, value, type and other composition information of MAP. The user state can operate the map through the fd returned by the bpf system call. libbpf and cilium/ebpf, etc. encapsulate the operation of the fd, which simplifies the use of the API.
The map update operation in the kernel is not atomic, so when there are multiple bpf programs accessing a map concurrently, the operation needs to be synchronized. bpf provides bpf_spin_lock to synchronize the map operation. We can add bpf_spin_lock to the value type to synchronize changes to the value, as in the following example (example from the book
Linux Observability with BPF).
The code involved in this article can be downloaded at here.