go ebpf

In the previous two articles, both “Developing eBPF programs in C” and “Developing eBPF programs in Go” are hello world level, which may be useful, but not very practical.

Generally speaking, a practical eBPF program has data exchange between its kernel state part and user state part, and with this data exchange, eBPF can play a more powerful role. And to make an eBPF program more practical, eBPF MAP is the mechanism that cannot be bypassed.

In this article about eBPF program development, let’s see how to use Go based on BPF MAP to implement bi-directional data exchange between the kernel state and user state of eBPF programs.

1. Why BPF MAP?

Never forget that BPF bytecode is code that runs in the OS kernel state, which means that it is “distinct” from the user state. We know that the only way for user state to access kernel state data is to make a system call into the kernel state. Therefore, the various variable instances created in a BPF kernel-state program can only be accessed by the kernel-state code.

How do we return useful data obtained by the BPF code in the kernel state to the user state for monitoring, computing, decision making, presentation, and storage? And how does the user state code pass data to the kernel state at runtime to change the BPF code’s runtime policy?

The Linux kernel BPF developers then introduced the BPF MAP mechanism. BPF MAP provides a channel for bidirectional data exchange between the kernel state and the user state of a BPF program . At the same time, since the bpf map is stored in the kernel-allocated memory space in the kernel state, it can be shared by multiple BPF programs running in the kernel state, and can also be used as a mechanism for multiple BPF programs to exchange and share data.

2. BPF MAP is not a narrowly defined map data structure

What exactly is BPF MAP? It is not a data structure that we narrowly understand as a hash map table, but a generic data structure that can store different types of data. In the words of Andrii Nakryiko, a famous kernel BPF developer, MAP is a concept representing abstract data container in BPF.

So far, there are 20+ MAP types supported by the kernel BPF, the following are the currently supported MAP types listed in bpf.h in libbpf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// libbpf/include/uapi/linux/bpf.h
enum bpf_map_type {
    BPF_MAP_TYPE_UNSPEC,
    BPF_MAP_TYPE_HASH,
    BPF_MAP_TYPE_ARRAY,
    BPF_MAP_TYPE_PROG_ARRAY,
    BPF_MAP_TYPE_PERF_EVENT_ARRAY,
    BPF_MAP_TYPE_PERCPU_HASH,
    BPF_MAP_TYPE_PERCPU_ARRAY,
    BPF_MAP_TYPE_STACK_TRACE,
    BPF_MAP_TYPE_CGROUP_ARRAY,
    BPF_MAP_TYPE_LRU_HASH,
    BPF_MAP_TYPE_LRU_PERCPU_HASH,
    BPF_MAP_TYPE_LPM_TRIE,
    BPF_MAP_TYPE_ARRAY_OF_MAPS,
    BPF_MAP_TYPE_HASH_OF_MAPS,
    BPF_MAP_TYPE_DEVMAP,
    BPF_MAP_TYPE_SOCKMAP,
    BPF_MAP_TYPE_CPUMAP,
    BPF_MAP_TYPE_XSKMAP,
    BPF_MAP_TYPE_SOCKHASH,
    BPF_MAP_TYPE_CGROUP_STORAGE,
    BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
    BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
    BPF_MAP_TYPE_QUEUE,
    BPF_MAP_TYPE_STACK,
    BPF_MAP_TYPE_SK_STORAGE,
    BPF_MAP_TYPE_DEVMAP_HASH,
    BPF_MAP_TYPE_STRUCT_OPS,
    BPF_MAP_TYPE_RINGBUF,
    BPF_MAP_TYPE_INODE_STORAGE,
    BPF_MAP_TYPE_TASK_STORAGE,
    BPF_MAP_TYPE_BLOOM_FILTER,
};

There are many types of data structures here, but they are not the focus of this article, so we won’t introduce them one by one. The BPF_MAP_TYPE_HASH type is the first type of MAP data structure supported by BPF. This type can be understood as the hash table that we come into contact with everyday, indexing data in the form of key-value pairs. We will use this type of MAP in the subsequent examples.

So how can BPF MAP share data between the kernel state and the user state? What is the principle?

We can find out from the description of the system call bpf. Here is the function prototype of the bpf system call.

1
2
3
4
5
// https://man7.org/linux/man-pages/man2/bpf.2.html

#include <linux/bpf.h>

int bpf(int cmd, union bpf_attr *attr, unsigned int size);

Looking at the prototype of bpf, it seems relatively simple. But bpf is actually a “rich call”, i.e. it can do more than one thing, and it can do many things around BPF by passing in different values through cmd. The main function is to load the bpf program (cmd=BPF_PROG_LOAD), followed by a series of operations around MAP, including creating MAP (cmd=BPF_MAP_CREATE), querying MAP elements (cmd=BPF_MAP_LOOKUP_ELEM), and updating MAP element values (cmd=BPF_MAP _UPDATE_ELEM), etc.

When cmd=BPF_MAP_CREATE, i.e. after bpf performs the operation of creating a MAP, the bpf call will return a file descriptor fd, through which the newly created MAP can be subsequently manipulated . The map is accessed via fd, which is very unix!

Of course such an underlying system call is not generally needed for BPF user state developers to touch, like libbpf wraps a series of map operations functions that do not expose map fd to the user, simplifying the usage and improving the experience.

Let’s take a look at how to implement map-based data exchange between the BPF user state and the kernel state in C.

3. Example of using map based on libbpf using C

This example is adapted from the helloworld example. The original helloworld example outputs a kernel log (available in /sys/kernel/debug/tracing/trace_pipe) when the execve system call is called, and the user-state program does not exchange any data with the kernel-state program.

In this new example (execve_counter), we still track the system call to execve, except that we count the calls to execve and store the counts in the BPF MAP. And the user state part of the program reads the count in this MAP and outputs the count value at regular intervals.

Let’s take a look at the source code of the kernel state part of BPF.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// https://github.com/bigwhite/experiments/tree/master/ebpf-examples/execve-counter/execve_counter.bpf.c

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

typedef __u64 u64;
typedef char stringkey[64];

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 128);
    //__type(key, stringkey);
    stringkey* key;
    __type(value, u64);
} execve_counter SEC(".maps");

SEC("tracepoint/syscalls/sys_enter_execve")
int bpf_prog(void *ctx) {
  stringkey key = "execve_counter";
  u64 *v = NULL;
  v = bpf_map_lookup_elem(&execve_counter, &key);
  if (v != NULL) {
    *v += 1;
  }
  return 0;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

Unlike the helloworld example, we have defined a map structure execve_counter in the new example, which is marked as a BPF MAP variable by the SEC macro.

This map structure has four fields as follows.

  • type: the BPF MAP type used (see the previous bpf_map_type enumeration type), here we use BPF_MAP_TYPE_HASH, i.e. a hash hash table structure.
  • max_entries: the maximum number of key-value pairs within the map.
  • key: a pointer to the key memory space. Here we customize a type stringkey(char[64]) to indicate the type of each key element.
  • value: a pointer to the value memory space, here the value element is of type u64, a 64-bit integer.

The implementation of the kernel state function bpf_prog is also relatively simple: look up the key “execve_counter” in the map above, and if you find it, add 1 to the value in the memory pointed to by the value pointer.

Let’s take a look at the source code of the user state part of the execve_counter example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// https://github.com/bigwhite/experiments/tree/master/ebpf-examples/execve_counter/execve_counter.c

#include <stdio.h>
#include <unistd.h>
#include <sys/resource.h>
#include <bpf/libbpf.h>
#include <linux/bpf.h>
#include "execve_counter.skel.h"

typedef __u64 u64;
typedef char stringkey[64];

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

int main(int argc, char **argv)
{
    struct execve_counter_bpf *skel;
    int err;

    libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
    /* Set up libbpf errors and debug info callback */
    libbpf_set_print(libbpf_print_fn);

    /* Open BPF application */
    skel = execve_counter_bpf__open();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }

    /* Load & verify BPF programs */
    err = execve_counter_bpf__load(skel);
    if (err) {
        fprintf(stderr, "Failed to load and verify BPF skeleton\n");
        goto cleanup;
    }

    /* init the counter */
    stringkey key = "execve_counter";
    u64 v = 0;
    err = bpf_map__update_elem(skel->maps.execve_counter, &key, sizeof(key), &v, sizeof(v), BPF_ANY);
    if (err != 0) {
        fprintf(stderr, "Failed to init the counter, %d\n", err);
        goto cleanup;
    }

    /* Attach tracepoint handler */
    err = execve_counter_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton\n");
        goto cleanup;
    }

    for (;;) {
            // read counter value from map
            err = bpf_map__lookup_elem(skel->maps.execve_counter, &key, sizeof(key), &v, sizeof(v), BPF_ANY);
            if (err != 0) {
               fprintf(stderr, "Lookup key from map error: %d\n", err);
               goto cleanup;
            } else {
               printf("execve_counter is %llu\n", v);
            }

            sleep(5);
    }

cleanup:
    execve_counter_bpf__destroy(skel);
    return -err;
}

The map is created in execve_counter_bpf__load, and as you will see by tracing the code (refer to the libbpf source code), the bpf system call is eventually called to create the map.

The difference with the helloworld example is that we initialize the key in the bpf map using the bpf_map__update_elem wrapped in libbpf before attaching the handler (initialized to 0, without this step, the first time the bpf program is executed, it will prompt that the key cannot be found).

Then after attaching the handler, we look up the value of key=“execve_counter” every 5s in a loop via bpf_map__lookup_elem and output it to the console.

Next we run make to compile the ebpf program, then execute it and observe the output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$sudo ./execve_counter
libbpf: loading object 'execve_counter_bpf' from buffer
libbpf: elf: section(3) tracepoint/syscalls/sys_enter_execve, size 192, link 0, flags 6, type=1
libbpf: sec 'tracepoint/syscalls/sys_enter_execve': found program 'bpf_prog' at insn offset 0 (0 bytes), code size 24 insns (192 bytes)
libbpf: elf: section(4) .reltracepoint/syscalls/sys_enter_execve, size 16, link 22, flags 0, type=9
libbpf: elf: section(5) .rodata, size 64, link 0, flags 2, type=1
libbpf: elf: section(6) .maps, size 32, link 0, flags 3, type=1
libbpf: elf: section(7) license, size 13, link 0, flags 3, type=1
libbpf: license of execve_counter_bpf is Dual BSD/GPL
libbpf: elf: section(13) .BTF, size 898, link 0, flags 0, type=1
libbpf: elf: section(15) .BTF.ext, size 176, link 0, flags 0, type=1
libbpf: elf: section(22) .symtab, size 744, link 1, flags 0, type=2
libbpf: looking for externs among 31 symbols...
libbpf: collected 0 externs total
libbpf: map 'execve_counter': at sec_idx 6, offset 0.
libbpf: map 'execve_counter': found type = 1.
libbpf: map 'execve_counter': found key [9], sz = 64.
libbpf: map 'execve_counter': found value [13], sz = 8.
libbpf: map 'execve_counter': found max_entries = 128.
libbpf: map 'execve_c.rodata' (global data): at sec_idx 5, offset 0, flags 480.
libbpf: map 1 is "execve_c.rodata"
libbpf: sec '.reltracepoint/syscalls/sys_enter_execve': collecting relocation for section(3) 'tracepoint/syscalls/sys_enter_execve'
libbpf: sec '.reltracepoint/syscalls/sys_enter_execve': relo #0: insn #15 against 'execve_counter'
libbpf: prog 'bpf_prog': found map 0 (execve_counter, sec 6, off 0) for insn #15
libbpf: map 'execve_counter': created successfully, fd=4
libbpf: map 'execve_c.rodata': created successfully, fd=5
execve_counter is 0
execve_counter is 0
execve_counter is 9
execve_counter is 23
... ...

Note: If you don’t know how to compile the execve_counter example, please first move to “Developing a Hello World-level eBPF program from scratch using C” to understand how it is built.

The bpftool tool provides the feature to view the map, through which we can view the map created by the example.

1
2
3
4
5
6
7
$sudo bpftool map
114: hash  name execve_counter  flags 0x0
    key 64B  value 8B  max_entries 128  memlock 20480B
    btf_id 120
116: array  name execve_c.rodata  flags 0x80
    key 4B  value 64B  max_entries 1  memlock 4096B
    frozen

We can also dump the entire map.

1
2
3
4
5
6
$sudo bpftool map dump id 114
[{
        "key": "execve_counter",
        "value": 23
    }
]

We see that there is only one key-value pair (key=“execve_counter”) in the entire map, and its value is the same as the output of the user-state part of the example program.

Well, with the C example as a base, let’s see how to implement this example based on Go.

4. Example of using Go to implement execve-counter based on cilium/ebpf

It is much easier to use Go to develop the user state part of a BPF program, and the packages provided by cilium/ebpf are very easy to use. If you don’t know how to use Go to develop the user state part of ebpf programs, please go to the article “Developing eBPF programs using Go language” to learn more.

The essential raw material for the Go example is execve_counter.bpf.c. The only difference between this C source file and execve_counter.bpf.c in the execve_counter example above is that the include header file has been changed to common.h.

1
2
3
4
5
6
7
$diff execve_counter.bpf.c ../execve-counter/execve_counter.bpf.c
1,2c1,2
<
< #include "common.h"
---
> #include <linux/bpf.h>
> #include <bpf/bpf_helpers.h>

Based on the raw material execve_counter.bpf.c, the bpf2go tool generates the Go source code needed for the user state part, e.g. the bpf map instance contained in bpfObject.

1
2
3
4
5
6
// bpfMaps contains all maps after they have been loaded into the kernel.
//
// It can be passed to loadBpfObjects or ebpf.CollectionSpec.LoadAndAssign.
type bpfMaps struct {
    ExecveCounter *ebpf.Map `ebpf:"execve_counter"`
}

Finally, we can use these generated Go functions related to bpf objects directly in the main function of the main package, here is the main.go part of the source code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// https://github.com/bigwhite/experiments/tree/master/ebpf-examples/execve-counter-go/main.go

// $BPF_CLANG, $BPF_CFLAGS and $BPF_HEADERS are set by the Makefile.
//go:generate bpf2go -cc $BPF_CLANG -cflags $BPF_CFLAGS -target bpfel,bpfeb bpf execve_counter.bpf.c -- -I $BPF_HEADERS
func main() {
    stopper := make(chan os.Signal, 1)
    signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)

    // Allow the current process to lock memory for eBPF resources.
    if err := rlimit.RemoveMemlock(); err != nil {
        log.Fatal(err)
    }

    // Load pre-compiled programs and maps into the kernel.
    objs := bpfObjects{}
    if err := loadBpfObjects(&objs, nil); err != nil {
        log.Fatalf("loading objects: %s", err)
    }
    defer objs.Close()

    // init the map element
    var key [64]byte
    copy(key[:], []byte("execve_counter"))
    var val int64 = 0
    if err := objs.bpfMaps.ExecveCounter.Put(key, val); err != nil {
        log.Fatalf("init map key error: %s", err)
    }

    // attach to xxx
    kp, err := link.Tracepoint("syscalls", "sys_enter_execve", objs.BpfProg, nil)
    if err != nil {
        log.Fatalf("opening tracepoint: %s", err)
    }
    defer kp.Close()

    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            if err := objs.bpfMaps.ExecveCounter.Lookup(key, &val); err != nil {
                log.Fatalf("reading map error: %s", err)
            }
            log.Printf("execve_counter: %d\n", val)

        case <-stopper:
            // Wait for a signal and close the perf reader,
            // which will interrupt rd.Read() and make the program exit.
            log.Println("Received signal, exiting program..")
            return
        }
    }
}

In the main function, we access the map instance directly through objs.bpfMaps.ExecveCounter, and can manipulate the map directly through its Put and Lookup methods. here it should be noted that the type of key must be consistent with the key type (char[64]) in execve_counter.bpf.c. The memory layout is the same, you cannot use string type directly, otherwise the following error will be reported in the execution.

1
init map key error: can't marshal key: string doesn't marshal to 64 bytes

Compiling and executing execve-counter-go is no different from helloworld-go.

1
2
3
4
5
6
$make
$go run -exec sudo main.go bpf_bpfel.go

2022/07/17 16:59:52 execve_counter: 0
2022/07/17 16:59:57 execve_counter: 14
^C2022/07/17 16:59:59 Received signal, exiting program..

5. Summary

This article introduced the main method for exchanging data between the eBPF kernel-state part and the user-state part: the BPF MAP mechanism. MAP here is not a hash table in the narrow sense, but a container of abstract data structures, currently supporting more than two dozen data structures, so you can pick the appropriate structure according to your needs (you can consult the manual to understand the characteristics of various data structures).

MAP is also essentially created by the bpf system call. The bpf program only needs to declare the key, value, type and other composition information of MAP. The user state can operate the map through the fd returned by the bpf system call. libbpf and cilium/ebpf, etc. encapsulate the operation of the fd, which simplifies the use of the API.

The map update operation in the kernel is not atomic, so when there are multiple bpf programs accessing a map concurrently, the operation needs to be synchronized. bpf provides bpf_spin_lock to synchronize the map operation. We can add bpf_spin_lock to the value type to synchronize changes to the value, as in the following example (example from the book Linux Observability with BPF).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct concurrent_element {
    struct bpf_spin_lock semaphore;
    int count;
}

struct bpf_map_def SEC("maps") concurrent_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(int),
    .value_size = sizeof(struct concurrent_element),
    .max_entries = 100,
};

int bpf_program(struct pt_regs *ctx) {
      intkey=0;
      struct concurrent_element init_value = {};
      struct concurrent_element *read_value;
      bpf_map_create_elem(&concurrent_map, &key, &init_value, BPF_NOEXIST);
      read_value = bpf_map_lookup_elem(&concurrent_map, &key);
      bpf_spin_lock(&read_value->semaphore);
      read_value->count += 100;
      bpf_spin_unlock(&read_value->semaphore);
}

The code involved in this article can be downloaded at here.