I have recently been learning about BPF, which is one of the more popular dynamic tracing techniques available today. Simply put, it allows us to insert a piece of code that executes along with the program without interrupting the program that is currently running. For example, if you want to know the value of each return of a function, you can write a BPF program that prints out the value of each return; then hook the function to the function call, so that the program will execute our BPF program every time the function is called.

This technique of observing the runtime of a program without terminating it is very useful.

BPF

credit: https://www.brendangregg.com/ebpf.html

It works roughly as follows.

  1. find the address of the function call to be observed (so with BPF we usually need binary with debug symbol), replace the instruction at this address with int3, i.e. a break, to save the instruction that should have been executed.
  2. when the program is executed here, instead of calling the function directly, it finds an uprobe handler here and executes the handler we defined.
  3. the original program continues to execute.
  4. when we stop the probe, we restore the original saved instructions.

Example: This line prints out all the commands executed in the system bpftrace -e 'tracepoint:syscalls:sys_enter_execve { join(args->argv); }' . The principle is that each time execve is executed, it will execute the join (print) that we defined.

bpftrace exmaple

bpftrace is a command-line program with a syntax similar to awk that does the following: compiles the program we wrote (in this case, the program refers to the syntax defined by bpftrace, not the BPF program) into a BPF program and attaches it to the appropriate event.

The above description shows that the necessary conditions needed to use this method to trace the program are

  1. the Kernel should support BPF; (4.19+);
  2. binary is available for tracing.

In an interpreted language like Python, binary is actually an interpreter, and inside the interpreter is the running Python code. If we follow the above method, we are actually tracing the code of the interpreter and not the code of our Python application.

Python Interpreter

Tracing Python programs using USDT

One solution is to use USDT tracing.

USDT is User-level statically defined tracing, and the Python interpreter defines a number of traces (which can be called probes, or markers) for us in advance, which are not activated by default, so they have no impact on performance. When you want to trace, you can activate a few of them and print out the arguments they provide.

There are several prerequisites to be able to use USDT:

  1. the Kernel should support BPF; (4.19+); (in fact, you can use Dtrace or systemtap)
  2. Python should be compiled with trace support turned on (compile with -with-dtrace option), on my servers, Python 3.10 on Ubuntu is turned on, Python 3.10 on Fedora does not seem to be turned on.

Tracing Python programs with USDT

First you need to install bpftrace. Installation instructions can be found here.

USDT markers in binary can be listed with this command: bpftrace -l 'usdt:/usr/bin/python3:*'

bpftrace

As you can see, there are not many markers supported, only 8 in total.

Let’s try one of them out: this command prints out the time of each Python function call, the address of the source code, the function name, and the line number.

1
bpftrace -e 'usdt:/usr/bin/python3.10:function__entry {time("%H:%M:%S"); printf(" line filename=%s, funcname=%s, lineno=%d\n", str(arg0), str(arg1), arg2);}' -p 15552

where -p 15552 is the pid of the running Python program, -e is followed by the code of bpftrace, much like awk, the first is probe, {} inside is the code, mainly two lines, the first line prints out the time, the second line prints out function__entry the probe provides three parameters. (It should be noted that the string argument needs to be printed out using bpftrace’s built-in str() function, otherwise it prints out the char * address; also note that although the Python documentation says the three arguments are $arg1 $arg2 $arg3, the actual printout should use arg0 , arg1 , arg2 …)

The effect is as follows.

bpftrace

Several other markers serve the following purposes.

  • function__entry: The entry point of the function
  • function__return: when the function exits
  • line: triggered for every line of Python code executed
  • gc__start and gc__done: triggered at the start and end of gc
  • import__find__load__start and import__find__load__done: triggered at the beginning and end of import
  • audit: triggered when sys.audit is called

As you can see, not many are currently supported, and as you can see from these markers, most of them are there to let you know what the Python interpreter is doing, not to debug, which can be used to solve performance problems if you encounter them, but not much help if you encounter logical errors. For example, you can’t see the value of a variable at a given moment via usdt.

How USDT works

USDT works similarly to the uprobe mentioned above, in that when it is activated it changes the original instruction and inserts int3 to execute our handler.

Seeing is believing.

We can use some tools to see how it works.

Taking function__entry as an example, we first find its marker in Binary, inside .note.stapsdt.

.note.stapsdt

You can see that the location of the function__entry we just used is 0x000000000005d839.

Then we find the command at this location.

gpb -p 17577 to attach to the pid that is running. Then we run info proc mappings to dump the address and we can see that Offset 0x0 is at 0x55bfcc88d000.

Offset 0x0 is at 0x55bfcc88d000

Then the location of function__entry should be 0x000000000005d839 + 0x55bfcc88d000.

The following command with disas will dump the instructions for this address.

disas

1
2
>>> hex(0x000000000005d839 + 0x55bfcc88d000)
'0x55bfcc8ea839'

Find the instruction for the address 0x55bfcc8ea839.

Find the instruction for the address 0x55bfcc8ea839

found to be a nop, i.e. doing nothing. So that’s why it says above that there is no loss of performance at all when usdt is not on. It’s just one more placeholder instruction that does nothing.

Use the previous command to attach the usdt probe.

Use the previous command to attach the usdt probe

Then revisit the instruction at this location 0x55bfcc8ea839.

Then revisit the instruction at this location 0x55bfcc8ea839

You can see that the instruction at this location has become int3 . When the program (interpreter) executes here, the kernel executes our BPF program with the variables provided by usdt. When the BPF probe exits, int3 will be restored to nop .

More possibilities?

Most of the Python related BPF content on the web is “how to use BPF with Python (BCC)”, not “how to profile Python code with BPF”. Maybe it’s not used much in interpreted languages.

USDT has the limitation that it can only use a few probes defined by the Python interpreter, and what it can do is basically see what the interpreter is doing, but not the parameters of some specific variables, the state of the interpreter, and so on. If we want to do more fine-grained dynamic tracing, we have two ideas.

  1. use uprobe to track the Python interpreter. But most of them are PyObject pointers, which require an understanding of how the interpreter works and are a bit complicated.
  2. Customizing more USDT, with python-stapsdt, a library that allows you to insert more USDT markers into your Python program, doesn’t make much sense to me: we’re trying to solve problems, and often we don’t know where the problem is, and you can’t stop the program when you’re trying to solve it, and then it’s too late to remember to insert markers. In other words, since observability is very important when you know that a place can go wrong, why not just print the log?

Ref

  • https://www.kawabangga.com/posts/4894