What is systemtap

We generally debug our programs, and the logs printed by the business process are basically sufficient for our needs. If not, using strace, lsof, or perf is enough to see the bottleneck of performance. But for system programming, you can’t print logs like crazy, and many call stacks are in kernel space, so ordinary debugging means are stretched to the limit.

At this point systemtap comes in handy, it adds probe probes to kernel functions, aggregates statistics on kernel space function calls, and even intervenes in them. However, the support for user space debugging is not very good.

Installation

Local environment: DELL R720, Ubuntu 14.04 3.19.0-25-generic x86_64

1
apt-get  install systemtap systemtap-client systemtap-common systemtap-runtime systemtap-server -y

For centos systems as well, yum install is sufficient. In this case, you should also install the missing kernel image debug package by running stap-prep. For example, mine is:

1
linux-image-3.19.0-25-generic-dbgsym_3.19.0-25.26~14.04.1_amd64.ddeb

If you encounter any missing packages install them directly, or download them from the Internet. systemtap It is best not to install them with source code, packages involving the kernel are nasty, version must match , you can check with uname -r.

1. how to know which process is writing a file under Linux

There is a file that is modified from time to time, if it is only written instantly, lsof can not help, even if it is executed regularly, there may be missing. Then systemtap should come into play, the code is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env stap

probe vfs.write, vfs.read
{
  if (@defined($file->f_path->dentry)) {
	dev_nr=$file->f_path->dentry->d_inode->i_sb->s_dev
	inode_nr = $file->f_path->dentry->d_inode->i_ino
	} else {
	dev_nr=$file->f_dentry->d_inode->i_sb->s_dev
	inode_nr = $file->f_dentry->d_inode->i_ino
	}
  # dev and ino are defined by vfs.write and vfs.read
  if (dev_nr == MKDEV($1,$2) && inode_nr==$3){
    printf ("%s(%d) %s 0x%x/%u\n",execname(), pid(), ppfunc(), dev_nr, inode_nr)
   }
}

probe timer.ms(10) {
	exit()
}

The syntax is similar to awk code, probe defines a probe, followed by a probe point, which can be a specific function name, supports * matching, and curly brackets define the probe trigger action.

file is the argument to the functions vfs.read, vfs.write, dev_nr, inode_nr get the device number and inode according to the file structure, probe point is for the kernel function, so you can get all the arguments to the function.

execname Execute vfs.write or vfs.read Program name.

pid execute vfs.write or vfs.read process number.

ppfunc is the name of the control point function. This built-in function may be different in different versions.

1.1 Open the terminal and execute dd

Open the terminal and execute dd to write data continuously and check the file inode number.

1
dd if=/dev/zero of=test.dat
1
2
stat -c "%i" /disk1/test.dat
ls -al /dev/sdb1

Here /dev/sdb1 is the device mounted in the /disk1 directory.

1.2 Execute stap probe

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
stap -v inodewatch.stp 8 17 15
Pass 1: parsed user script and 95 library script(s) using 84976virt/30204res/5152shr/25852data kb, in 200usr/0sys/456real ms.
Pass 2: analyzed script: 3 probe(s), 7 function(s), 5 embed(s), 0 global(s) using 610884virt/195324res/12432shr/180716data kb, in 1810usr/290sys/3605real ms.
Pass 3: translated to C into "/tmp/stapJEOYcQ/stap_20c430109956cd1ffc28c7ceaf0aa2f1_6899_src.c" using 599240virt/188844res/8908shr/180712data kb, in 0usr/0sys/73real ms.
Pass 4: compiled C into "stap_20c430109956cd1ffc28c7ceaf0aa2f1_6899.ko" in 1840usr/320sys/4180real ms.
Pass 5: starting run.
dd(25763) vfs_write 0x800011/15
dd(25763) vfs_write 0x800011/15
dd(25763) vfs_write 0x800011/15
dd(25763) vfs_write 0x800011/15
dd(25763) vfs_write 0x800011/15
Pass 5: run completed in 0usr/40sys/724real ms.

stap executes the script in 5 steps, parsing the script, parsing it, generating c code, and deducing it into a kernel module ko file. Finally the module is executed and you can see that the dd task is writing the file, calling vfs_write.

2. Using Systemtap to inject delay to simulate IO device jitter

This is an interesting example from Master Ba, systemtap simulates disk IO jitter, for some storage systems, you can try it when pressure testing. The principle is still very simple, sleep a small period of time when vfs_write, vfs_read, the time can be random. The code is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
cat inject_ka.stp
global inject, ka_cnt

probe procfs("cnt").read {
  $value = sprintf("%d\n", ka_cnt);
}
probe procfs("inject").write {
  inject= $value;
  printf("inject count %d, ka %s", ka_cnt, inject);
}

probe vfs.read.return,
      vfs.write.return {
  if (@defined($file->f_path->dentry)) {
 dev_nr=$file->f_path->dentry->d_inode->i_sb->s_dev
 inode_nr = $file->f_path->dentry->d_inode->i_ino
 } else {
 dev_nr=$file->f_dentry->d_inode->i_sb->s_dev
 inode_nr = $file->f_dentry->d_inode->i_ino
 }

  if ($return &&
      dev_nr == MKDEV($1,$2) &&
      inject == "on\n")
  {
# printf("dev %x func: %s\n", dev_nr, ppfunc())
    ka_cnt++;
    udelay($3);
  }
}

probe begin{
  println("ik module begin:)");
}

The code is a bit long, first look at the probes probe vfs.read.return, vfs.write.return means execute the probe code before exit, determine if dev_nr is the target device and open ineject, if open, then udelay a small time. As for the other two probes, procfs(“cnt”), procfs(“inject”) is triggered when reading /proc/systemtap, and the global variable inject is modified to decide whether to turn on IO injection.

2.1 Executing the code

This script execution may encounter vfs_lookup_path error, which is very nasty, I updated procfs.c by one version and commented out the vfs_lookup_path part to solve it.

1
2
stap -DMAXSKIPPED=9999 -m ik -g inject_ka.stp 8 17 400
ik module begin:)

8, 17 indicates the disk device number and 400 indicates the udelay time, at which point the script blocks and does not start executing the IO injection. Open another terminal and execute the injection for 30 seconds.

1
echo on| tee /proc/systemtap/ik/inject  && sleep 30 && echo off| tee /proc/systemtap/ik/inject

At this point, you can see that stap has output.

2.2 Testing Disk Performance

Simply use dd to test the effect of IO latency on sequential writes.

Before injecting.

1
2
3
4
dd if=/dev/zero of=test.dat  bs=8k count=1000000
1000000+0 records in
1000000+0 records out
8192000000 bytes (8.2 GB) copied, 34.8372 s, 235 MB/s

After injection.

1
2
3
4
dd if=/dev/zero of=test.dat  bs=8k count=1000000
1000000+0 records in
1000000+0 records out
8192000000 bytes (8.2 GB) copied, 79.5475 s, 103 MB/s

You can see that dd performance drops significantly, and by adjusting the udelay time you can simulate the performance at different latencies. It may be better if it is random or fits a normal distribution.

Summary

There are many systemtap examples and introductions on the official website, and you can also capture the network stack, which is very powerful. At the same time, you need to have some kernel skills, or at least know where to bury the probe. openresty uses systemtap a lot for debugging, you can refer to learn.

In addition, the installation is a big problem, must pay attention to the version, too new can not be, ubuntu system apt source is 2.3, tried to install a high version of the source code, there are many errors.