Principles of container technology (2): using Namespace to achieve process isolation

Namespace is a feature provided by the Linux kernel that wraps some system resources into an abstract space and makes the processes in that space think that these resources are the only resources available in the system. It isolates processes and resources from the host system and other containers.

There are many types of namespace depending on the system resources they operate on, such as cgroup namespace, mount namespace, etc. We will just take pid namespace as an example and use runC as the container runtime implementation to demonstrate how namespace works when we perform operations on the container .

As we described in the previous article, most container systems use runC as the underlying runtime implementation, and if you are using docker on a Linux distribution, you don’t even need to install it specifically to use the runc command.

Preparation

filesystem bundle

runC can only execute containers from a filesystem bundle (a filesystem bundle is, as the name implies, a folder that satisfies a specific structure), but we can use docker to prepare an available bundle.

# 创建 bundle 的顶层目录
$ mkdir /mycontainer
$ cd /mycontainer

# 创建用于存放 root filesystem 的 rootfs 目录
$ mkdir rootfs

# 利用 Docker 导出 busybox 容器的 root filesystem 
$ docker export $(docker create busybox) | tar -C rootfs -xvf -

# 创建一个 config.json 作为整个 bundle 的 spec
$ runc spec

At this point, the entire bundle directory structure is as follows.

$ tree -L 2 /mycontainer

/mycontainer
├── config.json
└── rootfs
    ├── bin
    ├── dev
    ├── etc
    ├── home
    ├── proc
    ├── root
    ├── sys
    ├── tmp
    ├── usr
    └── var

System monitoring tools

To complete the demo, we need some third-party system monitoring tools as an aid.

monitor the process startup to get the PID of the running process in the container, such as forkstat in ubuntu, which can monitor system calls like fork() , exec() and exit() in real time, installed as follows.
1

$ apt install forkstat

View namespace information, such as cinf, which is a command line tool that can easily list all namespaces on the system or view detailed information about a namespce, is installed as follows.

$ curl -s -L https://github.com/mhausenblas/cinf/releases/latest/download/cinf_linux_amd64.tar.gz \
    -o cinf.tar.gz && \
    tar xvzf cinf.tar.gz cinf && \
    mv cinf /usr/local/bin && \
    rm cinf*

Running containers with runc

First we need to run forkstat in a window.

`1`	`$ forkstat -e exec`

Then create a new terminal window, switch to the /mycontainer directory, and use runC to run the container.

`1`	`$ runc run mybox`

When executed, it will go directly to the newly created container and run the ps command.

1
2
3

PID   USER     TIME  COMMAND
    1 root      0:00 sh
    7 root      0:00 ps

The forkstat window will have the following output.

Time     Event     PID Info   Duration Process
12:35:22 exec    33040                 runc run mybox
12:35:22 exec    33047                 runc init
12:35:22 exec    33049                 dumpe2fs -h /dev/sdb3
12:35:22 exec    33050                 dumpe2fs -h /dev/sdb3
12:35:22 exec    33047                 runc init
12:35:22 exec    33052                 sh
12:35:37 exec    33062                 ps

As you can tell from the synchronous printout, the sh or ps output by ps and forkstat are actually the same process, but since the processes in the container are in a separate pid namespace, they have separate PIDs in the container, and they think they are the only processes in the container, so the PIDs will start at 1.

Find the namespace the process belongs to

Now to find the pid namespace used by the container, you need to adjust the output format of the ps command for this purpose.

1
2
3

$ ps -p 33052 -o pid,pidns
PID      PIDNS  
33052 4026532395

PIDNS is the pid namespace, the above command can get sh process with PID 33052 belongs to the pid namespace 4026532395. Since we already have the PID of the process in the container, we can actually get all the namespace of the process through the /proc file system of the host.

$ ll /proc/33052/ns
lrwxrwxrwx 1 root root 0  7月 21 12:37 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0  7月 21 12:36 ipc -> 'ipc:[4026532394]'
lrwxrwxrwx 1 root root 0  7月 21 12:36 mnt -> 'mnt:[4026532383]'
lrwxrwxrwx 1 root root 0  7月 21 12:36 net -> 'net:[4026532397]'
lrwxrwxrwx 1 root root 0  7月 21 12:36 pid -> 'pid:[4026532395]'
lrwxrwxrwx 1 root root 0  7月 21 12:37 pid_for_children -> 'pid:[4026532395]'
lrwxrwxrwx 1 root root 0  7月 21 12:37 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  7月 21 12:37 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0  7月 21 12:36 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0  7月 21 12:36 uts -> 'uts:[4026532393]'

The printout shows the namespace to which a process belongs.

Each namespace is a soft link, and the name of the soft link indicates the type of namespace, e.g. cgroup for cgroup namespace, pid for pid namespace.
Each softlink points to the real namespace object to which the process belongs, which is represented by an inode number, and each inode number is unique in the host system.
If two processes have softlinks of the same type of namespace pointing to the same inode, they belong to the same namespace.

Virtually all processes will belong to at least one namespace, and the Linux system creates a default namespace for all types of processes at boot time.

We can also try to get the namespace that sh belongs to within the container, which requires the PID 1 within the container.

$ ls -l /proc/1/ns
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 cgroup -> cgroup:[4026531835]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 ipc -> ipc:[4026532394]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 mnt -> mnt:[4026532383]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 net -> net:[4026532397]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 pid -> pid:[4026532395]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 pid_for_children -> pid:[4026532395]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 time -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 time_for_children -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 user -> user:[4026531837]
lrwxrwxrwx    1 root     root             0 Jul 21 04:37 uts -> uts:[4026532393]

Watching processes in namespace

We will now look at all the processes in the pid namespace from the namespace’s point of view, which is not provided by the Linux system, so you will need to use the cinf tool installed above.

$ cinf -namespace 4026532395

 PID    PPID   NAME  CMD  NTHREADS  CGROUPS                                                          STATE

 33052  33052  sh    sh   1         12:devices:/user.slice/mybox                                     S (sleeping)
                                    11:blkio:/user.slice/mybox 10:rdma:/
                                    9:memory:/user.slice/user-0.slice/session-590.scope/mybox
                                    8:net_cls,net_prio:/mybox 7:freezer:/mybox
                                    6:pids:/user.slice/user-0.slice/session-590.scope/mybox
                                    5:cpu,cpuacct:/user.slice/mybox 4:cpuset:/mybox
                                    3:perf_event:/mybox 2:hugetlb:/mybox
                                    1:name=systemd:/user.slice/user-0.slice/session-590.scope/mybox
                                    0::/user.slice/user-0.slice/session-590.scope

Currently there is only one process in this namespace, and this process is also the init process of the container we are creating. When a new container is created, some new namespaces will be created and the container’s init process will be added to these namespaces.

For pid namespace, all processes running in the container can only see other processes in the same pid namespace, pid:[4026532395]. The sh process is considered to be the first process running on the system in the container with a PID of 1, but in the host it is just a normal process with a PID of 33052, and the same process has different PIDs in different namespaces, which is the role of the pid namespace. In a way, a container means a new set of namespaces.

Create a new process in a container

Create a new terminal window to run a new process in an already running container.

`1`	`$ runc exec mybox /bin/top -b`

From the forkstat window, we can see the PID of the newly created process.

Time     Event     PID Info   Duration Process
12:40:23 exec    33132                 runc exec mybox /bin/top -b
12:40:23 exec    33140                 runc init
12:40:23 exec    33140                 runc init
12:40:23 exec    33142                 /bin/top -b

There is actually a more direct way to see the processes running in the container from the host, we can use the ps subcommand provided by runC.

$ runc ps mybox
UID          PID    PPID  C STIME TTY          TIME CMD
root       33052   33040  0 12:35 pts/0    00:00:00 sh
root       33142   33132  0 12:40 pts/1    00:00:00 /bin/top -b

Next, you still use cinf to find out which namespace the newly created process belongs to.

$ cinf --pid 33142

 NAMESPACE   TYPE

 4026532383  mnt
 4026532393  uts
 4026532394  ipc
 4026532395  pid
 4026532397  net
 4026531837  user

From the result, no new namespace is created, the namespace of the 32608 process is exactly the same as the namespace to which the init process-sh of the mybox container belongs. That is, creating a new process in the container simply adds that process to the namespace of the container’s init process.

Here is a list of all the processes owned by the 4026532395 namespace.

$ cinf --namespace 4026532395

 PID    PPID   NAME  CMD     NTHREADS  CGROUPS                                                          STATE

 33052  33040  sh    sh      1         12:devices:/user.slice/mybox                                     S (sleeping)
                                       11:blkio:/user.slice/mybox 10:rdma:/
                                       9:memory:/user.slice/user-0.slice/session-590.scope/mybox
                                       8:net_cls,net_prio:/mybox 7:freezer:/mybox
                                       6:pids:/user.slice/user-0.slice/session-590.scope/mybox
                                       5:cpu,cpuacct:/user.slice/mybox 4:cpuset:/mybox
                                       3:perf_event:/mybox 2:hugetlb:/mybox
                                       1:name=systemd:/user.slice/user-0.slice/session-590.scope/mybox
                                       0::/user.slice/user-0.slice/session-590.scope
 33142  33132  top   top -b  1         12:devices:/user.slice/mybox                                     S (sleeping)
                                       11:blkio:/user.slice/mybox 10:rdma:/
                                       9:memory:/user.slice/user-0.slice/session-590.scope/mybox
                                       8:net_cls,net_prio:/mybox 7:freezer:/mybox
                                       6:pids:/user.slice/user-0.slice/session-590.scope/mybox
                                       5:cpu,cpuacct:/user.slice/mybox 4:cpuset:/mybox
                                       3:perf_event:/mybox 2:hugetlb:/mybox
                                       1:name=systemd:/user.slice/user-0.slice/session-590.scope/mybox
                                       0::/user.slice/user-0.slice/session-590.scope

If we run ps -ef inside the container, we can also see these processes, their PIDs will be different due to the pid namespace.

PID   USER     TIME  COMMAND
    1 root      0:00 sh
   19 root      0:00 top -b
   20 root      0:00 ps -ef

Now we know that docker/runc exec is actually running a new process in the namespace of the created container.

Summary

When you run a container, new namespaces are created and the init process is added to those namespaces; when you run a new process in a container, the new process is added to the namespace created when the container was created.

In fact, the behavior of creating new namespaces when creating a container can be changed, we can specify that the new container uses the existing namespace.

Table of Contents