If you run a container with runc and do the following, you will get interesting results.

1
2
3
4
5
6
$ whoami
root
$ id -u root
0
$ hostname mybox
hostname: sethostname: Operation not permitted

Even if we use the root user with a UID of 0, we do not have the privilege to change the hostname.

The fact that the root user has the highest privileges is a thing of the past, as the Linux kernel introduced a new privilege checking mechanism, capabilities, in version 2.2.

Finer-grained permissions control than superuser

The traditional Linux privilege checking model is simple, as the kernel only distinguishes between two types of processes when checking privileges.

  • Privileged processes with a valid user ID of 0, which is often referred to as the superuser or root.
  • Non-privileged processes, which do not have a valid user ID of 0.

Privileged processes bypass all kernel checks directly, while non-privileged processes need to perform checks based on credentials such as the process’s valid user ID and valid user group ID.

In order to accommodate more complex privileges, the Linux kernel from version 2.2 onwards has been able to further break down superuser privileges into fine-grained units called capabilities; for example, capability CAP_CHOWN allows the user to make arbitrary changes to the UID and GID of a file by executing the chown command. Almost all superuser-related privileges have been broken down into separate capabilities.

The introduction of capabilities has the following benefits.

  • Removing some capabilities from the superuser’s privileges to weaken them and improve system security.
  • The ability to grant some special privileges to ordinary users very precisely on demand.

Security risks of privileged containers

Containers isolate processes and resources by namespace, but not all resources can be namespaced, containers and hosts are not completely isolated, for example, time is shared in containers and hosts. If the process in the container has all the privileges, it can run direct access to the hardware (malicious) programs or even directly modify the host’s file system, so it is necessary to impose certain restrictions on the operation in the container, otherwise it will affect the stability of the host, and even bring serious security risks.

For the above reasons, by default the container runs with a whitelist of capabilities added to the container at the time of creation, so that even if you are a super user in the container does not have permission to perform specific operations.

Let’s deepen our understanding of capabilities in containers with an example.

Preparation

We will use an additional tool library libcap in the container to interact with capabilities, which needs to be installed in a filesystem bundle, as described in the previous article, in the following way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 创建 bundle 的顶层目录
mkdir /mycontainer2
cd /mycontainer2

# 创建用于存放 root filesystem 的 rootfs 目录
mkdir rootfs

# 利用 Docker 导出已安装 libcap 容器的 root filesystem 
docker export $(docker create cmd.cat/capsh) | tar -C rootfs -xvf -

# 创建一个 config.json 作为整个 bundle 的 spec
runc spec

Then you can use runc run to run a base container with the library installed from the /mycontainer2 directory.

Add capabilities when creating containers

In the opening example, we were unable to set the hostname in the container with the root user because the capability CAP_SYS_ADMIN was missing, which is not included in the whitelist of capabilities added to the container by default.

In a previous article, we described that the container runtime sets the runtime parameters and execution environment for the container it creates based on the config.json in the bundle, a process that also includes setting the capabilities of the processes in the container.

By modifying config.json and adding "CAP_SYS_ADMIN" to the bounding, permitted and effective lists of the process.capabilities object in JSON, this capability will be added to the container init process to the corresponding capabilities set.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
"capabilities": {
    "bounding": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_ADMIN"
    ],
    "effective": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_ADMIN"
    ],
    "inheritable": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
    ],
    "permitted": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_ADMIN"
    ],
    "ambient": [
        "CAP_AUDIT_WRITE",
        "CAP_KILL",
        "CAP_NET_BIND_SERVICE"
    ]
}

Technical details of capabilities

capabilities can be applied to both files and processes (or threads, the Linux kernel does not distinguish between processes and threads), the capabilities of a file are stored in the extended attributes of the file, which are cleaned up when the image is built, so we basically do not need to consider the capabilities of a file in the container.

The capabilities of a process are controlled by five capability sets maintained separately for each process, each of which contains zero or more capabilities.

  • Permitted: a superset of capabilities that the process can use
  • Inheritable: capabilities that can be inherited by new derived processes when the process executes the exec() system call
  • Effective: the set used by the kernel to perform permission checks on processes
  • Bounding: A superset of the Inheritable set, a capability must be in the Bounding set to be added to Inheritable
  • Ambient: capabilities that will be retained by unprivileged programs when executing the exec() system call

As shown above we have added CAP_SYS_ADMIN to the Permitted, Bounding and Effective sets of the init process, so the init process will pass the kernel’s check for CAP_SYS_ADMIN.

Next we run a new container based on the new config.json, and now we can change the hostname.

1
2
3
4
$ runc run mybox2
$ hostname super
$ hostname
super

We are in the sh process that is the container init process when we do the above. If we continue to create new processes in the container, will they also have the newly added capability?Let’s try this by executing the following command in a new window.

1
2
3
4
5
6
$ runc exec -t mybox2 sh
$ hostname
super
$ hostname hello
$ hostname
hello

The hostname change was successful because the newly created process exactly replicates the capabilities of the init process.

Adding capabilities at container runtime

In addition to modifying config.json to add capabilities, we can also add capabilities during the container runtime phase.

First restore config.json, then run a new container mybox3 and make sure it no longer has CAP_SYS_ADMIN in the new sh process.

Then create a new process in that container with runc exec and add CAP_SYS_ADMIN to that process with the -cap option.

1
runc exec --cap CAP_SYS_ADMIN mybox3 /bin/hostname origin

The idea is that since runc can set the capabilities set for the init process based on config.json, it can do the same for other processes running in the container.

Check the capabilities of the process

capsh

Execute capsh --print from within the container to get more information about capabilities.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ capsh --print

Current: = cap_kill,cap_net_bind_service,cap_audit_write+eip cap_sys_admin+ep
Bounding set =cap_kill,cap_net_bind_service,cap_sys_admin,cap_audit_write
Ambient set =cap_kill,cap_net_bind_service,cap_audit_write
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root)
gid=0(root)
groups=

This command prints the capabilities of the current process.

The cap_sys_admin we added via config.json is included in the Current and Bounding set. The +eip at the end of the capability means that the capability exists in the Effective, Inheritable, and Permitted sets.

pscap

First get the PID of the running process in the container at the host.

1
2
3
4
$ runc ps mybox2
UID          PID    PPID  C STIME TTY          TIME CMD
root        9592    9580  0 14:39 pts/0    00:00:00 sh
root        9776    9765  0 14:46 pts/1    00:00:00 sh

Install the pscap program in the host computer.

1
$ apt-get install libcap-ng-utils

Based on the obtained PID, see the capabilities of the processes in the container.

1
2
3
pscap | grep "9592\|9776"
9580  9592  root        sh                kill, net_bind_service, sys_admin, audit_write
9765  9776  root        sh                kill, net_bind_service, sys_admin, audit_write