On January 18, 2022, Linux maintainers and vendors discovered a heap buffer overflow vulnerability in the legacy_parse_param function of the Linux kernel (5.1-rc1+) file system context function with the vulnerability ID number CVE-2022-0185, which is a high-risk vulnerability with a severity rating of is 7.8 .
The vulnerability allows for out-of-bounds writes in kernel memory. Using this vulnerability, an unprivileged attacker could bypass the restrictions of any Linux namespace and elevate its privileges to root. for example, if an attacker infiltrates your container, it could escape from the container and elevate privileges.
This vulnerability was introduced in March 2019 in the Linux kernel 5.1-rc1 version. A patch was released on January 18 to fix this issue, and all Linux users are advised to download and install the latest version of the kernel.
The vulnerability is caused by an integer underflow condition found in the legacy_parse_param function of the file system context function (fs/fs_context.c). The file system context function creates superblocks for mounting and remounting file systems, which record a file system’s characteristics, such as block and file size, and any storage blocks.
By sending more than 4095 bytes of input to the legacy_parse_param function, it is possible to bypass the input length detection and cause an out-of-bounds write, triggering the vulnerability. An attacker could use this vulnerability to write malicious code to other parts of memory, causing a system crash, or could execute arbitrary code to elevate privileges.
The input data for the legacy_parse_param function is added via the fsconfig system call to configure the file system creation context (e.g., an ext4 file system superblock).
To use the fsconfig system call, a non-privileged user must have at least CAP_SYS_ADMIN privileges in their current namespace. This means that if the user has access to another namespace with these privileges, it is sufficient to exploit this vulnerability.
If the CAP_SYS_ADMIN privilege is not available to a non-privileged user, an attacker could obtain it via the unshare(CLONE_NEWNS|CLONE_NEWUSER) system call. The Unshare system call allows a user to create or clone a namespace or user, thus having the necessary privileges to perform further attacks. This technique is important for the Kubernetes and container worlds that use Linux namespaces to isolate Pods, and an attacker can fully exploit this in a container escape attack, where once successful, the attacker can gain full control over the host OS and all containers running on the system to further attack other machines on the internal network segment, even even deploy malicious containers in a Kubernetes cluster.
The research team that discovered the vulnerability posted code and a proof of concept to exploit it on GitHub on January 25.
Docker and other container runs use Seccomp profiles by default to prevent processes in the container from using dangerous system calls to protect Linux namespace boundaries.
Seccomp (full name: secure computing mode) was introduced to the Linux kernel in version 2.6.12 (March 8, 2005) to limit the system calls available to processes to four: read, write, _exit, and sigreturn. initially this mode was a whitelist approach, in which the secure mode In addition to the open file descriptor and the four allowed system calls, the kernel uses SIGKILL or SIGSYS to terminate the process if any other system call is attempted.
However, Kubernetes does not use any Seccomp or AppArmor/SELinux profiles by default to restrict system calls to Pods, which makes it dangerous for processes in Pods to freely access dangerous system calls and wait for the opportunity to gain the necessary privileges (such as CAP_SYS_ADMIN) for further attacks.
Let’s start with a Docker example where the unshare command is not available in a standard Docker environment, and Docker’s Seccomp filterblocks the system calls used by this command.
Let’s look at the Kubernetes Pod.
You can see that the root user in the Pod does not have CAP_SYS_ADMIN capability, but we can get CAP_SYS_ADMIN capability by unshare command.
So what can you do with CAP_SYS_ADMIN? Here are two examples to show how CAP_SYS_ADMIN can be used to infiltrate a system.
Normal user elevated to root user
The following procedure can be used to elevate a normal user on the host directly to the root user.
First, give python3 the CAP_SYS_ADMIN capability (note that you cannot manipulate soft links, only original files).
Create a normal user.
Then switch to the normal user and go to the user home directory.
Copy /etc/passwd to the current directory and finish changing the root user’s password to " password".
Mount the modified passwd file to /etc/passwd.
Finally is the moment to witness the miracle!!! Switch directly to the root user and enter the password “password”.
Amazing, I’ve switched to the root user.
Let’s see if you really got root privileges.
Well, it’s root, that’s right.
Finally, remember to uninstall /etc/passwd.
So, System Reboot Engineers, see if the normal users you assign to others have CAP_SYS_ADMIN capability!
View all processes on the host in a container
Let’s look at another container example. The following magic operation allows you to get all the processes running on the host in a container.
We don’t need to use the
--privileged argument to run a privileged container, that would be pointless.
Next, execute the following command in the container. The final effect is to execute the ps aux command on the host and save its output to the /output file in the container.
Eventually you can see all the processes running in the host in the container.
I will not explain the specific meaning of these commands, interested parties can study it themselves against the comments.
What is certain is that CAP_SYS_ADMIN capability provides more possibilities for attackers, both in the host and in the container, especially in the container environment, and if we cannot upgrade the kernel due to force majeure factors, we should look for other solutions.
Since v1.22, Kubernetes has been able to use SecurityContext to add default Seccomp or AppArmor profiles to resource objects to protect Pods, Deployment, Statefulset, Daemonset, and more. While this feature is currently in Alpha, users can add their own Seccomp or AppArmor profile and define it in SecurityContext.
After creating the Pod, try to use unshare to get CAP_SYS_ADMIN capability.
The output shows that the unshare system call is successfully blocked and the attacker cannot use this capability to attack.
There is another option to disable the ability of user namespace from the host level, without rebooting the system. For example, in Ubuntu, the following two lines are all you need to do to make it work instantly, and it will take effect after a reboot.
If you are on a Red Hat system, you can run the following command to achieve the same result.
To summarize the recommendations for handling the vulnerability.
- If your environment can accept patching the kernel and rebooting the system, it is best to patch it, or upgrade the kernel.
- Reduce the use of privileged containers that have access to CAP_SYS_ADMIN.
- For unprivileged containers, make sure you have a Seccomp filter to block their calls to unshare to reduce risk. no problem with Docker, Kubernetes requires additional action.
- In the future, Seccomp profiles can be enabled for all workloads in a Kubernetes cluster. This feature is currently in Alpha and needs to be enabled via a feature gate.
- The ability to disable user namespace for users at the host level.
Container environments are complex, especially distributed scheduling platforms like Kubernetes, each of which has its own lifecycle and attack surface that can easily expose security risks, and container cluster administrators must pay attention to every detail of security issues. Overall, the security of containers in the vast majority of cases depends on the security of the Linux kernel, so we need to keep an eye on any security issues and implement corresponding solutions as soon as possible.