Kubernetes has a problem with mounted subpath containers that continue to crash after configmap or other volume changes if the container quits unexpectedly and does not start properly.

Kubernetes has now released version 1.18 and this issue still exists.

Community-related issue #68211

Reproduction steps

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
---
apiVersion: v1
kind: Pod 
metadata:
  name: test-pod
spec:
  volumes:
  - configMap:
      name: extra-cfg
    name: extra-cfg
  containers:
  - name: test
    image: ubuntu:bionic
    command: ["sleep", "30"]
    resources:
      requests:
        cpu: 100m
    volumeMounts:
      - name: extra-cfg
        mountPath: /etc/extra.ini
        subPath: extra.ini
---
apiVersion: v1
data:
  extra.ini: |
        somedata
kind: ConfigMap
metadata:
  name: extra-cfg

Apply this configuration, after the Pod is started, modify the contents of the configmap, wait 30 seconds for the container to exit automatically, and the kubelet restarts the container, at which point the container is observed to continue to mount failures.

1
Error: failed to start container "test": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/e044883a-48da-4d28-b304-1a57dcb32203/volume-subpaths/extra-cfg/test/0\\\" to rootfs \\\"/var/lib/docker/overlay2/31b076d0012aad47aa938b482de24ecda8b41505489a22f63b8a3e4ce39b43ba/merged\\\" at \\\"/var/lib/docker/overlay2/31b076d0012aad47aa938b482de24ecda8b41505489a22f63b8a3e4ce39b43ba/merged/etc/extra.ini\\\" caused \\\"no such file or directory\\\"\"": unknown

Cause Analysis

Update of Configmap Volume

Before the container starts for the first time, kubelet first downloads the contents of configmap to the volume directory corresponding to the Pod, for example /var/lib/kubelet/pods/{Pod UID}/volumes/kubernetes.io~configmap/extra-cfg.

Also, to ensure that updates to the contents of this volume are atomic (when updating the directory), updates are made by soft linking the files in the directory as follows.

1
2
3
4
5
drwxrwxrwx 3 root root 4.0K Mar 29 03:12 .
drwxr-xr-x 3 root root 4.0K Mar 29 03:12 ..
drwxr-xr-x 2 root root 4.0K Mar 29 03:12 ..2020_03_29_03_12_44.788930127
lrwxrwxrwx 1 root root   31 Mar 29 03:12 ..data -> ..2020_03_29_03_12_44.788930127
lrwxrwxrwx 1 root root   16 Mar 29 03:12 extra.ini -> ..data/extra.ini

extra.ini is a soft link to . .data/extra.ini softlink, . .data is . .2020_03_29_03_12_44.788930127 softlink, the directory named timestamp holds the real content.

When configmap is updated, a new timestamped directory will be generated to store the updated content.

Create a new softlink . .data_tmp to the new timestamp directory, then rename it to . .data, the renaming is an atomic operation.

Finally, the old timestamp directory is deleted.

Preparing the container to mount the subpath Volume

When the configmap volume is ready, kubelet will bind mount the files specified by subpath in configmap to a special directory: /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/ {container name}/0.

1
2
cat /proc/self/mountinfo|grep extra
2644 219 8:1 /var/lib/kubelet/pods/{Pod UID}/volumes/kubernetes.io~configmap/extra-cfg/..2020_03_29_03_12_13.444136014/extra.ini /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/test/0 rw,relatime shared:99 - ext4 /dev/sda1 rw,data=ordered

As you can see, the bind mount file is actually the contents of the timestamp directory of the real file.

When Configmap is updated, this timestamp directory is removed and //deleted is added to the source file.

1
2
cat /proc/self/mountinfo|grep extra
2644 219 8:1 /var/lib/kubelet/pods/{Pod UID}/volumes/kubernetes.io~configmap/extra-cfg/..2020_03_29_03_12_13.444136014/extra.ini//deleted /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/test/0 rw,relatime shared:99 - ext4 /dev/sda1 rw,data=ordered

Bind Mount

When the container is started, /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/test/0 needs to be mounted to the container.

If the original timestamp directory is deleted, mount will error: mount: mount(2) failed: No such file or directory.

Simulate this problem with a simple command:

1
2
3
4
5
# touch a b c
# mount --bind a b
# rm -f a
# mount --bind b c
mount: mount(2) failed: No such file or directory

As you can see, when a is deleted, the b mount point can no longer be mounted, so when the container abnormally exits and needs to be restarted, if the configmap is updated and the original timestamp file is deleted, the subpath can no longer be mounted to the container.

Solution

Configmap changed after Unmount

Community Related PR: https://github.com/kubernetes/kubernetes/pull/82784

Before a container restart, check that the source file of the subpath mount point and the new target subpath file are consistent.

When configmap is updated and the timestamp directory changes, the inconsistency is checked. Unmount /var/lib/kubelet/pods/{Pod UID}/volume-subpaths/extra-cfg/test/0 and re-Bind Mount the corresponding file in the current latest timestamp directory.

Based on the comments in the community PR, this solution may be risky and unclear (it has been noted that the kernel is insecure below 4.18 link), so no progress has been made for a long time.

Testing over time has not yet revealed any obvious problems.

does not use subpath

Use other ways to bypass this problem.

For example, you can mount the whole Configmap to another directory of the container and then link it to the corresponding path by softlinking it when the container starts.

Refer to the article at https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/.

You can see that the direct mount softlink was originally used, but there is a security vulnerability, symlink race. A malicious program can construct a softlink that allows a privileged program (kubelet) to mount the contents of an out-of-privilege file into the user’s container.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
  initContainers:
  - name: prep-symlink
    image: "busybox"
    command: ["bin/sh", "-ec", "ln -s / /mnt/data/symlink-door"]
    volumeMounts:
    - name: my-volume
      mountPath: /mnt/data
  containers:
  - name: my-container
    image: "busybox"
    command: ["/bin/sh", "-ec", "ls /mnt/data; sleep 999999"]
    volumeMounts:
    - mountPath: /mnt/data
      name: my-volume
      subPath: symlink-door
  volumes:
  - name: my-volume
  emptyDir: {}

Using the above configuration, a softlink to the root directory is created in the initContainer via emptyDir in the mounted Volume directory.

Afterwards, the container starts normally, but specifies a subpath. If the kubelet mounts the softlink directly, it will mount the root directory of the host into the user container.

To solve this problem, we need to resolve the real file path corresponding to the softlink, and determine whether the path is in the Volume directory, and then mount it to the container after passing the verification. However, due to the time gap between verification and mounting, the file may still be tampered with.

After the community discussion, we introduced an intermediate Bind Mount mechanism, which is equivalent to putting a lock on this file and solidifying the path of the original file, so that when you mount it to the container again, you will only mount the source file at the time of creating the mount point.

Update

The fix PR submitted to the community has been merged into 89629.