Introduction

As a provider of the kubernetes platform, it is important to put some restrictions on certain “rogue” applications to prevent them from abusing the platform’s CPU, memory, disk, network, and other resources.

For example, kubernetes provides limits on CPU and memory to prevent applications from using system resources without limits; kubernetes provides PVCs, such as cephfs and RBD, which also support capacity limits.

However, earlier versions of kubernetes did not limit the capacity of the container’s rootfs. Since the default container uses log storage under /var/lib/kubelet/ and rootfs under /var/lib/docker, both of which are in the root partition of the host node by default. If a malicious attack is applied, it can quickly cause the host node root partition file system to become full by mass dd in the container. We know that it is usually dangerous when the linux root partition is used to 100%.

In version 1.8, kubernetes introduced a new resource: local ephemeral storage, which is used to manage local temporary storage, corresponding to the feature LocalStorageCapacityIsolation. Since 1.10, this feature has been in beta status and is enabled by default.

Temporary storage, such as emptyDir volumes, container logs, image layers and container writable layers, use /var/lib/kubelet by default, which protects the root partition of the node by limiting the capacity of temporary storage. By limiting the temporary storage capacity, you can also protect the root partition of the node.

Local temporary storage management is only available for the root partition and will not work if you customize the parameters, such as --root-dir.

Configuration

My cluster version is 1.14 and the feature of local ephemeral storage is enabled by default, so you only need to configure Pods.

Each container of the Pod can be configured with.

  • spec.containers[].resources.limits.ephemeral-storage
  • spec.containers[].resources.requests.ephemeral-storage

The unit is byte, which can be configured directly or by E/P/T/G/M/K or Ei, Pi, Ti, Gi, Mi, Ki., for example 128974848, 129e6, 129M, 123Mi represents the same capacity.

The following creates a Deployment and sets the maximum temporary storage it uses to 2Gi.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    matchLabels:
      run: nginx
  template:
    metadata:
      labels:
        run: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        resources:
          limits:
            ephemeral-storage: 2Gi
          requests:
            ephemeral-storage: 2Gi

After the Pod is started, enter the container and execute dd if=/dev/zero of=/test bs=4096 count=1024000 , try to create a 4Gi file, you can find that after some time of execution, the Pod is Evict and the controller re-creates a new Pod.

1
2
nginx-75bf8666b8-89xqm                    1/1     Running             0          1h
nginx-75bf8666b8-pm687                    0/1     Evicted             0          2h

Implementation

Evict Pod actions are done by kubelet. The kubelet on each node starts an evict manager that checks every 10 seconds (evictionMonitoringPeriod), and the check of ephemeral storage is done at this stage.

The evict manager can check the excess application for pods and containers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func (m *managerImpl) localStorageEviction(summary *statsapi.Summary, pods []*v1.Pod) []*v1.Pod {
    statsFunc := cachedStatsFunc(summary.Pods)
    evicted := []*v1.Pod{}
    for _, pod := range pods {
        podStats, ok := statsFunc(pod)
        if !ok {
            continue
        }

        if m.emptyDirLimitEviction(podStats, pod) {
            evicted = append(evicted, pod)
            continue
        }

        if m.podEphemeralStorageLimitEviction(podStats, pod) {
            evicted = append(evicted, pod)
            continue
        }

        if m.containerEphemeralStorageLimitEviction(podStats, pod) {
            evicted = append(evicted, pod)
        }
    }

    return evicted
}

where Pods are all Pods of this node that are not Terminated state obtained by GetActivePods.

The kubelet will check the Pod’s emptyDir, pod-level temporary storage, and container-level temporary storage accordingly, and if the Pod needs to be evict, it will be added to the evicted array, after which the evicted Pods will be squeezed out.

The contaier level check is relatively simple, because ephemeral storage is set on the container, check the container usage and the set limits in turn, if the limits are exceeded, it is added to the evicted pods list.

The relevant code is in containerEphemeralStorageLimitEviction.

The Pod level check is a little more complicated.

The first thing is the calculation of the limits.

The kubelet counts the sum of the ephemeral storage limits of all containers in the Pod (but not the init container). init container specifies the Pod’s minimum quota requirement (kind of like a minimum wage, for life support), and when the quota specified by all containers When the quota specified by all containers exceeds the quota specified by the init container, the quota specified by the init container is ignored. The mathematical description is as follows.

1
max(sum(containers), initContainer1, initContainer2, ...)

The actual temporary storage usage is calculated not only for containers with ephemeral storage specified, but also for containers with no ephemeral storage specified, and for emptyDir.

When the actual temporary storage usage exceeds the limit, the kubelet will Evict the Pod and wait for the controller to recreate a new Pod and reschedule it.

The relevant code is in podEphemeralStorageLimitEviction.

requests

Note that the set local ephemeralstorage requests are not used in the evict manager processing. But it is not useless.

After creating a Pod, the scheduler schedules the Pod to one of the nodes in the cluster. Since there is an upper limit to the amount of local ephemeral storage that each node can carry, the scheduler ensures that the sum of local ephemeralstorage requests for all Pods on that node does not exceed the capacity of the node’s root partition.

inode protection

Sometimes, we will find that disk writes will report that the disk is full, but df view capacity is not 100% used, then it may just be caused by inode exhaustion. Therefore, for the platform, inode protection is also needed.

Among them, podLocalEphemeralStorageUsage also counts the number of inodes used by containers or pods.

But currently k8s does not support setting inode limits/requests for Pod’s temporary storage.

Of course, if the node goes into an inode shortage, kubelet will set the node to under pressure and will not receive new Pod requests.

emptyDir

emptyDir is also a kind of temporary storage, so it needs to be limited as well.

When checking temporary storage usage at the Pod level, the usage of emptyDir is also taken into account, so if emptyDir is used too much, the Pod will also be kubelet Evict.

In addition, the emptyDir itself can also be capped. In the following excerpt from the orchestration file, I specify that emptyDir uses memory as the storage medium so that users can get excellent read and write performance, but since memory is precious, I only provide 64Mi of space. When the user uses more than 64Mi in the /cache directory, the Pod will be evict by kubelet.

1
2
3
4
5
6
7
8
        volumeMounts:
        - mountPath: /cache
          name: cache-volume
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 64Mi
        name: cache-volume

The relevant code is in emptyDirLimitEviction.

Ref: