Recently, we found that many instances on our network are in Evicted state. Through pod yaml, we can see that the instances are being evicted because of insufficient node resources. However, these instances are not automatically cleaned up, and most users of the platform will think that there is a problem with the service or the platform when they see Evicted instances under the service, which affects the user experience. The Pods in the Evicted state have been destroyed in the underlying container, which has no impact on the user’s service. In other words, there is only an empty shell of Pod in k8s, but it needs to be cleaned up manually. In this article, we will analyze why Evicted instances are created, why they are not automatically cleaned up, and how to do so.

1
2
3
4
$ kubectl get pod | grep -i Evicted
cloud-1023955-84421-49604-5-deploy-c-7748f8fd8-hjqsh        0/1     Evicted   0          73d
cloud-1023955-84421-49604-5-deploy-c-7748f8fd8-mzd8x        0/1     Evicted   0          81d
cloud-1237162-276467-199844-2-deploy-7bdc7c98b6-26r2r       0/1     Evicted   0          18d

Evicted Instance status.

1
2
3
4
5
status:
  message: 'Pod The node had condition: [DiskPressure]. '
  phase: Failed
  reason: Evicted
  startTime: "2021-09-14T10:42:32Z"

Reasons for instance eviction

By default, kubelet configures a policy to evict instances when a node is low on resources. k8s will stop the instance on that node and start a new one on another node when the node is low on resources. The eviction policy can also be disabled in some cases by configuring the --eviction-hard= parameter to be empty, which we did in our previous production environment.

Insufficient node resources lead to instance eviction

The Evicted state in k8s is mainly caused by the eviction of the instance due to insufficient node resources. The kubelet eviction_manager module periodically checks the node memory usage, inode usage, disk usage, pid and other resources, and according to the kubelet configuration, when the usage reaches a certain threshold, it will first recycle the resources that can be recycled. If the resource usage still exceeds the threshold after eviction, the instance will be evicted.

Eviction Signal Description
memory.available memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
nodefs.available nodefs.available := node.stats.fs.available
nodefs.inodesFree nodefs.inodesFree := node.stats.fs.inodesFree
imagefs.available imagefs.available := node.stats.runtime.imagefs.available
imagefs.inodesFree imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree
pid.available pid.available := node.stats.rlimit.maxpid - node.stats.rlimit.curproc

The stats data of the pods in the kubelet is partly obtained through the cAdvisor interface and partly through the CRI runtimes interface.

  • memory.available: the available memory of the current node, calculated as the value of memory.usage_in_bytes in the cgroup memory subsystem minus the value of total_inactive_file in memory.stat.
  • nodefs.available: nodefs contains the file partition specified by -root-dir in the kubelet configuration and the disk usage of the partition where /var/lib/kubelet/ is located.
  • nodefs.inodesFree: inode usage of the nodefs.available partition.
  • imagefs.available: disk usage of the partition where the image is located.
  • imagefs.inodesFree: inode usage of the partition where the mirror is located.
  • pid.available: the value in /proc/sys/kernel/pid_max is the maximum number of pid available to the system.

The kubelet can configure the thresholds for the above parameters with the parameter -eviction-hard, which defaults to imagefs.available<15%,memory.available<100Mi,nodefs.available<10%,nodefs. inodesFree<5% , which will evict containers on the node when the threshold is reached.

1. kubelet does not perceive changes in node memory data in real time

kubelet regularly collects node memory usage data through the cadvisor interface. When a node has a sudden increase in memory usage within a short period of time, kubelet does not sense or have MemoryPressure related events, but still calls OOMKiller to stop the container. The memcg api can be enabled by configuring the -kernel-memcg-notification parameter for the kubelet, so that memcg will proactively notify when a memory usage threshold is triggered.

The memcg active notification feature is already in the cgroup. kubelet will write the threshold value of memory.available in the /sys/fs/cgroup/memory/cgroup.event_control file, which is related to the size of the inactive_file file. The kubelet will also update the threshold periodically, and will proactively notify the kubelet when the memcg usage reaches the configured threshold, and the kubelet will receive notifications through the epoll mechanism.

2. kubelet memory.available does not count active pages

When kubelet evicts an instance by memory usage, the memory usage data includes the active_file data in the page cache, and in some scenarios, the instance will be evicted because the memory usage exceeds the threshold due to high page cache.

Since inactive_file will be reclaimed first by the kernel when memory is tight, but active_file will also be reclaimed by the kernel when memory is insufficient, the community has some questions about this mechanism, and the community has not yet responded to the complicated situation of memory reclaiming by the kernel, for details see kubelet counts active page cache against memory.available (maybe it shouldn’t?).

The kubelet calculates the memory available to a node in the following way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
#!/usr/bin/env bash

# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.

# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')

memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
    memory_working_set=0
else
    memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi

memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))

echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"

Analysis of why evicted instances are not deleted

Reading the source code, I found that Evicted instances are automatically deleted for Statefulset and DaemonSet, but not for Deployment. After reading some of the official documentation and issues, I have not found an official explanation for why Deployment Evicted instances are not deleted.

statefulset

pkg/controller/statefulset/stateful_set_control.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// Examine each replica with respect to its ordinal
for i := range replicas {
    // delete and recreate failed pods
    if isFailed(replicas[i]) {
        ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
            "StatefulSet %s/%s is recreating failed Pod %s",
            set.Namespace,
            set.Name,
            replicas[i].Name)
        if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
            return &status, err
        }
        if getPodRevision(replicas[i]) == currentRevision.Name {
            status.CurrentReplicas--
        }
        if getPodRevision(replicas[i]) == updateRevision.Name {
            status.UpdatedReplicas--
        }
        ......

daemonset

pkg/controller/daemon/daemon_controller.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
func (dsc *DaemonSetsController) podsShouldBeOnNode(
		......
) (nodesNeedingDaemonPods, podsToDelete []string) {

		......

    switch {
		......
    case shouldContinueRunning:
				......
        for _, pod := range daemonPods {
            if pod.DeletionTimestamp != nil {
                continue
            }
            if pod.Status.Phase == v1.PodFailed {
                // This is a critical place where DS is often fighting with kubelet that rejects pods.
                // We need to avoid hot looping and backoff.
                backoffKey := failedPodsBackoffKey(ds, node.Name)
                ......

Solution

  1. Our team has a set of services to collect k8s cluster events. We process by consuming events related to pods in k8s, filtering the events related to Evicted instances in pods when consuming events and then processing them.

    Evicted instance judgment logic.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    const (
        podEvictedStatus = "Evicted"
    )
    
    // Deleting a pod directly if it is an instance of Evicted state and the number of containers in the pod is 0
    if strings.ToLower(pod.Status.Reason) == strings.ToLower(podEvictedStatus) && pod.Status.Phase == v1.PodFailed &&
            len(pod.Status.ContainerStatuses) == 0 {
    
    }
    
  2. Some people in the community provide automatic cleanup by configuring the podgc controller -terminated-pod-gc-threshold parameter in kube-controller-manager.

    1
    2
    3
    4
    5
    
    Podgc controller flags:
    
        --terminated-pod-gc-threshold int32
                    Number of terminated pods that can exist before the terminated pod garbage collector starts deleting terminated pods. If
                    <= 0, the terminated pod garbage collector is disabled. (default 12500)
    

    This parameter configures the number of exception instances to keep, and the default value is 12500, but the podgc controller does not support graceful exit of instances when recycling pods using the forced kill mode, so do not consider using it yet.

  3. Other ways to handle this can be found in Kubelet does not delete evicted pods provided in the community.

Summary

Due to the high importance attached to stability in the previous company, the online nodes did not open the function of eviction instances, so there will not be instances of Evicted state, when the node resources are seriously insufficient, there will be an alarm manual intervention, as well as some secondary scheduling, failure self-healing and other auxiliary processing measures. The analysis of the Evicted instance reveals that there are many connections between k8s and the operating system, and if we want to understand some mechanisms thoroughly, we need to have some understanding of the principles of the operating system.