This article is based on reading the source code of Kubernetes v1.22.1

Kubelet allows nodes to be evicted if they are under-resourced. I have recently studied Kubelet’s eviction mechanism and found a lot to learn from it, so I’ll share it with you.

Kubelet Configuration

Kubelet’s eviction feature needs to be turned on in the configuration, and the threshold value for eviction needs to be configured.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
type KubeletConfiguration struct {
    ...
	// Map of signal names to quantities that defines hard eviction thresholds. For example: {"memory.available": "300Mi"}.
	EvictionHard map[string]string
	// Map of signal names to quantities that defines soft eviction thresholds.  For example: {"memory.available": "300Mi"}.
	EvictionSoft map[string]string
	// Map of signal names to quantities that defines grace periods for each soft eviction signal. For example: {"memory.available": "30s"}.
	EvictionSoftGracePeriod map[string]string
	// Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition.
	EvictionPressureTransitionPeriod metav1.Duration
	// Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
	EvictionMaxPodGracePeriod int32
	// Map of signal names to quantities that defines minimum reclaims, which describe the minimum
	// amount of a given resource the kubelet will reclaim when performing a pod eviction while
	// that resource is under pressure. For example: {"imagefs.available": "2Gi"}
	EvictionMinimumReclaim map[string]string
	...
}

Among them, EvictionHard means hard eviction, once the threshold is reached, it will be evicted directly; EvictionSoft means soft eviction, i.e., you can set the soft eviction period, only after the soft eviction period is exceeded, the period is set with EvictionSoftGracePeriod; EvictionMinimumReclaim means setting the minimum available threshold, such as imagefs.

The eviction signals that can be set are.

  • memory.available: node.status.capacity[memory] - node.stats.memory.workingSet, the node’s available memory.
  • nodefs.available: node.stats.fs.available, the size of the available capacity of the file system used by Kubelet.
  • nodefs.inodesFree: node.stats.fs.inodesFree, the number of inodes available on the file system used by Kubelet.
  • imagefs.available: node.stats.runtime.imagefs.available, the available capacity of the filesystem used to store images and container writable layers during container runtime.
  • imagefs.inodesFree: node.stats.runtime.imagefs.inodesFree, the available inodes capacity of the file system used to store images and container writable layers when the container is running.
  • allocatableMemory.available: the amount of available memory reserved for allocating Pods.
  • pid.available: node.stats.rlimit.maxpid - node.stats.rlimit.curproc, the available PID to allocate Pods with.

How Eviction Manager works

The main work of Eviction Manager is in the synchronize function. There are two places where the synchronize task is triggered, the monitor task, which is triggered every 10s, and the notifier task, which is started to listen for kernel events based on user-configured eviction signals.

How Eviction Manager works

notifier

The notifier is started by the thresholdNotifier in the eviction manager, each eviction signal configured by the user corresponds to a thresholdNotifier, and the thresholdNotifier and notifier communicate through the channel When a notifier sends a message to the channel, the corresponding thresholdNotifier triggers a synchronize logic.

The notifier uses the kernel’s cgroups Memory thresholds. cgroups allows the user state process to set the kernel to send a notification to the application when memory.usage_in_bytes reaches a certain threshold via eventfd. This is done by writing "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to cgroup.event_control.

The initialization code for notifier is as follows (some extraneous code has been removed for readability), mainly to find the file descriptor watchfd for memory.usage_in_bytes and controlfd for cgroup.event_control and to complete the cgroup memory thrsholds are registered.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) {
	var watchfd, eventfd, epfd, controlfd int

	watchfd, err = unix.Open(fmt.Sprintf("%s/%s", path, attribute), unix.O_RDONLY|unix.O_CLOEXEC, 0)
	defer unix.Close(watchfd)
	
	controlfd, err = unix.Open(fmt.Sprintf("%s/cgroup.event_control", path), unix.O_WRONLY|unix.O_CLOEXEC, 0)
	defer unix.Close(controlfd)
	
	eventfd, err = unix.Eventfd(0, unix.EFD_CLOEXEC)
	defer func() {
		// Close eventfd if we get an error later in initialization
		if err != nil {
			unix.Close(eventfd)
		}
	}()
	
	epfd, err = unix.EpollCreate1(unix.EPOLL_CLOEXEC)
	defer func() {
		// Close epfd if we get an error later in initialization
		if err != nil {
			unix.Close(epfd)
		}
	}()
	
	config := fmt.Sprintf("%d %d %d", eventfd, watchfd, threshold)
	_, err = unix.Write(controlfd, []byte(config))

	return &linuxCgroupNotifier{
		eventfd: eventfd,
		epfd:    epfd,
		stop:    make(chan struct{}),
	}, nil
}

The notifier also listens for the above eventfd via epoll at boot time, and sends a signal to the channel when it listens for an event sent by the kernel indicating that the memory used has exceeded the threshold.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) {
	err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{
		Fd:     int32(n.eventfd),
		Events: unix.EPOLLIN,
	})

	for {
		select {
		case <-n.stop:
			return
		default:
		}
		event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval)
		if err != nil {
			klog.InfoS("Eviction manager: error while waiting for memcg events", "err", err)
			return
		} else if !event {
			// Timeout on wait.  This is expected if the threshold was not crossed
			continue
		}
		// Consume the event from the eventfd
		buf := make([]byte, eventSize)
		_, err = unix.Read(n.eventfd, buf)
		if err != nil {
			klog.InfoS("Eviction manager: error reading memcg events", "err", err)
			return
		}
		eventCh <- struct{}{}
	}
}

The synchronize logic determines whether the notifier has been updated in 10s and restarts the notifier each time it is executed. The cgroup memory threshold is calculated by subtracting the total amount of memory from the user-set eviction threshold.

synchronize

Eviction Manager’s main logic synchronize is more detailed, so we won’t post the source code here, but we will sort out the following matters.

  1. constructing a sorting function for each signal.
  2. update threshold and restart notifier.
  3. get the current node’s resource usage (cgroup information) and all active pods.
  4. for each signal, determine separately whether the resource usage of the current node has reached the eviction threshold, and if none of them have, exit the current loop.
  5. prioritize all signals, with the priority being that memory-related signals are evicted first.
  6. send an eviction event to the apiserver.
  7. prioritize all active pods.
  8. evict the pods in the sorted order.

Calculate the eviction order

The order of eviction of pods depends on three main factors.

  • whether the pod’s resource usage exceeds its requests.
  • The priority value of the pod.
  • the memory usage of the pod.

The order in which the three factors are judged is also based on the order in which they are registered into orderedBy. The multi-level ordering of the orderedBy function is also a worthwhile implementation in Kubernetes, so interested readers can check the source code for themselves.

1
2
3
4
5
6
// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
// It ranks by whether or not the pod's usage exceeds its requests, then by priority, and
// finally by memory usage above requests.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
	orderedBy(exceedMemoryRequests(stats), priority, memory(stats)).Sort(pods)
}  

Eviction Pod

Next is the implementation of eviction Pod. Eviction Manager evicts Pods with a clean kill, the specific implementation is not analyzed here, it is worth noting that there is a judgment before eviction, if IsCriticalPod returns true then no eviction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool {
	// If the pod is marked as critical and static, and support for critical pod annotations is enabled,
	// do not evict such pods. Static pods are not re-admitted after evictions.
	// https://github.com/kubernetes/kubernetes/issues/40573 has more details.
	if kubelettypes.IsCriticalPod(pod) {
		klog.ErrorS(nil, "Eviction manager: cannot evict a critical pod", "pod", klog.KObj(pod))
		return false
	}
	// record that we are evicting the pod
	m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)
	// this is a blocking call and should only return when the pod and its containers are killed.
	klog.V(3).InfoS("Evicting pod", "pod", klog.KObj(pod), "podUID", pod.UID, "message", evictMsg)
	err := m.killPodFunc(pod, true, &gracePeriodOverride, func(status *v1.PodStatus) {
		status.Phase = v1.PodFailed
		status.Reason = Reason
		status.Message = evictMsg
	})
	if err != nil {
		klog.ErrorS(err, "Eviction manager: pod failed to evict", "pod", klog.KObj(pod))
	} else {
		klog.InfoS("Eviction manager: pod is evicted successfully", "pod", klog.KObj(pod))
	}
	return true
}

Then look at the code for IsCriticalPod.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
func IsCriticalPod(pod *v1.Pod) bool {
	if IsStaticPod(pod) {
		return true
	}
	if IsMirrorPod(pod) {
		return true
	}
	if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) {
		return true
	}
	return false
}

// IsMirrorPod returns true if the passed Pod is a Mirror Pod.
func IsMirrorPod(pod *v1.Pod) bool {
	_, ok := pod.Annotations[ConfigMirrorAnnotationKey]
	return ok
}

// IsStaticPod returns true if the pod is a static pod.
func IsStaticPod(pod *v1.Pod) bool {
	source, err := GetPodSource(pod)
	return err == nil && source != ApiserverSource
}

func IsCriticalPodBasedOnPriority(priority int32) bool {
	return priority >= scheduling.SystemCriticalPriority
}

From the code, if the Pod is Static, Mirror, Critical Pod are not evicted. Static and Mirror are judged from the annotation of Pod; Critical is judged by the Priority value of Pod, if the Priority is system-cluster-critical / system-node-critical, it belongs to Critical Pod.

However, it is worth noting that the official documentation refers to Critical Pods as saying that if a non-Static Pod is marked as Critical, it is not completely guaranteed not to be evicted. Therefore, it is likely that the community has not thought through whether to evict in this case, and does not rule out changing this logic later, but it is also possible that the documentation is not up to date.

Summary

This article analyzed Kubelet’s Eviction Manager, including its listening to Linux CGroup events and determining Pod eviction priorities. Once we understand this, we can set the priority according to the importance of our application, even set it to Critical Pod.