Summary of eviction strategy for k8s standalone

Process eviction: When there is resource pressure on a machine, it may be due to a malicious program that is consuming system resources, or due to overcommit. The system reduces the overall impact of a single program on the system by controlling the survival of processes on the machine. The most critical aspect of the eviction phase is to select the right process to ensure system stability by minimizing the cost. There are two types of eviction at the execution level.

User-space eviction: triggered active process cleanup through mechanisms like daemons.
Kernel space eviction: The kernel selects processes to terminate to release resources through oom_killer when memory cannot be allocated.

In this article, we summarize the eviction process and process selection strategy at different levels from k8s.

Kubelet eviction policy

k8s supports API-initiated active eviction as well as user-space pod eviction (termination of resource-intensive processes). For incompressible resources: memory, disk (nodefs), pid, kubelet monitors the corresponding metrics to trigger pod eviction. k8S evicts pods to reclaim resources based on their resource consumption and priority.

If a pod’s resource usage exceeds the resource request value, it will be evicted first.
Eviction based on pod priority
The higher the real resource usage of a pod, the higher the priority of eviction

We can conclude the following.

When the resource usage of BestEffort and Burstable pods exceeds the requested value, the eviction order is determined based on the pod priority and how much it exceeds the request. There is also no risk that a special case pod can be evicted without being evicted. When the usage of the Guaranteed and Busrtable pods is lower than the requested value, the eviction order is determined based on pod priority.

All this logic is implemented in the eviction manager of the kubelet.

Eviction manager

The interface definition of the manager contains the start function of the main process and the one provided to the kubelet to report the node status.

Start(): starts the eviction control loop, gets monitoring data, and determines if the resources reach the threshold, triggers the eviction of the pod, and updates the local node status when the node is under pressure.
IsUnderMemoryPressure(): determine whether the node reaches the memory limit pressure, judged by the node status updated within the control loop.
IsUnderDiskPressure(): Determines if the node has reached the disk limit pressure, judged by the node status updated within the control loop.
IsUnderPIDPressure(): Determines if the node has reached the PID limit pressure, as determined by the node status updated within the control loop.

The kubelet will call the above method to determine the resource pressure of the node in the tryUpdateNodeStatus reporting node status loop.

After initializing the evictionManager, kubelet will call evictionManager.Start() to start the eviction and then call the above pressure judgment method when synchronizing the node status. In addition to implementing the interface of Manager, it also implements the PodAdmitHandler interface which is responsible for evaluating the allowed pod execution during the pod lifecycle. evictionManager is mainly based on the nature of the pod to determine whether the container can be created on a machine that already has resource pressure.

Expulsion control loop

Initialization phase

The kubelet main process parses the configuration and initializes the evictionManager, parsing the resource threshold parameter ParseThresholdConfig() for the single machine

The kubelet sets resource thresholds in the signal dimension, each signal identifies a resource metric that defines the resource threshold and other eviction parameters. For example, memory.available indicates the node’s available memory eviction marker (memory.available = capacity - workingSet).

The kubelet determines the resource signal property with the following parameters to construct the threshold for the corresponding resource.

--eviction-hard mapStringString: resource eviction hard downlink, default is: imagefs.available<15%,memory.available<100Mi,nodefs.available<10%
--eviction-soft mapStringString: soft downlink of resource eviction, when triggered, pod has graceful exit time.
--eviction-soft-grace-period mapStringString: graceful exit time for pod eviction when yellow line is triggered.
--eviction-minimum-reclaim mapString: the minimum amount of resources to be released. Default is 0.

The eviction-soft and soft-grace-period configurations for the same resource must both exist.

After setting the threshold for each resource signal by parsing the configuration items, the kubelet calls evictionManager.Start() to drive the evictionManager to work.

After setting the threshold value of each resource signal by parsing the configuration items, kubelet calls evictionManager.Start() to drive evictionManager to work.

Start of evictionManager

Before starting the control loop, evictionManager adds preprocessing for cgroup memory subsystem monitoring. This preprocessing listens for mem cgroup usage through the mechanism of cgroup notifier and periodically updates the cgroup notifier threshold configuration during the control loop.

MemoryThresholdNotifier

The evictionManager configures MemoryThresholdNotifier for memory.available and allocatableMemory.available signal respectively, monitoring different cgroup paths. allocatableMemory.available has the root cgroupRoot, which is the root cgroup of the pods on the node. memory.available monitors the /proc/cgroups/memory directory.

The workflow of MemoryThresholdNotifier is as follows.

Initialize MemoryThresholdNotifier
MemoryThresholdNotifier needs to get the cgoup memory subsystem path of the cgroup directory and set evictionManager.synchronize() to the threshold processing function thresholdHandler
Create goroutine to start MemoryThresholdNotifier
In MemoryThresholdNotifier.Start() loop: listen for the event channel and call the eviction function (call synchronize)
UpdateThreshold() is called in the synchronize phase to update the memcg threshold and activate MemoryThresholdNotifier.
Calculate the cgroup memory usage threshold based on the current collection metric configuration.
If there is already a notifier instance of MemoryThresholdNotifier, create a new cgroupNotifier to replace it. cgroupNotifier listens for memory over threshold events by epolling the eventfd descriptor above.

There are two key points here.

calculate the cgroup memory usage threshold in the UpdateThreshold function

As mentioned above, the memory usage (not including swap) is obtained by listening to the memory.usage_in_bytes file when the memory usage threshold is reached. And the memory usage threshold memcgThreshold is obtained by monitoring the data.

// Set threshold on usage to capacity - eviction_hard + inactive_file,
// since we want to be notified when working_set = capacity - eviction_hard
inactiveFile := resource.NewQuantity(int64(*memoryStats.UsageBytes-*memoryStats.WorkingSetBytes), resource.BinarySI)
capacity := resource.NewQuantity(int64(*memoryStats.AvailableBytes+*memoryStats.WorkingSetBytes), resource.BinarySI)
evictionThresholdQuantity := evictionapi.GetThresholdQuantity(m.threshold.Value, capacity)
memcgThreshold := capacity.DeepCopy()
memcgThreshold.Sub(*evictionThresholdQuantity)
memcgThreshold.Add(*inactiveFile)

The absolute value of the calculated memory usage threshold memcgThreshold is calculated by capacity - eviction_hard (based on capacity * percentage if the red line is not absolute) + inactive_file.

Where

memory capacitycapacity = memoryStats.AvailableBytes + memoryStats.WorkingSetBytes, i.e. memory available + workload used (both values are obtained from the monitoring module)
Hard down eviction_hard is the parameter value
inactive_file = memoryStats.UsageBytes - memoryStats.WorkingSetBytes, i.e. memory used - workload used (contains recently used memory, dirty memory to be reclaimed and kernel occupied memory, both values are also obtained from the monitoring module) (both values are also obtained from the monitoring module).

Create cgroupNotifier in UpdateThreshold function

The mechanism of cgroup notifier is to listen for events when memory usage in cgroup exceeds the threshold via eventfd.
- memory.usage_in_bytes: Listens for memory usage file objects.
- cgroup.event_control: Threshold monitoring control interface, configuring event_fd, watchfd and threshold threshold based on the format <event_fd> <fd of memory.usage_in_bytes> <threshold>.
1 2 3 4 5

/sys/fs/cgroup/memory # cat memory.usage_in_bytes 92459601920 # ls -lt cgroup.event_control --w--w--w- 1 root root 0 Nov 24 12:05 cgroup.event_control # an interface for event_fd()
The cgroupNotifier presses events into the channel based on cgroup events, triggering the event consumer (evictionManager) to process them. Here the channel does not pass the specific event content, but only does the task triggering function.

To register the threshold of cgroup, there are 3 steps.
- Create eventfd using eventfd(2)
- Create open memory.usage_in_bytes or memory.memsw.usage_in_bytes file descriptor
- Write the message “<event_fd>” in cgroup.event_control
Start the control loop synchronize at the end of evictionManager.Start() to periodically check if the threshold condition for eviction is met and proceed to the next action.

control loop synchronize

In the control loop of evictionManager, the synchronize function is called for 10s to select pod eviction. The primary judgment of eviction is the triggering condition of eviction, by monitoring the system resources to determine whether the resource usage has hit the threshold. evictionManager has two triggering methods.

eviction triggered based on cgroup (event-based): the above has described the memory CgroupNotifier mechanism

triggering eviction based on monitoring data (periodic check)

Get the resource usage of nodes and pods through summaryProvider 2.2 Get the usage of each resource based on monitoring data in the signalObservations function signalObservations Single signalObservation records The total amount of resources and their availability.

Get the usage of each resource based on monitoring data in the signalObservations function signalObservations

A single signalObservation records the total amount of resources and the amount available.

// signalObservation is the observed resource usage
type signalObservation struct {
    // The resource capacity
    capacity *resource.Quantity
    // The available resource
    available *resource.Quantity
    // Time at which the observation was taken
    time metav1.Time
}

Determining whether eviction is needed to release resources in the thresholdsMet function

When the resource availability observed above falls below the threshold of each signal, the type of resource to be released is returned.

Either way, the synchronize post logic is executed to determine if the pod needs to be evicted.

update the state of the node, the resource pressure state updated and reported to the cluster API other components within the cluster can observe the state of the node, processed from outside the node.
If featuregate LocalStorageCapacityIsolation local storage is turned on, it will first try to clean up the local disk affected this is based on featuregate to control whether to turn on, it will check whether the following resource usage of pods exceeds the limit value.
- sizeLimit of emptyDir
- ephemeralStorage’s limit
- ephemeralStorage limit of container
This eviction is immediate, with no graceful exit time. When triggered to a local disk trigger condition, the eviction behavior of other resources is ignored.

When the eviction process goes this far, it determines if there are resource-stressed eviction resources. If thresholdsMet returns an empty array, it means no resources have hit the eviction threshold. Otherwise, it continues to perform the eviction of node resources.
Reclaiming node-level resources
1. reclaimNodeLevelResource: Reclaiming node-level resources
  
  First try to reclaim node resources: nodefs/imagefs, this part can be done by deleting unused containers and images without infringing on the executing pod. after calling the node resource reclaim function, collect the indicator once more. If the free resources are greater than the threshold, the subsequent process of this eviction is skipped: pod-level eviction.
2. Rank phase: Determining the priority of resources that trigger the eviction condition
  
  Each synchronize will only select one resource that exceeds the threshold for recycling. When multiple resources appear to hit the threshold, the resource eviction priority is as follows.
  - Memory resources have the highest eviction priority
  - No resource signal has the lowest priority
3. Try to reclaim the resources of user pods
  
  Based on the resource signal obtained in the previous step, the eviction priority of the active pods on the node is determined, and the pods are ordered according to the eviction priority.
  
  For example, the rules for judging the eviction priority of pods based on memory resources are.
  - Based on whether the pod exceeds the resource request value: those without resource usage indicators are evicted first. Those that exceed the requested value are evicted first.
  - Based on the spec.priority of the pod: The pods are ordered according to their configured priority, the default is 0. The higher the priority, the later the eviction sequence.
  - Based on memory resource consumption: Sorted by the portion of memory consumed by the pod that exceeds the requested value. The higher the absolute value of resources exceeded, the higher the priority of the pod to be evicted.
  kubelet implements the multiSorter function: sorting the active pods according to the above order. If the result of the current rule is in equal order, then the next rule will determine the pod priority. The above logic translates to finding the pods whose resource usage exceeds the requested value (including those without metrics), and then sorting them according to their spec.priority. Within the pods with the same priority, the pods with the higher absolute value of the exceeded resources are then ranked.
  
  In addition to the logic of rankMemoryPressure, there are also the logic of rankPIDPressure and rankDiskPressure.
4. Eviction
  
  fter sorting based on recoverable resources, only one pod deletion is performed per eviction cycle. If it is not HardEviction, MaxPodGracePeriodSeconds is also given to allow the container process inside the pod to exit. The specific eviction actions operate on sending events, deleting the pod and updating the eviction status of the pod.

System Eviction Policy

The above describes the kubelet in user state to limit the node resources, pod resources by eviction. In kernel memory management, memory usage is limited at the single machine level by OOM killer.

OOM killer

OOM killer (Out Of Memory killer) is a kind of memory management mechanism in the Linux kernel: when the system has less memory available, the kernel will choose to end the process to free up memory resources in order to ensure that the system can still continue to run.

running mechanism

Running processes require more memory than is physically available. When the kernel allocates memory by calling alloc_pages(), it selects processes to release resources by calling out_of_memory() if more memory is needed than is physically available. The OOM killer checks all running processes and chooses to end one or more live processes to free system memory.

out_of_memory() function: Do a partial check first to avoid releasing memory by ending processes. If it can only be freed by ending the process, then the function will continue to select the target process to reclaim. If resources cannot be freed even at this stage, kernel eventually exits with an error. The source code of the function is located at https://elixir.bootlin.com/linux/v5.17.2/source/mm/oom_kill.c#L1052 and the flow is as follows:

First notify the subscribers of the oom_notify_list chain: Based on the notification chains mechanism, the modules registered with oom_notify_list are notified to release memory. If the subscriber is able to handle OOM, it will exit the OOM killer and will not perform subsequent operations if memory is released.
If the current task has a pending SIGKILL or has already exited, it will release the resources of the current process. This includes processes and threads that share the same memory descriptor mm_struct with the task will also be killed.
For IO-less recovery, based on gfp_mask, if 1) the allocation is a non-FS operation type allocation and 2) it is not a cgroup memory OOM -> exit the oom-killer directly.
check the memory allocation constraints (e.g. NUMA) with CONSTRAINT_NONE, CONSTRAINT_CPUSET, CONSTRAINT_MEMORY_POLICY, CONSTRAINT_MEMCG types.
Check the setting of /proc/sys/vm/panic_on_oom and do the operation. If panic_on_oom is set to 2, the process will panic directly and force an exit.
if /proc/sys/vm/oom_kill_allocating_task is true, call oom_kill_process to kill the process that wants to allocate memory (when this process can be killed).
select_bad_process(), select the most suitable process and call oom_kill_process.
if there is no suitable process, panic force exit if non-sysrq and memcg.

There are several details in the above process.

gfp_mask constraint

    /*
        * The OOM killer does not compensate for IO-less reclaim.
        * pagefault_out_of_memory lost its gfp context so we have to
        * make sure exclude 0 mask - all other users should have at least
        * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
        * invoke the OOM killer even if it is a GFP_NOFS allocation.
        */
    if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
        return true;

gfp_mask is the flag bit passed when requesting memory (get free page). The first four bits represent the memory domain modifiers (___GFP_DMA, ___GFP_HIGHMEM, ___GFP_DMA32, ___GFP_MOVABLE), and from bit 5 onwards are the memory allocation flags. Definition: https://elixir.bootlin.com/linux/v5.17.2/source/include/linux/gfp.h#L81. Default is null, scan from ZONE_NORMAL, which is the default memory request type.

The OOM killer does not compensate for non-IO reclaims, so allocated gfp_mask is a direct exit for OOMs allocated for non-FS operation types.

oom_constraint constraints

Check if the memory allocation is constrained, there are several different constraint strategies. Only for NUMA and memcg scenarios. oom_constraint can be: CONSTRAINT_NONE,CONSTRAINT_CPUSET,CONSTRAINT_MEMORY_POLICY,CONSTRAINT_MEMCG type. For UMA architecture, the oom_constraint is always CONSTRAINT_NONE, which means that the system does not have the OOM generated by the constraint, while in NUMA architecture, it is possible to attach other constraints to cause the OOM situation.

Then call check_panic_on_oom(oc) to check if /proc/sys/kernel/panic_on_oom is configured, and if so, trigger panic directly.

When it comes to this step, the oom killer needs to select the process to terminate, and there are two selection logics to choose the appropriate process to pass.

Terminate whoever triggers OOM: controlled by sysctl_oom_kill_allocating_task, whether to kill the process currently requesting memory
Stop whoever is the “baddest”: determine the “baddest” process by scoring it

sysctl_oom_kill_allocating_task from /proc/sys/vm/oom_kill_allocating_task. When the argument is true, the call to oom_kill_process directly kills the process that is currently trying to allocate memory.

select_bad_process: selects the “worst” process

In normal scenarios, the oom_evaluate_task function is used to evaluate the process score and select the process to be terminated. In the case of a memory cgroup, mem_cgroup_scan_tasks is called. First look at the logic of oom_evaluate_task

processes with mm->flags of MMF_OOM_SKIP are skipped and the next process is evaluated
oom_task_origin has the highest score, this flag indicates that the task has been allocated a lot of memory and marked as a potential cause of oom, so it is killed first.
Processes in other cases have their scores calculated by the oom_badness function

The process with the highest final score is terminated with the highest priority.

The process termination priority score calculated by the oom_badness function consists of two parts and is provided by the following two parameters.

Parameters.

oom_score_adj: OOM kill score adjustment, the adjustment value is scored by the user. The range is from OOM_SCORE_ADJ_MIN (-1000) to OOM_SCORE_ADJ_MAX (1000). The higher the value, the higher the priority of the process to be terminated. The user can use this value to protect a process.
totalpages: The current upper limit of allocatable memory, which provides the basis for system scoring.

Calculation formula.

    /*
    * The baseline for the badness score is the proportion of RAM that each
    * task's rss, pagetable and swap space use.
    */
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
    mm_pgtables_bytes(p->mm) / PAGE_SIZE;
adj *= totalpages / 1000;
points += adj;

The base fraction process_pages consists of 3 parts.

get_mm_rss(p->mm): rss part
get_mm_counter(p->mm, MM_SWAPENTS): swap occupied memory
mm_pgtables_bytes(p->mm) / PAGE_SIZE: memory occupied by page tables

Add up the 3 parts and combine them with oom_score_adj: the normalized adj and points are summed up and used as the current process score.

So process score points = process_pages + oom_score_adj*totalpages/1000

Older versions of the kernel also had some complex calculation logic to consider, such as the treatment of privileged processes. In the case of root privileged processes, there was a 3% memory usage privilege. points=process_pages*0.97 + oom_score_adj*totalpages/1000. v4.17 removes this, making the calculation logic more concise and predictable.

/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
    points -= (points * 3) / 100;

mem_cgroup_scan_tasks: memory cgroup cgroup processing will require traversing the cgroup hierarchy and calling oom_evaluate_task to calculate the task’s score. Reclaiming the memory of the parent process will also reclaim the memory of the child processes.

oom_kill_process

The next step is to enter the logic of terminating the process. oom_kill_process function checks whether the task has exited before terminating the process and the occupied memory will be freed to prevent duplicate processing; it gets the memory cgroup message and determines whether all the tasks under the cgroup need to be deleted. then there is a dump message that prints out the cause of the OOM print out and keep the clues of OOM.

After that, call put_task_struct inside __oom_kill_process function to free kernel stack and release system resources. Wake up the oom_reaper kernel thread to reap wake_oom_reaper(victim).

oom_reaper will remain dormant until there is a cleanup task. wake_oom_reaper will press the task into the oom_reaper_list chain, and oom_reaper will use the oom_reaper_list chain to determine the need to call oom_reap_task_mm to clean up the address space. The cleanup will traverse the vma and skip the VMA area of VM_LOCKED|VM_HUGETLB|VM_PFNMAP. The specific release operation is done by unmap_page_range.

for (vma = mm->mmap ; vma; vma = vma->vm_next) {
    if (!can_madv_lru_vma(vma))
        continue;

    /*
        * Only anonymous pages have a good chance to be dropped
        * without additional steps which we cannot afford as we
        * are OOM already.
        *
        * We do not even care about fs backed pages because all
        * which are reclaimable have already been reclaimed and
        * we do not want to block exit_mmap by keeping mm ref
        * count elevated without a good reason.
        */
    if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
        struct mmu_notifier_range range;
        struct mmu_gather tlb;

        mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
                    vma, mm, vma->vm_start,
                    vma->vm_end);
        tlb_gather_mmu(&tlb, mm);
        if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
            tlb_finish_mmu(&tlb);
            ret = false;
            continue;
        }
        unmap_page_range(&tlb, vma, range.start, range.end, NULL);
        mmu_notifier_invalidate_range_end(&range);
        tlb_finish_mmu(&tlb);
    }
}

https://elixir.bootlin.com/linux/v5.17.2/source/mm/oom_kill.c#L528

controls the behavior of the oom killer

There are several file parameters mentioned above to control the behavior of the control oom killer.

/proc/sys/vm/panic_on_oom, which sets the value to allow or disallow kernel panic when oom occurs (default is 0)
- 0: When oom occurs, the kernel will choose to call oom-killer to select the process to delete
- 1: When oom occurs, the kernel will normally panic directly, except for certain conditions: processes restricted by mempolicy/cpusets will be deleted by oom-killer without panic
- 2: When oom occurs, the kernel panic unconditionally
/proc/sys/vm/oom_kill_allocating_task, can take the value of 0 or non-0 (default is 0), 0 means that when sending oom, it will traverse the task chain and select a process to kill, while non-0 means that when sending oom, it will directly kill the process that caused oom, and will not traverse the task chain.
/proc/sys/vm/oom_dump_tasks: can take a value of 0 or non-0 (default is 1), indicating whether to print information about the task when sending an oom killer.
/proc//oom_score_adj: Configure the scoring adjustment score of a process to protect a process from being killed or to kill a process every time by using this value. The range of values is -1000 to 1000.
/proc/sys/vm/overcommit_memory: control memory overcommit, oom-killer function, default is 0
- 0: heuristic policy , more serious Overcommit will not be allowed, for example, you suddenly request 128TB of memory. And minor overcommits will be allowed. Also, root can Overcommit slightly more values than normal users. Default
- 1: Always allow overcommit , this policy is suitable for applications that cannot afford memory allocation failures, such as certain scientific computing applications.
- 2: Always disallow overcommit, in which case the system can allocate no more memory than swap+RAM factor* (/proc/sys/vm/overcmmit_ratio, default 50%, you can adjust it), and if this much resource has been used up, then any later attempts to request memory will This usually means that no new programs can be run at this point.

Control of Memory cgroup subsystem.

memory.use_hierarchy: Specify the cgroup hierarchy. (default is 0)
- 0: The parent process does not reclaim memory from child processes
- 1: it will reclaim memory from child processes that exceed the memory limit
memory.oom_control: oom control, (default is 0: per cgroup memory subsystem)
- 0: process will be killed by oom_killer when it consumes more memory
- 1: turn off oom_killer, when task tries to use more memory, it will be stuck until memory is sufficient.
- When reading a file, describe the state of oom: oom_kill_disable (whether it is on), under_oom (whether it is in oom state)

oom killer in user space

One last brief introduction to the user-space oom killer: https://github.com/facebookincubator/oomd. oomd is targeted at user space, solving memory resource usage.

Operation mechanism

Use PSI, cgroupv2 to monitor memory usage on the system and oomd to free memory resources before kernel’s oom_killer processing.
Monitor the memory pressure on the system and cgroup.

And it can be configured so that the eviction policy.

When workload has memory pressure/system has memory pressure, select a memory hog (resource hog) to delete by memory size or growth rate.
When the system is under memory pressure, select a memory hog to delete by memory size or growth rate.
When the system is under swap pressure, select the cgroup that uses the most swap to delete.

As you can see, oomd acts as a kubelet and is the agent for oom management on a single machine.

Summary

You can see the difference between user-space and kernel-space eviction policies. User space triggers the eviction process by monitoring system resources, while kernel space triggers the eviction process when allocating memory. Because user-space eviction needs to come before kernel eviction

In addition to process eviction, there are other means to achieve resource security and stability, such as resource suppression and recycling. Through cgroup v2’s Memory Qos capability

Guarantee the memory allocation performance of container and reduce its memory allocation latency when the whole machine memory is under pressure
Suppressing and quickly reclaiming the over-requested memory containers to reduce the pressure of memory usage of the whole machine
Protects the entire machine’s reserved memory

Table of Contents