Understanding the Kubernetes eviction mechanism

Eviction is the termination of a Pod running on a Node to ensure the availability of the workload. For using Kubernetes, it is necessary to understand the eviction mechanism because usually, Pods are evicted because they need to solve the problems caused behind the eviction, and to quickly locate them you need to have an understanding of the eviction mechanism.

Reasons for Pod eviction

Kubernetes officially gives the reasons for the eviction of subordinate Pods as follows.

Preemption and priority

Preemption is a process where kube-scheduler checks for low priority Pods and then evicts them to allocate resources to higher priority Pods when the node does not have enough resources to run the newly added Pod. this process is called “preemption” .

Node-pressure eviction

Node-pressure eviction means that the resources of the node where the Pod is located, such as CPU, memory, inode, etc., are divided into compressible resources CPU (compressible resources) and incompressible resources (incompressible resources) disk IO, memory, etc. When the incompressible resources are not enough, the Pod will be evicted. The eviction for such issues is per compute node kubelet monitors the resource usage of the node by capturing cAdvisor metrics.

Evicted by controller-manager

kube-controller-manager periodically checks the status of nodes that have been in NotReady for more than a certain amount of time, or Pod deployments that have failed for a long time, and these Pods are replaced by the control plane controller-manager that creates new Pods to replace the problematic ones.

Initiating evictions via API

Kubernetes provides users with an eviction API, and users can call the API to implement custom evictions.

For versions 1.22 and above, evictions can be initiated via the API policy/v1.

curl -v \
    -H 'Content-type: application/json' \
    https://your-cluster-api-endpoint.example/api/v1/namespaces/default/pods/quux/eviction -d '\
    {
        "apiVersion": "policy/v1",
        "kind": "Eviction",
        "metadata": {
            "name": "quux",
            "namespace": "default"
        }
    }'

For example, to evict Pod netbox-85865d5556-hfg6v , you can use the following command.

# 1.22+
$ curl -v 'https://10.0.0.4:6443/api/v1/namespaces/default/pods/netbox-85865d5556-hfg6v/eviction' \
--header 'Content-Type: application/json' \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
-d '{
    "apiVersion": "policy/v1",
    "kind": "Eviction",
    "metadata": {
        "name": "netbox-85865d5556-hfg6v",
        "namespace": "default"
    }
}'

# 1.22-
curl -v 'https://10.0.0.4:6443/api/v1/namespaces/default/pods/netbox-85865d5556-hfg6v/eviction' \
--header 'Content-Type: application/json' \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
-d '{
    "apiVersion": "policy/v1beta1",
    "kind": "Eviction",
    "metadata": {
        "name": "netbox-85865d5556-hfg6v",
        "namespace": "default"
    }
}'

You can see the result that the old Pod is evicted and the new Pod is created, where the experimental environment has fewer nodes, so it is reflected as no node replacement.

$ kubectl get pods -o wide
NAME                      READY   STATUS        RESTARTS   AGE    IP              NODE             NOMINATED NODE   READINESS GATES
netbox-85865d5556-hfg6v   1/1     Terminating   0          101d   192.168.1.213   master-machine   <none>           <none>
netbox-85865d5556-vlgr4   1/1     Running       0          101d   192.168.0.4     node01           <none>           <none>
netbox-85865d5556-z6vqx   1/1     Running       0          11s    192.168.1.220   master-machine   <none>           <none>

Eviction return status via API

200 OK|201 Success: Eviction allowed, Eviction is similar to sending a DELETE request to a Pod URL
429 Too Many Requests: may see this due to API speed limit, also for configuration reasons, eviction is not allowed poddisruptionbudget (PDB is a protection mechanism that will always ensure a certain number or percentage of Pods are voluntarily evicted)
500 Internal Server Error: Eviction is not allowed, there are misconfigurations, such as multiple PDBs referencing a Pod

drain Pods on Node

drain is a maintenance command provided to users after kubernetes 1.5+, with this command (kubectl drain <node_name>) you can evict all Pods running on the node, which has been used to perform operations on the node host (e.g. kernel upgrade, reboot)

Notes: The kubectl drain <node_name> command can only be sent to one node at a time.

Taint eviction

Taint is usually used in conjunction with tolerance, where a node with taint will not have a pod dispatched to that node, while tolerance will allow a certain amount of taint to be dispatched to the pod.

After Kubernetes 1.18+, a taint-based eviction mechanism is allowed, whereby the kubelet will automatically add nodes and thus evict under certain circumstances.

Kubernetes has some built-in taint, at which point Controller will automatically taint the node with.

node.kubernetes.io/not-ready: Node failure. Ready = False for the corresponding NodeCondition.
node.kubernetes.io/unreachable: Node controller cannot access the node. Corresponds to NodeCondition Ready = Unknown.
node.kubernetes.io/memory-pressure: Node memory pressure.
node.kubernetes.io/disk-pressure: Node disk pressure.
node.kubernetes.io/pid-pressure: Node has PID pressure.
node.kubernetes.io/network-unavailable: Node network is not available.
node.kubernetes.io/unschedulable: Node is not schedulable.

Example: Pod eviction troubleshooting process

Imagine a scenario where a Kubernetes cluster with three worker nodes, version v1.19.0, finds that some of the pods running on worker 1 have been evicted.

Pod eviction troubleshooting process

Source: https://www.containiq.com/post/kubernetes-pod-evictions

The above diagram shows that a lot of pods have been evicted, and the error message is clear. The eviction process is triggered by the kubelet due to insufficient storage resources on the node.

Method 1: Enable auto-scaler

Add worker nodes to the cluster and either deploy cluster-autoscaler to automatically scale up or down based on the configured conditions.
Adding only worker’s local storage, which involves VM scaling, will cause worker nodes to be temporarily unavailable.

Method 2: Protecting Critical Pods

Specify resource requests and limits in the resource list and configure QoS (Quality of Service). When a kubelet triggers an eviction, it will guarantee that at least these pods will not be affected.

This kind of policy guarantees the availability of some critical Pods to a certain extent. If a Pod is not evicted when a node has a problem, this will require additional steps to be performed to find the failure.

Running the command kubectl get pods results in a number of pods being in the evicted state. The results are saved in the kubelet log of the node. Find the corresponding logs using cat /var/paas/sys/log/kubernetes/kubelet.log | grep -i Evicted -C3.

Checking Ideas

Check Pod tolerance

When a Pod fails to connect or a node fails to respond, you can use tolerationSeconds to configure the corresponding length of time.

tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 6000

View the conditions to prevent Pod eviction

Pod eviction is suspended if the number of nodes in the cluster is less than 50 and the number of failed nodes exceeds 55% of the total number of nodes. In this case, Kubernetes will attempt to evict the workload of the failed node (the APP running in kubernetes).

The subordinate json describes a healthy node.

"conditions": [
    {
        "type": "Ready",
        "status": "True",
        "reason": "KubeletReady",
        "message": "kubelet is posting ready status",
        "lastHearbeatTime": "2019-06-05T18:38:35Z",
        "lastTransitionTime": "2019-06-05T11:41:27Z"
    }
]

If the ready condition is Unknown or False for longer than the pod-eviction-timeout, the node controller will perform an API-initiated type eviction on all Pods assigned to that node.

Checking Pod’s allocated resources

Pods will be evicted based on the resource usage of the node. The evicted Pods will be scheduled according to the resources of the node assigned to the Pod. The “Manage eviction” and “Scheduling” conditions consist of different rules. As a result, the evicted container may be rescheduled to the original node. Therefore, it is important to allocate resources to each container.

Checking Pods for periodic failures

Pods can be evicted multiple times. That is, if a Pod in a node is also evicted after it has been evicted and dispatched to a new node, the Pod will be evicted again.

If the eviction action is triggered by kube-controller-manager, the Pod in the Terminating state is retained. After the node is restored, the Pod will be automatically destroyed. Pods can be forcibly deleted if the node has been deleted or otherwise cannot be recovered.

If the eviction is triggered by a kubelet, the Pod state will remain in the Evicted state. It is only used for later fault location and can be deleted directly.

The command to delete an evicted Pod is as follows.

`1`	`kubectl get pods \| grep Evicted \| awk '{print $1}' \| xargs kubectl delete pod`

Notes.

Pods that are evicted by Kubernetes are not automatically recreated as pods. to recreate a pod, you need to use the replicationcontroller, replicaset, and deployment mechanisms, which is the Kubernetes workload mentioned above.

Pod controllers are controllers that coordinate a set of Pods to always be in a desired state, so they will be deleted and then rebuilt, also a feature of the Kubernetes declarative API.

How to monitor evicted Pods

Using Prometheus

`1`	`kube_pod_status_reason{reason="Evicted"} > 0`

Using ContainIQ

ContainIQ is an observability tool designed for Kubernetes that includes a Kubernetes event dashboard, which includes Pod eviction events.

Ref

https://kubernetes.io/docs/concepts/scheduling-eviction/
https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/
https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#draining-multiple-nodes-in-parallel
https://www.containiq.com/post/kubernetes-pod-evictions
https://sysdig.com/blog/kubernetes-pod-evicted/
https://cylonchau.github.io/kubernetes-eviction.html

Table of Contents