From the kube-scheduler’s perspective, it calculates the best node to run a Pod through a series of algorithms, and when a new Pod appears for scheduling, the scheduler makes the best scheduling decision based on its resource description of the Kubernetes cluster at that time. But Kubernetes clusters are very dynamic. As a result of cluster-wide changes, such as a node that we first perform an eviction operation for maintenance purposes, all Pods on that node will be evicted to other nodes, but when we finish maintenance, the previous Pods do not automatically come back to that node. Because Pods do not trigger rescheduling once they are bound to a node, due to these changes, the Kubernetes cluster may become unbalanced over time, so a load balancer is needed to rebalance the cluster.

Kubernetes Descheduler

Of course, we can do some cluster balancing manually, such as manually deleting some Pods and triggering rescheduling, but obviously this is a tedious process and not the way to solve the problem. In order to solve the problem that cluster resources are not fully utilized or wasted in practice, we can use the descheduler component to optimize the scheduling of cluster Pods; descheduler can help us rebalance the cluster state according to some rules and configuration policies, and its core principle is to find Pods that can be removed and evict them according to its policy configuration, which itself It does not schedule the evicted Pods, but relies on the default scheduler to do so. Currently, the following policies are supported.

  • RemoveDuplicates
  • LowNodeUtilization
  • HighNodeUtilization
  • RemovePodsViolatingInterPodAntiAffinity
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingTopologySpreadConstraint
  • RemovePodsHavingTooManyRestarts
  • PodLifeTime
  • RemoveFailedPods

These policies can be enabled or disabled as part of the policy, and some parameters associated with the policy can also be configured, and by default, all policies are enabled. In addition, there are some general configurations, as follows:

  • nodeSelector: restrict the nodes to be processed
  • evictLocalStoragePods: evict Pods that use LocalStorage
  • ignorePvcPods: whether to ignore Pods configured with PVCs, default is False
  • maxNoOfPodsToEvictPerNode: the maximum number of Pods allowed to be evicted by a node

We can configure it with the DeschedulerPolicy as shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
nodeSelector: "node=node1" # If not set, all content will be processed without this setting.
maxNoOfPodsToEvictPerNode: 5000 # If there is no setting, no restriction is required.
maxNoOfPodsToEvictPerNamespace: 5000
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        evictSystemCriticalPods: true
        evictFailedBarePods: true
        evictLocalStoragePods: true
        nodeFit: true
    plugins:
      evict:
        enabled:
          - "DefaultEvictor"
      deschedule:
        enabled:
          - ...
      balance:
        enabled:
          - ...
      [...]

Installation

descheduler can be run as CronJob or Deployment within a k8s cluster, also we can use Helm Chart to install descheduler.

1
➜ helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/

By Helm Chart we can configure descheduler to run as CronJob or Deployment, by default descheduler will run as a critical pod to avoid being evicted by itself or kubelet, you need to make sure the cluster has system-cluster-critical as Priorityclass.

1
2
3
➜ kubectl get priorityclass system-cluster-critical
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            87d

By default, a Helm Chart installation will run as a CronJob with an execution period of schedule: "*/2 * * * *" so that descheduler tasks are executed every two minutes, and the default configuration policy is shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      LowNodeUtilization:
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            targetThresholds:
              cpu: 50
              memory: 50
              pods: 50
            thresholds:
              cpu: 20
              memory: 20
              pods: 20
      RemoveDuplicates:
        enabled: true
      RemovePodsHavingTooManyRestarts:
        enabled: true
        params:
          podsHavingTooManyRestarts:
            includingInitContainers: true
            podRestartThreshold: 100
      RemovePodsViolatingInterPodAntiAffinity:
        enabled: true
      RemovePodsViolatingNodeAffinity:
        enabled: true
        params:
          nodeAffinityType:
          - requiredDuringSchedulingIgnoredDuringExecution
      RemovePodsViolatingNodeTaints:
        enabled: true
      RemovePodsViolatingTopologySpreadConstraint:
        enabled: true
        params:
          includeSoftConstraints: false    

By configuring the strategies of DeschedulerPolicy, you can specify the execution policy of descheduler, these policies can be enabled or disabled, we will introduce them in detail below, here we can use the default policy, just use the following command to install it directly.

1
➜ helm upgrade --install descheduler descheduler/descheduler --set image.repository=cnych/descheduler -n kube-system

When deployment is complete, a CronJob resource object is created to balance the cluster state.

1
2
3
4
5
6
7
8
9
➜ kubectl get cronjob -n kube-system
NAME          SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
descheduler   */2 * * * *   False     1        8s              117s
➜ kubectl get job -n kube-system
NAME                   COMPLETIONS   DURATION   AGE
descheduler-28032982   1/1           15s        17s
➜ kubectl get pods -n kube-system -l job-name=descheduler-28032982
NAME                         READY   STATUS      RESTARTS   AGE
descheduler-28032982-vxn24   0/1     Completed   0          31s

Normally a corresponding Job will be created to execute the descheduler task, and we can see what balancing operations were done by looking at the logs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
➜ kubectl logs -f descheduler-28032982-vxn24 -nkube-system
I0420 08:22:10.019936       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1681978930\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1681978929\" (2023-04-20 07:22:09 +0000 UTC to 2024-04-19 07:22:09 +0000 UTC (now=2023-04-20 08:22:10.019885292 +0000 UTC))"
I0420 08:22:10.020138       1 secure_serving.go:210] Serving securely on [::]:10258
I0420 08:22:10.020301       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0420 08:22:10.021237       1 policyconfig.go:211] converting Deschedule plugin: %sRemovePodsViolatingInterPodAntiAffinity
I0420 08:22:10.021255       1 policyconfig.go:211] converting Deschedule plugin: %sRemovePodsViolatingNodeAffinity
I0420 08:22:10.021262       1 policyconfig.go:211] converting Deschedule plugin: %sRemovePodsViolatingNodeTaints
I0420 08:22:10.021269       1 policyconfig.go:202] converting Balance plugin: %sRemovePodsViolatingTopologySpreadConstraint
I0420 08:22:10.021280       1 policyconfig.go:202] converting Balance plugin: %sLowNodeUtilization
I0420 08:22:10.021296       1 policyconfig.go:202] converting Balance plugin: %sRemoveDuplicates
I0420 08:22:10.021312       1 policyconfig.go:211] converting Deschedule plugin: %sRemovePodsHavingTooManyRestarts
# ......
I0420 08:22:11.630980       1 removeduplicates.go:162] "Duplicate found" pod="kruise-system/kruise-controller-manager-7d78fc5c97-pxsqx"
I0420 08:22:11.630997       1 removeduplicates.go:103] "Processing node" node="node2"
I0420 08:22:11.631052       1 removeduplicates.go:103] "Processing node" node="node3"
I0420 08:22:11.631113       1 removeduplicates.go:103] "Processing node" node="master1"
I0420 08:22:11.631184       1 removeduplicates.go:194] "Adjusting feasible nodes" owner={namespace:kruise-system kind:ReplicaSet name:kruise-controller-manager-7d78fc5c97 imagesHash:openkruise/kruise-manager:v1.3.0} from=4 to=3
I0420 08:22:11.631200       1 removeduplicates.go:203] "Average occurrence per node" node="node1" ownerKey={namespace:kruise-system kind:ReplicaSet name:kruise-controller-manager-7d78fc5c97 imagesHash:openkruise/kruise-manager:v1.3.0} avg=1
I0420 08:22:11.647438       1 evictions.go:162] "Evicted pod" pod="kruise-system/kruise-controller-manager-7d78fc5c97-pxsqx" reason="" strategy="RemoveDuplicates" node="node1"
I0420 08:22:11.647494       1 descheduler.go:408] "Number of evicted pods" totalEvicted=1
I0420 08:22:11.647583       1 reflector.go:227] Stopping reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:150
I0420 08:22:11.647702       1 reflector.go:227] Stopping reflector *v1.PriorityClass (0s) from k8s.io/client-go/informers/factory.go:150
I0420 08:22:11.647761       1 tlsconfig.go:255] "Shutting down DynamicServingCertificateController"
I0420 08:22:11.647764       1 reflector.go:227] Stopping reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:150
I0420 08:22:11.647811       1 secure_serving.go:255] Stopped listening on [::]:10258

From the logs, we can clearly see which Pods were evicted due to what policies.

PDB

Since using descheduler will evict Pods for rescheduling, but if all copies of a service are evicted, it may cause the service to be unavailable. If the service itself has a single point of failure, the eviction will definitely cause the service to be unavailable, in this case we strongly recommend using anti-affinity and multiple copies to avoid a single point of failure, but if the service itself is broken up on multiple nodes, and all these Pods are evicted, this will also cause the service to be unavailable. In this case we can avoid all replicas being deleted at the same time by configuring the PDB (PodDisruptionBudget) object. For example, we can set at most one copy of an application to be unavailable at the time of eviction by creating a manifest as shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# pdb-demo.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: pdb-demo
spec:
  maxUnavailable: 1 # Set the maximum number of unavailable copies, or use minAvailable, either as an integer or as a percentage
  selector:
    matchLabels: # Match Pod Labels
      app: demo

More details about PDB can be found in the official documentation: https://kubernetes.io/docs/tasks/run-application/configure-pdb/.

So if we use descheduler to rebalance the cluster state, then we strongly recommend to create a corresponding PodDisruptionBudget object for the application to protect.

Policy

PodLifeTime: Evict pods that exceed the specified time limit

This policy is used to evict Pods older than maxPodLifeTimeSeconds. You can configure which kind of status Pods will be evicted by podStatusPhases and it is recommended to create a PDB for each application to ensure the availability of the application. For example, we can configure the policy shown below to evict Pods that have been running for more than 7 days.

1
2
3
4
5
6
7
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      maxPodLifeTimeSeconds: 604800 # Pods run for up to 7 days

RemoveDuplicates

This policy ensures that only one RS, Deployment, or Job resource object associated with a Pod is running on the same node; if there are more Pods, these duplicate Pods are evicted to better spread the Pods across the cluster. This can happen if some nodes crash for some reason and the Pods on those nodes drift to other nodes, resulting in multiple Pods associated with RS running on the same node, and this policy can be enabled to evict these duplicate Pods once the failed node is ready again.

RemoveDuplicates

When configuring a policy, you can specify the parameter excludeOwnerKinds for exclusion types, under which Pods will not be evicted.

1
2
3
4
5
6
7
8
9
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
    enabled: true
    params:
      removeDuplicates:
        excludeOwnerKinds:
          - "ReplicaSet"

LowNodeUtilization

This policy is mainly used to find underutilized nodes and evict Pods from other nodes so that kube-scheduler can reschedule them to underutilized nodes. The parameters of this policy can be configured via the field nodeResourceUtilizationThresholds.

Underutilization of nodes can be determined by configuring the thresholds threshold parameter, which can be configured as a percentage of CPU, memory, and number of Pods. A node is considered underutilized if its utilization is below all thresholds.

LowNodeUtilization

In addition, there is a configurable threshold targetThresholds that counts potential nodes that may evict Pods, and this parameter can also be configured for CPU, memory, and a percentage of the number of Pods. thresholds and targetThresholds can be dynamically adjusted according to your cluster requirements, as shown in the example below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
    enabled: true
    params:
      nodeResourceUtilizationThresholds:
        thresholds:
          "cpu": 20
          "memory": 20
          "pods": 20
        targetThresholds:
          "cpu": 50
          "memory": 50
          "pods": 50

It is important to note that:

  • Only the following three resource types are supported: cpu, memory, pods
  • thresholds and targetThresholds must be configured with the same type
  • Access to parameter values is from 0-100 (percent)
  • The same resource type, thresholds cannot be configured higher than targetThresholds

If no resource type is specified, the default is 100% to prevent nodes from going from underutilized to overutilized. Another parameter associated with the LowNodeUtilization policy is numberOfNodes, which can be configured to activate the policy only if the number of underutilized nodes is greater than this configured value, which is useful for large clusters where some nodes may be frequently used or underutilized for a short period of time, by default numberOfNodes is 0.

RemovePodsViolatingInterPodAntiAffinity

This policy ensures that Pods that violate Pod anti-affinity are removed from the node. For example, if a node has podA as a Pod and podB and podC (running on the same node) have anti-affinity rules that prohibit them from running on the same node, podA will be expelled from that node so that podB and podC can run normally. This problem occurs when the anti-affinity rule is created after podB and podC are already running on the node.

RemovePodsViolatingInterPodAntiAffinity

To disable this policy, simply configure it to false.

1
2
3
4
5
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingInterPodAntiAffinity":
    enabled: false

RemovePodsViolatingNodeTaints

This policy ensures that Pods that violate the NoSchedule taint are removed from the node. For example, if there is a Pod named podA that is allowed to be dispatched to a node with that taint configuration by configuring the tolerance key=value:NoSchedule, and if the taint of the node is subsequently updated or removed, the taint will no longer be satisfied by the Pods’ tolerance. It will then be evicted.

1
2
3
4
5
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeTaints":
    enabled: true

RemovePodsViolatingNodeAffinity

This policy ensures that Pods violating node affinity are removed from the node. e.g. Pod named podA is scheduled to node nodeA, podA satisfies the node affinity rule requiredDuringSchedulingIgnoredDuringExecution at the time of scheduling, but over time, node nodeA no longer satisfies the rule, then podA is evicted from node nodeA if another node, nodeB, satisfying the node affinity rule is available, as shown in the following example policy configuration.

1
2
3
4
5
6
7
8
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeAffinity":
    enabled: true
    params:
      nodeAffinityType:
        - "requiredDuringSchedulingIgnoredDuringExecution"

RemovePodsViolatingTopologySpreadConstraint

This policy ensures that Pods that violate topology distribution constraints are removed from the node, specifically, it attempts to remove the minimum number of Pods needed to balance the topology domain into maxSkew for each constraint, although this policy requires a k8s version higher than 1.18 to be used.

By default, this policy only handles hard constraints, and will also support soft constraints if the parameter includeSoftConstraints is set to True.

1
2
3
4
5
6
7
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingTopologySpreadConstraint":
    enabled: true
    params:
      includeSoftConstraints: false

RemovePodsHavingTooManyRestarts

This policy ensures that Pods with too many restarts are removed from the node. Its parameters include podRestartThreshold, which is the number of restarts that should be expelled from the Pod, and include InitContainers, which determines whether restarts of the initialized container should be considered in the calculation. The policy configuration is shown below.

1
2
3
4
5
6
7
8
9
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsHavingTooManyRestarts":
    enabled: true
    params:
      podsHavingTooManyRestarts:
        podRestartThreshold: 100
        includingInitContainers: true

Filter Pods

When evicting Pods, sometimes it is not necessary for all Pods to be evicted, descheduler provides two main ways to filter: namespace filtering and priority filtering.

Namespace filtering

This policy allows you to configure whether to include or exclude certain namespaces. The following can be used with this policy.

  • PodLifeTime
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingInterPodAntiAffinity
  • RemoveDuplicates
  • RemovePodsViolatingTopologySpreadConstraint

For example, if you only evict Pods under certain command spaces, you can use the include parameter to configure them, as shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
      namespaces:
        include:
          - "namespace1"
          - "namespace2"

Or to exclude Pods under certain command spaces, you can use the exclude parameter as shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
      namespaces:
        exclude:
          - "namespace1"
          - "namespace2"

Priority Filtering

All policies can be configured with a priority threshold below which only Pods will be evicted, which we can specify by setting the thresholdPriorityClassName (which sets the threshold to the value of the specified priority class) or the thresholdPriority (which sets the threshold directly) parameter. By default, the threshold is set to the value of the PriorityClass class system-cluster-critical.

For example, use thresholdPriority.

1
2
3
4
5
6
7
8
9
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
      thresholdPriority: 10000

Or use thresholdPriorityClassName for filtering.

1
2
3
4
5
6
7
8
9
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
      thresholdPriorityClassName: "priorityclass1"

However, note that you cannot configure both thresholdPriority and thresholdPriorityClassName. If the specified priority class does not exist, the descheduler will not create it and will raise an error.

Caution

When using descheduler to evict Pods, the following points need to be noted:

  • Critical Pods will not be evicted, such as Pods with priorityClassName set to system-cluster-critical or system-node-critical
  • Pods that are not managed by RS, Deployment or Job will not be evicted
  • Pods created by DaemonSet will not be evicted
  • Pods with LocalStorage will not be evicted unless evictLocalStoragePods: true is set
  • Pods with PVCs will not be evicted unless ignorePvcPods: true is set
  • Under the LowNodeUtilization and RemovePodsViolatingInterPodAntiAffinity policies, Pods are evicted in descending order of priority. Pods of type Besteffort are evicted before types Burstable and Guaranteed if they have the same priority
  • Pods with descheduler.alpha.kubernetes.io/evict field in annotations can be evicted, this annotation is used to override the check to prevent eviction and the user can choose which Pods to evict
  • If Pods fail to evict, you can set -v=4 to find out why in the descheduler log, and not evict such Pods if the eviction violates the PDB constraint