Pod prioritization, preemption

Pod prioritization and preemption, introduced in kubernetes v1.8, entered beta status in v1.11, and entered GA phase in v1.14, is already a mature feature.

As the name suggests, the Pod priority, preemption feature, by subdividing applications into different priorities, prioritizes resources to high-priority applications, thus improving resource availability while guaranteeing the quality of service for high-priority applications.

Let’s use the Pod priority and preemption function briefly.

Ibu’s cluster version is v1.14, so feature PodPriority is enabled by default. The use of preemption mode is divided into two steps.

  1. define PriorityClass, the value of different PriorityClass is different, the larger the value the higher the priority.
  2. Create a Pod and set the Pod’s priorityClassName field to the expected PriorityClass.

Create PriorityClass

As follows, Ibu first creates two PriorityClasses: high-priority and low-priority, whose values are 1000000 and 10 respectively.

Note that Ibu sets globalDefault of low-priority to true, so low-priority is the default PriorityClass of the cluster, and any Pod that does not have the priorityClassName field configured will have its priority set to low-priority of 10. A cluster can only have one default PriorityClass. if the default PriorityClass is not set, the priority of the Pod without the PriorityClassName field will be 0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "for high priority pod"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 10
globalDefault: true
description: "for low priority pod"

Check the current PriorityClass of the system after creation.

1
2
3
4
5
6
kubectl get priorityclasses.scheduling.k8s.io
NAME                      VALUE        GLOBAL-DEFAULT   AGE
high-priority             1000000      false            47m
low-priority              10           true             47m
system-cluster-critical   2000000000   false            254d
system-node-critical      2000001000   false            254d

As you can see, in addition to the two PriorityClasses created above, the default system also has built-in system-cluster-critical and system-node-critical for high-priority system tasks.

Set the PriorityClassName of Pod

For verification purposes, Ibu uses extended resource here. Ibu sets the capacity of the extended resource example.com/foo to 1 for node x1.

1
2
3
4
curl -k --header "Authorization: Bearer ${token}" --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1foo", "value": "1"}]' \
https://{apiServerIP}:{apiServerPort}/api/v1/nodes/x1/status

Looking at the allocatable and capacity of x1, you can see that there is 1 example.com/foo resource on x1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Capacity:
 cpu:                2
 example.com/foo:    1
 hugepages-2Mi:      0
 memory:             4040056Ki
 pods:               110
Allocatable:
 cpu:                2
 example.com/foo:    1
 hugepages-2Mi:      0
 memory:             3937656Ki
 pods:               110

We first create the Deployment nginx, which will request one example.com/foo resource, but we don’t set the PriorityClassName, so the Pod’s priority will be the default low-priority specified by 10.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  template:
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources:
          limits:
            example.com/foo: "1"
          requests:
            example.com/foo: "1"

Then create the Deployment debian, which does not request the example.com/foo resource.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
  template:
    spec:
      containers:
      - args:
        - bash
        image: debian
        name: debian
        resources:
          limits:
            example.com/foo: "0"
          requests:
            example.com/foo: "0"
        priorityClassName: high-priority

At this point both Pods can be started normally.

Start preemption.

We change the Deployment debian’s example.com/foo request volume to 1 and set the priorityClassName to high-priority.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
  template:
    spec:
      containers:
      - args:
        - bash
        image: debian
        name: debian
        resources:
          limits:
            example.com/foo: "1"
          requests:
            example.com/foo: "1"
        priorityClassName: high-priority

At this point, since there is only 1 example.com/foo resource on x1 in the cluster and debian has a higher priority, the scheduler will start to seize it. The following is the observed Pod process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
kubectl get pods -o wide -w
NAME                      READY   STATUS    AGE     IP             NODE       NOMINATED NODE
debian-55d94c54cb-pdfmd   1/1     Running   3m53s   10.244.4.178   x201       <none>
nginx-58dc57fbff-g5fph    1/1     Running   2m4s    10.244.3.28    x1         <none>
// 此时Deployment debian开始Recreate
debian-55d94c54cb-pdfmd   1/1     Terminating   4m49s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m21s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m22s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m22s   10.244.4.178   x201       <none>
// example.com/foo不满足,阻塞
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     <none>
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     <none>
// scheduler判断将x1上的Pod挤出后可以满足debian Pod的需求,设置NOMINATED为x1
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     x1    
// sheduler开始挤出Pod nginx
nginx-58dc57fbff-g5fph    1/1     Terminating   3m33s   10.244.3.28    x1         <none>
// Pod nginx等待。优先级低啊,没办法。
nginx-58dc57fbff-29rzw    0/1     Pending       0s      <none>         <none>     <none>
nginx-58dc57fbff-29rzw    0/1     Pending       0s      <none>         <none>     <none>
// graceful termination period,优雅退出
nginx-58dc57fbff-g5fph    0/1     Terminating   3m34s   10.244.3.28    x1         <none>
nginx-58dc57fbff-g5fph    0/1     Terminating   3m37s   10.244.3.28    x1         <none>
nginx-58dc57fbff-g5fph    0/1     Terminating   3m37s   10.244.3.28    x1         <none>
// debian NODE绑定为x1
debian-5bc46885dd-rvtwv   0/1     Pending       5s      <none>         x1         x1    
// 抢占到资源,启动
debian-5bc46885dd-rvtwv   0/1     ContainerCreating   5s      <none>         x1         <none>
debian-5bc46885dd-rvtwv   1/1     Running             14s     10.244.3.29    x1         <none>

Gentleman: Non-preempting PriorityClasses

kubernetes v1.15 added a field PreemptionPolicy for PriorityClasses, when set to Never, the Pod will not preempt Pods with lower priority than it, just the scheduling will be prioritized (refer to the value of PriorityClass).

1
2
3
4
5
6
7
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false

So I call this PriorityClass “gentleman”, because he just silently queue up according to his ability (Priority), and will not steal other people’s resources. The official website gives a suitable example is data science workload.

Compare to Cluster Autoscaler

When kubernetes on the cloud is running low on cluster resources, it can automatically scale the nodes through Cluster Autoscaler, i.e., request more nodes from the cloud vendor and add them to the cluster, thus providing more resources.

However, the shortcomings of this approach are.

  • The under-cloud scenario is not easy to implement
  • It costs more money to add nodes
  • Not immediate, takes time

If users can more clearly divide the priority of applications, they can better improve resource utilization and quality of service by seizing resources from lower priority Pods when resources are insufficient.