Taints and Tolerations in Kubernetes are one of the important mechanisms of the scheduling system, which manages services to ensure that Pods are not scheduled to inappropriate nodes.

In this article, we will briefly introduce the Taints and Tolerations mechanism of Kubernetes.

Taints

Taints are labels defined on Node objects in Kubernetes. Unlike Labels and Annotations mechanisms that record information using key=values, Taints add an effect attribute that is described using the key=value:effect format, where the Key and Value can be user-defined strings, and the effect indicates how the Taints affects the Kubernetes scheduling pod, which currently supports the following three types.

  • NoExecute: Pods already running on a node will be evicted and new Pods will not be scheduled to that node.
  • NoSchedule: Pods already running on a node are not affected and new Pods are not scheduled to that node.
  • PreferNoSchedule: Pods already running on the node are not affected and new Pods are not scheduled to the node.

With the kubectl taint command we can quickly add taint to the Node.

1
2
3
4
# Add a taint, the effect is NoSchedule, subsequent Pods will not be scheduled to this node
kubectl taint nodes <node-name> key1=value1:NoSchedule
# Add a taint with the effect of NoExecute, where the value is omitted (default is the empty string)
kubectl taint nodes <node-name> key2:NoExecute-

Tolerations

Tolerations are pieces of information defined on a Pod that tell the Kubernetes Management Service whether the specified Taints will have an impact on the scheduling and failover (Evicted) of that Pod.

Users can define multiple Tolerations in the following ways.

1
2
3
4
5
tolerations:
- key: "key1" 
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"

The above definition says: taint key1=value1:NoSchedule will have no effect on the scheduling of this pod, and the user allows schedule to schedule these pods to the Node that has these taints.

1
2
3
4
5
6
tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

The above definition means that the taint key1=value1:NoExecute will not affect the pod for the first 3600s, and the pod will be migrated to another node when the taint exists on the Node for more than 3600s.

When defining tolerations, we can omit the value or both key and value, in which case we need to specify that the operator takes the value Exists, as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Omit key
tolerations:
- key: "example-key"
  operator: "Exists"
  effect: "NoSchedule"

# Omit key and value
tolerations:
- effect: NoExecute
  operator: Exists

Taint based Evictions

Kubernetes’ Node Controller implements Pod migration in case of node failure based on the Taint and Tolerations mechanism described above.

When a node fails, the Node Controller automatically adds the following taint to the Node.

Key Effect remark
node.kubernetes.io/not-ready NoExecute NodeCondition Ready == False
node.kubernetes.io/unreachable NoExecute NodeCondition Ready == Unknown
node.kubernetes.io/memory-pressure NoSchedule Insufficient node memory
node.kubernetes.io/disk-pressure NoSchedule Insufficient disk space for the node
node.kubernetes.io/pid-pressure NoSchedule Insufficient Pid for the node
node.kubernetes.io/unschedulable NoSchedule Nodes are not schedulable, mainly used in the scenario of active codron nodes

The Node Controller adds Taint based on the node’s NodeCondition information, which is updated in real time by each node’s Kubelet. schuduler works directly on the node’s Taint when scheduling the pod, and does not repeatedly consider the specific value of the NodeCondition.

Only the effect of node.kubernetes.io/not-ready and node.kubernetes.io/unreachable in the above Node Controller managed Taint is NoExecute, the other taint is NoSchedule.

This is because setting the NoExecute taint will immediately evict all pods on the node. To improve the robustness of the system, kubernetes takes the solution of marking the node as NoSchedule by tainting it when it is under memory pressure. At this point, the Kubelet backend performs a resource recovery, prioritizing the low-security pods to other nodes.

At this point, if memory resources are restored below the alert line, kubelet will update the NodeCondition information and Taint will be removed. In other words, a type taint such as memory-pressure is not an indication that the node is completely unavailable and its effect set to NoExecute is inappropriate.

By default, kubernetes automatically adds a number of tolerations to the created pods to enable automatic pod migration for various scenarios.

Kubernetes automatically adds the following tolerations to all pods created by controllers other than Daemon:

1
2
3
4
5
6
7
8
9
tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

For daemon managed pods, Kubernetes injects the following tolerations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
tolerations:
- effect: NoExecute
  key: node.kubernetes.io/not-ready
  operator: Exists
- effect: NoExecute
  key: node.kubernetes.io/unreachable
  operator: Exists
- effect: NoSchedule
  key: node.kubernetes.io/disk-pressure
  operator: Exists
- effect: NoSchedule
  key: node.kubernetes.io/memory-pressure
  operator: Exists
- effect: NoSchedule
  key: node.kubernetes.io/pid-pressure
  operator: Exists
- effect: NoSchedule
  key: node.kubernetes.io/unschedulable
  operator: Exists

The official documentation mentions that for Guaranteed and Burstable type Pods, Kubernetes automatically adds node.kubernetes.io/memory-pressure type tolerations (even when Burstable type Pods don’t have memory quotas set). However, in real-world testing, I found that it was not added.

Reference