OpenKruise (https://openkruise.io) is a suite of Kubernetes-based extensions focused on automating cloud-native applications, such as deployment, publishing, operations and availability protection. The majority of the capabilities provided by OpenKruise are defined based on CRD extensions, which do not exist in any external dependencies and can run on any pure Kubernetes cluster; Kubernetes itself provides some application deployment management capabilities that are not sufficient for large-scale application and cluster scenarios, and OpenKruise bridges the gap between Kubernetes in the areas of application deployment, upgrades, protection, and operations and maintenance.

OpenKruise provides some core capabilities as follows:

  • Enhanced Versions of Workloads: OpenKruise includes a range of enhanced versions of workloads such as CloneSet, Advanced StatefulSet, Advanced DaemonSet, BroadcastJob, etc. They not only support basic functionality similar to Kubernetes’ native workloads, but also offer things like in-place upgrades, configurable scaling/release policies, concurrent operations, and more. In particular, in-place upgrades are a new way to upgrade application container images and even environment variables, rebuilding only specific containers in a Pod with new images, leaving the entire Pod and the other containers within it unaffected. This results in faster releases and avoids negative impacts on other components such as Scheduler, CNI, CSI, etc.
  • Bypass management of applications: OpenKruise offers several ways to manage application sidecar containers, multi-region deployments, via bypass. Bypassing means that you can implement them without having to modify the application’s Workloads. For example, SidecarSet can help you inject specific sidecar containers into all matching Pods when they are created, or even upgrade an already injected sidecar container image in-place without affecting other containers in the Pod. WorkloadSpread can constrain the regional distribution of Pods scaled out of stateless workloads, giving a single workload the ability to deploy multiple regions and elasticity.
  • High Availability Protection: OpenKruise protects your Kubernetes resources from cascading deletion mechanisms, including CRD, Namespace, and almost all workloads type resources. PodUnavailableBudget protects against Pod Deletion, Eviction, Update, and many other voluntary disruption scenarios, as opposed to Kubernetes’ native PDB, which only provides protection against Pod Eviction.
  • Advanced Application Ops: OpenKruise also offers a number of advanced Ops capabilities to help you better manage your applications, such as the ability to pre-pull certain images on an arbitrary range of nodes via ImagePullJob, or to specify that one or more containers in a Pod be restarted in-place.

Architecture

The following diagram shows the overall architecture of OpenKruise:

The overall architecture of OpenKruise

All OpenKruise functionality is provided through the Kubernetes CRD. One of the central components running the controller and webhook is Kruise-manager, which is deployed via Deployment in the kruise-system namespace. Logically, controllers such as cloneset-controller and sidecarset-controller run independently, but to reduce complexity they are packaged in a separate binary and run in kruise-controller-manager-xxx in a Pod. In addition to the controller, kruise-controller-manager-xxx contains the admission webhook for the Kruise CRD and Pod resources. Kruise-manager creates webhook configurations to configure which resources that need to be handled sensitively, and provide a service for kube-apiserver to call.

A new Kruise-daemon component is available from v0.8.0, which is deployed to each node via DaemonSet and provides functions such as image warm-up, container restart, etc.

Installation

Here we also use the Helm method of installation, but note that from v1.0.0 onwards, OpenKruise requires installation and use in clusters with Kubernetes >= 1.16 or higher.

First add the charts repository:

1
2
➜ helm repo add openkruise https://openkruise.github.io/charts/
➜ helm repo update

Then install the latest version of the application by executing the following command:

1
➜ helm upgrade --install kruise openkruise/kruise --version 1.3.0

This charts defines the namespace as kruise-system by default in the template, so you can leave it unspecified during installation.

For users in mainland China, if your environment does not have access to the official DockerHub image, you can use the following command to replace the image with AliCloud’s mirror service.

1
➜ helm upgrade --install kruise openkruise/kruise --set manager.image.repository=openkruise-registry.cn-shanghai.cr.aliyuncs.com/openkruise/kruise-manager --version 1.3.0

When the application is deployed, two pods of kruise-manager will run under the kruise-system namespace, and they will also be elected by leader-election, so that only one is available at the same time for high availability, and the kruise-daemon component will be started as a DaemonSet. kruise-daemon` component as a DaemonSet.

1
2
3
4
5
6
7
➜ kubectl get pods -n kruise-system
NAME                                         READY   STATUS    RESTARTS   AGE
kruise-controller-manager-7d78fc5c97-d6mbb   1/1     Running   0          52s
kruise-controller-manager-7d78fc5c97-wccbn   1/1     Running   0          52s
kruise-daemon-9f94k                          1/1     Running   0          52s
kruise-daemon-bqj69                          1/1     Running   0          52s
kruise-daemon-h95pf                          1/1     Running   0          52s

If you do not want to use the default parameters for installation, you can also customise the values that can be configured by referring to the charts documentation at https://github.com/openkruise/charts.

CloneSet

The CloneSet controller is an enhancement to the native Deployment provided by OpenKruise, and is used in much the same way as Deployment, as shown in the following declaration of a CloneSet resource object:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# cloneset-demo.yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  name: cs-demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cs
  template:
    metadata:
      labels:
        app: cs
    spec:
      containers:
        - name: nginx
          image: nginx:alpine
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 80

Create the above CloneSet object directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
➜ kubectl apply -f cloneset-demo.yaml
➜ kubectl get cloneset cs-demo
NAME      DESIRED   UPDATED   UPDATED_READY   READY   TOTAL   AGE
cs-demo   3         3         0               0       3       8s
➜ kubectl describe cloneset cs-demo
Name:         cs-demo
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  apps.kruise.io/v1alpha1
Kind:         CloneSet
# ......
Events:
  Type    Reason            Age   From                 Message
  ----    ------            ----  ----                 -------
  Normal  SuccessfulCreate  21s   cloneset-controller  succeed to create pod cs-demo-jfx5s
  Normal  SuccessfulCreate  21s   cloneset-controller  succeed to create pod cs-demo-kg9p2
  Normal  SuccessfulCreate  21s   cloneset-controller  succeed to create pod cs-demo-n72fr

After the object is created, we can check the corresponding Events information through the kubectl describe command, and we can see that the cloneset-controller is a Pod created directly. This is different from the native Deployment, where the Deployment is created through the ReplicaSet, so we can also see that the CloneSet manages the Pod directly, and the 3 replicas of the Pod have been created successfully.

1
2
3
4
5
➜ kubectl get pods -l app=cs
NAME            READY   STATUS    RESTARTS   AGE
cs-demo-jfx5s   1/1     Running   0          58s
cs-demo-kg9p2   1/1     Running   0          58s
cs-demo-n72fr   1/1     Running   0          58s

Although CloneSet is similar to Deployment in terms of use, it has many more advanced features than Deployment, which are described in more detail below.

Expansion and reduction of capacity

Streaming Scaling

CloneSet can limit the size of the scaling step when scaling with ScaleStrategy.MaxUnavailable so that it has the least impact on the service application, either by setting an absolute value or a percentage, or by not setting the value to indicate no limit.

For example, we add the following data to the manifest above:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  name: cs-demo
spec:
  minReadySeconds: 60
  scaleStrategy:
    maxUnavailable: 1
  replicas: 5
  ......

Above we have configured scaleStrategy.maxUnavailable to 1, and combined with the minReadySeconds parameter, this means that when scaling, the CloneSet will only create the next Pod when the last scaled Pod has been ready for more than a minute.

For example, here we expand to 5 copies and check the events of the CloneSet after updating the above objects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
➜ kubectl describe cloneset cs-demo
......
Events:
  Type     Reason            Age                From                 Message
  ----     ------            ----               ----                 -------
  Normal   SuccessfulCreate  4m25s              cloneset-controller  succeed to create pod cs-demo-jfx5s
  Normal   SuccessfulCreate  4m25s              cloneset-controller  succeed to create pod cs-demo-kg9p2
  Normal   SuccessfulCreate  4m25s              cloneset-controller  succeed to create pod cs-demo-n72fr
  Warning  ScaleUpLimited    66s                cloneset-controller  scaleUp is limited because of scaleStrategy.maxUnavailable, limit: 1
  Normal   SuccessfulCreate  66s                cloneset-controller  succeed to create pod cs-demo-x8ndf
  Warning  ScaleUpLimited    64s (x6 over 66s)  cloneset-controller  scaleUp is limited because of scaleStrategy.maxUnavailable, limit: 0
  Normal   SuccessfulCreate  5s                 cloneset-controller  succeed to create pod cs-demo-2sfzz

You can see that one Pod was expanded first, and since we configured minReadySeconds: 60, it took more than 1 minute for the newly expanded Pod to be created before the other Pod was expanded, and the Events message above shows this. Looking at the Pod’s AGE also shows that there was a gap of about 1 minute between the 2 Pods that were expanded.

1
2
3
4
5
6
7
➜ kubectl get pods -l app=cs
NAME            READY   STATUS    RESTARTS   AGE
cs-demo-2sfzz   1/1     Running   0          22s
cs-demo-jfx5s   1/1     Running   0          4m42s
cs-demo-kg9p2   1/1     Running   0          4m42s
cs-demo-n72fr   1/1     Running   0          4m42s
cs-demo-x8ndf   1/1     Running   0          83s

When a CloneSet is scaled down, we can also specify some Pods to be deleted, which is not possible with StatefulSet or Deployment, where Pods are deleted based on serial number, and Deployment/ReplicaSet, which can currently only be deleted based on the ordering defined in the controller. CloneSet, on the other hand, allows users to specify the name of the Pods they want to delete while narrowing down the number of replicas, as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  name: cs-demo
spec:
  minReadySeconds: 60
  scaleStrategy:
    maxUnavailable: 1
    podsToDelete:
    - cs-demo-n72fr
  replicas: 4
  ......

After updating the manifest above, the application will be reduced to 4 Pods and if a Pod name is specified in the podsToDelete list, the controller will remove these Pods first and the Pods that have been removed will be automatically cleaned up from the podsToDelete list by the controller. For example, if we update the manifest above, the Pod cs-demo-n72fr will be removed and the rest will remain.

1
2
3
4
5
6
➜ kubectl get pods -l app=cs
NAME            READY   STATUS    RESTARTS   AGE
cs-demo-2sfzz   1/1     Running   0          61s
cs-demo-jfx5s   1/1     Running   0          5m21s
cs-demo-kg9p2   1/1     Running   0          5m21s
cs-demo-x8ndf   1/1     Running   0          2m2s

If you only add the Pod name to podsToDelete, but do not change the number of replicas, then the controller will first delete the specified Pod and then expand a new one. Another way to delete a Pod directly is to tag the Pod to be deleted with apps.kruise.io/specified-delete: true.

Instead of manually deleting the pod directly, using podsToDelete or apps.kruise.io/specified-delete: true will have the CloneSet’s maxUnavailable/maxSurge to protect the deletion and will trigger the PreparingDelete lifecycle hooks.

PVC Templates

In a unique feature, CloneSet allows users to configure PVC templates volumeClaimTemplates to generate a unique PVC for each Pod, which is not supported by Deployment. As it is often the case that stateful applications need to have separate PVCs, here are some things to keep in mind when using CloneSet’s PVC templates:

  • Each PVC that is automatically created will have an ownerReference pointing to the CloneSet, so when the CloneSet is deleted, all Pods and PVCs created by it will be deleted.

  • Each Pod and PVC created by a CloneSet will have an apps.kruise.io/cloneset-instance-id: xxx label, the associated Pod and PVC will have the same instance-id and their names will be suffixed with this instance-id.

  • If a Pod is deleted by the CloneSet controller, all the PVCs associated with the Pod will be deleted together.

  • If a Pod is deleted or evicted by an external call, all the PVCs associated with the Pod still exist; and when the CloneSet controller finds that there is not enough capacity to re-expand, the newly expanded Pod will reuse the instance-id of the original Pod and associate it with the original PVCs.

  • When a Pod is rebuilt and upgraded, the associated PVC will be deleted and created along with the Pod.

  • When a Pod is upgraded in-place, the associated PVC will continue to be used.

The following is an example with PVC stencils:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# cloneset-pvc.yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  labels:
    app: sample
  name: sample-data
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
        - name: nginx
          image: nginx:alpine
          volumeMounts:
            - name: data-vol
              mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
    - metadata:
        name: data-vol
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi

For example, the application of the resource object above will automatically create 3 Pods and 3 PVCs, each of which will mount a PVC:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
➜ kubectl get pods -l app=sample
NAME                READY   STATUS    RESTARTS   AGE
sample-data-t4vq6   0/1     Pending   0          2m13s
sample-data-vcjnl   0/1     Pending   0          2m13s
sample-data-znwjd   0/1     Pending   0          2m13s
➜ kubectl get pvc -l app=sample
NAME                         STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-vol-sample-data-t4vq6   Pending                                                     2m46s
data-vol-sample-data-vcjnl   Pending                                                     2m46s
data-vol-sample-data-znwjd   Pending                                                     2m46s

Upgrades

CloneSet offers a total of 3 upgrade methods:

  • ReCreate: Deletes the old Pod and its PVCs, then recreates them with the new version, this is the default method

  • InPlaceIfPossible: Tries to upgrade the Pod in-place first, then rebuilds it if that doesn’t work

  • InPlaceOnly: only in-place upgrades are allowed, so the user can only modify the restricted fields in the previous article, and attempts to modify other fields will be rejected

Here is an important concept: InPlaceUpgrade, which is one of the core features provided by OpenKruise, when we want to upgrade a image in a Pod, the following diagram shows the difference between RebuildUpgrade and InPlaceUpgrade:

The difference between RebuildUpgrade and InPlaceUpgrade

Rebuild Upgrade We need to delete the old Pod and create the new Pod:

  • Pod name and uid change because they are two completely different Pod objects (e.g. Deployment upgrade)
  • Pod name may remain the same, but uid changes, because they are different Pod objects, just reusing the same name (e.g. StatefulSet upgrade)
  • Pod’s Node name may change, as the new Pod will probably not be dispatched to the previous Node node
  • Pod IP changes, as the new Pod will most likely not be assigned to the previous IP address

But for in-place upgrades, we still reuse the same Pod object, just modifying the fields inside it:

  • additional operations and costs such as scheduling, assigning IPs, mounting volumes, etc. can be avoided
  • Faster image pulling, as most of the layer layers of the old image are reused and only some of the changed layers of the new image need to be pulled
  • When a container is upgraded in-place, the other containers in the Pod are not affected and remain operational

So obviously if we can upgrade our workloads with InPlaceIfPossible, the impact on the online application is minimal. We mentioned above that the CloneSet upgrade type supports InPlaceIfPossible, which means that Kruise will try to do in-place upgrades for Pods, or degrade to rebuild upgrades if this is not possible, and the following changes will be allowed to perform in-place upgrades:

  • Update spec.template.metadata.* in the workload, e.g. labels/annotations, and Kruise will only update the changes in the metadata to the stock Pod.
  • By updating spec.template.spec.containers[x].image in the workload, Kruise will upgrade the images of those containers in the Pod in-place, without rebuilding the entire Pod.
  • Starting with Kruise v1.0, if spec.template.metadata.labels/annotations is updated and containers are configured with env from these changed labels/anntations, Kruise will upgrade these containers in-place to take effect with the new env values.

Otherwise, changes to other fields, such as spec.template.spec.containers[x].env or spec.template.spec.containers[x].resources, are rolled back to rebuild the upgrade.

For example, if we set the above application update method to InPlaceIfPossible, we just need to add spec.updateStrategy.type: InPlaceIfPossible to the manifest.

1
2
3
4
5
6
7
8
9
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  name: cs-demo
spec:
  updateStrategy:
    type: InPlaceIfPossible
  ......
  # image: nginx:1.7.9

After the update, we can see that the status of the Pod has not changed much. The name and IP are the same, the only thing that has changed is the image tag:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
➜ kubectl get pods -l app=cs
NAME            READY   STATUS    RESTARTS      AGE
cs-demo-2sfzz   1/1     Running   1 (18s ago)   36m
cs-demo-jfx5s   1/1     Running   0             40m
cs-demo-kg9p2   1/1     Running   0             40m
cs-demo-x8ndf   1/1     Running   0             37m
➜ kubectl describe cloneset cs-demo
Name:         cs-demo
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  apps.kruise.io/v1alpha1
Kind:         CloneSet
......
Events:
  Type    Reason                      Age    From                 Message
  ----    ------                      ----   ----                 -------
  #  ......
  Normal   SuccessfulUpdatePodInPlace  6m58s              cloneset-controller  successfully update pod cs-demo-2sfzz in-place(revision cs-demo-7cb9c88699)
  Normal   SuccessfulUpdatePodInPlace  5m46s              cloneset-controller  successfully update pod cs-demo-x8ndf in-place(revision cs-demo-7cb9c88699)
  Normal   SuccessfulUpdatePodInPlace  4m43s              cloneset-controller  successfully update pod cs-demo-kg9p2 in-place(revision cs-demo-7cb9c88699)
  Normal   SuccessfulUpdatePodInPlace  3m40s              cloneset-controller  successfully update pod cs-demo-jfx5s in-place(revision cs-demo-7cb9c88699)
➜ kubectl describe pod cs-demo-2sfzz
......
Events:
  Type    Reason     Age                  From               Message
  ----    ------     ----                 ----               -------
  Normal  Scheduled  44m                  default-scheduler  Successfully assigned default/cs-demo-2sfzz to node2
  Normal  Pulled     44m                  kubelet            Container image "nginx:alpine" already present on machine
  Normal  Killing    8m8s                 kubelet            Container nginx definition changed, will be restarted
  Normal  Pulling    8m8s                 kubelet            Pulling image "nginx:1.7.9"
  Normal  Created    7m58s (x2 over 44m)  kubelet            Created container nginx
  Normal  Started    7m58s (x2 over 44m)  kubelet            Started container nginx
  Normal  Pulled     7m58s                kubelet            Successfully pulled image "nginx:1.7.9" in 9.720841233s (9.720847295s including waiting)

This is the effect of the in-place upgrade and the overall workflow of the in-place upgrade is shown in the following diagram:

The in-place upgrade process

If you enable the feature-gate PreDownloadImageForInPlaceUpdate when installing or upgrading Kruise, the CloneSet controller will automatically pre-warm the new version of the image you are canarying on the node where all the old version pods are located, which is helpful to speed up app releases.

By default, CloneSet preheats each new image with a concurrency of 1, i.e. pulling images one node at a time, but if you need to adjust this, you can set the concurrency in CloneSet via the annotation apps.kruise.io/image-predownload-parallelism.

Also starting with Kruise v1.1.0, it is possible to use apps.kruise.io/image-predownload-min-updated-ready-pods to control that the image warm-up is performed after a small number of new versions of the Pod have been successfully upgraded. The value can be an absolute number or a percentage.

1
2
3
4
5
6
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  annotations:
    apps.kruise.io/image-predownload-parallelism: "5"
    apps.kruise.io/image-predownload-min-updated-ready-pods: "2"

Note that in order to avoid most unnecessary image pulls, only CloneSet with replicas > 3 is currently being automatically warmed up.

In addition CloneSet supports batching of canary releases, in the updateStrategy property you can configure the partition parameter, which can be used to keep the number or percentage of Pods from older versions, default is 0:

  • If it is a number, the controller will update the (replicas - partition) number of Pods to the latest version
  • If it is a percentage, the controller will update the (replicas * (100% - partition)) number of Pods to the latest version

For example, if we update the image in the example above to nginx:latest and set partition=2, we can see after the update that only 2 Pods have been updated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
➜ kubectl get pods -l app=cs -L controller-revision-hash
NAME            READY   STATUS    RESTARTS      AGE   CONTROLLER-REVISION-HASH
cs-demo-2sfzz   1/1     Running   1 (11m ago)   47m   cs-demo-7cb9c88699
cs-demo-jfx5s   1/1     Running   2 (99s ago)   52m   cs-demo-7c4d79f5bc
cs-demo-kg9p2   1/1     Running   2 (27s ago)   52m   cs-demo-7c4d79f5bc
cs-demo-x8ndf   1/1     Running   1 (10m ago)   48m   cs-demo-7cb9c88699
➜ kubectl get pods -o custom-columns='DATA:metadata.name,CONTAINERS:spec.containers[*].name,IMAGES:spec.containers[*].image' -l app=cs
DATA            CONTAINERS   IMAGES
cs-demo-2sfzz   nginx        nginx:1.7.9
cs-demo-jfx5s   nginx        nginx:latest
cs-demo-kg9p2   nginx        nginx:latest
cs-demo-x8ndf   nginx        nginx:1.7.9

In addition, CloneSet supports some more advanced uses, such as defining priority policies to control the priority rules for Pod releases, defining policies to break up a class of Pods throughout the release process, and suspending Pod releases.

Life cycle hooks

Each Pod managed by a CloneSet will have an explicit state it is in, marked by lifecycle.apps.kruise.io/state in the Pod label:

  • Normal: normal state
  • PreparingUpdate: preparing for in-place upgrade
  • Updating: in-place upgrade in progress
  • Updated: in-place upgrade complete
  • PreparingDelete: ready for deletion

The lifecycle hook, on the other hand, enables custom actions (such as switching traffic, alarms, etc.) before and after in-place upgrades and before deletions by jamming in the above state flow. preDelete and inPlaceUpdate are the main properties supported under lifecycle of CloneSet.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
  # Defining hooks via finalizer
  lifecycle:
    preDelete:  # PreDelete is the hook before the Pod is deleted.
      finalizersHandler:
      - example.io/unready-blocker
    inPlaceUpdate: # InPlaceUpdate is the hook before and after the Pod update
      finalizersHandler:
      - example.io/unready-blocker

  # Alternatively, it can be defined by label
  lifecycle:
    inPlaceUpdate:
      labelsHandler:
        example.io/block-unready: "true"

Set it to NotReady before upgrading/removing Pods

1
2
3
4
5
6
7
8
9
lifecycle:
  preDelete:
    markPodNotReady: true
    finalizersHandler:
      - example.io/unready-blocker
  inPlaceUpdate:
    markPodNotReady: true
    finalizersHandler:
      - example.io/unready-blocker
  • If preDelete.markPodNotReady=true is set.
    • Kruise will set the Pod Condition KruisePodReady to False when the Pod enters the PreparingDelete state, and the Pod will become NotReady.
  • If inPlaceUpdate.markPodNotReady=true is set.
    • Kruise will set the Pod Condition of KruisePodReady to False when the Pod enters the PreparingUpdate state, and the Pod will become NotReady.
    • Kruise will try to set the KruisePodReady Pod Condition back to True.

We can use this feature to prevent traffic loss by excluding traffic from the Pod before the container is actually stopped.

Flow diagram

Life cycle diagram

  • When the CloneSet deletes a Pod (including normal scaling and rebuild upgrades):

    • If no lifecycle hook is defined or the Pod does not meet the preDelete condition, it is deleted directly.
    • Otherwise, just change the Pod state to PreparingDelete first. After the user controller has finished removing the label/finalizer and the Pod does not meet the preDelete condition, kruise performs the Pod deletion.
    • Note that Pods in the PreparingDelete state are in the deletion phase and will not be upgraded.
  • When a CloneSet upgrades a Pod in-place:

    • Before the upgrade, if a lifecycle hook is defined and the Pod meets the inPlaceUpdate condition, change the Pod state to PreparingUpdate
    • After the user controller has completed the task of removing the label/finalizer and the Pod does not meet the inPlaceUpdate condition, kruise changes the Pod state to Updating and starts the upgrade
    • After the upgrade is complete, if the lifecycle hook is defined and the Pod does not meet the inPlaceUpdate condition, change the Pod state to Updated
    • When the user controller completes the task with the label/finalizer and the Pod meets the inPlaceUpdate condition, kruise changes the Pod state to Normal and determines that the upgrade was successful

With regard to returning to the Normal state from PreparingDelete, this is supported by design (by undoing the specified deletion), but we generally do not recommend this usage. As Pods in the PreparingDelete state are not upgraded, they may go back into the release phase immediately after returning to the Normal state, which is a problem for users dealing with hooks.

Example of user controller logic

As per the above example, you can define:

  • example.io/unready-blocker finalizer as a hook
  • example.io/initialing annotation as an initialisation tag

Include this field in the CloneSet template.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
  template:
    metadata:
      annotations:
        example.io/initialing: "true"
      finalizers:
        - example.io/unready-blocker
  # ...
  lifecycle:
    preDelete:
      finalizersHandler:
        - example.io/unready-blocker
    inPlaceUpdate:
      finalizersHandler:
        - example.io/unready-blocker

The logic of the user controller is then as follows:

  • For Pods in the Normal state, if the annotation has example.io/initialing: true and the ready condition in the Pod status is True, access the traffic and remove the annotation.
  • For Pods in the PreparingDelete and PreparingUpdate states, cut the traffic and remove the example.io/unready-blocker finalizer
  • For Pods in the Updated state, access the traffic and tag example.io/unready-blocker finalizer

Usage Scenarios

For a variety of historical and objective reasons, some users may not be able to Kubernetesize their entire company architecture. For example, some users are temporarily unable to use the service discovery mechanism provided by Kubernetes itself, and instead use a separate service registration and discovery system independent of Kubernetes. In such an architecture, users may encounter many problems if they Kubernetesize their services. For example, whenever Kubernetes successfully creates a Pod, it needs to register the Pod with the Service Discovery Centre itself in order to be able to provide services internally and externally; accordingly, if you want to take a Pod offline, you usually have to remove it from the Service Discovery Centre before you can gracefully take it offline, which may result in a loss of traffic. In the native Kubernetes system, however, the lifecycle of a Pod is managed by a Workload (e.g. Deployment), and when the Replicas field of these Workloads changes, the corresponding Controller will immediately add or remove the Pod, making it difficult for users to customize the lifecycle of a Pod.

There are two general solutions to this problem: one is to constrain the resilience of Kubernetes, for example by restricting the scaling of a workload to a specific link, so that the Pod IP is removed from the service registry before the Pod is deleted. This would limit the resilience of Kubernetes itself and increase the difficulty and risk of link control. The second is to fundamentally revamp the existing service discovery system, which is obviously a much longer and riskier task.

CloneSet lifecycle transformation

Is there a way to take full advantage of Kubernetes’ resiliency capabilities while avoiding the need to modify existing service discovery systems and quickly bridge the gap between the two systems?

The OpenKruise CloneSet provides a set of highly customisable extensions for this type of scenario, allowing users to manage the Pod lifecycle in a more granular and customised way; the CloneSet sets aside Hooks at several important time points in the Pod lifecycle, allowing users to insert customised extensions at these time points actions. For example, remove the Pod IP from the Service Discovery Center before the Pod upgrade, and then register the Pod IP to the Service Discovery Center after the upgrade is completed, or do some special sniffing and monitoring actions.

Let’s assume that we now have a scenario where

  • users do not use Kubernetes Service as a service discovery mechanism, and the service discovery architecture is completely independent of Kubernetes;
  • Using CloneSet as a Kubernetes workload.

and make the following reasonable assumptions about the specific requirements:

  • When a Kubernetes Pod is created:
    • When the Pod IP is registered to the Service Discovery Center after the creation is successful and the Pod is Ready;
  • When a Kubernetes Pod is upgraded in-place:
    • The Pod IP needs to be removed from the Service Discovery Centre (or actively FailOver) before the upgrade can take place;
    • After the upgrade is complete and the Pod is Ready, register the Pod IP to the Service Discovery Center again;
  • When a Kubernetes Pod is deleted:
    • the Pod IP needs to be removed from the Service Discovery Centre before it can be deleted;

Based on the above assumptions, we can actually use CloneSet LifeCycle to write a simple Operator to implement a user-defined Pod lifecycle management mechanism.

As mentioned earlier, CloneSet LifeCycle defines the life cycle of a Pod as 5 states, and the logic for transitioning between the 5 states is controlled by a state machine. We can select only one or more of these states that we care about, write a standalone Operator to implement the transitions, control the Pod’s lifecycle, and insert our own custom logic at the time points we care about.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  namespace: demo
  name: cloneset-lifecycle-demo
spec:
  replicas: 2
  ##########################################################################
  ## Life cycle configuration
  lifecycle:
    inPlaceUpdate:
      labelsHandler:
        ## Define label.
        ##    1. Block in-place Pod updates for cloneset controllers
        ##    2. Notify operator to execute inPlace update hook
        example.com/unready-blocker-inplace: "true"
    preDelete:
      labelsHandler:
        ## Define label.
        ##    1. Blocking pod deletions for cloneset controllers
        ##    2. Notify the operator to execute the preDelete hook
        example.com/unready-blocker-delete: "true"
  ##########################################################################
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        ## This tag can be used to determine if this Pod is newly created
        example.com/newly-create: "true"
        ## Corresponds to spec.lifecycle.inPlaceUpdate.labelsHandler.example.com/unready-blocker-inplace
        example.com/unready-blocker-inplace: "true"
        ## Corresponds to spec.lifecycle.preDelete.labelsHandler.example.com/unready-blocker-inplace
        example.com/unready-blocker-delete: "true"
      containers:
        - name: main
          image: nginx:latest
          imagePullPolicy: Always
  updateStrategy:
    maxUnavailable: 20%
    type: InPlaceIfPossible

We’ve explained how to develop an Operator in the previous CRD chapter, so we won’t go over the process here, we’ll just give you the core code for the controller:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
const (
    deleteHookLabel  = "example.com/unready-blocker-delete"
    inPlaceHookLabel = "example.com/unready-blocker-inplace"
    newlyCreateLabel = "example.com/newly-create"
)

func (r *SampleReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
    ... ...

    switchLabel := func(pod *v1.Pod, key, value string) error {
        body := fmt.Sprintf(`{"metadata":{"labels":{"%s":"%s"}}}`, key, value)
        if err := r.Patch(context.TODO(), pod, client.RawPatch(types.StrategicMergePatchType, []byte(body))); err != nil {
            return err
        }
        return nil
    }

    /*
        Pod LifeCycle Hook Logic
    */
    switch {
    // Handling newly created Pods
    case IsNewlyCreateHooked(pod):
        // Register this Pod to your Service Discovery Centre
        if err := postRegistry(pod); err != nil {
            return reconcile.Result{}, err
        }
        if err := switchLabel(pod, newlyCreateLabel, "false"); err != nil {
            return reconcile.Result{}, err
        }

    // Handling Pods that are ready for in-place upgrades
    case IsPreUpdateHooked(pod):
        // Have the Service Discovery Center fail over this Pod
        if err := postFailOver(pod); err != nil {
            return reconcile.Result{}, err
        }
        if err := switchLabel(pod, inPlaceHookLabel, "false"); err != nil {
            return reconcile.Result{}, err
        }

    // Processing the finished update Pod
    case IsUpdatedHooked(pod):
        // Get the Service Discovery Centre re-registered Pod
        if err := postRegistry(pod); err != nil {
            return reconcile.Result{}, err
        }
        if err := switchLabel(pod, inPlaceHookLabel, "true"); err != nil {
            return reconcile.Result{}, err
        }

    // Handling Pods ready for deletion
    case IsPreDeleteHooked(pod):
        // Unregister the Pod from your Service Discovery Centre
        if err := postUnregister(pod); err != nil {
            return reconcile.Result{}, err
        }
        if err := switchLabel(pod, deleteHookLabel, "false"); err != nil {
            return reconcile.Result{}, err
        }
    }

    return ctrl.Result{}, nil
}

func IsNewlyCreateHooked(pod *v1.Pod) bool {
    return kruiseappspub.LifecycleStateType(pod.Labels[kruiseappspub.LifecycleStateKey]) == kruiseappspub.LifecycleStateNormal && pod.Labels[newlyCreateLabel] == "true" && IsPodReady(pod)
}

func IsPreUpdateHooked(pod *v1.Pod) bool {
    return kruiseappspub.LifecycleStateType(pod.Labels[kruiseappspub.LifecycleStateKey]) == kruiseappspub.LifecycleStatePreparingUpdate && pod.Labels[inPlaceHookLabel] == "true"
}

func IsUpdatedHooked(pod *v1.Pod) bool {
    return kruiseappspub.LifecycleStateType(pod.Labels[kruiseappspub.LifecycleStateKey]) == kruiseappspub.LifecycleStateUpdated && pod.Labels[inPlaceHookLabel] == "false" && IsPodReady(pod)
}

func IsPreDeleteHooked(pod *v1.Pod) bool {
    return kruiseappspub.LifecycleStateType(pod.Labels[kruiseappspub.LifecycleStateKey]) == kruiseappspub.LifecycleStatePreparingDelete && pod.Labels[DeleteHookLabel] == "true"
}

The four branches in the above code correspond to the four important declaration cycle nodes from top to bottom, such as after creation, before upgrade, after upgrade and before deletion of the Pod, etc. We can refine the corresponding Hooks according to our actual needs, and the behaviors of the above Hooks here are as follows

  • postRegistry(pod *v1.Pod): sends a request to notify the Service Discovery Centre to register the Pod service;
  • postFailOver(pod *v1.Pod): sends a request to notify the Service Discovery Center of a Fail Over of the Pod service;
  • postUnregiste(pod *v1.Pod): sends a request to notify the Service Discovery Center that the Pod service is being logged out.

This is the power of the CloneSet Lifecycle and we can insert custom logic into the Pod Lifecycle management as required.