1. Overview

A running kubernetes cluster stores a lot of interrelated resources, such as deployment, replicaset and pod, which are a set of related resources that we commonly use. When we create a deployment, the relevant controller automatically creates the replicaset, and then the controller of the replicaset creates the pod to run the service we deployed. This mechanism is called garbage collection (hereinafter referred to as GC).

In earlier versions, GC was implemented on the client side, for example, by using commands like kubectl delete deployment nginx, and kubectl would delete pods and replicasets, but this approach increased the complexity of the client-side implementation and was not conducive to unified management. Therefore, the need to implement GC on the server side was raised. There are three main goals for implementing GC, and we will focus on them in our subsequent analysis.

  • Support cascade deletion on the server side
  • Centralize the logic of cascade deletion instead of distributing it within each component
  • Option to not delete dependent resources. For example, only the deployment is deleted, but the replicaset and pods remain

GC in kubernetes is implemented as a separate controller in the controller manager. It dynamically discovers and listens to all resources in the cluster that support delete, list and watch through the discovery client. It then constructs a relationship graph between resources to record the dependencies between them.

2. Preparatory Knowledge

To better explain the GC mechanism of kubernetes, here are some k8s basics to start with.

  • finalizer: finalizer can be translated as finalization period. It is a mechanism to ensure that resources have a chance to do some cleanup before they are deleted.
  • There are three deletion propagation strategies in kubernetes.
    1. Orphan. With this strategy, the dependent resources are kept. For example, only the deployment is deleted, but the replicaset and pods are kept.
    2. background. resources are removed from etcd, and dependent resources are removed by the GC mechanism.
    3. foreground. apiserver does not delete the resource. Instead, it adds foregroundDeletion to its finalizer and sets the current deletion timestamp. The GC will then delete the dependent resource with ownerReference.blockOwnerDeletion=true from etcd first. Finally, the current resource is deleted.
  • Each resource in k8s has a unique UID that is unique for each resource throughout the life of the cluster. This UID is unique for each resource throughout the life of the cluster, and is used when marking the dependencies of a resource.
  • ownerReferences. This field is present in the metadata of each resource and is an array that indicates which owners are available for that resource. Whenever an owner resource is deleted, it is removed from this array. When all owners have been deleted, the resource is reclaimed by the GC.
  • If the ownerReference of a set of resources G refers to a specific resource A, the dependents of A are G.

3. Implementation mechanism of garbage collection

The kubernetes GC consists of two main parts.

  • GraphBuilder is mainly used to listen to all resources on apiserver using monitors, build dependencies between resources by inserting events of all resources into the graphChanges queue, and then calling the processGraphChanges method to take out elements from the queue in turn. and insert them into the attemptToDelete or attemptToOrphan queue as appropriate.
  • GarbageCollector is responsible for taking out resources from the attemptToDelete and attemptToOrphan queues, and then going through a series of responsible processes to determine whether they can be deleted or not, and perform the relevant processing.

Therefore, the analysis of the garbage collection implementation mechanism is carried out mainly from these two parts.

3.1 Implementation of the graph builder

The graph builder can be seen as the maintainer of the cluster resource state. It does not modify any resources through the apiserver itself. It is defined as follows.

// GraphBuilder 处理 informers 提供的事件,更新 uidToNode,使用 LRU 缓存依赖资源,并将
// 资源送入到 attemptToDelete 和 attemptToOrphan 队列
type GraphBuilder struct {
    restMapper meta.RESTMapper

  // 每个 monitor 都会 list/watches 一个资源,结果会被导入到 dependencyGraphBuilder 中·
    monitors    monitors
    monitorLock sync.RWMutex

    informersStarted <-chan struct{}
    stopCh <-chan struct{}
    running bool

    metadataClient metadata.Interface
  // monitors 是该队列的生产者,graphBuilder 根据这些改变来修改内存中的 graph
    graphChanges workqueue.RateLimitingInterface
  // 资源 uid 对应到 graph 中的 node
    uidToNode *concurrentUIDToNode
  // GraphBuilder 是 attemptToDelete 和 attemptToOrphan 的生产者,GC 是消费者。
    attemptToDelete workqueue.RateLimitingInterface
    attemptToOrphan workqueue.RateLimitingInterface
  // GraphBuilder 和 GC 共享 absentOwnerCache. 目前已知的不存在的对象会被添加到缓存中
    absentOwnerCache *UIDCache
    sharedInformers  controller.InformerFactory
    ignoredResources map[schema.GroupResource]struct{}

The node that makes up the graph is defined as follows.

// 单线程的 GraphBuilder.processGraphChanges() 是 nodes 的唯一 writer。多线程的 GarbageCollector.attemptToDeleteItem() 读取 nodes。
type node struct {
    identity objectReference
    dependentsLock sync.RWMutex
    // dependents 是当前 node 的依赖资源。比如当前 node 是 replicaset,那么这里面保存的应该就是多个 pod
    dependents map[*node]struct{}
    // this is set by processGraphChanges() if the object has non-nil DeletionTimestamp
    // and has the FinalizerDeleteDependents.
    deletingDependents     bool
    deletingDependentsLock sync.RWMutex
    // this records if the object's deletionTimestamp is non-nil.
    beingDeleted     bool
    beingDeletedLock sync.RWMutex
    // this records if the object was constructed virtually and never observed via informer event
    virtual     bool
    virtualLock sync.RWMutex
    // when processing an Update event, we need to compare the updated
    // ownerReferences with the owners recorded in the graph.
    owners []metav1.OwnerReference

GraphBuilder synchronizes monitors with the apiserver, then creates a monitor for each resource and synchronizes the state of the resources via informer. All resources go directly to the graphChanges queue. They are then unified in the processGraphChanges method.

For Add and Update events:

  • If the current resource does not exist in the graph, a Node object is instantiated and added to the graph. The node is then added to the dependents array of its owners. One detail here is that there may be a situation where the resource represented by the current node is synced to the local cache via informer, but its owner has not been synced yet. This would result in missing dependents for updating owners. Therefore, each node has a virtual field, and a virtual owner node is instantiated and added to the graph when the owner is not yet synchronized. And add this virtual node to the attemptToDelete queue to be processed by the GC later. If the virtual node is found later by processGraphChanges, markObserved() is called to set virtual to false.
  • If it already exists, the changes in the ownerReferences of the old and new resources are compared. The changes in ownerReferences may result in the following cases to be handled.
    • As mentioned before about Foreground’s deletion, a resource with ownerReference with blockOwnerDeletion=true will block the owner’s deletion. So here, because of the change in ownerReferences, the following two things need to be done.
    • For a removed ownerReference, if blockOwnerDeletion is true, it means that the deletion of the node owner is currently not allowed to be blocked again. So put the owner in the attemptToDelete queue and wait for the GC to process it.
    • For updated ownerReference, if blockOwnerDeletion was true before and is now false, then it should also be added to the attemptToDelete queue.
    • For both added and removed, the dependents of the corresponding owner node need to be updated.
  • The processTransitions method is called on both Add and Update events.
    • If the old object is not deleted or does not have an orphan finalizer, but the new object is deleted or has an orphan finalizer, the node is inserted into the attemptToOrphan queue.
    • If the old object is not deleted or does not have a foregroundDeletion finalizer, but the new object is deleted or has a foregroundDeletion finalizer, all the dependents of the node are inserted into the attemptToDelete queue, and then insert the node into the attemptToDelete queue.

For the delete event.

  • removes the node from the current graph, starting by removing the node from uidToNode and updating all the owner’s dependents.
  • If the dependents of the current node are greater than 0, add the current node to the absentOwnerCache.
  • Add the node’s dependents to the attemptToDelete queue (garbage collection).
  • Finally, find the owner from the node with deletingDependents=true state and insert it into the attemptToDelete queue as well. This is for the GC to check if all dependents of the owner have been deleted, and if so, delete the owner as well (the owner is in deletingDependents, which means it is using foregroundDeletion, so it needs to delete dependents first, and then delete the (here the owner is in deletingDependents, which means that foregroundDeletion is used, so you need to delete dependents first and then delete the owner).

Therefore, it is known that resources with the following states are inserted into the attemptToDelete queue.

  • foregroundDelete in finalizers
  • foregroundDelete in the owner’s finalizers
  • The owner resource is deleted
  • Dependents with resources deleted and the current state is not yet deletingDependents
  • owner is in deletingDependents

Resources in the following states are inserted into the attemptToOrphan queue.

  • finalizers with orphan

3.2 Implementation of GarbageCollector

As mentioned in 3.1, the GC consumes the attemptToDelete and attemptToOrphan queues of the GraphBuilder to perform delete or orphan operations. So our main concern here is what kind of resources can be deleted or orphaned by the GC.

3.2.1 attemptToDeleteItem

  • For resources where the DeletionTimestamp is not empty and is not in the process of deleting dependents. The process is skipped directly.
  • If the resource is in the deletingDependents state, count the number of dependents with blockOwnerDeletion=true.
    • If it is 0, the resource is ready to be deleted, so remove the finalizer foregroundDeletion.
    • Otherwise, insert dependents into the attemptToDelete queue.
    • After that, the loop exits.
  • Sort the ownerReferences of the resource
    • Dangling: The resource corresponding to the owner no longer actually exists.
    • waitingForDependentsDeletion: owner’s DeletionTimeStamp is not empty, but there is a foregroundDeletion, so it is waiting for dependents to be deleted
    • solid: owner exists and is not waitingForDependentsDeletion
  • If solid is not empty, then the current resource cannot be GC’d, so only the ownerReferences of dangling and waitingForDependentsDeletion need to be removed via patch
  • If waitingForDependentsDeletion is not empty and the dependents of the current resource are not empty. This decision is used to handle the exception of circular dependencies, because the current resource is not in the deletion state and has dependents, and its owner is waiting for the item to be deleted, which means there is a circular dependency. The solution is to change the blockOwnerDeletion of the resource to false via patch.
  • If neither of the above is the case. The resource will be deleted based on the finalizer of the current resource
    • orphan
    • foreground
    • background

Therefore, it can be concluded that a delete request is called by GC for resources in the following states.

  • The resource is in the deletingDependents state and its blockOwnerDeletion without dependents is true. foregroundDeletion finalizer is removed first, then deleted.
  • If the dependents are in the deletingDependents state. To prevent circular dependencies, the owner is unblocked first, and then foreground is used to delete the current resource.
  • If the resource does not have a solid owner, then the resource is the one that should be cascaded and deleted. So it is deleted according to the finalizer of the resource. By default, the background method is used to delete.

3.2.2 attemptToOrphan

attemptToOrphan is a way to prevent resources from being reclaimed by GC in some cases. attemptToOrphan’s logic is a bit shorter, as follows.

  • Remove dependents from the current resource ownerReferences
  • Remove the orphan finalizer of the resource (this update event is fetched by the GraphBuilder, and the resource is then eligible for the attemptToDelete queue. (After that, it will be processed by GC and eventually deleted.)

4. Summary

Based on the above process, I attach an overall GC flowchart that I have put together.

overall GC flowchart