Container GC

Exiting containers continue to use system resources, such as storing a lot of data on the filesystem and the CPU and memory that Docker applications use to maintain these containers.

Docker itself does not automatically delete exiting containers, so kubelet takes over this responsibility. kubelet container recycling is used to delete exiting containers to save space on nodes and improve performance.

While container GC is good for space and performance, deleting containers also results in the error site being cleaned up, which is not good for debug and error location, so it is not recommended to delete all exiting containers. Therefore, container cleanup requires a certain strategy, mainly telling the kubelet how many exiting containers you want to keep. Configurable kubelet startup parameters related to container GC include

  • minimum-container-ttl-duration: how long after the container ends before it can be recycled, default is one minute
  • maximum-dead-containers-per-container : how many containers can be saved per container, default is 1, negative setting means no limit
  • maximum-dead-containers: the maximum number of dead containers that can be kept on the node, the default is -1, which means no limit

This means that by default, kubelet will automatically do container GC every minute, containers can be deleted after one minute of exit, and only one exiting history container will be kept per container.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
type containerGC struct {
    // client 用来和 docker API 交互,比如获取容器列表、查看某个容器的详细信息等
    client           DockerInterface
    podGetter        podGetter
    containerLogsDir string
}

func NewContainerGC(client DockerInterface, podGetter podGetter, containerLogsDir string) *containerGC {
    return &containerGC{
        client:           client,
        podGetter:        podGetter,
        containerLogsDir: containerLogsDir,
    }
}

func (cgc *containerGC) GarbageCollect(gcPolicy kubecontainer.ContainerGCPolicy, allSourcesReady bool) error {
    // 找到可以清理的容器列表,条件是不在运行并且创建时间超过 MinAge。
    // 这个步骤会过滤掉不是 kubelet 管理的容器,并且把容器按照创建时间进行排序(也就是说最早创建的容器会先被删除)
    // evictUnits 返回的是需要被正确回收的,第二个参数是 kubelet 无法识别的容器
    evictUnits, unidentifiedContainers, err := cgc.evictableContainers(gcPolicy.MinAge)
    ......

    // 删除无法识别的容器
    for _, container := range unidentifiedContainers {
        glog.Infof("Removing unidentified dead container %q with ID %q", container.name, container.id)
        err = cgc.client.RemoveContainer(container.id, dockertypes.ContainerRemoveOptions{RemoveVolumes: true})
        if err != nil {
            glog.Warningf("Failed to remove unidentified dead container %q: %v", container.name, err)
        }
    }

    // 如果 pod 已经不存在了,就删除其中所有的容器
    if allSourcesReady {
        for key, unit := range evictUnits {
            if cgc.isPodDeleted(key.uid) {
                cgc.removeOldestN(unit, len(unit)) // Remove all.
                delete(evictUnits, key)
            }
        }
    }

    // 执行 GC 策略,保证每个 POD 最多只能保存 MaxPerPodContainer 个已经退出的容器
    if gcPolicy.MaxPerPodContainer >= 0 {
        cgc.enforceMaxContainersPerEvictUnit(evictUnits, gcPolicy.MaxPerPodContainer)
    }

    // 执行 GC 策略,保证节点上最多有 MaxContainers 个已经退出的容器
    // 先把最大容器数量平分到 pod,保证每个 pod 在平均数量以下;如果还不满足要求的数量,就按照时间顺序先删除最旧的容器
    if gcPolicy.MaxContainers >= 0 && evictUnits.NumContainers() > gcPolicy.MaxContainers {
        // 先按照 pod 进行删除,每个 pod 能保留的容器数是总数的平均值
        numContainersPerEvictUnit := gcPolicy.MaxContainers / evictUnits.NumEvictUnits()
        if numContainersPerEvictUnit < 1 {
            numContainersPerEvictUnit = 1
        }
        cgc.enforceMaxContainersPerEvictUnit(evictUnits, numContainersPerEvictUnit)

        // 如果还不满足数量要求,按照容器进行删除,先删除最老的
        numContainers := evictUnits.NumContainers()
        if numContainers > gcPolicy.MaxContainers {
            flattened := make([]containerGCInfo, 0, numContainers)
            for uid := range evictUnits {
                flattened = append(flattened, evictUnits[uid]...)
            }
            sort.Sort(byCreated(flattened))

            cgc.removeOldestN(flattened, numContainers-gcPolicy.MaxContainers)
        }
    }

    ......
    return nil
}

This code is the core logic of container GC, and it does something like this.

  • first find the containers that can be cleaned from the running containers, including those that meet the cleanup criteria or are not recognized by the kubelet
  • directly delete containers that are not recognized and those whose pod information no longer exists
  • Delete the remaining containers according to the configured container deletion policy

Image GC

images mainly take up disk space, and although docker uses image tiering to allow multiple images to share storage, long-running nodes that download many images can take up too much storage space. If the images fill up the disk, the application will not work properly. docker does not clean up images by default, once they are downloaded, they will stay in the local area forever unless they are manually deleted.

In fact, many images are not actually used, so it’s a huge waste of space and a huge risk that these unused images continue to take up space, so kubelet also cleans up images periodically.

Unlike containers, cleanup of images is based on the amount of space they occupy, and users can configure what percentage of storage space is occupied by a image before it is cleaned up. The cleanup will prioritize the longest unused images, and will update its recent usage time when it is pulled down or used by a container.

When starting a kubelet, you can configure these parameters to control the policy for image cleanup.

  • image-gc-high-threshold: the upper limit of disk usage that will trigger image cleanup when this usage is reached. The default value is 90%.
  • image-gc-low-threshold: the lower limit of disk usage, each cleanup will not stop until the usage falls below this value or there are no more images to clean up. The default value is 80%.
  • minimum-image-ttl-duration: the image will be cleaned up only if it has not been used for at least this long, configurable in h (hours), m (minutes), s (seconds) and ms (milliseconds) time units, default is 2m (two minutes)

That is, by default, kubelet will clean up when the image fills 90% of the capacity of the disk it is on, until the image occupancy is below 80%.

Parameter configuration

Users can adjust the relevant thresholds to optimize image garbage collection using the following kubelet parameters.

  1. image-gc-high-threshold , the percentage of disk usage that triggers image garbage collection. The default value is 8. If this value is set to 100, image garbage collection will be stopped.
  2. image-gc-low-threshold, the percentage of disk utilization reached after image garbage collection attempts to free resources. The default value is 80.
  3. minimum-image-ttl-duration, default 2m0s, the minimum age of the recycled image.

The following events may be reported during garbage collection.

  • ContainerGCFailed: container garbage collection is executed every 1min, and this event is reported if the execution fails.
  • ImageGCFailed: Image garbage collection is performed every 5min, and if it fails, the event is reported.
  • FreeDiskSpaceFailed : Report this exception if the cleaned space does not meet the requirement when executing image garbage collection.
  • InvalidDiskCapacity : Report this exception if the image disk capacity is 0.

ImageGCManager

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
type ImageGCManager interface {
  // 执行垃圾回收策略,如果根据垃圾回收策略不能释放足够的空间,则会返回 error
    GarbageCollect() error
    // 启动异步垃圾镜像回收
    Start()

    GetImageList() ([]container.Image, error)
    // 删除所有无用镜像
    DeleteUnusedImages() error
}

Initialization

The ImageGCManager is initialized in the kubelet.NewMainKubelet() method.

1
2
3
4
5
6
// setup imageManager
imageManager, err := images.NewImageGCManager(klet.containerRuntime, klet.StatsProvider, kubeDeps.Recorder, nodeRef, imageGCPolicy, crOptions.PodSandboxImage)
if err != nil {
    return nil, fmt.Errorf("failed to initialize image manager: %v", err)
}
klet.imageManager = imageManager

realImageGCManager.Start()

ImageGCManager is started in the kubelet.initializeModules() method. imageGCManager starts performing two tasks asynchronously after starting.

  • Update the information about the list of images in use every 5min.
  • Update the image cache every 30s.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
func (im *realImageGCManager) Start() {
    go wait.Until(func() {
        var ts time.Time
        if im.initialized {
            ts = time.Now()
        }
        _, err := im.detectImages(ts) // 更新缓存镜像列表,并返回正在使用的镜像列表
        if err != nil {
            klog.Warningf("[imageGCManager] Failed to monitor images: %v", err)
        } else {
            im.initialized = true
        }
    }, 5*time.Minute, wait.NeverStop) // 每5min探测一次

    // 每30s更新一次镜像缓存
    go wait.Until(func() {
        images, err := im.runtime.ListImages()
        if err != nil {
            klog.Warningf("[imageGCManager] Failed to update image list: %v", err)
        } else {
            im.imageCache.set(images)
        }
    }, 30*time.Second, wait.NeverStop)

}

Start garbage collection

When the kubelet is started, it opens a garbage collection asynchronous thread. It will

  • perform a container garbage collection every 1min, and report the event ContainerGCFailed if it fails.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    
    func (kl *Kubelet) StartGarbageCollection() {
        loggedContainerGCFailure := false
        go wait.Until(func() {
            if err := kl.containerGC.GarbageCollect(); err != nil {  // 每 1min 执行一次容器垃圾回收,如果执行失败,则上报事件 ContainerGCFailed
                klog.Errorf("Container garbage collection failed: %v", err)
                kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ContainerGCFailed, err.Error())
                loggedContainerGCFailure = true
            } else {
                var vLevel klog.Level = 4
                if loggedContainerGCFailure {
                    vLevel = 1
                    loggedContainerGCFailure = false
                }
    
                klog.V(vLevel).Infof("Container garbage collection succeeded")
            }
        }, ContainerGCPeriod, wait.NeverStop)
    
        // 如果 --image-gc-high-threshold=100,则会停止镜像垃圾回收。
        if kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {
            klog.V(2).Infof("ImageGCHighThresholdPercent is set 100, Disable image GC")
            return
        }
    
        prevImageGCFailed := false
        go wait.Until(func() {
            if err := kl.imageManager.GarbageCollect(); err != nil { // 每 5min 执行一次镜像垃圾回收,如果执行失败,则上报 ImageGCFailed 事件
                if prevImageGCFailed {
                    klog.Errorf("Image garbage collection failed multiple times in a row: %v", err)
                    // Only create an event for repeated failures
                    kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
                } else {
                    klog.Errorf("Image garbage collection failed once. Stats initialization may not have completed yet: %v", err)
                }
                prevImageGCFailed = true
            } else {
                var vLevel klog.Level = 4
                if prevImageGCFailed {
                    vLevel = 1
                    prevImageGCFailed = false
                }
    
                klog.V(vLevel).Infof("Image garbage collection succeeded")
            }
        }, ImageGCPeriod, wait.NeverStop)
    }
    

realImageGCManager.GarbageCollect()

The execution of image garbage collection is as follows

  1. get the image disk information from cadvisor.
  2. calculate the disk capacity and disk utilization.
  3. if the disk utilization reaches the upper limit set by --image-gc-high-threshold, then image garbage collection is performed.
  4. if the space freed after image garbage collection does not reach the expected value, report a -FreeDiskSpaceFailed exception event.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
func (im *realImageGCManager) GarbageCollect() error {
    // 从 cadvisor 获取 image 磁盘信息
    fsStats, err := im.statsProvider.ImageFsStats()
    if err != nil {
        return err
    }

    var capacity, available int64
    if fsStats.CapacityBytes != nil { // image 磁盘容器
        capacity = int64(*fsStats.CapacityBytes)
    }
    if fsStats.AvailableBytes != nil { // image 磁盘可用空间
        available = int64(*fsStats.AvailableBytes)
    }

    if available > capacity { // 修正磁盘容量大小
        klog.Warningf("available %d is larger than capacity %d", available, capacity)
        available = capacity
    }

    // Check valid capacity.
    if capacity == 0 { // 如果磁盘容量为0,则上报 InvalidDiskCapacity 异常时间
        err := goerrors.New("invalid capacity 0 on image filesystem")
        im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
        return err
    }

    usagePercent := 100 - int(available*100/capacity) // 磁盘使用率达到上限
    if usagePercent >= im.policy.HighThresholdPercent {
        amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available // 计算要清理的磁盘空间大小
        klog.Infof("[imageGCManager]: Disk usage on image filesystem is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes down to the low threshold (%d%%).", usagePercent, im.policy.HighThresholdPercent, amountToFree, im.policy.LowThresholdPercent)
        freed, err := im.freeSpace(amountToFree, time.Now()) // 清理镜像,并返回清理的空间大小
        if err != nil {
            return err
        }

        if freed < amountToFree { // 如果被清理空间不满足要求,则上报 FreeDiskSpaceFailed 异常事件
            err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d bytes, but freed %d bytes", amountToFree, freed)
            im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
            return err
        }
    }

    return nil
}

Freeing disk space (freeSpace)

The detailed process of image garbage collection is documented here.

  1. list all images that are not in use.
  2. install the last used time and detection time sorted from far to near.
  3. iterate through the list and clean up the images by time from far to near.
  4. again determine, if the image is in use, then do not clean up. Determine when the image was first probed to avoid cleaning up the image with a short pull time, because these images may have just been pulled down and will soon be used by a container.
  5. call the runtime interface to delete useless images until enough space is freed up.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
func (im *realImageGCManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) {
    imagesInUse, err := im.detectImages(freeTime) // 更新正在使用的镜像列表,并返回正在使用的镜像列表
    if err != nil {
        return 0, err
    }

    im.imageRecordsLock.Lock()
    defer im.imageRecordsLock.Unlock()

    // 列出所有没在使用的镜像
    images := make([]evictionInfo, 0, len(im.imageRecords))
    for image, record := range im.imageRecords {
        if isImageUsed(image, imagesInUse) {
            klog.V(5).Infof("Image ID %s is being used", image)
            continue
        }
        images = append(images, evictionInfo{
            id:          image,
            imageRecord: *record,
        })
    }
    sort.Sort(byLastUsedAndDetected(images))  // 按照最后使用时间和探测时间排序
    // 删除无用的镜像,直到释放足够的空间为止
    var deletionErrors []error
    spaceFreed := int64(0)
    for _, image := range images {
        klog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id)
        // 再次判断镜像是否正在使用
        if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
            klog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
            continue
        }
    // 避免清理拉取时间较短的镜像,因为这些镜像可能刚被拉取下来,马上要被某个容器使用
        if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
            klog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge)
            continue
        }
        // 清理镜像,即便发生error
        klog.Infof("[imageGCManager]: Removing image %q to free %d bytes", image.id, image.size)
        err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})
        if err != nil {
            deletionErrors = append(deletionErrors, err)
            continue
        }
        delete(im.imageRecords, image.id)
        spaceFreed += image.size

        if spaceFreed >= bytesToFree {
            break
        }
    }

    if len(deletionErrors) > 0 {
        return spaceFreed, fmt.Errorf("wanted to free %d bytes, but freed %d bytes space with errors in image deletion: %v", bytesToFree, spaceFreed, errors.NewAggregate(deletionErrors))
    }
    return spaceFreed, nil
}