For unstructured data storage systems, LIST operations are usually very heavyweight, not only consuming a lot of disk IO, network bandwidth and CPU, but also affecting other requests in the same time period (especially the response latency demanding master selection requests), which is a major cluster stability killer.

For example, for Ceph object storage, each LIST bucket request needs to go to multiple disks to retrieve all the data of the bucket; it is not only slow, but also affects other common read and write requests in the same time period, because IO is shared, resulting in increased response latency and even timeout. If there are many objects in the bucket (e.g. as a storage backend for harbor/docker-registry), LIST operations cannot even be completed in regular time (and thus registry GC, which relies on LIST bucket operations, cannot run).

Compared to Ceph, an actual etcd cluster may store a small amount of data (a few ~ tens of gigabytes), even enough to be cached in memory. Unlike Ceph, however, the number of concurrent requests can be several orders of magnitude higher, e.g., it is an etcd for a ~4000 nodes k8s cluster. a single LIST request may only need to return a few tens of MB to a few GB of traffic, but the number of concurrent requests is obviously too much for etcd to handle, so it is better to have a cache layer in front of it, which is what apiserver does (for one). (Most of K8s’ LIST requests should be blocked by the apiserver and served from its local cache, but if not used properly, they can skip the cache and reach etcd directly, with significant stability risks.

This paper delves into the processing logic and performance bottlenecks of LIST operations on k8s apiserver/etcd, and provides some recommendations for LIST stress testing, deployment and tuning of basic services to improve the stability of large-scale K8s clusters.

The kube-apiserver LIST request processing logic is as follows.

kube-apiserver LIST request processing logic

The code is based on v1.24.0. However, the basic logic and code path of 1.19~1.24 are the same, so you can cross-reference if you need.

1 Introduction

1.1 K8s Architecture: A Hierarchical View of the Ring

From an architectural hierarchy and component dependency perspective, the analogy between a K8s cluster and a Linux host can be made as follows.

Fig 1. Anology: a Linux host and a Kubernetes cluster

For K8s clusters, several components and functions from the inside out.

  1. etcd: persistent KV storage, the sole authoritative data (state) source for cluster resources (pods/services/networkpolicies/…).
  2. apiserver: reads (ListWatch) the full amount of data from etcd and caches it in memory; stateless service, horizontally scalable.
  3. various basic services (e.g. kubelet, *-agent, *-operator): connect to apiserver and get (List/ListWatch) the data they each need.
  4. workloads within the cluster: created, managed and reconcile by 3 in case 1 and 2 are normal, e.g. kubelet creates pods, cilium configures network and security policies.

1.2 The apiserver/etcd role

As you can see above, there are two levels of List/ListWatch in the system path (but the data is the same copy).

  1. apiserver List/ListWatch etcd
  2. base service List/ListWatch apiserver

So, in its simplest form, the apiserver is a proxy (proxy) in front of etcd.

1
2
3
4
5
           +--------+              +---------------+                 +------------+
           | Client | -----------> | Proxy (cache) | --------------> | Data store |
           +--------+              +---------------+                 +------------+

         infra services               apiserver                         etcd
  1. in the vast majority of cases, the apiserver serves directly from the local cache (since it caches the full amount of data for the cluster).
  2. some special cases, such as
    1. the client explicitly requests to read data from etcd (seeking the highest data accuracy), and
    2. apiserver local cache is not yet builtapiserver will have to forward the request to etcd – Here you have to pay special attention - - Improperly set LIST parameters on the client side can also lead to this logic.

1.3 apiserver/etcd List overhead

1.3.1 Example of a request

Consider the following LIST operations.

  1. LIST apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion=0 Here both parameters are passed, but resourceVersion=0 will cause apiserver ignore limit=500, so the client gets the full amount of ciliumendpoints data. The full amount of data for a resource can be quite large, and need to think through whether you really need the full amount of data. The quantitative measurement and analysis method will be described later.

  2. LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1 This request is to fetch all pods on node1 (%3D is an escape of =). Doing the filtering based on nodename may give the impression that the amount of data is not too large, but it’s actually more complicated behind the scenes than it looks.

    • First, resourceVersion=0 is not specified here, causing apiserver to skip the cache and go directly to etcd to read the data.
    • Secondly, etcd is just KV storage, with no filtering by label/field (only limit/continue is handled).
    • So, apiserver is pulling the full amount of data from etcd and then doing filtering in memory, which is also a lot of overhead, as analyzed in code later. This behavior is to be avoided unless there is an extremely high demand for data accuracy and you purposely want to bypass the apiserver cache.
  3. **LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0**The difference with 2 is that resourceVersion=0 is added, so apiserver will read data from the cache, and there is an order of magnitude performance improvement. But note that while what is actually returned to the client may only be a few hundred KB to a few hundred MB (depending on the number of pods on the node, the number of labels on the pods, etc.), the amount of data that apiserver needs to process can be several GB. A quantitative analysis will follow later.

As you can see above, the impact of different LIST operations is different, and the client may see only a small fraction of the data processed by apiserver/etcd. If the base service is started or restarted massively, it is very likely to blow up the control plane.

1.3.2 Processing overhead

List requests can be divided into two types.

  1. list full amount of data: the overhead is spent mainly on data transfer.
  2. specify filtering by label or field (field), only the data that needs to be matched.

The special note here is the second case, where the list request comes with a filter.

  • In most cases, apiserver will use its own cache to do the filtering, which is fast, so ** time spent is mostly on data transfer**.
  • The case where the request needs to be forwarded to etcd, as mentioned earlier, etcd is just KV storage and does not understand label/field information, so it cannot handle filtering requests. The actual process is: apiserver pulls the full amount of data from etcd, then does the filtering in memory, and returns it to the client. So in addition to the data transfer overhead (network bandwidth), this case also takes up a lot of apiserver CPU and memory.

1.4 Potential problems with large scale deployments

For another example, the following line of code uses k8s client-go to filter pods based on nodename

1
 podList, err := Client().CoreV1().Pods("").List(ctx(), ListOptions{FieldSelector: "spec.nodeName=node1"})

It looks like a very simple operation, let’s actually look at the amount of data behind it. Using a 4000 node, 10w pod cluster as an example, full volume of pod data.

  1. in etcd: compact unstructured KV storage in the 1GB magnitude.
  2. in apiserver cache: already structured golang objects, in the 2GB magnitude (TODO: further confirmation required).
  3. apiserver returns: the client generally chooses the default json format to receive, which is also already structured data. The json for the full pod is also in the 2GB range.

As you can see, some requests may look simple, a matter of a single line of code from the client, but the amount of data behind them is staggering. Specifying a pod filter by nodeName may return only 500KB of data, but apiserver needs to filter 2GB of data – worst case, etcd has to process 1GB of data along with it (the above parameter configuration does hit the worst case (see code analysis below).

When the cluster is small, this problem may not be visible (etcd only starts printing warning logs after the LIST response latency exceeds a certain threshold); when it is large, apiserver/etcd will not be able to handle such requests if there are more of them.

1.5 Purpose of this paper

To deepen the understanding of performance issues by looking at the List/ListWatch implementation of k8s in deeper code, and to provide some reference for optimizing the stability of large-scale K8s clusters.

2 apiserver List() operation source code analysis

With the above theoretical warm-up, you can next look at the code implementation.

2.1 Call stack and flowchart

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
store.List
|-store.ListPredicate
   |-if opt == nil
   |   opt = ListOptions{ResourceVersion: ""}
   |-Init SelectionPredicate.Limit/Continue fileld
   |-list := e.NewListFunc()                               // objects will be stored in this list
   |-storageOpts := storage.ListOptions{opt.ResourceVersion, opt.ResourceVersionMatch, Predicate: p}
   |
   |-if MatchesSingle ok                                   // 1. when "metadata.name" is specified,  get single obj
   |   // Get single obj from cache or etcd
   |
   |-return e.Storage.List(KeyRootFunc(ctx), storageOpts)  // 2. get all objs and perform filtering
      |-cacher.List()
         | // case 1: list all from etcd and filter in apiserver
         |-if shouldDelegateList(opts)                     // true if resourceVersion == ""
         |    return c.storage.List                        // list from etcd
         |             |- fromRV *int64 = nil
         |             |- if len(storageOpts.ResourceVersion) > 0
         |             |     rv = ParseResourceVersion
         |             |     fromRV = &rv
         |             |
         |             |- for hasMore {
         |             |    objs := etcdclient.KV.Get()
         |             |    filter(objs)                   // filter by labels or filelds
         |             | }
         |
         | // case 2: list & filter from apiserver local cache (memory)
         |-if cache.notready()
         |   return c.storage.List                         // get from etcd
         |
         | // case 3: list & filter from apiserver local cache (memory)
         |-obj := watchCache.WaitUntilFreshAndGet
         |-for elem in obj.(*storeElement)
         |   listVal.Set()                                 // append results to listOjb
         |-return  // results stored in listObj

Corresponding flow chart.

Fig 2-1. List operation processing in apiserver

2.2 Request processing entry: List()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L361

// 根据 PredicateFunc 中指定的 LabelSelector 和 FieldSelector 过滤,返回一个对象列表
func (e *Store) List(ctx, options *metainternalversion.ListOptions) (runtime.Object, error) {
    label := labels.Everything()
    if options != nil && options.LabelSelector != nil
        label = options.LabelSelector // Label 过滤器,例如 app=nginx

    field := fields.Everything()
    if options != nil && options.FieldSelector != nil
        field = options.FieldSelector // 字段过滤器,例如 spec.nodeName=node1

    out := e.ListPredicate(ctx, e.PredicateFunc(label, field), options) // 拉取(List)数据并过滤(Predicate)
    if e.Decorator != nil
        e.Decorator(out)

    return out, nil
}

2.3 ListPredicate()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L411

func (e *Store) ListPredicate(ctx , p storage.SelectionPredicate, options *metainternalversion.ListOptions) (runtime.Object, error) {
    // Step 1: 初始化
    if options == nil
        options = &metainternalversion.ListOptions{ResourceVersion: ""}

    p.Limit    = options.Limit
    p.Continue = options.Continue
    list      := e.NewListFunc()        // 返回结果将存储在这里面
    storageOpts := storage.ListOptions{ // 将 API 侧的 ListOption 转成底层存储侧的 ListOption,字段区别见下文
        ResourceVersion:      options.ResourceVersion,
        ResourceVersionMatch: options.ResourceVersionMatch,
        Predicate:            p,
        Recursive:            true,
    }

    // Step 2:如果请求指定了 metadata.name,则应获取单个 object,无需对全量数据做过滤
    if name, ok := p.MatchesSingle(); ok { // 检查是否设置了 metadata.name 字段
        if key := e.KeyFunc(ctx, name); err == nil { // 获取这个 object 在 etcd 中的 key(唯一或不存在)
            storageOpts.Recursive = false
            e.Storage.GetList(ctx, key, storageOpts, list)
            return list
        }
        // else 逻辑:如果执行到这里,说明没有从 context 中拿到过滤用的 key,则 fallback 到下面拿全量数据再过滤
    }

    // Step 3: 对全量数据做过滤
    e.Storage.GetList(ctx, e.KeyRootFunc(), storageOpts, list) // KeyRootFunc() 用来获取这种资源在 etcd 里面的 root key(即 prefix,不带最后的 /)
    return list
}

GetList()` in cases 1 & 2 in 1.24.0, previous versions are a bit different: * e.Storage.

  • GetToList in Case 1
  • List in Case 1

But the basic process is the same.

  1. If the client does not pass ListOption, a default value is initialized where ResourceVersion is set to the empty string, which will cause the apiserver *to pull data from etcd to return to the client without using the local cache (unless the local cache is not yet built) (unless the local cache is not yet built); for example, when the client sets ListOption{Limit: 5000, ResourceVersion: 0} list ciliumendpoints, the request sent will be /apis/cilium.io/v2/ ciliumendpoints?limit=500&resourceVersion=0. ResourceVersion is the behavior of the empty string, which you will see parsed later.
  2. initialize the limit/continue fields of the filter (SelectionPredicate) with the fields in listoptions respectively.
  3. initialize the returned result, list := e.NewListFunc().
  4. convert the API-side ListOptions to the underlying stored ListOptions, see metainternalversion.ListOptions below for field differences. is the API-side structure containing the
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
 // staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/types.go
    
 // ListOptions is the query options to a standard REST list call.
 type ListOptions struct {
     metav1.TypeMeta
    
     LabelSelector labels.Selector // 标签过滤器,例如 app=nginx
     FieldSelector fields.Selector // 字段过滤器,例如 spec.nodeName=node1
    
     Watch bool
     AllowWatchBookmarks bool
     ResourceVersion string
     ResourceVersionMatch metav1.ResourceVersionMatch
    
     TimeoutSeconds *int64         // Timeout for the list/watch call.
     Limit int64
     Continue string               // a token returned by the server. return a 410 error if the token has expired.
 }

storage.ListOptions is the structure passed to the underlying storage , with a few differences in the fields.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
 // staging/src/k8s.io/apiserver/pkg/storage/interfaces.go
    
 // ListOptions provides the options that may be provided for storage list operations.
 type ListOptions struct {
     ResourceVersion string
     ResourceVersionMatch metav1.ResourceVersionMatch
     Predicate SelectionPredicate // Predicate provides the selection rules for the list operation.
     Recursive bool               // true: 根据 key 获取单个对象;false:根据 key prefix 获取全量数据
     ProgressNotify bool          // storage-originated bookmark, ignored for non-watch requests.
 }

2.4 The request specifies a resource name: Get a single object

Next, depending on whether meta.Name is specified in the request, there are two cases.

  1. if specified, it is a query for a single object, since Name is unique, and the next logical step is to query for a single object.
  2. if it is not specified, you need to get the full amount of data, and then filter it in apiserver memory according to the filter conditions in SelectionPredicate, and return the final result to the client.

The code is as follows.

1
2
3
4
5
6
7
8
    // case 1:根据 metadata.name 获取单个 object,无需对全量数据做过滤
    if name, ok := p.MatchesSingle(); ok { // 检查是否设置了 metadata.name 字段
        if key := e.KeyFunc(ctx, name); err == nil {
            e.Storage.GetList(ctx, key, storageOpts, list)
            return list
        }
        // else 逻辑:如果执行到这里,说明没有从 context 中拿到过滤用的 key,则 fallback 到下面拿全量数据再过滤
    }

e.Storage is an Interface.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// staging/src/k8s.io/apiserver/pkg/storage/interfaces.go

// Interface offers a common interface for object marshaling/unmarshaling operations and
// hides all the storage-related operations behind it.
type Interface interface {
    Create(ctx , key string, obj, out runtime.Object, ttl uint64) error
    Delete(ctx , key string, out runtime.Object, preconditions *Preconditions,...)
    Watch(ctx , key string, opts ListOptions) (watch.Interface, error)
    Get(ctx , key string, opts GetOptions, objPtr runtime.Object) error

    // unmarshall objects found at key into a *List api object (an object that satisfies runtime.IsList definition).
    // If 'opts.Recursive' is false, 'key' is used as an exact match; if is true, 'key' is used as a prefix.
    // The returned contents may be delayed, but it is guaranteed that they will
    // match 'opts.ResourceVersion' according 'opts.ResourceVersionMatch'.
    GetList(ctx , key string, opts ListOptions, listObj runtime.Object) error

e.Storage.GetList() will execute to the cacher code.

Whether fetching a single object or the full amount of data, it goes through a similar process.

  1. fetching from the apiserver local cache first (determinants include ResourceVersion, etc.), and
  2. go to etcd as a last resort.

The logic of getting individual objects is relatively simple, so we won’t look at it here. The next step is to look at the logic of filtering the full amount of data in the list.

2.5 Request unspecified resource name, get full amount of data to do filtering

2.5.1 apiserver cache layer: GetList() processing logic

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622

// GetList implements storage.Interface
func (c *Cacher) GetList(ctx , key string, opts storage.ListOptions, listObj runtime.Object) error {
    recursive := opts.Recursive
    resourceVersion := opts.ResourceVersion
    pred := opts.Predicate

    // 情况一:ListOption 要求必须从 etcd 读
    if shouldDelegateList(opts)
        return c.storage.GetList(ctx, key, opts, listObj) // c.storage 指向 etcd

    // If resourceVersion is specified, serve it from cache.
    listRV := c.versioner.ParseResourceVersion(resourceVersion)

    // 情况二:apiserver 缓存未建好,只能从 etcd 读
    if listRV == 0 && !c.ready.check()
        return c.storage.GetList(ctx, key, opts, listObj)

    // 情况三:apiserver 缓存正常,从缓存读:保证返回的 objects 版本不低于 `listRV`
    listPtr := meta.GetItemsPtr(listObj)
    listVal := conversion.EnforcePtr(listPtr)
    filter  := filterWithAttrsFunction(key, pred) // 最终的过滤器

    objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...) // 根据 index 预筛,性能优化
    for _, obj := range objs {
        elem := obj.(*storeElement)
        if filter(elem.Key, elem.Labels, elem.Fields)                           // 真正的过滤
            listVal.Set(reflect.Append(listVal, reflect.ValueOf(elem))
    }

    // 更新最后一次读到的 ResourceVersion
    if c.versioner != nil
        c.versioner.UpdateList(listObj, readResourceVersion, "", nil)
    return nil
}

2.5.2 Determining whether data must be read from etcd: shouldDelegateList()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L591

func shouldDelegateList(opts storage.ListOptions) bool {
    resourceVersion := opts.ResourceVersion
    pred            := opts.Predicate
    pagingEnabled   := DefaultFeatureGate.Enabled(features.APIListChunking)      // 默认是启用的
    hasContinuation := pagingEnabled && len(pred.Continue) > 0                   // Continue 是个 token
    hasLimit        := pagingEnabled && pred.Limit > 0 && resourceVersion != "0" // 只有在 resourceVersion != "0" 的情况下,hasLimit 才有可能为 true

    // 1. 如果未指定 resourceVersion,从底层存储(etcd)拉去数据;
    // 2. 如果有 continuation,也从底层存储拉数据;
    // 3. 只有 resourceVersion != "0" 时,才会将 limit 传给底层存储(etcd),因为 watch cache 不支持 continuation
    return resourceVersion == "" || hasContinuation || hasLimit || opts.ResourceVersionMatch == metav1.ResourceVersionMatchExact
}

Very important here.

  1. Q: Does the fact that the ResourceVersion field in ListOption{} is not set by the client correspond to resourceVersion == "" here?

    A: Yes, so Section 1 of example will result in pulling the full amount of data from etcd.

  2. Q: Will the client set limit=500&resourceVersion=0 cause hasContinuation==true next time?

    A: No, resourceVersion=0 will cause the limit to be ignored (the hasLimit line of code), meaning that the request will return the full amount of data, even though limit=500 is specified.

  3. Q: What is the purpose of ResourceVersionMatch?

    A: It’s used to tell apiserver how to interpret ResourceVersion, and there’s a complicated official table that you can look at if you’re interested.

Next, we return to cacher’s GetList() logic and look at the specific cases of processing.

2.5.3 Case 1: ListOption asks to read data from etcd

In this case, apiserver will read all objects directly from etcd and filter them, and then return them to the client, for scenarios where data consistency is extremely important. Of course, it is also easy to mistake into this scenario and overstress etcd, for example Section 1 of example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L563

// GetList implements storage.Interface.
func (s *store) GetList(ctx , key string, opts storage.ListOptions, listObj runtime.Object) error {
    listPtr   := meta.GetItemsPtr(listObj)
    v         := conversion.EnforcePtr(listPtr)
    key        = path.Join(s.pathPrefix, key)
    keyPrefix := key // append '/' if needed

    newItemFunc := getNewItemFunc(listObj, v)

    var fromRV *uint64
    if len(resourceVersion) > 0 { // 如果 RV 非空(客户端不传时,默认是空字符串)
        parsedRV := s.versioner.ParseResourceVersion(resourceVersion)
        fromRV = &parsedRV
    }

    // ResourceVersion, ResourceVersionMatch 等处理逻辑
    switch {
    case recursive && s.pagingEnabled && len(pred.Continue) > 0: ...
    case recursive && s.pagingEnabled && pred.Limit > 0        : ...
    default                                                    : ...
    }

    // loop until we have filled the requested limit from etcd or there are no more results
    for {
        getResp = s.client.KV.Get(ctx, key, options...) // 从 etcd 拉数据
        numFetched += len(getResp.Kvs)
        hasMore = getResp.More

        for i, kv := range getResp.Kvs {
            if limitOption != nil && int64(v.Len()) >= pred.Limit {
                hasMore = true
                break
            }

            lastKey = kv.Key
            data := s.transformer.TransformFromStorage(ctx, kv.Value, kv.Key)
            appendListItem(v, data, kv.ModRevision, pred, s.codec, s.versioner, newItemFunc) // 这里面会做过滤
            numEvald++
        }

        key = string(lastKey) + "\x00"
    }

    // instruct the client to begin querying from immediately after the last key we returned
    if hasMore {
        // we want to start immediately after the last key
        next := encodeContinue(string(lastKey)+"\x00", keyPrefix, returnedRV)
        return s.versioner.UpdateList(listObj, uint64(returnedRV), next, remainingItemCount)
    }

    // no continuation
    return s.versioner.UpdateList(listObj, uint64(returnedRV), "", nil)
}
  • client.KV.Get() It goes into the etcd client library, so keep digging down if you’re interested.
  • appendListItem() will * filter the data it gets, which is the apiserver memory filtering operation we mentioned in section 1.

2.5.4 Case 2: The local cache is not yet built, so you can only read data from etcd

The procedure is the same as in case 1.

2.5.5 Case 3: Using local cache

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L622

// GetList implements storage.Interface
func (c *Cacher) GetList(ctx , key string, opts storage.ListOptions, listObj runtime.Object) error {
    // 情况一:ListOption 要求必须从 etcd 读
    ...
    // 情况二:apiserver 缓存未建好,只能从 etcd 读
    ...
    // 情况三:apiserver 缓存正常,从缓存读:保证返回的 objects 版本不低于 `listRV`
    listPtr := meta.GetItemsPtr(listObj) // List elements with at least 'listRV' from cache.
    listVal := conversion.EnforcePtr(listPtr)
    filter  := filterWithAttrsFunction(key, pred) // 最终的过滤器

    objs, readResourceVersion, indexUsed := c.listItems(listRV, key, pred, ...) // 根据 index 预筛,性能优化
    for _, obj := range objs {
        elem := obj.(*storeElement)
        if filter(elem.Key, elem.Labels, elem.Fields)                           // 真正的过滤
            listVal.Set(reflect.Append(listVal, reflect.ValueOf(elem))
    }

    if c.versioner != nil
        c.versioner.UpdateList(listObj, readResourceVersion, "", nil)
    return nil
}

3 LIST test

To avoid client-side libraries (such as client-go) automatically setting some parameters for us, we test directly with curl, specifying the credentials.

1
2
$ cat curl-k8s-apiserver.sh
curl -s --cert /etc/kubernetes/pki/admin.crt --key /etc/kubernetes/pki/admin.key --cacert /etc/kubernetes/pki/ca.crt $@

Usage.

1
2
3
4
5
6
7
8
9
$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"
{
  "kind": "PodList",
  "metadata": {
    "resourceVersion": "2127852936",
    "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",
  },
  "items": [ {pod1 data }, {pod2 data}]
}

3.1 Specify limit=2: the response will return paging information (continue)

3.1.1 curl test

1
2
3
4
5
6
7
8
9
$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2"
{
  "kind": "PodList",
  "metadata": {
    "resourceVersion": "2127852936",
    "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",
  },
  "items": [ {pod1 data }, {pod2 data}]
}

As you can see, the

  • does return two pod messages, in the items[] field.
  • Also returns a continue field in metadata. The next time the client takes this parameter, apiserver will continue to return the rest until apiserver no longer returns continue.

3.1.2 kubectl testing

Cranking up the logging level of kubectl also shows that it uses continue behind the scenes to get the full amount of pods.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ kubectl get pods --all-namespaces --v=10
# 以下都是 log 输出,做了适当调整
# curl -k -v -XGET  -H "User-Agent: kubectl/v1.xx" -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json"
#   'http://localhost:8080/api/v1/pods?limit=500'
# GET http://localhost:8080/api/v1/pods?limit=500 200 OK in 202 milliseconds
# Response Body: {"kind":"Table","metadata":{"continue":"eyJ2Ijoib...","remainingItemCount":54},"columnDefinitions":[...],"rows":[...]}
# 
# curl -k -v -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.xx"
#   'http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500'
# GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500 200 OK in 44 milliseconds
# Response Body: {"kind":"Table","metadata":{"resourceVersion":"2122644698"},"columnDefinitions":[],"rows":[...]}

The first request got 500 pods, and the second request took the continue return with it: GET http://localhost:8080/api/v1/pods?continue=eyJ2Ijoib&limit=500 , which is a token. The continue is a token and is a bit long, so it is truncated here for better presentation.

3.2 Specify limit=2&resourceVersion=0: limit=2 will be ignored and the full amount of data will be returned

1
2
3
4
5
6
7
8
9
$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/pods?limit=2&resourceVersion=0"
{
  "kind": "PodList",
  "metadata": {
    "resourceVersion": "2127852936",
    "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJ...",
  },
  "items": [ {pod1 data }, {pod2 data}, ...]
}

items[] contains the full amount of pod information.

3.3 Specify spec.nodeName=node1&resourceVersion=0 vs. specnode.Name=node1"

Same result

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1" | jq '.items[].spec.nodeName'
"node1"
"node1"
"node1"
...

$ ./curl-k8s-apiserver.sh "https://localhost:6443/api/v1/namespaces/default/pods?fieldSelector=spec.nodeName%3Dnode1&resourceVersion=0" | jq '.items[].spec.nodeName'
"node1"
"node1"
"node1"
...

The result is the same, unless there is an inconsistency between apiserver cache and etcd data, which is extremely unlikely and we won’t discuss here.

Speed varies greatly

Using time to measure the elapsed time in the above two cases, you will find that for larger clusters, there is a significant difference in response time between the two types of requests.

1
$ time ./curl-k8s-apiserver.sh <url> > result

For a cluster size of 4K nodes, 100K pods, the following data is provided for reference.

  • without resourceVersion=0 (read etcd and filter at apiserver): time consumed 10s
  • with resourceVersion=0 (read apiserver cache): time consumed 0.05s

200x worse.

The total size of the full pod is calculated at 2GB, averaging 20KB each.

4 LIST Request to Control Plane Pressure: Quantitative Analysis

This section presents an example of the cilium-agent to quantify the pressure on the control plane when it starts.

4.1 Collecting LIST requests

The first step is to obtain which resources are LISTed to k8s when the agent starts. There are several ways to collect them.

  1. in the k8s access log, filtered by ServiceAccount, verb, request_uri, etc.
  2. through agent logs.
  3. by further code analysis, etc.

Suppose we collect the following LIST requests.

  1. api/v1/namespaces?resourceVersion=0
  2. api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion=0
  3. api/v1/nodes?fieldSelector=metadata.name%3Dnode1&resourceVersion=0
  4. api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name
  5. apis/discovery.k8s.io/v1beta1/endpointslices?resourceVersion=0
  6. apis/networking.k8s.io/networkpolicies?resourceVersion=0
  7. apis/cilium.io/v2/ciliumnodes?resourceVersion=0
  8. apis/cilium.io/v2/ciliumnetworkpolicies?resourceVersion=0
  9. apis/cilium.io/v2/ciliumclusterwidenetworkpolicies?resourceVersion=0

2.2 Testing the amount of data and time consumed by LIST requests

With the list of LIST requests, you can then manually execute these requests and get the following data.

  1. request time consumption

  2. the amount of data processed by the request, which is divided into two types.

    1. the amount of data processed by apiserver (full amount of data), the evaluation of the performance impact on apiserver/etcd should be based on this
    2. the amount of data that the agent finally gets (filtered by selector)

The following script (put on the real environment k8s master) can be used to execute the test once.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
$ cat benchmark-list-overheads.sh
apiserver_url="https://localhost:6443"

# List k8s core resources (e.g. pods, services)
# API: GET/LIST /api/v1/<resources>?<fileld/label selector>&resourceVersion=0
function benchmark_list_core_resource() {
    resource=$1
    selectors=$2

    echo "----------------------------------------------------"
    echo "Benchmarking list $2"
    listed_file="listed-$resource"
    url="$apiserver_url/api/v1/$resource?resourceVersion=0"

    # first perform a request without selectors, this is the size apiserver really handles
    echo "curl $url"
    time ./curl-k8s-apiserver.sh "$url" > $listed_file

    # perform another request if selectors are provided, this is the size client receives
    listed_file2="$listed_file-filtered"
    if [ ! -z "$selectors" ]; then
        url="$url&$selectors"
        echo "curl $url"
        time ./curl-k8s-apiserver.sh "$url" > $listed_file2
    fi

    ls -ahl $listed_file $listed_file2 2>/dev/null

    echo "----------------------------------------------------"
    echo ""
}

# List k8s apiextension resources (e.g. pods, services)
# API: GET/LIST /apis/<api group>/<resources>?<fileld/label selector>&resourceVersion=0
function benchmark_list_apiexternsion_resource() {
    api_group=$1
    resource=$2
    selectors=$3

    echo "----------------------------------------------------"
    echo "Benchmarking list $api_group/$resource"
    api_group_flatten_name=$(echo $api_group | sed 's/\//-/g')
    listed_file="listed-$api_group_flatten_name-$resource"
    url="$apiserver_url/apis/$api_group/$resource?resourceVersion=0"
    if [ ! -z "$selectors" ]; then
        url="$url&$selectors"
    fi

    echo "curl $url"
    time ./curl-k8s-apiserver.sh "$url" > $listed_file
    ls -ahl $listed_file
    echo "----------------------------------------------------"
    echo ""
}

benchmark_list_core_resource "namespaces" ""
benchmark_list_core_resource "pods"       "filedSelector=spec.nodeName%3Dnode1"
benchmark_list_core_resource "nodes"      "fieldSelector=metadata.name%3Dnode1"
benchmark_list_core_resource "services"   "labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name"

benchmark_list_apiexternsion_resource "discovery.k8s.io/v1beta1" "endpointslices"                   ""
benchmark_list_apiexternsion_resource "apiextensions.k8s.io/v1"  "customresourcedefinitions"        ""
benchmark_list_apiexternsion_resource "networking.k8s.io"        "networkpolicies"                  ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumnodes"                      ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumendpoints"                  ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumnetworkpolicies"            ""
benchmark_list_apiexternsion_resource "cilium.io/v2"             "ciliumclusterwidenetworkpolicies" ""

The execution effect is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ benchmark-list-overheads.sh
----------------------------------------------------
Benchmarking list
curl https://localhost:6443/api/v1/namespaces?resourceVersion=0

real    0m0.090s
user    0m0.038s
sys     0m0.044s
-rw-r--r-- 1 root root 69K listed-namespaces
----------------------------------------------------

Benchmarking list fieldSelector=spec.nodeName%3Dnode1
curl https://localhost:6443/api/v1/pods?resourceVersion=0

real    0m18.332s
user    0m1.355s
sys     0m1.822s
curl https://localhost:6443/api/v1/pods?resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1

real    0m0.242s
user    0m0.044s
sys     0m0.188s
-rw-r--r-- 1 root root 2.0G listed-pods
-rw-r--r-- 1 root root 526K listed-pods-filtered
----------------------------------------------------

...

Note: For LIST with selector, e.g. LIST pods?spec.nodeName=node1, this script will execute the request without selector first, in order to measure the amount of data apiserver needs to process, e.g. the list pods above: 1.

  1. the agent really executes pods?resourceVersion=0&fieldSelector=spec.nodeName%3Dnode1, so the request time consumption should be based on this
  2. the extra execution of pods?resourceVersion=0 is to test how much data the apiserver needs to process for a request of 1

Note: List all pods will generate 2GB files, so use this benchmark tool with caution, first understand what you are testing with the script you wrote, and especially don’t automate or run concurrently, you may blow up apiserver/etcd.

4.3 Analysis of test results

The above output has the following key information.

  1. the type of resources in the LIST, e.g. pods/endpoints/services
  2. the time consumed by the LIST operation
  3. the amount of data involved in the LIST operation
    1. the amount of data (in json format) to be processed by apiserver: the above list pods, for example, corresponds to the listed-pods file, totaling 2GB.
    2. the amount of data received by the agent (since the agent may have specified label/field filters): in the case of the list pods above, corresponding to the listed-pods-filtered file, totaling 526K

By collecting and sorting all LIST requests in the above way, we know how much pressure the agent puts on apiserver/etcd for one startup operation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ ls -ahl listed-*
-rw-r--r-- 1 root root  222 listed-apiextensions.k8s.io-v1-customeresourcedefinitions
-rw-r--r-- 1 root root 5.8M listed-apiextensions.k8s.io-v1-customresourcedefinitions
-rw-r--r-- 1 root root 2.0M listed-cilium.io-v2-ciliumclusterwidenetworkpolicies
-rw-r--r-- 1 root root 193M listed-cilium.io-v2-ciliumendpoints
-rw-r--r-- 1 root root  185 listed-cilium.io-v2-ciliumnetworkpolicies
-rw-r--r-- 1 root root 6.6M listed-cilium.io-v2-ciliumnodes
-rw-r--r-- 1 root root  42M listed-discovery.k8s.io-v1beta1-endpointslices
-rw-r--r-- 1 root root  69K listed-namespaces
-rw-r--r-- 1 root root  222 listed-networking.k8s.io-networkpolicies
-rw-r--r-- 1 root root  70M listed-nodes    # 仅用于评估 apiserver 需要处理的数据量
-rw-r--r-- 1 root root  25K listed-nodes-filtered
-rw-r--r-- 1 root root 2.0G listed-pods     # 仅用于评估 apiserver 需要处理的数据量
-rw-r--r-- 1 root root 526K listed-pods-filtered
-rw-r--r-- 1 root root  23M listed-services # 仅用于评估 apiserver 需要处理的数据量
-rw-r--r-- 1 root root  23M listed-services-filtered

Again using cilium as an example, there is roughly this sort (amount of data processed by apiserver, json format)

List Resource Type Amount of data processed by apiserver (json) Time consuming
CiliumEndpoints (Full volume) 193MB 11s
CiliumNodes (Full volume) 70MB 0.5s

5 Large Scale Foundation Services: Deployment and Tuning Recommendations

5.1 List request default setting ResourceVersion=0

As described earlier, not setting this parameter will cause apiserver to pull the full amount of data from etcd and then filter it, resulting in

  1. very slow
  2. too large for etcd to handle

Therefore, unless you have to pull data from etcd because of high data accuracy requirements, you should set the ResourceVersion=0 parameter on LIST requests and let apiserver serve it with cache.

If you are using client-go’s ListWatch/informer interface, then it already has ResourceVersion=0 set by default.

5.2 Preferring the namespaced API

If the resources to be LISTed are in a single or a few namespaces, consider using the namespaced API.

  • Namespaced API: /api/v1/namespaces/<ns>/pods?query=xxx
  • Un-namespaced API: /api/v1/pods?query=xxx

5.3 Restart backoff

For per-node deployed base services, such as kubelet, cilium-agent, daemonsets, the stress on the control plane during large restarts needs to be reduced by an effective restart backoff.

For example, the number of agents restarted per minute after a simultaneous hang should not exceed 10% of the cluster size (configurable, or can be calculated automatically). api/v1/pods?query=xxx`

5.4 Prioritize filtering on the server side via label/field selector

If you need to cache some resources and listen for changes, you need to use the ListWatch mechanism to pull the data locally and the business logic filters it from the local cache itself as needed. This is client-go’s ListWatch/informer mechanism.

But if it’s just a one-time LIST operation with filtering criteria, like the nodename filtering pod example mentioned earlier, then obviously we should let apiserver filter out the data for us by setting label or field filters. LIST 10w pods takes a few tens of seconds (most of the time is spent on data transfer and also takes up a lot of CPU/BW/IO on apiserver), while if only the pods on the local machine are needed, LIST may only take 0.05s to return the results after setting nodeName=node1. It is also very important not to forget to include resourceVersion=0 in the request.

5.4.1 Label selector

In-memory filtering in apiserver.

5.4.2 Field selector

In-memory filtering in apiserver.

5.4.3 Namespace selector

Namespace is part of the prefix in etcd, so it is possible to specify namespace to filter resources much faster than selectors that are not prefixed.

5.5 Supporting infrastructure (monitoring, alerting, etc.)

The above analysis shows that a single request from a client may only return a few hundred KB of data, but an apiserver (or worse, etcd) needs to handle GBs of data. Therefore, mass restart of basic services should be avoided, and for this reason monitoring and alerting should be done as well as possible.

5.5.1 Use independent ServiceAccount

Each basic service (e.g. kubelet, cilium-agent, etc.) and various operators that have a lot of LIST operations on the apiserver use their own independent SAs, which makes it easy for the apiserver to distinguish the source of requests and is useful for monitoring, troubleshooting, and server-side flow limitation.

5.5.2 Liveness Monitoring Alerts

The base service must be covered by liveness monitoring.

There must be P1 level liveness alerts to detect mass hang scenarios first. Then reduce the pressure on the control plane by restart backoff.

5.5.3 Monitoring and tuning etcd

The key performance-related indicators need to be monitored and alerted.

  1. memory

  2. bandwidth

  3. number of large LIST requests and response times such as the following LIST all pods logs.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    {
        "level":"warn",
        "msg":"apply request took too long",
        "took":"5357.87304ms",
        "expected-duration":"100ms",
        "prefix":"read-only range ",
        "request":"key:\"/registry/pods/\" range_end:\"/registry/pods0\" ",
        "response":"range_response_count:60077 size:602251227"
    }
    

Deployment and configuration tuning.

  1. K8s events split to a separate etcd cluster
  2. other.

6 Other

6.1 Get requests: GetOptions{}

The basic principle is the same as ListOption{}, not setting ResourceVersion=0 will cause the apiserver to go to etcd to get the data, so you should try to avoid it.