1. Overview

The apiserver of k8s is the hub of communication for all components, and its importance is self-explanatory. apiserver can provide HTTP-based services to the outside world, so what steps does a request go through from issuing to processing? The following is a brief description of the entire process based on the code so that you can get a general impression of the process.

Since the code structure of apiserver is not simple, we will try to post as little code as possible. The following analysis is based on k8s 1.18

2. The processing chain of requests

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// 构建请求的处理链
func DefaultBuildHandlerChain(apiHandler http.Handler, c *Config) http.Handler {
   handler := genericapifilters.WithAuthorization(apiHandler, c.Authorization.Authorizer, c.Serializer)
   if c.FlowControl != nil {
      handler = genericfilters.WithPriorityAndFairness(handler, c.LongRunningFunc, c.FlowControl)
   } else {
      handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc)
   }
   handler = genericapifilters.WithImpersonation(handler, c.Authorization.Authorizer, c.Serializer)
   handler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc)
   failedHandler := genericapifilters.Unauthorized(c.Serializer, c.Authentication.SupportsBasicAuth)
   failedHandler = genericapifilters.WithFailedAuthenticationAudit(failedHandler, c.AuditBackend, c.AuditPolicyChecker)
   handler = genericapifilters.WithAuthentication(handler, c.Authentication.Authenticator, failedHandler, c.Authentication.APIAudiences)
   handler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, "true")
   handler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout)
   handler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup)
   handler = genericapifilters.WithRequestInfo(handler, c.RequestInfoResolver)
   if c.SecureServing != nil && !c.SecureServing.DisableHTTP2 && c.GoawayChance > 0 {
      handler = genericfilters.WithProbabilisticGoaway(handler, c.GoawayChance)
   }
   handler = genericfilters.WithPanicRecovery(handler)
   return handler
}

The processing chain of this request is executed from back to front. So the request goes through the handler as follows.

  • PanicRecovery
  • ProbabilisticGoaway
  • RequestInfo
  • WaitGroup
  • TimeoutForNonLongRunningRequests
  • CORS
  • Authentication
  • failedHandler: FailedAuthenticationAudit
  • failedHandler: Unauthorized
  • Audit
  • Impersonation
  • PriorityAndFairness / MaxInFlightLimit
  • Authorization

It is then passed to the director, who distributes it to gorestfulContainer or nonGoRestfulMux. gorestfulContainer is the main part of the apiserver.

1
2
3
4
5
director := director{
   name:               name,
   goRestfulContainer: gorestfulContainer,
   nonGoRestfulMux:    nonGoRestfulMux,
}

PanicRecovery

runtime.HandleCrash prevents panic, and logs the details of the panic request.

ProbabilisticGoaway

Because the client and apiserver are using http2 long connections. So even if the apiserver is load balanced, some of the client’s requests will keep hitting the same apiserver. goaway configures a small chance that the apiserver will respond GOWAY to the client after receiving the request, so that the client will create a new tcp connection to load balance to a different apiserver. This chance can range from 0 to 0.02

Related PR: https://github.com/kubernetes/kubernetes/pull/88567

RequestInfo

RequestInfo parses the HTTP request. The following information is obtained.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// RequestInfo holds information parsed from the http.Request
type RequestInfo struct {
    // IsResourceRequest indicates whether or not the request is for an API resource or subresource
    IsResourceRequest bool
    // Path is the URL path of the request
    Path string
    // Verb is the kube verb associated with the request for API requests, not the http verb.  This includes things like list and watch.
    // for non-resource requests, this is the lowercase http verb
    Verb string

    APIPrefix  string
    APIGroup   string
    APIVersion string
    Namespace  string
    // Resource is the name of the resource being requested.  This is not the kind.  For example: pods
    Resource string
    // Subresource is the name of the subresource being requested.  This is a different resource, scoped to the parent resource, but it may have a different kind.
    // For instance, /pods has the resource "pods" and the kind "Pod", while /pods/foo/status has the resource "pods", the sub resource "status", and the kind "Pod"
    // (because status operates on pods). The binding resource for a pod though may be /pods/foo/binding, which has resource "pods", subresource "binding", and kind "Binding".
    Subresource string
    // Name is empty for some verbs, but if the request directly indicates a name (not in body content) then this field is filled in.
    Name string
    // Parts are the path parts for the request, always starting with /{resource}/{name}
    Parts []string
}

WaitGroup

waitgroup is used to handle short connection exits.

How can we tell if it’s a long connection? Here it is determined by the request action or subresource. watch and proxy are determined by the path of the request on requestinfo.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
serverConfig.LongRunningFunc = filters.BasicLongRunningRequestCheck(
  sets.NewString("watch", "proxy"),
  sets.NewString("attach", "exec", "proxy", "log", "portforward"),
)

// BasicLongRunningRequestCheck returns true if the given request has one of the specified verbs or one of the specified subresources, or is a profiler request.
func BasicLongRunningRequestCheck(longRunningVerbs, longRunningSubresources sets.String) apirequest.LongRunningRequestCheck {
    return func(r *http.Request, requestInfo *apirequest.RequestInfo) bool {
        if longRunningVerbs.Has(requestInfo.Verb) {
            return true
        }
        if requestInfo.IsResourceRequest && longRunningSubresources.Has(requestInfo.Subresource) {
            return true
        }
        if !requestInfo.IsResourceRequest && strings.HasPrefix(requestInfo.Path, "/debug/pprof/") {
            return true
        }
        return false
    }
}

This way, the handler of the waitgroup will be done only after all subsequent handlers have exited, so that it can exit gracefully.

TimeoutForNonLongRunningRequests

For non-long connection requests, use ctx’s cancel to cancel the request after the timeout.

CORS

Set some CORS response headers.

Authentication

Begin authenticating the user. Successful authentication removes Authorization from the request. Then the request is passed to the next handler, otherwise it is passed to the next failed handler.

There are a number of ways to handle this. These include

  • Requestheader, which takes out X-Remote-User, X-Remote-Group, X-Remote-Extra from the request.
  • X509 certificate validation.
  • BearerToken
  • WebSocket
  • Anonymous: in case anonymity is allowed

There is also a section that provides authentication in the form of a plugin.

  • bootstrap token
  • Basic auth
  • password
  • OIDC
  • Webhook

Authentication is considered successful if one of them succeeds. and if the user is system:anonymous or the user group contains system:unauthenticated and system:authenticated. it returns directly, otherwise it modifies the user information and returns.

1
2
3
4
5
6
r.User = &user.DefaultInfo{
        Name:   r.User.GetName(),
        UID:    r.User.GetUID(),
        Groups: append(r.User.GetGroups(), user.AllAuthenticated),
        Extra:  r.User.GetExtra(),
    }

Notice that the user is now part of system:authenticated. That is, it is authenticated.

FailedAuthenticationAudit

This will only be executed after an authentication failure. It mainly provides auditing capabilities.

Unauthorized

Unauthorized processing, called after FailedAuthenticationAudit

Audit

Provides the audit function for requests

Impersonation

Impersonation is a feature that assumes the current user as another user, which helps administrators to test whether the permissions of different users are configured correctly, etc. The key to get the header is.

  • Impersonate-User: user
  • Impersonate-Group: group
  • Impersonate-Extra-: additional information

Users are divided into service account and user. service account is formatted as namespace/name, otherwise it is treated as user.

The final format of a service account is: system:serviceaccount:namespace:name

PriorityAndFairness / MaxInFlightLimit

If flow control is set, PriorityAndFairness is used, otherwise MaxInFlightLimit is used.

PriorityAndFairness: will do priority ranking of requests. Requests of the same priority will have fairness-related controls.

MaxInFlightLimit: The maximum number of immutable requests in progress in a given time. When this value is exceeded, the service will reject all requests. 0 value means no limit. (Default value 400)

Reference: https://kubernetes.io/zh/docs/concepts/cluster-administration/flow-control/

Authorization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// AttributesRecord implements Attributes interface.
type AttributesRecord struct {
   User            user.Info
   Verb            string
   Namespace       string
   APIGroup        string
   APIVersion      string
   Resource        string
   Subresource     string
   Name            string
   ResourceRequest bool
   Path            string
}

Authentication takes the information needed for this structure above from the context and then authenticates it. The following authentication methods are supported.

  • Always allow
  • Always deny
  • Path: Allows partial paths to always be accessible

Some other common authentication methods are provided mainly through plugins.

  • Webhook
  • RBAC
  • Node

Where Node is designed specifically for kubelet, the node authenticator allows kubelet to perform API operations. This includes:

Read operation:

  • services
  • endpoints
  • nodes
  • pods
  • secrets, configmaps, pvcs, and pod-related persistent volumes bound to kubelet nodes

Write operations:

  • Node and node state (enable the NodeRestriction access plugin to restrict the kubelet to only modify its own nodes)
  • Pods and Pod state (enable the NodeRestriction access plugin to restrict the kubelet to only modify Pods bound to itself)
  • Events

Authentication-related operations.

  • read/write permissions for the certificationsigningrequests API used during TLS-based bootstrapping
  • Ability to create tokenreviews and subjectaccessreviews for delegated authentication/authorization checks

In future releases, the node authenticator may add or remove permissions to ensure that the kubelet has the minimum set of permissions needed to operate correctly.

In order to obtain authorization from the node authenticator, the kubelet must use a credential to indicate that it is in the system:nodes group with the username system:node:<nodeName>. The above group name and username format should match the identity created for each kubelet during the kubelet TLS bootstrapping process.

director

The director’s ServeHTTP method is defined as follows, i.e. it will be forwarded according to the defined webservice matching rules. Otherwise, it calls nonGoRestfulMux for processing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func (d director) ServeHTTP(w http.ResponseWriter, req *http.Request) {
    path := req.URL.Path

    // check to see if our webservices want to claim this path
    for _, ws := range d.goRestfulContainer.RegisteredWebServices() { q 
        switch {
        case ws.RootPath() == "/apis":
            // if we are exactly /apis or /apis/, then we need special handling in loop.
            // normally these are passed to the nonGoRestfulMux, but if discovery is enabled, it will go directly.
            // We can't rely on a prefix match since /apis matches everything (see the big comment on Director above)
            if path == "/apis" || path == "/apis/" {
                klog.V(5).Infof("%v: %v %q satisfied by gorestful with webservice %v", d.name, req.Method, path, ws.RootPath())
                // don't use servemux here because gorestful servemuxes get messed up when removing webservices
                // TODO fix gorestful, remove TPRs, or stop using gorestful
                d.goRestfulContainer.Dispatch(w, req)
                return
            }

        case strings.HasPrefix(path, ws.RootPath()):
            // ensure an exact match or a path boundary match
            if len(path) == len(ws.RootPath()) || path[len(ws.RootPath())] == '/' {
                klog.V(5).Infof("%v: %v %q satisfied by gorestful with webservice %v", d.name, req.Method, path, ws.RootPath())
                // don't use servemux here because gorestful servemuxes get messed up when removing webservices
                // TODO fix gorestful, remove TPRs, or stop using gorestful
                d.goRestfulContainer.Dispatch(w, req)
                return
            }
        }
    }

    // if we didn't find a match, then we just skip gorestful altogether
    klog.V(5).Infof("%v: %v %q satisfied by nonGoRestful", d.name, req.Method, path)
    d.nonGoRestfulMux.ServeHTTP(w, req)
}

admission webhook

The last step before the request is actually processed is our admission webhook. admission is called in the specific REST processing code. In create, update and delete, mutate is called first, followed by validating. k8s itself has a number of admissions built in, provided as plugins, as follows.

  • AlwaysAdmit
  • AlwaysPullImages
  • LimitPodHardAntiAffinityTopology
  • CertificateApproval/CertificateSigning/CertificateSubjectRestriction
  • DefaultIngressClass
  • DefaultTolerationSeconds
  • ExtendedResourceToleration
  • OwnerReferencesPermissionEnforcement
  • ImagePolicyWebhook
  • LimitRanger
  • NamespaceAutoProvision
  • NamespaceExists
  • NodeRestriction
  • TaintNodesByCondition
  • PodNodeSelector
  • PodPreset
  • PodTolerationRestriction
  • Priority
  • ResourceQuota
  • RuntimeClass
  • PodSecurityPolicy
  • SecurityContextDeny
  • ServiceAccount
  • PersistentVolumeLabel
  • PersistentVolumeClaimResize
  • DefaultStorageClass
  • StorageObjectInUseProtection

The repository I am looking at is https://github.com/kubernetes/kubernetes. The apiserver code is mainly scattered in the following locations.

  • cmd/kube-apiserver: apiserver main function entry. It mainly encapsulates a lot of startup parameters.
  • pkg/kubeapiserver: Provides code shared by kube-apiserver and federation-apiserve, but is not part of the generic API server.
  • plugin/pkg: The following are all plugins related to authentication, authentication and access control
  • staging/src/apiserver: This is the core code of apiserver. The pkg/server below it is the service startup portal.