Kubernetes is just like its original meaning “helmsman” in English, commanding, scheduling… It is such a container orchestration and scheduling infrastructure platform, derived from Google’s internal container cluster management platform Borg, which was released in 2003. Borg was released in 2003, from a small project to an internal cluster management system that supports thousands of applications and tasks within Google, its success speaks for itself.

In 2014, Google released Kubernetes as an open-source version of Borg, which was exciting, and then Microsoft, IBM, and RedHat joined the Kubernetes community to add to it. The project is maturing, the community is very active, and today the Kubernetes project is one of the brightest open source projects around.

The success of Kubernetes is that it fills the gap in the orchestration and scheduling management platform for large-scale container clusters, before that, everyone seems to be doing the work of the Stone Age in the cloud era, time-consuming and laborious to deploy and manage their applications, although the concept of containers has become popular, but still using them in the most primitive way, although there are some technical frameworks, such as Docker Swarm Although some technology frameworks, such as Docker Swarm, have tried to change this status quo, but the response was not good until Kubernetes appeared, people were stunned, so Kubernetes and Docker and microservices can be so organic combination? It’s hard to believe that they are different projects with different design ideas, but when they are integrated together they are so perfect.

Therefore, we really need to learn to understand the excellent Kubernetes, and of course, learning its architecture implementation and understanding its operation mechanism from a macro perspective is an essential part.

Architecture Overview

The overall Kubernetes system architecture is a C/S architecture, where the Master is the Server and each Worker node is the Client. In a production-oriented cluster, multiple Master nodes can usually be used to implement HA.

We will then outline the “responsibilities” of the two different node types, Master and Worker.

Master Node

Main responsibilities

  1. managing all Nodes of the cluster.
  2. scheduling the cluster’s Pods.
  3. managing the operational state of the cluster.

main components

  • kube-apiserver: responsible for handling CRUD requests for resources, providing a REST API interface.
  • kube-scheduler: responsible for scheduling Pod resources in the cluster (which Pod is running on which Node).
  • kube-controller-manager: controller manager that automates cluster state management (e.g., automatic scaling, rolling updates).

Worker Node

Main responsibilities

  1. managing the lifecycle of containers, networking, storage, etc.
  2. monitor and report on the operational status of Pods.

main components

  • kubelet: manages the lifecycle of the container and communicates with the Master node, which can be understood as Kubernetes’ Agent at the Worker node.
  • kube-proxy: responsible for the communication of Kubernetes Service components, the principle is to dynamically generate iptables or ipvs rules for the current node Pod, and maintain communication with kube-apiserver, once a Service’s backend Pod is found to have changed, the changes need to be saved in kube -apiserver.
  • container engine: responsible for receiving kubelet commands and performing basic management of containers.

Component Analysis

[Master] kube-apiserver

As the name suggests, “apiserver” is essentially an API, and a REST API at that. When we think of REST APIs, we always think of the concept of “resources”, and yes, kube-apiserver here is a REST API that provides CRUD for various resources in a Kubernetes cluster.

It is worth noting that kube-apiserver is the only core component in the cluster that interacts with the Etcd cluster, for the simple reason that cluster state information, metadata, and cluster resource objects are stored in the /registry directory of the Etcd storage cluster. The kube-apiserver is involved in the CRUD of these resources, so it must deal with Etcd frequently.

Then there is the question of how to interact with the kube-apiserver as a cluster’s operations manager or developer. There are two common ways of interacting with kube-apiserver.

  • kubectl command line tool: It interacts with kube-apiserver via HTTP/JSON protocol. A kubectl command is executed as follows: the user writes a kubectl command and executes it, the command is converted into a corresponding HTTP request, and the request is sent to the kube-apiserver, which receives the request, processes it accordingly, and returns the result to kubectl, which receives the response and then displays the result back.

  • client-go: For secondary development of kubenetes, we prefer to manage kubernetes resources through code. As you might expect, there is an official Go language client for secondary development needs.

    client-go is very optimized for Kubernetes. And many of the core Kubernetes components, including kube-scheduler, kube-controller-manager, etc., internally interact with kube-apiserver via client-go. Therefore for those who want to do secondary development based on kubernetes, it is necessary to master the use of client-go, I have been practicing in this area for some time and found that client-go is indeed an easy-to-use and efficient toolkit. I will probably write a separate article for client-go sometime.

[Master] kube-scheduler

The default scheduler for Kubernetes clusters. We know that the basic scheduling unit in Kubernetes is the Pod, so the responsibility of kube-scheduler is actually to find a suitable Worker Node for each Pod resource object in the cluster to run, but what about on a micro level? Just fill in the spec.nodeName field of the Pod with the name of a Worker Node. Of course, the algorithms inside kube-scheduler are very complex to fill in such a name.

Scheduling algorithms are divided into two types: preselection scheduling algorithm and preferred scheduling algorithm. The former is a one-vote veto approach, and only those that meet all can be scheduled, while the latter calculates the score and schedules those with high scores, or randomly if the same.

  • preselection scheduling algorithm: from all the nodes, to exclude those nodes that completely fail to meet the basic operational requirements of the corresponding pod.

    • CheckNodeCondition: Check if the node is normal
    • HostName: check if Pod object defines pod.spec.hostname
    • MatchNodeSelector: pods.spec.nodeSelector
    • PodFitsResources: checks if the resource requirements of the Pod can be met by the node.
    • NoDiskConflict: checks whether the storage volumes on which the Pod depends can be satisfied.
    • NoExecute: taint, pods will be driven away if they are intolerant
    • CheckNodePIDPressure: check the number of node pid resource pressure is too much
    • CheckNodeDiskPressure: check if the node disk IO is too high
    • MatchInterPodAffinity: check if the node meets the POD whether it meets the affinity or anti-affinity (need to define it yourself)
    • … (not to be enumerated)
  • Preferred scheduling algorithm: The calculation is based on a series of algorithm functions, input the data of each node to calculate the priority, after the calculation, in order to obtain the highest score, is our best matching node; Preferred function is briefly listed below.

    • LeastRequested: A ratio of the node’s free resources compared to the node’s total capacity, evaluated based on the free ratio
    • BalancedResourceAllocation: CPU and memory resources that are similarly occupied win
    • NodePreferAvoidPods: higher priority, based on whether the node is annotated by the annotation information to determine the node annotation information “scheduler.alpha.kubernetes.io/preferAvoidPods”
    • TaintToleration: the spec.tolerations list item of the Pod object is checked for a match with the taints list item of the node, the more matching entries, the lower the score.
    • SeletorSpreading: scattering pods of the same label selector to multiple nodes
    • InterPodAffinity: the more matching entries, the higher the score
    • NodeAffinity: node affinity type
    • MostRequested: opposite to LeastRequested, the smaller the idle, the higher the score, use up a node’s resources as much as possible (default off)
    • NodeLabel: based on whether the node has a label or not (off by default)
    • ImageLocality: based on the sum of the volume size of existing images that meet the needs of the current Pod object (off by default)

kube-scheduler supports HA and implements the Leader election mechanism through distributed locks based on Etcd clusters. The instance that obtains the lock becomes the Leader node and executes the main logic of scheduling, while other Candidate nodes are in a blocking state.

[Master] kube-controller-manager

kube-controller-manager manages cluster state, including resource state, node state, etc., and is automated, i.e., its core key is to ensure that the cluster is always in the desired state, and for some resources, to control their convergence to the desired state configured in spec.

kube-controller-manager is called a “controller manager” because it provides Controllers such as DeploymentControllers, NamespaceControllers, PsersistentVolumeControllers. PsersistentVolumeControllers, which monitor the status of resource objects through the interface provided by the kube-apiserver component, and try to fix the resource to the spec if its status deviates from the spec’s expected state.

Similarly, kube-controller-manager also supports HA, which implements the Leader election mechanism through distributed locks based on Etcd clusters, where the resources of kube-apiserver compete for the election, and the instance that obtains the lock becomes the Leader node and executes the main controller-manager logic, while other Candidate nodes are in blocking state.

kube-controller-manager is called a “controller manager” because it provides Controllers such as DeploymentControllers, NamespaceControllers, PsersistentVolumeControllers. The controllers monitor the status of resource objects through the interface provided by the kube-apiserver component, and try to fix the resource to the spec if its status deviates from the spec’s expected state.

Similarly, kube-controller-manager also supports HA, which implements the Leader election mechanism through distributed locks based on Etcd clusters, where the resources of kube-apiserver compete for the election, and the instance that obtains the lock becomes the Leader node and executes the main controller-manager logic, while other Candidate nodes are in blocking state.

[Worker] kubelet

kubelet is one of the core components on the Worker Node, its core responsibility is to manage the lifecycle of Pod resource objects on the Worker Node. kubelet interacts with the kube-apiserver on the Master Node, gets the tasks issued to it, and then performs the management logic of the Pod. The kubelet also monitors the resource usage status of the nodes and reports it to the kube-apiserver periodically, and this feedback assists the kube-scheduler in executing the scheduling logic. The kubelet also does a cleanup of the images and containers on the Worker Node, acting as a steward of the Worker Node resources.

The other more central concept of kubelet is that it defines three interfaces.

  • CRI (Container Runtime Interface): container runtime interface, providing compute resources
  • CNI (Container Network Interface): container network interface, providing network resources
  • CSI (Container Storage Interface): container storage interface, providing storage resources

The concept of interfaces is not new to you. To some extent, interfaces can also be understood as “protocols” that decouple the concrete implementation of certain logic. Kubernetes has introduced three highly open interfaces, CRI, CNI and CSI, to improve the scalability of containers at the runtime, network and storage levels respectively.

In short, take CRI as an example, although the default container runtime is still Docker, with CRI we can introduce other good container runtimes as our Pod backend, such as kata, rkt….; Take CNI as an example, needless to say, Calico / Flannel is familiar to everyone. Each has its own merits, and all of them are blossoming.

[Master] kube-proxy

To be clear, kube-proxy is responsible for the implementation of the Service resource functionality in Kubernetes, i.e., internal access from Pod -> Service and external access from NodePort -> Service. Service is a service abstraction for a set of Pods, which is equivalent to an “LB” for Pods, responsible for distributing requests to the corresponding Pods.

Service provides an IP for this “LB”, which is generally called cluster IP.

The Cluster IP is a virtual IP, but more like a fake IP network for several reasons:

  • The Cluster IP acts only on the Kubernetes Service object, and is managed and assigned a P address by Kubernetes
  • The Cluster IP cannot be pinged, it does not have a “physical network object” to respond to
  • Cluster IPs can only be combined with Service Ports to form a specific communication port, individual Cluster IPs do not have the basis for communication and they belong to a closed space like a Kubernetes cluster.
  • Pod nodes under different Services can access each other between clusters via Cluster IP

kube-proxy works by monitoring changes in the Service and Endpoint resources of kube-apiserver and dynamically configuring iptables / ipvs to achieve load balancing of Service backend Pods.

Note that kube-proxy is only for Service communication and Pod requests, and not for other scenarios.