Why use cluster federation

Kubernetes has claimed to support up to 5,000 nodes and 150,000 Pods in a single cluster since version 1.8, and few companies actually deploy such a large single cluster. In many cases, we may deploy multiple clusters for a variety of reasons, but want to manage them in a unified manner, so we can use cluster federation.

Some typical application scenarios for cluster federation.

  • High availability: Deploying applications on multiple clusters minimizes the impact of cluster failures.
  • Avoid vendor lock-in: Distribute application load across multiple vendor clusters and migrate directly to other vendors when needed
  • Fault isolation: Having multiple small clusters may be more conducive to fault isolation than a single large cluster

Federation v1

The earliest multi-cluster project, proposed and maintained by the K8s community.

Federation v1 was started around K8s v1.3 (Design Proposal), and related components and command line tools (kubefed) were released in later versions to help users quickly build federation clusters, and entered the Beta phase at v1.6; however, Federation v1 entered After the Beta, there was no further development, and it was officially abandoned around K8s v1.11 due to flexibility and API maturity issues.

Federation v1

The basic architecture of v1 is shown in the figure above, with three main components.

  • Federation API Server: similar to K8s API Server, providing a unified resource management portal, but only allows the use of Adapter to expand the supported K8s resources.
  • Controller Manager: Provides resource scheduling and state synchronization among multiple clusters, similar to kube-controller-manager.
  • Etcd: stores the Federation’s resources

In the v1 version, the general steps to create a federation resource are as follows: write all the configuration information of the federation to the annotations of the resource object, the whole creation process is similar to K8s, the resource will be created to the Federation API Server, and then the Federation Controller Manager will create the annotations according to the After that, the Federation Controller Manager will create the resource to each subcluster according to the configuration in annotations; here is an example of ReplicaSet.

ReplicaSet

There are two main problems associated with this architecture.

  • not being flexible enough to add a new Adapter every time a new resource is created (committing the code for re-release); and the object will carry many Federation-specific Annotation
  • Lack of independent API object version control, e.g. Deployment is GA in Kubernetes, but only Beta in Federation v1

Federation v2

With the experience and lessons learned from v1, the community has proposed a new cluster federation architecture: Federation v2; the evolution of the Federation project can also be found in the article Kubernetes Federation Evolution.

The v2 release leverages CRD to implement the overall functionality by defining multiple custom resources (CRs), thus eliminating the API Server from v1. The v2 release consists of two components.

  • admission-webhook provides access control
  • controller-manager handles custom resources and coordinates state across clusters

The general process to create a federated resource in v2 is as follows.

general process to create a federated resource in v2

The Federated Resource is created in the API Server of the Host cluster, and then the controller-manager intervenes to distribute the corresponding resources to different clusters, and the rules of distribution are written in the Federated Resource object.

Logically, Federation v2 is divided into two major parts: configuration and propagation; configuration mainly contains two configurations: Cluster Configuration and Type Configuration.

Cluster Configuration

This is used to store API authentication information for federated clusters and can be used to join/remove clusters via kubefedctl join/unjoin. When successfully joined, a KubeFedCluster CR is created to store cluster related information such as API Endpoint, CA Bundle and Token. The controller-manager will later use this information to access different Kubernetes clusters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: core.kubefed.io/v1beta1
kind: KubeFedCluster
metadata:
  creationTimestamp: "2019-10-24T08:05:38Z"
  generation: 1
  name: cluster1
  namespace: kube-federation-system
  resourceVersion: "647452"
  selfLink: /apis/core.kubefed.io/v1beta1/namespaces/kube-federation-system/kubefedclusters/cluster1
  uid: 4c5eb57f-5ed4-4cec-89f3-cfc062492ae0
spec:
  apiEndpoint: https://172.16.200.1:6443
  caBundle: LS....Qo=
  secretRef:
    name: cluster1-shb2x
status:
  conditions:
  - lastProbeTime: "2019-10-28T06:25:58Z"
    lastTransitionTime: "2019-10-28T05:13:47Z"
    message: /healthz responded with ok
    reason: ClusterReady
    status: "True"
    type: Ready
  region: ""

Cluster Configuration

Type Configuration

defines which Kubernetes API resources are to be federated; for example, if you want to federate ConfigMap resources across clusters, you must first create the new resource FederatedConfigMap in the Host cluster via CRD, then create the configmaps resource with the name Type configuration (FederatedTypeConfig) resource with the name configmaps, and then describe the ConfigMap to be managed by the FederatedConfigMap so that the Kubefed controller-manager knows how to create the Federated resource. An example is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: core.kubefed.k8s.io/v1beta1
kind: FederatedTypeConfig
metadata:
  name: configmaps
  namespace: kube-federation-system
spec:
  federatedType:
    group: types.kubefed.k8s.io
    kind: FederatedConfigMap
    pluralName: federatedconfigmaps
    scope: Namespaced
    version: v1beta1
  propagation: Enabled
  targetType:
    kind: ConfigMap
    pluralName: configmaps
    scope: Namespaced
    version: v1

Federated Resource CRD

There is another key CRD: Federated Resource. If you want to add a new resource to be federated, you need to create a new FederatedXX CRD that describes the structure and distribution strategy of the corresponding resource (to which clusters it needs to be distributed); the Federated Resource CRD consists of three main parts Templates

  • Templates to describe the resources to be federated
  • Placement to describe the clusters that will be deployed and will not be distributed to any cluster if not configured
  • Overrides to allow overriding of some resources of some clusters

An example is shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: types.kubefed.k8s.io/v1beta1
kind: FederatedDeployment
metadata:
  name: test-deployment
  namespace: test-namespace
spec:
  template: # 定义 Deployment 的所有內容,可理解成 Deployment 与 Pod 之间的关联。
    metadata:
      labels:
        app: nginx
    spec:
      ...
  placement:
    clusters:
    - name: cluster2
    - name: cluster1
  overrides:
  - clusterName: cluster2
    clusterOverrides:
    - path: spec.replicas
      value: 5

These FederatedXX CRDs can be created by kubefedctl enable <target kubernetes API type>, or you can generate/write the corresponding CRDs yourself and create them again.

Combined with the Cluster Configuration, Type Configuration and Federated Resource CRD introduced above, the overall architecture and related concepts of the v2 version are much clearer.

architecture

Scheduling

Kubefed can currently only do some simple inter-cluster scheduling, i.e., manual designation, for manual designation of scheduling is divided into two main parts, one is directly in the resource to set the destination cluster, and the other is through ReplicaSchedulingPreference for proportional allocation.

Specifying directly in the resource can be done by specifying a list of clusters through clusters or by selecting clusters based on cluster labels through clusterSelector, but there are two points to note.

  • If the clusters field is specified, clusterSelector will be ignored
  • The selected clusters are equal, and the resource will deploy an undifferentiated copy in each selected cluster
1
2
3
4
5
6
7
8
spec:
  placement:
    clusters:
    - name: cluster2
    - name: cluster1
    clusterSelector:
      matchLabels:
        foo: bar

If you need to schedule differently across multiple clusters, you need to introduce ReplicaSchedulingPreference for scaled scheduling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: scheduling.kubefed.io/v1alpha1
kind: ReplicaSchedulingPreference
metadata:
  name: test-deployment
  namespace: test-ns
spec:
  targetKind: FederatedDeployment
  totalReplicas: 9
  clusters:
    A:
      minReplicas: 4
      maxReplicas: 6
      weight: 1
    B:
      minReplicas: 4
      maxReplicas: 8
      weight: 2

totalReplicas defines the total number of replicas, clusters describes the maximum/minimum replicas for different clusters and the weights.

Currently ReplicaSchedulingPreference only supports deployments and replicasets as resources.

Karmada

Karmada is a multi-cloud container orchestration project open-sourced by Huawei. This project is a continuation of Kubernetes Federation v1 and v2, and some basic concepts are inherited from these two versions.

Karmada

Karmada has three main components.

  • Karmada API Server: essentially a normal K8s API Server, bound to a separate etcd to store those resources to be federally hosted
  • Karmada Controller Manager: a collection of controllers that listen to objects in the Karmada API Server and communicate with member cluster API servers
  • Karmada Scheduler: provides advanced multi-cluster scheduling policies

Similar to Federation v1, we have to write a resource to Karmada’s own API Server, before controller-manager distributes resources to each cluster according to some policies; however, this API Server is native to K8s, so it supports any resources and will not have the problems in the previous The distribution policy of federated resources is also controlled by a separate CRD, and there is no need to configure Federated Resource CRD and Type Configure in v2.

Some basic concepts of Karmada.

  • Resource Template: Karmada uses K8s native API definitions as resource templates to facilitate quick interface to the K8s ecosystem toolchain.
  • Propagaion Policy: Karmada provides a separate policy API to configure resource distribution policies.
  • Override Policy: Karmada provides a separate differentiation API to configure cluster-related differentiation, such as configuring different clusters to use different mirrors

karmada concepts

Cluster

Cluster resource records are similar to Federation v2 in that they contain the necessary information to access the managed cluster: API Endpoint, CA Bundle, and Access Token.

1
2
3
4
5
6
spec:
  apiEndpoint: https://172.31.165.66:55428
  secretRef:
    name: member1
    namespace: karmada-cluster
  syncMode: Push

But one difference is that there are two sync modes for Karmada’s Cluster resources: Push and Pull; Push is the most common and common way, the Karmada component of the host cluster will be responsible for synchronizing and updating the status of such clusters; Pull mode member clusters will run a karmada-agent component, which collects its own state and updates the state of the host cluster’s corresponding Cluster resources.

Propagaion Policy

To distribute resources to member clusters in Karmada, you need to configure this separate PropagationPolicy CR; take the following nginx application as an example, starting with the Resource Template, which is the normal K8s Deployment .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx

You can then configure a PropagationPolicy to control the distribution policy for this nginx Deployment resource. In the example below, the nginx application is distributed to the member1 and member2 clusters on a 1:1 weighting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    replicaScheduling:
      replicaDivisionPreference: Weighted
      replicaSchedulingType: Divided
      weightPreference:
        staticWeightList:
          - targetCluster:
              clusterNames:
                - member1
            weight: 1
          - targetCluster:
              clusterNames:
                - member2
            weight: 1

After creating these two resources in Karmada API Server, you can query the status of the resources through Karmada API Server.

1
2
3
$ kubectl get deploy
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   2/2     2            2           51s

However, note that this does not mean that the application is running on the cluster where the Karmada API Server is located. In fact, there is no workload on this cluster, it just stores these Resource Templates, and the actual workloads are running in the member1 and member2 clusters configured in PropagationPolicy above, which can be seen by switching to the member1/member2 cluster.

1
2
3
4
5
6
7
$ kubectl get deploy
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           6m26s

$ kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
nginx-6799fc88d8-7cgfz   1/1     Running   0          6m29s

The distribution policy also supports LabelSelector, FieldSelector and ExcludeClusters in addition to the most common cluster name specified above, and if all these filters are set, only clusters with all conditions met will be filtered out; in addition to cluster affinity, it also supports SpreadConstraints: dynamic grouping of clusters by region, zone and provider, which allows you to distribute applications to only certain types of clusters.

For resources with replicas (such as the native Deployment and StatefulSet), it supports updating this number of replicas as required when distributing resources to different clusters, e.g. I want 2 replicas for nginx applications on member1 cluster and only 1 replica for member2; there are a number of policies. The first is ReplicaSchedulingType, which has two optional values.

  • Duplicated: the number of replicas for each candidate member cluster is the same, copied from the original resource, and the effect is the same without setting the ReplicaSchedulingStrategy.
  • Divided : The replicas are divided into parts according to the number of valid candidate clusters, and the number of replicas per cluster is determined by ReplicaDivisionPreference.

This ReplicaDivisionPreference in turn has two optional values.

  • Aggregated: Schedules these replicas to as few clusters as possible, taking into account the resources available in the cluster. If one cluster can hold all the replicas, they will only be scheduled to this one cluster, and the corresponding resources will exist on other clusters, but the number of replicas is 0
  • Weighted: the number of replicas is divided according to WeightPreference, this WeightPreference is very simple, it directly specifies how much weight each cluster has

The full and detailed structure can be found in Placement’s API definition.

Override Policy

Override Policy is as simple as adding the OverridePolicy CR to configure different clusters with different configurations, see an example directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
  name: example-override
  namespace: default
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  targetCluster:
    clusterNames:
      - member1
    labelSelector:
      matchLabels:
        failuredomain.kubernetes.io/region: dc1
  overriders:
    plaintext:
      - path: /spec/template/spec/containers/0/image
        operator: replace
        value: 'dc-1.registry.io/nginx:1.17.0-alpine'
      - path: /metadata/annotations
        operator: add
        value:
          foo: bar

Demo

Finally, see an official Karmada example.

Karmada example