kube-vip can provide a Kubernetes native HA load balancing on your control plane nodes, we don’t need to set up HAProxy and Keepalived externally to achieve high availability of the cluster anymore.

kube-vip is an open source project that provides high availability and load balancing for Kubernetes clusters both internally and externally, and has been used in Vmware’s Tanzu project to replace the HAProxy load balancer for vSphere deployments with kube-vip, in this article we will first understand how kube-vip can be used for Kubernetes control plane high availability and load balancing. and load balancing features of the Kubernetes control plane.

Features

Originally created to provide an HA solution for the Kubernetes control plane, Kube-Vip has evolved over time to consolidate the same functionality into a LoadBalancer-type service for Kubernetes.

  • VIP addresses can be IPv4 or IPv6
  • Control plane with ARP (layer 2) or BGP (layer 3)
  • Use leader election or raft control plane
  • Control plane HA with kubeadm (static Pod)
  • Control plane HA with K3s/and others (DaemonSets)
  • Service LoadBalancer with ARP leader election (layer 2)
  • Service LoadBalancer using multiple nodes via BGP
  • Service LoadBalancer address pool per namespace or globally
  • Service LoadBalancer addresses exposed to the gateway via UPNP

HAProxy and kube-vip for HA clusters

In the old days when we created a Kubernetes cluster in a private environment, we needed to prepare a hardware/software load balancer to create a multi-controller plane cluster, more often than not we would choose to use HAProxy + Keepalived to achieve this functionality. Typically we create 2 load balancer VMs and assign a VIP, then use the VIP to serve the load balancer and redirect the traffic to one of the Kubernetes controller plane nodes on the backend via the VIP.

Next, let’s see what happens if we use kube-vip.

kube-vip can run on control plane nodes with static pods, these pods identify other hosts on each node through ARP conversations, so you need to set the IP address of each node in the hosts file, we can choose BGP or ARP to set the load balancer, which is more similar to Metal LB. Here we don’t have BGP service, we just want to test it quickly, so here we use ARP with static pods.

kube-vip architecture

kube-vip has many feature design options to provide high availability or network functionality as part of a VIP/load balancing solution.

Cluster

kube-vip creates a multi-node or multi-module cluster to provide high availability. In ARP mode, a leader is elected and this node inherits the virtual IP and becomes the leader of the load balancing within the cluster, while in BGP mode, all nodes are notified of the VIP address.

When using ARP or layer2, it will use leader election and of course raft clustering techniques, but this method has been largely superseded by leader election, especially when running in a cluster.

Virtual IP

The leader in the cluster will assign the vip and bind it to the selected interface declared in the configuration. When the leader changes, it will first revoke the vip, or in the case of failure, the vip will be assigned directly by the next elected leader.

When a vip is moved from one host to another, any host using the vip will retain the previous vip <-> MAC address mapping until the ARP expires (typically 30 seconds) and a new vip <-> MAC mapping is retrieved, which can be optimized by using gratuitous ARP broadcasts.

ARP

kube-vip可以被配置为广播一个无偿的arp(可选),通常会立即通知所有本地主机 vip <-> MAC 地址映射已经改变。

下面我们可以看到,当arp 广播被接收时,故障转移通常在几秒钟内完成。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
64 bytes from 192.168.0.75: icmp_seq=146 ttl=64 time=0.258 ms
64 bytes from 192.168.0.75: icmp_seq=147 ttl=64 time=0.240 ms
92 bytes from 192.168.0.70: Redirect Host(New addr: 192.168.0.75)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 bc98   0 0000  3f  01 3d16 192.168.0.95  192.168.0.75

Request timeout for icmp_seq 148
92 bytes from 192.168.0.70: Redirect Host(New addr: 192.168.0.75)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 75ff   0 0000  3f  01 83af 192.168.0.95  192.168.0.75

Request timeout for icmp_seq 149
92 bytes from 192.168.0.70: Redirect Host(New addr: 192.168.0.75)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 2890   0 0000  3f  01 d11e 192.168.0.95  192.168.0.75

Request timeout for icmp_seq 150
64 bytes from 192.168.0.75: icmp_seq=151 ttl=64 time=0.245 ms

Using kube-vip

Next, let’s build a highly available Kubernetes cluster using kube-vip. Start by preparing 6 nodes.

  • 3 control plane nodes
  • 3 worker nodes

First install the dependencies on the host, including kubeadm, kubelet, kubectl, and a container runtime, in this case containerd.

Get the docker image of kube-vip and set the yaml resource manifest file of the static pod in /etc/kuberentes/manifests so that Kubernetes will automatically deploy the pod of kube-vip on each control plane node.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 设置VIP地址
export VIP=192.168.0.100
export INTERFACE=eth0
ctr image pull docker.io/plndr/kube-vip:0.3.1
ctr run --rm --net-host docker.io/plndr/kube-vip:0.3.1 vip \
/kube-vip manifest pod \
--interface $INTERFACE \
--vip $VIP \
--controlplane \
--services \
--arp \
--leaderElection | tee  /etc/kubernetes/manifests/kube-vip.yaml

Next, you can configure kubeadm as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
cat > ~/init_kubelet.yaml <<EOF
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
bootstrapTokens:
- token: "9a08jv.c0izixklcxtmnze7"
description: "kubeadm bootstrap token"
ttl: "24h"
nodeRegistration:
criSocket: "/var/run/containerd/containerd.sock"
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
controlPlaneEndpoint: "192.168.0.100:6443"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"
protectKernelDefaults: true
EOF
kubeadm init --config init_kubelet.yaml --upload-certs

Then install CNI, for example we choose to use Cilium.

1
2
3
4
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.9.4 \
--namespace kube-system

After the first control plane node is ready, let the other nodes join your cluster. For the other control plane nodes, run the following command.

1
2
3
kubeadm join 192.168.0.100:6443 --token hash.hash\
     --discovery-token-ca-cert-hash sha256:hash \
     --control-plane --certificate-key key

For a working node, run a command similar to

1
2
kubeadm join 192.168.0.100:6443 --token hash.hash\
    --discovery-token-ca-cert-hash sha256:hash

The cluster is ready to start up after the normal execution is completed:

1
2
3
4
5
6
7
8
# kubectl get node -o wide
NAME           STATUS   ROLES                  AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-master-0   Ready    control-plane,master   121m   v1.20.2   192.168.0.201   <none>        Ubuntu 20.04.2 LTS   5.4.0-45-generic   containerd://1.4.3
k8s-master-1   Ready    control-plane,master   114m   v1.20.2   192.168.0.202   <none>        Ubuntu 20.04.2 LTS   5.4.0-45-generic   containerd://1.4.3
k8s-master-2   Ready    control-plane,master   113m   v1.20.2   192.168.0.203   <none>        Ubuntu 20.04.2 LTS   5.4.0-45-generic   containerd://1.4.3
k8s-worker-0   Ready    <none>                 114m   v1.20.2   192.168.0.204   <none>        Ubuntu 20.04.2 LTS   5.4.0-45-generic   containerd://1.4.3
k8s-worker-1   Ready    <none>                 114m   v1.20.2   192.168.0.205   <none>        Ubuntu 20.04.2 LTS   5.4.0-45-generic   containerd://1.4.3
k8s-worker-2   Ready    <none>                 112m   v1.20.2   192.168.0.206   <none>        Ubuntu 20.04.2 LTS   5.4.0-45-generic   containerd://1.4.3

Now you can see that the endpoint of our control plane is 192.168.0.100, there are no other additional nodes, is not very convenient.