1 Pre-requisite knowledge

1.1 Introduction to Cilium

Cilium is a Kubernetes CNI plug-in based on eBPF technology, which Cilium positions on its official website as being dedicated to providing a range of eBPF-based networking, observability, and security solutions for container workloads. Cilium implements networking, observability and security related features by using eBPF technology to dynamically insert control logic inside Linux that can be applied and updated without modifying application code or container configuration.

Cilium

1.2 Introduction to Cilium BGP

BGP (Border Gateway Protocol) is a dynamic routing protocol used between AS (Autonomous System), which provides rich and flexible routing control policies and was mainly used for interconnection between Internet AS in the early days. With the development of technology, BGP is now also widely used in data centers, where modern data center networks are usually based on Spine-Leaf architecture, where BGP can be used to propagate endpoint reachability information.

Cilium BGP

The Leaf layer consists of access switches that aggregate traffic from the servers and connect directly to the Spine or network core, which interconnects to all Leaf switches in a full-mesh topology.

As Kubernetes is increasingly used in the enterprise, these endpoints are likely to be Kubernetes Pods, and it was clear that Cilium should introduce support for the BGP protocol in order to allow networks outside of the Kubernetes cluster to dynamically obtain routes to the Pods they are accessing via the BGP protocol.

BGP was initially introduced in Cilium in version 1.10, by assigning LoadBalancer-type services to applications and combining them with MetalLB to announce routing information to BGP neighbors.

k8s cluster

However, as IPv6 usage continued to grow, it became clear that Cilium needed BGP IPv6 functionality - including Segment Routing v6 (SRv6). metalLB currently has limited support for IPv6 via FRR and is still in the experimental phase. the Cilium team evaluated various options and decided to move to the more feature-rich GoBGP [1] .

GoBGP

In the latest Cilium 1.12 release, enabling support for BGP requires only setting the --enable-bgp-control-plane=true parameter and enables more granular and scalable configuration via a new CRD CiliumBGPPeeringPolicy.

  • The same BGP configuration can be applied to multiple nodes by tag selection using the nodeSelector parameter.
  • When the exportPodCIDR parameter is set to true, all Pod CIDRs can be declared dynamically, eliminating the need to manually specify which route prefixes need to be declared.
  • The neighbors parameter is used to set BGP neighbor information, usually for network devices external to the cluster.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
 name: rack0
spec:
 nodeSelector:
   matchLabels:
     rack: rack0
 virtualRouters:
 - localASN: 65010
   exportPodCIDR: true
   neighbors:
   - peerAddress: "10.0.0.1/32"
     peerASN: 65010

1.3 Introduction to Kind

Kind [2] (Kubernetes in Docker) is a tool for running local Kubernetes clusters using Docker containers as Node nodes. We just need to install Docker and we can quickly create one or more Kubernetes clusters in a few minutes. For the sake of experimentation, this article uses Kind to build a Kubernetes cluster environment.

1.4 Introduction to Containerlab

Containerlab [3] provides a simple, lightweight, container-based solution for orchestrating network experiments, supporting a variety of containerized network operating systems such as Cisco, Juniper, Nokia, Arista, etc. Containerlab can start containers based on user-defined configuration files and create virtual connections to build user-defined network topologies.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
name: sonic01

topology:
  nodes:
    srl:
      kind: srl
      image: ghcr.io/nokia/srlinux
    sonic:
      kind: sonic-vs
      image: docker-sonic-vs:2020-11-12

  links:
    - endpoints: ["srl:e1-1", "sonic:eth1"]

The management interface of the container is connected to a Docker network of bridge type called clab, and the business interface is connected via links rules defined in the configuration file. This is similar to the out-of-band and in-band management models for network management in the data center.

Containerlab

Containerlab also provides us with a rich set of experimental examples, which can be found in Lab examples[4]. We can even create a data center level network architecture with Containerlab (see 5-stage Clos fabric[5] )

network architecture

2 Prerequisite preparation

Please choose the appropriate installation method according to the corresponding operating system version:

The configuration files used in this article are available at https://github.com/cr7258/kubernetes-guide/tree/master/containerlab/cilium-bgp.

3 Starting a Kubernetes Cluster with Kind

Prepare a Kind configuration file to create a 4-node Kubernetes cluster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# cluster.yaml
kind: Cluster
name: clab-bgp-cplane-demo
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true 
  podSubnet: "10.1.0.0/16"
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.0.1.2  
        node-labels: "rack=rack0"    

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.0.2.2
        node-labels: "rack=rack0"    

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.0.3.2
        node-labels: "rack=rack1"    

- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-ip: 10.0.4.2
        node-labels: "rack=rack1"    

Execute the following command to create a Kubernetes cluster via Kind.

1
kind create cluster --config cluster.yaml

create cluster

Checking the cluster node status, the status of the node is NotReady as we do not have the CNI plugin installed yet.

1
kubectl get node

kubectl get node

4 Start Containerlab

Define the Containerlab configuration file, create the network infrastructure and connect to the Kubernetes cluster created by Kind.

  • router0, tor0, tor1 as network devices outside the Kubernetes cluster, set the network interface information and BGP configuration in the exec parameter. router0 establishes BGP neighbors with tor0, tor1, tor0 with server0, server1, router0, tor1 with server2, server3, router0.
  • Setting network-mode: container:<container-name> allows Containerlab to share the network namespace of containers started outside of Containerlab, setting server0, server1, server2, server3 containers to connect to the Kubernetes cluster created with Kind in subsection 3. in Section 3.
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# topo.yaml
name: bgp-cplane-demo
topology:
  kinds:
    linux:
      cmd: bash
  nodes:
    router0:
      kind: linux
      image: frrouting/frr:v8.2.2
      labels:
        app: frr
      exec:
      - iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
      - ip addr add 10.0.0.0/32 dev lo
      - ip route add blackhole 10.0.0.0/8
      - touch /etc/frr/vtysh.conf
      - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
      - usr/lib/frr/frrinit.sh start
      - >-
         vtysh -c 'conf t'
         -c 'router bgp 65000'
         -c ' bgp router-id 10.0.0.0'
         -c ' no bgp ebgp-requires-policy'
         -c ' neighbor ROUTERS peer-group'
         -c ' neighbor ROUTERS remote-as external'
         -c ' neighbor ROUTERS default-originate'
         -c ' neighbor net0 interface peer-group ROUTERS'
         -c ' neighbor net1 interface peer-group ROUTERS'
         -c ' address-family ipv4 unicast'
         -c '   redistribute connected'
         -c ' exit-address-family'
         -c '!'
            
                   
    tor0:
      kind: linux
      image: frrouting/frr:v8.2.2  
      labels:
        app: frr
      exec:
      - ip link del eth0
      - ip addr add 10.0.0.1/32 dev lo
      - ip addr add 10.0.1.1/24 dev net1
      - ip addr add 10.0.2.1/24 dev net2
      - touch /etc/frr/vtysh.conf
      - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
      - /usr/lib/frr/frrinit.sh start
      - >-
         vtysh -c 'conf t'
         -c 'frr defaults datacenter'
         -c 'router bgp 65010'
         -c ' bgp router-id 10.0.0.1'
         -c ' no bgp ebgp-requires-policy'
         -c ' neighbor ROUTERS peer-group'
         -c ' neighbor ROUTERS remote-as external'
         -c ' neighbor SERVERS peer-group'
         -c ' neighbor SERVERS remote-as internal'
         -c ' neighbor net0 interface peer-group ROUTERS'
         -c ' neighbor 10.0.1.2 peer-group SERVERS'
         -c ' neighbor 10.0.2.2 peer-group SERVERS'
         -c ' address-family ipv4 unicast'
         -c '   redistribute connected'
         -c '  exit-address-family'
         -c '!'
                   
    

    tor1:
      kind: linux
      image: frrouting/frr:v8.2.2
      labels:
        app: frr
      exec:
      - ip link del eth0
      - ip addr add 10.0.0.2/32 dev lo
      - ip addr add 10.0.3.1/24 dev net1
      - ip addr add 10.0.4.1/24 dev net2
      - touch /etc/frr/vtysh.conf
      - sed -i -e 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
      - /usr/lib/frr/frrinit.sh start
      - >-
         vtysh -c 'conf t'
         -c 'frr defaults datacenter'
         -c 'router bgp 65011'
         -c ' bgp router-id 10.0.0.2'
         -c ' no bgp ebgp-requires-policy'
         -c ' neighbor ROUTERS peer-group'
         -c ' neighbor ROUTERS remote-as external'
         -c ' neighbor SERVERS peer-group'
         -c ' neighbor SERVERS remote-as internal'
         -c ' neighbor net0 interface peer-group ROUTERS'
         -c ' neighbor 10.0.3.2 peer-group SERVERS'
         -c ' neighbor 10.0.4.2 peer-group SERVERS'
         -c ' address-family ipv4 unicast'
         -c '   redistribute connected'
         -c '  exit-address-family'
         -c '!'               
    
    server0:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:control-plane
      exec:
      - ip addr add 10.0.1.2/24 dev net0
      - ip route replace default via 10.0.1.1

    server1:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:worker
      exec:
      - ip addr add 10.0.2.2/24 dev net0
      - ip route replace default via 10.0.2.1

    server2:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:worker2
      exec:
      - ip addr add 10.0.3.2/24 dev net0
      - ip route replace default via 10.0.3.1

    server3:
      kind: linux
      image: nicolaka/netshoot:latest
      network-mode: container:worker3
      exec:
      - ip addr add 10.0.4.2/24 dev net0
      - ip route replace default via 10.0.4.1


  links:
  - endpoints: ["router0:net0", "tor0:net0"]
  - endpoints: ["router0:net1", "tor1:net0"]
  - endpoints: ["tor0:net1", "server0:net0"]
  - endpoints: ["tor0:net2", "server1:net0"]
  - endpoints: ["tor1:net1", "server2:net0"]
  - endpoints: ["tor1:net2", "server3:net0"]

Execute the following command to create the Containerlab experimental environment.

1
clab deploy -t topo.yaml

clab deploy

The created topology is shown below. Currently, only the BGP connections between tor0, tor1 and router0 devices have been established. The BGP connections between tor0, tor1 and Kubernetes Nodes have not yet been established because we have not yet set the BGP configuration of the Kubernetes cluster via CiliumBGPPeeringPolicy.

Kubernetes

Execute the following commands separately to view the current BGP neighbor establishment status of tor0, tor1, router0.

1
2
3
docker exec -it clab-bgp-cplane-demo-tor0 vtysh -c "show bgp ipv4 summary wide"
docker exec -it clab-bgp-cplane-demo-tor1 vtysh -c "show bgp ipv4 summary wide"
docker exec -it clab-bgp-cplane-demo-router0 vtysh -c "show bgp ipv4 summary wide"

docker exec

Execute the following command to view the BGP routing entries now learned by the router0 device.

1
docker exec -it clab-bgp-cplane-demo-router0 vtysh -c "show bgp ipv4 wide"

There are currently a total of 8 route entries, and no Pod-related routes have been learned at this point.

Pod

To make it easier for users to visualize the network structure of the experiment, Containerlab provides the graph command to generate the network topology.

1
clab graph -t topo.yaml 

clab graph

Enter http://<host IP>:50080 in your browser to view the Containerlab-generated topology diagram.

topology diagram

In this example, we use Helm to install Cilium and set the Cilium configuration parameters we need to adjust in the values.yaml configuration file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# values.yaml
tunnel: disabled

ipam:
  mode: kubernetes

ipv4NativeRoutingCIDR: 10.0.0.0/8

# Enabling BGP feature support is equivalent to executing --enable-bgp-control-plane=true on the command line
bgpControlPlane:  
  enabled: true

k8s:
  requireIPv4PodCIDR: true

Execute the following command to install Cilium version 1.12 and enable BGP feature support.

1
2
helm repo add cilium https://helm.cilium.io/
helm install -n kube-system cilium cilium/cilium --version v1.12.1 -f values.yaml

Once all the Cilium Pods are started, check the Kubernetes Node status again and see that all the Nodes are in Ready state.

Kubernetes Node status

Kubernetes Node status

6 Configuring BGP on Cilium Nodes

Next, configure the CiliumBGPPeeringPolicy for the Kubernetes Nodes on rack0 and rack1 respectively. rack0 and rack1 correspond to the labels of the Nodes, which were set in the configuration file of Kind in Subsection 3.

The Node on rack0 establishes a BGP neighbor with tor0, and the Node on rack1 establishes a BGP neighbor with tor1 and automatically declares a Pod CIDR to the BGP neighbor.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# cilium-bgp-peering-policies.yaml 
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
  name: rack0
spec:
  nodeSelector:
    matchLabels:
      rack: rack0
  virtualRouters:
  - localASN: 65010
    exportPodCIDR: true # Automatically declare Pod CIDR
    neighbors:
    - peerAddress: "10.0.0.1/32" # IP address of tor0
      peerASN: 65010
---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPPeeringPolicy
metadata:
  name: rack1
spec:
  nodeSelector:
    matchLabels:
      rack: rack1
  virtualRouters:
  - localASN: 65011
    exportPodCIDR: true
    neighbors:
    - peerAddress: "10.0.0.2/32" # IP address of tor1
      peerASN: 65011

Execute the following command to apply the CiliumBGPPeeringPolicy.

1
kubectl apply -f cilium-bgp-peering-policies.yaml 

The created topology is shown below. tor0 and tor1 have now also established BGP neighbors with the Kubernetes Node.

topology

Execute the following commands separately to view the current BGP neighbor establishment status of tor0, tor1, router0.

1
2
3
docker exec -it clab-bgp-cplane-demo-tor0 vtysh -c "show bgp ipv4 summary wide"
docker exec -it clab-bgp-cplane-demo-tor1 vtysh -c "show bgp ipv4 summary wide"
docker exec -it clab-bgp-cplane-demo-router0 vtysh -c "show bgp ipv4 summary wide"

current BGP neighbor establishment status

Execute the following command to view the BGP routing entries now learned by the router0 device.

1
docker exec -it clab-bgp-cplane-demo-router0 vtysh -c "show bgp ipv4 wide"

There are currently a total of 12 route entries, of which 4 extra routes are learned from the 4 Kubernetes Nodes on the 10.1.x.0/24 network segment.

Kubernetes Nodes

7 Verification

Create 1 Pod on the node where rack0 and rack1 are located to test the connectivity of the network.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# nettool.yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nettool-1
  name: nettool-1
spec:
  containers:
  - image: cr7258/nettool:v1
    name: nettool-1
  nodeSelector:
    rack: rack0 
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nettool-2
  name: nettool-2
spec:
  containers:
  - image: cr7258/nettool:v1
    name: nettool-2
  nodeSelector:
    rack: rack1

Execute the following command to create 2 test Pods.

1
kubectl apply -f nettool.yaml

View the IP address of the Pod.

1
kubectl get pod -o wide

The nettool-1 Pod is located on clab-bgp-cplane-demo-worker (server1, rack0) with IP address 10.1.2.185; the nettool-2 Pod is located on clab-bgp-cplane-demo-worker3 (server3, rack1 ) with an IP address of 10.1.3.56.

pod

Execute the following command to try pinging nettool-2 Pod in the nettool-1 Pod.

1
kubectl exec -it nettool-1 -- ping 10.1.3.56 

You can see that the nettool-1 Pod can access the nettool-2 Pod normally.

nettool-1 Pod can access the nettool-2 Pod normally

Next, use the traceroute command to observe the direction of the network packets.

1
kubectl exec -it nettool-1 -- traceroute -n 10.1.3.56

kubectl

The packets are sent from the nettool-1 Pod and go through the following in order.

  1. server1’s cilium_host interface: the default route for Pods in the Cilium network points to the local cilium_host. cilium_host and cilium_net are a pair of veth pair devices. cilium forces the Pod traffic’s Cilium uses the hardcode ARP table to force the next hop of Pod traffic to be hijacked to the host side of the veth pair. server1&rsquo;s cilium_host interface

    server1&rsquo;s cilium_host interface

  2. tor0’s net2 interface.

  3. router0’s lo0 interface: tor0, tor1 and router0 establish BGP neighbors through the local loopback port lo0, which improves the robustness of BGP neighbors when there are multiple physical link backups, and does not affect the neighbor relationship when a physical interface fails.

  4. tor1’s lo0 interface.

  5. server3’s net0 interface.

network

8 Clean up the environment

Execute the following command to clean up the experimental environment created by Containerlab and Kind.

9 References

  • [0] https://mp.weixin.qq.com/s?__biz=MzkxOTIwMDgxMg==&mid=2247486133&idx=1&sn=ae1dddc04ac194ff34f6eab670e72edb
  • [1] GoBGP: https://osrg.github.io/gobgp/
  • [2] Kind: https://kind.sigs.k8s.io/
  • [3] containerlab: https://containerlab.dev/
  • [4] Lab examples: https://containerlab.dev/lab-examples/lab-examples/
  • [5] 5-stage Clos fabric: https://containerlab.dev/lab-examples/min-5clos/
  • [6] BGP WITH CILIUM: https://nicovibert.com/2022/07/21/bgp-with-cilium/
  • [7] https://www.bilibili.com/video/BV1Qa411d7wm?spm_id_from=333.337.search-card.all.click&vd_source=1c0f4059dae237b29416579c3a5d326e
  • [8] https://www.koenli.com/fcdddb4a.html
  • [9] Cilium BGP Control Plane: https://docs.cilium.io/en/stable/gettingstarted/bgp-control-plane/#cilium-bgp-control-plane
  • [10] Cilium 1.12 - Ingress, Multi-Cluster, Service Mesh, External Workloads, and much more: https://isovalent.com/blog/post/cilium-release-112/#vtep-support
  • [11] Cilium 1.10: WireGuard, BGP Support, Egress IP Gateway, New Cilium CLI, XDP Load Balancer, Alibaba Cloud Integration and more: https://cilium.io/blog/2021/05/20/cilium-110/
  • [12] https://arthurchiao.art/blog/cilium-life-of-a-packet-pod-to-service-zh/