I’ve been using k8s for a while now, and I’ve encountered problems with iptables and other things that cause problems with the network between k8s nodes, so I wanted to look into how the k8s network works.

Docker Networking

Let’s start by looking at how Docker networking is implemented. Docker first creates a bridge called bridge0.

1
2
3
4
5
6
7
$ ip a show docker0
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:c4:87:73:bf brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:c4ff:fe87:73bf/64 scope link
       valid_lft forever preferred_lft forever

By default, each container will have a separate netns, and then a veth pair is created, with one end left in the global netns and the other end placed in the container. The veth port in the global netns is added to docker0.

1
2
3
4
5
6
7
8
$ ip a show dev veth3db9316
21: veth3db9316@if20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether e2:49:a6:2d:5a:bd brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::e049:a6ff:fe2d:5abd/64 scope link
       valid_lft forever preferred_lft forever
$ brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.0242c48773bf       no              veth3db9316

For the network in the container, on the veth docker assigns and configures an address (e.g. 172.17.0.2) and then sets the default route via 172.17.0.1. On the one hand, you can access the outside network via the default route to 172.17.0.1 and then via iptables NAT.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ iptables-save -t nat
# Generated by xtables-save v1.8.2 on Sat Sep 18 10:44:49 2021
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A DOCKER -i docker0 -j RETURN
COMMIT
# Completed on Sat Sep 18 10:44:49 2021

On the other hand, since the veth connecting different containers are under the same bridge, the different containers can be considered to be in the same layer 2 network and can naturally access each other.

K8s Network

In k8s, all pods are expected to be interconnected by IP address. One idea is to implement the pods on each node in a docker-like way, i.e. each netns connects to a bridge via veth, and then find a way to route the pods on the other nodes.

Since I build a k8s cluster with k3s, it uses a flannel as the cni. flannel uses vxlan to implement network communication between nodes.

First, let’s see how the pods in the node are networked.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 6a:4f:ff:8b:b1:b3 brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::7cf6:57ff:fed7:c49b/64 scope link
       valid_lft forever preferred_lft forever
6: vethc47d6140@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
    link/ether da:19:f8:48:f6:49 brd ff:ff:ff:ff:ff:ff link-netns cni-9d2a5120-16a3-453e-bf64-c4006c06c93b
    inet6 fe80::d819:f8ff:fe48:f649/64 scope link
       valid_lft forever preferred_lft forever

First, the flannel assigns each node a /24 segment, e.g. the first node is 10.42.0.0/24, the second is 10.42.1.0/24, and so on. The pods in the node are then assigned addresses from this segment, for example, 10.42.0.50/24, whose default gateway is 10.42.0.1. These veth are added to the bridge of cni0. The principle of this part is the same as docker, but with a different name. There is also a corresponding iptables rule.

1
2
3
$ iptables-save | grep MASQUERADE
-A POSTROUTING -s 10.42.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
-A POSTROUTING ! -s 10.42.0.0/16 -d 10.42.0.0/16 -j MASQUERADE --random-fully

So, how is the inter-node network implemented? If we want to access the pod 10.42.1.51/24 of the second node from the first node pod 10.42.0.50/24, first, the pod will send to 10.42.0.1/24 according to the default route to reach cni0 of the first node, and then check the routing table.

1
2
3
$ ip r
10.42.0.0/24 dev cni0 proto kernel scope link src 10.42.0.1
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink

As you can see, it will match the route 10.42.1.0/24 via 10.42.1.0 dev flannel.1. flannel.1 is a vxlan interface.

1
2
3
4
5
6
7
$ ip a show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether b6:2f:39:4a:02:c0 brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::b42f:39ff:fe4a:2c0/64 scope link
       valid_lft forever preferred_lft forever

When this interface receives a packet, it queries fdb.

1
2
$ bridge fdb show brport flannel.1
...

This fdb includes the tuple (MAC address, IP address). When flannel.1 receives an Ethernet Frame, if the destination address matches the MAC address here, it will encapsulate the Eth Frame in UDP and send it to the destination IP address; otherwise, it will broadcast it in this table so that the second node will receive the packet and forward it to the actual pod.

Summary

To summarize the implementation of the k8s network interconnection: the nodes are bridged by a bridge, and the nodes are divided into subnets, which are routed through flannel gateways and interconnected by vxlan between flannel gateways.