k8s

Background

When upgrading cilium v1.8.1 to v1.11.1, the business pod reported a mysql authorization error, and after checking, we found that the clientIP of mysql server is the nodeIP of the business pod, not the default podIP, because mysql server only authorized the current K8s cluster The mysql server only authorizes the pod cidr of the current K8s cluster, so it reports an authorization error.

The contradiction is that when using cilium v1.8.1, the IP of the outgoing machine is still podIP, but v1.11.1 is nodeIP, and further found that v1.8.2 version is also nodeIP. K8s network uses Cilium + BGP mode, and the podIP is reachable in the company intranet, so we hope that the business pod from the current node The IP should be podIP, cilium is estimated to do SNAT, the podIP SNAT into nodeIP.

Reason

The reason is that cilium will do podIP masq by default, you can refer to the official website documentation v1.8: masquerading

The cilium configuration we deployed also configured masquerade: true, in fact cilium will default to the value true:

1
2
3
masquerade: 'true'
enable-bpf-masquerade: 'true'
native-routing-cidr: 10.20.30.0/24

When upgrading cilium v1.11.1 we still use the above configuration, cilium new version of this old configuration masquerade: true has been deprecated in favor of enable-ipv4-masquerade: true, cilium default to enable podIP masquerade, see code:daemon_main.go#L679-L680

So when upgrading cilium v1.11.1, you need to change the configuration to solve the problem.

1
2
3
enable-ipv4-masquerade: 'false'
enable-bpf-masquerade: 'false'
ipv4-native-routing-cidr: 10.20.30.0/24 # The new version deprecates the native-routing-cidr configuration, which also uses the cluster pod cidr by default, the same as the configuration value cluster-pool-ipv4-cidr

Why cilium v1.8.1 does not report this problem? Although the configuration we used in cilium v1.8.1 was masquerade: true, there was a bug in this version that caused the configuration to not work and podIP Masq would not go through the corresponding ebpf SNAT rules, this bug was fixed in v1.8.2, so the default after cilium v1.8.2 is podIP Masq is enabled by default after v1.8.2, although this is not what we want. The bug fix code is available at:pull/12456

If a pod accesses a target ip within the ipv4-native-routing-cidr segment, it will not go through podIP Masq either. ebpf c code will determine that if it is within the segment, it will skip the masq logic. This way pods will not access each other with podIP Masq, but only when accessing networks outside the cluster. The ebpf c code skips the ipv4-native-routing-cidr segment logic as shown in.

Then, the packet comes out of the container and goes through the eth0 NIC on the node, which downlinks the ebpf SNAT logic that will SNAT the podIP into the nodeIP, the SNAT logic code function is found in.

The main difficulty here is how to use the ebpf code to do SNAT, cilium this piece of code is worth learning, here is also one of the core logic of cilium.

Finally, cilium will send the ebpf c program to the eth0 NIC egress side (the default is eth0 NIC, you can configure the egress NIC in the cilium daemon), you can see the to-netdev ebpf program on then a K8s node by executing the following command, here to-netdev can be understood is the name of this ebpf c program:

1
2
3
tc filter show dev eth0 egress
#filter protocol all pref 1 bpf chain 0
#filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_netdev_eth0.o:[to-netdev] direct-action not_in_hw tag aed7375159f1f3a4

The to-netdev ebpf c procedure is visible, and this is the logic that the packet takes when it exits the egress side of the eth0 NIC.

Similarly, the from-netdev ebpf c program is the logic for the packet to enter the ingress side of the eth0 NIC: bpf_host.c#L962-L987 , here mainly the firewall or BPF NodePort is useful, for example, we here podIP Masq when the BPF NodePort is on.

iptables snat masquerading

In addition to using ebpf to implement snat masq, cilium can also use downlink iptables rules to implement it, see code: iptables.go#L1097-L1137

You can modify the cilium configuration and then use the command iptables -t nat -S CILIUM_POST_nat to view it:

1
2
3
enable-ipv4-masquerade: 'true'
enable-bpf-masquerade: 'false'
ipv4-native-routing-cidr: 10.20.30.0/24 # The new version deprecates the native-routing-cidr configuration, which also uses the cluster pod cidr by default, the same as the configuration value cluster-pool-ipv4-cidr

The issued iptables rule is similar to the following.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
iptables -t nat -S POSTROUTING
#-P POSTROUTING ACCEPT
#-A POSTROUTING -m comment --comment "cilium-feeder: CILIUM_POST_nat" -j CILIUM_POST_nat
#-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
#-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE


iptables -t nat -S CILIUM_POST_nat
#-N CILIUM_POST_nat
#-A CILIUM_POST_nat -s 20.30.137.0/25 -m set --match-set cilium_node_set_v4 dst -m comment --comment "exclude traffic to cluster nodes from masquerade" -j ACCEPT
#-A CILIUM_POST_nat -s 20.30.137.0/25 ! -d 10.216.136.0/21 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE
#-A CILIUM_POST_nat -m mark --mark 0xa00/0xe00 -m comment --comment "exclude proxy return traffic from masquerade" -j ACCEPT
#-A CILIUM_POST_nat -s 127.0.0.1/32 -o cilium_host -m comment --comment "cilium host->cluster from 127.0.0.1 masquerade" -j SNAT --to-source 20.30.137.116
#-A CILIUM_POST_nat -o cilium_host -m mark --mark 0xf00/0xf00 -m conntrack --ctstate DNAT -m comment --comment "hairpin traffic that originated from a local pod" -j SNAT --to-source 20.30.137.116

When the packet takes the netfilter POSTROUTING chain, it will first jump to the CILIUM_POST_nat chain after walking through all the rules of the chain, and then jump to all the rules of the KUBE-POSTROUTING chain.

CILIUM_POST_nat chain contains the rules as above, the rule of podIP Masq is mainly this, through iptables is very simple to achieve podIP SNAT into nodeIP:

1
-A CILIUM_POST_nat -s 20.30.137.0/25 ! -d 10.216.136.0/21 ! -o cilium_+ -m comment --comment "cilium masquerade non-cluster" -j MASQUERADE

Of course, eBPF is still used to implement podIP Masq because it skips netfilter and the packets don’t have to be copied to the kernel to go through netfilter, so the performance is higher compared to iptables, so it is better to use eBPF if needed. However, eBPF has high performance but is complex to implement.

Comparison with calico

calico also has podIP masq to nodeIP, see Configure outgoing NAT , which can be configured via the parameter natOutgoing.

1
2
3
4
5
6
7
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  name: default-ipv4-ippool
spec:
  cidr: 192.168.0.0/16
  natOutgoing: true

We have a small cluster of production K8s, and the container network plug-in is calico, configured with natOutgoing: false turned off. calico is supposed to downlink SNAT implemented by iptables rules by default.

Summary

cilium uses podIP Masq by default, so that when the pod is not accessing other pods, it will SNAT the podIP to nodeIP, especially useful when the podIP is not reachable on the private network and accessing resources outside the cluster. However, since we use cilium + BGP mode, the podIP is reachable on the company’s intranet and we don’t need this feature, so we need to configure it off.

In addition, a pitfall is that our configuration has never turned off this feature, so the configuration has always been wrong, just because of the cilium v1.8.1 own bug, resulting in podIP Masq is not turned on.

To be researched

cilium podIP Masq ebpf logic shared NodePort service implementation, can investigate how cilium implements NodePort service?