There are more and more networking solutions for containers, and with each new solution, it is obviously unreasonable to adapt the networking solution to different container runtimes, and CNI is designed to solve this problem.

While maintaining a “home-level Kubernetes cluster” at home during holidays, I got the idea to write a network plugin, and developed Village Net based on cni/plugin.

Taking this network plugin as an example, this article focuses on how to implement a CNI plugin.

How CNI works

To understand how to implement a CNI plug-in, you need to understand how CNI works, which stands for Container Network Interface, an interface protocol used to configure a container’s network. After the container management system provides the network namespace where the container is located, CNI is responsible for inserting the network interface into that network namespace and configuring the corresponding ip and route.

CNI is actually a bridge between container runtime system and CNI Plugin, CNI will pass the container runtime information and network configuration information to Plugin, and each Plugin will realize the subsequent work, so CNI Plugin is the concrete implementation of container network. This can be summarized in the following diagram.

k8s CNI

What is CNI Plugin

We now know that a CNI Plugin is a concrete implementation of a container network. In a cluster, each Plugin exists as a binary and is invoked by the kubelet through the CNI interface for each plugin to execute. The exact process is as follows.

CNI Plugin Process

CNI Plugin can be divided into three categories: Main, IPAM and Meta, where the Main and IPAM plugins complement each other and do the basic work of creating a network environment for containers.

IPAM Plugin

The IPAM (IP Address Management) plug-in is mainly responsible for assigning IP addresses. The official plug-ins available include the following.

  • dhcp: A daemon running on the host that makes DHCP requests on behalf of the container.
  • host-local: uses a pre-assigned IP address segment to assign and logs the IP usage in memory
  • static: used to assign a static IP address to the container, mainly for debugging purposes

Main Plugin

Main plug-in is mainly used to create binary files for specific network devices. The official plug-ins available include the following.

  • bridge: Create a bridge on the host and connect it to the container via veth pair.
  • macvlan: virtualize multiple macvtaps, each with a different mac address
  • ipvlan: similar to macvla n, also virtualize multiple virtual network interfaces through a host interface, the difference is that ipvlan virtualizes a shared MAC address, ip address is different
  • loopback: lo device (set the loopback interface to up)
  • ptp: veth pair device
  • vlan: assign vlan device
  • host-device: move a device that already exists on the host to the container

Meta Plugin

Internal plugins maintained by the CNI community, currently consisting mainly of

  • flannel: A plugin specifically for the Flannel project
  • tuning: binary for tuning network device parameters via sysctl
  • portmap: binary for configuring port mapping via iptables
  • bandwidth: binary for limiting traffic using Token Bucket Filter (TBF)
  • firewall: Add rules to control incoming and outgoing traffic to the container via iptables or firewalled

CNI Plugin Implementation

The CNI Plugin repository is located at https://github.com/containernetworking/plugins. Inside you can see the specific implementation of each type of Plugin. Each Plugin needs to implement the following three methods and register them in main.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func cmdCheck(args *skel.CmdArgs) error {
    ...
}

func cmdAdd(args *skel.CmdArgs) error {
    ...
}

func cmdDel(args *skel.CmdArgs) error {
    ...
}

Take host-local as an example, the registration method is as follows, you need to specify the three methods implemented above, the supported version, and the name of the Plugin.

1
2
3
func main() {
	skel.PluginMain(cmdAdd, cmdCheck, cmdDel, version.All, bv.BuildString("host-local"))
}

What is CNI

After understanding how the Plugin works, let’s look at how the CNI works. The CNI repository is at https://github.com/containernetworking/cni. The code analyzed in this article is based on the latest version, v0.8.1.

The community provides a tool, cnitool, that simulates the CNI interface being called to add or remove network devices from an existing network namespace.

First, let’s look at the implementation logic of cnitool.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func main() {
	...
	netconf, err := libcni.LoadConfList(netdir, os.Args[2])
    ...
	netns := os.Args[3]
	netns, err = filepath.Abs(netns)
    ...
	// Generate the containerid by hashing the netns path
	s := sha512.Sum512([]byte(netns))
	containerID := fmt.Sprintf("cnitool-%x", s[:10])
	cninet := libcni.NewCNIConfig(filepath.SplitList(os.Getenv(EnvCNIPath)), nil)

	rt := &libcni.RuntimeConf{
		ContainerID:    containerID,
		NetNS:          netns,
		IfName:         ifName,
		Args:           cniArgs,
		CapabilityArgs: capabilityArgs,
	}

	switch os.Args[1] {
	case CmdAdd:
		result, err := cninet.AddNetworkList(context.TODO(), netconf, rt)
		if result != nil {
			_ = result.Print()
		}
		exit(err)
	case CmdCheck:
		err := cninet.CheckNetworkList(context.TODO(), netconf, rt)
		exit(err)
	case CmdDel:
		exit(cninet.DelNetworkList(context.TODO(), netconf, rt))
	}
}

From the above code, we can see that the configuration netconf is first parsed from the cni configuration file, and then the netns, containerId and other information is passed to the interface cninet.AddNetworkList as the container’s runtime information.

Next, look at the implementation of the interface AddNetworkList.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// AddNetworkList executes a sequence of plugins with the ADD command
func (c *CNIConfig) AddNetworkList(ctx context.Context, list *NetworkConfigList, rt *RuntimeConf) (types.Result, error) {
	var err error
	var result types.Result
	for _, net := range list.Plugins {
		result, err = c.addNetwork(ctx, list.Name, list.CNIVersion, net, result, rt)
		if err != nil {
			return nil, err
		}
	}
    ...
	return result, nil
}

Obviously, the function’s role is to execute the addNetwork operations of each Plugin in order. Then look at the addNetwork function.

1
2
3
4
5
6
7
8
9
func (c *CNIConfig) addNetwork(ctx context.Context, name, cniVersion string, net *NetworkConfig, prevResult types.Result, rt *RuntimeConf) (types.Result, error) {
	c.ensureExec()
	pluginPath, err := c.exec.FindInPath(net.Network.Type, c.Path)
    ...

	newConf, err := buildOneConfig(name, cniVersion, net, prevResult, rt)
    ...
	return invoke.ExecPluginWithResult(ctx, pluginPath, newConf.Bytes, c.args("ADD", rt), c.exec)
}

The addNetwork operation for each plugin is divided into three parts.

  • First, the FindInPath function is called to find the absolute path of the plugin based on the type of the plugin.
  • Then, the buildOneConfig function is called to extract the NetworkConfig structure of the current plugin from the NetworkList, where the preResult is the result of the previous plugin’s execution.
  • ExecPluginWithResult function is called to actually execute the Add operation of the plugin. Bytes stores the NetworkConfig structure and a stream of bytes encoded with the result of the previous plugin’s execution, while the c.args function is used to build an instance of type Args, which stores mainly container runtime information and information about the execution of CNI operations.

In fact, invoke.ExecPluginWithResult is just a wrapper function, which calls exec.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
func (e *RawExec) ExecPlugin(ctx context.Context, pluginPath string, stdinData []byte, environ []string) ([]byte, error) {
	stdout := &bytes.Buffer{}
	stderr := &bytes.Buffer{}
	c := exec.CommandContext(ctx, pluginPath)
	c.Env = environ
	c.Stdin = bytes.NewBuffer(stdinData)
	c.Stdout = stdout
	c.Stderr = stderr

	// Retry the command on "text file busy" errors
	for i := 0; i <= 5; i++ {
		err := c.Run()
        ...
		// All other errors except than the busy text file
		return nil, e.pluginErr(err, stdout.Bytes(), stderr.Bytes())
	}
    ...
}

See here, we also see the core logic of the entire CNI, surprisingly simple, just exec the plug-in executable, and retry 5 times when an error occurs.

In short, a CNI plug-in is an executable file that gets the configuration information of the network from the configuration file and the container information from the container runtime, the former in the form of standard input and the latter in the form of environment variables passed to each plug-in, which eventually invokes each plug-in in turn in the order defined in the configuration file and passes the results of the execution of the previous plug-in to the next plug-in with the configuration information included.

Nevertheless, the mature network plugins we are familiar with (e.g. calico) usually do not call Plugin sequentially, but only the main plug-in, which calls the ipam plug-in and gets the execution result on the spot.

How kubelet uses CNI

After understanding how the CNI plugin works, let’s take a look at how kubelet uses the CNI plugin.

When kubelet creates a pod, it calls the CNI plugin to create a network environment for the pod. The source code is as follows. You can see that kubelet calls the plugin.addToNetwork function in the SetUpPod function (pkg/kubelet/dockershim/network/cni/cni.go).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func (plugin *cniNetworkPlugin) SetUpPod(namespace string, name string, id kubecontainer.ContainerID, annotations, options map[string]string) error {
	if err := plugin.checkInitialized(); err != nil {
		return err
	}
	netnsPath, err := plugin.host.GetNetNS(id.ID)
    ...
	if plugin.loNetwork != nil {
		if _, err = plugin.addToNetwork(cniTimeoutCtx, plugin.loNetwork, name, namespace, id, netnsPath, annotations, options); err != nil {
			return err
		}
	}

	_, err = plugin.addToNetwork(cniTimeoutCtx, plugin.getDefaultNetwork(), name, namespace, id, netnsPath, annotations, options)
	return err
}

Let’s take a look at the addToNetwork function, which will first build the runtime information of the pod and then read the network configuration information of the CNI plugin, i.e. the configuration file in the /etc/cni/net.d directory. After assembling the parameters needed by the plugin, it calls cni’s interface cniNet.AddNetworkList.

The source code is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func (plugin *cniNetworkPlugin) addToNetwork(ctx context.Context, network *cniNetwork, podName string, podNamespace string, podSandboxID kubecontainer.ContainerID, podNetnsPath string, annotations, options map[string]string) (cnitypes.Result, error) {
	rt, err := plugin.buildCNIRuntimeConf(podName, podNamespace, podSandboxID, podNetnsPath, annotations, options)
    ...

	pdesc := podDesc(podNamespace, podName, podSandboxID)
	netConf, cniNet := network.NetworkConfig, network.CNIConfig
    ...
	res, err := cniNet.AddNetworkList(ctx, netConf, rt)
    ...
	return res, nil
}

Simulate the execution of CNI

After understanding the entire CNI execution process, let’s simulate the CNI execution process. We use the cnitool tool, the main plugin selects bridge and the ipam plugin selects host-local to simulate the container network configuration.

Compile plugins

First, compile the CNI Plugin into an executable file, which can be executed by running the build_linux.sh script from the official repository.

1
2
3
4
5
6
7
$ mkdir -p $GOPATH/src/github.com/containernetworking/plugins
$ git clone https://github.com/containernetworking/plugins.git  $GOPATH/src/github.com/containernetworking/plugins
$ cd $GOPATH/src/github.com/containernetworking/plugins
$ ./build_linux.sh
$ ls
bandwidth  dhcp      flannel      host-local  loopback  portmap  sbr     tuning   village-ipam  vrf
bridge     firewall  host-device  ipvlan      macvlan   ptp      static  village  vlan

Creating a network profile

Next, create our own network configuration file, choose bridge for the main plugin, host-local for the ipam plugin, and specify the available ip segments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ mkdir -p /etc/cni/net.d
$ cat >/etc/cni/net.d/10-hdlsnet.conf <<EOF
{
	"cniVersion": "0.2.0",
	"name": "hdls-net",
	"type": "bridge",
	"bridge": "cni0",
	"isGateway": true,
	"ipMasq": true,
	"ipam": {
		"type": "host-local",
		"subnet": "10.22.0.0/16",
		"routes": [
			{ "dst": "0.0.0.0/0" }
		]
	}
}
EOF
$ cat >/etc/cni/net.d/99-loopback.conf <<EOF
{
	"cniVersion": "0.2.0",
	"name": "lo",
	"type": "loopback"
}
EOF

Create a network namespace

1
$ ip netns add hdls

executes cnitool’s add

Finally, specify CNI_PATH as the path to the above compiled plugin executable, and run the cnitool tool from the official repository.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$ mkdir -p $GOPATH/src/github.com/containernetworking/cni
$ git clone https://github.com/containernetworking/cni.git  $GOPATH/src/github.com/containernetworking/cni
$ export CNI_PATH=$GOPATH/src/github.com/containernetworking/plugins/bin
$ go run cnitool.go  add hdls-net /var/run/netns/hdls
\{
    "cniVersion": "0.2.0",
    "ip4": {
        "ip": "10.22.0.2/16",
        "gateway": "10.22.0.1",
        "routes": [
            {
                "dst": "0.0.0.0/0"
            }
        ]
    },
    "dns": {}
}#

The result appears to assign an ip of 10.22.0.2 to this network namespace hdls-net, which actually means that the container we created manually has an ip of 10.22.0.2.

Verification

After obtaining the container’s ip, we can verify that it is possible to ping through and use the nsenter command to enter the container’s namespace to find that the container’s default network device, eth0, has also been created.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
$ ping 10.22.0.2
PING 10.22.0.2 (10.22.0.2) 56(84) bytes of data.
64 bytes from 10.22.0.2: icmp_seq=1 ttl=64 time=0.039 ms
64 bytes from 10.22.0.2: icmp_seq=2 ttl=64 time=0.046 ms
64 bytes from 10.22.0.2: icmp_seq=3 ttl=64 time=0.042 ms
64 bytes from 10.22.0.2: icmp_seq=4 ttl=64 time=0.073 ms
^C
--- 10.22.0.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.039/0.050/0.073/0.013 ms
$ nsenter --net=/var/run/netns/hdls bash
[root@node-3 ~]# ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether be:6b:0c:93:3a:75 brd ff:ff:ff:ff:ff:ff link-netnsid 0

[root@node-3 ~]#

Finally, let’s check the network devices of the host and find that the veth device pair corresponding to the eth0 of the container has been created.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 00:0c:29:9a:04:8d brd ff:ff:ff:ff:ff:ff
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 02:42:22:86:98:d9 brd ff:ff:ff:ff:ff:ff
4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 76:32:56:61:e4:f5 brd ff:ff:ff:ff:ff:ff
5: veth3e674876@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP mode DEFAULT group default
    link/ether 62:b3:06:15:f9:39 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Village Net

The reason for choosing Village Net as the name of the plugin is to implement a layer 2 based network plugin via macvlan. For a layer 2 network, the internal communication is like a small village, communication is basically by roar (arp), of course, there is the meaning of village net, although simple, but good enough to use.

Working principle

The reason for choosing macvlan is that for a “family Kubernetes cluster”, there are not many nodes but many services, so services can only be differentiated by port mapping (nodeport), and since all machines are originally on the same switch and IPs are relatively rich, macvlan/ipvlan are both simple and good solutions. is a simple and well implemented solution. Considering that mac-based dhcp services can be used, and even the ip of pods can be fixed based on mac, we tried to use macvlan to implement network plugins.

However, macvlan has a lot of problems across net namespace, for example, when there is a separate net namespace, the traffic will cross the host’s protocol stack, which causes the iptables/ipvs based cluster ip to not work properly.

family Kubernetes cluster Working principle

Of course, for the same reason, the host and container networks do not interoperate when using macvlan, but this can be solved by creating an additional macvlan bridge.

In order to solve the problem of cluster ip not working properly, the idea of just using macvlan was abandoned and multiple network interfaces were used for networking.

sobyte

Each Pod has two network interfaces, one is bridge-based eth0 and acts as the default gateway, while relevant routes are added on the host to ensure that communication across nodes is possible. The second interface is a bridge-mode macvlan and assigns the ip of the host segment to this device.

Workflow

Similar to the workflow of CNI mentioned earlier, village net is also divided into main and ipam plugins.

workflow of CNI

The main task of ipam is to assign an available IP from each of the two network segments based on the configuration, and the main plugin is to create bridge, veth, macvlan devices based on the IPs of the two segments and configure them.

Finally

The implementation of Village Net is still relatively simple, and even requires some manual operations, such as the routing part of the bridge. But the functionality basically meets expectations, and the pitfalls of cni are completely sorted out. cni itself is not complicated, but there are many details that were not considered at the beginning of the process, and even in the end just bypassed by a number of workaround. If there is still time and energy to put into the web plugin later, then consider how to optimize it.