As containers gradually replace virtual machines as the standard for cloud infrastructure today, the network management modules for these containers are dependent on Linux virtual networking devices. In fact, understanding the common Linux virtual network devices is a great benefit to our understanding of container networking and other container-dependent implementations of the underlying network architecture. For now, let’s take a look at what are the common Linux virtual network devices and their typical usage scenarios.

Virtual Network Devices

As we know from the previous note, the drivers for network devices do not interact directly with the protocol stack in the kernel, but rather use the kernel’s network device management module as an intermediate bridge. The advantage of this is that the driver does not need to know the details of the network stack, and the stack does not need to process packets for a specific driver.

For the kernel network device management module, there is no difference between virtual and physical devices, both are network devices and both can be configured with IP addresses. Even logically, both virtual and physical network devices are similar to pipes, where data received from either end will be sent out from the other end. For example, the two ends of the physical network card are the protocol stack and the outside physical network. Packets received from the outside physical network will be forwarded to the protocol stack, and conversely, packets sent by the application through the protocol stack will be sent through the physical network card to the outside physical network. However, different network devices have different driver implementations for where to send packets and how to send them, and it has nothing to do with the kernel device management module and the protocol stack.

In summary, virtual network devices are no different from physical network devices in that they have a kernel stack attached to one end and what the behavior of the other end is depends on the driver implementation of the different network devices.

TUN/TAP

The TUN/TAP virtual network device is connected to the protocol stack on one end, and not to the physical network on the other end, but to another application in user space. This means that packets sent from the stack to the TUN/TAP can be read by the application, and of course the application can send packets directly to the TUN/TAP.

A typical example of using a TUN/TAP network device is shown in the following figure.

Example of using a TUN/TAP network device

In the above figure, we have configured a physical NIC with IP 18.12.0.92 and tun0 is a TUN/TAP device with IP 10.0.0.12. The packet flow is as follows.

  1. Application A sends a packet through socket A. Suppose the destination IP address of the packet is 10.0.0.22.

  2. Socket A drops the packet to the network protocol stack.

  3. The protocol stack sends the packet out to the tun0 device based on the local routing rules and the destination IP of the packet.

  4. After receiving the packet, tun0 forwards the packet to application B in user space.

  5. Application B receives the packet and constructs a new packet, embedding the original packet in the new packet (IPIP packet) and finally forwards the packet through socket B.

    Note: The source address of the new packet becomes the address of tun0, and the destination IP address becomes another address 18.13.0.91 .

  6. Socket B sends the packet to the protocol stack.

  7. Based on the local routing rules and the destination IP of the packet, the protocol stack decides that the packet should be sent out through device eth0, and forwards the packet to device eth0.

  8. Device eth0 sends the packet out over the physical network.

We see that the network packet sent to 10.0.0.22 is sent to 18.13.0.91 on the remote network using 18.12.0.92 by application B in user space, and the network packet arrives at 18.13.0.91, reads the original packet inside, and forwards it to 10.0.0.22 locally. This is the basic principle of VPN implementation.

Using TUN/TAP devices we have the opportunity to forward some of the packets in the protocol stack to the application in the user space and let the application process the packets. Common usage scenarios include data compression, encryption and other functions.

Note: The difference between a TUN device and a TAP device is that a TUN device is a virtual end-to-end IP layer device, meaning that user space applications can only read and write IP network packets (layer 3) through a TUN device, while a TAP device is a virtual link layer device that can read and write link layer packets (layer 2) through a TAP device. If you use the Linux networking toolkit iproute2 to create a network device TUN/TAP device you need to specify --dev tun and --dev tap to distinguish between them.

veth

A veth virtual network device is connected to a protocol stack on one end and another veth device on the other end instead of a physical network. A packet sent out of a pair of veth devices goes directly to the other veth device. Each veth device can be configured with an IP address and participate in the routing process of a Layer 3 IP network.

The following is a typical example of using a veth device pair.

Example of using a veth device pair

We configure the IP address of the physical NIC eth0 as 12.124.10.11, where the veth device pairs are veth0 and veth1, whose IPs are 20.1.0.10 and 20.1.0.11, respectively.

1
2
3
4
5
# ip link add veth0 type veth peer name veth1
# ip addr add 20.1.0.10/24 dev veth0
# ip addr add 20.1.0.11/24 dev veth1
# ip link set veth0 up
# ip link set veth1 up

Then try ping the other device, veth1, from device veth0.

1
2
3
4
5
6
7
# ping -c 2 20.1.0.11 -I veth0
PING 20.1.0.11 (20.1.0.11) from 20.1.0.11 veth0: 28(42) bytes of data.
64 bytes from 20.1.0.11: icmp_seq=1 ttl=64 time=0.034 ms
64 bytes from 20.1.0.11: icmp_seq=2 ttl=64 time=0.052 ms

--- 20.1.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1500ms

Note: In some versions of Ubuntu, you may not be able to ping because the default kernel network configuration causes the veth device to fail to return ARP packets, the solution is to configure the veth device to return ARP packets.

1
2
3
4
5
# echo 1 > /proc/sys/net/ipv4/conf/veth1/accept_local
# echo 1 > /proc/sys/net/ipv4/conf/veth0/accept_local
# echo 0 > /proc/sys/net/ipv4/conf/veth0/rp_filter
# echo 0 > /proc/sys/net/ipv4/conf/veth1/rp_filter
# echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter

You can try using the tcpdump command to see the requested packets on the veth device pair.

1
2
3
4
5
6
7
# tcpdump -n -i veth1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth1, link-type EN10MB (Ethernet), capture size 458122 bytes
20:24:12.220002 ARP, Request who-has 20.1.0.11 tell 20.1.0.10, length 28
20:24:12.220198 ARP, Request who-has 20.1.0.11 tell 20.1.0.10, length 28
20:24:12.221372 IP 20.1.0.10 > 20.1.0.11: ICMP echo request, id 18174, seq 1, length 64
20:24:13.222089 IP 20.1.0.10 > 20.1.0.11: ICMP echo request, id 18174, seq 2, length 64

You can see that there are only ICMP echo request packets on veth1, but no answer packets. Think about it. veth1 receives the ICMP echo request packet and forwards it to the protocol stack at the other end, but the protocol stack checks the current device list and finds that there is 20.1.0.10 locally, so it constructs an ICMP echo reply packet and forwards it to the lo device.

The lo device receives the packet and forwards it directly to the protocol stack and then to the ping process in user space.

We can try to use tcpdump to grab the data on the lo device.

1
2
3
4
5
# tcpdump -n -i lo
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 458122 bytes
20:25:49.486019 IP IP 20.1.0.11 > 20.1.0.10: ICMP echo reply, id 24177, seq 1, length 64
20:25:50.4861228 IP IP 20.1.0.11 > 20.1.0.10: ICMP echo reply, id 24177, seq 2, length 64

It can be seen that for pairs of veth devices, packets going out from one device will be sent directly to the other device. In practical application scenarios, such as container networks, pairs of veth devices are in different network namespaces, and packets are forwarded between different network namespaces, as will be explained later in the introduction of container networks.

bridge

A bridge is a virtual network device, so it has the characteristics of a virtual network device and can be configured with IP and MAC addresses. Unlike other network devices, the bridge is a virtual switch with similar functions to a physical switch. bridge has a protocol stack attached to one end and multiple ports on the other end, and data is forwarded between the ports based on MAC addresses.

The bridge can work at either layer 2 (link layer) or layer 3 (IP network layer). By default, it works at layer 2 and can forward Ethernet messages between different hosts on the same subnet; when an IP address is assigned to the bridge, it enables the bridge to work at layer 3. Under Linux, you can manage the bridge with the command iproute2 or brctl.

Creating a bridge is similar to creating other virtual network devices, except that you need to specify the type parameter as bridge.

1
2
# ip link add name br0 type bridge
# ip link set br0 up

bridge

But this creates a bridge with a protocol stack connected to one end and nothing connected to the other ports, so we need to connect other devices to the bridge to have any real functionality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# ip link add veth0 type veth peer name veth1
# ip addr add 20.1.0.10/24 dev veth0
# ip addr add 20.1.0.11/24 dev veth1
# ip link set veth0 up
# ip link set veth1 up
# Connect veth0 to br0
# ip link set dev veth0 master br0
# The bridge link command allows you to see which devices are connected to the bridge
# bridge link
6: veth0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 2

bridge

In fact, once br0 and veth0 are connected, they become bidirectional channels, but the kernel protocol stack and veth0 become unidirectional channels. The protocol stack can send data to veth0, but the data veth0 receives from the outside is not forwarded to the protocol stack. Also the MAC address of br0 becomes the MAC address of veth0. We can verify this.

1
2
3
4
5
6
# ping -c 1 -I veth0 20.1.0.11
PING 20.1.0.11 (20.1.0.11) from 20.1.0.10 veth0: 56(84) bytes of data.
From 20.1.0.10 icmp_seq=1 Destination Host Unreachable

--- 20.1.0.11 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

If we use tcpdump to capture packets on br0 we will see that:

1
2
3
4
# tcpdump -n -i br0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 262144 bytes
21:45:48.225459 ARP, Reply 20.1.0.10 is-at a2:85:26:b3:72:6c, length 28

You can see that veth0 receives the reply packet and forwards it directly to br0 instead of giving it to the protocol stack, so the protocol stack does not get the MAC address of veth1 and thus the ping fails. br0 intercepts the packet between veth0 and the protocol stack. But what happens if we configure the IP for br0?

1
2
# ip addr del 20.1.0.10/24 dev veth0
# ip addr add 20.1.0.10/24 dev br0

Thus, the network structure becomes the following.

network structure

At this point, you can ping veth1 via br0 and find that it works.

1
2
3
4
5
6
7
# ping -c 1 -I br0 20.1.0.11
PING 20.1.0.11 (20.1.0.11) from 20.1.0.10 br0: 56(84) bytes of data.
64 bytes from 20.1.0.11: icmp_seq=1 ttl=64 time=0.121 ms

--- 20.1.0.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.121/0.121/0.121/0.000 ms

In fact, when the IP of veth0 is removed and the IP is configured for br0, the protocol stack will not send packets to veth0 when routing. To express it more intuitively, the connection line between our protocol stack and veth0 is removed and veth0 is equivalent to a network cable at this time.

In reality, bridge is commonly used in the following scenarios.

Virtual Machine

A typical virtual machine network implementation is to connect the NIC in the virtual machine to the br0 of the host through TUN/TAP, when br0 is similar to the physical switch, the packets sent out by the virtual machine first reach br0, and then br0 is handed over to eth0 to send out, so that the packets do not need to go through the host’s stack, which is very efficient.

Typical Virtual Machine Network

Container

As for container networks, each container’s network device is in a separate network namespace, so it is good to different container’s protocol stack, we further discuss the different container network implementations in the next notes.

Ref

  • https://backreference.org/2010/03/26/tuntap-interface-tutorial/
  • https://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/
  • https://www.ibm.com/developerworks/cn/linux/1310_xiawc_networkdevice/
  • http://ifeanyi.co/posts/linux-namespaces-part-1/
  • http://www.opencloudblog.com/?p=66