In the previous note, we learned about a typical principle of VPN through TUN/TAP devices when introducing network devices, but did not practice how TUN/TAP virtual network devices specifically function practically in linux. In this note we will look at how a very typical IPIP tunnel in the cloud computing field is implemented through TUN devices.

IPIP Tunneling

As we mentioned in the previous note, the TUN network device can encapsulate a Layer 3 (IP) network packet within another Layer 3 network packet, and it looks like the packet sent out through the TUN device will look like this.

1
2
3
4
5
6
MAC: xx:xx:xx:xx:xx:xx
IP Header: <new destination IP>
IP Body:
  IP: <original destination IP>
  TCP: stuff
  HTTP: stuff

This is the structure of a typical IPIP tunnel packet. Linux natively supports several different types of IPIP tunnels, but all rely on TUN network devices, and we can view the relevant types of IPIP tunnels and their operations by using the command ip tunnel help.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# ip tunnel help
Usage: ip tunnel { add | change | del | show | prl | 6rd } [ NAME ]
          [ mode { ipip | gre | sit | isatap | vti } ] [ remote ADDR ] [ local ADDR ]
          [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]
          [ prl-default ADDR ] [ prl-nodefault ADDR ] [ prl-delete ADDR ]
          [ 6rd-prefix ADDR ] [ 6rd-relay_prefix ADDR ] [ 6rd-reset ]
          [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]

Where: NAME := STRING
       ADDR := { IP_ADDRESS | any }
       TOS  := { STRING | 00..ff | inherit | inherit/STRING | inherit/00..ff }
       TTL  := { 1..255 | inherit }
       KEY  := { DOTTED_QUAD | NUMBER }

where mode represents the different IPIP tunnel types, Linux natively supports a total of 5 types of IPIP tunnels.

  1. ipip: Common IPIP tunneling, which is the encapsulation of an IPv4 message on top of a message
  2. gre: Generic Routing Encapsulation, which defines a mechanism to encapsulate any other network layer protocol on top of any other network layer protocol, so it works for both IPv4 and IPv6
  3. sit: sit mode is mainly used for IPv4 messages to encapsulate IPv6 messages, i.e. IPv6 over IPv4
  4. isatap: Intra-Site Automatic Tunnel Addressing Protocol, similar to sit, is also used for IPv6 tunnel encapsulation
  5. vti: Virtual Tunnel Interface, an IPsec tunneling technology

Some other useful parameters.

  • - ttl N sets the TTL of the incoming tunnel packet to N (N is a number between 1 and 255, 0 is a special value indicating that the TTL value of this packet is inherited (inherit)), the default value of the ttl parameter is for inherit
  • - tos T/dsfield T Set the TOS field of the incoming channel packet, the default is inherit
  • - [no]pmtudisc Disable or turn on Path MTU Discovery on this tunnel, the default is enable

Note: The nopmtudisc option is not compatible with fixed ttl, if the fixed ttl parameter is used, the system will turn on Path MTU Discovery.

one-to-one

Let’s start with the most basic one-to-one IPIP tunneling mode as an example to introduce how to build IPIP tunnels in linux to communicate between two different subnets.

Before you start, it should be noted that not all linux distributions load the ipip.ko module by default. You can check if the kernel loads the module by using lsmod | grep ipip; if not, use modprobe ipip to load it first; if everything is fine then it should look like this.

1
2
3
4
5
6
# lsmod | grep ipip
# modprobe ipip
# lsmod | grep ipip
ipip                   20480  0
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip

Now it’s time to start building the IPIP tunnel. Our network topology is shown in the following figure.

network topology

There are two hosts A and B in the same network segment 172.16.0.0/16, so they can be connected directly. What we need to do is to create two different subnets on each of the two hosts.

Note: In fact, the two hosts A and B are unnecessarily in the same subnet, as long as they are in the same layer 3 network, which means that they can be routed through the layer 3 network to complete the IPIP tunnel.

1
2
A: 10.42.1.0/24
B: 10.42.2.0/24

To simplify things, we first create the bridge network device mybr0 on node A and set the IP address to the gateway address of the 10.42.1.0/24 subnet, then enable mybr0.

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.1.1/24 dev mybr0
# ip link set dev mybr0 up

Similarly, a similar operation is then performed separately on node B.

B:

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.2.1/24 dev mybr0
# ip link set dev mybr0 up

Next, we separately on two nodes A and B

  1. create the corresponding TUN network devices
  2. set the corresponding local and remote addresses as the routable addresses of the node
  3. set the corresponding gateway address as the TUN network device we are going to create
  4. Enable the TUN network device to create the IPIP tunnel

Note: Step 3 is to save and simplify the steps of creating subnets by setting the gateway address directly without creating additional network devices.

A:

1
2
3
4
# modprobe ipip
# ip tunnel add tunl0 mode ipip remote 172.16.232.194 local 172.16.232.172
# ip addr add 10.42.1.1/24 dev tunl0
# ip link set tunl0 up

The above command creates a new tunnel device tunl0 and sets the remote and local IP addresses of the tunnel, which are the outer addresses of the IPIP packets; for the inner addresses, we set two separate subnet addresses, so that the IPIP packets will look as shown below.

IPIP packets

B:

1
2
3
4
# modprobe ipip
# ip tunnel add tunl0 mode ipip remote 172.16.232.172 local 172.16.232.194
# ip addr add 10.42.2.1/24 dev tunl0
# ip link set tunl0 up

In order to ensure that we access subnets on two different hosts through the IPIP tunnel created, we need to manually add the following static route.

A:

1
# ip route add 10.42.2.0/24 dev tunl0

B:

1
# ip route add 10.42.1.0/24 dev tunl0

The routing table for host AB is now shown below.

A:

1
2
3
4
5
# ip route show
default via 172.16.200.51 dev ens3
10.42.1.0/24 dev tunl0 proto kernel scope link src 10.42.1.1
10.42.2.0/24 dev tunl0 scope link
172.16.0.0/16 dev ens3 proto kernel scope link src 172.16.232.172

B:

1
2
3
4
5
# ip route show
default via 172.16.200.51 dev ens3
10.42.1.0/24 dev tunl0 scope link
10.42.2.0/24 dev tunl0 proto kernel scope link src 10.42.2.1
172.16.0.0/16 dev ens3 proto kernel scope link src 172.16.232.194

At this point we can start verifying that the IPIP tunnel is working properly:

A:

1
2
3
4
5
6
7
8
# ping 10.42.2.1 -c 2
PING 10.42.2.1 (10.42.2.1) 56(84) bytes of data.
64 bytes from 10.42.2.1: icmp_seq=1 ttl=64 time=0.269 ms
64 bytes from 10.42.2.1: icmp_seq=2 ttl=64 time=0.303 ms

--- 10.42.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1013ms
rtt min/avg/max/mdev = 0.269/0.286/0.303/0.017 ms

B:

1
2
3
4
5
6
7
8
# ping 10.42.1.1 -c 2
PING 10.42.1.1 (10.42.1.1) 56(84) bytes of data.
64 bytes from 10.42.1.1: icmp_seq=1 ttl=64 time=0.214 ms
64 bytes from 10.42.1.1: icmp_seq=2 ttl=64 time=3.27 ms

--- 10.42.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1021ms
rtt min/avg/max/mdev = 0.214/1.745/3.277/1.532 ms

Yes, it can ping through, and we grab the data at the TUN device via tcpdump.

1
2
3
4
5
6
7
# tcpdump -n -i tunl0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
01:32:05.486835 IP 10.42.1.1 > 10.42.2.1: ICMP echo request, id 3460, seq 1, length 64
01:32:05.486868 IP 10.42.2.1 > 10.42.1.1: ICMP echo reply, id 3460, seq 1, length 64
01:32:06.509617 IP 10.42.1.1 > 10.42.2.1: ICMP echo request, id 3460, seq 2, length 64
01:32:06.509668 IP 10.42.2.1 > 10.42.1.1: ICMP echo reply, id 3460, seq 2, length 64

Up to this point, our experiment was successful. However, it should be noted that if we are using gre mode, it is possible that we need to set up a firewall to allow the two subnets to interoperate, which is more common in building IPv6 tunnels.

one-to-many

In the previous section, we created a one-to-one IPIP tunnel by specifying the local address and remote address of the TUN device. In fact, it is possible to create an IPIP tunnel without specifying the remote address at all, just add the corresponding route on the TUN device, and the IPIP tunnel will know how to encapsulate the new IP packets and deliver them to the route to the specified destination address.

To illustrate, suppose we now have three nodes in the same Layer 3 network.

1
2
3
A: 172.16.165.33
B: 172.16.165.244
C: 172.16.168.113

Attach three different subnets on each of these three nodes at the same time.

1
2
3
A: 10.42.1.0/24
B: 10.42.2.0/24
C: 10.42.3.0/24

Unlike the previous subsection, instead of directly setting the gateway address of the subnet to the IP address of the TUN device, we create an additional bridge network device to simulate the actual common container network model. We create the bridge network device mybr0 on node A and set the IP address to the gateway address of the 10.42.1.0/24 subnet, and then enable mybr0.

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.1.1/24 dev mybr0
# ip link set dev mybr0 up

Similarly, a similar operation is then performed on nodes B and C, respectively.

B:

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.2.1/24 dev mybr0
# ip link set dev mybr0 up

C:

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.3.1/24 dev mybr0
# ip link set dev mybr0 up

Our ultimate goal is to build IPIP tunnels between each of the three nodes to ensure that the three different subnets can communicate directly with each other, so the next step is to create the TUN network device and set up the routing information. On nodes A and B, respectively.

  1. create the corresponding TUN network device and enable it
  2. set the IP address of the TUN network device
  3. set the routes to different subnets and specify the next hop address

The corresponding gateway address is the TUN network device we are going to create respectively

  1. Enable the TUN network device to create IPIP tunnel

Note: The IP address of the TUN network device is the subnet address of the corresponding node, but the subnet mask is 32 bits, for example, the subnet address on node A is 10.42.1.0/24, the IP address of the TUN network device on node A is 10.42.1.0/32. The reason for this is that sometimes addresses on the same subnet (e.g. 10.42.1.0/24) are assigned the same MAC address and therefore cannot communicate directly through the link layer of layer 2, whereas if the IP address of the TUN network device and any address are guaranteed not to be on the same subnet, there is no direct communication at the link layer of layer 2. On this point please refer to the implementation principle of calico, each container will have the same MAC address, later we have the opportunity to explore in depth.

Note: Another point to note is that when setting up the route to the TUN network device, onlink is specified, the purpose of this is to ensure that the next hop is directly attached to the TUN network device, so that even if the nodes are not in the same subnet, you can still build IPIP tunnels.

A:

1
2
3
4
5
6
# modprobe ipip
# ip tunnel add tunl0 mode ipip
# ip link set tunl0 up
# ip addr add 10.42.1.0/32 dev tunl0
# ip route add 10.42.2.0/24 via 172.16.165.244 dev tunl0 onlink
# ip route add 10.42.3.0/24 via 172.16.168.113 dev tunl0 onlink

B:

1
2
3
4
5
6
# modprobe ipip
# ip tunnel add tunl0 mode ipip
# ip link set tunl0 up
# ip addr add 10.42.2.0/32 dev tunl0
# ip route add 10.42.1.0/24 via 172.16.165.33 dev tunl0 onlink
# ip route add 10.42.3.0/24 via 172.16.168.113 dev tunl0 onlink

C:

1
2
3
4
5
6
modprobe ipip
ip tunnel add tunl0 mode ipip
ip link set tunl0 up
ip addr add 10.42.3.0/32 dev tunl0
ip route add 10.42.1.0/24 via 172.16.165.33 dev tunl0 onlink
ip route add 10.42.2.0/24 via 172.16.165.244 dev tunl0 onlink

At this point we can start verifying that the IPIP tunnel we built is working properly.

A:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# try to ping IP in 10.42.2.0/24 on Node B
# ping 10.42.2.1 -c 2
PING 10.42.2.1 (10.42.2.1) 56(84) bytes of data.
64 bytes from 10.42.2.1: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 10.42.2.1: icmp_seq=2 ttl=64 time=0.302 ms

--- 10.42.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1028ms
rtt min/avg/max/mdev = 0.302/0.320/0.338/0.018 ms
...
# try to ping IP in 10.42.3.0/24 on Node C
# ping 10.42.3.1 -c 2
PING 10.42.3.1 (10.42.3.1) 56(84) bytes of data.
64 bytes from 10.42.3.1: icmp_seq=1 ttl=64 time=0.315 ms
64 bytes from 10.42.3.1: icmp_seq=2 ttl=64 time=0.381 ms

--- 10.42.3.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.315/0.348/0.381/0.033 ms

Everything seems to work fine, and if we reverse the process and ping the other subnets from node B or C respectively, we can get through. This shows that we can indeed create one-to-many IPIP tunnels, and this one-to-many pattern is very useful in creating overlay communication models in some typical multi-node networks.

under the hood

We then grab the data via tcpdump on the TUN devices in B and C respectively.

B:

1
2
3
4
5
6
7
# tcpdump -n -i tunl0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
22:38:28.268089 IP 10.42.1.0 > 10.42.2.1: ICMP echo request, id 6026, seq 1, length 64
22:38:28.268125 IP 10.42.2.1 > 10.42.1.0: ICMP echo reply, id 6026, seq 1, length 64
22:38:29.285595 IP 10.42.1.0 > 10.42.2.1: ICMP echo request, id 6026, seq 2, length 64
22:38:29.285629 IP 10.42.2.1 > 10.42.1.0: ICMP echo reply, id 6026, seq 2, length 64

C:

1
2
3
4
5
6
7
# tcpdump -n -i tunl0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
22:36:18.236446 IP 10.42.1.0 > 10.42.3.1: ICMP echo request, id 5894, seq 1, length 64
22:36:18.236499 IP 10.42.3.1 > 10.42.1.0: ICMP echo reply, id 5894, seq 1, length 64
22:36:19.265946 IP 10.42.1.0 > 10.42.3.1: ICMP echo request, id 5894, seq 2, length 64
22:36:19.265997 IP 10.42.3.1 > 10.42.1.0: ICMP echo reply, id 5894, seq 2, length 64

In fact, from the process of creating one-to-many IPIP tunnels we can roughly guess that the Linux ipip module gets the internal IPIP packet based on routing information and then encapsulates it with an external IP into a new IP packet. As for how to unpack IPIP packets, let’s look at the process of ipip module to receive packets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
{
    const struct net_protocol *ipprot;
    int raw, ret;

resubmit:
    raw = raw_local_deliver(skb, protocol);

    ipprot = rcu_dereference(inet_protos[protocol]);
    if (ipprot) {
        if (!ipprot->no_policy) {
            if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
                kfree_skb(skb);
                return;
            }
            nf_reset_ct(skb);
        }
        ret = INDIRECT_CALL_2(ipprot->handler, tcp_v4_rcv, udp_rcv,
                        skb);
        if (ret < 0) {
            protocol = -ret;
            goto resubmit;
        }
        __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
    } else {
        if (!raw) {
            if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
                __IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS);
                icmp_send(skb, ICMP_DEST_UNREACH,
                        ICMP_PROT_UNREACH, 0);
            }
            kfree_skb(skb);
        } else {
            __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
            consume_skb(skb);
        }
    }
}

From https://github.com/torvalds/linux/blob/master/net/ipv4/ip_input.c#L187-L224

As you can see, the ipip module will unblock the packet according to its protocol type, and then do it again with the unblocked skb packet. The above is just some very superficial analysis, if you are interested, we recommend to go to see more source code implementation of ipip module.


Reference https://houmin.cc/posts/cc24de6a/