In the previous notes, we briefly saw a way to implement VPN through TUN/TAP devices when introducing Linux network devices, but did not practice how TUN/TAP virtual network devices function exactly in Linux. In this note we’ll take a look at how to implement IPIP tunnels based on TUN devices in the cloud computing space.

IPIP Tunneling

As we mentioned in our previous notes, the TUN network device can encapsulate a Layer 3 (IP network packet) packet in another Layer 3 packet, so that the packet sent out through the TUN device will look like the following.

1
2
3
4
5
6
MAC: xx:xx:xx:xx:xx:xx
IP Header: <new destination IP>
IP Body:
  IP: <original destination IP>
  TCP: stuff
  HTTP: stuff

This is the structure of a typical IPIP packet. Linux natively supports several different types of IPIP tunnels, but they all depend on the TUN network device, and we can use the command ip tunnel help to see the relevant types of IPIP tunnels and the supported operations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# ip tunnel help
Usage: ip tunnel { add | change | del | show | prl | 6rd } [ NAME ]
          [ mode { ipip | gre | sit | isatap | vti } ] [ remote ADDR ] [ local ADDR ]
          [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]
          [ prl-default ADDR ] [ prl-nodefault ADDR ] [ prl-delete ADDR ]
          [ 6rd-prefix ADDR ] [ 6rd-relay_prefix ADDR ] [ 6rd-reset ]
          [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]

Where: NAME := STRING
       ADDR := { IP_ADDRESS | any }
       TOS  := { STRING | 00..ff | inherit | inherit/STRING | inherit/00..ff }
       TTL  := { 1..255 | inherit }
       KEY  := { DOTTED_QUAD | NUMBER }

The mode represents the different IPIP tunnel types.

  1. ipip: A normal IPIP tunnel, which is an encapsulation of an IPv4 message on top of a message
  2. gre: Generic Routing Encapsulation, which defines a mechanism to encapsulate other network layer protocols on top of any network layer protocol, so it works for both IPv4 and IPv6
  3. sit: sit mode is mainly used for IPv4 messages encapsulating IPv6 messages, i.e. IPv6 over IPv4
  4. isatap: Intra-Site Automatic Tunnel Addressing Protocol, similar to sit, is also used for IPv6 tunnel encapsulation
  5. vti: Virtual Tunnel Interface, an IPsec tunneling technology

There are also some useful parameters.

  • ttl N sets the TTL of the incoming tunnel packet to N (N is a number between 1 and 255, 0 is a special value indicating that the TTL value of this packet is inherit), the default value of the ttl parameter is for inherit
  • tos T/dsfield T sets the TOS field of the incoming tunnel packet, the default is inherit
  • [no]pmtudisc disables or turns on Path MTU Discovery on this tunnel, which is turned on by default

Note: The nopmtudisc option is not compatible with fixed ttl. If a fixed ttl parameter is used, the Path MTU Discovery feature is turned on.

one-to-one

Let’s start with the basic one-to-one tunnel mode to introduce how to build an IPIP tunnel in Linux to communicate between two different subnets.

Before you start, it is important to note that not all Linux distributions load the ipip.ko module by default, so you can check if the kernel loads the module by running lsmod | grep ipip; if not, use the modprobe ipip command to load the ipip module first; if everything is fine by running lsmod | grep ipip command should show output similar to the following.

1
2
3
4
5
6
# lsmod | grep ipip
# modprobe ipip
# lsmod | grep ipip
ipip                   20480  0
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip

Now it’s time to start building the IPIP tunnel, and our target network topology is shown in the figure below.

Target Network Topology

There are two hosts A and B in the same network segment 172.16.0.0/16, so they can be connected directly. What we need to do is to create two different subnets on each of the two hosts.

Note: In fact, the two hosts A and B do not need to be on the same subnet, as long as they are on the same Layer 3 routable network, i.e., they can be routed through the Layer 3 network to build the IPIP tunnel.

1
2
A: 10.42.1.0/24
B: 10.42.2.0/24

To simplify things, we first create the bridge network device mybr0 on node A and set the IP address to the gateway address of the 10.42.1.0/24 subnet, which is 10.42.1.1/24, and then enable the mybr0 bridge device.

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.1.1/24 dev mybr0
# ip link set dev mybr0 up

Similarly, a similar operation is then performed on node B, but with a subnet address of 10.42.2.0/24.

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.2.1/24 dev mybr0
# ip link set dev mybr0 up

Next, we create the corresponding TUN network device on nodes A and B respectively.

  1. create the corresponding TUN network device
  2. set the corresponding local and remote addresses as the routable addresses of the nodes
  3. Set the corresponding gateway address to the TUN network device we are going to create
  4. Enable the TUN network device to create IPIP tunnel

Note: Step 3 is to save and simplify our subnet creation steps by setting the gateway address directly without creating additional network devices.

A:

1
2
3
4
# modprobe ipip
# ip tunnel add tunl0 mode ipip remote 172.16.232.194 local 172.16.232.172
# ip addr add 10.42.1.1/24 dev tunl0
# ip link set tunl0 up

With the above command, we create a new tunnel device tunl0 and set the remote and local IP addresses of the tunnel, which are the outer addresses of the IPIP packets; for the inner addresses, we set two separate subnet addresses, so that the IPIP packets will be as shown in the figure below.

IPIP packets

B:

1
2
3
4
# modprobe ipip
# ip tunnel add tunl0 mode ipip remote 172.16.232.172 local 172.16.232.194
# ip addr add 10.42.2.1/24 dev tunl0
# ip link set tunl0 up

In order to ensure that we can access the subnets on two different hosts through the IPIP tunnels we created, we need to manually add the following static routes.

A:

1
# ip route add 10.42.2.0/24 dev tunl0

B:

1
# ip route add 10.42.1.0/24 dev tunl0

The routing table for host AB is now shown below.

A:

1
2
3
4
5
# ip route show
default via 172.16.200.51 dev ens3
10.42.1.0/24 dev tunl0 proto kernel scope link src 10.42.1.1
10.42.2.0/24 dev tunl0 scope link
172.16.0.0/16 dev ens3 proto kernel scope link src 172.16.232.172

B:

1
2
3
4
5
# ip route show
default via 172.16.200.51 dev ens3
10.42.1.0/24 dev tunl0 scope link
10.42.2.0/24 dev tunl0 proto kernel scope link src 10.42.2.1
172.16.0.0/16 dev ens3 proto kernel scope link src 172.16.232.194

At this point we can start verifying that the IPIP tunnel is working properly.

A:

1
2
3
4
5
6
7
8
# ping 10.42.2.1 -c 2
PING 10.42.2.1 (10.42.2.1) 56(84) bytes of data.
64 bytes from 10.42.2.1: icmp_seq=1 ttl=64 time=0.269 ms
64 bytes from 10.42.2.1: icmp_seq=2 ttl=64 time=0.303 ms

--- 10.42.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1013ms
rtt min/avg/max/mdev = 0.269/0.286/0.303/0.017 ms

B:

1
2
3
4
5
6
7
8
# ping 10.42.1.1 -c 2
PING 10.42.1.1 (10.42.1.1) 56(84) bytes of data.
64 bytes from 10.42.1.1: icmp_seq=1 ttl=64 time=0.214 ms
64 bytes from 10.42.1.1: icmp_seq=2 ttl=64 time=3.27 ms

--- 10.42.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1021ms
rtt min/avg/max/mdev = 0.214/1.745/3.277/1.532 ms

Yes, it is possible to ping through and we can grab the data at the TUN device with the tcpdump command.

1
2
3
4
5
6
7
# tcpdump -n -i tunl0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
01:32:05.486835 IP 10.42.1.1 > 10.42.2.1: ICMP echo request, id 3460, seq 1, length 64
01:32:05.486868 IP 10.42.2.1 > 10.42.1.1: ICMP echo reply, id 3460, seq 1, length 64
01:32:06.509617 IP 10.42.1.1 > 10.42.2.1: ICMP echo request, id 3460, seq 2, length 64
01:32:06.509668 IP 10.42.2.1 > 10.42.1.1: ICMP echo reply, id 3460, seq 2, length 64

So far, our experiment is successful. However, it should be noted that if we use gre mode, we may need to modify the firewall settings to allow the two subnets to interoperate, which is more common when building IPv6 tunnels.

one-to-many

In the previous section, we created a one-to-one IPIP tunnel by specifying the local address and remote address of the TUN device. In fact, it is possible to create an IPIP tunnel without specifying the remote address, and the IPIP tunnel will know how to encapsulate new IP packets and deliver them to the destination address specified by the route as long as the corresponding route is added to the TUN device.

To illustrate with an example, suppose we now have 3 nodes that are in the same three-layer network.

1
2
3
A: 172.16.165.33
B: 172.16.165.244
C: 172.16.168.113

Three different subnets are created on each of these three nodes at the same time.

1
2
3
A: 10.42.1.0/24
B: 10.42.2.0/24
C: 10.42.3.0/24

Unlike the previous subsection, instead of directly setting the gateway address of the subnet to the IP address of the TUN device, we create an additional bridge network device to simulate the actual common container network model. We create the bridge network device mybr0 on node A and set the IP address to the gateway address of the 10.42.1.0/24 subnet, and then enable mybr0.

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.1.1/24 dev mybr0
# ip link set dev mybr0 up

Then perform a similar operation on nodes B and C respectively.

B:

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.2.1/24 dev mybr0
# ip link set dev mybr0 up

C:

1
2
3
# ip link add name mybr0 type bridge
# ip addr add 10.42.3.1/24 dev mybr0
# ip link set dev mybr0 up

Our ultimate goal is to build IPIP tunnels between each of the three nodes to ensure that the three different subnets can communicate with each other directly, so the next step is to create TUN network devices and set up routing information on nodes A and B respectively.

  1. Create the corresponding TUN network device and enable it

  2. set the IP address of the TUN network device

  3. set the routes to different subnets and specify the next hop address

    The corresponding gateway address is the TUN network device that we are about to create.

  4. Enable the TUN network device to create the IPIP tunnel

Note: The IP address of the TUN network device is the subnet address of the corresponding node, but the subnet mask is 32 bits, for example, if the subnet address on node A is 10.42.1.0/24, the IP address of the TUN network device on node A is 10.42.1.0/32. The reason for this is that sometimes addresses on the same subnet (e.g., 10.42.1.0/24) are assigned the same MAC address and therefore cannot communicate directly through the link layer at layer 2, and if the IP address of the TUN network device and any address are guaranteed not to be on the same subnet, there is no direct communication at the link layer at layer 2. Please refer to Calico’s implementation principle for this point, each container will have the same MAC address, we will have the opportunity to explore it later.

Note: Another point to note is that the onlink parameter is specified when setting the route to the TUN network device, to ensure that the next hop is directly to the TUN network device, so that IPIP tunnels can be built even if the nodes are not in the same subnet.

A:

1
2
3
4
5
6
# modprobe ipip
# ip tunnel add tunl0 mode ipip
# ip link set tunl0 up
# ip addr add 10.42.1.0/32 dev tunl0
# ip route add 10.42.2.0/24 via 172.16.165.244 dev tunl0 onlink
# ip route add 10.42.3.0/24 via 172.16.168.113 dev tunl0 onlink

B:

1
2
3
4
5
6
# modprobe ipip
# ip tunnel add tunl0 mode ipip
# ip link set tunl0 up
# ip addr add 10.42.2.0/32 dev tunl0
# ip route add 10.42.1.0/24 via 172.16.165.33 dev tunl0 onlink
# ip route add 10.42.3.0/24 via 172.16.168.113 dev tunl0 onlink

C:

1
2
3
4
5
6
# modprobe ipip
# ip tunnel add tunl0 mode ipip
# ip link set tunl0 up
# ip addr add 10.42.3.0/32 dev tunl0
# ip route add 10.42.1.0/24 via 172.16.165.33 dev tunl0 onlink
# ip route add 10.42.2.0/24 via 172.16.165.244 dev tunl0 onlink

At this point we can start verifying that the IPIP tunnel we built is working properly.

A:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# try to ping IP in 10.42.2.0/24 on Node B
# ping 10.42.2.1 -c 2
PING 10.42.2.1 (10.42.2.1) 56(84) bytes of data.
64 bytes from 10.42.2.1: icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from 10.42.2.1: icmp_seq=2 ttl=64 time=0.302 ms

--- 10.42.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1028ms
rtt min/avg/max/mdev = 0.302/0.320/0.338/0.018 ms
...
# try to ping IP in 10.42.3.0/24 on Node C
# ping 10.42.3.1 -c 2
PING 10.42.3.1 (10.42.3.1) 56(84) bytes of data.
64 bytes from 10.42.3.1: icmp_seq=1 ttl=64 time=0.315 ms
64 bytes from 10.42.3.1: icmp_seq=2 ttl=64 time=0.381 ms

--- 10.42.3.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1029ms
rtt min/avg/max/mdev = 0.315/0.348/0.381/0.033 ms

Everything looks fine. If you ping the other subnets from node B or C respectively, it also works. This shows that we can indeed create one-to-many IPIP tunnels, and the one-to-many IPIP tunneling model is very useful in creating overlay communication models in some typical multi-node networks.

under the hood

We then use the tcpdump command to grab the data from the TUN devices in B and C respectively.

B:

1
2
3
4
5
6
7
# tcpdump -n -i tunl0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
22:38:28.268089 IP 10.42.1.0 > 10.42.2.1: ICMP echo request, id 6026, seq 1, length 64
22:38:28.268125 IP 10.42.2.1 > 10.42.1.0: ICMP echo reply, id 6026, seq 1, length 64
22:38:29.285595 IP 10.42.1.0 > 10.42.2.1: ICMP echo request, id 6026, seq 2, length 64
22:38:29.285629 IP 10.42.2.1 > 10.42.1.0: ICMP echo reply, id 6026, seq 2, length 64

C:

1
2
3
4
5
6
7
# tcpdump -n -i tunl0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes
22:36:18.236446 IP 10.42.1.0 > 10.42.3.1: ICMP echo request, id 5894, seq 1, length 64
22:36:18.236499 IP 10.42.3.1 > 10.42.1.0: ICMP echo reply, id 5894, seq 1, length 64
22:36:19.265946 IP 10.42.1.0 > 10.42.3.1: ICMP echo request, id 5894, seq 2, length 64
22:36:19.265997 IP 10.42.3.1 > 10.42.1.0: ICMP echo reply, id 5894, seq 2, length 64

In fact, from the process of creating a one-to-many IPIP tunnel, we can roughly guess that Linux’s ipip module gets the internal IP of the IPIP packet based on routing information and then encapsulates it into a new IP packet with an external IP. How is the IPIP packet unpacked? Let’s take a look at how the ipip module receives packets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol)
{
    const struct net_protocol *ipprot;
    int raw, ret;

    resubmit:
    raw = raw_local_deliver(skb, protocol);

    ipprot = rcu_dereference(inet_protos[protocol]);
    if (ipprot) {
        if (!ipprot->no_policy) {
            if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
                kfree_skb(skb);
                return;
            }
            nf_reset_ct(skb);
        }
        ret = INDIRECT_CALL_2(ipprot->handler, tcp_v4_rcv, udp_rcv,
                        skb);
        if (ret < 0) {
            protocol = -ret;
            goto resubmit;
        }
        __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
    } else {
        if (!raw) {
            if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
                __IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS);
                icmp_send(skb, ICMP_DEST_UNREACH,
                        ICMP_PROT_UNREACH, 0);
            }
            kfree_skb(skb);
        } else {
            __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
            consume_skb(skb);
        }
    }
}

The above code is taken from: https://github.com/torvalds/linux/blob/master/net/ipv4/ip_input.c#L187-L224

As you can see, the ipip module will unblock the packet according to its protocol type, and then unblock the unblocked skb packet again. The above is just some very superficial analysis, if you are interested, we recommend to take a look at the source code implementation of ipip module.