Some time ago, when doing some tests, I came across Linux tc, because I needed to add a delay to the packets, I used netem in tc. It is very easy to add a simple delay, just one command like this: $ tc qdisc add dev eth0 root netem delay 1s, you don’t even need to fully understand the meaning of the parameters in the command . But when you want to do some more specific restrictions, (such as adding delay only to a specific ip port, or adding delay only to inbound traffic), things get a little tricky, and a simple Google search doesn’t seem to satisfy the requirements anymore. You have to understand some basic concepts in TC, and the meaning of the relevant parameters in the tc[2] command.

This article takes you through exactly these basic concepts in TC and links them to tc commands through a practical example.

Example commands

Considering that this is a popular science introduction, only the simplest example is given here, but it basically contains the important concepts. The expectation of this article is that after reading it, the reader will fully understand the following example and know how to write their own commands to suit their needs.

1
2
3
sudo tc qdisc add dev eth0 root handle 1: prio bands 4
sudo tc qdisc add dev eth0 parent 1:4 handle 40: netem loss 10% delay 40ms
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 4 u32 match ip dst 192.168.190.7 match ip dport 36000 0xffff flowid 1:4

TC Basic Concepts

QDISCS

The full name is queueing discipline, let’s call it queuing rule. It is a buffer layer between the protocol stack and the network interface. You can do whatever you want with packets on the qdisc, such as sorting, shaping, scheduling, etc.

The qdisc is divided into classless qdisc and classful qdisc. classless qdisc is no longer internally subclassified, while classful qdisc can further contain multiple classes, and each class can further contain sub-qdiscs, which can also be classful qdiscs, thus forming a tree-like hierarchical structure .

CLASSES

A classed qdisc can have multiple subclasses (classes), some qdiscs have predefined subclasses (e.g. prio) and some require the user to add classes. Additional classes can be attached to a class. The class with no subclasses at the very end is called a leaf class, which has a qdisc attached to it. when a class is created, a fifo qdisc is attached by default, which is simply a queue and does not perform any operations on packets. This default qdisc is removed when a subclass is added to this class. You can replace this default fifo qdisc with any other qdisc you want.

FILTERS

Filters, used in classed qdiscs, determine which class to queue the package to. Whenever a packet reaches a class that has subclasses, it needs to be sorted. One way to classify is to use filters (the other two are ToS and skb->priority). All filters attached to a class are called in turn until one of them returns a ruling. A filter contains conditions that determine whether a packet matches based on the packet’s characteristics when it arrives at that node.

The above 3 are the 3 most basic concepts in TC, and any complex traffic control is implemented recursively through this triplet.

Hierarchy

Each interface has an egress ‘root qdisc’, which is pfifo_fast by default. each qdisc and class is assigned a handle handle, which is used to make references in subsequent configuration statements. In addition to the egress qdisc, an interface can also have an ingress qdisc, which is responsible for controlling inbound traffic. But the possibilities of an ingress qdisc are very limited compared to a classful qdisc. (That is why there is the so-called control of sending but not receiving, control of inbound traffic is usually done with the help of ifb [6] or imq).

The handles of these qdiscs have two parts, a major number and a minor number: <major>:<minor>. It is customary to name the root qdisc 1:, which is equivalent to 1:0. The minor number of a qdisc is always 0.

Subclasses need to have the same major number as their parent. major numbers must be unique within an egress or ingress, and minor numbers must be unique within a qdisc and its class.

A typical hierarchy is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
          1:   root qdisc
           |
          1:1    child class
        /  |  \
       /   |   \
      /    |    \
      /    |    \
   1:10  1:11  1:12   child classes
    |      |     | 
    |     11:    |    leaf class
    |            | 
    10:         12:   qdisc
   /   \       /   \
10:1  10:2   12:1  12:2   leaf classes

The kernel only communicates with the root qdisc. Whenever a packet needs to be queued in or out, it needs to start from the root node and eventually reach the leaf node, thus deciding where to queue in or where to queue out.

For example, when a package is queued, it may go through the following path.

1
1: -> 1:1 -> 1:12 -> 12: -> 12:2

Of course it is also possible to go directly to the following path.

1
1: -> 12:2

In this case, it is the filter on the root qdisc that decides to send the packet directly to 12:2 .

Note that although the topology diagrams of the nodes are the same for incoming and outgoing queues, each node represents a different meaning [4]. The incoming queue is based on the filter and packet characteristics to decide which path to take, while the outgoing queue depends on the scheduling algorithm of the qdisc itself, such as FIFO, priority queue, sequential scheduling of SFQ, etc.

Filters

As already mentioned filters are used to classify packets into subclasses, how exactly are packets classified? tc supports many types of classifiers that make decisions based on different information related to packets. One of the most commonly used is the u32 classifier which makes decisions based on fields in the packet (e.g. source IP address, etc.). There are also things like fw classifiers that make decisions based on how the firewall marks packets. You can use iptables to mark target packets and then filter them through the fw classifier. There are also things like route classifier, cgroup classifier, bpf classifier, etc., which are not discussed in detail for space reasons. Only the most common u32 classifiers are described below.

Public parameters

Classifiers generally accept the following public parameters.

  • protocol The protocol the classifier accepts, usually you only accept IP traffic. Required.
  • The handle to which the parent classifier is attached. This handle must be a class that already exists. Required.
  • The priority of the prio|perf classifier. The smaller the number the first match attempt.
  • handle This handle means different things for different filters.

u32 classifier [3]

The simplest format of a u32 filter is to set a set of selectors to match packages, and the matched packages are grouped into specific subclasses, or to execute an action. u32 classifiers provide a variety of different selectors, which can be roughly divided into two categories: special selectors and general selectors.

Special selectors

The common ones are ip selector and tcp selector. Special selectors simplify the setting of some common fields and can match various fields in the package header, such as.

1
2
tc filter add dev eth0 protocol ip parent 1:0 prio 10 u32 \
    match ip src 192.168.8.0/24 flowid 1:4

The above example matches packets with an ip source address on the 192.168.8.0/24 subnet.

1
2
3
4
tc filter add dev eth0 protocol ip parent 1:0 prio 10 u32 \
        match ip protocol 0x6 0xff \
        match tcp dport 53 0xffff \
        flowid 1:2

The above example matches packets with TCP protocol (0x6) and destination port 53.

Generic selectors

Special selectors can always be rewritten to match the corresponding generic selector, which can match almost any bit in the IP (or upper layer) header, but is harder to write and read than the special selector. The syntax is as follows.

1
match [ u32 | u16 | u8 ] PATTERN MASK at [OFFSET | nexthdr+OFFSET]

where u32|u16|u8 specifies the length of the pattern, which is 4 bytes, 2 bytes, and 1 byte respectively. pattern indicates the pattern of the matched packet, mask tells the filter which bits to match, and at indicates that the matching starts at the specified offset of the packet.

To see an example.

1
2
tc filter add dev eth0 protocol ip parent 1:0 pref 10 u32 \
    match u32 00100000 00ff0000 at 0 flowid 1:10

The selector will match packets with IP header second byte 0x10, at 0 means match from header, mask is 00ff0000 so only the second byte is matched, pattern is 00100000 that is, the second byte is 0x10.

Let’s look at another example.

1
2
tc filter add dev eth0 protocol ip parent 1:0 pref 10 u32 \
    match u32 00000016 0000ffff at nexthdr+0 flowid 1:10

The nexthdr option indicates the next header encapsulated in the IP packet, i.e. the header of the upper layer protocol. at nexthdr+0 indicates that the match starts from the first byte of the next header. Since the mask is 0000ffff, the match occurs in the third and fourth bytes of the header. In the TCP and UDP protocols these two bytes are the destination port of the packet. The numbers are given in big-endian format, so pattern 00000016 is converted to 22 in decimal, i.e. the selector will match packets with a destination port of 22.

Example Explanation

Okay, now we can go back to the original example and see what these commands really mean.

1
2
3
sudo tc qdisc add dev eth0 root handle 1: prio bands 4
sudo tc qdisc add dev eth0 parent 1:4 handle 40: netem loss 10% delay 40ms
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 4 u32 match ip dst 192.168.190.7 match ip dport 36000 0xffff flowid 1:4

Let’s look at it line by line. The first line adds a root qdisc on device eth0 with handle 1:, qdisc type prio and number of bands 4. prio is a qdisc with class.

prio is a classed qdisc. its role is similar to that of the default qdisc pfifo_fast. pfifo_fast has three so-called bands, and traffic in different bands has different priorities. Within each band, then FIFO rules are applied.

prio qdisc, by default, creates 3 subclasses containing pure FIFO qdiscs, which are classified by default according to ToS bits. You can use filters to classify the traffic, and you can also attach other qdiscs to the subclasses to replace the default FIFOs.

Next, look at the second command, parent 1:4 means on subclass 1:4, handle 40: means the handle is 40:, netem means add a netem qdisc, loss 10% delay 40ms is the parameter of netem, means packet loss 10%, delay 40ms. netem[5] is a netem[5] is a classless qdisc used to provide network emulation, which can simulate various situations such as delay, packet loss, packet duplication, and packet out-of-order.

The third command adds a filter, parent 1:0 means add this filter on the root node, prio 4 is the priority of the filter, if there are many filters will be tried in order according to the priority value. u32 indicates that the u32 classifier is used. match ip dst 192.168.190.7 means match the packet with ip address 192.168.190.7, match ip dport 36000 0xffff means match the packet with destination port 36000, multiple selectors are in “with” relationship with each other and flowid 1:4 means the matched packets are sorted into 1:4 subclasses.

So the final effect is that packets destined for 192.168.190.7 with a destination port of 36000 will be classified into the 1:4 subclass, adding a 40ms delay and a 10% packet loss rate. Other packets are still behaving as default, i.e. classified into 1:1, 1:2 or 1:3 subclasses according to the ToS field, and then sent in order according to priority.

Draw a diagram of the hierarchical structure of the example, roughly as follows.

1
2
3
4
5
6
7
8
9
          1:     root qdisc (prio)
         / | \ \
       /   |  \  \
       /   |   \   \
     1:1  1:2  1:3  1:4      classes
      |    |    |    |
                     40:     qdiscs
   pfifo pfifo pfifo netem
band  0    1    2    3

Postscript

This article only introduces the basic concept and simple usage of tc. prio qdisc only does a classification of packets and not shaping. In fact, you can also use more complex qdiscs with shaping, such as CBQ, HTB, etc., and add more layers as well. You can also add SFQ qdisc to the leaf nodes to achieve bandwidth fairness at the session level. I believe that after understanding these basic concepts of TC, it is not difficult to use other qdiscs according to your needs.