The Linux networking stack is not lacking in features, and it performs well enough for most purposes. However, with high-speed networks, the extra overhead of traditional network programming is too large a percentage. In the previous article on syscall.Socket, we introduced the AF_PACKET type socket, which has a really mediocre performance, all the data has to be converted between user and kernel state, and there are a lot of interrupts in case of high concurrency. Using eBPF XDP can be a perfect solution for high performance, we introduced XDP technology in an earlier article, Björn Töpel added a protocol family AF_XDP for sockets in Linux 4.18, which allows high performance network reading and writing using the socket interface and XDP technology.

In 2019, Intel’s Björn Töpel (who is the main implementer of the AF_XDP Socket) presented a performance comparison between AF_XDP and normal AF_PACKET in three scenarios in a share.

Performance comparison of AF_XDP and normal AF_PACKET in three scenarios

You can see that the performance of AF_XDP is much greater than that of AF_PACKET.

Introduction to AF_XDP Socket

AF_XDP (eXpress Data Path) is a high-performance network protocol stack that enables zero-copy data transfer and zero-interruption data reception. af_xdp socket is a socket type in the Linux kernel that supports the AF_XDP protocol.

Compared with traditional sockets, AF_XDP socket has the following significant features:

  • Zero-copy transfer: When transferring data using AF_XDP socket, data can be transferred directly in memory without copying data from user space to kernel space, thus reducing the number of memory copies during data transfer and improving data transfer efficiency.
  • Zero-interrupt reception: When receiving data using AF_XDP socket, data can be received directly from the NIC without notifying the kernel through interrupts, thus reducing the number of interrupt processing and improving the efficiency of receiving data.
  • Support for multiple queues: AF_XDP socket supports multiple queues, which can route different network traffic to different queues, thus achieving better load balancing and multi-core utilization.
  • Support for user space protocol stack: AF_XDP socket can be used in combination with protocol stack in user space, thus allowing the implementation of network protocol stack in user space, which improves the performance and flexibility of network applications.

In summary, AF_XDP socket is a high-performance network data transfer method for high-performance network applications that need to handle large amounts of data.

AF_XDP socket

We use the normal socket() system call to create an AF_XDP socket (XSK). Each XSK has two rings: RX RING and TX RING. The socket can receive packets on the RX RING and can send packets on the TX RING ring. These rings are registered and resized via XDP_RX_RING and XDP_TX_RING of setockopts(), respectively. Each socket must have at least one of these rings. the RX or TX descriptor ring points to a data buffer in the storage area (called UMEM). the RX and TX can share the same UMEM, so there is no need to copy packets between RX and TX.

UMEM also has two rings: FILL RING and COMPLETION RING. The application uses FILL RING to send an addr to the kernel that can carry the packet (the addr refers to a chunk in the UMEM) for the kernel to populate with RX packet data. Whenever a packet is received, references to these chunks appear in the RX ring. On the other hand, COMPLETION RING contains the addresses of chunks that have been fully transmitted by the kernel and can be used again by user space for either TX or RX.

As you can see, there are four rings, the data in the RX RING and TX RING rings are descriptors (xdp_desc), while the FILL RING and COMPLETION RING are addresses (u64).

Four rings

  1. Rx Ring: The Receive Ring is generated by the hardware NIC or AF_XDP driver and stores the Receive Descriptor of the received data frame to be processed and passes these descriptors to the kernel or user space program. The receive ring usually consists of multiple queues, each with a separate Rx Ring. the producer of the Rx Ring is the XDP program and the consumer is the user state program. the XDP program consumes the Fill Ring, obtains the desc that can carry the message and copies the message to the address specified in the desc. It then fills the Rx Ring with the desc and notifies the user-state program to receive the message from the Rx Ring through the socket IO mechanism.
  2. Fill Ring: A Fill Ring is a ring in which a user space program generates new descriptors for a receive ring so that the receive ring always has enough descriptors available. The Fill Ring can also consist of multiple queues, each with a separate Fill Ring. the producer of the Fill Ring is the user state program and the consumer is the XDP program in the kernel state. The user-state program passes the UMEM frames that can be used to carry messages through the Fill Ring to the kernel, which then consumes the descriptor desc in the Fill Ring and copies the messages to the address specified in desc (which is the address of the UMEM frame).
  3. Tx Ring: The Transmit Ring is generated by the user space program and is used to store the Descriptor of the data frame to be sent. The transmit ring can also consist of multiple queues, each with a separate Tx Ring. the producer of the Tx Ring is the user state program and the consumer is the XDP program. The user-state program copies the message to be sent to the address specified by desc in the Tx Ring, then the XDP program consumes the desc in the Tx Ring, sends the message, and tells the user-state program the desc of the successfully sent message via the Completion Ring;
  4. Completion Ring: A Completion Ring is a ring used to receive the descriptors of data frames that have already been processed. The completion ring is created by a kernel or user space program and can consist of multiple queues, each with a separate Completion Ring. the producer of the Completion Ring is the XDP program and the consumer is the user state program.

When the kernel finishes sending XDP messages, it notifies the user-state program via completion_ring which messages have been successfully sent, and then the user-state program consumes the completion_ring in desc (just updating the consumer count is equivalent to an acknowledgement);

With these four rings working together, AF_XDP enables high-performance network data transfer as well as the implementation of a network protocol stack in user space. The user space program can generate new incoming data descriptors for the Rx Ring via the Fill Ring and then send the processed data out using the Tx Ring. The kernel or user space program can fetch the processed descriptors from the Completion Ring for subsequent processing. These rings enable efficient data processing and network load balancing, thus improving the performance and throughput of web applications.

AF_XDP Socket is used in high performance network application scenarios, including DDoS attack defense, network traffic monitoring, load balancing, etc. In these application scenarios, AF_XDP can improve the performance and security of network applications by processing large amounts of network traffic data in real time, quickly identifying malicious traffic and load balancing.

Go AF_XDP Practice

The AF_XDP socket is at least an order of magnitude more complex than the traditional AF_PACKET, and because of its complexity, it is error-prone, but fortunately, there is a third-party library that encapsulates it and makes it easier for us to use. This library is asavie/xdp.

It encapsulates XSK and provides some very convenient methods for reading and sending data.

We introduce its functions with two examples of it.

Example of sending

The following is an example of a DNS query that is constantly being sent.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
package main
import (
    "encoding/hex"
    "flag"
    "fmt"
    "math"
    "net"
    "time"
    "github.com/asavie/xdp"
    "github.com/google/gopacket"
    "github.com/google/gopacket/layers"
    "github.com/miekg/dns"
    "github.com/vishvananda/netlink"
)
// ...
var (
    NIC        string
    QueueID    int
    SrcMAC     string
    DstMAC     string
    SrcIP      string
    DstIP      string
    DomainName string
)
func main() {
    flag.StringVar(&NIC, "interface", "enp3s0", "Network interface to attach to.")
    flag.IntVar(&QueueID, "queue", 0, "The queue on the network interface to attach to.")
    flag.StringVar(&SrcMAC, "srcmac", "b2968175b211", "Source MAC address to use in sent frames.")
    flag.StringVar(&DstMAC, "dstmac", "ffffffffffff", "Destination MAC address to use in sent frames.")
    flag.StringVar(&SrcIP, "srcip", "192.168.111.1", "Source IP address to use in sent frames.")
    flag.StringVar(&DstIP, "dstip", "192.168.111.10", "Destination IP address to use in sent frames.")
    flag.StringVar(&DomainName, "domain", "asavie.com", "Domain name to use in the DNS query.")
    flag.Parse()
    // Initialize the XDP socket.
    link, err := netlink.LinkByName(NIC)
    if err != nil {
        panic(err)
    }
    xsk, err := xdp.NewSocket(link.Attrs().Index, QueueID, nil)
    if err != nil {
        panic(err)
    }
    //-----------------
    // Generate DNS lookup requests
    srcMAC, _ := hex.DecodeString(SrcMAC)
    dstMAC, _ := hex.DecodeString(DstMAC)
    eth := &layers.Ethernet{
        SrcMAC:       net.HardwareAddr(srcMAC),
        DstMAC:       net.HardwareAddr(dstMAC),
        EthernetType: layers.EthernetTypeIPv4,
    }
    ip := &layers.IPv4{
        Version:  4,
        IHL:      5,
        TTL:      64,
        Id:       0,
        Protocol: layers.IPProtocolUDP,
        SrcIP:    net.ParseIP(SrcIP).To4(),
        DstIP:    net.ParseIP(DstIP).To4(),
    }
    udp := &layers.UDP{
        SrcPort: 1234,
        DstPort: 53,
    }
    udp.SetNetworkLayerForChecksum(ip)
    query := new(dns.Msg)
    query.SetQuestion(dns.Fqdn(DomainName), dns.TypeA)
    payload, err := query.Pack()
    if err != nil {
        panic(err)
    }
    buf := gopacket.NewSerializeBuffer()
    opts := gopacket.SerializeOptions{
        FixLengths:       true,
        ComputeChecksums: true,
    }
    err = gopacket.SerializeLayers(buf, opts, eth, ip, udp, gopacket.Payload(payload))
    if err != nil {
        panic(err)
    }
    frameLen := len(buf.Bytes())
    //-----------------
    // Populate all frames in UMEM, using pre-generated DNS queries
    descs := xsk.GetDescs(math.MaxInt32, false)
    for i := range descs {
        frameLen = copy(xsk.GetFrame(descs[i]), buf.Bytes())
    }
    fmt.Printf("sending DNS queries from %v (%v) to %v (%v) for domain name %s...\n", ip.SrcIP, eth.SrcMAC, ip.DstIP, eth.DstMAC, DomainName)
    // Output statistics sent per second
    go func() {
        var err error
        var prev xdp.Stats
        var cur xdp.Stats
        var numPkts uint64
        for i := uint64(0); ; i++ {
            time.Sleep(time.Duration(1) * time.Second)
            cur, err = xsk.Stats()
            if err != nil {
                panic(err)
            }
            numPkts = cur.Completed - prev.Completed
            fmt.Printf("%d packets/s (%d bytes/s)\n", numPkts, numPkts*uint64(frameLen))
            prev = cur
        }
    }()
    // Endless sending of query data
    for {
        descs := xsk.GetDescs(xsk.NumFreeTxSlots(), false)
        for i := range descs {
            descs[i].Len = uint32(frameLen)
        }
        xsk.Transmit(descs)
        _, _, err = xsk.Poll(1)
        if err != nil {
            panic(err)
        }
    }
}
  1. First it generates an XSK based on the NIC, the initialization of this XSK hides a lot of the underlying initialization actions, which is a very good place for this library to do
  2. Generate a specific DNS query request packet, which will be used later to send data to the network
  3. Get all available Desc, and initialize with DNS request data
  4. Start a goroutine that prints out the number of packets sent and the size of the data every second to see how it performs
  5. In an infinite loop, first get the Desc that can be sent, then call Transmit to write the Desc to the Tx ring.
  6. Then call Poll, wait for the kernel to send data or receive data, and then do the next data sending

Thanks to the encapsulation of the XDP library, many troublesome details such as mmap creation, socket option setting, ring operation, etc., are hidden, providing an easy-to-use interface to the outside world.

Next, let’s look at an example of simultaneous reading and writing.

Example of broadcasting

The following example receives all packets and changes the destination Mac address to a broadcast address before sending them out.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
package main
import (
    "os"
    "os/signal"
    "syscall"
    "github.com/asavie/xdp"
    "github.com/vishvananda/netlink"
)
func main() {
    const LinkName = "enp6s0"
    const QueueID = 0
    link, err := netlink.LinkByName(LinkName)
    if err != nil {
        panic(err)
    }
    // Create an XDP program 1
    program, err := xdp.NewProgram(QueueID + 1)
    if err != nil {
        panic(err)
    }
    if err := program.Attach(link.Attrs().Index); err != nil {
        panic(err)
    }
    // Initialize an XDP socket
    xsk, err := xdp.NewSocket(link.Attrs().Index, QueueID, nil)
    if err != nil {
        panic(err)
    }
    // Register this xsk in the XDP program
    if err := program.Register(QueueID, xsk.FD()); err != nil {
        panic(err)
    }
    // Remove this XDP BPF program when exiting
    c := make(chan os.Signal)
    signal.Notify(c, os.Interrupt, syscall.SIGTERM)
    go func() {
        <-c
        program.Detach(link.Attrs().Index)
        os.Exit(1)
    }()
    // Start working
    for {
        // Fill, wait for the kernel to write the received program to the Rx ring
        xsk.Fill(xsk.GetDescs(xsk.NumFreeFillSlots()))
        numRx, _, err := xsk.Poll(-1) // Waiting to receive
        if err != nil {
            panic(err)
        }
        rxDescs := xsk.Receive(numRx) // Data received
        for i := 0; i < len(rxDescs); i++ {
            // Change the Mac address to a broadcast address, i.e. full ff
            // ff:ff:ff:ff:ff:ff
            frame := xsk.GetFrame(rxDescs[i])
            for i := 0; i < 6; i++ {
                frame[i] = byte(0xff)
            }
        }
        xsk.Transmit(rxDescs) // Sending out the modified data
    }
}

As the comments in the code show, the

  1. Fill first
  2. call Poll to wait for incoming data
  3. call Receive to read the received data
  4. modify the mac address in the data
  5. send it out again

If you also test this program, you’d better create a test network, otherwise your network will hang.

Ref

  • https://colobu.com/2023/04/17/use-af-xdp-socket/