An analysis of how Linux receives network frames

This article will introduce how the Linux kernel receives network frames from a beginner’s perspective: starting with the reception of the data frame by the NIC device and ending with the frame being passed to the third layer of the network stack. This article will focus on the working mechanism of the kernel and will not go into too many driver-level details. The sample code is taken from Linux 2.6.

Device notification means

After a network device receives data and stores it in the device’s receive frame buffer (which may be located in the device’s memory or in the receive ring written to the host memory via DMA), it must notify the kernel to process the received data.

Polling

Polling means that the kernel actively checks the device, for example, by periodically reading the device’s memory registers to determine if there are new incoming frames to be processed. This approach is inefficient when the device load is high and takes up system resources when the device load is low, so it is rarely used by the operating system alone and can be combined with other mechanisms to achieve better results.

Hardware interrupts

A hardware interrupt signal is generated by the device when an event occurs such as a new data frame is received. The signal is usually sent by the device to the interrupt controller, which forwards it to the CPU, which receives it and is interrupted from its current task to execute an interrupt handler registered by the device driver to handle device events. The interrupt handler adds the data frame to the kernel’s input queue and notifies the kernel for further processing. This technique performs well at low loads, as each data frame is responded to in a timely manner, but at higher loads, the CPU is interrupted frequently and this affects the execution of other tasks.

The processing of the received frame is usually divided into two parts: first the driver-registered interrupt handler copies the frame to the kernel-accessible input queue, and then the kernel processes it, usually by passing it to the handler of the relevant protocol, e.g., IPv4. the first part of the interrupt handler is executed in the interrupt context and can preempt the second part of the execution, which means that the frame copying of the received frame to the input queue queue has a higher priority than the protocol stack program that consumes the data frames.

The consequences are obvious: the input queue will eventually fill up, but the program that should go out of the queue and process these frames has no chance to execute at a lower priority. As a result, new receive frames cannot be added to the queue because the input queue is full, and old frames will not be processed because there are no CPU resources available. This situation is called receive-livelock.

Hardware interrupts have the advantage of very low latency between frame reception and processing, but can severely disrupt the execution of other kernel or user programs under high load. Most network drivers will use some optimized version of hardware interrupts.

Handling multiple frames at once

Some device drivers use a modified approach where when an interrupt handler is executed, it continuously queues data frames within a specified window time or frame count limit. Since other interrupts are disabled while the interrupt handler is executed, a reasonable execution policy must be set to share CPU resources with other tasks.

This approach can be further optimized by having the device notify the kernel of pending receive frames only via hardware interrupts, leaving the queuing and processing of the receive frames to other kernel handlers. This is also how Linux’s new interface NAPI works.

Timed interrupts

In addition to generating interrupts immediately based on events, the device can also send interrupts at fixed intervals when there are received frames. The interrupt handler will check if there are new frames during this interval and process them all at once. If all received frames have been processed and there are no new frames, the device will stop sending interrupts.

This approach requires the device to implement timing at the hardware level and imposes a fixed processing delay depending on the timing interval, but is effective in reducing CPU usage and avoiding receive live-lock at high loads.

Combinations in practice

Different notification mechanisms have their own suitable working scenarios: pure interrupt models guarantee very low latency at low loads but perform poorly at high loads; timed interrupts may introduce excessive latency and waste CPU time at low loads, but are very helpful at high loads to reduce CPU usage and resolve receive live locks. In practice, network devices often do not rely on a single model, but rather take a combination of solutions.

Take the interrupt handling function vortex_interrupt (located in /drivers/net/3c59x.c) registered by the Linux 2.6 Vortex device as an example.

The device will categorize multiple events into one interrupt type (it can even wait for a while before sending an interrupt signal, aggregating multiple interrupts into one signal to send). The interrupt triggers the execution of vortex_interrupt and disables the interrupt on that CPU.
If the interrupt is triggered by the receive frame event RxComplete, the handler calls other code to process the frames received by the device.
vortex_interrupt continuously reads the device register during execution to check if a new interrupt is signaled. If there is and the interrupt event is RxComplete, the handler will continue processing the received frames until the number of processed frames reaches the preset work_done value. Other types of interrupts will be ignored by the handler.

Soft interrupt handling mechanism

An interrupt usually triggers the following events.

The device generates an interrupt and notifies the kernel via hardware.
If the kernel is not processing another interrupt (i.e., the interrupt is not disabled), it will receive this notification.
the kernel disables the local CPU interrupt and executes the handler associated with the type of interrupt received. 4. the kernel exits the interrupt handler.
The kernel exits the interrupt handler and re-enables the local CPU’s interrupts.

When the CPU receives an interrupt notification, it calls the handler corresponding to that interrupt number. During the execution of the handler, the kernel code is in the interrupt context and the interrupt is disabled. This means that while the CPU is handling an interrupt, it neither handles other interrupts nor can it be seized by other processes; CPU resources are exclusively occupied by that interrupt handler. This design decision reduces the possibility of contention conditions, but also introduces a potential performance impact.

Obviously, interrupt handlers should do their job as fast as possible. Different interrupt events do not require the same amount of processing work. For example, when a key is pressed on the keyboard, the interrupt handler function triggered only needs to record the code of the key, and this event does not occur very often; while when processing a new data frame received by a network device, it needs to allocate memory space for skb, copy the received data, and complete some initialization work such as determining The network protocol to which the data belongs, etc.

For this reason the operating system introduces the concept of upper and lower halves for interrupt handlers.

Second half of the handler

Even though processing actions triggered by interrupts require a lot of CPU time, most actions can usually wait. Interrupts can preempt CPU execution in the first place, as the hardware may lose data if the OS makes it wait too long. This applies both to real-time data and to data stored in fixed-size buffers. If the hardware loses data, there is generally no way to recover it again (not considering sender retransmissions). On the other hand, there is generally nothing to lose when a process in kernel or user space is delayed in execution or preempted (except for systems with extremely high real-time requirements, which need to handle processes and interrupts in a completely different way).

Given these considerations, modern interrupt handlers are divided into an upper and a lower half. The upper half performs work that must be done before CPU resources can be released, such as saving received data, while the lower half performs work that can be deferred to idle time, such as completing further processing of received data.

You can think of the second half as a specific function that can be executed asynchronously. When an interrupt is triggered, some of the work is not required to be done immediately, and we can package this work as a lower half handler to be executed later. The first and second half work models can effectively reduce the time the CPU is in interrupt context (i.e., interrupt disabled).

The device signals the interrupt to the CPU, notifying it of a specific event.
The CPU executes the upper half of the interrupt-related handler function, disabling subsequent interrupt notifications until the handler finishes its work: a. Store some data in memory for further processing of the interrupt event by the kernel at a later time. b. Set a flag bit to ensure that the kernel is aware of pending interrupts. c. Re-enable interrupt notification for the local CPU before termination.
At a later point in time, when the kernel has no more urgent tasks to handle, it checks the flag bit set by the top half of the handler and calls the associated bottom half of the handler. After the call it resets this flag bit and moves on to the next round of processing.

Linux implements several different mechanisms for lower half processing: soft interrupts, microtasks and work queues, which are also applicable to time-delayed tasks in the operating system. Lower half processing mechanisms usually have the following common features.

Defining different types and establishing associations between the types and specific processing tasks.
Scheduling the execution of processing tasks.
Notify the kernel that there are scheduled tasks that need to be executed.

The next section focuses on the soft interrupt mechanism used to process network data frames.

Soft interrupts

There are several common types of soft interrupts, as follows.

enum
{
    HI_SOFTIRQ=0,
    TIMER_SOFTIRQ,
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    BLOCK_SOFTIRQ,
    IRQ_POLL_SOFTIRQ,
    TASKLET_SOFTIRQ,
};

where NET_TX_SOFTIRQ and NET_RX_SOFTIRQ are used to handle the reception and transmission of network data.

Scheduling and execution timing

Each time a network device receives a frame, it sends a hardware interrupt to notify the kernel to call the interrupt handler, which triggers the scheduling of soft interrupts on the local CPU with the following functions.

__raise_softirq_irqoff: sets the bitmap corresponding to the soft interrupt type in a dedicated bitmap structure, and calls the handler associated with the soft interrupt when a subsequent check of the bitmap turns out to be true. A separate bitmap is used for each CPU.
raise_softirq_irqoff: The __raise_softirq_irqoff function is wrapped internally. If this function is not called from an interrupt context and preemption is not disabled, an additional ksoftirqd thread will be scheduled.
raise_softirq : wraps raise_softirq_irqoff internally, but executes with CPU interrupts disabled.

At a certain point, the kernel checks the bitmap unique to each CPU to determine if there are any scheduled soft interrupts waiting to be executed, and if so, do_softirq will be called to handle the soft interrupts. The kernel handles soft interrupts at the following times.

do_IRQ

Whenever the kernel receives an IRQ notification for a hardware interrupt, it calls do_IRQ to execute the handler for the interrupt. New soft interrupts may be scheduled in the interrupt handler, so handling soft interrupts at the end of do_IRQ is a natural design and can effectively reduce latency. In addition, the kernel’s clocked interrupts guarantee a maximum time interval between the timing of two soft interrupt processing.

Most architectures call do_softirq in the exit interrupt context step irq_exit().

unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
{
......
exit_idle();
irq_enter();

// handle irq with registered handler

irq_exit();

set_irq_regs(old_regs);
return 1;
}

In irq_exit(), invoke_softirq() is called if the kernel has exited the interrupt context and there is a pending soft interrupt.

void irq_exit(void)
{
account_system_vtime(current);
trace_hardirq_exit();
sub_preempt_count(IRQ_EXIT_OFFSET);
if (!in_interrupt() && local_softirq_pending())
    invoke_softirq();

rcu_irq_exit();

preempt_enable_no_resched();
}

invoke_softirq is a simple wrapper around do_softirq.

static inline void invoke_softirq(void)
{
if (!force_irqthreads)
    do_softirq();
else
    wakeup_softirqd();
}

When returning from interrupts and exception events (including system calls), this part of the processing logic is written directly into the assembly code.
When calling local_bh_enable to turn on soft interrupts, the pending soft interrupts are executed.
Each processor has a soft interrupt thread, ksoftirqd_CPUn, which also handles soft interrupts when executed.

CPU interrupts are on when soft interrupts are executed, and soft interrupts can be pending by new interrupts. However, if an instance of a soft interrupt is already running or pending on a CPU, the kernel will disable new requests of that soft interrupt type from running on the CPU, which significantly reduces the concurrent locks required for soft interrupts.

Handling soft interrupts `do_softirq`

When the time to execute a soft interrupt is reached, the kernel executes the do_softirq function.

do_softirq first saves a copy of the soft interrupt to be executed. When do_softirq is running, the same soft interrupt type may be scheduled more than once: it may be preempted by a hardware interrupt when running a soft interrupt handler, and the cpu’s pending soft interrupt bitmap may be reset during the interrupt processing, i.e., the soft interrupt may be rescheduled during the execution of a pending soft interrupt. For this reason, do_softirq first disables the interrupt, saves a copy of the bitmap of the pending soft interrupt to the local variable pending, then resets the corresponding bit in the local CPU’s soft interrupt bitmap to 0, and then reopens the interrupt. Finally, based on the copy of pending, each bit is checked to see if it is 1, and if it is, the corresponding handler is called according to the type of soft interrupt.

do {
    if (pending & 1) {
        unsigned int vec_nr = h - softirq_vec;
        int prev_count = preempt_count();

        kstat_incr_softirqs_this_cpu(vec_nr);

        trace_softirq_entry(vec_nr);
        h->action(h);
        trace_softirq_exit(vec_nr);
        if (unlikely(prev_count != preempt_count())) {
            printk(KERN_ERR "huh, entered softirq %u %s %p"
                    "with preempt_count %08x,"
                    " exited with %08x?\n", vec_nr,
                    softirq_to_name[vec_nr], h->action,
                    prev_count, preempt_count());
            preempt_count() = prev_count;
        }

        rcu_bh_qs(cpu);
    }
    h++;
    pending >>= 1;
} while (pending);

The order of pending soft interrupt calls depends on the position of the flag bits in the bitmap and the direction in which they are scanned (from low to high), and is not executed on a first-in-first-out basis.

When all handlers are executed, do_ softirq disables the interrupts again and re-checks the CPU’s pending interrupt bitmap, and if a new pending soft interrupt is found, a copy is created again and the above process is executed again. This process is repeated at most MAX_SOFTIRQ_RESTART times (usually 10) to avoid infinite CPU resource hogging.

When the processing rounds reach the MAX_SOFTIRQ_RESTART threshold, do_ softirq must end its execution, and if there are still unexecuted soft interrupts, the ksoftirqd thread will be woken up to handle them. However, do_ softirq is called so frequently in the kernel that subsequent calls to do_softirq may actually finish processing these soft interrupts before the ksoftirqd thread is scheduled.

ksoftirqd Kernel Threads

Each CPU has a kernel thread ksoftirqd (usually named ksoftirqd_CPUn according to the CPU serial number). When the mechanism described above cannot handle all the soft interrupts, the ksoftirqd thread in the background of that CPU is woken up and takes on the responsibility of handling as many pending soft interrupts as possible after they are scheduled.

The task function run_ksoftirqd associated with ksoftirqd is as follows.

static int run_ksoftirqd(void * __bind_cpu)
{
    set_current_state(TASK_INTERRUPTIBLE);

    while (!kthread_should_stop()) {
        preempt_disable();
        if (!local_softirq_pending()) {
            preempt_enable_no_resched();
            schedule();
            preempt_disable();
        }

        __set_current_state(TASK_RUNNING);

        while (local_softirq_pending()) {
            /* Preempt disable stops cpu going offline.
                If already offline, we'll be on wrong CPU:
                don't process */
            if (cpu_is_offline((long)__bind_cpu))
                goto wait_to_die;
            local_irq_disable();
            if (local_softirq_pending())
                __do_softirq();
            local_irq_enable();
            preempt_enable_no_resched();
            cond_resched();
            preempt_disable();
            rcu_note_context_switch((long)__bind_cpu);
        }
        preempt_enable();
        set_current_state(TASK_INTERRUPTIBLE);
    }
    __set_current_state(TASK_RUNNING);
    return 0;

wait_to_die:
    preempt_enable();
    /* Wait for kthread_stop */
    set_current_state(TASK_INTERRUPTIBLE);
    while (!kthread_should_stop()) {
        schedule();
        set_current_state(TASK_INTERRUPTIBLE);
    }
    __set_current_state(TASK_RUNNING);
    return 0;
}

ksoftirqd does basically the same thing as do_softirq, its main logic is to call __do_softirq (this function is also the core logic of do_softirq) continuously through while loop, and it will stop only when the following two conditions are reached.

when there are no pending soft interrupts, ksoftirqd will call schedule() to trigger scheduling to actively relinquish CPU resources.
The thread finishes executing the allocated time slice and is asked to give up CPU resources for the next scheduling.

The scheduling priority of ksoftirqd thread is set to be very low, which can also avoid grabbing too many CPU resources when there are more soft interrupts.

Network frame reception

Linux’s network system uses the following two main types of soft interrupts.

NET_RX_SOFTIRQ to handle receiving (inbound) network data
NET_TX_SOFTIRQ to handle sending (outgoing) network data

This article focuses mainly on how to receive data.

Input queue

Each CPU has an input queue input_pkt_queue for incoming network frames, which is located in the softnet_data structure, but not all NIC device drivers will use this input queue.

struct softnet_data {
    struct Qdisc        *output_queue;
    struct Qdisc        **output_queue_tailp;
    struct list_head    poll_list;
    struct sk_buff      *completion_queue;
    struct sk_buff_head process_queue;

    /* stats */
    unsigned int        processed;
    unsigned int        time_squeeze;
    unsigned int        cpu_collision;
    unsigned int        received_rps;

    unsigned            dropped;
    struct sk_buff_head input_pkt_queue;
    struct napi_struct  backlog;
};

Linux New API (NAPI)

The NIC device uses a hardware interrupt to signal the CPU that it has a new frame to process every time it receives a layer 2 network frame. The CPU receiving the interrupt executes the do_IRQ function, which calls the handler associated with the hardware interrupt number. The handler is usually a function registered by the device driver during initialization. This interrupt handler will be executed in disable interrupt mode, causing the CPU to temporarily stop receiving interrupt signals. The interrupt handler performs some necessary immediate tasks and schedules other tasks for delayed execution in the second half. Specifically the interrupt handler does these things.

copy the network frame into the sk_buff data structure.
initialize some sk_buff parameters for use by the upper network stack. In particular, skb->protocol, which identifies the upper layer’s protocol handler.
update other device-specific parameters.
notify the kernel of further processing of the received frame by scheduling a soft interrupt NET_RX_SOFTIRQ.

We have described above the polling and interrupt notification mechanisms (including several modified versions), which have different advantages and disadvantages and are suitable for different work scenarios, but Linux introduced in Linux 2.6 a NAPI mechanism that mixes polling and interrupts to notify and process new incoming frames. This article will focus on the NAPI mechanism.

When the device driver supports NAPI, the device still uses interrupts to notify the kernel when it receives a network frame, but the kernel disables interrupts from the device after it starts processing the interrupt and continues to poll the device’s input buffer to fetch the received frame for processing until the buffer is empty, when it ends the handler and re-enables interrupt notification for the device. NAPI combines the advantages of polling and interrupts.

The idle state allows the kernel to be notified as soon as the device receives a new network frame without wasting resources on polling.
After the kernel is notified of the pending data in the device buffer, it does not need to waste resources to handle the interrupts, but simply poll to process the data.

For the kernel, NAPI effectively reduces the number of interrupts to be handled under high load, thus reducing CPU usage, and also reduces contention between devices by polling to access them. The kernel implements NAPI with the following data structures.

poll: virtual function to queue network frames from the device’s inbound queue, each device will have a separate inbound queue.
poll_list : A chain of devices that maintain a state in polling. Multiple devices can share the same interrupt signal, so the kernel needs to poll multiple devices. Interrupts from this device will be disabled after being added to the list.
quota and weight: The kernel uses these two values to control the amount of data queued from a device at a time. A smaller quota means that data frames from different devices have a fair chance to be processed, but the kernel spends more time switching before the device and vice versa.

When the device sends an interrupt signal and it is received, the kernel executes the interrupt handler registered by the device driver. The interrupt handler will call napi_schedule to schedule the execution of the polling program. In napi_schedule, if the device sending the interrupt is not in the CPU’s poll_list, the kernel adds it to the poll_list and triggers the scheduling of NET_RX_SOFTIRQ soft interrupts via __raise_softirq_irqoff. The main logic is located in ____napi_schedule.

/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd,
        struct napi_struct *napi)
{
    list_add_tail(&napi->poll_list, &sd->poll_list);
    __raise_softirq_irqoff(NET_RX_SOFTIRQ);
}

NET_RX_SOFTIRQ Soft Interrupt Handler

The handler for NET_RX_SOFTIRQ is net_rx_action. Its code is as follows.

static void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *sd = &__get_cpu_var(softnet_data);
    unsigned long time_limit = jiffies + 2;
    int budget = netdev_budget;
    void *have;

    local_irq_disable();

    while (!list_empty(&sd->poll_list)) {
        struct napi_struct *n;
        int work, weight;

        /* If softirq window is exhuasted then punt.
            * Allow this to run for 2 jiffies since which will allow
            * an average latency of 1.5/HZ.
            */
        if (unlikely(budget <= 0 || time_after(jiffies, time_limit)))
            goto softnet_break;

        local_irq_enable();

        /* Even though interrupts have been re-enabled, this
            * access is safe because interrupts can only add new
            * entries to the tail of this list, and only ->poll()
            * calls can remove this head entry from the list.
            */
        n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);

        have = netpoll_poll_lock(n);

        weight = n->weight;

        /* This NAPI_STATE_SCHED test is for avoiding a race
            * with netpoll's poll_napi().  Only the entity which
            * obtains the lock and sees NAPI_STATE_SCHED set will
            * actually make the ->poll() call.  Therefore we avoid
            * accidentally calling ->poll() when NAPI is not scheduled.
            */
        work = 0;
        if (test_bit(NAPI_STATE_SCHED, &n->state)) {
            work = n->poll(n, weight);
            trace_napi_poll(n);
        }

        WARN_ON_ONCE(work > weight);

        budget -= work;

        local_irq_disable();

        /* Drivers must not modify the NAPI state if they
            * consume the entire weight.  In such cases this code
            * still "owns" the NAPI instance and therefore can
            * move the instance around on the list at-will.
            */
        if (unlikely(work == weight)) {
            if (unlikely(napi_disable_pending(n))) {
                local_irq_enable();
                napi_complete(n);
                local_irq_disable();
            } else
                list_move_tail(&n->poll_list, &sd->poll_list);
        }

        netpoll_poll_unlock(have);
    }
out:
    net_rps_action_and_irq_enable(sd);

#ifdef CONFIG_NET_DMA
    /*
        * There may not be any more sk_buffs coming right now, so push
        * any pending DMA copies to hardware
        */
    dma_issue_pending_all();
#endif

    return;

softnet_break:
    sd->time_squeeze++;
    __raise_softirq_irqoff(NET_RX_SOFTIRQ);
    goto out;
}

When net_rx_action is scheduled for execution.

the device in the poll_list chain is traversed from the beginning, and the device’s poll virtual function is called to process the data frames in the inbound queue.
When the number of frames processed by the poll call reaches the maximum threshold, the device is moved to the end of the poll_list and moved to the next device in the poll_list, even if the device’s inbound queue has not been emptied.
If the device’s inbound queue is emptied, call netif_rx_complete to move the device out of the poll_list and turn on interrupt notification for that device.
Continue the process until poll_list is emptied, or net_rx_action has executed enough time slices (so as not to take up too many CPU resources), in which case net_rx_action will reschedule itself for the next execution before exiting.

Poll Virtual Functions

During device driver initialization, the device points dev->poll to a custom function provided by the driver, so different drivers will use different poll functions. We will introduce the default poll function process_backlog provided by Linux, which works in a similar way to most driver poll functions, with the main difference that process_backlog works without disabling interrupts, and since non-NAPI devices use a shared input queue, stacking out of the input queue Since non-NAPI devices use a shared input queue, they need to temporarily disable interrupts to implement locking when exiting data frames from the input queue, whereas NAPI devices use a separate inbound queue, and devices that join the poll_list have their interrupts disabled separately, so there is no need to consider locking during polling.

When process_backlog is executed, it first calculates the quota of the device, and then enters the following loop.

disable interrupts, stack data frames from the input queue associated with the CPU, and re-enable interrupts.
If the input queue is empty at the time of stacking, the device is moved out of the poll_list and execution ends.
If the input queue is not empty, call netif_receive_skb(skb) to process the out-stacked data frames, which we will describe in the next section.
Check the following conditions, and if they are not met jump to step 1 to continue the loop.
1. If the number of data frames out of the stack reaches the quota value of the device, end the execution.
2. If enough CPU time slice has been executed, end execution.

Processing receive frames

netif_receive_skb is a tool function used by the poll virtual function to process receive frames, in short it does the following for data frames in order.

handle the bond function of the data frame. Linux can aggregate a group of devices into a bond device, where the data frame is changed from the receiving device skb->dev to the master device in the bond before it enters layer 3 processing.
pass a copy of the data frame to the registered sniffers of each protocol.
handle some functions that need to be done at layer 2, including bridging. If the data frame does not need to be bridged, continue down the line.
Pass a copy of the data frame to the registered Layer 3 protocol handler corresponding to skb->protocol. The data frame then enters the upper layer of the kernel network stack.

If the corresponding protocol handler is not found or is not consumed by a function such as bridging, the data frame is discarded by the kernel.

Typically, the three layer protocol handlers process the data frames as follows.

Pass them to protocols higher up in the network stack such as TCP, UDP, ICMP and finally to the application process.
They are discarded in data frame processing frameworks such as netfilter.
If the destination of the data frame is not the local host, it will be forwarded to another machine.

This concludes the discussion of how Linux receives network frames.

Table of Contents