Packet reception process

For the sake of simplicity, we will describe the process of receiving and sending Linux network packets with a UDP packet processing process on a physical NIC, and I will try to ignore some irrelevant details.

From NIC to memory

As we know, each network device (NIC) has a driver to work, and the driver needs to be loaded into the kernel at kernel boot time. In fact, logically, the driver is the intermediate module responsible for bridging the network device and the kernel network stack. Whenever the network device receives a new packet, it triggers an interrupt, and the corresponding interrupt handler is the very driver that is loaded into the kernel.

The following diagram shows in detail how packets enter memory from the network device and are processed by the driver and network stack in the kernel.

Data packets network devices into memory

  1. the packet enters the physical NIC, and if the destination address is not that network device and that network device does not have promiscuous mode turned on, the packet will be discarded.
  2. the physical NIC writes the packet by DMA to the specified memory address, which is allocated and initialized by the NIC driver.
  3. the physical NIC notifies the CPU via a hardware interrupt (IRQ) that a new packet has arrived at the physical NIC and needs to be processed.
  4. next, the CPU calls the interrupt function that has been registered according to the interrupt table, and this interrupt function will call the corresponding function in the driver (NIC Driver).
  5. the driver first disables the interrupt of the NIC, indicating that the driver already knows that there is data in the memory, and tells the physical NIC to write the memory directly next time it receives a packet and not to notify the CPU, so as to improve efficiency and avoid the CPU being interrupted non-stop.
  6. start a soft interrupt to continue processing packets. The reason for this is that the hard interrupt handler cannot be interrupted during execution, so if it takes too long to execute, it will cause the CPU to be unable to respond to other hardware interrupts, so the kernel introduces soft interrupts, so that the time-consuming part of the hard interrupt handler can be moved to the soft interrupt handler to handle it slowly.

Kernel packet processing

The network device driver in the previous step will process the packet by triggering the soft interrupt handling function in the kernel network module, and the kernel processes the packet as shown in the following diagram.

The flow of packet processing in the kernel

  1. for the soft interrupts issued by the driver in the previous step, the ksoftirqd process in the kernel will call the corresponding soft interrupt handler function of the network module, to be precise, the net_rx_action function is called here.

  2. net_rx_action then calls the poll function in the NIC driver to process the packets one by one.

  3. and the poll function will let the driver read the packets written to memory by the NIC, in fact, the format of the packets in memory is known only to the driver;

  4. the driver converts the packets in memory into the skb(socket buffer) format recognized by the kernel network module and then calls the napi_gro_receive function.

  5. The napi_gro_receive function processes the GRO-related content, that is, it merges the packets that can be merged, so that only one call to the stack is required, and then determines whether RPS is enabled; if it is, the enqueue_to_backlog function will be called.

  6. the enqueue_to_backlog function will put the packet into the input_pkt_queue structure and return it.

    Note: If input_pkt_queue is full, the packet will be dropped, and the size of this queue can be configured with net.core.netdev_max_backlog.

  7. the CPU will then process the network data in its own input_pkt_queue in a soft interrupt context, actually calling the __netif_receive_skb_core function to do so.

  8. if RPS is not enabled, the napi_gro_receive function will directly call the __netif_receive_skb_core function to process the network packets.

  9. Immediately afterwards, the CPU copies a copy of the data to the socket of type AF_PACKET (raw socket), if there is one (the packet captured by tcpdump is this packet).

  10. pass the packet to the kernel TCP/IP stack for processing.

  11. when all packets in memory have been processed (the poll function is finished), re-enable the NIC’s hard interrupts so that the next time the NIC receives data again it will notify the CPU.

Kernel Network Protocol Stack

The packets received by the kernel TCP/IP stack at this point are actually Layer 3 (network layer) packets, so the packets will first go to the IP network layer first, and then to the transport layer for processing.

IP Network Layer

IP Network Layer

  • ip_rcv is the entry function for the IP network layer processing module, which first determines whether the packet needs to be discarded (the destination mac address is not the current NIC and the NIC is set to promiscuous mode), and if further processing is required calls the processing function in the NF_INET_PRE_ROUTING chain registered in netfilter.

  • NF_INET_PRE_ROUTING is a hook function placed in the protocol stack by netfilter to inject some packet processing functions through iptables to modify or drop packets, and if the packet is not dropped, it will continue down the stack.

    The processing logic in the netfilter chain such as NF_INET_PRE_ROUTING can be set via iptables.

  • routing for routing, if the destination IP is not the local IP and ip forwarding is not enabled, then the packet will be dropped; otherwise it goes to the ip_forward function for processing.

  • the ip_forward function will first call the processing function registered by netfilter on the NF_INET_FORWARD chain, and if the packet is not dropped, then it will continue to call the dst_output_sk function further on.

  • the dst_output_sk function will call the appropriate function at the IP network layer to send the packet out, the details of this step will be described in the next section on sending packets.

  • ip_local_deliver If the above route processing finds that the destination IP is a local IP, then the ip_local_deliver function will be called, which first calls the relevant processing function on the NF_INET_LOCAL_IN chain, and if it passes, the packet will be sent down to the transport layer.

Transport layer

Transport layer

  • The udp_rcv function is the entry function of the UDP processing layer module, it first calls the __udp4_lib_lookup_skb function to find the corresponding socket based on the destination IP and port (the so-called socket is basically a structure consisting of ip+port), if the corresponding socket is not found, then the packet will be be discarded if the corresponding socket is not found, otherwise it continues.
  • sock_queue_rcv_skb This function checks if the socket’s receive cache is full and discards the packet if it is full; secondly, it calls sk_filter to check if the packet is a packet that meets the conditions. packet will also be discarded if the filter is currently set on the socket and the packet does not meet the conditions.
  • __skb_queue_tail function puts the packet at the end of the socket’s receive queue.
  • sk_data_ready informs the socket that the packet is ready;
  • After calling sk_data_ready, a packet is processed and awaits to be read by the application layer;

Note: All the execution procedures described above are executed in the soft interrupt context.

The packet sending process

Logically, the sending process of a Linux network packet is the opposite of the receiving process, so we’ll still use the example of a UDP packet being sent through a physical NIC.

Application layer

The application layer process starts with the application calling the Linux network interface to create a socket, and this diagram below shows in detail how the application layer builds the socket and sends it to the transport layer.

The application layer builds the socket and sends it to the transport layer

  • socket(...) Called to create a socket structure and initialize the corresponding operator functions.
  • sendto(sock, ...) Called by the application layer program to start sending packets; this function calls the inet_sendmsg function that follows.
  • inet_sendmsg This function mainly checks if the current socket has a bound source port, and if not, calls the inet_autobind function to assign one, and then calls the UDP layer function to transmit it.
  • The inet_autobind function will call the get_port function to get an available port.

Transport layer

Transport layer

  • The udp_sendmsg function is the entry point for the UDP transport layer module to send packets. This function first calls the ip_route_output_flow function to get the routing information (mainly the source IP and the NIC), then calls ip_make_skb to construct the skb structure, and finally associates the NIC information with the skb.
  • The ip_route_output_flow function mainly deals with routing information, it will find out from which network device the packet should be sent based on the routing table and the destination IP. If the socket is not bound to a source IP, the function will also find the most appropriate source IP for it based on the routing table. If the socket has a source IP bound, but the NIC corresponding to the source IP cannot reach the destination according to the routing table, the packet will be discarded and an error will be returned for failure to send the data. This function finally stuffs the found network device and source IP into the flowi4 structure and returns it to the udp_sendmsg function.
  • The function ip_make_skb constructs the skb packet with the IP packet header (including the source IP information) assigned to it, and calls the __ip_append_dat function to slice the packet and check if the socket’s send cache has been exhausted, and if it has been exhausted returns an ENOBUFS error message.
  • The udp_send_skb(skb, fl4) function fills the skb with UDP packet headers and handles the checksum, and then passes it to the corresponding function in the IP network layer.

IP Network Layer

IP Network Layer

  • ip_send_skb is the entry function for the IP network layer module to send packets, which essentially calls the series of functions that follow to send network layer packets.
  • __ip_local_out_sk function is used to set the length and checksum value of the IP packet header and then call the following processing function registered on the netfilter hook chain NF_INET_LOCAL_OUT.
  • NF_INET_LOCAL_OUT is a netfilter hook gateway that can be used to configure the processing functions on the chain via iptables; if the packet is not discarded, it continues on down the chain.
  • dst_output_sk This function calls the corresponding output function ip_output based on the information inside skb.
  • The ip_output function writes the NIC information obtained from the previous layer udp_sendmsg to the skb and then calls the processing function registered on the netfilter hook chain NF_INET_POST_ROUTING; * NF_INET_POST_ROUTING is the netfilter hook chain NF_INET_POST_ROUTING.
  • NF_INET_POST_ROUTING is a netfilter hook gateway that can be used to configure the processing functions on the chain via iptables; in this step the original address translation (SNAT) is mainly configured, resulting in a change in the routing information for this skb.
  • the ip_finish_output function determines if the routing information has changed since the previous step, and if so, the dst_output_sk function needs to be called again (when this function is called again, it may not go to the branch where the ip_output function was called, but to the output function specified by the netfilter, possibly xfrm4_transport_output), otherwise it continues on.
  • The ip_finish_output2 function finds the next hop address in the routing table based on the destination IP, then calls the __ipv4_neigh_lookup_noref function to find the next hop’s neigh information in the arp table, and calls the __neigh_create function to construct an empty neigh structure if it is not found.
  • The dst_neigh_output function calls the neigh_resolve_output function to get the neigh information and fill the skb with the mac address inside the information, and then calls the dev_queue_xmit function to send the packet.

kernel processing packets

kernel processing packets

  • The dev_queue_xmit function is the entry point for the kernel module to start processing the sent packets, this function will first get the corresponding qdisc of the device, if not (e.g. loopback or IP tunnels), the dev_hard_start_xmit function will be called directly, otherwise the packets will go through the traffic control module for processing.
  • traffic control module mainly filters and sorts packets, if the queue is full, packets will be dropped, please refer to: http://tldp.org/HOWTO/Traffic-Control-HOWTO/intro.html
  • The dev_hard_start_xmit function first copies a copy of the skb to “packet taps” (from which the tcpdump command gets its data), then calls the ndo_start_xmit function to send the packet. If the dev_hard_start_xmit function returns an error, the function calling it puts the skb in a place and throws a soft interrupt NET_TX_SOFTIRQ to the soft interrupt handler net_tx_action function to retry the process later.
  • The ndo_start_xmit function is bound to the processing function of the data sent by the specific driver.

Note: ndo_start_xmit function will point to the specific NIC driver to send packets, after this step, the task of sending packets to the network device driver, different network device drivers have different ways of handling, but the general process is basically the same.

  1. put skb into the NIC’s own transmit queue
  2. notify the NIC to send the packet
  3. send an interrupt to the CPU after the NIC finishes sending
  4. clean up the skb after receiving the interrupt

Summary

Understanding the process of receiving and sending Linux network packets, we can know where to monitor and modify packets, and in which cases packets may be dropped. In particular, understanding the location of the corresponding hook functions in netfilter will help us understand the usage of iptables, and will also help us better understand the network virtual devices under Linux.

Reference