Basic implementation of linux networking

In the TCP/IP network hierarchy model, the entire protocol stack is divided into physical layer, link layer, network layer, transport layer, and application layer. The physical layer corresponds to the grid card and the grid line, and the application layer corresponds to various applications such as Nginx, FTP, etc. Linux implements the link layer, the grid layer, and the transport layer.

In the Linux kernel implementation, the link layer protocol is implemented by the grid card driver, and the kernel protocol stack implements the network and transport layers. The kernel provides socket access to the higher application layers for user processes. The layered model of the TCP/IP network as we see it using the Linux perspective should look like the following.

How to network events

When data arrives on the device, a voltage change is triggered on the relevant pin of the CPU to notify the CPU to process the data.

You can also call this a hard interrupt

But as we know, the CPU runs very fast, but the network reads the data very slowly, which will occupy the CPU for a long time, making the CPU unable to handle other events, such as mouse movement.

So how to solve this problem in linux?

The linux kernel splits the interrupt processing into 2 parts, one is the hard interrupt mentioned above, and the other is the soft interrupt.

The first part receives the cpu voltage change, generates a hard interrupt, then does only the simplest processing, and then asynchronously hands over to the hardware to receive the information into the buffer. At this time, the cpu can already receive other interrupt information over.

The second part is the soft interrupt part, how does the soft interrupt do it? In fact, it is to change the binary bits of memory, similar to the status field that we usually write business to, such as in network Io, when the buffer has finished receiving data, it will change the current status to complete. For example, when epoll reads a certain io time to finish reading data, it does not directly enter the ready state, but waits for the next loop to traverse to determine the status before stuffing this fd into the ready list (of course, this time is very short, but compared to the cpu, this time is very long).

The second half of the implementation used in kernel versions from 2.4 onwards is soft interrupts, which are handled solely by the ksoftirqd kernel thread. Unlike hard interrupts, which apply a voltage change to the physical CPU pins, soft interrupts notify the soft interrupt handler by giving a binary value to a variable in memory.

This is why epoll (formally introduced) was only known to be used in 2.6; the kernel did not support this approach until 2.4.

The overall data flow diagram is as follows.

A data from arrives at the NIC and goes through the following steps before a data reception is completed.

  • The packet enters the physical NIC from an outside network. If the destination address is not that NIC and the NIC does not have promiscuous mode enabled, the packet is discarded by the NIC.
  • The NIC writes the packet by DMA to the specified memory address, which is allocated and initialized by the NIC driver. Note: Older NICs may not support DMA, though newer NICs generally do.
  • The NIC notifies the CPU via a hardware interrupt (IRQ) that data is coming
  • The CPU calls the registered interrupt function according to the interrupt table, and this interrupt function will call the corresponding function in the driver (NIC Driver)
  • The driver disables the interrupt of the NIC first, indicating that the driver already knows that there is data in the memory, and tells the NIC to write the memory directly next time it receives a packet, and not to notify the CPU again, which can improve efficiency and avoid the CPU being interrupted constantly.
  • Start soft interrupt. After this step, the hardware interrupt handler function ends and returns. Since the hard interrupt handler cannot be interrupted during its execution, if it takes too long to execute, it will make the CPU unable to respond to other hardware interrupts, so the kernel introduces soft interrupts so that the time-consuming part of the hard interrupt handler function can be moved to the soft interrupt handler function to be handled slowly.
  • When it receives a soft interrupt, it will call the corresponding soft interrupt processing function. For the soft interrupt thrown by the network card driver module in step 6 above, ksoftirqd will call the net_rx_action function of the network module
  • net_rx_action calls the poll function in the NIC driver to process the packets one by one
  • In the pool function, the driver reads the packets written to memory one by one, the format of the packets in memory is only known to the driver
  • The driver converts the packets in memory to the skb format recognized by the kernel network module and then calls the napi_gro_receive function
  • napi_gro_receive will process the GRO related content, that is, it will merge the packets that can be merged, so that only one protocol stack call is needed. Then determine if RPS is enabled, if so, enqueue_to_backlog will be called
  • In the enqueue_to_backlog function, the packet will be put into the input_pkt_queue of the CPU’s softnet_data structure, and then return, if the input_pkt_queue is full, the packet will be discarded, the size of the queue can be determined by net.core. The size of the queue can be configured via net.core. netdev_max_backlog
  • The CPU will then process the network data in its own input_pkt_queue in its own soft interrupt context (call __netif_receive_skb_core)
  • If RPS is not enabled, napi_gro_receive will call __netif_receive_skb_core directly
  • See if there is a socket of type AF_PACKET (which is often called raw socket), if so, copy a copy of the data to it. tcpdump captures the packets here.
  • Call the corresponding function of the protocol stack and give the packet to the stack for processing.
  • After all the packets in memory have been processed (i.e. the poll function is finished), enable the hard interrupt of the NIC, so that the next time the NIC receives data again it will notify the CPU

epoll

poll function

The poll function here is a callback function that is registered and handled in a soft interrupt. For example, the epoll program will register an “ep_poll_callback”

Take go epoll as an example.

1
go: accept –> pollDesc.Init -> poll_runtime_pollOpen –> runtime.netpollopen(epoll_create) -> epollctl(EPOLL_CTL_ADD)

go: netpollblock(gopark),let out cpu->scheduled back, netpoll(0) writes concurrently to ready state -> other operations ……

epoll thread: epoll_create(ep_ptable_queue_proc, register soft interrupt to ksoftirqd, register method ep_poll_callback to)->epoll_add->epoll_wait(ep_poll let out cpu)

core: NIC receives data -> dma+hard interrupt -> soft interrupt -> system dispatch to ksoftirqd, handle ep_poll_callback (note here that new connections come into the program, not by callback,but by accept) -> get the previously registered fd handle -> copy NIC data to the handle -> Operate on the fd according to the event type (ready list)

Part of the code

go: accept

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// accept阻塞,等待系统事件(等待有客户端进来)
func (fd *FD) Accept() (int, syscall.Sockaddr, string, error) {
	if err := fd.readLock(); err != nil {
		return -1, nil, "", err
	}
	defer fd.readUnlock()

	if err := fd.pd.prepareRead(fd.isFile); err != nil {
		return -1, nil, "", err
	}
	for {
		s, rsa, errcall, err := accept(fd.Sysfd)
		if err == nil {
			return s, rsa, "", err
		}
		switch err {
		case syscall.EAGAIN:
			if fd.pd.pollable() {
				if err = fd.pd.waitRead(fd.isFile); err == nil {
					continue
				}
			}
		case syscall.ECONNABORTED:
			// This means that a socket on the listen
			// queue was closed before we Accept()ed it;
			// it's a silly error, so try again.
			continue
		}
		return -1, nil, errcall, err
	}
}
//accept创建netpoll
func (fd *netFD) accept() (netfd *netFD, err error) {
	d, rsa, errcall, err := fd.pfd.Accept()
	if err != nil {
		if errcall != "" {
			err = wrapSyscallError(errcall, err)
		}
		return nil, err
	}

	if netfd, err = newFD(d, fd.family, fd.sotype, fd.net); err != nil {
		poll.CloseFunc(d)
		return nil, err
	}
	if err = netfd.init(); err != nil {  //open 创建 ctl_add
		fd.Close()
		return nil, err
	}
	lsa, _ := syscall.Getsockname(netfd.pfd.Sysfd)
	netfd.setAddr(netfd.addrFunc()(lsa), netfd.addrFunc()(rsa))
	return netfd, nil
}

### //syscall包。 最终调用的是linux的accept
func accept(s int, rsa *RawSockaddrAny, addrlen *_Socklen) (fd int, err error) {
	r0, _, e1 := syscall(funcPC(libc_accept_trampoline), uintptr(s), uintptr(unsafe.Pointer(rsa)), uintptr(unsafe.Pointer(addrlen)))
	fd = int(r0)
	if e1 != 0 {
		err = errnoErr(e1)
	}
	return
}

epoll source code

1
2
3
4
5
6
7
8
 static int __init eventpoll_init(void)
{
   mutex_init(&pmutex);
   ep_poll_safewake_init(&psw);
   epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), 0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC, NULL);
   pwq_cache = kmem_cache_create("eventpoll_pwq", sizeof(struct eppoll_entry), 0, EPI_SLAB_DEBUG|SLAB_PANIC, NULL);
   return 0;
}

Basic data structure

epoll uses kmem_cache_create (slab allocator) to allocate memory to hold struct epitem and struct eppoll_entry. When an fd is added to the system, an epitem structure is created, which is the basic data structure for the kernel to manage epoll.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct epitem {
	struct rb_node  rbn;        //用于主结构管理的红黑树
	struct list_head  rdllink;  //事件就绪队列
	struct epitem  *next;       //用于主结构体中的链表
	struct epoll_filefd  ffd;   //这个结构体对应的被监听的文件描述符信息
	int  nwait;                 //poll操作中事件的个数
	struct list_head  pwqlist;  //双向链表,保存着被监视文件的等待队列,功能类似于select/poll中的poll_table
	struct eventpoll  *ep;      //该项属于哪个主结构体(多个epitm从属于一个eventpoll)
	struct list_head  fllink;   //双向链表,用来链接被监视的文件描述符对应的struct file。因为file里有f_ep_link,用来保存所有监视这个文件的epoll节点
	struct epoll_event  event;  //注册的感兴趣的事件,也就是用户空间的epoll_event
}

And the main data structure corresponding to each epoll fd (epfd) is

1
2
3
4
5
6
7
8
9
struct eventpoll {
	spin_lock_t       lock;        //对本数据结构的访问
	struct mutex      mtx;         //防止使用时被删除
	wait_queue_head_t     wq;      //sys_epoll_wait() 使用的等待队列
	wait_queue_head_t   poll_wait;       //file->poll()使用的等待队列
	struct list_head    rdllist;        //事件满足条件的链表
	struct rb_root      rbr;            //用于管理所有fd的红黑树(树根)
	struct epitem      *ovflist;       //将事件到达的fd进行链接起来发送至用户空间
}

struct eventpoll is created at epoll_create.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
long sys_epoll_create(int size) {

    struct eventpoll *ep;

   // ...

    ep_alloc(&ep); //为ep分配内存并进行初始化

/* 调用anon_inode_getfd 新建一个file instance,

也就是epoll可以看成一个文件(匿名文件)。

因此我们可以看到epoll_create会返回一个fd。

           epoll所管理的所有的fd都是放在一个大的结构eventpoll(红黑树)中,

将主结构体struct eventpoll *ep放入file->private项中进行保存(sys_epoll_ctl会取用)*/

 fd = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep, O_RDWR | (flags & O_CLOEXEC));
     return fd;
}

where ep_alloc(struct eventpoll **pep) allocates memory for pep and initializes it. Where the above registered operation eventpoll_fops is defined as follows: static const struct file_operations eventpoll_fops = { .release= ep_eventpoll_release, .poll = ep_eventpoll_ poll, }; In this way, a red-black tree is maintained in the kernel with the following approximate structure: clip_image002 Then comes the epoll_ctl function (omitting code such as error checking).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
asmlinkage long sys_epoll_ctl(int epfd,int op,int fd,struct epoll_event __user *event) {
    int error;
    struct file *file,*tfile;
    struct eventpoll *ep;
    struct epoll_event epds;
    error = -FAULT;
    //判断参数的合法性,将 __user *event 复制给 epds。
    if(ep_op_has_event(op) && copy_from_user(&epds,event,sizeof(struct epoll_event)))
            goto error_return; //省略跳转到的代码
    file  = fget (epfd); // epoll fd 对应的文件对象
    tfile = fget(fd);    // fd 对应的文件对象
    //在create时存入进去的(anon_inode_getfd),现在取用。
    ep = file->private->data;
    mutex_lock(&ep->mtx);
    //防止重复添加(在ep的红黑树中查找是否已经存在这个fd)
    epi = epi_find(ep,tfile,fd);
    switch(op)
    {
        case EPOLL_CTL_ADD:  //增加监听一个fd
            if(!epi)
            {
                epds.events |= EPOLLERR | POLLHUP;     //默认包含POLLERR和POLLHUP事件
                error = ep_insert(ep,&epds,tfile,fd);  //在ep的红黑树中插入这个fd对应的epitm结构体。
            } else  //重复添加(在ep的红黑树中查找已经存在这个fd)。
                error = -EEXIST;
            break;
        ...
    }
    return error;
}

ep_insert is implemented as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
static int ep_insert(struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd)
{
   int error ,revents,pwake = 0;
   unsigned long flags ;
   struct epitem *epi;
   /*
      struct ep_queue{
         poll_table pt;
         struct epitem *epi;
      }   */
   struct ep_pqueue epq;
   //分配一个epitem结构体来保存每个加入的fd
   if(!(epi = kmem_cache_alloc(epi_cache,GFP_KERNEL)))
      goto error_return;
   //初始化该结构体
   ep_rb_initnode(&epi->rbn);
   INIT_LIST_HEAD(&epi->rdllink);
   INIT_LIST_HEAD(&epi->fllink);
   INIT_LIST_HEAD(&epi->pwqlist);
   epi->ep = ep;
   ep_set_ffd(&epi->ffd,tfile,fd);
   epi->event = *event;
   epi->nwait = 0;
   epi->next = EP_UNACTIVE_PTR;
   epq.epi = epi;
   //安装poll回调函数
   init_poll_funcptr(&epq.pt, ep_ptable_queue_proc );
   /* 调用poll函数来获取当前事件位,其实是利用它来调用注册函数ep_ptable_queue_proc(poll_wait中调用)。
       如果fd是套接字,f_op为socket_file_ops,poll函数是
       sock_poll()。如果是TCP套接字的话,进而会调用
       到tcp_poll()函数。此处调用poll函数查看当前
       文件描述符的状态,存储在revents中。
       在poll的处理函数(tcp_poll())中,会调用sock_poll_wait(),
       在sock_poll_wait()中会调用到epq.pt.qproc指向的函数,
       也就是ep_ptable_queue_proc()。  */
   revents = tfile->f_op->poll(tfile, &epq.pt);
   spin_lock(&tfile->f_ep_lock);
   list_add_tail(&epi->fllink,&tfile->f_ep_lilnks);
   spin_unlock(&tfile->f_ep_lock);
   ep_rbtree_insert(ep,epi); //将该epi插入到ep的红黑树中
   spin_lock_irqsave(&ep->lock,flags);
//  revents & event->events:刚才fop->poll的返回值中标识的事件有用户event关心的事件发生。
// !ep_is_linked(&epi->rdllink):epi的ready队列中有数据。ep_is_linked用于判断队列是否为空。
/*  如果要监视的文件状态已经就绪并且还没有加入到就绪队列中,则将当前的
    epitem加入到就绪队列中.如果有进程正在等待该文件的状态就绪,则
    唤醒一个等待的进程。  */
if((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
      list_add_tail(&epi->rdllink,&ep->rdllist); //将当前epi插入到ep->ready队列中。
/* 如果有进程正在等待文件的状态就绪,
也就是调用epoll_wait睡眠的进程正在等待,
则唤醒一个等待进程。
waitqueue_active(q) 等待队列q中有等待的进程返回1,否则返回0。
*/
      if(waitqueue_active(&ep->wq))
         __wake_up_locked(&ep->wq,TAKS_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE);
/*  如果有进程等待eventpoll文件本身(???)的事件就绪,
           则增加临时变量pwake的值,pwake的值不为0时,
           在释放lock后,会唤醒等待进程。 */ 
      if(waitqueue_active(&ep->poll_wait))
         pwake++;
   }
   spin_unlock_irqrestore(&ep->lock,flags);
if(pwake)
      ep_poll_safewake(&psw,&ep->poll_wait);//唤醒等待eventpoll文件状态就绪的进程
   return 0;
}
1
2
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
revents = tfile->f_op->poll(tfile, &epq.pt);

These two functions register ep_ptable_queue_proc to qproc in epq.pt. typedef struct poll_table_struct { poll_queue_proc qproc; unsigned long key; }poll_table; Execute f_op- >poll(tfile, &epq.pt), the XXX_poll(tfile, &epq.pt) function executes poll_wait(), and poll_wait() calls the epq.pt.qproc function, i.e., ep_ptable_queue_proc. ep_ptable_queue_ proc function is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/*  在文件操作中的poll函数中调用,将epoll的回调函数加入到目标文件的唤醒队列中。
    如果监视的文件是套接字,参数whead则是sock结构的sk_sleep成员的地址。  */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt) {
/* struct ep_queue{
         poll_table pt;
         struct epitem *epi;
      } */
    struct epitem *epi = ep_item_from_epqueue(pt); //pt获取struct ep_queue的epi字段。
    struct eppoll_entry *pwq;
    if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
        pwq->whead = whead;
        pwq->base = epi;
        add_wait_queue(whead, &pwq->wait);
        list_add_tail(&pwq->llink, &epi->pwqlist);
        epi->nwait++;
    } else {
        /* We have to signal that an error occurred */
        /*
         * 如果分配内存失败,则将nwait置为-1,表示
         * 发生错误,即内存分配失败,或者已发生错误
         */
        epi->nwait = -1;
    }
}

ep_ptable_queue_proc

where struct eppoll_entry is defined as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct eppoll_entry {
   struct list_head llink;
   struct epitem *base;
   wait_queue_t wait;
   wait_queue_head_t *whead;
};
ep_ptable_queue_proc 函数完成 epitem 加入到特定文件的wait队列任务
ep_ptable_queue_proc有三个参数
struct file *file;              该fd对应的文件对象
wait_queue_head_t *whead;      该fd对应的设备等待队列同select中的mydev->wait_address
poll_table *pt;                 f_op->poll(tfile, &epq.pt)中的epq.pt

In the ep_ptable_queue_proc function, another very important data structure, eppoll_entry, is introduced. eppoll_entry mainly completes the association between the epitem and the callback (ep_poll_callback) function when the epitem event occurs. First, the whead of eppoll_entry is pointed to the device wait queue of fd (same as wait_address in select), then the base variable of eppoll_entry is initialized to point to epitem, and finally the epoll_entry is mounted to the device wait queue of fd by add_wait_queue. queue. After this action, the epoll_entry has been mounted to the device wait queue of fd.

Since the ep_ptable_queue_proc function sets the ep_poll_callback callback function for the wait queue. So when the device hardware data arrives, the wakeup function ep_poll_callback will be called when the hardware interrupt handling function will wake up the process waiting on that wait queue

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key) {
   int pwake = 0;
   unsigned long flags;
   struct epitem *epi = ep_item_from_wait(wait);
   struct eventpoll *ep = epi->ep;
   spin_lock_irqsave(&ep->lock, flags);
   //判断注册的感兴趣事件
//#define EP_PRIVATE_BITS  (EPOLLONESHOT | EPOLLET)
//有非EPOLLONESHONT或EPOLLET事件
   if (!(epi->event.events & ~EP_PRIVATE_BITS))
      goto out_unlock;
   if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
      if (epi->next == EP_UNACTIVE_PTR) {
         epi->next = ep->ovflist;
         ep->ovflist = epi;
      }
      goto out_unlock;
   }
   if (ep_is_linked(&epi->rdllink))
      goto is_linked;
    //***关键***,将该fd加入到epoll监听的就绪链表中
   list_add_tail(&epi->rdllink, &ep->rdllist);
   //唤醒调用epoll_wait()函数时睡眠的进程。用户层epoll_wait(...) 超时前返回。
if (waitqueue_active(&ep->wq))
      __wake_up_locked(&ep->wq, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE);
   if (waitqueue_active(&ep->poll_wait))
      pwake++;
   out_unlock: spin_unlock_irqrestore(&ep->lock, flags);
   if (pwake)
      ep_poll_safewake(&psw, &ep->poll_wait);
   return 1;
}

So the main function of ep_poll_callback function is to add the epitem instance corresponding to the file to the ready queue when the wait event of the monitored file is ready, and when the user calls epoll_wait(), the kernel will report the event in the ready queue to the user.

The epoll_wait implementation is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, int, maxevents, int, timeout)  {
   int error;
   struct file *file;
   struct eventpoll *ep;
    /* 检查maxevents参数。 */
   if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
      return -EINVAL;
    /* 检查用户空间传入的events指向的内存是否可写。参见__range_not_ok()。 */
   if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) {
      error = -EFAULT;
      goto error_return;
   }
    /* 获取epfd对应的eventpoll文件的file实例,file结构是在epoll_create中创建。 */
   error = -EBADF;
   file = fget(epfd);
   if (!file)
      goto error_return;
    /* 通过检查epfd对应的文件操作是不是eventpoll_fops 来判断epfd是否是一个eventpoll文件。如果不是则返回EINVAL错误。 */
   error = -EINVAL;
   if (!is_file_epoll(file))
      goto error_fput;
    /* At this point it is safe to assume that the "private_data" contains  */
   ep = file->private_data;
    /* Time to fish for events ... */
   error = ep_poll(ep, events, maxevents, timeout);
    error_fput:
   fput(file);
error_return:
   return error;
}

epoll_wait calls ep_poll. ep_poll is implemented as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout) {
    int res, eavail;
   unsigned long flags;
   long jtimeout;
   wait_queue_t wait;
    /* timeout是以毫秒为单位,这里是要转换为jiffies时间。这里加上999(即1000-1),是为了向上取整。 */
   jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?MAX_SCHEDULE_TIMEOUT : (timeout * HZ + 999) / 1000;
 retry:
   spin_lock_irqsave(&ep->lock, flags);
    res = 0;
   if (list_empty(&ep->rdllist)) {
      /* 没有事件,所以需要睡眠。当有事件到来时,睡眠会被ep_poll_callback函数唤醒。*/
      init_waitqueue_entry(&wait, current); //将current进程放在wait这个等待队列中。
      wait.flags |= WQ_FLAG_EXCLUSIVE;
      /* 将当前进程加入到eventpoll的等待队列中,等待文件状态就绪或直到超时,或被信号中断。 */
      __add_wait_queue(&ep->wq, &wait);
       for (;;) {
         /* 执行ep_poll_callback()唤醒时应当需要将当前进程唤醒,所以当前进程状态应该为“可唤醒”TASK_INTERRUPTIBLE  */
         set_current_state(TASK_INTERRUPTIBLE);
         /* 如果就绪队列不为空,也就是说已经有文件的状态就绪或者超时,则退出循环。*/
         if (!list_empty(&ep->rdllist) || !jtimeout)
            break;
         /* 如果当前进程接收到信号,则退出循环,返回EINTR错误 */
         if (signal_pending(current)) {
            res = -EINTR;
            break;
         }
          spin_unlock_irqrestore(&ep->lock, flags);
         /* 主动让出处理器,等待ep_poll_callback()将当前进程唤醒或者超时,返回值是剩余的时间。
从这里开始当前进程会进入睡眠状态,直到某些文件的状态就绪或者超时。
当文件状态就绪时,eventpoll的回调函数ep_poll_callback()会唤醒在ep->wq指向的等待队列中的进程。*/
         jtimeout = schedule_timeout(jtimeout);
         spin_lock_irqsave(&ep->lock, flags);
      }
      __remove_wait_queue(&ep->wq, &wait);
       set_current_state(TASK_RUNNING);
   }
    /* ep->ovflist链表存储的向用户传递事件时暂存就绪的文件。
    * 所以不管是就绪队列ep->rdllist不为空,或者ep->ovflist不等于
    * EP_UNACTIVE_PTR,都有可能现在已经有文件的状态就绪。
    * ep->ovflist不等于EP_UNACTIVE_PTR有两种情况,一种是NULL,此时
    * 可能正在向用户传递事件,不一定就有文件状态就绪,
    * 一种情况时不为NULL,此时可以肯定有文件状态就绪,
    * 参见ep_send_events()。
    */
   eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
    spin_unlock_irqrestore(&ep->lock, flags);
    /* Try to transfer events to user space. In case we get 0 events and there's still timeout left over, we go trying again in search of more luck. */
   /* 如果没有被信号中断,并且有事件就绪,但是没有获取到事件(有可能被其他进程获取到了),并且没有超时,则跳转到retry标签处,重新等待文件状态就绪。 */
   if (!res && eavail && !(res = ep_send_events(ep, events, maxevents)) && jtimeout)
      goto retry;
    /* 返回获取到的事件的个数或者错误码 */
   return res;
}

Trivia

Hybrid mode

Promiscuous mode (English: promiscuous mode) is a term used in computer networks. It refers to the ability of a machine’s network card to receive all data streams passing through it, regardless of their destination address.

Promiscuous mode is commonly used in network analysis

DMA

DMA, full name Direct Memory Access, means direct memory access.

DMA transfers copy data from one address space to another, providing high-speed data transfers between peripherals and memory or between memory and memory. When the CPU initializes this transfer action, the transfer action itself is implemented and completed by the DMA controller. the DMA transfer method does not require the CPU to directly control the transfer, and there is no interrupt processing method like retaining the field and restoring the field process, through the hardware to open a direct data transfer channel for RAM and IO devices, making the CPU much more efficient.

Main features of DMA:

  • Each channel is directly connected to a dedicated hardware DMA request, and each channel equally supports software triggers, which are configured via software.
  • Priority between multiple requests on the same DMA module can be programmed by software (there are four levels: very high, high, medium and low), and priority settings are determined by hardware when they are equal (request 0 has priority over request 1, and so on).
  • Transfer width (byte, half-word, full-word) of the independent data source and destination data areas, simulating the packetization and unpacketization process. Source and destination addresses must be aligned by data transfer width.
  • Supports circular buffer management.
  • Each channel has 3 event flags (DMA half-transfer, DMA transfer complete, and DMA transfer error), which logically or become a single interrupt request.
  • Transfers between memory and memory, peripheral and memory, and memory and peripheral.
  • Flash, SRAM, SRAM of peripherals, APB1, APB2 and AHB peripherals can be used as sources and targets for accesses.
  • Programmable number of data transfers: up to 65535 (0xFFFF).

Non-blocking socket programming to handle EAGAIN errors

 In linux, when receiving data from a non-blocking socket, there is often a Resource temporarily unavailable, and the errno code is 11(EAGAIN), what does this mean?   This indicates that you have called a blocking operation in non-blocking mode, and this error is returned when the operation is not completed. For non-blocking sockets, EAGAIN is not an error. On VxWorks and Windows, EAGAIN is called EWOULDBLOCK.