Regardless of the field, building is often much more difficult than destroying. Building a building may take years, while destroying it may be a matter of seconds. At the root of this, because construction is an entropy-reducing process, there must be an input of energy from outside the system. One type of problem in creation that seems even more special is the problem of origins. How did the first life come into being? How did mankind originate? Questions such as these have never ceased to be explored by mankind.

Today we are going to explore the topic of process creation in Linux. Most of you probably know about fork, and you know that processes are divided like a cell division. So are you aware of the following: what do you need to prepare in order to split it? What exactly is divided and what is inherited by the so-called division? How is the genetic material passed on? How does the first process come about?

Let’s explore these questions together!

What is a process

Definition and Background

Before discussing any issue we need to define it clearly, otherwise the discussion will seem pointless. Similarly, since we are talking about processes, we need to know what a process is first.

This is how processes are usually defined in books: A process is an instance of program execution. You can think of a process as a collection of data structures that describe how far a process has been executed.

This is a simple definition, but it doesn’t help us much to understand the underlying layers. Processes are actually a concept that followed the multi-channel programming of operating systems. Early computers did not support multitasking and could only have one program running, but this was too inefficient. So from Intel 80286/80386 onwards, CPUs started to support protected mode, and operating systems started to support multitasking with it. In the era of single-channel programs CPU and memory are yours, you can use them as you like. But not in the era of multi-channel programs, because system resources (CPU, memory, etc.) need to be shared, and the whole system is in chaos if not managed accordingly.

So, the concept of process was abstracted, which takes on the role of an entity that allocates system resources (CPU time, memory, etc.) and provides isolation and protection for access to shared resources.

Processes are like humans, they are created, have their own life cycle and may be in several different states. Similarly, there are kinship and non-kinship relationships between processes.

Process Descriptors

In order to manage processes, the kernel must provide a clear description of the process state. This is where the process descriptor comes in. task_struct is a very complex structure that contains all the information related to a process. Even in the early days of Linux 0.11, it already contained so many fields: the most basic process state information, memory-related information, file system-related information, local descriptor table ldt, and task state segment tss. Where ldt holds the descriptors of the code and data segments of the process, and tss is used to record the CPU register information used when the process switches.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
struct task_struct {
/* these are hardcoded - don't touch */
    long state; /* -1 unrunnable, 0 runnable, >0 stopped */
    long counter;
    long priority;
    long signal;
    struct sigaction sigaction[32];
    long blocked;   /* bitmap of masked signals */
/* various fields */
    int exit_code;
    unsigned long start_code,end_code,end_data,brk,start_stack;
    long pid,father,pgrp,session,leader;
    unsigned short uid,euid,suid;
    unsigned short gid,egid,sgid;
    long alarm;
    long utime,stime,cutime,cstime,start_time;
    unsigned short used_math;
/* file system info */
    int tty;        /* -1 if no tty, so it must be signed */
    unsigned short umask;
    struct m_inode * pwd;
    struct m_inode * root;
    struct m_inode * executable;
    unsigned long close_on_exec;
    struct file * filp[NR_OPEN];
/* ldt for this task 0 - zero 1 - cs 2 - ds&ss */
    struct desc_struct ldt[3];
/* tss for this task */
    struct tss_struct tss;
};

The structure of tss_struct is as follows, you can see that most of it is register information.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct tss_struct {
    long    back_link;  /* 16 high bits zero */
    long    esp0;
    long    ss0;        /* 16 high bits zero */
    long    esp1;
    long    ss1;        /* 16 high bits zero */
    long    esp2;
    long    ss2;        /* 16 high bits zero */
    long    cr3;
    long    eip;
    long    eflags;
    long    eax,ecx,edx,ebx;
    long    esp;
    long    ebp;
    long    esi;
    long    edi;
    long    es;     /* 16 high bits zero */
    long    cs;     /* 16 high bits zero */
    long    ss;     /* 16 high bits zero */
    long    ds;     /* 16 high bits zero */
    long    fs;     /* 16 high bits zero */
    long    gs;     /* 16 high bits zero */
    long    ldt;        /* 16 high bits zero */
    long    trace_bitmap;   /* bits: trace 0, bitmap 16-31 */
    struct i387_struct i387;
};

Note: The code and narrative in this article are based on Linux 0.11

Process 0

Process 0 is like Adam, the ancestor of all processes, and Linus is like God, who pre-set the data structure of process 0 statically and then breathed on it during the kernel initialization phase, and process 0 came to life.

As you can see below, the task_struct of process 0 is pre-set in the static variable init_task with the INIT_TASK macro. From here I can see that the page where task_struct is located is also the kernel stack, and the stack pointer starts at the end of the page.

1
2
3
4
5
6
static union task_union init_task = {INIT_TASK,};

union task_union {
    struct task_struct task;
    char stack[PAGE_SIZE];
};

The macro INIT_TASK pre-sets the fields of the task_struct structure, including ldt and tss. You can understand the meaning of each initial value in conjunction with the previous fields of the task_struct. We mention a few special ones here

  • ldt is a local descriptor table LDT, 0x9f, 0xc0fa00 and 0x9f, 0xc0f200 represent the code segment and data segment descriptors of process 0 respectively, with base address 0x0, limit 640K, G=1, D=1, DPL=3, P=1, and TYPE 0x0a and 0x02 respectively. The code segment and data segment of the kernel.
  • tss is the task status segment
    • The value of the second field, PAGE_SIZE+(long)&init_task, indicates the kernel stack pointer esp0, which points to the end of the page where init_task is located.
    • The subsequent ss is initialized to 0x10, indicating that it is a kernel data stack segment selector.
    • (long)&pg_dir initializes cr3 to the address where the page directory is located
    • The next 10 registers are initialized to 0, including eip and esp in user state
    • All 6 segment registers are initialized to 0x17, indicating the user state data segment selector
    • _LDT(0) indicates the LDT selector for process 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#define INIT_TASK \
/* state etc */ { 0,15,15, \
/* signals */   0,{{},},0, \
/* ec,brk... */ 0,0,0,0,0,0, \
/* pid etc.. */ 0,-1,0,0,0, \
/* uid etc */   0,0,0,0,0,0, \
/* alarm */ 0,0,0,0,0,0, \
/* math */  0, \
/* fs info */   -1,0022,NULL,NULL,NULL,0, \
/* filp */  {NULL,}, \
    { \
        {0,0}, \
/* ldt */   {0x9f,0xc0fa00}, \
        {0x9f,0xc0f200}, \
    }, \
/*tss*/ {0,PAGE_SIZE+(long)&init_task,0x10,0,0,0,0,(long)&pg_dir,\
     0,0,0,0,0,0,0,0, \
     0,0,0x17,0x17,0x17,0x17,0x17,0x17, \
     _LDT(0),0x80000000, \
        {} \
    }, \
}

You can see that most of the process information has been set statically, but it is still a dead process 0. Let’s see how the kernel gives it a breath of air to bring it back to life.

The following is the kernel main function. Note that the main function is not the entry point for the kernel program, but rather the system bootloader, followed by the main function after the system enters protected mode and completes some initialization. At the beginning the system uses a temporary stack, after entering protected mode the stack segment is set to the kernel data segment (0x10) and esp points to the end of the user_stack array, which is still used when the system first enters main.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
void main(void)
{
    // ...
    sched_init();
    sti();
    move_to_user_mode();
    if (!fork()) {      /* we count on this going ok */
        init();
    }
    for(;;) pause();
}

Part of the initialization of process 0 is done in the sched_init() function. First, the TSS and LDT descriptor entries for process 0 are set in the global descriptor table GDT, and then loaded into the tr and ldtr registers respectively. This is the only time the kernel loads the LDT descriptor explicitly, and later the LDT descriptor of the process will be loaded automatically by the CPU according to the LDT selector in the TSS when the process switches.

1
2
3
4
5
6
// 设置进程0的TSS和LDT描述符到GDT中
set_tss_desc(gdt+FIRST_TSS_ENTRY,&(init_task.task.tss));
set_ldt_desc(gdt+FIRST_LDT_ENTRY,&(init_task.task.ldt));
// 加载进程0的TSS到tr寄存器,LDT到ldtr寄存器
ltr(0);
lldt(0);

The next move_to_user_mode() is the final exhalation. It accomplishes two important historical tasks: moving from the kernel state to the user state, and process process 0. The interrupt process is simulated manually by pressing specific values into the stack.

  • Press in the user state stack segment selector, 0x17 for the data stack segment of the user state local table (previously ss was 0x10)
  • Press in the user state stack pointer, which inherits the esp before executing move_to_user_mode
  • Pressing into the status register
  • Press in the user state code segment selector, 0x0f means the code segment in the user state local table (previously cs was 0x08)
  • Press in the offset address, which is the location marked 1 below
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#define move_to_user_mode() \
__asm__ ("movl %%esp,%%eax\n\t" \
    "pushl $0x17\n\t" \
    "pushl %%eax\n\t" \
    "pushfl\n\t" \
    "pushl $0x0f\n\t" \
    "pushl $1f\n\t" \
    "iret\n" \
    "1:\tmovl $0x17,%%eax\n\t" \
    "movw %%ax,%%ds\n\t" \
    "movw %%ax,%%es\n\t" \
    "movw %%ax,%%fs\n\t" \
    "movw %%ax,%%gs" \
    :::"ax")

Finally, by executing iret, the system moves from the kernel state to the user state and process 0 comes to life. Its kernel stack is set statically in the init_task structure, and the user stack segment and code segment are also set statically in the ldt of init_task, whose base address is 0, so it still points to the kernel code segment and data segment, but the privilege level becomes the user level. The user state stack pointer inherits the previous esp, so it still points to the original location. eip is manually set to the location of the next instruction in iret. So after iret, the execution of the code continues.

The task of process 0 is simple: fork process 1, and then it goes into a dead loop. Process 0 runs when there is no other process to run on the system, so it is also called idle process.

1
2
3
4
if (!fork()) {      /* we count on this going ok */
    init();
}
for(;;) pause();

fork and system calls

Process 1 has a slightly special status, it is forked from process 0, so its code and data segments are the same as those of process 0. As for all subsequent processes, they are all children of process 1, because they are all exec’d by process 1 after fork, so their code and data segments are no longer kernel code and data segments.

Next, let’s explore how fork is divided into two, starting with the fork system call definition.

1
static inline _syscall0(int,fork)

_syscall0 is a macro, and the kernel defines several such macros, where the number that follows it indicates the number of arguments.

1
2
3
4
5
6
7
#define _syscall0(type,name) \
...
#define _syscall1(type,name,atype,a) \
...
#define _syscall2(type,name,atype,a,btype,b) \
...
#define _syscall3(type,name,atype,a,btype,b,ctype,c) \

Let’s look at the specific definition of _syscall0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#define _syscall0(type,name) \
type name(void) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
    : "=a" (__res) \
    : "0" (__NR_##name)); \
if (__res >= 0) \
    return (type) __res; \
errno = -__res; \
return -1; \
}

These macros are actually embedded assembly functions, with only one int $0x80 instruction, where 0x80 is the interrupt number of the system call. There is 1 output parameter and input parameter respectively. The local variable __res is bound to the eax register as an output parameter and is used to receive the return value. The input parameter __NR_fork is the system call number, each system call has a separate number, also bound to register eax.

After the int instruction is executed, the CPU goes to the IDT to find the corresponding interrupt descriptor. Since the system call is implemented as a system gate (a trapdoor with a privilege level DPL of 3), it can be called in the user state, and the selector of the segment where the interrupt routine is located and the in-segment offset can be found in the gate descriptor to jump there. For the system call, it is a jump to the following system_call:.

This way the process enters the kernel space through the system call, and the interrupt process of the CPU triggered by the int instruction automatically puts the user state original stack segment ss, stack pointer esp, status register eflags, original user code segment cs, and original user code eip on the stack (note: the stack here is the kernel stack). So why does the CPU stack these registers? Because it has to remember the way back. Just as a function call needs to stack the return address, so does an interrupt need to stack the return address. In addition, the kernel state and the user state use separate stacks, so the user state stack segment register ss and the stack pointer esp also need to be stacked. So when you first enter the kernel space, the kernel stack pointer is located at esp0 in the following figure.

This code at the beginning of systen_call is common to all system calls. It first eax whether the system call number in eax is out of range, if it is normal, it will stack the segment registers and general registers, and then reset ds and es to kernel data segment and fs to user data segment. Finally, the corresponding processing function is called through the system call number lookup table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
system_call:
    cmpl $nr_system_calls-1,%eax
    ja bad_sys_call
    push %ds
    push %es
    push %fs
    pushl %edx
    pushl %ecx      # push %ebx,%ecx,%edx as parameters
    pushl %ebx      # to the system call
    movl $0x10,%edx     # set up ds,es to kernel space
    mov %dx,%ds
    mov %dx,%es
    movl $0x17,%edx     # fs points to local data space
    mov %dx,%fs
    call sys_call_table(,%eax,4)

fork system call kernel stack situation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
HIGH  +----------------------+
      |          |  prev ss  |
      +----------------------+
      |       prev esp       |
      +----------------------+
      |        eflags        |
      +----------------------+
      |          |  prev cs  |
      +----------------------+
      |       prev eip       |<-- esp0
      +----------------------+
      |          |     ds    |
      +----------------------+
      |          |     es    |
      +----------------------+
      |          |     fs    |
      +----------------------+
      |         edx          |
      +----------------------+
      |         ecx          |
      +----------------------+
      |         ebx          |
      +----------------------+
      |         eip1         |<-- esp1
      +----------------------+
      |          |     gs    |
      +----------------------+
      |         esi          |
      +----------------------+
      |         edi          |
      +----------------------+
      |         ebp          |
      +----------------------+
      |         eax          | nr
      +----------------------+
      |         eip2         |<-- esp2
LOW   +----------------------+    

For the fork system call, it goes to sys_fork, where the kernel stack pointer is at esp1. The find_empty_process function just gets a free process number (note that it is not a pid), and if there are no more free ones it jumps directly to the marker 1 and returns. Then all the remaining programmable registers are also put on the stack, and finally the new process number is put on the stack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sys_fork:
    call find_empty_process
    testl %eax,%eax
    js 1f
    push %gs
    pushl %esi
    pushl %edi
    pushl %ebp
    pushl %eax
    call copy_process
    addl $20,%esp
1:  ret

Next, the core function copy_process is called, and the kernel stack pointer is located at esp2 when you first enter the function. This function has many arguments, which are actually registers from the previous stack. Because the call convention in C on x86 is to stack arguments from right to left, the last nr on the stack is the first argument. The reason why so many arguments are needed is that the parent process needs to pass on the values of all these registers to the child process, while most of the rest of the information can be obtained through the process descriptor. The function first allocates a free page as the task_struct structure for the new process, and then saves the pointer to the task array.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int copy_process(int nr,long ebp,long edi,long esi,long gs,long none,
        long ebx,long ecx,long edx,
        long fs,long es,long ds,
        long eip,long cs,long eflags,long esp,long ss)
{
    struct task_struct *p;
    int i;
    struct file *f;

    p = (struct task_struct *) get_free_page();
    if (!p)
        return -EAGAIN;
    task[nr] = p;

The next step is to initialize the process descriptor of the child process p. The first line below copies the process descriptor of the parent process directly to the child process, so the fields that are not additionally modified later are by default the same as the parent process. But after all, the child process is a separate process and needs to modify the necessary fields, such as pid, parent process, time slice, signal, timer, time, etc.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
*p = *current;  /* NOTE! this doesn't copy the supervisor stack */
p->state = TASK_UNINTERRUPTIBLE;
p->pid = last_pid;
p->father = current->pid;
p->counter = p->priority;
p->signal = 0;
p->alarm = 0;
p->leader = 0;      /* process leadership doesn't inherit */
p->utime = p->stime = 0;
p->cutime = p->cstime = 0;
p->start_time = jiffies;

The next step is to initialize the tss of the child process, noting that the kernel stack is pointing to the end of the page where task_struct is located, p->tss.esp0 = PAGE_SIZE + (long) p; . The kernel stack segment selector ss0 is 0x10, which is the kernel data segment. The rest of the register values are copied from the parent process, except for the eax register. You can see that the eax of the child process is assigned to 0. As a comparison, the eax of the parent process (i.e., the last return value) is last_pid. The eax value here is actually the final return value of the fork function, which is why fork returns 0 for the child process and the pid of the child process for the parent process.

p->tss.ldt = _LDT(nr) is to set the LDT selector of the new process so that the process can find its LDT when it switches.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
p->tss.back_link = 0;
p->tss.esp0 = PAGE_SIZE + (long) p; /* 内核堆栈指针 */
p->tss.ss0 = 0x10;
p->tss.eip = eip;
p->tss.eflags = eflags;
p->tss.eax = 0;     /* 新进程的返回值是0 */
p->tss.ecx = ecx;
p->tss.edx = edx;
p->tss.ebx = ebx;
p->tss.esp = esp;
p->tss.ebp = ebp;
p->tss.esi = esi;
p->tss.edi = edi;
p->tss.es = es & 0xffff;
p->tss.cs = cs & 0xffff;
p->tss.ss = ss & 0xffff;
p->tss.ds = ds & 0xffff;
p->tss.fs = fs & 0xffff;
p->tss.gs = gs & 0xffff;
p->tss.ldt = _LDT(nr);
p->tss.trace_bitmap = 0x80000000;
if (last_task_used_math == current)
    __asm__("clts ; fnsave %0"::"m" (p->tss.i387));

The next copy_mem() function starts copying the memory.

1
2
3
4
5
if (copy_mem(nr,p)) {
    task[nr] = NULL;
    free_page((long) p);
    return -EAGAIN;
}

Let’s take a look at its concrete implementation, we can see that it does not copy the memory pages directly, but the page table, and only the page table of the data segment is copied. (Because all processes in Linux 0.11 share the page directory, and the base address of the code segment and data segment in Linux 0.11 is the same, and the length of the data segment is not less than that of the code segment, so only the data segment needs to be copied here.)

(Linux 0.11 supports up to 64 processes at the same time, and the linear base address of each process is obtained according to its process number * 64MB, i.e. nr * 0x4000000 in the above code.)

Before just set the ldt selector in tss, the LDT table is still a copy of the parent process. The user code segment and data segment descriptors are not updated, but are set here with the new base address.

This is only the first piece of the write time copy puzzle, not the complete write time copy mechanism, so I won’t expand on it here for space reasons. (Perhaps a separate article can be written later)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
int copy_mem(int nr,struct task_struct * p)
{
    unsigned long old_data_base,new_data_base,data_limit;
    unsigned long old_code_base,new_code_base,code_limit;

    code_limit=get_limit(0x0f);
    data_limit=get_limit(0x17);
    old_code_base = get_base(current->ldt[1]);
    old_data_base = get_base(current->ldt[2]);
    if (old_data_base != old_code_base)
        panic("We don't support separate I&D");
    if (data_limit < code_limit)
        panic("Bad data_limit");
    new_data_base = new_code_base = nr * 0x4000000;
    p->start_code = new_code_base;
    set_base(p->ldt[1],new_code_base);
    set_base(p->ldt[2],new_data_base);
    if (copy_page_tables(old_data_base,new_data_base,data_limit)) {
        printk("free_page_tables: from copy_mem\n");
        free_page_tables(new_data_base,data_limit);
        return -ENOMEM;
    }
    return 0;
}

Let’s go ahead and finish the rest of the copy_process() function first.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
for (i=0; i<NR_OPEN;i++)
    if ((f=p->filp[i]))
        f->f_count++;
if (current->pwd)
    current->pwd->i_count++;
if (current->root)
    current->root->i_count++;
if (current->executable)
    current->executable->i_count++;
set_tss_desc(gdt+(nr<<1)+FIRST_TSS_ENTRY,&(p->tss));
set_ldt_desc(gdt+(nr<<1)+FIRST_LDT_ENTRY,&(p->ldt));
p->state = TASK_RUNNING;    /* do this last, just in case */
return last_pid;

The next step is to add the reference count, all files opened by the process, and the three i-nodes pwd, root, and executable. Then the TSS and LDT descriptor entries of the child process are set in GDT. Finally, change the process state to runnable and return the pid of the child process.

The preceding is the process of system call in, and the next is a process of system call return, note that the eax on the stack here is the return value of fork. After the execution of the specific system call processing function to this may occur scheduling behavior, so fork after the parent process and the child process in the end which first run is not certain.

1
2
3
4
5
6
pushl %eax
movl current,%eax
cmpl $0,state(%eax)     # state
jne reschedule
cmpl $0,counter(%eax)       # counter
je reschedule

If the system call is made from the user state, the process will also be signaled, which is not the focus of our attention today and will not be expanded here. Finally, all registers from the previous stack are restored, and iret returns to user space to continue execution. At this point, the kernel stack is restored to its original clean state, and cs/eip/ss/esp is restored to the state before the int instruction was executed, and the process continues to execute after the user state fork().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
ret_from_sys_call:
    movl current,%eax       # task[0] cannot have signals
    cmpl task,%eax
    je 3f
    cmpw $0x0f,CS(%esp)     # was old code segment supervisor ?
    jne 3f
    cmpw $0x17,OLDSS(%esp)      # was stack segment = 0x17 ?
    jne 3f
    movl signal(%eax),%ebx
    movl blocked(%eax),%ecx
    notl %ecx
    andl %ebx,%ecx
    bsfl %ecx,%ecx
    je 3f
    btrl %ecx,%ebx
    movl %ebx,signal(%eax)
    incl %ecx
    pushl %ecx
    call do_signal
    popl %eax
3:  popl %eax
    popl %ebx
    popl %ecx
    popl %edx
    pop %fs
    pop %es
    pop %ds
    iret

Summary

To briefly review, the process creation is mainly to initialize the process descriptor structure, including the ldt and tss in it. memory copy, because it is a copy-on-write only copies the page table (Linux 0.11 all processes are shared page directory). Also set the LDT and TSS descriptors of the process in the GDT global descriptor table.

Process 0 is special as the ancestor of all processes, its process descriptor structure is set statically, then its LDT and TSS descriptors are set in the GDT during the initialization phase and tr and ldtr are loaded manually. finally it is transferred to the user state by manually simulating the interrupt process and setting its stack pointer and eip pointer.

The subsequent process creation is done by forking, and most of the information in the process descriptor is copied directly from the parent process. To keep the state consistent with the parent process, all programmable registers are passed as arguments to copy_process, although there are some differences between the child process and the parent process as an independent process, notably the kernel stack pointer, the local descriptor table LDT, and the return value eax.

The initial questions that were raised should all be answered now.