A system call is a way for a computer program to request services from the operating system kernel during execution. This may include hardware-related services, the creation and execution of new processes, and process scheduling. Anyone with a little knowledge of operating systems knows that - system calls provide the user program with an interface to the operating system.

The famous glibc for C language encapsulates the system calls provided by the operating system and provides a well-defined interface that allows engineers to develop upper-level applications directly using the functions encapsulated in the container.

We often need to deal with system calls when using the standard library, but often we do not know the implementation behind the standard library, take the common Hello World program as an example, such a simple few lines of function in the real run will perform dozens of system calls:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <stdio.h>
int main() {
   printf("Hello, World!");
   return 0;
}

$ gcc hello.c -o hello
$ strace ./hello
execve("./hello", ["./hello"], 0x7ffd64dd8090 /* 23 vars */) = 0
brk(NULL)                               = 0x557b449db000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26133, ...}) = 0
mmap(NULL, 26133, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f645455a000
close(3)                                = 0
...
munmap(0x7f645455a000, 26133)           = 0
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0
brk(NULL)                               = 0x557b449db000
brk(0x557b449fc000)                     = 0x557b449fc000
write(1, "Hello, World!", 13Hello, World!)           = 13
exit_group(0)                           = ?
+++ exited with 0 +++

strace is a tool in Linux for monitoring and tampering with operations between processes and the kernel. The above command prints out information about system calls, parameters and return values triggered during the execution of hello. Most of the system calls triggered during the execution of the Hello World program are triggered by the program startup, only the system calls after munmap are triggered by the printf function, and as an application we are very limited in what we can do and rely on the services provided by the operating system for many functions.

Function calls in most programming languages simply allocate new stack space, write arguments to registers and execute the CALL assembly instruction to jump to the target address to execute the function and return the arguments on the stack or in registers when the function returns. System calls consume more resources than function calls, as shown in the following figure, where the execution of a system call using the SYSCALL designation consumes tens of times more time than a C function call.

The vDSO in the above figure is called Virtual Dynamically Shared Object (vDSO), which can reduce the time consumed by system calls, and we will analyze its implementation in detail later.

getpid(2) is a relatively fast system call that does not contain any arguments and only switches to the kernel state, reads variables and returns the PID, so we can use its execution time as a benchmark for system calls; besides getpid(2), using the close(999) system call to close non-existent file descriptors consumes Of course, if you want to implement a system call for testing additional overhead, using a custom null function should be the perfect choice, so interested readers can try it out for themselves.

From the above benchmarking of system calls versus function calls, we can see that system calls without vSDO acceleration take tens of times longer than normal function calls. Why does a system call incur so much extra overhead and what does it actually perform internally? This article will describe three ways Linux performs system calls.

  • Triggering a system call using a Software interrupt.
  • Triggering system calls using assembly instructions such as SYSCALL / SYSENTER.
  • Executing system calls using virtual dynamic shared object (vDSO).

Software Interrupts

An interrupt is an input signal sent to the processor that can indicate a time that requires immediate processing by the operating system. If the operating system receives an interrupt, then the processor suspends the current task, stores the context state, and executes the interrupt processor to process the event that occurred. After the interrupt processor finishes, the current processor resumes the context and continues to complete the previous work.

Hardware interrupts are electronic signals triggered by devices external to the processor, while software interrupts are triggered by the processor during the execution of specific instructions, and some special instructions can also intentionally trigger software interrupts.

On 32-bit x86 systems, we can use the INT instruction to trigger a software interrupt. Early Linux would use INT 0x80 to trigger a software interrupt, register a specific interrupt handler entry_INT80_32 to handle system calls, let’s understand the process of using software interrupts to execute system calls.

  1. the application initiates the system call by calling a function in the C language library.
  2. the C function receives the arguments passed in by the caller through the stack and copies the arguments needed for the system call into registers.
  3. each system call in Linux has a specific serial number, and the function copies the number of the system call to the eax register.
  4. the function executes the INT 0x80 instruction and the processor switches from the user state to the kernel state and executes the predefined processor
  5. execute the interrupt handler entry_INT80_32 to process the system calls.
    1. Execute SAVE_ALL to store the register values onto the kernel stack and call do_int80_syscall_32 .
    2. call do_syscall_32_irqs_on to check if the serial number of the system call is legal.
    3. In the system call table ia32_sys_call_table to find the corresponding system call implementation and pass in the register value.
    4. during execution the system call checks the legitimacy of the parameters, transfers data between user state memory and kernel state memory, and the result of the system call is stored in the eax register.
    5. the value of the register is recovered from the kernel stack and the return value is placed on the stack.
    6. the system call returns the C function and the wrapper function returns the result to the application.
  6. if an error occurs during the execution of the system call service, the C function stores the error in the global variable errno and returns a status expressed as an integer int based on the result of the system call.

From the execution process of the above system call, we can see that the software interrupt-based system call is a relatively complex process. The application falls into the kernel state through the software interrupt and queries and executes the functions registered in the system call table in the kernel state, the whole process not only requires storing the data in the registers, switching from the user state to the kernel state, but also needs to finish verifying the legitimacy of the parameters, which does bring a lot of extra overhead compared to the function call process.

In fact, the use of INT 0x80 to trigger system calls is a thing of the past, and most programs try to avoid it. However, this rule is not universal, as the Go language team has found in benchmarking that INT 0x80 triggers system calls on some operating systems with almost identical performance to other methods, so interrupts are still used to execute system calls on architectures such as Android/386 and Linux/386.

Assembly instructions

Because system calls implemented using software interrupts perform very poorly on Pentium 4 processors, Linux has addressed this problem with newer versions of the assembly instructions SYSENTER / SYSCALL, which are used on Intel and AMD to implement fast system calls, and we will use SYSENTER / SYSEXIT on 32-bit operating systems and SYSCALL / SYSRET on 64-bit operating systems: SYSENTER / SYSCALL / SYSRET on 64-bit operating systems.

The above-mentioned assembly instructions are low-latency system call and return instructions that assume that the operating system implements the Linear-memory Model, which greatly simplifies the process of OS system calls and returns, including unnecessary checks, preloading parameters, etc. The use of fast system call instructions can reduce clock cycles by up to 25% compared to software interrupt-driven system calls.

The linear memory model is a common paradigm for memory addressing in which linear memory is stored with the application in a single contiguous spatial address and the CPU can access the available memory addresses directly without resorting to memory fragmentation or paging techniques using addresses.

On 64-bit operating systems, we use SYSCALL / SYSRET to enter and exit system calls, which are executed in the highest privilege level of the operating system. The kernel will call syscall_init when it initializes common.c#L1662) function to set entry_SYSCALL_64 into the MSR register (Model Specific Register, MSR), which is the control register for debugging, tracing and performance monitoring in the x86 instruction set.

1
2
3
4
5
void syscall_init(void) {
	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
	...
}

When the kernel receives a system call triggered by the user program, it reads the function to be executed in the MSR register and reads the number of the system call and its arguments in the register according to the x86-64 calling convention, which you can find in the comments of the entry_SYSCALL_64 function.

The assembly function entry_SYSCALL_64 calls do_syscall_64 during execution, and its implementation is somewhat similar to do_int80_syscall_32 in the previous section in that they both look up the function in the system call table and pass in the parameters in the registers.

Unlike INT 0x80, which implements system calls by triggering software interrupts, SYSENTER and SYSCALL are assembly instructions designed specifically for system calls. They do not need to look up the execution procedure corresponding to the system call in the Interrupt Descriptor Table (IDT), nor do they need to save information such as the stack and return address, so they can reduce the extra overhead required.

vDSO

A virtual dynamic shared object (vDSO) is a mechanism by which the Linux kernel exposes some functions in kernel space to user space. Simply put, we map system calls in the Linux kernel that do not involve security directly to user space, so that applications in user space do not need to switch to kernel state when calling these functions to reduce performance loss.

vDSO uses standard link-and-load techniques as a dynamically linked library, which is provided by the Linux kernel and mapped to each executing process, and we can view the location of this dynamically linked library in the process using the command shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ ldd /bin/cat
	linux-vdso.so.1 (0x00007fff2709c000)
	...

$ cat /proc/self/maps
...
7f28953ce000-7f28953cf000 r--p 00027000 fc:01 2079                       /lib/x86_64-linux-gnu/ld-2.27.so
7f28953cf000-7f28953d0000 rw-p 00028000 fc:01 2079                       /lib/x86_64-linux-gnu/ld-2.27.so
7f28953d0000-7f28953d1000 rw-p 00000000 00:00 0
7ffe8ca4d000-7ffe8ca6e000 rw-p 00000000 00:00 0                          [stack]
7ffe8ca8d000-7ffe8ca90000 r--p 00000000 00:00 0                          [vvar]
7ffe8ca90000-7ffe8ca92000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Because vDSO is provided directly by the operating system, it does not have a corresponding file, and we can see where it is loaded in virtual memory during program execution. vDSO can provide virtual system calls to user programs, which emulate system calls in the user state using data provided by the kernel.

The system call gettimeofday is a very good example. As shown above, the system call gettimeofday using vDSO is initialized as follows.

  1. the ELF loader in the kernel takes care of mapping the memory pages of the vDSO and setting the Auxiliary Vector in AT_SYSINFO_EHDR, which stores the base address of the vDSO.
  2. the dynamic linker queries the AT_SYSINFO_EHDR in the Auxiliary Vector and links the vDSO if the label is set; 3. libc is initialized with the AT_SYSINFO_EHDR label.
  3. libc will look for the __vdso_gettimeofday symbol in vDSO during initialization and link the symbol to the global function pointer.

In addition to gettimeofday, vDSO on most architectures contains three other system calls such as clock_gettime, clock_getres, and rt_sigreturn. These system calls are relatively simple to accomplish and do not pose security problems, so mapping them to user space can significantly improve the performance of the system calls, and as we can see in Figure 2, using vDSO can improve the time of several of these system calls by a factor of tens.

Summary

System calls are not a far-fetched concept when we write applications; a simple Hello World will trigger dozens of system calls during execution, and we may also be required to deal with system calls when performance issues arise online. Although system calls are very frequent in programs, they introduce significant additional overhead compared to normal function calls:

  • System calls triggered using software interrupts require saving information such as the stack and return address, as well as looking up the system call’s response function in the interrupt description table. Although most operating systems do not use INT 0x80 to trigger system calls, there are still special scenarios where we need to take advantage of this old technique.
  • Using the assembly instructions SYSCALL / SYSENTER to execute system calls is the most common method today. As instructions built specifically for system calls, they can eliminate some unnecessary steps and reduce the overhead of system calls.
  • Executing system calls using vSDO is the fastest path the OS provides us with, and this approach can level out the overhead of system calls with function calls, although the OS will only release a limited number of system calls because of the real security risk of mapping kernel state system calls to ‘user state’.

Applications are quite limited in what they can do, and we need to use the services provided by the operating system to write feature-rich user programs. As an interface provided by the operating system, system calls are very closely related to the underlying hardware, and because of the variety of hardware, different architectures have to use different instructions. In the end, let’s look at some of the more open related questions, and the interested reader can think carefully about the following.

  • What is the role of the system call rt_sigreturn provided by vDSO?
  • Three of the four system calls provided by vDSO are related to getting time, why can it provide rt_sigreturn in the user state, without security risks?