Although we often think of Redis as a purely in-memory key-value storage system, we also use its persistence features, and RDB and AOF are two of the persistence tools Redis provides us, with RDB being a snapshot of Redis data.

In this article, we want to analyze why Redis needs to use subprocesses when persisting snapshots of data, rather than exporting the in-memory data structures directly to disk for storage.

Overview

Before analyzing today’s problem, we first need to understand what Redis’ persistent storage mechanism, RDB, is. RDB takes snapshots of the current data set in the Redis service every once in a while, and in addition to the Redis configuration file, which can be set for the snapshot interval, the Redis client also provides two commands to generate RDB storage files, SAVE and BGSAVE, and we can guess the difference between these two commands by their names.

The SAVE command blocks the current thread when executed, and since Redis is single-threaded, the SAVE command blocks all other requests from the client, which is unacceptable for many This is often unacceptable for Redis services that need to provide strong availability guarantees.

When we use the BGSAVE command, Redis will immediately fork a child process, which will perform the process of saving the data in memory to disk in RDB format, while the Redis service can still handle requests from the client during the BGSAVE process.

rdbSaveBackground is used to handle the process of saving data to disk in the function that saves data to disk in the background:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) {
    pid_t childpid;

    if (hasActiveChildProcess()) return C_ERR;
    ...

    if ((childpid = redisFork()) == 0) {
        int retval;

        /* Child */
        redisSetProcTitle("redis-rdb-bgsave");
        retval = rdbSave(filename,rsi);
        if (retval == C_OK) {
            sendChildCOWInfo(CHILD_INFO_TYPE_RDB, "RDB");
        }
        exitFromChild((retval == C_OK) ? 0 : 1);
    } else {
        /* Parent */
        ...
    }
    ...
}

The Redis server will call the redisFork function when BGSAVE is triggered to create a child process and call rdbSave to persist the data in the child process, we have omitted some of the contents of the function here, but the overall structure is still very clear, interested readers can click on the above link to understand the entire function implementation.

The purpose of using fork must ultimately be to improve the availability of the Redis service without blocking the main process, but here we can actually find two problems:

  1. why is the child process after fork able to access the data in the parent process’ memory?
  2. Does the fork function introduce additional performance overhead, and how can we avoid it?

Since Redis has chosen to use fork to solve the snapshot persistence problem, these two questions have already been answered. First, the child process after fork can access the data in the parent’s memory, and the additional performance overhead of fork must be acceptable compared to blocking the main thread, which is the only way Redis will eventually choose this solution.

Design

In order to analyze the two issues raised in the previous section, we need to understand here the following, which are the prerequisites for the Redis server to use the fork function and which ultimately motivate it to choose this implementation.

  1. the parent and child processes spawned by fork will share resources, including memory space.
  2. the fork function does not incur a significant performance overhead, especially for making large copies of memory, and it defers the work of copying memory until it is really needed through write-time copying.

Subprocesses

In the field of computer programming, especially in Unix and Unix-like systems, fork is an operation used by a process to create a copy of itself. It is often a system call implemented by the operating system kernel and is the main method used by the operating system to create new processes in *nix systems.

Once the program calls the fork method, we can use the return value of fork to determine the parent and child processes and thus perform different actions.

  • When the fork function returns 0, it means that the current process is a child process.
  • When the fork function returns non-zero, it means that the current process is the parent process and the return value is the pid of the child process.
1
2
3
4
5
6
7
int main() {
    if (fork() == 0) {
        // child process
    } else {
        // parent process
    }
}

In the manual of fork, we find that the parent and child processes after calling fork will run in different memory spaces, and when fork happens both memory spaces have exactly the same content, and the memory writing and When fork occurs, the memory spaces of both processes have exactly the same contents, and writes and modifications to memory and file mapping are independent, and the two processes do not affect each other.

The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. Memory writes, file mappings (mmap(2)), and unmappings (munmap(2)) performed by one of the processes do not affect other.

In addition, the child process is an almost exact duplicate of the parent process, but the two processes differ to a lesser extent in the following ways.

  • The child process uses a separate and unique process ID.
  • The parent process ID of the child process is identical to the parent process ID.
  • The child process does not inherit the memory locks of the parent process.
  • The child process resets the process resource utilization and CPU timer.

The key point is that the memory of the parent and child processes is identical at the time of fork, and writes and modifications after fork will not affect each other, which in fact solves the problem of the snapshot scenario perfectly – only the data in memory at a certain point in time is needed, and the parent process can continue to make changes to its own memory without blocking or affecting the generated snapshot.

Copy-on-write

Since the parent and child processes have exactly the same memory space and neither writes to memory, does this mean that the child process needs to make a full copy of the parent’s memory when forking? Assuming that the child process needs to make a copy of the parent’s memory is basically catastrophic for the Redis service, especially in the following two scenarios.

  1. a large amount of data is stored in memory, and copying the memory space during fork consumes a lot of time and resources, which can cause the program to be unavailable for a while.
  2. Redis takes up 10G of memory, while the resource limit of a physical or virtual machine is only 16G, at which point we cannot persist the data in Redis, which means that Redis cannot utilize more than 50% of the maximum memory resources on the machine.

If you can’t solve the two problems above, using fork to generate a memory image doesn’t really get off the ground and is not a method that can really be used in a project.

Suppose we need to execute a command at the command line, we need to create a new process via fork and then execute it via exec. The large amount of memory space copied by fork may be of no use at all to the child process, but it introduces a huge additional overhead.

Copy-on-Write was introduced to solve this problem, and as we described at the beginning of this section, the main purpose of Copy-on-Write is to delay copying until the write operation actually occurs, which avoids a lot of pointless copy operations. On some early *nix systems, the system call fork did immediately make a copy of the parent process’ memory space, but on most systems today, fork does not immediately trigger this process.

At the time of the fork function call, the parent and child processes are allocated by the Kernel to different virtual memory spaces, so it appears to the two processes that they are accessing different memory: * When actually accessing the virtual memory space, the Kernel maps the virtual memory to physical memory, so the parent and child processes share the physical memory space.

  • When actually accessing the virtual memory space, the Kernel maps the virtual memory to physical memory, so the parent and child processes share the physical memory space.
  • The shared memory is only ** copied on a page-by-page basis** when the parent or child process makes changes to the shared memory, and the parent process keeps the original physical space while the child process uses the new physical space after the copy.

For most Redis services or databases, write requests are often much smaller than read requests, so using fork with the copy-on-write mechanism can bring very good performance and make the implementation of BGSAVE very easy.

Summary

The way Redis implements the background snapshot is very clever, through the fork and copy-on-write feature provided by the operating system, this feature is easily implemented, from here we can see that the author’s knowledge of the operating system is still very solid, most people in the face of similar scenarios, the method may be to manually implement a similar copy-on-write feature, but this not only increases the workload, but also increases the possibility of program problems.

Let’s briefly summarize why Redis implements snapshots by means of subprocesses when using RDB:

  1. the child process created by fork can get exactly the same memory space as the parent process, and the memory changes made by the parent process are not visible to the child process, so they do not affect each other.
  2. the creation of a child process by fork does not immediately trigger a large number of memory copies, and memory is copied on a page-by-page basis when it is modified, which avoids the performance problems caused by a large number of memory copies.

One of these two reasons, one to support child process access to the parent process and the other to reduce additional overhead, are the reasons why Redis uses child processes for snapshot persistence. To conclude, let’s look at some more open-ended related issues, and the interested reader can ponder the following questions.

  • What other services use this feature when Nginx’s main process forks a set of subprocesses at runtime that can handle requests separately?
  • Write-time copy is actually a relatively common mechanism, where else would it be used outside of Redis?