Redis has become the de facto standard for high-performance caching solutions in current technology selection, and as a result, Redis has become one of the basic skill trees for back-end developers, and the underlying principles of Redis are logically a must-learn.

Redis is essentially a web server, and for a web server, the network model is the essence of it. If you understand the network model of a web server, you will understand the essence of it.

This article introduces the Redis network model in a step-by-step manner, and analyzes how it evolved from single-threaded to multi-threaded. In addition, we also analyze the thinking behind many of the choices made in the Redis network model to help the reader understand the design of the Redis network model better.

How fast is Redis?

According to the official benchmark, a single instance of Redis running on a Linux machine with average hardware can typically achieve QPS of 8w+ for simple commands (O(N) or O(log(N)), and up to 100w with pipeline batching.

Judging by performance alone, Redis can be called a high-performance caching solution.

Why is Redis fast?

The high performance of Redis is due to the following fundamentals.

Why is Redis fast

  • C implementation, while C contributes to Redis’ performance, the language is not the core factor.
  • In-memory-only I/O, Redis has a natural performance advantage over other disk-based DBs with pure memory operations.
  • I/O multiplexing for high-throughput network I/O based on epoll/select/kqueue and other I/O multiplexing techniques.
  • Single-threaded model , single-threaded can not take advantage of multi-core, but on the other hand, it avoids the frequent context switching of multiple threads, and the overhead of synchronization mechanisms such as locks.

Why did Redis choose to be single-threaded?

Redis’ core network model is single-threaded, which caused a lot of confusion at the beginning, and the official Redis answer to this is:

It’s not very frequent that CPU becomes your bottleneck with Redis, as usually Redis is either memory or network bound. For instance, using pipelining Redis running on an average Linux system can deliver even 1 million requests per second, so if your application mainly uses O(N) or O(log(N)) commands, it is hardly going to use too much CPU.

At its core, this means that CPU is usually not the bottleneck for a DB, because most requests are not CPU-intensive, but rather I/O-intensive. In the case of Redis specifically, if you don’t consider persistence schemes like RDB/AOF, Redis is a completely in-memory operation, which is very fast. The real performance bottleneck for Redis is network I/O, which is the latency of network transfers between the client and server, so Redis chooses single-threaded I/O multiplexing to implement its core network model.

The above is a more general official answer, but in fact the more specific reasons for choosing single-threaded can be summarized as follows.

Avoid excessive context switching overhead

In the process of multi-thread scheduling, it is necessary to switch the thread context between CPUs, and the context switch involves a series of register replacement, program stack reset and even CPU cache and TLB fast table retirement such as program counter, stack pointer and program status word. Because multiple threads within a single process share the process address space, the thread context is much smaller than the process context, and in the case of cross-process scheduling, the entire process address space needs to be switched.

In case of single-threaded scheduling, the frequent thread switching overhead within the process can be avoided because the program always runs within a single thread in the process and there is no multithreaded switching scenario.

Avoiding the overhead of synchronization mechanisms

If Redis chooses a multi-threaded model, and because Redis is a database, it will inevitably involve underlying data synchronization issues, which will inevitably introduce some synchronization mechanisms, such as locks, and we know that Redis provides not only simple key-value data structures, but also lists, sets, hash, and other rich data structures. Different data structures have different granularity of locking for synchronous access, which may lead to a lot of overhead in locking and unlocking during data manipulation, increasing program complexity and reducing performance.

Simple and Maintainable

The author of Redis, Salvatore Sanfilippo (alias antirez), has an almost paranoid philosophy of simplicity in the design and code of Redis, and you can feel this paranoia when reading the Redis source code or submitting PRs to Redis. So simple and maintainable code was necessarily one of the core guidelines of Redis in its early days, and the introduction of multithreading inevitably led to increased code complexity and decreased maintainability.

In fact, multi-threaded programming is not perfect. First of all, the introduction of multi-threaded programming will no longer keep the code logically serial, and the order of code execution will become unpredictable, which will lead to various concurrent programming problems if you are not careful; secondly, multi-threaded mode also makes debugging more complicated and troublesome. There is an interesting picture on the web that vividly depicts the dilemma faced by concurrent programming.

What you expect from multithreaded programming VS Actual multithreaded programming.

What you expect from multithreaded programming VS Actual multithreaded programming

If Redis uses a multi-threaded model, then all of the underlying data structures must be implemented as thread-safe. This in turn makes the Redis implementation more complex.

In short, Redis’ choice of single-threaded is a trade-off between keeping the code simple and maintainable while maintaining adequate performance.

Is Redis really single-threaded?

Before we get to that question, we need to clarify the boundaries of the concept of ‘single-threaded’: does it cover the core network model or Redis as a whole? If the former, then the answer is yes. The network model was single-threaded until Redis formally introduced multithreading in v6.0; if the latter, then the answer is no. Redis introduced multithreading as early as v4.0.

Therefore, when discussing multi-threading in Redis, it is important to delineate two important points in the Redis release.

  1. Redis v4.0 (which introduced multithreading for asynchronous tasks)
  2. Redis v6.0 (formally implements I/O multithreading in the network model)

Single-threaded event loops

Let’s start by dissecting the core network model of Redis. From Redis v1.0 until v6.0, the core network model of Redis has been a typical single Reactor model: using multiplexing techniques such as epoll/select/kqueue to process events (client requests) in a single-threaded event loop, and finally writing back the response data to the client.

Single-threaded event loops

There are several core concepts to learn here.

  • client: client object, Redis is a typical CS architecture (Client <–> Server), where the client establishes a network channel with the server via socket and then sends the requested command, and the server executes the requested command and replies. Redis uses the structure client to store all relevant information about the client, including but not limited to wrapped socket connection -- *conn, currently selected database pointer -- *db, read buffer -- querybuf, write buffer -- buf, write data linked list -- reply, etc.
  • aeApiPoll : I/O multiplexing API, is based on epoll_wait/select/kevent and other system calls wrapped, listening for read and write events to trigger, and then processing, it is the core function in the Event Loop (Event Loop), is the basis for the event driver to run.
  • acceptTcpHandler : connection answer processor, the underlying use of the system call accept to accept new connections from the client, and the new connection to register the binding command read processor for subsequent processing of new client TCP connections; in addition to this processor, there is a corresponding acceptUnixHandler for handling Unix Domain Socket and acceptTLSHandler for handling TLS encrypted connections.
  • readQueryFromClient : A command reading processor that parses and executes the client’s requested commands.
  • beforeSleep : The function that is executed before the event loop enters aeApiPoll and waits for the event to arrive. It contains some routine tasks, such as writing the response from client->buf or client->reply (two buffers are needed here) back to the client, persisting the data in the AOF buffer to disk, etc. There is also an afterSleep function that is executed after aeApiPoll.
  • sendReplyToClient: command reply handler, when there is still data left in the write buffer after an event loop, this handler will be registered and bound to the corresponding connection, and when the connection triggers a write ready event, it will write the remaining data in the write buffer back to the client.

Redis internally implements a high-performance event library, AE, based on epoll/select/kqueue/evport, to implement a high-performance event loop model for Linux/MacOS/FreeBSD/Solaris. The core network model of Redis is formally built on AE, including I/O multiplexing, and the registration of various processor bindings, all of which are based on it.

At this point, we can depict the workings of a client requesting a command from Redis.

  1. the Redis server starts, opens the main thread Event Loop, registers the acceptTcpHandler connection answer processor to the file descriptor corresponding to the user-configured listening port, and waits for a new connection to arrive.
  2. the establishment of a network connection between the client and the server.
  3. acceptTcpHandler is called and the main thread uses AE’s API to bind the readQueryFromClient command read processor to the file descriptor corresponding to the new connection and initialize a client to bind this client connection.
  4. the client sends the request command, triggering the read-ready event, and the main thread calls readQueryFromClient to read the command sent by the client via socket into the client->querybuf read buffer.
  5. next call processInputBuffer, in which processInlineBuffer or processMultibulkBuffer is used to parse the command according to the Redis protocol, and finally call processCommand to execute the command.
  6. depending on the type of the requested command (SET, GET, DEL, EXEC, etc.), assign the appropriate command executor to execute it, and finally call a series of functions in the addReply family to write the response data to the write buffer of the corresponding client: client->buf or client->reply, client->buf is the preferred write-out buffer, with a fixed size of 16KB, which can generally buffer enough response data, but if the client needs a very large response within the time window, then it will automatically switch to the client->reply linked list, which can theoretically hold an unlimited amount of data (limited by the physical memory of the machine) Finally, add client to a LIFO queue clients_pending_write.
  7. In the Event Loop, the main thread executes beforeSleep –> handleClientsWithPendingWrites, traverses the clients_pending_write queue, and calls writeToClient to return the data in client’s write buffer to the client, and if there is any data left in the write buffer, register the sendReplyToClient command to reply to the processor with a write ready event for the connection, and wait for the client to write before continuing to write back the remaining response data in the event loop.

For those who want to take advantage of multi-core performance, the official Redis solution is simple and brutal: run more Redis instances on the same machine. In fact, to ensure high availability, it is unlikely that an online business will be in standalone mode. It is more common to use Redis distributed clusters with multiple nodes and data sharding to improve performance and ensure high availability.

Multi-threaded asynchronous tasks

The above is the core network model of Redis, which was not transformed into a multi-threaded model until Redis v6.0. But that doesn’t mean that Redis has always been single-threaded.

Redis introduced multithreading in v4.0 to do some asynchronous operations, mainly for very time-consuming commands. By making the execution of these commands asynchronous, it avoids blocking the single-threaded event loop.

We know that the Redis DEL command is used to delete one or more stored values of a key, and it is a blocking command. In most cases, the key you want to delete will not have many values stored in it, at most a few dozen or a few hundred objects, so it can be executed quickly. But if you want to delete a very large key-value pair with millions of objects, then this command may block for at least several seconds, and because the event loop is single-threaded, it will block other events that follow, resulting in reduced throughput.

Redis author antirez has given a lot of thought to solving this problem. At first, he came up with an incremental solution: using timers and data cursors, he would delete a small amount of data at a time, say 1000 objects, and eventually clear all the data. But this solution had a fatal flaw: if other clients continued to write data to a key that was being progressively deleted at the same time, and the deletion rate could not keep up with the data being written, then memory would be consumed endlessly, which was solved by a clever solution, but this implementation made Redis more complex. Multi-threading seemed like a watertight solution: simple and easy to understand. So, in the end, antirez chose to introduce multithreading to implement this class of non-blocking commands. More of antirez’s thoughts on this can be found in his blog: Lazy Redis is better Redis.

So, after Redis v4.0, some non-blocking commands like UNLINK, FLUSHALL ASYNC, FLUSHDB ASYNC have been added.

async command in redis

The UNLINK command is actually an asynchronous version of DEL, it doesn’t delete data synchronously, it just removes the key from the keyspace temporarily, then adds the task to an asynchronous queue, and finally the background thread will delete it. But here we need to consider a situation that if we use UNLINK to delete a very small key, it will be more overhead to do it in an asynchronous way, so it will first calculate an overhead threshold, and only when this value is greater than 64 will we use the asynchronous way to delete the key, for the basic data types such as List, Set, and Hash, the threshold is the number of objects stored in them The number of objects stored.

Redis Multi-Threaded Network Model

As mentioned earlier Redis originally chose the single-threaded network model for the reason that the CPU is usually not a performance bottleneck, the bottlenecks tend to be memory and network, so single-threaded is sufficient. So why is Redis now introducing multithreading? The simple fact is that the network I/O bottleneck for Redis is becoming more and more obvious.

With the rapid growth of the Internet, Internet business systems are handling more and more online traffic, and Redis’ single-threaded mode causes the system to consume a lot of CPU time on network I/O, thus reducing throughput. There are two ways to improve the performance of Redis.

  • Optimized network I/O modules
  • Improving the speed of machine memory reads and writes

The latter depends on the development of hardware and is temporarily unsolvable. Therefore, we can only start from the former, and the optimization of network I/O can be divided into two directions.

  • Zero-copy technology or DPDK technology
  • Take advantage of multi-core

Zero-copy technology has its limitations and cannot be fully adapted to complex network I/O scenarios like Redis. (For more on network I/O consumption of CPU time and Linux zero-copy techniques, read the previous article.) The DPDK technique of bypassing the kernel stack by bypassing NIC I/O is too complex and requires kernel or even hardware support.

Therefore, taking advantage of multiple cores is the most cost effective way to optimize network I/O.

After version 6.0, Redis formally introduced multithreading into the core network model, also known as I/O threading, and Redis now has a truly multithreaded model. In the previous section, we learned about the single-threaded event loop model of Redis before 6.0, which is actually a very classic Reactor model.

multithreaded model

Reactor mode is used in most of the mainstream high-performance networking libraries/frameworks on Linux platforms, such as netty, libevent, libuv, POE (Perl), Twisted (Python), etc.

Reactor pattern essentially refers to the use of I/O multiplexing (I/O multiplexing) + non-blocking I/O (non-blocking I/O) pattern.

The core network model of Redis, until version 6.0, was a single Reactor model: all events were processed in a single thread, and although multithreading was introduced in version 4.0, it was more of a patch for specific scenarios (removing oversized key values, etc.) and could not be considered multithreading for the core network model.

In general, the single Reactor model, after the introduction of multi-threading, evolves into the Multi-Reactors model, with the following basic working model.

multithreaded model

Instead of a single-threaded event loop, this pattern has multiple threads (Sub Reactors) each maintaining a separate event loop, with the Main Reactor receiving new connections and distributing them to the Sub Reactors for independent processing, and the Sub Reactors writing back responses to the client.

The Multiple Reactors pattern can often be equated to the Master-Workers pattern, such as Nginx and Memcached, which use this multi-threaded model, and although the implementation details vary slightly from project to project, the pattern is generally consistent.

Design Thinking

Redis also implements multithreading, but not in the standard Multi-Reactors/Master-Workers pattern, for reasons we will analyze later. For now, let’s look at the general design of the Redis multi-threaded network model.

multithreaded model in redis

  1. the Redis server starts, opens the main thread Event Loop, registers the acceptTcpHandler connection answer processor to the file descriptor corresponding to the user-configured listening port, and waits for a new connection to arrive.
  2. the establishment of a network connection between the client and the server.
  3. acceptTcpHandler is called and the main thread uses AE’s API to bind the readQueryFromClient command read processor to the file descriptor corresponding to the new connection and initialize a client to bind this client connection.
  4. the client sends a request command, triggering a read-ready event. instead of reading the client’s request command through the socket, the server’s main thread first puts client into a LIFO queue clients_pending_read.
  5. In the Event Loop, the main thread executes beforeSleep –> handleClientsWithPendingReadsUsingThreads, using a Round-Robin polling load balancing strategy to evenly distribute the connections in the clients_pending_read queue between the I/O threads The I/O thread reads the client’s requested command via socket, stores it in client->querybuf and parses the first command, but does not execute it, while the main thread is busy polling and waiting for All I/O threads complete the read task.
  6. When the main thread and all I/O threads have finished reading, the main thread finishes busy polling, traverses the clients_pending_read queue, executes all client-connected request commands, and first calls processCommandResetClient to execute the first command that has been parsed. Then call processInputBuffer to parse and execute all commands connected by the client, use processInlineBuffer or processMultibulkBuffer to parse the commands according to the Redis protocol, and finally call processCommand to execute the commands which is called processCommand to execute the command.
  7. Depending on the type of the requested command (SET, GET, DEL, EXEC, etc.), the corresponding command executor is assigned to execute it, and finally a series of functions in the addReply family are called to write the response data to the corresponding client writeout buffer: client->buf or client->reply, client->buf is the preferred write-out buffer, with a fixed size of 16KB, which generally buffers enough response data, but automatically switches to the client->reply linked list if the client needs to respond to a very large amount of data within a time window. Theoretically, a linked list can hold an infinite amount of data (limited by the physical memory of the machine), and finally add client to a LIFO queue clients_pending_write.
  8. In the Event Loop, the main thread executes beforeSleep –> handleClientsWithPendingWritesUsingThreads, using a Round-Robin polling load balancing strategy to evenly distribute the connections in the clients_pending_write queue to the I/O threads and to the main thread itself. The I/O threads write back the data in client’s write buffer to the client by calling writeToClient, and the main thread is busy polling for all I/O threads to complete their write tasks The main thread is busy polling and waiting for all I/O threads to complete their write tasks.
  9. When the main thread and all I/O threads have finished writing out, the main thread finishes busy polling and traverses the clients_pending_write queue. If there is data left in the client write buffer, it registers sendReplyToClient to the write ready event for that connection and waits for the client to write before continuing to write back the remaining response data in the event loop.

Most of the logic here is the same as in the previous single-threaded model, the only change is to asynchronize the logic of reading the client request and writing back the response data to the I/O thread. One special note here: I/O threads only read and parse client commands and do not actually execute them, the execution of client commands is ultimately done on the main thread.

Source code dissection

All of the following code is based on the Redis v6.0.10 release.

Multi-threaded initialization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
void initThreadedIO(void) {
    server.io_threads_active = 0; /* We start with threads not active. */

    // 如果用户只配置了一个 I/O 线程,则不会创建新线程(效率低),直接在主线程里处理 I/O。
    if (server.io_threads_num == 1) return;

    if (server.io_threads_num > IO_THREADS_MAX_NUM) {
        serverLog(LL_WARNING,"Fatal: too many I/O threads configured. "
                             "The maximum number is %d.", IO_THREADS_MAX_NUM);
        exit(1);
    }

    // 根据用户配置的 I/O 线程数,启动线程。
    for (int i = 0; i < server.io_threads_num; i++) {
        // 初始化 I/O 线程的本地任务队列。
        io_threads_list[i] = listCreate();
        if (i == 0) continue; // 线程 0 是主线程。

        // 初始化 I/O 线程并启动。
        pthread_t tid;
        // 每个 I/O 线程会分配一个本地锁,用来休眠和唤醒线程。
        pthread_mutex_init(&io_threads_mutex[i],NULL);
        // 每个 I/O 线程分配一个原子计数器,用来记录当前遗留的任务数量。
       io_threads_pending[i] = 0;
        // 主线程在启动 I/O 线程的时候会默认先锁住它,直到有 I/O 任务才唤醒它。
        pthread_mutex_lock(&io_threads_mutex[i]);
        // 启动线程,进入 I/O 线程的主逻辑函数 IOThreadMain。
        if (pthread_create(&tid,NULL,IOThreadMain,(void*)(long)i) != 0) {
            serverLog(LL_WARNING,"Fatal: Can't initialize IO thread.");
            exit(1);
        }
        io_threads[i] = tid;
    }
}

initThreadedIO is called at the end of the initialization effort at Redis server startup to initialize the I/O multithreading and start it.

Redis multithreading mode is turned off by default and needs to be enabled by the user in the redis.conf configuration file.

1
2
io-threads 4
io-threads-do-reads yes

Read Request

When a client sends a request command, it triggers an event loop in the Redis main thread and the command handler readQueryFromClient is called back. In the previous single-threaded model, this method would read and parse the client command directly and execute it. In multi-threaded mode, however, the client is added to the clients_pending_read task queue, and the main thread is then assigned to the I/O thread to read the client’s requested command.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void readQueryFromClient(connection *conn) {
    client *c = connGetPrivateData(conn);
    int nread, readlen;
    size_t qblen;

    // 检查是否开启了多线程,如果是则把 client 加入异步队列之后返回。
    if (postponeClientRead(c)) return;
    
    // 省略代码,下面的代码逻辑和单线程版本几乎是一样的。
    ... 
}

int postponeClientRead(client *c) {
    // 当多线程 I/O 模式开启、主线程没有在处理阻塞任务时,将 client 加入异步队列。
    if (server.io_threads_active &&
        server.io_threads_do_reads &&
        !ProcessingEventsWhileBlocked &&
        !(c->flags & (CLIENT_MASTER|CLIENT_SLAVE|CLIENT_PENDING_READ)))
    {
        // 给 client 打上 CLIENT_PENDING_READ 标识,表示该 client 需要被多线程处理,
        // 后续在 I/O 线程中会在读取和解析完客户端命令之后判断该标识并放弃执行命令,让主线程去执行。
        c->flags |= CLIENT_PENDING_READ;
        listAddNodeHead(server.clients_pending_read,c);
        return 1;
    } else {
       return 0;
    }
}

The main thread then calls handleClientsWithPendingReadsUsingThreads in the beforeSleep() method of the event loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
int handleClientsWithPendingReadsUsingThreads(void) {
    if (!server.io_threads_active || !server.io_threads_do_reads) return 0;
    int processed = listLength(server.clients_pending_read);
    if (processed == 0) return 0;

    if (tio_debug) printf("%d TOTAL READ pending clients\n", processed);

    // 遍历待读取的 client 队列 clients_pending_read,
    // 通过 RR 轮询均匀地分配给 I/O 线程和主线程自己(编号 0)。
    listIter li;
    listNode *ln;
    listRewind(server.clients_pending_read,&li);
    int item_id = 0;
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        int target_id = item_id % server.io_threads_num;
        listAddNodeTail(io_threads_list[target_id],c);
        item_id++;
    }

    // 设置当前 I/O 操作为读取操作,给每个 I/O 线程的计数器设置分配的任务数量,
    // 让 I/O 线程可以开始工作:只读取和解析命令,不执行。
    io_threads_op = IO_THREADS_OP_READ;
    for (int j = 1; j < server.io_threads_num; j++) {
        int count = listLength(io_threads_list[j]);
        io_threads_pending[j] = count;
    }

    // 主线程自己也会去执行读取客户端请求命令的任务,以达到最大限度利用 CPU。
    listRewind(io_threads_list[0],&li);
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        readQueryFromClient(c->conn);
    }
    listEmpty(io_threads_list[0]);

    // 忙轮询,累加所有 I/O 线程的原子任务计数器,直到所有计数器的遗留任务数量都是 0,
    // 表示所有任务都已经执行完成,结束轮询。
    while(1) {
        unsigned long pending = 0;
        for (int j = 1; j < server.io_threads_num; j++)
            pending += io_threads_pending[j];
        if (pending == 0) break;
    }
    if (tio_debug) printf("I/O READ All threads finshed\n");

    // 遍历待读取的 client 队列,清除 CLIENT_PENDING_READ 和 CLIENT_PENDING_COMMAND 标记,
    // 然后解析并执行所有 client 的命令。
    while(listLength(server.clients_pending_read)) {
        ln = listFirst(server.clients_pending_read);
        client *c = listNodeValue(ln);
        c->flags &= ~CLIENT_PENDING_READ;
        listDelNode(server.clients_pending_read,ln);

        if (c->flags & CLIENT_PENDING_COMMAND) {
            c->flags &= ~CLIENT_PENDING_COMMAND;
            // client 的第一条命令已经被解析好了,直接尝试执行。
            if (processCommandAndResetClient(c) == C_ERR) {
                /* If the client is no longer valid, we avoid
                 * processing the client later. So we just go
                 * to the next. */
                continue;
            }
        }
        processInputBuffer(c); // 继续解析并执行 client 命令。

        // 命令执行完成之后,如果 client 中有响应数据需要回写到客户端,则将 client 加入到待写出队列 clients_pending_write
        if (!(c->flags & CLIENT_PENDING_WRITE) && clientHasPendingReplies(c))
            clientInstallWriteHandler(c);
    }

    /* Update processed count on server */
    server.stat_io_reads_processed += processed;

    return processed;
}

The core work here is.

  • Iterate over the client queue clients_pending_read to be read, and assign all tasks to I/O threads and the main thread to read and parse client commands via the RR policy.
  • Busy polling waiting for all I/O threads to complete their tasks.
  • Finally iterate through clients_pending_read and execute all client commands.

Write back the response

After reading, parsing and executing the command, the response data of the client command has been stored in client->buf or client->reply. Next, you need to write the response data back to the client. Again, in beforeSleep, the main thread calls handleClientsWithPendingWritesUsingThreads.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
int handleClientsWithPendingWritesUsingThreads(void) {
    int processed = listLength(server.clients_pending_write);
    if (processed == 0) return 0; /* Return ASAP if there are no clients. */

    // 如果用户设置的 I/O 线程数等于 1 或者当前 clients_pending_write 队列中待写出的 client
    // 数量不足 I/O 线程数的两倍,则不用多线程的逻辑,让所有 I/O 线程进入休眠,
    // 直接在主线程把所有 client 的相应数据回写到客户端。
    if (server.io_threads_num == 1 || stopThreadedIOIfNeeded()) {
        return handleClientsWithPendingWrites();
    }

    // 唤醒正在休眠的 I/O 线程(如果有的话)。
    if (!server.io_threads_active) startThreadedIO();

    if (tio_debug) printf("%d TOTAL WRITE pending clients\n", processed);

    // 遍历待写出的 client 队列 clients_pending_write,
    // 通过 RR 轮询均匀地分配给 I/O 线程和主线程自己(编号 0)。
    listIter li;
    listNode *ln;
    listRewind(server.clients_pending_write,&li);
    int item_id = 0;
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        c->flags &= ~CLIENT_PENDING_WRITE;

        /* Remove clients from the list of pending writes since
         * they are going to be closed ASAP. */
        if (c->flags & CLIENT_CLOSE_ASAP) {
            listDelNode(server.clients_pending_write, ln);
            continue;
        }

        int target_id = item_id % server.io_threads_num;
        listAddNodeTail(io_threads_list[target_id],c);
        item_id++;
    }

    // 设置当前 I/O 操作为写出操作,给每个 I/O 线程的计数器设置分配的任务数量,
    // 让 I/O 线程可以开始工作,把写出缓冲区(client->buf 或 c->reply)中的响应数据回写到客户端。
    io_threads_op = IO_THREADS_OP_WRITE;
    for (int j = 1; j < server.io_threads_num; j++) {
        int count = listLength(io_threads_list[j]);
        io_threads_pending[j] = count;
    }

    // 主线程自己也会去执行读取客户端请求命令的任务,以达到最大限度利用 CPU。
    listRewind(io_threads_list[0],&li);
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        writeToClient(c,0);
    }
    listEmpty(io_threads_list[0]);

    // 忙轮询,累加所有 I/O 线程的原子任务计数器,直到所有计数器的遗留任务数量都是 0。
    // 表示所有任务都已经执行完成,结束轮询。
    while(1) {
        unsigned long pending = 0;
        for (int j = 1; j < server.io_threads_num; j++)
            pending += io_threads_pending[j];
        if (pending == 0) break;
    }
    if (tio_debug) printf("I/O WRITE All threads finshed\n");

    // 最后再遍历一次 clients_pending_write 队列,检查是否还有 client 的写出缓冲区中有残留数据,
    // 如果有,那就为 client 注册一个命令回复器 sendReplyToClient,等待客户端写就绪再继续把数据回写。
    listRewind(server.clients_pending_write,&li);
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);

        // 检查 client 的写出缓冲区是否还有遗留数据。
        if (clientHasPendingReplies(c) &&
                connSetWriteHandler(c->conn, sendReplyToClient) == AE_ERR)
        {
            freeClientAsync(c);
        }
    }
    listEmpty(server.clients_pending_write);

    /* Update processed count on server */
    server.stat_io_writes_processed += processed;

    return processed;
}

The core work here is to.

  • Check the current task load, and if the current number of tasks is not enough to be processed in multi-threaded mode, hibernate the I/O threads and write the response data back to the client directly and synchronously.
  • Wake up the I/O threads that are hibernating (if any).
  • Iterate through the client queue clients_pending_write and assign all tasks to the I/O threads and the main thread to write the response data back to the client via the RR policy.
  • Busy polling waits for all I/O threads to complete their tasks.
  • Finally iterate through clients_pending_write, register the command reply handler sendReplyToClient for those clients that still have response data left, and wait for the client to be writable before continuing to write back the remaining response data in the event loop.

I/O thread main logic

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
void *IOThreadMain(void *myid) {
    /* The ID is the thread number (from 0 to server.iothreads_num-1), and is
     * used by the thread to just manipulate a single sub-array of clients. */
    long id = (unsigned long)myid;
    char thdname[16];

    snprintf(thdname, sizeof(thdname), "io_thd_%ld", id);
    redis_set_thread_title(thdname);
    // 设置 I/O 线程的 CPU 亲和性,尽可能将 I/O 线程(以及主线程,不在这里设置)绑定到用户配置的
    // CPU 列表上。
    redisSetCpuAffinity(server.server_cpulist);
    makeThreadKillable();

    while(1) {
        // 忙轮询,100w 次循环,等待主线程分配 I/O 任务。
        for (int j = 0; j < 1000000; j++) {
            if (io_threads_pending[id] != 0) break;
        }

        // 如果 100w 次忙轮询之后如果还是没有任务分配给它,则通过尝试加锁进入休眠,
        // 等待主线程分配任务之后调用 startThreadedIO 解锁,唤醒 I/O 线程去执行。
        if (io_threads_pending[id] == 0) {
            pthread_mutex_lock(&io_threads_mutex[id]);
            pthread_mutex_unlock(&io_threads_mutex[id]);
            continue;
        }

        serverAssert(io_threads_pending[id] != 0);

        if (tio_debug) printf("[%ld] %d to handle\n", id, (int)listLength(io_threads_list[id]));


        // 注意:主线程分配任务给 I/O 线程之时,
        // 会把任务加入每个线程的本地任务队列 io_threads_list[id],
        // 但是当 I/O 线程开始执行任务之后,主线程就不会再去访问这些任务队列,避免数据竞争。
        listIter li;
        listNode *ln;
        listRewind(io_threads_list[id],&li);
        while((ln = listNext(&li))) {
            client *c = listNodeValue(ln);
            // 如果当前是写出操作,则把 client 的写出缓冲区中的数据回写到客户端。
            if (io_threads_op == IO_THREADS_OP_WRITE) {
                writeToClient(c,0);
              // 如果当前是读取操作,则socket 读取客户端的请求命令并解析第一条命令。
            } else if (io_threads_op == IO_THREADS_OP_READ) {
                readQueryFromClient(c->conn);
            } else {
                serverPanic("io_threads_op value is unknown");
            }
        }
        listEmpty(io_threads_list[id]);
        // 所有任务执行完之后把自己的计数器置 0,主线程通过累加所有 I/O 线程的计数器
        // 判断是否所有 I/O 线程都已经完成工作。
        io_threads_pending[id] = 0;

        if (tio_debug) printf("[%ld] Done\n", id);
    }
}

After the I/O thread is started, it first enters busy polling to determine the number of tasks in the atomic counter. If it is non-zero, the main thread has assigned it a task and starts executing it, otherwise it stays busy polling for a million times and waits. If it is still 0, it tries to add a local lock, because the main thread has already locked the local locks of all I/O threads in advance when it starts them, so the I/O threads will sleep and wait for the main thread to wake up.

The main thread will try to call startThreadedIO in each event loop to wake up the I/O thread to perform the task. If a client request command is received, the I/O thread will be woken up and start working to perform the task of reading and parsing the command or writing back the response data according to the io_threads_op flag set by the main thread. After receiving notification from the main thread, the I/O thread will iterate through its own local task queue io_threads_list[id] and take out one client to execute the task.

  • If the current operation is a write operation, call writeToClient to write the response data from client->buf or client->reply back to the client via the socket.
  • If the current operation is a read operation, call readQueryFromClient to read the client command via socket, store it in client->querybuf, and then call processInputBuffer to parse the command, which will only end up with the first command, and then finish without executing it.
  • Set its own atomic counter to 0 after all tasks have been executed to tell the main thread that it has finished its work.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
void processInputBuffer(client *c) {
// 省略代码
...

    while(c->qb_pos < sdslen(c->querybuf)) {
        /* Return if clients are paused. */
        if (!(c->flags & CLIENT_SLAVE) && clientsArePaused()) break;

        /* Immediately abort if the client is in the middle of something. */
        if (c->flags & CLIENT_BLOCKED) break;

        /* Don't process more buffers from clients that have already pending
         * commands to execute in c->argv. */
        if (c->flags & CLIENT_PENDING_COMMAND) break;
        /* Multibulk processing could see a <= 0 length. */
        if (c->argc == 0) {
            resetClient(c);
        } else {
            // 判断 client 是否具有 CLIENT_PENDING_READ 标识,如果是处于多线程 I/O 的模式下,
            // 那么此前已经在 readQueryFromClient -> postponeClientRead 中为 client 打上该标识,
            // 则立刻跳出循环结束,此时第一条命令已经解析完成,但是不执行命令。
            if (c->flags & CLIENT_PENDING_READ) {
                c->flags |= CLIENT_PENDING_COMMAND;
                break;
            }

            // 执行客户端命令
            if (processCommandAndResetClient(c) == C_ERR) {
                /* If the client is no longer valid, we avoid exiting this
                 * loop and trimming the client buffer later. So we return
                 * ASAP in that case. */
                return;
            }
        }
    }

...
}

Additional attention should be paid to the CPU affinity of the current thread when the I/O thread is first started, i.e., binding the current thread to the user-configured CPU. The same CPU affinity is set when starting the main Redis server thread, which is the core network model of Redis. Redis itself is an extremely throughput- and latency-sensitive system, so users need Redis to have finer-grained control over CPU resources. There are two main considerations here: the CPU cache and the NUMA architecture.

First, the CPU cache (we are talking about a hardware architecture where both the L1 Cache and L2 Cache are integrated into the CPU). Imagine a scenario where the main Redis process is running on CPU-1, serving data to clients, and Redis starts a child process for data persistence (BGSAVE or AOF). After system scheduling, the child process takes over CPU-1 of the main process, and the main process is scheduled to run on CPU-2, causing the instructions and data in CPU-1’s cache to be eliminated and CPU-2 to reload the instructions and data into its own local cache, wasting CPU resources and reducing performance.

CPU cache

Therefore, by setting CPU affinity, Redis can isolate the main process/thread and child processes/threads by binding them to different cores so that they do not interfere with each other, which can effectively improve system performance.

The second consideration is based on the NUMA architecture. Under the NUMA system, the memory controller chip is integrated inside the processor to form the CPU local memory. Access to local memory is only through the memory channel and not through the system bus, which greatly reduces the access latency, while multiple processors are interconnected by QPI data links, and the memory access overhead across NUMA nodes is much higher than that of local memory access.

QuickPath Interconnect

Therefore, Redis can also greatly improve performance by setting CPU affinity so that the main process/thread runs on a fixed CPU on a NUMA node as much as possible, using more local memory instead of accessing data across nodes.

For more information about NUMA, please check it yourself. I’ll write a separate article about it later when I have time.

One final point that readers who have read the source code may wonder is that Redis does not seem to lock data in multi-threaded mode. In fact, Redis’ multi-threaded model is lock-free throughout, which is achieved through atomic operations + interleaved access, with three variables shared between the main thread and the I/O thread: io_threads_pending counters, io_threads_op I/O identifiers, and io_threads_list list` thread-local task queue.

io_threads_pending is an atomic variable that does not require lock protection, while io_threads_op and io_threads_list are two variables that circumvent the shared data competition problem by controlling staggered access between the main thread and the I/O thread: the I/O thread starts and waits for a signal from the main thread through busy polling and lock hibernation. After starting, the I/O thread waits for a signal from the main thread through busy polling and lock sleep. It does not access its own local task queue io_threads_list[id] until it has assigned all tasks to the local queue of each I/O thread before waking up the I/O thread to start work, and the main thread will only access its own local task queue io_threads_list[ 0] and will not access the I/O thread’s local queue, which ensures that the main thread will always access io_threads_list before the I/O thread and never again, ensuring interleaved access. Similarly for io_threads_op, the main thread sets the value of io_threads_op before waking up the I/O thread and does not access this variable again while the I/O thread is running.

Async Command in Redis

Performance Improvements

Redis transforms the core network model into multi-threaded mode in pursuit of the ultimate performance improvement, so the benchmark data is the real deal.

Redis multi-threaded mode benchmark

The test data shows that Redis performance is dramatically improved by a factor of two when using multi-threaded mode. More detailed performance crunching data can be found in this article: Benchmarking the experimental Redis Multi-Threaded I/O.

The following is a comparison of the performance of the old and new Redis versions as measured by the Mito technical team, for reference only.

comparison of the performance of the old and new Redis versions as measured

comparison of the performance of the old and new Redis versions as measured

Model flaws

First of all, as I mentioned earlier, Redis’ multi-threaded network model is not actually a standard Multi-Reactors/Master-Workers model, and differs from other mainstream open source web server models. Workers model, Sub Reactors/Workers will complete the network read -> data parsing -> command execution -> network write process, and Main Reactor/Master is only responsible for assigning tasks. In Redis’ multi-threaded scenario, the I/O threads are only tasked with reading and parsing client requests through the socket, but not actually executing the commands. All client-side commands eventually need to be returned to the main thread for execution, so the utilization of multiple cores is not very high. Moreover, each time the main thread must be busy polling for all I/O threads to complete their tasks before continuing with the rest of the logic.

I think the main reason why Redis was designed with a multi-threaded network model was to maintain compatibility, because Redis was previously single-threaded and all client commands were executed in a single-threaded event loop, so all data structures in Redis were non-thread-safe. Now with multi-threading, if we follow the standard Multi-Reactors/Master-Workers model, all the built-in data structures would have to be refactored to be thread-safe, which is a huge amount of work and hassle.

So, in my opinion, Redis’ current multi-threading solution is more of a compromise: it maintains the compatibility of the original system while leveraging multiple cores to improve I/O performance.

Second, the current multi-threaded model of Redis is too simple and brutal in its communication between the main thread and the I/O threads: busy polling and locking, because waiting via spin busy polling causes Redis to have occasional high occupancy caused by brief CPU idle at startup and during runtime. The final implementation of this communication mechanism looks very unintuitive and uncomplicated, and we hope that Redis will improve on the current solution later.

Summary

Redis is the de facto standard for caching systems, and its underlying principles are worth a deep dive. But author antirez, a developer of simplicity, was extremely cautious about adding any new features to Redis, so the core Redis network model was finally converted to a multi-threaded model a decade after the initial release of Redis, during which time a number of Redis multi-threaded alternatives were even born. Although antirez has been postponing the multithreading solution, it has never stopped thinking about the feasibility of multithreading. The transformation of the Redis multithreaded network model did not happen overnight, and involved all aspects of the project, so we can see that the final Redis solution was not perfect, and did not use the mainstream multithreaded design.

Let’s review the design of the Redis multi-threaded network model.

  • Multi-threaded network I/O using I/O threads, where I/O threads are only responsible for network I/O and command parsing, and do not execute client commands.
  • Implement a lock-free multi-threaded model using atomic operations + interleaved access.
  • Isolating the main process from other sub-processes by setting CPU affinity, so that the multi-threaded network model can maximize performance.

After reading through this article, I believe that readers should be able to understand the various technologies involved in the computer domain to implement a good network system: design patterns, network I/O, concurrent programming, operating system underlays, and even computer hardware. It also requires a careful approach to project iteration and refactoring, and deep thinking about technical solutions, not just the hard part of writing good code.