Redis is a very popular in-memory database that allows for very high read and write performance by keeping data in memory. However, once a process exits, all of Redis’ data is lost.

To solve this problem, Redis provides two persistence schemes, RDB and AOF, to save data in memory to disk and avoid data loss. In this article, we will focus on the AOF persistence scheme, some of its problems, and discuss the design and implementation details of Multi Part AOF (hereinafter referred to as MP-AOF, a feature contributed by the AliCloud Database Tair team) in Redis 7.0 (released in RC1).

1. AOF

AOF ( append only file ) persistence records each write command as a separate log file, and plays back the commands in the AOF file for data recovery purposes when Redis starts.

Since AOF records each redis write command as an append, as Redis processes more write commands, the AOF file becomes larger and the time to replay the commands increases, to solve this problem, Redis introduces the AOF rewrite mechanism (hereafter called AOFRW). AOFRW removes redundant writes from the AOF and rewrites them in an equivalent manner, generating a new AOF file to reduce the size of the AOF file.

2. AOFRW

Figure 1 shows the principle of AOFRW implementation. When AOFRW is triggered, Redis first forks a child process to perform a background rewrite operation, which rewrites a snapshot of Redis data at the moment of the fork to a temporary AOF file called temp-rewriteaof-bg-pid.aof.

Since the rewrite operation is performed in the background by the child process, the master process can still respond to user commands during the AOF rewrite. Therefore, in order for the child process to eventually also get the incremental changes generated by the master process during the rewrite, the master process will write a copy of the executed write commands to the aof_buf in addition to writing them to the aof_rewrite_buf for caching. At a later stage of rewriting by the child process, the master process sends the accumulated data in the aof_rewrite_buf to the child process using pipe, and the child process appends this data to the temporary AOF file.

When the master process takes on a large amount of write traffic, aof_rewrite_buf may accumulate a very large amount of data, resulting in the child process not being able to consume all the data in aof_rewrite_buf during the rewrite period. In this case, the remaining data in the aof_rewrite_buf will be processed by the master process at the end of the rewrite.

When the child process finishes the rewrite operation and exits, the main process will handle the rest in the backgroundRewriteDoneHandler. First, the unconsumed data in the aof_rewrite_buf during the rewrite is appended to the temporary AOF file. Second, when everything is ready, Redis will use the rename operation to atomically rename the temporary AOF file to server.aof_filename, at which point the original AOF file will be overwritten. At this point, the entire AOFRW process is complete.

sobyte

3. Problems with AOFRW

1. Memory overhead

As you can see from Figure 1, during AOFRW, the main process writes the data changes after fork into aof_rewrite_buf. The vast majority of the contents of aof_rewrite_buf and aof_buf are duplicated, so this introduces additional memory redundancy overhead.

The size of the memory occupied by aof_rewrite_buf at the current moment can be seen in the aof_rewrite_buffer_length field in Redis INFO. As shown below, at high write traffic aof_rewrite_buffer_length takes up almost as much memory space as aof_buffer_length, almost doubling the amount of memory wasted.

1
2
3
4
aof_pending_rewrite:0
aof_buffer_length:35500
aof_rewrite_buffer_length:34000
aof_pending_bio_fsync:0

When the memory size occupied by aof_rewrite_buf exceeds a certain threshold, we will see the following message in the Redis log. As you can see, aof_rewrite_buf is taking up 100MB of memory space and 2135MB of data is being transferred between the main process and the child process (the child process also has internal memory overhead for reading the buffer when it reads this data via pipe).

This is a significant amount of overhead for Redis, an in-memory database.

1
2
3
4
5
3351:M 25 Jan 2022 09:55:39.655 * Background append only file rewriting started by pid 6817
3351:M 25 Jan 2022 09:57:51.864 * AOF rewrite child asks to stop sending diffs.
6817:C 25 Jan 2022 09:57:51.864 * Parent agreed to stop sending diffs. Finalizing AOF...
6817:C 25 Jan 2022 09:57:51.864 * Concatenating 2135.60 MB of AOF diff received from parent.
3351:M 25 Jan 2022 09:57:56.545 * Background AOF buffer size: 100 MB

When the memory size occupied by aof_rewrite_buf exceeds a certain threshold, we will see the following message in the Redis log. As you can see, aof_rewrite_buf is taking up 100MB of memory space and 2135MB of data is being transferred between the main process and the child process (the child process also has internal memory overhead for reading the buffer when it reads this data via pipe).

This is a significant amount of overhead for Redis, an in-memory database.

The memory overhead from AOFRW can cause Redis memory to suddenly reach the maxmemory limit, which can affect normal command writes and even trigger the OS limit to be killed by the OOM Killer, making Redis unserviceable.

2. CPU Overhead

There are three main areas of CPU overhead, each explained as follows.

  1. during AOFRW, the master process spends CPU time writing data to the aof_rewrite_buf and sending the data in the aof_rewrite_buf to the child processes using the eventloop event loop.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    
    /* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
    void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
        // 此处省略其他细节...
    
        /* Install a file event to send data to the rewrite child if there is
        * not one already. */
        if (!server.aof_stop_sending_diff &&
            aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0)
        {
            aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
                AE_WRITABLE, aofChildWriteDiffData, NULL);
        } 
    
        // 此处省略其他细节...
    }
    
  2. At the later stage of the rewrite operation performed by the child process, the incremental data sent by the main process in the pipe is read cyclically and then appended and written to the temporary AOF file.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
    int rewriteAppendOnlyFile(char *filename) {
        // 此处省略其他细节...
    
        /* Read again a few times to get more data from the parent.
        * We can't read forever (the server may receive data from clients
        * faster than it is able to send data to the child), so we try to read
        * some more data in a loop as soon as there is a good chance more data
        * will come. If it looks like we are wasting time, we abort (this
        * happens after 20 ms without new data). */
        int nodata = 0;
        mstime_t start = mstime();
        while(mstime()-start < 1000 && nodata < 20) {
            if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
            {
                nodata++;
                continue;
            }
            nodata = 0; /* Start counting from zero, we stop on N *contiguous*
                        timeouts. */
            aofReadDiffFromParent();
        }
    
        // 此处省略其他细节...
    }
    
  3. After the child process completes the rewrite operation, the master process performs the finishing work in the backgroundRewriteDoneHandler. One of the tasks is to write the data in aof_rewrite_buf that was not consumed during the rewrite to a temporary AOF file. If there is a lot of data left in the aof_rewrite_buf, CPU time will be consumed here as well.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    
    void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
        // 此处省略其他细节...
    
        /* Flush the differences accumulated by the parent to the rewritten AOF. */
        if (aofRewriteBufferWrite(newfd) == -1) {
            serverLog(LL_WARNING,
                    "Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
            close(newfd);
            goto cleanup;
        }
    
        // 此处省略其他细节...
    }
    

The CPU overhead from AOFRW may cause Redis to experience jitter on RT when executing commands, or even cause client timeouts.

3. Disk IO Overhead

As mentioned earlier, during AOFRW, the master process writes a copy of the executed write command to aof_buf in addition to a copy to aof_rewrite_buf. The data in aof_buf is eventually written to the old AOF file currently in use, generating disk IO. at the same time, the data in aof_rewrite_buf is also written to the Therefore, the same data will generate disk IO twice.

4. Code Complexity

Redis uses the six pipes shown below for data transfer and control interactions between the master and child processes, which makes the entire AOFRW logic more complex and difficult to understand.

1
2
3
4
5
6
7
8

 /* AOF pipes used to communicate between parent and child during rewrite. */
 int aof_pipe_write_data_to_child;
 int aof_pipe_read_data_from_parent;
 int aof_pipe_write_ack_to_parent;
 int aof_pipe_read_ack_from_child;
 int aof_pipe_write_ack_to_child;
 int aof_pipe_read_ack_from_parent;

4. MP-AOF Implementation

1. Solution Overview

As the name implies, MP-AOF is to split the original single AOF file into multiple AOF files. In MP-AOF, we divide AOF into three types, which are.

  • BASE: denotes base AOF, which is generally generated by child processes through rewriting, and there is at most one of this file.
  • INCR: denotes incremental AOF, which is generally created when AOFRW starts execution, and this file may exist more than one.
  • HISTORY: denotes historical AOF, which is changed from BASE and INCR AOF. Each time AOFRW completes successfully, the corresponding BASE and INCR AOFs before this AOFRW will be changed to HISTORY, and AOFs of HISTORY type will be deleted automatically by Redis.

To manage these AOF files, we introduce a manifest file to track and manage these AOFs, and to facilitate AOF backup and copying, we put all AOF files and manifest files into a single directory with a name determined by the appenddirname configuration (a new configuration item in Redis 7.0) The name of the directory is determined by the appenddirname configuration (new in Redis 7.0).

redis

Figure 2 shows the general flow of executing an AOFRW in MP-AOF. At the beginning we still fork a child process for the rewrite operation, and in the main process we open a new INCR type AOF file at the same time. During the rewrite operation of the child process, all data changes are written to this newly opened INCR AOF. At the end of the AOFRW, the master process is responsible for updating the manifest file with the newly generated BASE AOF and the newly opened INCR AOF, which represents all the data in Redis at the current moment. At the end of AOFRW, the master process is responsible for updating the manifest file, adding the newly generated BASE AOF and INCR AOF information to it, and marking the previous BASE AOF and INCR AOF as HISTORY (these HISTORY AOFs are deleted by Redis asynchronously). Once the manifest file is updated, it marks the end of the entire AOFRW process.

As can be seen in Figure 2, we no longer need the aof_rewrite_buf during AOFRW, so the corresponding memory consumption is removed. At the same time, there is no more data transfer and control interaction between the main process and the child process, so the corresponding CPU overhead is also removed. Correspondingly, the six pipes and their corresponding code are also removed, making the AOFRW logic simpler and clearer.

2. Key implementations

Manifest

1) Representation in memory

MP-AOF strongly depends on the manifest file. The manifest is represented in memory as the following structure, where

  • aofInfo: represents an AOF file information, currently only includes file name, file serial number and file type
  • base_aof_info: indicates BASE AOF information, when no BASE AOF exists, this field is NULL
  • incr_aof_list: used to store the information of all INCR AOF files, all INCR AOF will be discharged according to the file opening order
  • history_aof_list: used to store HISTORY AOF information, the elements in history_aof_list are moved from base_aof_info and incr_aof_list.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
typedef struct {
    sds           file_name;  /* file name */
    long long     file_seq;   /* file sequence */
    aof_file_type file_type;  /* file type */
} aofInfo;

typedef struct {
    aofInfo     *base_aof_info;       /* BASE file information. NULL if there is no BASE file. */
    list        *incr_aof_list;       /* INCR AOFs list. We may have multiple INCR AOF when rewrite fails. */
    list        *history_aof_list;    /* HISTORY AOF list. When the AOFRW success, The aofInfo contained in
                                         `base_aof_info` and `incr_aof_list` will be moved to this list. We
                                         will delete these AOF files when AOFRW finish. */
    long long   curr_base_file_seq;   /* The sequence number used by the current BASE file. */
    long long   curr_incr_file_seq;   /* The sequence number used by the current INCR file. */
    int         dirty;                /* 1 Indicates that the aofManifest in the memory is inconsistent with
                                         disk, we need to persist it immediately. */
} aofManifest;

To facilitate atomic modification and rollback operations, we use a pointer reference to aofManifest in the redisServer structure.

1
2
3
4
5
6
7
struct redisServer {
    // 此处省略其他细节...

    aofManifest *aof_manifest;       /* Used to track AOFs. */

    // 此处省略其他细节...
}
2) Representation on disk

A manifest is essentially a text file containing multiple lines of records, each line of which corresponds to an AOF file of information that is presented in key/value pairs for Redis processing, easy reading and modification. The following is a possible manifest file content.

1
2
3
file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i
file appendonly.aof.2.incr.aof seq 2 type i

The Manifest format itself needs to be extensible to add or support other features in the future. For example, it can easily support new key/value and annotations (similar to the annotations in AOF), which can ensure better forward compatibility.

1
2
3
4
file appendonly.aof.1.base.rdb seq 1 type b newkey newvalue
file appendonly.aof.1.incr.aof type i seq 1 
# this is annotations
seq 2 type i file appendonly.aof.2.incr.aof

File Naming Rules

Before MP-AOF, the AOF file name was the value set by the appendfilename parameter (default is appendonly.aof).

In MP-AOF, we use basename.suffix to name multiple AOF files. Where the appendfilename configuration content will be used as the basename part and the suffix will consist of three parts in the format seq.type.format , where.

  • seq is the file serial number, starting from 1 monotonically increasing, BASE and INCR have separate file serial numbers
  • type is the type of the AOF, indicating whether the AOF file is a BASE or an INCR
  • format is the internal encoding of the AOF. Since Redis supports the RDB preamble mechanism, the BASE AOF may be encoded in RDB format or AOF format.
1
2
3
4
5
#define BASE_FILE_SUFFIX           ".base"
#define INCR_FILE_SUFFIX           ".incr"
#define RDB_FORMAT_SUFFIX          ".rdb"
#define AOF_FORMAT_SUFFIX          ".aof"
#define MANIFEST_NAME_SUFFIX       ".manifest"

Therefore, when using the appendfilename default configuration, the possible naming of the BASE, INCR and manifest files are as follows.

1
2
3
4
appendonly.aof.1.base.rdb // 开启RDB preamble
appendonly.aof.1.base.aof // 关闭RDB preamble
appendonly.aof.1.incr.aof
appendonly.aof.2.incr.aof

is compatible with old version upgrades

Since MP-AOF is strongly dependent on the manifest file, Redis will load the corresponding AOF file strictly according to the manifest instructions when it starts. However, when upgrading from an older version of Redis (meaning versions prior to Redis 7.0) to Redis 7.0, since there is no manifest file at this time, the ability for Redis to correctly recognize that this is an upgrade process and to load the old AOF correctly and safely is a capability that must be supported.

The ability to recognize this is the first part of this important process, and before we actually load the AOF file, we check to see if an AOF file named server.aof_filename exists in the Redis working directory. If it exists, then we are probably performing an upgrade from an older version of Redis. Next, we proceed to determine that we consider this to be an upgrade launch when one of the following three conditions is met.

  1. if the appenddirname directory does not exist.
  2. or the appenddirname directory exists, but there is no corresponding manifest file in the directory.
  3. if the appenddirname directory exists and there is a manifest file in the directory, and there is only BASE AOF information in the manifest file, and the name of this BASE AOF is the same as server.aof_filename, and there is no file named server.aof_filename in the appenddirname directory. filename.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

/* Load the AOF files according the aofManifest pointed by am. */
int loadAppendOnlyFiles(aofManifest *am) {
    // 此处省略其他细节...

    /* If the 'server.aof_filename' file exists in dir, we may be starting
     * from an old redis version. We will use enter upgrade mode in three situations.
     *
     * 1. If the 'server.aof_dirname' directory not exist
     * 2. If the 'server.aof_dirname' directory exists but the manifest file is missing
     * 3. If the 'server.aof_dirname' directory exists and the manifest file it contains
     *    has only one base AOF record, and the file name of this base AOF is 'server.aof_filename',
     *    and the 'server.aof_filename' file not exist in 'server.aof_dirname' directory
     * */
    if (fileExist(server.aof_filename)) {
        if (!dirExists(server.aof_dirname) ||
            (am->base_aof_info == NULL && listLength(am->incr_aof_list) == 0) ||
            (am->base_aof_info != NULL && listLength(am->incr_aof_list) == 0 &&
             !strcmp(am->base_aof_info->file_name, server.aof_filename) && !aofFileExist(server.aof_filename)))
        {
            aofUpgradePrepare(am);
        }
    }

    // 此处省略其他细节...
  }

Once this is recognized as an upgrade launch, we will use the aofUpgradePrepare function to do the pre-upgrade preparation.

The upgrade preparation is divided into three main parts.

  1. construct a BASE AOF message using server.aof_filename as the filename.
  2. persist the BASE AOF information to the manifest file.
  3. use rename to move the old AOF file to the appenddirname directory.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

void aofUpgradePrepare(aofManifest *am) {
    // 此处省略其他细节...

    /* 1. Manually construct a BASE type aofInfo and add it to aofManifest. */
    if (am->base_aof_info) aofInfoFree(am->base_aof_info);
    aofInfo *ai = aofInfoCreate();
    ai->file_name = sdsnew(server.aof_filename);
    ai->file_seq = 1;
    ai->file_type = AOF_FILE_TYPE_BASE;
    am->base_aof_info = ai;
    am->curr_base_file_seq = 1;
    am->dirty = 1;

    /* 2. Persist the manifest file to AOF directory. */
    if (persistAofManifest(am) != C_OK) {
        exit(1);
    }

    /* 3. Move the old AOF file to AOF directory. */
    sds aof_filepath = makePath(server.aof_dirname, server.aof_filename);
    if (rename(server.aof_filename, aof_filepath) == -1) {
        sdsfree(aof_filepath);
        exit(1);;
    }

    // 此处省略其他细节...
}

The upgrade preparation operation is Crash Safe, and a Crash in any of the above three steps will allow us to correctly identify and retry the entire upgrade operation in the next boot.

Multi-file loading and progress calculation

Redis keeps track of the loading progress when loading an AOF, and displays it in the loading_loaded_perc field of the Redis INFO. In MP-AOF, the loadAppendOnlyFiles function performs AOF file loading based on the aofManifest passed in. Before doing the loading, we need to calculate the total size of all the AOF files to be loaded in advance and pass it to the startLoading function, and then keep reporting the loading progress in loadSingleAppendOnlyFile.

Next, loadAppendOnlyFiles will load BASE AOF and INCR AOF in order according to aofManifest. stopLoading will be used to end the loading state after all AOF files are currently loaded.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

int loadAppendOnlyFiles(aofManifest *am) {
    // 此处省略其他细节...

    /* Here we calculate the total size of all BASE and INCR files in
     * advance, it will be set to `server.loading_total_bytes`. */
    total_size = getBaseAndIncrAppendOnlyFilesSize(am);
    startLoading(total_size, RDBFLAGS_AOF_PREAMBLE, 0);

    /* Load BASE AOF if needed. */
    if (am->base_aof_info) {
        aof_name = (char*)am->base_aof_info->file_name;
        updateLoadingFileName(aof_name);
        loadSingleAppendOnlyFile(aof_name);
    }

    /* Load INCR AOFs if needed. */
    if (listLength(am->incr_aof_list)) {
        listNode *ln;
        listIter li;

        listRewind(am->incr_aof_list, &li);
        while ((ln = listNext(&li)) != NULL) {
            aofInfo *ai = (aofInfo*)ln->value;
            aof_name = (char*)ai->file_name;
            updateLoadingFileName(aof_name);
            loadSingleAppendOnlyFile(aof_name);
        }
    }

    server.aof_current_size = total_size;
    server.aof_rewrite_base_size = server.aof_current_size;
    server.aof_fsync_offset = server.aof_current_size;

    stopLoading();

    // 此处省略其他细节...
}

AOFRW Crash Safety

When the child process completes the rewrite operation, it creates a temporary AOF file called temp-rewriteaof-bg-pid.aof, which is still not visible to Redis because it has not yet been added to the manifest file. To make it visible to Redis and load it correctly when Redis starts, we need to rename it and add its information to the manifest file according to the naming convention mentioned earlier.

Although AOF file rename and manifest file modification are two separate operations, we must ensure atomicity in both operations so that Redis can load the corresponding AOF correctly at startup. MP-AOF uses two designs to solve this problem.

  1. the name of the BASE AOF contains the file serial number, ensuring that each BASE AOF created does not conflict with previous BASE AOFs.
  2. executing the rename operation of the AOF first, and then modifying the manifest file.

For the sake of illustration, let’s assume that before AOFRW starts, the contents of the manifest file are as follows.

1
2
file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i

The content of the manifest file after the execution of AOFRW starts is as follows.

1
2
3
file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i
file appendonly.aof.2.incr.aof seq 2 type i

After the subprocess rewrite is finished, in the main process, we will rename temp-rewriteaof-bg-pid.aof to appendonly.aof.2.base.rdb and add it to the manifest, and will mark the previous BASE and INCR AOF as HISTORY. at this point the manifest file will look like this.

1
2
3
4
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.1.base.rdb seq 1 type h
file appendonly.aof.1.incr.aof seq 1 type h
file appendonly.aof.2.incr.aof seq 2 type i

At this point, the results of this AOFRW are visible to Redis, and the HISTORY AOF is cleaned up asynchronously by Redis.

The backgroundRewriteDoneHandler function implements the above logic in seven steps.

  1. dup a temporary manifest structure before modifying the server.aof_manifest in memory, and all subsequent modifications will be made to this temporary manifest. The advantage of this is that if the later steps fail, we can simply destroy the temporary manifest and thus roll back the entire operation, avoiding contaminating the server.aof_manifest global data structure.
  2. get the new BASE AOF filename (noted as new_base_filename) from the temporary manifest and mark the previous (if any) BASE AOF as HISTORY.
  3. renaming the temp-rewriteaof-bg-pid.aof temporary file generated by the child process to new_base_filename.
  4. mark all the previous INCR AOFs in the temporary manifest structure as type HISTORY.
  5. persist the information corresponding to the temporary manifest to disk (persistAofManifest internally will ensure atomicity of the modifications to the manifest itself).
  6. if all the above steps are successful, we can safely point the server.aof_manifest pointer in memory to the temporary manifest structure (and release the previous manifest structure), up to which point the entire modification is visible to Redis.
  7. clear the AOF of type HISTORY, this step is allowed to fail because it does not cause data consistency problems.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    snprintf(tmpfile, 256, "temp-rewriteaof-bg-%d.aof",
        (int)server.child_pid);

    /* 1. Dup a temporary aof_manifest for subsequent modifications. */
    temp_am = aofManifestDup(server.aof_manifest);

    /* 2. Get a new BASE file name and mark the previous (if we have)
     * as the HISTORY type. */
    new_base_filename = getNewBaseFileNameAndMarkPreAsHistory(temp_am);

    /* 3. Rename the temporary aof file to 'new_base_filename'. */
    if (rename(tmpfile, new_base_filename) == -1) {
        aofManifestFree(temp_am);
        goto cleanup;
    }

    /* 4. Change the AOF file type in 'incr_aof_list' from AOF_FILE_TYPE_INCR
     * to AOF_FILE_TYPE_HIST, and move them to the 'history_aof_list'. */
    markRewrittenIncrAofAsHistory(temp_am);

    /* 5. Persist our modifications. */
    if (persistAofManifest(temp_am) == C_ERR) {
        bg_unlink(new_base_filename);
        aofManifestFree(temp_am);
        goto cleanup;
    }

    /* 6. We can safely let `server.aof_manifest` point to 'temp_am' and free the previous one. */
    aofManifestFreeAndUpdate(temp_am);

    /* 7. We don't care about the return value of `aofDelHistoryFiles`, because the history
     * deletion failure will not cause any problems. */
    aofDelHistoryFiles();
}

supports AOF truncate

By default, Redis cannot load such incomplete AOFs, but Redis supports the AOF truncate feature (turned on via the aof-load- truncated configuration). The principle is to use server.aof_current_size to track the last correct file offset of the AOF, and then use the ftruncate function to remove all file contents after that offset, which may lose some data, but ensures the integrity of the AOF.

In MP-AOF, server.aof_current_size no longer indicates the size of a single AOF file but the total size of all AOF files. Since only the last INCR AOF is likely to have incomplete writes, we have introduced a separate field server.aof_last_incr_size to track the size of the last INCR AOF file. When the last INCR AOF has an incomplete write, we simply remove the contents of the file after server.aof_last_incr_size.

1
2
3
 if (ftruncate(server.aof_fd, server.aof_last_incr_size) == -1) {
      //此处省略其他细节...
 }

AOFRW Flow Limiting

Redis supports automatic execution of AOFRW when the AOF size exceeds a certain threshold. When a disk failure occurs or a code bug is triggered that causes AOFRW to fail, Redis will keep executing AOFRW repeatedly until it succeeds. Before MP-AOF, this didn’t seem like a big deal (at best, it consumed some CPU time and fork overhead). But in MP-AOF, because each AOFRW opens an INCR AOF, and only when the AOFRW succeeds will the previous INCR and BASE be converted to HISTORY and deleted. Therefore, successive AOFRW failures will inevitably lead to the problem of multiple INCR AOFs co-existing. In extreme cases, if AOFRW retries are frequent we will see hundreds or thousands of INCR AOF files.

For this reason, we have introduced the AOFRW flow limitation mechanism. That is, when AOFRW has failed three times in a row, the next AOFRW will be forcibly delayed by 1 minute, and if the next AOFRW still fails, it will be delayed by 2 minutes, and so on for 4, 8, 16… , the current maximum delay time is 1 hour.

During the AOFRW restriction period, we can still use the bgrewriteaof command to execute an AOFRW immediately.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
if (server.aof_state == AOF_ON &&
    !hasActiveChildProcess() &&
    server.aof_rewrite_perc &&
    server.aof_current_size > server.aof_rewrite_min_size &&
    !aofRewriteLimited())
{
    long long base = server.aof_rewrite_base_size ?
        server.aof_rewrite_base_size : 1;
    long long growth = (server.aof_current_size*100/base) - 100;
    if (growth >= server.aof_rewrite_perc) {
        rewriteAppendOnlyFileBackground();
    }
}

The introduction of the AOFRW flow-limiting mechanism also effectively avoids the CPU and fork overhead associated with high-frequency retries in AOFRW. a lot of RT jitter in Redis is related to fork.

The introduction of the AOFRW flow-limiting mechanism also effectively avoids the CPU and fork overhead associated with high-frequency retries in AOFRW. a lot of RT jitter in Redis is related to fork.

5. Summary

The introduction of MP-AOF has successfully solved the memory and CPU overheads of AOFRW, which had a negative impact on Redis instances and even business access. At the same time, in the process of solving these problems, we encountered many unforeseen challenges, mainly from Redis’ large user base and diverse usage scenarios, so we had to consider the problems that users might encounter in using MP-AOF in various scenarios. Such as compatibility, ease of use, and minimizing intrusiveness into Redis code. This is a top priority for the Redis community in terms of feature evolution.

At the same time, the introduction of MP-AOF has brought more imagination to data persistence in Redis. For example, when aof-use-rdb-preamble is enabled, BASE AOF is essentially an RDB file, so we don’t need to perform a separate BGSAVE operation when performing a full backup. MP-AOF supports the ability to turn off automatic cleanup of the HISTORY AOF, so those historical AOFs have a chance to be preserved, and Redis now supports adding timestamp annotations to the AOF, so we can even implement a simple PITR capability based on this (point-in-time recovery). point-in-time recovery).

The design prototype of MP-AOF comes from the binlog implementation of Tair for redis Enterprise Edition, which is a set of core features proven on AliCloud Tair service. Today we are contributing this core capability to the Redis community, hoping that community users can also enjoy these enterprise-class features and create their own business code through better optimization with these enterprise-class features. For more details on MP-AOF, please move to the related PR ([#9788]), where more original designs and full code are available.