In-depth analysis of split locks, i++ can lead to disaster

Split lock is a memory bus lock supported by the CPU to support atomic memory accesses across a cache line.

Some processors like ARM and RISC-V do not allow unaligned memory accesses and do not generate atomic accesses across cache lines, so split lock is not generated, while X86 supports it.

Split lock is convenient for developers because there is no need to consider unaligned memory access, but it also comes at a cost: an instruction that generates split lock will occupy the memory bus exclusively for about 1000 clock cycles, compared to less than 10 clock cycles for a normal ADD instruction, and locking the memory bus so that other CPUs cannot access memory can seriously affect system performance.

Therefore, the detection and handling of split lock is very important. Nowadays, CPUs support the detection capability, and will directly panic if detected in the kernel state, and will try to actively sleep in the user state to reduce the frequency of split lock generation, or kill the user state process, thus alleviating the contention for the memory bus.

With the introduction of virtualization, it will try to handle this on the host side, and KVM will notify the vCPU threads in QEMU to actively sleep to reduce the frequency of split lock generation, or even kill the virtual machine. The above conclusions are only as of 2022/4/19 (same below), the community still has different views on split lock handling in the past 2 years, and the handling has changed many times, so the following analysis only discusses the current situation.

1. Split lock background

1.1 Starting with i++

Let’s assume the simplest computing model, a CPU (single core, no Hyper-threading enabled, no Cache), and a block of memory. A C program is running on it executing i++, which corresponds to the assembly code add 1, i.

Analyzing the semantics of the add instruction here, two operands are required, the source operand SRC and the destination operand DEST, to achieve the function DEST = DEST + SRC. Here SRC is the immediate number 1 and DEST is the memory address of i. The CPU needs to first read the content of i in memory, then add 1, and finally write the result to the memory address where i is located. In total, two serial memory operations are generated.

If the computational architecture is more complex, with 2 CPU cores CoreA and CoreB, the above i++ code has to consider the data consistency problem.

1.1.1 Concurrent Write Problem

What if CoreA is writing to i’s memory address while CoreB is writing to i’s memory address at the same time?

Concurrent Write Problem

Writing the same memory address concurrently is actually quite simple, and the CPU guarantees the atomicity of the underlying memory operations from hardware.

The specific operations are.

Read/write 1 byte
Read/write 16 bit aligned 2 byte
Read/write 32 bit aligned 4 byte
Read/write 64 bit aligned 8 byte

1.1.2 Write Overwrite Problem

What if CoreB writes data to i’s memory address during the time after CoreA reads i from memory and before it writes to i’s memory address?

Write Overwrite Problem

This situation causes the data written by CoreB to be overwritten by the data written by CoreA later, so that the written data of CoreB is lost and CoreA does not know that the written data has been updated after it has been read.

Why does this problem occur? Because the ADD instruction is not an atomic operation and will generate two memory operations.

So how do we solve this problem? Since the ADD instruction is not atomic in hardware, add a lock in software to implement atomic operations so that CoreB’s memory operations cannot be executed until CoreA’s memory operations are complete.

Write Overwrite Problem

The corresponding method is to declare the instruction prefix LOCK and the assembly code becomes lock add 1, i.

1.2 Bus lock

When the LOCK instruction prefix is declared, the accompanying instruction becomes an atomic instruction. The principle is to lock the system bus during the execution of the accompanying instruction, prohibiting other processors from performing memory operations and making them exclusive to achieve atomic operations.

Bus lock

Here are a few examples.

1.2.1 Atomic Accumulation in QEMU

The function qatomic_inc(ptr) in QEMU adds 1 to the memory data pointed to by the argument ptr.

`1`	`#define qatomic_inc(ptr) ((void) __sync_fetch_and_add(ptr, 1))`

The principle is to call the GCC built-in __sync_fetch_and_add function. Let’s write a C program by hand and see the assembly implementation of __sync_fetch_and_add.

int main() {
    int i = 1;
    int *p = &i;
    while(1) {
        __sync_fetch_and_add(p, 1);
    }
    return 0;
}

// add.s
        .file   "add.c"
        .text
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movl    $1, -12(%rbp)
        leaq    -12(%rbp), %rax
        movq    %rax, -8(%rbp)
.L2:
        movq    -8(%rbp), %rax
        lock addl       $1, (%rax)
        jmp     .L2
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516"
        .section        .note.GNU-stack,"",@progbits

You can see that the assembly implementation of __sync_fetch_and_add declares the lock instruction prefix before the add instruction.

1.2.2 Atomic accumulation in the Kernel

The atomic_inc function in the Kernel adds 1 to the memory data pointed to by the argument v.

static __always_inline void
atomic_inc(atomic_t *v)
{
        instrument_atomic_read_write(v, sizeof(*v));
        arch_atomic_inc(v);
}

static __always_inline void arch_atomic_inc(atomic_t *v)
{
        asm volatile(LOCK_PREFIX "incl %0"
                     : "+m" (v->counter) :: "memory");
}

#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "

As you can see, the same lock instruction prefix is declared.

1.2.3 CAS (Compare And Swap)

The CAS interface in programming languages provides developers with atomic operations to implement lock-free mechanisms.

Golang’s CAS

// bool Cas(int32 *val, int32 old, int32 new)
// Atomically:
//        if(*val == old){
//                *val = new;
//                return 1;
//        } else
//                return 0;
TEXT ·Cas(SB),NOSPLIT,$0-17
        MOVQ        ptr+0(FP), BX
        MOVL        old+8(FP), AX
        MOVL        new+12(FP), CX
        LOCK
        CMPXCHGL        CX, 0(BX)
        SETEQ        ret+16(FP)
        RET

Java’s CAS

inline jlong    Atomic::cmpxchg    (jlong    exchange_value, volatile jlong*    dest, jlong    compare_value) {
  bool mp = os::is_MP();
  __asm__ __volatile__ (LOCK_IF_MP(%4) "cmpxchgq %1,(%3)"
                        : "=a" (exchange_value)
                        : "r" (exchange_value), "a" (compare_value), "r" (dest), "r" (mp)
                        : "cc", "memory");
  return exchange_value;
}

// Adding a lock prefix to an instruction on MP machine
#define LOCK_IF_MP(mp) "cmp $0, " #mp "; je 1f; lock; 1: "

As you can see, CAS is also implemented using the lock instruction prefix, so how is the lock instruction prefix implemented specifically?

1.2.4 The LOCK# Signal

Specifically, when a LOCK prefix instruction is declared in front of an instruction in the code, the processor generates the LOCK# signal during the instruction run so that other processors cannot access the memory through the bus.

Let’s try to understand the principle of the LOCK# signal by looking at the pinout diagram of the 8086 CPU.

8086 CPU

The 8086 CPU has a LOCK pin (pin 29 in the diagram) that is active low. When the LOCK instruction prefix is declared, the LOCK pin level is pulled low for assert operation, and other devices cannot gain control of the system bus at this time. When the execution of the instruction modified by the LOCK instruction is completed, the LOCK pin level is pulled high for de-assert.

So the whole process is clear. When you want to achieve atomic operation by non-atomic instructions (e.g. add), you need to declare the lock instruction prefix before the instruction when programming, and the lock instruction prefix will be recognized by the processor at runtime and generate the LOCK# signal to make it exclusive to the memory bus, while other processors cannot access the memory through the memory bus, thus achieving atomic operation. So it also solves the above write overwrite problem.

This looks good, but it introduces a new problem.

1.2.5 Performance degradation caused by bus locks

Nowadays there are more and more cores in the processor, what if each core generates LOCK# signals frequently to monopolize the memory bus, so that the rest of the cores cannot access the memory, resulting in a significant performance degradation?

Performance degradation caused by bus locks

1.3 Cache locking

INTEL introduced the cache locking mechanism on post-P6 processors in order to optimize the performance problems caused by bus locks: the cache coherency protocol guarantees atomicity and consistency of multiple accesses to memory addresses across cache lines by multiple CPU cores without locking the memory bus.

1.3.1 MESI Protocol

Let’s start with a brief introduction to the cache coherency protocol with the common MESI, which is divided into four states:

Modified (M) The cache line is dirty and has a different value than the main memory. If another CPU core wants to read this data from main memory, the cache line must be written back to main memory and the state becomes shared (S).
Exclusive (E) The cache line is only in the current cache, but is clean - the cache data is the same as the main memory data. When another cache reads it, the state changes to shared; when the current data is written, the state changes to modified.
Shared (S) The cache line is also present in other caches and is clean. Cache lines can be discarded at any time.
Invalid (I) The cache line is invalid.

The MESI protocol state machine is as follows.

MESI protocol state machine

The transition of the state machine is based on two cases.

the CPU generates a request for a cache
- a. PrRd: CPU requests to read a cache block
- b. PrWr: CPU requests to write a cache block
The bus generates requests to the cache
- a. BusRd: A snooper request indicates that another processor is requesting to read a cache block
- b. BusRdX: A snooper request indicates that another processor is requesting to write a cache block that the processor does not own
- c. BusUpgr: The snooper request indicates that another processor is requesting to write a cache block owned by that processor
- d. Flush: The snooper request indicates that the entire cache is requested to be written back to main memory
- e. FlushOpt: The snooper request indicates that the entire cache block is sent to the bus to be sent to another processor (cache-to-cache copy)

Simply put, through the MESI protocol, each CPU not only knows its own read and write operations to the cache, but also performs bus sniffing (snooping) to know the read and write operations of other CPUs to the cache, so in addition to its own modifications to the cache, it will also change the state of the cache according to the modifications of other CPUs to the cache.

1.3.2 Cache Locking Principle

Cache locking relies on the cache coherency protocol to ensure atomicity of memory accesses, because the cache coherency protocol prevents memory addresses cached by multiple CPUs from being modified by multiple CPUs at the same time.

Here is an example of how cache locking achieves atomicity of memory reads and writes based on the MESI protocol.

Let’s assume that there are two CPU Cores, CoreA and CoreB, for the analysis.

Cache Locking Principle

Note the last operation step 4, after CoreB modifies the data in cache, when CoreA wants to modify it again, it will be sniffed by CoreB, and CoreA will modify it only after CoreB’s data is synchronized to main memory and CoreA.

We can see that the data modified by CoreB is not lost and is synchronized to CoreA and main memory. And the above operation is implemented without locking the memory bus, only CoreA’s modification is blocked for a while, which is manageable compared to locking the whole memory bus.

The above is a relatively simple case, where the writes to both CPU Cores are serial. What if CoreA and CoreB send write requests at the same time after operation step 2? Will the cache of both Cores enter the M state?

The answer is no. The MESI protocol ensures that the above scenario of simultaneous M does not happen. According to the MESI protocol, a Core’s PrWr operation can only be freely executed when its cache is in state M or E. If it is in state S, the other Core’s cache must first be set to state I. This is done through a bus broadcast called Request For Ownership (RFO), which is a bus transaction. If two Cores broadcast RFO to the bus at the same time and both want to invalidate each other’s cache, the bus will arbitrate and the final result will be that only one Core will succeed in broadcasting and the other Core will fail and its cache will be set to I state. So we can see that with the introduction of the cache layer, atomic operations are now implemented by bus arbitration instead of locking the memory bus.

If the LOCK instruction prefix is declared, then the corresponding cache address will be locked by the bus. In the above example, other Cores will wait until the end of instruction execution before accessing it, which means that it becomes a serial operation and achieves atomicity for cache reads and writes.

So, to summarize, cache lock: if the LOCK instruction prefix is declared in front of a code instruction to access memory data atomically, and if the memory data can be cached in the CPU’s cache, the runtime usually does not generate the LOCK# signal on the bus, but prevents two or more CPU cores from accessing the same address concurrently through the cache coherency protocol, bus arbitration mechanism and cache locking. The cache coherency protocol, bus arbitration mechanism and cache locking are used to prevent two or more CPU cores from accessing the same address concurrently.

So can all bus locks be optimized for cache locks? The answer is no. The case that cannot be optimized is split lock.

1.4 Split lock

Since the granularity of the cache coherency protocol is a cache line, when the data of an atomic operation crosses a cache line, the cache locking mechanism cannot guarantee data consistency and degenerates to a bus lock to ensure consistency, which is split lock.

For example, there is the following data structure.

struct Data {
   char padding[62]; // 62字节
    int32_t value;  // 4字节
} __attribute__((packed)) // 按实际字节对齐

When cached in a cache line of 64 bytes in size, the value member will span the cache line.

cache line

At this point, if you want to manipulate the value members of the Data structure by the LOCK ADD instruction, you cannot solve it by cache locking, and you have to go the old way, locking the bus to ensure data consistency.

Locking the bus causes a serious performance degradation, increasing access latency by a factor of 100, and in memory-intensive operations, performance drops by two orders of magnitude. So in modern X86 processors, it is important to avoid writing code that generates split lock and to have the ability to detect split lock generation.

2. Avoid Split lock

Recall the conditions under which Split lock is generated.

atomic access to the data is performed
the data to be accessed is stored across a cache line in the cache

Since atomic operations are relatively basic, let’s take data stored across a cache line as an intervention point.

If the data is stored in only one cache line, the problem can be solved.

2.1 Compiler optimizations

The GCC feature __attribute__((packed)), used in our previous data structure, indicates that no memory alignment optimization is performed.

If __attribute__((packed)) is not introduced, when using memory alignment optimizations, the compiler pads the memory data, for example by filling in 2 bytes after the padding so that the memory address of value can be divided by 4 bytes, thus achieving alignment. When the value is cached in the cache, it will not cross the cache line.

cache line

Why did the compiler introduce __attribute__((packed)) when it can be optimized to avoid cross-cache line accesses through memory alignment?

This is because there are benefits to forcing alignment by data structure through __attribute__((packed)). For example, network communication based on data structures, no need to fill extra bytes, etc.

2.2 Precautions

We have the following points to keep in mind when writing code.

Use the compiler’s memory alignment optimization as much as possible when available.
When compiler optimization is not available, consider the size and declaration order of structure members.
When generating potentially unaligned memory accesses, try not to use atomic instructions to do so.

3. Split lock detection and handling

3.1 Usage scenarios

hard real-time system: When a hard real-time application runs on some cores and another common program runs on other cores, the common program can generate bus lock to break the hard real-time requirement.
cloud computing: multi-tenant running on a physical machine, bus lock can be generated within a virtual machine to interfere with the performance of other virtual machines.

The following analysis is mainly for the cloud environment, from the bottom up.

3.2 Hardware detection support

An Alignment Check (#AC) exception is generated when a split lock operation is attempted, and a Debug(#DB) trap is generated when a bus lock is acquired and executed.

Hardware distinguishes between split lock and bus lock here.

split lock is an atomic operation where the operand spans two cache lines.
bus lock can be generated in two cases, either split lock for writeback memory, or any lock operation for non-writeback memory

Conceptually, split lock is a kind of bus lock, split lock tends to access across cache lines, bus lock tends to lock bus operations.

Whether or not the corresponding exceptions are generated when split lock and bus lock occur can be controlled by specific registers, the following are the related control registers.

MSR_MEMORY_CTRL/MSR_TEST_CTRL: 33H Bit 29 of this MSR controls the #AC exception caused by split lock.
IA32_DEBUGCTL: bit 2 of this MSR, controls the #DB exception caused by bus lock.

3.3 Kernel processing support

Take v5.17 as an example for analysis. The current version of the kernel supports a related startup parameter split_lock_detect with the following configuration items and corresponding functions.

split_lock_detect

The implementation of split_lock_detect is mainly divided into 3 parts: configuration, initialization, and processing, and we analyze the source code item by item below.

3.3.1 Configuration

-> start_kernel
    -> sld_setup
        -> split_lock_setup
            -> __split_lock_setup
        -> sld_state_setup

When the kernel starts, it will do the setup of sld(split lock detect) first.

static void __init __split_lock_setup(void)
{
        if (!split_lock_verify_msr(false)) {
                pr_info("MSR access failed: Disabled\n");
                return;
        }

        rdmsrl(MSR_TEST_CTRL, msr_test_ctrl_cache);

        if (!split_lock_verify_msr(true)) {
                pr_info("MSR access failed: Disabled\n");
                return;
        }

        /* Restore the MSR to its cached value. */
        wrmsrl(MSR_TEST_CTRL, msr_test_ctrl_cache);

        setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT);
}

static bool split_lock_verify_msr(bool on)
{
        u64 ctrl, tmp;

        if (rdmsrl_safe(MSR_TEST_CTRL, &ctrl))
                return false;
        if (on)
                ctrl |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT;
        else
                ctrl &= ~MSR_TEST_CTRL_SPLIT_LOCK_DETECT;
        if (wrmsrl_safe(MSR_TEST_CTRL, ctrl))
                return false;
        rdmsrl(MSR_TEST_CTRL, tmp);
        return ctrl == tmp;
}

#define MSR_TEST_CTRL               0x00000033

__split_lock_setup tries to enable/disable 33H MSR for verify, and ends without enabling split lock #AC exception, leaving only a global variable msr_test_ctrl_cache to be used as a cache for later manipulation of this MSR.

static void __init sld_state_setup(void)
{
        enum split_lock_detect_state state = sld_warn; // 默认配置
        char arg[20];
        int i, ret;

        if (!boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) &&
            !boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT))
                return;

        ret = cmdline_find_option(boot_command_line, "split_lock_detect",
                                  arg, sizeof(arg));
        if (ret >= 0) {
                for (i = 0; i < ARRAY_SIZE(sld_options); i++) {
                        if (match_option(arg, ret, sld_options[i].option)) {
                                state = sld_options[i].state;
                                break;
                        }
                }
        }
        sld_state = state;
}

static inline bool match_option(const char *arg, int arglen, const char *opt)
{
        int len = strlen(opt), ratelimit;

        if (strncmp(arg, opt, len))
                return false;

        /*
         * Min ratelimit is 1 bus lock/sec.
         * Max ratelimit is 1000 bus locks/sec.
         */
        if (sscanf(arg, "ratelimit:%d", &ratelimit) == 1 &&
            ratelimit > 0 && ratelimit <= 1000) {
                ratelimit_state_init(&bld_ratelimit, HZ, ratelimit);
                ratelimit_set_flags(&bld_ratelimit, RATELIMIT_MSG_ON_RELEASE);
                return true;
        }

        return len == arglen;
}

What sld_state_setup does is resolve the configuration of the kernel boot parameter split_lock_detect (you can see that the default configuration is at the warn level). In case of ratelimit configuration, a bld_ratelimit global variable is initialized for the handle stage using the kernel’s ratelimit library.

3.3.2 Initialization

-> start_kernel
    -> init_intel
        -> split_lock_init
        -> bus_lock_init

After setup completes the basic verify and get the boot parameters configured, a hardware enbale operation is attempted.

3.3.2.1 split lock init

static void split_lock_init(void)
{
        /*
         * #DB for bus lock handles ratelimit and #AC for split lock is
         * disabled.
         */
        if (sld_state == sld_ratelimit) {
                split_lock_verify_msr(false);
                return;
        }

        if (cpu_model_supports_sld)
                split_lock_verify_msr(sld_state != sld_off);
}

The init of split lock disables hardware detection for split lock if the configured parameter is found to be ratelimit. Other non-off parameters (warn, fatal) will enable the hardware.

3.3.2.2 bus lock init

static void bus_lock_init(void)
{
        u64 val;

        /*
         * Warn and fatal are handled by #AC for split lock if #AC for
         * split lock is supported.
         */
        if (!boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT) ||
            (boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) &&
            (sld_state == sld_warn || sld_state == sld_fatal)) ||
            sld_state == sld_off)
                return;

        /*
         * Enable #DB for bus lock. All bus locks are handled in #DB except
         * split locks are handled in #AC in the fatal case.
         */
        rdmsrl(MSR_IA32_DEBUGCTLMSR, val);
        val |= DEBUGCTLMSR_BUS_LOCK_DETECT;
        wrmsrl(MSR_IA32_DEBUGCTLMSR, val);
}

#define DEBUGCTLMSR_BUS_LOCK_DETECT (1UL <<  2)

In the init of bus lock, if the CPU does not support bus lock, the hardware detection of bus lock will not be enabled, and if the CPU supports both split lock and bus lock hardware detection and the configuration parameter is still warn or fatal, the hardware will not be enabled.

So from the code, there are two cases to enable CPU bus lock detection: one is when the CPU supports bus lock detection and the configuration parameter is specified as ratelimt, and the other is when the CPU does not support split lock detection but supports bus lock detection and the configuration parameter is not off.

3.3.3 Processing

3.3.3.1 split lock handle

DEFINE_IDTENTRY_ERRORCODE(exc_alignment_check)
{
        char *str = "alignment check";

        if (notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_AC, SIGBUS) == NOTIFY_STOP)
                return;

        if (!user_mode(regs)) // 如果不是用户态
                die("Split lock detected\n", regs, error_code);

        local_irq_enable();

        if (handle_user_split_lock(regs, error_code))
                goto out;

        do_trap(X86_TRAP_AC, SIGBUS, "alignment check", regs,
                error_code, BUS_ADRALN, NULL);

out:
        local_irq_disable();
}


bool handle_user_split_lock(struct pt_regs *regs, long error_code)
{
        if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal)
                return false;
        split_lock_warn(regs->ip);
        return true;
}

static void split_lock_warn(unsigned long ip)
{
        pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n",
                            current->comm, current->pid, ip);

        /*
         * Disable the split lock detection for this task so it can make
         * progress and set TIF_SLD so the detection is re-enabled via
         * switch_to_sld() when the task is scheduled out.
         */
        sld_update_msr(false);
        set_tsk_thread_flag(current, TIF_SLD);
}

When an #AC exception is generated, if it is not in the user state, it will call die and enter the panic process.

If it is in user state, then configured as fatal, it will send SIGBUS signal to the current user state process, and the user state process will be killed if it does not catch SIGBUS manually.

If the user state is configured as warn, it prints a warning log and outputs the current process information, disable split lock detection, and indicate that the process has been detected once by setting the TIF_SLD bit of the current process flags.

-> context_switch
    -> __switch_to_xtra

void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)
{
        unsigned long tifp, tifn;

        tifn = read_task_thread_flags(next_p);
        tifp = read_task_thread_flags(prev_p);

        if ((tifp ^ tifn) & _TIF_SLD)
                switch_to_sld(tifn);
}

void switch_to_sld(unsigned long tifn)
{
        sld_update_msr(!(tifn & _TIF_SLD));
}

When context_switch is performed, if the TIF_SLD bits in the flags of the prev process and the next process are different, then a split lock detect switch is performed. The switch is based on the TIF_SLD bit of the next process.

For example, after process A on CPU 0 triggers a split lock warning detection, the split lock detection on CPU 0 is disable to avoid frequent warning logs, and the flags of process A are set to TIF_SLD, and after process A finishes execution, it switches to process B running on CPU 0. The flags of process A and B have different TIF_SLD bits, so we can enable split lock detection according to the flags of process B. When process B finishes execution, it will disable split lock again when it switches to process A again.

The above mechanism achieves the requirement that each process only warn once.

3.3.3.2 bus lock handle

static __always_inline void exc_debug_user(struct pt_regs *regs,
                                           unsigned long dr6)
{
        /* #DB for bus lock can only be triggered from userspace. */
        if (dr6 & DR_BUS_LOCK)
                handle_bus_lock(regs);
}

void handle_bus_lock(struct pt_regs *regs)
{
        switch (sld_state) {
        case sld_off:
                break;
        case sld_ratelimit:
                /* Enforce no more than bld_ratelimit bus locks/sec. */
                while (!__ratelimit(&bld_ratelimit))
                        msleep(20);
                /* Warn on the bus lock. */
                fallthrough;
        case sld_warn:
                pr_warn_ratelimited("#DB: %s/%d took a bus_lock trap at address: 0x%lx\n",
                                    current->comm, current->pid, regs->ip);
                break;
        case sld_fatal:
                force_sig_fault(SIGBUS, BUS_ADRALN, NULL);
                break;
        }
}

After the #DB trap is generated, the flow of warn and fatal is basically similar to #AC, we focus here on ratelimit, which is also a strong requirement for the introduction of bus lock.

Forcing the frequency of bus lock generation down to the configured ratelimit. The principle is that if the frequency exceeds the setting, it sleeps for 20 ms until the frequency drops (the frequency here is the frequency at which the whole system generates bus lock).

After the frequency drop, a warning log is generated in the sld_warn case via the compiler’s fallthrough stream.

4. Detection and handling of virtualized environments

The previous analysis was done for the kernel and user state programs (VMX root mode) on the physical machine. In a virtualized environment, we need to consider some more issues, such as how to detect split lock if it comes from the Guest? How to avoid the impact on other Guests? If you enable bus lock ratelimit directly on the Host, it may affect guests who are not ready for it, and if you expose the split lock detection switch directly to the Guest, what happens to the Host or other Guests, etc.

The CPU can support #AC trap for split lock detection in VMX mode, and it is up to the hypervisor to decide what to do later. Most hypervisors forward the trap directly to the Guest, which can cause a crash if the Guest is not ready, as was the case in previous versions of VMX.

So the first step is to get the hypervisor to handle traps correctly.

4.1 Processing flow of virtualized environment

The following is the overall processing flow for split lock and bus lock in a virtualized environment.

overall processing flow for split lock and bus lock in a virtualized environment

If the hardware does not support split lock detection or is legacy #AC, the #AC will be injected to the Guest for processing, if the hardware supports detection, then a warning will be generated according to the configuration, or even a SIGBUS will be attempted.

After Guest performs bus lock, it will vm exit to kvm, which will notify QEMU and the vCPU thread will actively sleep to downscale.

Let’s analyze this from the bottom up.

4.2 Hardware support

4.2.1 split lock

#AC exception

The #AC exception is included in the NMI exit reason.

4.2.2 bus lock

bus lock VM exit

In VMX non-root mode, the CPU detects the bus lock and generates a VM exit with a reason of 74.

4.3 KVM Support

-> vmx_handle_exit
    -> __vmx_handle_exit
        -> kvm_vmx_exit_handlers
            -> handle_exception_nmi // split lock
            -> handle_bus_lock_vmexit // bus lock

4.3.1 split lock

static int handle_exception_nmi(struct kvm_vcpu *vcpu)
{
        switch (ex_no) {
        case AC_VECTOR:
                if (vmx_guest_inject_ac(vcpu)) {
                        kvm_queue_exception_e(vcpu, AC_VECTOR, error_code);
                        return 1;
                }

                /*
                 * Handle split lock. Depending on detection mode this will
                 * either warn and disable split lock detection for this
                 * task or force SIGBUS on it.
                 */
                if (handle_guest_split_lock(kvm_rip_read(vcpu)))
                        return 1;
}

// Guest handle
bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu)
{
        if (!boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT))
                return true;

        return vmx_get_cpl(vcpu) == 3 && kvm_read_cr0_bits(vcpu, X86_CR0_AM) &&
               (kvm_get_rflags(vcpu) & X86_EFLAGS_AC);
}

// KVM handle
bool handle_guest_split_lock(unsigned long ip)
{
        if (sld_state == sld_warn) {
                split_lock_warn(ip);
                return true;
        }

        pr_warn_once("#AC: %s/%d %s split_lock trap at address: 0x%lx\n",
                     current->comm, current->pid,
                     sld_state == sld_fatal ? "fatal" : "bogus", ip);

        current->thread.error_code = 0;
        current->thread.trap_nr = X86_TRAP_AC;
        force_sig_fault(SIGBUS, BUS_ADRALN, NULL);
        return false;
}

When a split lock operation is generated inside the Guest, the VM will exit because it is an #AC exception.

First of all, there are two types of #AC exceptions themselves.

legacy #AC exception
split lock #AC exception

KVM eventually produces two behaviors based on Host and Guest states.

Let the Guest process: Inject #AC into the Guest.
- If the hardware does not support split lock detection, inject into Guest unconditionally.
- If the Host enables split lock detection, it will only be injected into the Guest if the #AC exception is generated for the Guest user state and the legacy #AC.
Let HOST handle it: generate warnings or even send SIGBUS.
- When not injected into Guest, Host will only warn once per vCPU thread if it is configured to warn, or generate SIGBUS if it is configured to fatal.

4.3.2 bus lock

#define EXIT_REASON_BUS_LOCK            74

static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
    [EXIT_REASON_BUS_LOCK]                = handle_bus_lock_vmexit,
}

static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
{
        /*
         * Hardware may or may not set the BUS_LOCK_DETECTED flag on BUS_LOCK
         * VM-Exits. Unconditionally set the flag here and leave the handling to
         * vmx_handle_exit().
         */
        to_vmx(vcpu)->exit_reason.bus_lock_detected = true;
        return 1;
}

union vmx_exit_reason {
        struct {
                u32        basic                        : 16;
                u32        reserved16                : 1;
                u32        reserved17                : 1;
                u32        reserved18                : 1;
                u32        reserved19                : 1;
                u32        reserved20                : 1;
                u32        reserved21                : 1;
                u32        reserved22                : 1;
                u32        reserved23                : 1;
                u32        reserved24                : 1;
                u32        reserved25                : 1;
                u32        bus_lock_detected        : 1;
                u32        enclave_mode                : 1;
                u32        smi_pending_mtf                : 1;
                u32        smi_from_vmx_root        : 1;
                u32        reserved30                : 1;
                u32        failed_vmentry                : 1;
        };
        u32 full;
};

After VM exit, the function handle_bus_lock_vmexit is executed according to the reason 74 index, which simply sets bus_lock_detected of exit_reason to bit.

static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
        int ret = __vmx_handle_exit(vcpu, exit_fastpath);

        /*
         * Exit to user space when bus lock detected to inform that there is
         * a bus lock in guest.
         */
        if (to_vmx(vcpu)->exit_reason.bus_lock_detected) {
                if (ret > 0)
                        vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;

                vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
                return 0;
        }
        return ret;
}

After executing __vmx_handle_exit. The exit_reason and flags returned to the user state are set after the bus_lock_detected is detected.

4.4 QEMU support

Using the v6.2.0 version as an example for analysis.

-> kvm_cpu_exec
    -> kvm_vcpu_ioctl
    -> kvm_arch_post_run
    -> kvm_arch_handle_exit

int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
{
    switch (run->exit_reason) {
        case KVM_EXIT_X86_BUS_LOCK:
            /* already handled in kvm_arch_post_run */
            ret = 0;
            break;
    }
}

MemTxAttrs kvm_arch_post_run(CPUState *cpu, struct kvm_run *run) {
    if (run->flags & KVM_RUN_X86_BUS_LOCK) {
        kvm_rate_limit_on_bus_lock();
    }
}

static void kvm_rate_limit_on_bus_lock(void)
{
    uint64_t delay_ns = ratelimit_calculate_delay(&bus_lock_ratelimit_ctrl, 1);

    if (delay_ns) {
        g_usleep(delay_ns / SCALE_US);
    }
}

QEMU learns from the KVM return value that bus lock is generated in the Guest and enters kvm_rate_limit_on_bus_lock, which also implements ratelimit through sleep to reduce the frequency of bus lock generated by the Guest.

The ratelimit on the Host is controlled by the split_lock_detect startup parameter, but what about the Guest?

int kvm_arch_init(MachineState *ms, KVMState *s) {
    if (object_dynamic_cast(OBJECT(ms), TYPE_X86_MACHINE)) {
        X86MachineState *x86ms = X86_MACHINE(ms);

        if (x86ms->bus_lock_ratelimit > 0) {
            ret = kvm_check_extension(s, KVM_CAP_X86_BUS_LOCK_EXIT);
            if (!(ret & KVM_BUS_LOCK_DETECTION_EXIT)) {
                error_report("kvm: bus lock detection unsupported");
                return -ENOTSUP;
            }
            ret = kvm_vm_enable_cap(s, KVM_CAP_X86_BUS_LOCK_EXIT, 0,
                                    KVM_BUS_LOCK_DETECTION_EXIT);
            if (ret < 0) {
                error_report("kvm: Failed to enable bus lock detection cap: %s",
                             strerror(-ret));
                return ret;
            }
            ratelimit_init(&bus_lock_ratelimit_ctrl);
            ratelimit_set_speed(&bus_lock_ratelimit_ctrl,
                                x86ms->bus_lock_ratelimit, BUS_LOCK_SLICE_TIME);
        }
    }
}

static void x86_machine_get_bus_lock_ratelimit(Object *obj, Visitor *v,
                                const char *name, void *opaque, Error **errp)
{
    X86MachineState *x86ms = X86_MACHINE(obj);
    uint64_t bus_lock_ratelimit = x86ms->bus_lock_ratelimit;

    visit_type_uint64(v, name, &bus_lock_ratelimit, errp);
}

static void x86_machine_class_init(ObjectClass *oc, void *data)
{
    object_class_property_add(oc, X86_MACHINE_BUS_LOCK_RATELIMIT, "uint64_t",
                                x86_machine_get_bus_lock_ratelimit,
                                x86_machine_set_bus_lock_ratelimit, NULL, NULL);
}

#define X86_MACHINE_BUS_LOCK_RATELIMIT  "bus-lock-ratelimit"

QEMU ratelimit for Guest bus lock is obtained from the boot parameter bus-lock-ratelimit.

4.5 Libvirt Support

Libvirt supports QEMU’s bus-lock-ratelimit startup parameter which is currently not available from upsteam:

https://listman.redhat.com/archives/libvir-list/2021-December/225755.html

5. Summary

Due to the nature of X86 hardware, which supports atomic semantics across cache lines, implementation requires split lock to maintain atomicity, but this comes at the cost of system-wide access performance. Developers gradually realize that this feature is harmful and try not to use it, for example, kernel developers ensure that they do not generate split lock, even at the expense of kernel panic, while user-state programs generate warnings and then reduce the execution frequency of the program or have the kernel kill it. The virtualized environment has a large software stack, so it is handled by the host-side KVM and QEMU as much as possible to alert or even kill the virtual machine, or to notify QEMU to downscale.

Table of Contents