There’s a lot to be said for averaging unsigned integers by rounding?

A recent article by Raymond Chen, a Microsoft engineer, has been a direct hit on the technology platform, sparking numerous discussions.

sobyte

Countless people clicked in with unbridled confidence: isn’t it just a simple elementary school programming problem of adding and dividing by two?

1
2
3
4
unsigned average(unsigned a, unsigned b)
{
    return (a + b) / 2;
}

But following deeper, but gradually surprised ……

Not so simple to find the average

Starting with the method mentioned at the beginning, which any elementary school student would know, this simple method has a fatal flaw.

If the unsigned integers are 32 bits long, then a memory overflow will occur if the two summed values are both half of the maximum length, just in the first step of the summation.

That is, average(0x80000000U, 0x80000000U) = 0.

But there are quite a few solutions, and the first one most experienced developers can think of is to pre-limit the length of the summed numbers to avoid overflow.

There are two specific methods.

  1. When the larger of the two unsigned integers is known, subtract the smaller value and divide by two to advance reduce the length :)

    1
    2
    3
    4
    
    unsigned average(unsigned low, unsigned high)
    {
        return low + (high - low) / 2;
    }
    
  2. Pre-divide two unsigned integers while correcting the lower digit by & to ensure that the result is still correct when both integers are odd.

    (Incidentally, this is a method that was patented and expired in 2016)

    1
    2
    3
    4
    
    unsigned average(unsigned a, unsigned b)
    {
        return (a / 2) + (b / 2) + (a & b & 1);
    }
    

Both of these are more common ideas, and many users also said that the fastest they could think of was 2016 patent method .

The same method that can be quickly thought of by the majority of users is also SWAR (SIMD within a register).

1
2
3
4
unsigned average(unsigned a, unsigned b)
{
    return (a & b) + (a ^ b) / 2;// 变体 (a ^ b) + (a & b) * 2
}

and the std: : midpoint function in C++ version 20.

Next, the authors propose a second idea.

If the unsigned integer is 32 bits and the native register size is 64 bits, or if the compiler supports multi-word operations, the summed value can be forced into long integer data.

1
2
3
4
5
6
unsigned average(unsigned a, unsigned b)
{
    // Suppose "unsigned" is a 32-bit type and
    // "unsigned long long" is a 64-bit type.
    return ((unsigned long long)a + b) / 2;
}

However, there is one particular point to note here.

You must ensure that the first 32 bits of the 64-bit register are all zeros in order not to affect the remaining 32-bit values.

Architectures such as x86-64 and aarch64 automatically extend 32-bit values zero to 64-bit values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// x86-64: Assume ecx = a, edx = b, upper 32 bits unknown
    mov     eax, ecx        ; rax = ecx zero-extended to 64-bit value
    mov     edx, edx        ; rdx = edx zero-extended to 64-bit value
    add     rax, rdx        ; 64-bit addition: rax = rax + rdx
    shr     rax, 1          ; 64-bit shift:    rax = rax >> 1
                            ;                  result is zero-extended
                            ; Answer in eax

// AArch64 (ARM 64-bit): Assume w0 = a, w1 = b, upper 32 bits unknown
    uxtw    x0, w0          ; x0 = w0 zero-extended to 64-bit value
    uxtw    x1, w1          ; x1 = w1 zero-extended to 64-bit value
    add     x0, x1          ; 64-bit addition: x0 = x0 + x1
    ubfx    x0, x0, 1, 32   ; Extract bits 1 through 32 from result
                            ; (shift + zero-extend in one instruction)
                            ; Answer in x0

In contrast, architectures such as Alpha AXP and mips64 extend 32-bit values symbols to 64-bit values.

At such times, it is necessary to add additional instructions to return to zero, for example, through the delete instruction rldicl that rounds two words to the left.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Alpha AXP: Assume a0 = a, a1 = b, both in canonical form
    insll   a0, #0, a0      ; a0 = a0 zero-extended to 64-bit value
    insll   a1, #0, a1      ; a1 = a1 zero-extended to 64-bit value
    addq    a0, a1, v0      ; 64-bit addition: v0 = a0 + a1
    srl     v0, #1, v0      ; 64-bit shift:    v0 = v0 >> 1
    addl    zero, v0, v0    ; Force canonical form
                            ; Answer in v0

// MIPS64: Assume a0 = a, a1 = b, sign-extended
    dext    a0, a0, 0, 32   ; Zero-extend a0 to 64-bit value
    dext    a1, a1, 0, 32   ; Zero-extend a1 to 64-bit value
    daddu   v0, a0, a1      ; 64-bit addition: v0 = a0 + a1
    dsrl    v0, v0, #1      ; 64-bit shift:    v0 = v0 >> 1
    sll     v0, #0, v0      ; Sign-extend result
                            ; Answer in v0

// Power64: Assume r3 = a, r4 = b, zero-extended
    add     r3, r3, r4      ; 64-bit addition: r3 = r3 + r4
    rldicl  r3, r3, 63, 32  ; Extract bits 63 through 32 from result
                            ; (shift + zero-extend in one instruction)
                            ; result in r3

Or directly access SIMD registers that are larger than the native registers. Of course, crossing from general-purpose registers to SIMD registers will certainly increase memory consumption as well.

If the computer’s processor supports rounding addition, then a third idea can also be used.

In this case, if the register size is n bits, then the sum of two n-bit unsigned integers can be interpreted as n+1 bits, and by using the RCR (cyclic right shift with rounding) instruction, the correct average value can be obtained without losing the overflowing bits.

sobyte

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// x86-32
    mov     eax, a
    add     eax, b          ; Add, overflow goes into carry bit
    rcr     eax, 1          ; Rotate right one place through carry

// x86-64
    mov     rax, a
    add     rax, b          ; Add, overflow goes into carry bit
    rcr     rax, 1          ; Rotate right one place through carry

// 32-bit ARM (A32)
    mov     r0, a
    adds    r0, b           ; Add, overflow goes into carry bit
    rrx     r0              ; Rotate right one place through carry

// SH-3
    clrt                    ; Clear T flag
    mov     a, r0
    addc    b, r0           ; r0 = r0 + b + T, overflow goes into T bit
    rotcr   r0              ; Rotate right one place through carry

What if the processor does not support right-shift operations with a round-robin?

You can also use the inner loop (rotation intrinsic).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
unsigned average(unsigned a, unsigned b)
{
#if defined(_MSC_VER)
    unsigned sum;
    auto carry = _addcarry_u32(0, a, b, &sum);
    sum = (sum & ~1) | carry;
    return _rotr(sum, 1);
#elif defined(__clang__)
    unsigned carry;
    sum = (sum & ~1) | carry;
    auto sum = __builtin_addc(a, b, 0, &carry);
    return __builtin_rotateright32(sum, 1);
#else
#error Unsupported compiler.
#endif
}

The result is that code generation for the x86 architecture has not changed much, code generation for the MSCver architecture has gotten worse, and code generation for the arm-thumb2 clang has gotten better.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// _MSC_VER
    mov     ecx, a
    add     ecx, b          ; Add, overflow goes into carry bit
    setc    al              ; al = 1 if carry set
    and     ecx, -2         ; Clear bottom bit
    movzx   ecx, al         ; Zero-extend byte to 32-bit value
    or      eax, ecx        ; Combine
    ror     ear, 1          ; Rotate right one position
                            ; Result in eax

// __clang__
    mov     ecx, a
    add     ecx, b          ; Add, overflow goes into carry bit
    setc    al              ; al = 1 if carry set
    shld    eax, ecx, 31    ; Shift left 64-bit value

// __clang__ with ARM-Thumb2
    movs    r2, #0          ; Prepare to receive carry
    adds    r0, r0, r1      ; Calculate sum with flags
    adcs    r2, r2          ; r2 holds carry
    lsrs    r0, r0, #1      ; Shift sum right one position
    lsls    r1, r2, #31     ; Move carry to bit 31
    adds    r0, r1, r0      ; Combine

Reflections of a Microsoft Engineer

Raymond Chen joined Microsoft in 1992 and has served for 25 years so far, doing UEX-Shell and also participating in Windows development, and he did much of the initial UI architecture for Windows.

Raymond Chen

His blog The Old New Thing on MSDN is also very well known in the industry as a purely technical output site.

The comments section of this blog is also infested with Microsoft experts from all walks of life, and continues to delve deeper.

A new approach has been proposed, with a total of 36 loops in the MIPS ASM.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
unsigned avg(unsigned a, unsigned b)
{
    return (a & b) + (a ^ b) / 2;
}

// lw      $3,8($fp)  # 5
// lw      $2,12($fp) # 5
// and     $3,$3,$2   # 4
// lw      $4,8($fp)  # 5
// lw      $2,12($fp) # 5
// xor     $2,$4,$2   # 4
// srl     $2,$2,1    # 4
// addu    $2,$3,$2   # 4

In response to the 2016 patent law, someone said that instead of using (a / 2) + (b / 2) + (a & b & 1), why not just put (a & 1) & ( b & 1 ) ) into the adder as an integer?

Someone in the comments section also recommended the TopSpeed compiler, which can define an inline function by specifying the appropriate code bytes and calling conventions to solve the “multiply and divide result is 16 bits, but the middle calculated value is not” scenario.

Let’s just say that there is no end to learning.