There’s a lot to be said for averaging unsigned integers by rounding?
A recent article by Raymond Chen, a Microsoft engineer, has been a direct hit on the technology platform, sparking numerous discussions.
Countless people clicked in with unbridled confidence: isn’t it just a simple elementary school programming problem of adding and dividing by two?
But following deeper, but gradually surprised ……
Not so simple to find the average
Starting with the method mentioned at the beginning, which any elementary school student would know, this simple method has a fatal flaw.
If the unsigned integers are 32 bits long, then a memory overflow will occur if the two summed values are both half of the maximum length, just in the first step of the summation.
That is, average(0x80000000U, 0x80000000U) = 0.
But there are quite a few solutions, and the first one most experienced developers can think of is to prelimit the length of the summed numbers to avoid overflow.
There are two specific methods.

When the larger of the two unsigned integers is known, subtract the smaller value and divide by two to advance reduce the length :)

Predivide two unsigned integers while correcting the lower digit by & to ensure that the result is still correct when both integers are odd.
(Incidentally, this is a method that was patented and expired in 2016)
Both of these are more common ideas, and many users also said that the fastest they could think of was 2016 patent method .
The same method that can be quickly thought of by the majority of users is also SWAR (SIMD within a register).
and the std: : midpoint
function in C++ version 20.
Next, the authors propose a second idea.
If the unsigned integer is 32 bits and the native register size is 64 bits, or if the compiler supports multiword operations, the summed value can be forced into long integer data.
However, there is one particular point to note here.
You must ensure that the first 32 bits of the 64bit register are all zeros in order not to affect the remaining 32bit values.
Architectures such as x8664 and aarch64 automatically extend 32bit values zero to 64bit values.


In contrast, architectures such as Alpha AXP and mips64 extend 32bit values symbols to 64bit values.
At such times, it is necessary to add additional instructions to return to zero, for example, through the delete instruction rldicl that rounds two words to the left.


Or directly access SIMD registers that are larger than the native registers. Of course, crossing from generalpurpose registers to SIMD registers will certainly increase memory consumption as well.
If the computer’s processor supports rounding addition, then a third idea can also be used.
In this case, if the register size is n bits, then the sum of two nbit unsigned integers can be interpreted as n+1 bits, and by using the RCR (cyclic right shift with rounding) instruction, the correct average value can be obtained without losing the overflowing bits.


What if the processor does not support rightshift operations with a roundrobin?
You can also use the inner loop (rotation intrinsic).


The result is that code generation for the x86 architecture has not changed much, code generation for the MSCver architecture has gotten worse, and code generation for the armthumb2 clang has gotten better.


Reflections of a Microsoft Engineer
Raymond Chen joined Microsoft in 1992 and has served for 25 years so far, doing UEXShell and also participating in Windows development, and he did much of the initial UI architecture for Windows.
His blog The Old New Thing on MSDN is also very well known in the industry as a purely technical output site.
The comments section of this blog is also infested with Microsoft experts from all walks of life, and continues to delve deeper.
A new approach has been proposed, with a total of 36 loops in the MIPS ASM.
In response to the 2016 patent law, someone said that instead of using (a / 2) + (b / 2) + (a & b & 1), why not just put (a & 1) & ( b & 1 ) ) into the adder as an integer?
Someone in the comments section also recommended the TopSpeed compiler, which can define an inline function by specifying the appropriate code bytes and calling conventions to solve the “multiply and divide result is 16 bits, but the middle calculated value is not” scenario.
Let’s just say that there is no end to learning.