Why floating-point arithmetic is inaccurate

Verify

>>> 0.3-0.2
0.09999999999999998
>>> 0.2-0.1
0.1

Verify a little deeper

First write the tool. Write a function to display 64-bit 8-byte binary data.

import struct


def byte2bin(s, g=8):
    o = []
    for i in range(0, len(s), g):
        sub = s[i:min(i+g, len(s))]
        o.append(' '.join((f'{c:08b}' for c in sub)))
    return '\n'.join(o)


print(byte2bin(b'abcdefghijklmnopqrstuvwxyz'))

The output is as follows.

01100001 01100010 01100011 01100100 01100101 01100110 01100111 01101000
01101001 01101010 01101011 01101100 01101101 01101110 01101111 01110000
01110001 01110010 01110011 01110100 01110101 01110110 01110111 01111000
01111001 01111010

The data is then converted to binary data, paying attention to the byte order (Endianness).

1
2

print(byte2bin(struct.pack('>h', 16385)))
print(byte2bin(struct.pack('>h', -16383)))

The output is as follows.

1
2

01000000 00000001
11000000 00000001

Finally, verify again.

def show_double(d):
    print(d)
    print(byte2bin(struct.pack('>d', d)))


a = 0.1
show_double(a)

b = 0.2
show_double(b)

c = b-a
show_double(c)
print(c == a)

d = 0.3
show_double(d)

e = d-b
show_double(e)
print(e == a)

The output is as follows.

0.1
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011010
0.2
00111111 11001001 10011001 10011001 10011001 10011001 10011001 10011010
0.1
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011010
True
0.3
00111111 11010011 00110011 00110011 00110011 00110011 00110011 00110011
0.09999999999999998
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011000
False

As can be seen, the result of 0.2-0.1 is binary indistinguishable from 0.1. the result of 0.3-0.2 is binary significantly different from 0.1. the end of 0.1 is 1010 and the result of 0.3-0.2 is 1000.

The Principle of Floating Point Numbers

Binary floating point numbers are related to decimal decimals in the same way that binary integers are related to decimal integers.

In the integer representation system, the leftmost digit indicates the size of the number from 0-9, and when the concept of “10” is needed, 0-9 is actually written second from the left to indicate how many “10s” there are. And so on.
In the decimal representation system, the rightmost digit indicates how many “1⁄10” there are. And so on.

Why do we express it this way? Because in this way, “shift operation” and “multiply by 10” are linked. Also, addition and subtraction can be done in bits.

909*10 = 9090
909 shifted one place to the left = 1010
909+101 = 1010, specifically (9+1), (0+0), (9+1), and finally rounding from right to left
90.9*10 = 909
90.9 + 99.99 = 190.89, specifically (9 + 9), (0 + 9), decimal point, (9 + 9), (0 + 9), and finally right-to-left rounding
There is nothing special about adding decimal numbers and adding integers except that the decimal places are aligned
expression of 90.9 + 99.99 = 190.89
1 2 3 4

90.90 99.99 ------ 190.89

Similarly, binary integers and floating point numbers have the same pattern.

0b101*0b10 = 0b1010
0b101 shifted one place left = 0b1010
0b101+0b11 = 0b1000, the specific operation is (1+0), (0+1), (1+1), and finally rounded from right to left
0b101.101*0b10 = 0b1011.01. Note that 0b101.101 is not a qualified representation in computers, but the reader should be able to understand its meaning
0b101.101 + 0b110.11 = 0b1100.011
The above equation may not be very good, so let’s switch back to decimal and see. 5.625 + 6.75 = 12.375, which is exactly right.

expression of 0b101.101+0b110.11 = 0b1100.011

0b101.101
0b110.110
----------
0b1100.011

In-depth understanding of floating-point representation

Please see for yourself here.

def scale_ieee754d():
    # https://zh.m.wikipedia.org/zh/IEEE_754
    print('seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff')


def show_double(d):
    print(d)
    scale_ieee754d()
    print(byte2bin(struct.pack('>d', d)))

The output is as follows (the equality judgment is omitted).

0.1
seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011010
0.2
seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 11001001 10011001 10011001 10011001 10011001 10011001 10011010
0.1
seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011010
0.3
seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 11010011 00110011 00110011 00110011 00110011 00110011 00110011
0.09999999999999998
seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011000

As you can see, the trailing part of 0.2 and 0.1 are strictly the same, and the only difference between them is the exponent part. 0.3 is completely different from both.

Subtraction of floating point numbers

First of all it is still necessary to have a tool function.

def str2bin(s):
    b = eval('0b'+s.replace(' ', ''))
    return struct.unpack('>d', struct.pack('>Q', b))[0]


print(str2bin('00111111 11010011 00110011 00110011 00110011 00110011 00110011 00110011'))

We then look at 0.3-0.2.

1
2
3

seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 11010011 00110011 00110011 00110011 00110011 00110011 00110011
00111111 11001001 10011001 10011001 10011001 10011001 10011001 10011010

These two numbers can be reversed and dropped back into str2bin to verify that they are correct.

The first step is to return to the complete form. The so-called “complete form” is because the first digit is 1 in the statute form, so we need to make up the 1 before the operation.

1
2
3

seeeeeee eeee fffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 1101 10011 00110011 00110011 00110011 00110011 00110011 00110011
00111111 1100 11001 10011001 10011001 10011001 10011001 10011001 10011010

Step 2 alignment. The exponential part of 0.3 is a little larger than 0.2.

1
2
3

seeeeeee eeee fffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffff
00111111 1101 10011 00110011 00110011 00110011 00110011 00110011 00110011
00111111 1101 01100 11001100 11001100 11001100 11001100 11001100 110011010

The third step is the subtraction of the trailing part.

seeeeeee eeee fffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffff
00111111 1101 10011 00110011 00110011 00110011 00110011 00110011 00110011
00111111 1101 01100 11001100 11001100 11001100 11001100 11001100 110011010
--------------------------------------------------------------------------
00111111 1101 00110 01100110 01100110 01100110 01100110 01100110 011001100

The fourth step is to raise the alignment. Because the highest bit of the result of the operation is not 1, so it needs to be aligned by raising the bit.

seeeeeee eeee fffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffff
00111111 1101 00110 01100110 01100110 01100110 01100110 01100110 011001100
--------------------------------------------------------------------------
seeeeeee eeee fff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffff
00111111 1011 110 01100110 01100110 01100110 01100110 01100110 011001100

In the fifth step, the regression form is replaced by omitting the highest 1 and adding 0 at the end.

seeeeeee eeee fff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffff
00111111 1011 110 01100110 01100110 01100110 01100110 01100110 011001100
------------------------------------------------------------------------
seeeeeee eeee ff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffff
00111111 1011 10 01100110 01100110 01100110 01100110 01100110 011001100
------------------------------------------------------------------------
seeeeeee eeee ffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffff
00111111 1011 1001 10011001 10011001 10011001 10011001 10011001 10011000

Below, we move down the calculation of 0.3-0.2 above and compare it with the result just calculated manually.

seeeeeee eeeeffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011000
-----------------------------------------------------------------------
00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011000

Exactly the same.

Reasons for accuracy problems

First of all, we see that after the above subtraction operation is done, the 0 at the end of the last is filled in. That is, the precision of the final data is not enough to fill the effective trailing space.

This is because in floating point mode, the data only provides the “most significant part” of the data. The general logic is that when a data has 200 bits, it does not matter if the last bit is 0 or 5. As long as the first 190 bits are correct.

In this case, it is easy to get a “precision deficient” difference by subtracting two large numbers that are close together. For example.

10000002.0-1000000001.0 = 1.0, which is easy to understand.
10000000002.0-10000000001.0 = 1.0, also very understandable.
100000000000000000002.0-100000000000000000001.0 = 2.0, which is completely incomprehensible.

In fact, using floating-point numbers for computation, as long as there are enough zeros in front, the last subtle data part (none of which need to be decimal) must be a problem.

Conclusion

the results of floating-point operations, to determine the equal when you use the “difference between the two less than one top value” approach.
When it comes to money, please use decimal.

Table of Contents

Verify

Verify a little deeper

The Principle of Floating Point Numbers

In-depth understanding of floating-point representation

Subtraction of floating point numbers

Reasons for accuracy problems

Conclusion