Years ago when I was studying C++ at university, I learned that IEEE 754 is the most common way of representing floating point numbers in computers today. But since I work with it very little, I often forget the specific details of this standard, so I wrote this article to document it in detail.

How do you represent numbers?

There are many ways to represent numbers. For example, the most common way we use when writing - fixing the position of the decimal point, placing a decimal point in the middle of several numbers to represent a decimal, or a whole number if there is no decimal point.

There is another way, scientific notation , which consists of a base part and an exponential part. For example, the decimal integer 50 is represented in scientific notation.

1
2
3
4
5
5 x (10 ^ 1)
0.5 x (10 ^ 2)
0.05 x (10 ^ 3)

...

123.45 can be expressed in the following ways.

1
2
3
4
5
6
7
0.12345 x (10 ^ 3)
1.2345 x (10 ^ 2)
12.345 x (10 ^ 1)
123.45 x (10 ^ 0)
1234.5 x (10 ^ -1)

...

where 5 x (10 ^ 1) and 1.2345 x (10 ^ 2) are called Standard Scientific Notation , where the left-hand part of the number has only one non-zero digit to the left of the decimal point.

If expressed in standard scientific notation, representing binary numbers, the exponent has a floor of 2.

The binary 10100.110 is expressed as 1.0100110 × (2 ^ 4).

IEEE 754 is essentially standard scientific notation in binary.

IEEE 754

First look at the storage structure of IEEE 754.

Sign Exponent Fraction
Single Precision 1 [31] 8 [30-23] 23 [22-00]
Double Precision 1 [63] 11 [62-52] 52 [51-00]

In C++, single precision floating point number is represented by 4 bytes, one byte equals 8 bits, so there are 32 bits in total. the leftmost bit is used to store the sign bit, bits 23-30 store the exponential part of scientific notation, and bits 0-23 store the non-exponential part of scientific notation, which is called the trailing part here. Double precision is represented by 8 bytes, the leftmost bit is the sign bit, 52-62 represents the exponential part, and 0-51 represents the trailing part.

Assuming that the sign bit is positive, the exponent part is stored as 70, and the trailing part is 1.1001 (binary), the value represented is 1.1001 x (2 ^ 70) .

The following explains the sign, exponent, and trailing parts in order.

Sign

The symbol bits of IEEE 754 are used to represent positive and negative numbers. A symbol bit of 0 means positive and 1 means negative, it’s so easy.

Exponent

Under single precision structure, the exponent part of IEEE 754 has 8 bits, and the maximum integer that can be represented by 8 bits is 255 (2 ^ 8 - 1), because the exponent part has to represent both positive and negative numbers, so it needs to have an offset value, for single precision this offset value is 127, so the exponent part is stored as 200, which means (200 - 127), and 0 means (0 - 127), which is -127.

With double precision structure, the exponent part of IEEE 754 has 11 bits, at this time the calculated offset value is 1023, so 0 means -1023.

Fraction

Using standard binary scientific notation, then the trailing part must be 1 before the decimal point (binary can only be 1 and 0, because it is standard scientific notation, so it is 1). Since it is determined that 1 is before the decimal point, there is no need to use a separate 1bit representation, so all bits in the trailing part are used to represent the value to the right of the decimal point. For example, if 100111 is stored, then the result is 1.100111, so the Fraction of a 32-bit floating-point number has 32 bits, but it represents a 24-bit value plus a decimal point.

Above, is the process of representing floating point numbers through IEEE 754, if you understand IEEE 754, it is easy to calculate the range of values that can be represented by single and double precision.

1
0.1 + 0.2 != 0.3 ?

In languages that use IEEE754 to store floating point numbers, there is 0.1 + 0.2 = 0.30000000000000004. For example, JavaScript, C++, etc. The reason why 0.1 + 0.2 is not equal to 0.3 is because of the way IEEE 754 stores floating point numbers.

The binary of 0.1 and 0.2 is an infinite loop binary decimal, either single or double precision, and truncated access is required when storing them, so the stored value of 0.1 is not actually 0.1 and there is an error, and similarly there is an error in the stored 0.2, so the result is not always accurate.

So any time you determine whether the sum of two floating point values is equal to a certain number, you must consider a certain amount of error.

In JavaScript, Number.EPSILON can be is used to represent this error, and we can determine if they are equal this way.

1
2
3
4
x = 0.2;
y = 0.3;
z = 0.1;
equal = (Math.abs(x - y + z) < Number.EPSILON);