The Code Segment: Floating Point Numbers and Round-Off Errors

Hi there. I was thinking of writing about this topic for a long time but I couldn't find plenty time. Since it's a broad topic, it took time to tidy it up in my mind and start typing. For the very same reason, I splited the article into two parts. In these parts, I will try to explain how floating point numbers are stored in computers using single precision numbers (the simplest one) and what kind of errors could happen using this representation of numbers. First, I have added two interesting screen captures. First is from Excel 2007, but this result could be reproduced in more recent versions:

The results of formulas are in column A. I rewrote the formulas on column A in text again, in column B. i.e. the cell A1 contains =1-0.2-0.2-0.2-0.2-0.2 and the cell A2 contains =1-0.1-0.1...-0.1 (10 times 0.1) etc. It can be seen that both results, that should be zero, deviate from zero. These results in A2 and A3 are interesting. Why did this happen? And second interesting thing is, why did this interesting result not happen in the cell A4 or A5? It seems that same results can be obtained in Octave/Matlab.

I will give a hint about this behavior: Octave's default data type is double precision number. Although 1 is an integer, it is treated as a double (see the next screen capture). The final error increases, when I specify the numbers as "single(0.2)" instead of just "0.2" (converting them from double to single precision numbers) and subtract them from 1 consecutively. For example, for 0.2, the magnitude of error rose from 10^-17 to 10^-8 however the error is still zero for "single(0.0625)".

Scientific Notation of Numbers
Before digging the reasons of the error deep, we should return to the school years and remember, what normalized number and scientific notation is. To write a very big number or a number with a lot of decimals; first, meaningful digits of the number are written down. Then factor such as 10ⁿ is added next to it. For numbers, close to zero, n is negative and for huge numbers, n is positive.

For example:
6 720 000 000 = 6.72 * 10⁹
0.000 000 007 51 = 7.51 * 10^-9

Assuming that, m is a real number and n is an integer, scientific notation in decimal number system can be generalized as m * 10ⁿ. Let's name m as coefficient, and n as exponent. I had initially chosen m as a real number but I need to put some other constraints over it. Let's assume that d_i's are significant figures {0, 1, ..., 9} and d₀ is not equal to zero. Thus, m can be written as follows:

Assume that d₀ is zero: For example: 0.672 * 10¹⁰ = 6.72 * 10⁹
Assume that d₀ is greater than 10: 67.2 * 10⁸ = 6.72 * 10⁹

Hence, under my assumptions each number can be written uniquely. I especially underlined "decimal number system" because if I generalize it to any number system the above expression would be:

For binary number system, b = 2 and considering the constraints, it can be seen that d₀ can only 1 in binary system.

Representation of Decimals in Binary Number System
This topic can be outside of high school math but the logic behind is simple. Positional values in decimal system are:

When generalized to binary number system, if positional values go like 2⁰, 2¹, 2²..., the positional values after the point continue as 2^-1, 2^-2, 2^-3..., similarly.

The number above is, 2³ + 2¹ + 2⁰ + 2^-2 + 2^-3 = 11.375. In short, binary numbers can be uniquely written using scientific notation. I previously wrote that the integer part of the coefficient will always be 1. Integers can be denoted as follows:

1 = 1 * 2⁰
2 = 1 * 2¹
3 = 1.5 * 2¹
4 = 1 * 2²
5 = 1.25 * 2²
6 = 1.5 * 2²
7 = 1.75 * 2²
8 = 1 * 2³
...

To interpret stored floating numbers, I will use a hex editor named Hexpert. I am using this editor ever since I met computers. This is my favorite editor but unfortunately no longer in development. HxD can be my second favorite editor and it is actively developed. I just noticed, that a newer version of HxD was released. In this latest version, file editing capabilities are much more better and it can also be used. I have taken the screenshots from Hexpert, throughout this article. Below is a screenshot of Hexpert, showing the content of a small file. Right now, it would be enough to create a text file with spaces in it and open it in Hexpert.

The bytes in the file are shown in the middle field. The boxes below contains 8-, 16- and 32-bit values of the bytes where the cursor is. Boxes, starting with 'S' contains signed and 'U' contains unsigned values of same byte(s). First bit of a signed number is sign bit, denotes a negative number if one and a positive number if zero. This bit is counted in the number if the byte is interpreted as an unsigned number. The boxes below also allows the user to enter some values manually. For example, 8-bit value (00h) and 16-bit value (00h 00h) at offset 30h are zero. Since this representation is Little Endian, when the four bytes (32-bit value) under the cursor, are read in reverse order (i.e. 3Fh 80h 00h 00h), it will take the value of 63 * 256³ + 128 * 256² + 0 * 256 + 0 = 1 065 353 216 (please compare this with the value in the box S32 and U32). What I am really interested in, is the value in FP32 box. Since floating point numbers are always handled by computer as signed numbers, there is no UFP32 or UFP64 boxes obviously. FP32 means floating point number 32-bit (single precision) and FP64 means floating point number 64-bit or double precision. I will not go into this right now and only thing, I can say for now is, 40h offset contains the value 1.0 in FP64 format.

This representation is determined by IEEE-754 standard. In the standard in 1985, the terms 'single' and 'double' are mentioned, while in the 2008 revision of the standard, these terms are referred as binary32 and binary64 respectively. According to this standard, the first bit of a single precision number is the sign bit, the next 8-bits [bits 30-23] are exponent and the remaining 22 bits [bits 22-0] are coefficient. Let's decompose the value 1.0 accordingly:

3F 80 00 00 = 0|011 1111 1|000 0000 0000 0000 0000 0000

[ You can switch to binary view with Alt + B keys. ]

Since the integer part is always one, there is no need to store this within the bits of floating value. It is added automatically in each calculation. Therefore, the coefficient is actually 0 + 1 = 1, here. The exponent 0111 1111b = 127 and the constant "127" is subtracted from this value according to the standard. So the real exponent becomes 127 - 127 = 0. This number is positive since the sign bit is 0. When calculated 1 * 2^(127-127) = 1 is found.

Two is written as "1 * 2¹". If 1 is subtracted from the coefficient 1 - 1 = 0 and if "127" is added to the exponent: 1 + 127 = 128. Thus, 2 is expressed as: 0|100 0000 0|000 0000 0000 0000 0000 0000 = (40 00 00 00)₁₆. Of course, these bytes must be written in reverse order before it's stored since it is Little-Endian.

It should be noted that, the coefficient bits actually represent fractional binary numbers. For example, let's convert 200 to floating point. 200 = 128 + 64 + 8 = 2⁷ + 2⁶ + 2³. The biggest exponent must be chosen as the exponent value of floating point which is 7. Therefore, 200 = 2⁷ * (1 + 2^-1 + 2^-4) = 1.1001b * 2⁷ = 1.5625 * 2⁷. If I remove "1" from the coefficient, the rest is ".1001". Let's calculate exponent of floating point by adding 127 to 7: 7 + 127 = 134 = 1000 0110b. Putting it all together, final result is: 0|100 0011 0|100 1000 0000 0000 0000 0000 = (43 48 00 00)₁₆.

Now, using this information, I can also express numbers with decimals:

Example: 0.25 = 1 * 2^-2. Exponent is -2 + 127 = 125: 0|011 1110 1|000 0000 0000 0000 0000 0000

Let's remember repeating decimals. When we try to express the divisions with decimal numbers instead of fractions and if the dividend cannot be divided by divisor without any remainder, decimals start to repeat each other somwhere after the period (not necessarily right after it). For example: 7 / 90 = 0.0707... Behind the logic of computer representation of any floating point number, there are continuous divisons and expressing the coefficient as a sum of binary fractions. Okay, would such repeating decimals occur in floating numbers? Let's take 0.1 as an example. Its single precision floating point representation is: 0|011 1101 1|100 1100 1100 1100 1100 1101. The exponent is 123 - 127 = -4. Let's separate the digits of the coefficient by multiplying them with their positional values and sum all of them, it looks like this:

If the lowest term with 2^-23 (on the last line) would not exist in this sum, the total would be equal to 0.599 999 904 632 568 359 15. In other words, the value 0.6 exactly, (actually 1.6) can never be obtained. Adding 1 to this number and multiplying the result with 2^-4 yields 0.100 000 001 490 116 119 370 703 125. But when Hexpert, Excel, Matlab or Octave shows the first 6-7 digits of this number, we come up with 0.1 (of course I ignore, the number being handled as double precision numbers in Matlab or Octave, for now). This small difference accumulates on each -0.1 operation (subtraction) and the accumulated error becomes really significant when the result is in the neighborhood of zero.

Please note that, this should not be seen as a deficiency of the binary number system. If computers could use decimal number system and humans used duodecimal (base 12) system the same situation would have happened again. This is caused by the translation itself regardless of number system.

Non-zero values are observed after the ninth decimal in the above number. But when I subtract that number from 1 ten times, the actual operation is "1-1.000 000 014 901 161 193 707 031 25". The error accumulates and overflows to the eighth decimal. I previously wrote that, when the same process is repeated with single precision numbers in Octave, the magnitude of error is 10^-8.

>> X=single(0.1)
X = 0.10000
>> 1-X-X-X-X-X-X-X-X-X-X
ans = -7.4506e-008
>>

The difference between having the last bit of the coefficient of a floating number set or reset, is the smallest difference which a computer can understand between two floating point value. If this difference is between 1 and the first floating point value greater than one, it is called machine epsilon, in scientific literature. In Matlab, eps function returns this value:

>> eps
ans = 2.2204e-016
>> eps('single')
ans = 1.1921e-007

The same values and the examples on Wikipedia are comparable:
Single-precision floating-point format
0|011 1111 1|000 0000 0000 0000 0000 0001₂ = 3f80 0001₁₆ = 1 + 2^-23 ≈ 1.0000001192
(smallest number larger than one)

Double-precision floating-point format

0|011 1111 1111 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001₂ ≙ 1 + 2-⁵² ≈ 1.0000000000000002, the smallest number > 1

Okay, why there was no error in subtraction with 0.0625s? Because, since 0.0625 is actually 0.0001₂, this number can be actually represented in CPU (or FPU to be more precise) exactly as it is. Therefore no error occurs.

In the second part of the article, I will detail the occurring errors and give further examples.

The Code Segment

Saturday, January 5, 2019

Floating Point Numbers and Round-Off Errors

No comments:

Post a Comment

About Me