Friday, January 18, 2019

Floating Point Numbers and Round-Off Errors #2

 
Hi there. In the previous article, I wrote about how floating numbers are stored in the computer and mentioned that the constraints of computer architechture cause interesting behaviors. I will give further examples for these interesting behaviors.
 
The first interesting behavior is: A programmer has to test a floating point result, which is (expectedly) close to zero, whether it has in an acceptable distance to zero, rather than comparing results with zero. I will give an example for this in further sections of this article with an asterisk (*) next to it. First, I need to clarify some intermediate points.

The examples in the previous article were mostly with single precision numbers. This does not mean that double precision numbers do not have such problems. Since the coefficient is represented by more bits, the error (from previous article, "1-0.1-0.1-...." operation) in the calculation occurred in seventeenth decimal where with single precision numbers the error was in ninth. The result in Excel (first example in previous article), had an error in sixteenth decimal place. The error of numerical representation of "0.1" is on seventeenth decimal. Since we subtracted ten times, the error is also multiplied by ten. I have chosen examples with single precision numbers for only one reason: to easily show calculations and their errors.

These errors do not only occur with in operations resulting close to zero. Since coefficient bits are limited, there are errors with representation of big numbers, as well. The fact, that the coefficient field has only 23 bits, also causes problems with numbers that can only be expressed with more than 23 fractions. Let's make an experiment in Hexpert. I opened a file, located 00H offset and entered 16 777 215 to FP32 box. Then located to 04H offset and entered 16 777 216 to FP32 box and so on. I entered all numbers up to 16 777 220.


As it can be seen on the next image, all the coefficint bits are filled. Because 16 777 215 can be written as a sum of 23 (binary) fractions (proof is above). Its successor is expressed as 1*224. Bu the exponent has increased so much that no place left to write the last "1" of 224+1. Note that, even though the number at offset 08H is the successor of the number at 04H, they are actually represented with same floating point value. Like the value at offset 0CH is the same as at offset 10H. While storing numbers between 224 and 225, odd numbers cannot be represented. Similarly, the same is also true for the numbers between 225 and 226, that are not multiple of four. If small increments in those large numbers would be considered insignificant, there is no problem. But according to the problem itself 224 may not be a very large number. By the way, using a 32-bit integer, I can still express 224 and its many successors with an absolut precision.

In short, since the bit fields in floating point numbers are finite, numbers must be truncated at some point. Examples for these errors are numerous. For example irrational numbers, like π or e, cannot be expressed as a sum of finite fractions (in other words cannot be written with finite decimals), therefore can also never be fully expressed with floating point numbers. Moreover, it is impossible to represent any real number from real numbers set, which is an uncountable infinite set, using finite number of bits, at least in theory. In practice, we can still work with these numbers. The only reason for that is, it is fine in engineering when the final result is "sufficiently" accurate when rounded. For example, 4218.14 kg can be rounded as 4200 kg or even 4000 kg. These kinds of errors, which are caused by truncated bits, are called truncation or round-off errors. Although these two errors are not same in numerical analysis they can be used interchangeably because the behavior of CPU is to round-off these numbers. ("Occasionally, round-off error (the consequence of using finite precision floating point numbers on computers) is also called truncation error, especially if the number is rounded by truncation. -- ref: Truncation error (Accessed on 05.09.2020)). 

Let's consider an example of comparing two floating point numbers when they are very close to zero*. We know that sin(30) = sin(π / 6) = 0.5, thus sin(π / 6) - 0.5 = 0. Following statement "printf("%e\n", sin( M_PI / 6.0 )); prints "5.000000e-01" out. As I wrote previously, pi contains actually an approximation in itself. There must be also an approximation in sine function because of truncation of MacLaurin series (well, FPU is using mainly CORDIC but I am trying to simplify). As a result, it can be guessed that the result "5.000000e-01" is not exactly 2-1. And as expected, the statement "printf("%e\n", sin( M_PI / 6.0 ) - 0.5);" prints -5.551115e-17. While comparing the result, it should be considered that the difference is "close enough" to zero, not zero. For example, I could check if the result is in a sufficiently small neighborhood of zero, like machine epsilon. i.e.

if(fabs(sin(M_PI / 6.0) - 0.5) < DBL_EPSILON)

is a better comparison. I wrote above that the machine epsilon is only meaningful in ones neighborhood. Therefore is this approach not enough accurate, too. According to the sensitivity of the operation, machine epsilon could be large. Machine epsion of double precision numbers is about 10-16 where the smallest possible double precision number is 2.225 * 10-308 (this number is actually 4.94 * 10-324, however I will explain this later in the paragraph with double asterisks **). Choosing the correct "small enough" vaule is a difficult problem as this value can differ along the number line. For example, 3 is an "enough small" number for the numbers between 224 ile 225 but it is not "enough small" between 225 ile 226.

If the exponent is 00h or 0FFh, it carries special meanings. If the exponent is 0FFh and the coefficient is zero, the value is either positive or negative infinity according to the sign bit. These infinities can come out by dividing a very large number (dividend) by a small divisor, when the quotient doesn't fit into the register. If the exponent is 0FFh but the coefficient is non-zero, this value is called NaN or "Not a Number". This value comes out as a result of any indeterminate form in math or if the operation to be performed in FPU is undefined (like square root of a negative number). All operations with NaNs will also result in NaNs.

The second special case is that the exponent is zero. If the coefficient is zero as well (i.e. 00 00 00 00) the value is obviously zero. However the case with exponent zero, is more complicated than that. So, I previously mentioned that 1 is added to each coefficient during translation. If the coefficient is always greater than or equal to one, zero could never be obtained. Therefore, exponent zero is an exception and the value 127 is not subtracted from the exponent in this case. Instead of that, it is assumed that the exponents value is -126 and no additional 1 is added to the coefficient, in other words d0 = 0 (please refer previous article for the definition of d0). When subtracting two numbers where their exponents are -126, it will yield a difference of order of 2-127. For example:

1.5 * 2-126 - 1.375 * 2-126 = 0.125 * 2-126 = 1.25 * 2-127

To prevent an underflow and in order to be able to represent the difference in the equation above, a class of "very small" numbers are needed in FPU. Therefore, the range [0, 1], with a precision of ±2-126, has been added to floating point number standard. These numbers are called subnormal numbers (below normalized numbers). When working with subnormal numbers, FPU sets DE flag. This means, that there is an ongoing operation with DEnormal (subnormal) numbers and the least significant decimals in the result might be lost. In other words, FPU cannot guarantee the accuracy of the result of subtractions and divisions on subnormal numbers. In the sentence above which I marked with double asterisks (**), I had noted two different smallest possible values of double precision numbers. The larger one is the smallest of the normal numbers and the other one is the smallest of the subnormal numbers. (It can be thought that subnormal numbers are only intended to store results not intended for an operation).

Example code:
I wrote following code to demonstrate operations on subnormal numbers:

#include<stdio.h>
#include<stdint.h>
union uSayi {
    unsigned char bayt[8];
    uint64_t q;
    double s;
};

int main()      {
    union uSayi d1, d2, d3;
    double t1, t2, t3;

    d1.q = 0x0018000000000000;
    d2.q = 0x0016000000000000;
    d3.q = 0x0014000000000000;
    //printf("%e %e %e\n", d1.s, d2.s, d3.s);

    // 1.5 * 2^-1022 - 1.25 * 2^-1022 = 0.25 * 2^-1022
    t1 = d1.s - d3.s;
    // 1.5 * 2^-1022 - 1.375 * 2^-1022 = 0.125 * 2^-1022
    t2 = d1.s - d2.s;
    // 0.25 * 2^-1022 - 0.125 * 2^-1022 = 0.125 * 2^-1022
    t3 = t1 - t2;

    return 0;
}

I defined an union and I have accessed to the variable as a double and as an unsigned integer of 64 bits at the same time. Then I defined three variables d1, d2 and d3 as union uSayi. I have assigned small values near the normalized floating point limit and subtracted them from each other. I stored the intermediate results in variables t1, t2 and t3. I compiled this as follows since I will debug it in gdb.

gcc -mfpmath=387 -g kod2.c
gdb -tui a.out

gcc handles floating point numbers with SSE instructions by default. I forced gcc to use 80837 code because I want to see them in FPU.

I loaded the program with gdb's TUI. And since I prefer intel syntax in assembly, I gave following two commands:

(gdb) set disassembly-flavor intel
(gdb) layout asm


In the above screenshot, the assignmets to d1, d2 and d3 can be clearly seen starting from the offset 0x400574. First subtraction is at 0x4005a4, second is at 0x4005af and the last one is at 0x4005b7 using fsub instruction. I put a breakpoint on the first subtraction with break *main+71 command and run the code with run command. As the execution stopped, I checked FPU stack using info float. The numbers I entered manually, are ready to be processed in FPU:


I executed subtraction using nexti command. Then I checked the result in FPU stack using info float command. Next subtraction will run in the same way. Therefore, I proceeded to the final subtraction using until *main+90. When I give info float command in this step, I see the value of FPU Status Word is changed from 0x3800 to 0x3802, which means DE (DEnormalized) flag is set. I executed the last subtraction with a final nexti command. The result was +2,781342323134001729e-309. Then I completed the program with cont command. In documents and in Wikipedia, it is stated that instructions run slower with denormalized values than with normal values. Hence it is recommended to turn off operations with denormalized values in applications where speed is more important than accuracy of results.

At https://blogs.oracle.com/d/subnormal-numbers, it is mentioned that -fns compiler parameter allows the subnormal results to be rounded to zero during the execution however I could not find such a parameter in gcc.

Saturday, January 5, 2019

Floating Point Numbers and Round-Off Errors

 
Hi there. I was thinking of writing about this topic for a long time but I couldn't find plenty time. Since it's a broad topic, it took time to tidy it up in my mind and start typing. For the very same reason, I splited the article into two parts. In these parts, I will try to explain how floating point numbers are stored in computers using single precision numbers (the simplest one) and what kind of errors could happen using this representation of numbers. First, I have added two interesting screen captures. First is from Excel 2007, but this result could be reproduced in more recent versions:


The results of formulas are in column A. I rewrote the formulas on column A in text again, in column B. i.e. the cell A1 contains =1-0.2-0.2-0.2-0.2-0.2 and the cell A2 contains =1-0.1-0.1...-0.1 (10 times 0.1) etc. It can be seen that both results, that should be zero, deviate from zero. These results in A2 and A3 are interesting. Why did this happen? And second interesting thing is, why did this interesting result not happen in the cell A4 or A5? It seems that same results can be obtained in Octave/Matlab.



I will give a hint about this behavior: Octave's default data type is double precision number. Although 1 is an integer, it is treated as a double (see the next screen capture). The final error increases, when I specify the numbers as "single(0.2)" instead of just "0.2" (converting them from double to single precision numbers) and subtract them from 1 consecutively. For example, for 0.2, the magnitude of error rose from 10-17 to 10-8 however the error is still zero for "single(0.0625)".

Scientific Notation of Numbers
Before digging the reasons of the error deep, we should return to the school years and remember, what normalized number and scientific notation is. To write a very big number or a number with a lot of decimals; first, meaningful digits of the number are written down. Then factor such as 10n is added next to it. For numbers, close to zero, n is negative and for huge numbers, n is positive.

For example:
6 720 000 000 = 6.72 * 109
0.000 000 007 51 = 7.51 * 10-9

Assuming that, m is a real number and n is an integer, scientific notation in decimal number system can be generalized as m * 10n. Let's name m as coefficient, and n as exponent. I had initially chosen m as a real number but I need to put some other constraints over it. Let's assume that di's are significant figures {0, 1, ..., 9} and d0 is not equal to zero. Thus, m can be written as follows:


Assume that d0 is zero: For example: 0.672 * 1010 = 6.72 * 109
Assume that d0 is greater than 10: 67.2 * 108 = 6.72 * 109

Hence, under my assumptions each number can be written uniquely. I especially underlined "decimal number system" because if I generalize it to any number system the above expression would be:

 
For binary number system, b = 2 and considering the constraints, it can be seen that d0 can only 1 in binary system.

Representation of Decimals in Binary Number System
This topic can be outside of high school math but the logic behind is simple. Positional values in decimal system are:


When generalized to binary number system, if positional values go like 20, 21, 22..., the positional values after the point continue as 2-1, 2-2, 2-3..., similarly.


The number above is, 23 + 21 + 20 + 2-2 + 2-3 = 11.375. In short, binary numbers can be uniquely written using scientific notation. I previously wrote that the integer part of the coefficient will always be 1. Integers can be denoted as follows:

1 = 1 * 20
2 = 1 * 21
3 = 1.5 * 21
4 = 1 * 22
5 = 1.25 * 22
6 = 1.5 * 22
7 = 1.75 * 22
8 = 1 * 23
...

To interpret stored floating numbers, I will use a hex editor named Hexpert. I am using this editor ever since I met computers. This is my favorite editor but unfortunately no longer in development. HxD can be my second favorite editor and it is actively developed. I just noticed, that a newer version of HxD was released. In this latest version, file editing capabilities are much more better and it can also be used. I have taken the screenshots from Hexpert, throughout this article. Below is a screenshot of Hexpert, showing the content of a small file. Right now, it would be enough to create a text file with spaces in it and open it in Hexpert.

The bytes in the file are shown in the middle field. The boxes below contains 8-, 16- and 32-bit values of the bytes where the cursor is. Boxes, starting with 'S' contains signed and 'U' contains unsigned values of same byte(s). First bit of a signed number is sign bit, denotes a negative number if one and a positive number if zero. This bit is counted in the number if the byte is interpreted as an unsigned number. The boxes below also allows the user to enter some values manually. For example, 8-bit value (00h) and 16-bit value (00h 00h) at offset 30h are zero. Since this representation is Little Endian, when the four bytes (32-bit value) under the cursor, are read in reverse order (i.e. 3Fh 80h 00h 00h), it will take the value of 63 * 2563 + 128 * 2562 + 0 * 256 + 0 = 1 065 353 216 (please compare this with the value in the box S32 and U32). What I am really interested in, is the value in FP32 box. Since floating point numbers are always handled by computer as signed numbers, there is no UFP32 or UFP64 boxes obviously. FP32 means floating point number 32-bit (single precision) and FP64 means floating point number 64-bit or double precision. I will not go into this right now and only thing, I can say for now is, 40h offset contains the value 1.0 in FP64 format.
 
This representation is determined by IEEE-754 standard. In the standard in 1985, the terms 'single' and 'double' are mentioned, while in the 2008 revision of the standard, these terms are referred as binary32 and binary64 respectively. According to this standard, the first bit of a single precision number is the sign bit, the next 8-bits [bits 30-23] are exponent and the remaining 22 bits [bits 22-0] are coefficient. Let's decompose the value 1.0 accordingly: 

3F 80 00 00 = 0|011 1111 1|000 0000 0000 0000 0000 0000

[ You can switch to binary view with Alt + B keys. ]

Since the integer part is always one, there is no need to store this within the bits of floating value. It is added automatically in each calculation. Therefore, the coefficient is actually 0 + 1 = 1, here. The exponent 0111 1111b = 127 and the constant "127" is subtracted from this value according to the standard. So the real exponent becomes 127 - 127 = 0. This number is positive since the sign bit is 0. When calculated 1 * 2(127-127) = 1 is found.

Two is written as "1 * 21". If 1 is subtracted from the coefficient 1 - 1 = 0 and if  "127" is added to the exponent: 1 + 127 = 128. Thus, 2 is expressed as: 0|100 0000 0|000 0000 0000 0000 0000 0000 = (40 00 00 00)16. Of course, these bytes must be written in reverse order before it's stored since it is Little-Endian.

It should be noted that, the coefficient bits actually represent fractional binary numbers. For example, let's convert 200 to floating point. 200 = 128 + 64 + 8 = 27 + 26 + 23. The biggest exponent must be chosen as the exponent value of floating point which is 7. Therefore, 200 = 27 * (1 + 2-1 + 2-4) = 1.1001b * 27 = 1.5625 * 27. If I remove "1" from the coefficient, the rest is ".1001". Let's calculate exponent of floating point by adding 127 to 7: 7 + 127 = 134 = 1000 0110b. Putting it all together, final result is: 0|100 0011 0|100 1000 0000 0000 0000 0000 = (43 48 00 00)16.

Now, using this information, I can also express numbers with decimals:

Example: 0.25 = 1 * 2-2. Exponent is -2 + 127 = 125: 0|011 1110 1|000 0000 0000 0000 0000 0000

Let's remember repeating decimals. When we try to express the divisions with decimal numbers instead of fractions and if the dividend cannot be divided by divisor without any remainder, decimals start to repeat each other somwhere after the period (not necessarily right after it). For example: 7 / 90 = 0.0707... Behind the logic of computer representation of any floating point number, there are continuous divisons and expressing the coefficient as a sum of binary fractions. Okay, would such repeating decimals occur in floating numbers? Let's take 0.1 as an example. Its single precision floating point representation is: 0|011 1101 1|100 1100 1100 1100 1100 1101. The exponent is 123 - 127 = -4. Let's separate the digits of the coefficient by multiplying them with their positional values and sum all of them, it looks like this:


If the lowest term with 2-23 (on the last line) would not exist in this sum, the total would be equal to 0.599 999 904 632 568 359 15. In other words, the value 0.6 exactly, (actually 1.6) can never be obtained. Adding 1 to this number and multiplying the result with 2-4 yields 0.100 000 001 490 116 119 370 703 125. But when Hexpert, Excel, Matlab or Octave shows the first 6-7 digits of this number, we come up with 0.1 (of course I ignore, the number being handled as double precision numbers in Matlab or Octave, for now). This small difference accumulates on each -0.1 operation (subtraction) and the accumulated error becomes really significant when the result is in the neighborhood of zero.

Please note that, this should not be seen as a deficiency of the binary number system. If computers could use decimal number system and humans used duodecimal (base 12) system the same situation would have happened again. This is caused by the translation itself regardless of number system.

Non-zero values are observed after the ninth decimal in the above number. But when I subtract that number from 1 ten times, the actual operation is "1-1.000 000 014 901 161 193 707 031 25". The error accumulates and overflows to the eighth decimal. I previously wrote that, when the same process is repeated with single precision numbers in Octave, the magnitude of error is 10-8.

>> X=single(0.1)
X =  0.10000
>> 1-X-X-X-X-X-X-X-X-X-X
ans =  -7.4506e-008
>>

The difference between having the last bit of the coefficient of a floating number set or reset, is the smallest difference which a computer can understand between two floating point value. If this difference is between 1 and the first floating point value greater than one, it is called machine epsilon, in scientific literature. In Matlab, eps function returns this value:

>> eps
ans =   2.2204e-016
>> eps('single')
ans =   1.1921e-007

The same values and the examples on Wikipedia are comparable:
Single-precision floating-point format
0|011 1111 1|000 0000 0000 0000 0000 00012 = 3f80 000116 = 1 + 2-23 ≈ 1.0000001192
(smallest number larger than one)

0|011 1111 1111 | 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 00012 ≙ 1 + 2-52 ≈ 1.0000000000000002, the smallest number > 1

Okay, why there was no error in subtraction with 0.0625s? Because, since 0.0625 is actually 0.00012, this number can be actually represented in CPU (or FPU to be more precise) exactly as it is. Therefore no error occurs.

In the second part of the article, I will detail the occurring errors and give further examples.