Hi there. In the previous article, I wrote about how floating numbers are stored in the computer and mentioned that the constraints of computer architechture cause interesting behaviors. I will give further examples for these interesting behaviors.
The examples in the previous article were mostly with single precision numbers. This does not mean that double precision numbers do not have such problems. Since the coefficient is represented by more bits, the error (from previous article, "1-0.1-0.1-...." operation) in the calculation occurred in seventeenth decimal where with single precision numbers the error was in ninth. The result in Excel (first example in previous article), had an error in sixteenth decimal place. The error of numerical representation of "0.1" is on seventeenth decimal. Since we subtracted ten times, the error is also multiplied by ten. I have chosen examples with single precision numbers for only one reason: to easily show calculations and their errors.
These errors do not only occur with in operations resulting close to zero. Since coefficient bits are limited, there are errors with representation of big numbers, as well. The fact, that the coefficient field has only 23 bits, also causes problems with numbers that can only be expressed with more than 23 fractions. Let's make an experiment in Hexpert. I opened a file, located 00H offset and entered 16 777 215 to FP32 box. Then located to 04H offset and entered 16 777 216 to FP32 box and so on. I entered all numbers up to 16 777 220.
In short, since the bit fields in floating point numbers are finite, numbers must be truncated at some point. Examples for these errors are numerous. For example irrational numbers, like π or e, cannot be expressed as a sum of finite fractions (in other words cannot be written with finite decimals), therefore can also never be fully expressed with floating point numbers. Moreover, it is impossible to represent any real number from real numbers set, which is an uncountable infinite set, using finite number of bits, at least in theory. In practice, we can still work with these numbers. The only reason for that is, it is fine in engineering when the final result is "sufficiently" accurate when rounded. For example, 4218.14 kg can be rounded as 4200 kg or even 4000 kg. These kinds of errors, which are caused by truncated bits, are called truncation or round-off errors. Although these two errors are not same in numerical analysis they can be used interchangeably because the behavior of CPU is to round-off these numbers. ("Occasionally, round-off error (the consequence of using finite precision floating point numbers on computers) is also called truncation error, especially if the number is rounded by truncation. -- ref: Truncation error (Accessed on 05.09.2020)).
Let's consider an example of comparing two floating point numbers when they are very close to zero*. We know that sin(30) = sin(π /
6) = 0.5, thus sin(π / 6) - 0.5 = 0. Following statement "printf("%e\n", sin( M_PI / 6.0 )); prints "5.000000e-01" out. As I wrote previously, pi contains actually an approximation in itself. There must be also an approximation in sine function because of truncation of MacLaurin series (well, FPU is using mainly CORDIC but I am trying to simplify). As a result, it can be guessed that the result "5.000000e-01" is not exactly 2-1. And as expected, the statement "printf("%e\n", sin( M_PI / 6.0 ) - 0.5);" prints -5.551115e-17. While comparing the result, it should be considered that the difference is "close enough" to zero, not zero. For example, I could check if the result is in a sufficiently small neighborhood of zero, like machine epsilon. i.e.
if(fabs(sin(M_PI / 6.0) - 0.5) < DBL_EPSILON)
is a better comparison. I wrote above that the machine epsilon is only meaningful in ones neighborhood. Therefore is this approach not enough accurate, too. According to the sensitivity of the operation, machine epsilon could be large. Machine epsion of double precision numbers is about 10-16 where the smallest possible double precision number is 2.225 * 10-308 (this number is actually 4.94 * 10-324, however I will explain this later in the paragraph with double asterisks **). Choosing the correct "small enough" vaule is a difficult problem as this value can differ along the number line. For example, 3 is an "enough small" number for the numbers between 224 ile 225 but it is not "enough small" between 225 ile 226.
If the exponent is 00h or 0FFh, it carries special meanings. If the exponent is 0FFh and the coefficient is zero, the value is either positive or negative infinity according to the sign bit. These infinities can come out by dividing a very large number (dividend) by a small divisor, when the quotient doesn't fit into the register. If the exponent is 0FFh but the coefficient is non-zero, this value is called NaN or "Not a Number". This value comes out as a result of any indeterminate form in math or if the operation to be performed in FPU is undefined (like square root of a negative number). All operations with NaNs will also result in NaNs.
The second special case is that the exponent is zero. If the coefficient is zero as well (i.e. 00 00 00 00) the value is obviously zero. However the case with exponent zero, is more complicated than that. So, I previously mentioned that 1 is added to each coefficient during translation. If the coefficient is always greater than or equal to one, zero could never be obtained. Therefore, exponent zero is an exception and the value 127 is not subtracted from the exponent in this case. Instead of that, it is assumed that the exponents value is -126 and no additional 1 is added to the coefficient, in other words d0 = 0 (please refer previous article for the definition of d0). When subtracting two numbers where their exponents are -126, it will yield a difference of order of 2-127. For example:
1.5 * 2-126 - 1.375 * 2-126 = 0.125 * 2-126 = 1.25 * 2-127
To prevent an underflow and in order to be able to represent the difference in the equation above, a class of "very small" numbers are needed in FPU. Therefore, the range [0, 1], with a precision of ±2-126, has been added to floating point number standard. These numbers are called subnormal numbers (below normalized numbers). When working with subnormal numbers, FPU sets DE flag. This means, that there is an ongoing operation with DEnormal (subnormal) numbers and the least significant decimals in the result might be lost. In other words, FPU cannot guarantee the accuracy of the result of subtractions and divisions on subnormal numbers. In the sentence above which I marked with double asterisks (**), I had noted two different smallest possible values of double precision numbers. The larger one is the smallest of the normal numbers and the other one is the smallest of the subnormal numbers. (It can be thought that subnormal numbers are only intended to store results not intended for an operation).
Example code:
I wrote following code to demonstrate operations on subnormal numbers:
#include<stdio.h>
#include<stdint.h>
union uSayi {
unsigned char bayt[8];
uint64_t q;
double s;
};
int main() {
union uSayi d1, d2, d3;
double t1, t2, t3;
d1.q = 0x0018000000000000;
d2.q = 0x0016000000000000;
d3.q = 0x0014000000000000;
//printf("%e %e %e\n", d1.s, d2.s, d3.s);
// 1.5 * 2^-1022 - 1.25 * 2^-1022 = 0.25 * 2^-1022
t1 = d1.s - d3.s;
// 1.5 * 2^-1022 - 1.375 * 2^-1022 = 0.125 * 2^-1022
t2 = d1.s - d2.s;
// 0.25 * 2^-1022 - 0.125 * 2^-1022 = 0.125 * 2^-1022
t3 = t1 - t2;
return 0;
}
#include<stdint.h>
union uSayi {
unsigned char bayt[8];
uint64_t q;
double s;
};
int main() {
union uSayi d1, d2, d3;
double t1, t2, t3;
d1.q = 0x0018000000000000;
d2.q = 0x0016000000000000;
d3.q = 0x0014000000000000;
//printf("%e %e %e\n", d1.s, d2.s, d3.s);
// 1.5 * 2^-1022 - 1.25 * 2^-1022 = 0.25 * 2^-1022
t1 = d1.s - d3.s;
// 1.5 * 2^-1022 - 1.375 * 2^-1022 = 0.125 * 2^-1022
t2 = d1.s - d2.s;
// 0.25 * 2^-1022 - 0.125 * 2^-1022 = 0.125 * 2^-1022
t3 = t1 - t2;
return 0;
}
I defined an union and I have accessed to the variable as a double and as an unsigned integer of 64 bits at the same time. Then I defined three variables d1, d2 and d3 as union uSayi. I have assigned small values near the normalized floating point limit and subtracted them from each other. I stored the intermediate results in variables t1, t2 and t3. I compiled this as follows since I will debug it in gdb.
gcc -mfpmath=387 -g kod2.c
gdb -tui a.out
gcc handles floating point numbers with SSE instructions by default. I forced gcc to use 80837 code because I want to see them in FPU.
I loaded the program with gdb's TUI. And since I prefer intel syntax in assembly, I gave following two commands:
(gdb) set disassembly-flavor intel
(gdb) layout asm
In the above screenshot, the assignmets to d1, d2 and d3 can be clearly seen starting from the offset 0x400574. First subtraction is at 0x4005a4, second is at 0x4005af and the last one is at 0x4005b7 using fsub instruction. I put a breakpoint on the first subtraction with break *main+71 command and run the code with run command. As the execution stopped, I checked FPU stack using info float. The numbers I entered manually, are ready to be processed in FPU:
I executed subtraction using nexti command. Then I checked the result in FPU stack using info float command. Next subtraction will run in the same way. Therefore, I proceeded to the final subtraction using until *main+90. When I give info float command in this step, I see the value of FPU Status Word is changed from 0x3800 to 0x3802, which means DE (DEnormalized) flag is set. I executed the last subtraction with a final nexti command. The result was +2,781342323134001729e-309. Then I completed the program with cont command. In documents and in Wikipedia, it is stated that instructions run slower with denormalized values than with normal values. Hence it is recommended to turn off operations with denormalized values in applications where speed is more important than accuracy of results.
At https://blogs.oracle.com/d/subnormal-numbers, it is mentioned that -fns compiler parameter allows the subnormal results to be rounded to zero during the execution however I could not find such a parameter in gcc.
No comments:
Post a Comment