Sunday, March 10, 2019

Streaming SIMD Extensions (SSE) and glibc


Hi there. This article will be like the "behind the scenes" of previous floating point article. I had a strange problem while writing the article on floating point numbers. First, I realized that gcc compiles floating point operations with SSE instruction set by default when I checked assembly output, while compiling the code on Oracle Blog. (Platforms: Linux Mint 17.3, Centos 6.x ve 7.x)

...
40054d:  f2 0f 10 45 e8        movsd xmm0,QWORD PTR [rbp-0x18]
400552:  bf 18 06 40 00        mov edi,0x400618
400557:  b8 01 00 00 00        mov eax,0x1
40055c:  e8 af fe ff ff        call 400410 <printf@plt>
400561:  f2 0f 10 45 f8        movsd xmm0,QWORD PTR [rbp-0x8]
400566:  f2 0f 10 0d b2 00 00  movsd xmm1,QWORD PTR [rip+0xb2]
                               # 400620 <_IO_stdin_used+0x10>
40056d:  00
40056e:  f2 0f 5e c1           divsd xmm0,xmm1

The above code snippet contains printf and division commands.

When I first tried "-mno-sse" compiler parameter, to force this code to run in FPU, gcc generated the following code:

...
40054d:  bf 08 06 40 00      mov    edi,0x400608
400552:  b8 00 00 00 00      mov    eax,0x0
400557:  e8 b4 fe ff ff      call    400410 <printf@plt>
40055c:  dd 45 f8            fld QWORD PTR [rbp-0x8]
40055f:  dd 05 ab 00 00 00   fld QWORD PTR [rip+0xab]
                             # 400610 <_IO_stdin_used+0x10>
400565:  de f9               fdivrp st(1),st

Interestingly, the assembly output of the same code (with different compiler parameters of course) does not print anything out except 4.940656e-324 as its output. By the way, to make the output shorter, I changed double d = 1.0; statement with double d = pow(2.0, -1000);. The code started to produce 75 lines of output instead of 1075 lines.

The number of output lines does not change regardless with or without -mno-sse. And the program terminates. Which leads that the problem is not in the calculation but it is printed incorrectly on the screen. When two code snippets are compared, the result is written to xmm0 in the first one. edi holds the pointer of "%e\n" and eax has the number of extra arguments of printf. When I checked the assembly output of second code, the floating point value is not written to xmm0 because of -mno-sse parameter. edi has the same value and since no other value is available eax has zero (no extra arguments). The C equivalent of this code is actually  and since there are no extra arguments, a garbage value from a previous calculation which remained in xmm0 is written to the screen. By the way, when I check the result of division with gdb, divsd instruction works with correct values. That also proves that the error is in printf function.

I copied the code snippets above from Linux Mint (glibc v2.19) but there was no difference in CentOS either. I made the experiments in CentOS virtual machines. I think, there is no such problem with 32-bit code since xmm registers are 64-bit (they are not used at all, maybe??). But SSE instructions can be used in 32-bit codes as well: https://stackoverflow.com/questions/24386541/gcc-wont-use-sse-in-32bit

CentOS 6.x is using glibc v2.12. When I tried with CentOS 5.6 and 5.9 (glibc v2.5), same thing happened again. CentOS v3.3 is the first 64-bit CentOS ever and uses glibc v2.3.2. When I tried to compile same code in CentOS 3.3, -mno-see that no effect at all, supposedly due to a bug in gcc version (v3.2.3). In the assembly output, the result was stored in xmm0 prior to printf.

It looks like SSE instructions were used in the printf function during the development of glibc and floating point numbers must enter to printf in xmm0 only. Therefore, -mno-sse parameter should never be used because it is even in glibc. Instead of that, using -mfpmath=387 parameter to force the floating point operations executed by the FPU will do the job perfectly.
 

On the other hand, I also mentioned that gcc has no -fns parameter in contrary with Oracle Blog article. There is a -ffast-math parameter in gcc which does the same thing (or maybe even more). This parameter consists of six different flags: -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range (ref: man gcc). -funsafe-math-optimizations parameter primarily works on subnormal numbers and the code compiled with this parameter contains following function:

0000000000400440 <set_fast_math>:
  400440:  0f ae 5c 24 fc        stmxcsr DWORD PTR [rsp-0x4]
  400445:  81 4c 24 fc 40 80 00  or DWORD PTR [rsp-0x4],0x8040
  40044c:  00
  40044d:  0f ae 54 24 fc        ldmxcsr DWORD PTR [rsp-0x4]
  400452:  c3                    ret

MXCSR is a register* similar to FPU Control Word (FCW) and it is used to control the execution of SSE instructions. The 15th and 6th bits of this register, namely "Flush to Zero (FZ)" and "Denormals are Zero (DAZ)" respectively, are set using the function above. When the code is compiled with this parameter and run, the last line of the output of that oracle code is 2.225074e-308, the smallest normalized double precision value. Since MXCSR is an SSE register, -ffast-math parameter has no effect on the execution of 80387 instructions.

80387 FPU does not have a similar mechanism for subnormal numbers. In previous article, I mentioned that FCW has a DE bit for subnormals. Using this, operations on subnormal values can be controlled. I wrote following function to access to FCW:

void setfst()   {
    short FPUCW;
    asm("fstcw %0": "=m" (FPUCW));
    FPUCW = FPUCW & ~0x02;
    asm("fldcw %0":: "m" (FPUCW));
}

Or another option:

void setfst2()   {
    short FPUCW;
    asm("fstcw %0\n\t"
        "and  %1,0xFFFD\n\t"
        "fstcw %1"
        : "=m" (FPUCW): "m" (FPUCW));
}

This function must be called just before the while loop and the code must be compiled with "-mfpmath=387 -masm=intel" parameters. When the code runs, it writes 2.225074e-308 and ends with "Floating point exception". This means, manually resetting DE flag causes exceptions when the value of an instruction is subnormal. If an exception handler is written for this and if the result of a subnormal operation (which causes an exception) is rounded to zero in this handler; the function of DAZ flag can be adapted to FPU in that way.