Saturday, April 21, 2018

GCC Compiler Optimizations



Hi there. I previously wrote that I will discuss about a low level topic in this post. First, I want to refer two blogs/vlogs, I am following. First one belongs to a guy who wrote his own operating system in C++. Its address is http://www.wyoos.org. He explains each coding steps in his videos. It took quite much time to follow this and analyze his code. Second one is Ben Eater, who is designing his own 8-bit computer and explaining the implementation of this on breadboards on his Youtube channel. I wanted to follow this vlog step by step, too but could not find any time because I also have followed last month RH442 Performance Tuning training and barely could find time for this post.

Ben Eater has wrote a simple Fibonacci code in C and compares the code with its assembly output and translates into his own machines code. Here is its video. In the next video, he programs his machine and runs. As I said, the code is pretty simple:

#include<stdio.h>

int main(void)  {
    int x,y,z;

    while(1)        {
        x = 0;
        y = 1;
        do      {
            printf("%d\n", x);

            z = x + y;
            x = y;
            y = z;
        } while (x < 255);
    }
}

I thought, compiling this code in different gcc optimization levels and comparing them would be interesting.

What are Optimization Levels?
Compilers mostly do some arrangements which do not exist in the source code in order to utilize the underlying CPU better. This arrangements does not impact the output of the code obviously but there may be different intermediate results if debugged. Examples for these arrangements are loop unrolling, instruction scheduling (swapping or moving the instructions further which are using different units of CPU (i.e. ALU, memory fetcher, FPU etc)), aligning/remapping of memory addresses for better cache memory utilization. Whole list can be found in Optimizing Compiler article.

No matter how good and optimized the programmer can code, it's nearly impossible to master all these CPU optimizations. From a C programmers perspective, it does not matter whether a push instruction is above or below a subtraction, in favor of code optimization. And from an assembly programmers perspective, it is really a hard work to optimize thousands lines of code. It requires to think and understand the relationship between each code line w.r.t. functional CPU units. If we consider that Intel Optimization Reference Manual has 672 pages, making assembly code level optimizations requires almost memorizing this book.

Even, in the beginning of 2000s, we had discussed this topic with a colleague. He had told me that we wrote the same code in Win32 assembly and also in Delphi and saw that Delphi code has significantly better performance. There is also another aspect with Intel. The instructions and CPU features, which Intel did not documented, could only be utilized with Intel compilers (and same is with AMD).

There are basically four levels of optimization in GCC:
* O0: Optimizations disabled. The code is compiled into assembly line by line. It is also used for debugging purposes.
* O1: Basic optimizations like branch prediction and stack optimizations.
* O2: More aggressive than O1. This is the recommended level of optimization which contains thread level optimizations, function call optimizations and optimizations in string operations.
* O3: This level is more aggressive than O2 but not always recommended. Loop unrolls may increase the size of assembly output or floating point numbers may lose their precision because of the FPU/SSE optimizations.

There are another options like -Os, -Ofast which I am not going into detail. Please review following sources about these parameters:
https://wiki.gentoo.org/wiki/GCC_optimization
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

As it is stated in second source, these parameters above are packages which contain different singular optimizations. For example, O2 optimization contains lre-remat optimization. Thus applying -O2 -fno-lre-remat parameters together means all O2 optimizations but without lre-remat.

Without extending this article too much, let's review assembly output of the code above. I have used gcc for compilation and objdump for disassembly. Objdump generated AT&T syntax by default. I used to know Intel syntax therefore I used -M parameter with objdump.

gcc -O0 -o fib fib.c
objdump -M intel -d fib

I will be interested only with the main() function.

  4004c4:  55                   push rbp
  4004c5:  48 89 e5             mov  rbp,rsp
  4004c8:  48 83 ec 10          sub  rsp,0x10
  4004cc:  c7 45 f4 00 00 00 00 mov  DWORD PTR [rbp-0xc],0x0
  4004d3:  c7 45 f8 01 00 00 00 mov  DWORD PTR [rbp-0x8],0x1
  4004da:  b8 18 06 40 00       mov  eax,0x400618
  4004df:  8b 55 f4             mov  edx,DWORD PTR [rbp-0xc]
  4004e2:  89 d6                mov  esi,edx
  4004e4:  48 89 c7             mov  rdi,rax
  4004e7:  b8 00 00 00 00       mov  eax,0x0
  4004ec:  e8 c7 fe ff ff       call 4003b8 <printf@plt>
  4004f1:  8b 45 f8             mov  eax,DWORD PTR [rbp-0x8]
  4004f4:  8b 55 f4             mov  edx,DWORD PTR [rbp-0xc]
  4004f7:  8d 04 02             lea  eax,[rdx+rax*1]
  4004fa:  89 45 fc             mov  DWORD PTR [rbp-0x4],eax
  4004fd:  8b 45 f8             mov  eax,DWORD PTR [rbp-0x8]
  400500:  89 45 f4             mov  DWORD PTR [rbp-0xc],eax
  400503:  8b 45 fc             mov  eax,DWORD PTR [rbp-0x4]
  400506:  89 45 f8             mov  DWORD PTR [rbp-0x8],eax
  400509:  81 7d f4 fe 00 00 00 cmp  DWORD PTR [rbp-0xc],0xfe
  400510:  7e c8                jle  4004da <main+0x16>
  400512:  eb b8                jmp  4004cc <main+0x8>

First three instructions are related with the stack frame arrangements at the entrance of main() function. They are internally added by C compiler and don't have any corresponding C code.

The instruction at the offset 40 04CCh writes a 0 to the address rbp-0Ch. This operation corresponds to x=0 line in C code. Thus we found that variable x is saved in address rbp-0Ch.
Next instruction writes a '1' to rbp-08h address and corresponds to y=1 line in C code. So, variable y is saved in address rbp-08h.
The instruction in 40 04DAh belongs to printf() and it writes the value 40 0618h to eax. If it is considered that the program entry resides at 40 0000h address, this value in eax should be most probably a pointer to some address. This address is not contained in the objdump output because the address is in the data segment of the code. If I check the file content with hexdump -c fib command, there are following values in 0618h offset:

00000610  00 00 00 00 00 00 00 00  25 64 0a 00 01 1b 03 3b  |........%d.....;|

The bytes 25h 64h 0Ah correspond to the format string of printf "%d\n" and eax is the pointer to this address.
In next instruction, x variable is assigned to edx. Then the value in edx is assigned to esi, the value in rdx is assigned to rdi and eax is zeroed. The reason for the last instruction is a bit complicated and actually a thing with glibc. In short, if floating point numbers were used, this value would be non-zero. Here is a detailed information: https://stackoverflow.com/questions/6212665/why-is-eax-zeroed-before-a-call-to-printf

printf call is in 40 04ECh offset.
In 40 04F1h, y variable is read from rbp-08h to eax. In next instruction x variable is read to edx and then the sum of rdx and rax is written to eax again using lea command. The right hand side of the z=x+y expression is done by lea command. In offset 40 04FAh, this sum in eax is stored in rbp-04h which is z variable. The variables are stored in memory in reverse order in which they are defined in the code.
With the instruction in 40 04FDh address, y variable is read to eax and with the next instruction, the value in eax is written to the address of x variable. This is x=y assignment. Similarly, next two instructions do the y=z assignment.
In 40 0509h, x value is compared with 254. x is an integer and the comparison in C code is x < 255, so x can be at most 254 in this comparison. The jle (jump if less or equal) instruction in 40 0510h is true when x equals at most 254. In other words 254 < 255 is true but 255 < 255 is false. If I rebuild this with '<=', same inequality is true when 254 <= 254 and false when 255 <= 254.
Since the offset of main() is 40 04C4h; if x is less than 254 then the jump will be to 40 04DAh, in other terms to the line beginning with do. With this command, the do loop ends. If x is greater than 254, then the unconditional jump will be to the address 40 04CCh. Because of this address is the beginning of the code, this jmp command corresponds to while(1).

Before making any compiler optimizations, I redefined the x variable as a register.

int y, z;
register x;

I recompiled the code with same parameters and generated assembly output:

  4004c4:  55                   push rbp
  4004c5:  48 89 e5             mov  rbp,rsp
  4004c8:  53                   push rbx
  4004c9:  48 83 ec 18          sub  rsp,0x18
  4004cd:  bb 00 00 00 00       mov  ebx,0x0
  4004d2:  c7 45 e8 01 00 00 00 mov  DWORD PTR [rbp-0x18],0x1
  4004d9:  b8 08 06 40 00       mov  eax,0x400608
  4004de:  89 de                mov  esi,ebx
  4004e0:  48 89 c7             mov  rdi,rax
  4004e3:  b8 00 00 00 00       mov  eax,0x0
  4004e8:  e8 cb fe ff ff       call 4003b8 <printf@plt>
  4004ed:  89 d8                mov  eax,ebx
  4004ef:  03 45 e8             add  eax,DWORD PTR [rbp-0x18]
  4004f2:  89 45 ec             mov  DWORD PTR [rbp-0x14],eax
  4004f5:  8b 5d e8             mov  ebx,DWORD PTR [rbp-0x18]
  4004f8:  8b 45 ec             mov  eax,DWORD PTR [rbp-0x14]
  4004fb:  89 45 e8             mov  DWORD PTR [rbp-0x18],eax
  4004fe:  81 fb fe 00 00 00    cmp  ebx,0xfe
  400504:  7e d3                jle  4004d9 <main+0x15>
  400506:  eb c5                jmp  4004cd <main+0x9>

With this rearrangement, the code got 12 bytes shorter because there is no pointer to x anymore and memory access got reduced. In the address 40 04C8h, rbx is pushed to stack. Unlike before, it is needed to subtract 18h instead of 10h from rsp, because rbx is 64-bit. If the program would terminate, we would see a pop corresponding to this push. 'x=0' assignment corresponds to 40 04CDh. In next instruction, value '1' is assigned to y variable, which resides in rbp-18h. esi register should have the value to be printed for printf command, so it is done with the instruction 40 04DEh. The rest of the code is the same with previous printf block.
The two instructions beginning with 40 04EDh makes up 'x+y' operation and the operation next to them writes the sum to z variable at rbp-14h.
40 04F5h contains 'x=y' assignment and it is now only one instruction since x is defined as register.
The instruction at 40 04F8h and the next one is 'y=z' and the rest is same as the previous example.

Defining a variable as register is not a real compiler optimization. For a robust comparison, I defined x as an int again before turning O1 optimization on. The compiler itself will be keeping x in the register already, as I turned O1 optimization. I recompiled the code with gcc -O1 -o fib fib.c and disassembled it with objdump.

  4004c4:  41 55             push r13
  4004c6:  41 54             push r12
  4004c8:  55                push rbp
  4004c9:  53                push rbx
  4004ca:  48 83 ec 08       sub  rsp,0x8
  4004ce:  bb 01 00 00 00    mov  ebx,0x1
  4004d3:  bd 00 00 00 00    mov  ebp,0x0
  4004d8:  41 bc 01 00 00 00 mov  r12d,0x1
  4004de:  41 bd 00 00 00 00 mov  r13d,0x0
  4004e4:  eb 06             jmp  4004ec <main+0x28>
  4004e6:  44 89 e3          mov  ebx,r12d
  4004e9:  44 89 ed          mov  ebp,r13d
  4004ec:  89 ee             mov  esi,ebp
  4004ee:  bf 08 06 40 00    mov  edi,0x400608
  4004f3:  b8 00 00 00 00    mov  eax,0x0
  4004f8:  e8 bb fe ff ff    call 4003b8 <printf@plt>
  4004fd:  81 fb fe 00 00 00 cmp  ebx,0xfe
  400503:  7f e1             jg   4004e6 <main+0x22>
  400505:  8d 04 2b          lea  eax,[rbx+rbp*1]
  400508:  89 dd             mov  ebp,ebx
  40050a:  89 c3             mov  ebx,eax
  40050c:  eb de             jmp  4004ec <main+0x28>


This output is longer than before where I defined x as register. Scratch registers like r12 and r13 keeps the x and y values. Thus, the values are kept in CPU instead of memory. First four push commands are for keeping the original values of registers at the function exit but as the function does never exit, corresponding pop commands are missing at the end.
The sub instruction at 40 04CAh is for stack frame arrangement.
The two instructions at 40 04CEh and at 40 04D3h are corresponding to y=1 and x=0 respectively.
Other two instructions at 40 04D8h and 40 04DEh correspond to y=1 and x=0 again but these values kept in r12 and r13 never change in run-time. These are solely used for reading the values.
The jmp instruction at 40 04D8h address jumps to the first of four instructions between 40 04ECh and 40 04FCh which makes up the printf command.
At 40 04FDh, the comparison is made with ebx which holds the y variable. This is a bit confusing because the comparison in the original code was made with x. I am still not pretty sure about that but y could be used instead of x, because there is a 'x=y' assignment two lines before. The logic behind, should be like this. If y (which is actually x) is not greater than 254, the execution continues with next instruction.
At 40 0505h, where rbp+rbx sum is written to eax, is 'z=x+y' expression.
The commands at 40 0508h and 40 050Ah correspond to 'x=y' and 'y=z' expressions respectively. After these, a jump is made to the first instruction of the loop which is the part of printf command.
If ebx is greater than 254, in the comparison at 40 0503h, then initial value of y (is 1) in r12 is assigned to ebx and the initial value of x (is 0) in r13 is assigned to ebp and whole operations repeat.

It is noticeable that this code has no memory access compared to the assembly output compiled with -O0. Now, I have recompiled the code with -O2:

  4004d0:  55                push rbp
  4004d1:  31 ed             xor  ebp,ebp
  4004d3:  53                push rbx
  4004d4:  bb 01 00 00 00    mov  ebx,0x1
  4004d9:  48 83 ec 08       sub  rsp,0x8
  4004dd:  0f 1f 00          nop  DWORD PTR [rax]
  4004e0:  31 c0             xor  eax,eax
  4004e2:  89 ee             mov  esi,ebp
  4004e4:  bf 08 06 40 00    mov  edi,0x400608
  4004e9:  e8 ca fe ff ff    call 4003b8 <printf@plt>
  4004ee:  81 fb fe 00 00 00 cmp  ebx,0xfe
  4004f4:  7f 0a             jg   400500 <main+0x30>
  4004f6:  8d 04 2b          lea  eax,[rbx+rbp*1]
  4004f9:  89 dd             mov  ebp,ebx
  4004fb:  89 c3             mov  ebx,eax
  4004fd:  eb e1             jmp  4004e0 <main+0x10>
  4004ff:  90                nop
  400500:  bb 01 00 00 00    mov  ebx,0x1
  400505:  31 ed             xor  ebp,ebp
  400507:  eb d7             jmp  4004e0 <main+0x10>

First of all, this code is shorter than others. main() starts from 40 04D0h and ends before other two and second, there is no memory access, too. Lastly, in contrary of O0 and O1 optimizations instead of mov <register>,0 operation there is xor <register>,<register> operation.

At 40 04D1h, ebp is zeroed as 'x=0'.
At 40 04D4h, '1' is assigned to ebx, as 'y=1'.
I am not sure, why there are nops there but they might be for alignment purposes. (Source, p79 and p81. Or https://www.agner.org/optimize/optimizing_assembly.pdf)
Four instructions between 40 04E0h and 40 04EDh are printf.
At 40 04EEh, y is compared with 254 and if it is not greater then execution continues, same as previous example.
The lea instruction at 40 04F6h corresponds to z=x+y and following instructions correspond to x=y and y=z respectively.
jmp instruction does a jump to beginning of printf. From here, it is understood that the structure of the loop changes with compiler optimizations but the logic  remains same.
If the comparison at 40 04EEh holds true, then this means, the inner do..while loop has ended. Then the conditional jump at 40 04F4h is executed and execution continues from 40 0500h. There, the variables are rewritten to their corresponding registers (at the offset 40 0500h and 40 0505h) and then jumped to the beginning of code.

If this code is compiled with -O3, the compiler does not generate a code different than compiled with -O2. In other words, the optimization parameters* which -O3 contains but -O2 doesn't, have no impact on this code.

In the beginning of this article, I had mentioned that these optimization parameters are actually a set of optimizations which can be applied separately. For example, -O1 contains a parameter named -fomit-frame-pointer. In order not to contain this optimization, the code must be compiled with gcc -O1 -fno-omit-frame-pointer -o fib fib.c command. Then the start of main() function (and of all other functions) becomes:

  4004c4:  55            push   rbp
  4004c5:  48 89 e5      mov    rbp,rsp
  4004c8:  41 56         push   r14

The frame-pointer is rbp register. The frame pointer is pushed to stack at the function entrance and then rbp value is overwritten with rsp, so that the whole stack access, which contains function arguments, is made using rbp. If omit-frame-pointer parameter is given before compilation, push and mov combination is removed from all the functions. So, all the function arguments are accessed using rsp register because rbp value is not backed up in stack.