Fractals And Intel x86_64 Assembler

fractasmSome time ago I wrote this article about using integer operations to calculate fractal images. Currently, I’m giving a course which prepares for malware analysis. Among other things we deal a lot with Intel assembler and how compilers create and optimize code.

The full code of everything discussed below (and also of the article referred to above) is found on Github at github.com/rahra/intfract.

One of the exercises was to write the integer version of the iteration function manually in assembler. Here is the original version in C:

int iterate(nint_t real0, nint_t imag0)
{
   nint_t realq, imagq, real, imag;
   int i;

   real = real0;
   imag = imag0;
   for (i = 0; i < MAXITERATE; i++)
   {
      realq = (real * real) >> NORM_BITS;
      imagq = (imag * imag) >> NORM_BITS;

      if ((realq + imagq) > (nint_t) 4 * NORM_FACT)
         break;

      imag = ((real * imag) >> (NORM_BITS - 1)) + imag0;
      real = realq - imagq + real0;
   }
   return i;
}

NORM_BITS, NORM_FACT, and MAXITERATE are preprozessor macros defining integer constants, and nint_t is typedef’ed to long.

Benchmark Setup

I did some benchmarks experimenting with several variants that the code could be written in assembler. For the benchmark I used gettimeofday(2) directly before and after calculating the image. The iteration loop is set to MAXITERATE = 50 000 and I ran every attempt 5 times and then calculated the average time value. At an image resolution of 640×400 the iteration loop would be executed 12.8×109 (12.8 billion) times in the worst case. At the given coordinates and zoom level the iteration loop is actually executed 2.8×109 times.

I ran the programs directly on the console without X11 to avoid interference with other tasks and to reduce the influence of multi-tasking as much as possible.

As a time reference I used the original C program compiled with gcc option -O2. I consider this time as 100%1 I used gcc version 4.9.2 on a 64 bit Intel Core 2 Duo 1.2Ghz (family 6, model 15, stepping 13, ucode 0x3a). I observed that the iteration loop is compiled in the same way, independently if optimized with -O2 or -O3.

Comparison

The benchmark showed that there are several different optimizations which influence the performance, some of which are not really obvious. The following listing shows the solution which performed most. It needs about 91,7% of the time compared to the compiler’s optimized code. This is a huge improvement.

   .section .text
   .align 16
   .global iterate
iterate:
   mov   $(4 * NORM_FACT),%rdx
   mov   %rdi,%r8             // real = real0
   mov   %rsi,%r9             // imag = imag0

   mov   $MAXITERATE,%ecx     // i = 64
   jmp   .Litloop
   .align 16
.Litloop:
   mov   %r8,%r10
   imul  %r10,%r10            // realq = real * real
   sar   $NORM_BITS,%r10      // realq >>= 13

   mov   %r9,%r11
   imul  %r11,%r11            // imagq = imag * imag
   sar   $NORM_BITS,%r11      // imagq >>= 13

   lea   (%r10,%r11),%rax     // realq + imagq
   cmp   %rdx,%rax            // > 4 * NORM_FACT ?
   jg    .Litbrk

   imul  %r8,%r9              // imag *= real
   sar   $(NORM_BITS - 1),%r9 // imag >>= NORM_BITS - 1
   add   %rsi,%r9             // imag += imag0

   sub   %r11,%r10            // realq - imagq
   lea   (%rdi,%r10),%r8
   
   dec   %ecx                 // i--
   jne   .Litloop

.Litbrk:
   mov   $MAXITERATE,%eax     // 64 - i
   sub   %ecx,%eax
   ret

Important Optimizations

The inner loop is between line numbers 12 and 33. The first important improvement is that the loop, i.e. the target of the branch in line #33 shall be aligned to 16 bytes. This is according to the »Intel 32/64 Optimization Reference Manual« and this is also done by the compilier’s optimizer. The benchmark showed that if it is not properly aligned that the runtime is at 93,9% which is more than 2% slower.

Line #32 and #33 shows the loop counter decrement and the conditional branch. Although the Intel core provides a LOOP instruction the benchmark shows that using LOOP is significantly slower than doing a DEC/JNE combination, independently if the target is properly aligned or not. The execution time with LOOP is at about 96,6%. If we use SUB $1,%ecx it slightly slower, about 0,2%.

Another trick which gains performance is to use the LEA instruction instead of a MOV/ADD combination. LEA actually is a three-operand instruction. Line #21 is equivalent to

   mov   %r10,%rax
   add   %r11,%rax

Conclusion

There are many reasons to not write any code in assembler today. But although modern code optimizers are highly efficient, manual written code with thoroughly chosen instructions may still perform better.

  1. The absolute time on my computer was slightly more than 30 seconds for one run.

Comments

Fractals And Intel x86_64 Assembler — 1 Comment

  1. its rather question than comment. so i have a windows machine that has its ass compromised more than a reputation of bill clinton back in a day. so let us begin… i got wireshark up that is listening on an interface which is responsible for in and out traffic . i used chrome and my firewall settings are literally in and out down but let specific apps to use internet. so in this case only chrome was used as the only allowed app that can bypass firewall. so all in all. my machine got hooked up somehow with an attacker and it started sending and receiving a bunch of tcp-retransmission, tcp-out-of order and malformed packets. eventually an attacker managed to control visual stream from desktop etc. so chrome ports where used somehow to control my machine. it would be really awesome to find out how it was done. if it is actually chrome exploit or windows somehow get fked up its ass using chrome ports. well i assume it was done with a help of chrome becuz it the only app with established connections and where was no indication of an additional app sending or receiving data. basically wireshark+tcp view were used monitor traffic. ones the attacker got disconnected and that attack ended tcp retransimission and out-of order packet stream ended for good. no issues whith timeouts were observed either that were common when the actual attack was ongoing.

    thanks if anyone knows anything valuable and can share some info on how it is performed and what windows or chrome parts are targeted.