Posted on January 16, 2019

The previous post covers preliminary topics that include tool-chain, compiling assembly programs and turning them into executables, a bit about anatomy of Mach-O object files, and finally SysV ABI and system calls.

With that out of the way, we are ready to get serious and solve some real problems… like FizzBuzz. This time around we’ll be doing dynamic linking with C runtime, using library functions, accessing command line arguments, and, of course, fighting segfaults.

This is not a post about how to solve a FizzBuzz. The problems is used merely as a means to illustrate low-level aspects of linking, executable loading, lazy symbol binding, etc.

Table of contents:

  1. Problem statement
  2. Dynamic executable
  3. How to use library functions
    1. How to print
    2. (Aside) When is stack alignment really required?
    3. (Aside) How printf is actually linked to the executable?
    4. How to parse command line args
  4. Finally, a solution to FizzBuzz
  5. (Bonus) FizzBuzz in other languages
    1. Effectiveness of C
    2. Expressiveness of Haskell
    3. Whatever of Python
  6. Footnotes

Problem statement

Below is the original FizzBuzz problem statement:

For each number from 1 to 100: print ‘Fizz’ if a number is divisible by 3, print ‘Buzz’ if it is divisible by 5, and print ‘FizzBuzz’ if it’s divisible by both 3 and 5.

To make things a bit more interesting (kek) we will work with numbers from 1 to n where n is a command line parameter.

Breaking down this problem further, we can identify specific pieces of functionality needed to implement a solution:

  1. Access command line arguments to read n.
  2. Parse a string representation of n into a number.
  3. Print to stdout either a string or a number.
  4. Arithmetic, logic and jumps to implement the logic of the method.

We could implement everything in pure assembly using only syscalls, but it would get rather tedious very soon. Instead we’ll use familiar printing and parsing functions from C stdlib (which we will link to dynamically), and use assembly only for the main control flow.

Dynamic executable

Previously, we were building statically linked executables. Static linking has its benefits as the resulting binary is self-contained and there are no runtime overhead to method calls and variable references induced by the usage of the dynamic linker. However, the final image size is larger and since library code is not shared memory consumption is higher. Depending on the use case, either static or dynamic linking may be preferable.

There are no practical differences between static and dynamic linking for our case though. So, let’s take a look at the code of a dynamic no-op executable that just return an exit code equal to 42, and then move on to the fun part:

To compile and link it we can use the following commands:

There are several things to note here:

  1. _main is no longer a real entry point in our program. When the executable is loaded the control is passed to the dynamic linker, which initializes C runtime among other things, and only after that _main is called.
  2. Since _main is called in a regular way, all calling conventions apply to it, in particular its integer result (exit code) should be returned in rax.
  3. We don’t need to make an exit syscall. Instead we just return control to the code that called _main which in turn will do a proper termination.
  4. The executable needs to be linked with libc. Otherwise linking fails with an error.

Running this binary produces the expected results:


Now that we know how to build dynamic executables and link with libc let’s proceed to the next section.

How to use library functions

We’ve identified 2 pieces of functionality that we need from libc: printing and parsing. For no particular reason we’ll start with printing, and implement a hello world.

How to print

Let’s look at the code first, and then proceed with a commentary about highlighted parts:

Now to the highlights from the code:

  1. Using library functions is pretty straightforward – it only requires an EXTERN declaration somewhere in the source code. On MacOS symbol names need to be prefixed with an underscore _, so it’s _printf instead of printf.
  2. .data section is used to declare initialized memory regions. In this particular case we define a label helloworldstr that points to the beginning of a byte array declared with db whose contents correspond to a zero-terminated string “Hello World!” ended by a line-feed.
  3. This part of the assembly does necessary preparations and calls printf.
    1. According to the calling conventions, the stack should be 16-byte aligned before the call instruction. Since call itself places on the stack a return address (8 bytes on 64-bit machines), we need to subtract an additional 8 bytes to align the stack.
    2. First parameter of printf is a pointer to the format string. We supply the address pointing to helloworlstr in rdi register according to the calling conventions.
    3. printf is a variadic function. Such functions accept a number of floating point arguments in al register, and since we don’t pass any, the register is zeroed out.
    4. After the function call, we need to restore a previous value of the stack pointer. Failure to do so will most likely result in a segmentation fault when ret pops a bogus value from the stack.

Commands to link and compile this program are the same as above. To save the clutter, I wrote a Makefile to manage the build, and will use make <target> instead of full commands on this and future occasions.

Hello World!

Since we will need to print both strings and numbers, it makes sense to write a macro for this. Macros will be stores in a separate file and included in the main source file with a special NASM directive.

With this macros included, we can easily print both pure strings and formatted strings with an integer argument:

Hello, there!
Today's number is 42

(Aside) When is stack alignment really required?

Suppose that we forgot aligning the stack. To simulate this, it’s enough to set to 0 the second parameter to print0.

The program compiles just fine, but during execution fails with a segmentation fault:

[1]    56909 segmentation fault  bin/unaligned-stack

A short session in debugger shows that it’s not even printf triggering this error, but it’s rather an explicit check in the dynamic linker:

->  0x7fff6b5f5ac4 <+0>:  push   rbp
    0x7fff6b5f5ac5 <+1>:  test   rsp, 0xf
    0x7fff6b5f5acc <+8>:  jne    0x7fff6b5f5c56 ; stack_not_16_byte_aligned_error

A couple steps forward show exactly how this stack_not_16_byte_aligned_error is implemented:

->  0x7fff6b5f5c56 <+0>: movdqa xmmword ptr [rsp], xmm0

movdqa instruction moves aligned 128 bits of memory into xmm0 register. When memory is not aligned the processor generates a general protection exception that translates to a segmentation fault.

Now consider a slightly modified example. The code is the same as before, except that there is a second call to printf and:

  • the first call has proper alignment,
  • the second call doesn’t have stack properly aligned.

It would be natural to expect this program to segfault on the second print0. However, the program runs just fine:

Hello, there!
Hello, there!

So, what’s going on here? A more detailed answer is provided in the following aside where linker’s work is explored in more detail. But in short, control goes to the dynamic linker only on a first call of a function when it does lazy symbol binding. For consecutive calls the memory location of printf is already known, so control goes directly there. Luckily, the code path in printf doesn’t have any instruction that require memory alignment1, so the call to printf finishes without issues, even though stack alignment is violated.

One thing to keep in mind is that stack alignment and other calling convention are just conventions. Unless you need to call/use external methods, you are free to do whatever you want with the stack and the registers.

(Aside) How printf is actually linked to the executable?

The only thing we need to use printf in our code, is define it with EXTERN keyword:

The assembler doesn’t know what printf is and where it will eventually come from. So, when it compiles the source code into an object file, it leaves a note in a form of a relocation entry2 to the linker that later builds a final executable object file.

We can peek into the assembly generated for printf call using objdump tool:

      10:	e8 00 00 00 00 	callq	0 <_main+0x15>

The address3 of printf is replaced with 0. This can’t be it, since call 0 just passes control to the next instruction. The missing part is the relocation table also generated by the assembler. For printf the entry is the following:

0000000000000011 X86_64_RELOC_BRANCH _printf

First comes the address where the linker needs to do the relocation, then the type of relocation, and finally the name of the symbol.

Note above that call instruction start at address 10: first byte is the opcode, and bytes 11–14 need to be filled with printf’s displacement. This matches the address in the relocation entry for printf.

So far, so good. When the linker does it’s job, we get an executable object file which can be loaded by the kernel. However, not all relocations are done yet. Because libc is linked dynamically to the executable, loading and binding of printf will happen at runtime.

To illustrate the mechanics, we will refer to unaligned-stack-2 program from the previous section: two calls to printf — one with aligned stack, and the other with incorrect alignment:

Now, this is a bit more involved as we need to look at and connect several pieces of information. First part is the resulting assembly in the linked executable:

Then, contents of sections __la_symbol_ptr and __nl_symbol_ptr4:

Contents of section __nl_symbol_ptr:
 2000 00000000 00000000 00000000 00000000  ................ ; (4)
Contents of section __la_symbol_ptr:
 2010 f41f0000 00000000                    ........         ; (5)

And finally, bind and lazy bind tables, which again are static pieces of information encoded in the executable:

   : Bind table:
   : segment  section            address    type       addend dylib            symbol
(6): __DATA   __nl_symbol_ptr    0x00002000 pointer         0 libSystem        dyld_stub_binder
   : Lazy bind table:
   : segment  section            address     dylib            symbol
(7): __DATA   __la_symbol_ptr    0x00002010 libSystem        _printf

The linker did necessary static relocations and prepared the object file for handling by the dynamic linker. Note numbered highlights in the outputs above:

  1. From (1) we see that calls to printf transfer control to address 0x1fde where the stub is placed by the linker.
  2. The instruction at 0x1fde passes control to the address stored at memory location 0x2010. From (5) we can see that initially 0x2010 stores address 0x1ff45 which pushes 0 onto the stack and jumps to the beginning of __stub_helper in the assembly above. From (7) we can see that 0x2010 is reserved for printf binding.
  3. Following 0x1fe4, there is another instruction at 0x1fed that transfers control to the address stored at memory location 0x2000. Bind entry (6) shows that 0x2000 will point to dyld_stub_binder which is a part of the dynamic linker. Initial value of 0x2000 is zero and is filled with a real address at executable load time.

Dynamic linking and lazy binding requires a level of indirection: function calls no longer point directly to the destination address, but rather retrieve this address from a memory location that is filled at execution time by the dynamic linker.

To get a better feel for how lazy binding works, let’s fire up a debugger and see it in action:

Current executable set to 'bin/unaligned-stack-2' (x86_64).

# Before the executable is loaded, 0x2000 holds 0s and 0x2010 points to 0x1ff4
(lldb) mem read 0x2000 --count 8
0x00002000: 00 00 00 00 00 00 00 00                          ........
(lldb) mem read 0x2010 --count 8
0x00002010: f4 1f 00 00 00 00 00 00                          .......

# Set breakpoint on main and run the program
(lldb) b main
Breakpoint 1: where = unaligned-stack-2`main, address = 0x0000000000001fa6
(lldb) r
Process 70864 launched: 'bin/unaligned-stack-2' (x86_64)
Process 70864 stopped
* thread #1, queue = '', stop reason = breakpoint 1.1
    frame #0: 0x0000000000001fa6 unaligned-stack-2`main
->  0x1fa6 <+0>:  sub    rsp, 0x8
    0x1faa <+4>:  movabs rdi, 0x2018
    0x1fb4 <+14>: xor    al, al
    0x1fb6 <+16>: call   0x1fde                    ; symbol stub for: printf
Target 0: (unaligned-stack-2) stopped.

# After the executable is loaded 0x2000 points to a real address
# to the dynamic linker stub binder.
(lldb) mem read 0x2000 --count 8
0x00002000: c4 5a 5f 6b ff 7f 00 00                          Z_k...

# Set a breakpoint after the first call to printf and continue execution
(lldb) b 0x1fbb
Breakpoint 2: where = unaligned-stack-2`main + 21, address = 0x0000000000001fbb
(lldb) c
Process 70864 resuming
Hello, there!
Process 70864 stopped
* thread #1, queue = '', stop reason = breakpoint 2.1
    frame #0: 0x0000000000001fbb unaligned-stack-2`main + 21
->  0x1fbb <+21>: add    rsp, 0x8
    0x1fbf <+25>: sub    rsp, 0x0
    0x1fc3 <+29>: movabs rdi, 0x2018
    0x1fcd <+39>: xor    al, al
Target 0: (unaligned-stack-2) stopped.

# After the first call to printf, 0x2010 stores a memory address
# where printf was actually loaded to.
(lldb) mem read 0x2010 --count 8
0x00002010: ec 78 69 6b ff 7f 00 00                          xik...
(lldb) dis -s 0x7fff6b6978ec
    0x7fff6b6978ec <+0>:  push   rbp
    0x7fff6b6978ed <+1>:  mov    rbp, rsp
    0x7fff6b6978f0 <+4>:  sub    rsp, 0xd0
    0x7fff6b6978f7 <+11>: mov    r10, rdi
    0x7fff6b6978fa <+14>: test   al, al
    0x7fff6b6978fc <+16>: je     0x7fff6b697924            ; <+56>
    0x7fff6b6978fe <+18>: movaps xmmword ptr [rbp - 0xa0], xmm0
    0x7fff6b697905 <+25>: movaps xmmword ptr [rbp - 0x90], xmm1

How to parse command line args

Since our main is not an entry point but rather a regular function, we can expect argc and argv to be in rdi and rsi respectively. First pointer in argv points to the executable name, and the second pointer (if provided) should point to n. Parsing can be done with a function from C stdlib:

With this in mind, let’s write an assembly program that echoes a number back to the console, or says that the provided value is not a number.

Compile, run, and enjoy the result:

Usage: echo-num <number>
Not a number
Your number is 42

There are several things in the code that deserve further elaboration:

  1. This is what is usually called a function prologue. Conventionally, rbp points to the beginning of a stack frame. Instructions push rbp; mov rbp, rsp; enable access to the current stack frame, previous call stack frame, and recursively all stack frames up the call chain.

    As a side effect, it also gets stack 16-byte aligned.

  2. This lea rax, [rsi + 8] instruction is equivalent to rax = rsi + 8 which stores in rax address of a second item in argv array. An alternative implementation could be mov rax, rsi; add rax, 8; but apart from requiring less instructions, on modern processors lea is also processed outside of ALU (in a different part of the execution pipeline that deals with addressing) and can happen in parallel with other arithmetic instructions.

Similarly, to printmacro.asm let’s move parsenum to a separate file, and proceed to the final part.

Finally, a solution to FizzBuzz

The code below implements a straightforward solution that doesn’t use the fact that 3 and 5 are prime numbers. It’s also extensively commented for reader’s convenience.

        GLOBAL _main
        EXTERN  _printf
        EXTERN  _strtol

        %include "printmacro.asm"
        %include "parse.asm"

        SECTION .data
        usestr  db      'Usage: fizzbuzz <number>', `\n`, 0
        nanstr  db      'Not a number', `\n`, 0
        numstr  db      '%d', 0
        fizzstr db      'Fizz', 0
        buzzstr db      'Buzz', 0
        lf      db      `\n`, 0

        SECTION .text
        ;; The beginning is the same as in echo-num.asm
        cmp     rdi, 1
        jle     .noargs

        push    rbp
        mov     rbp, rsp

        lea     rax, [rsi + 8]
        mov     rdi, [rax]
        call    parsenum

        cmp     rdx, 0
        jne     .invalid
        ;; At this point n is parsed and is in rax register

        ;; Here starts the main logic of the routine.
        ;; Since registers r12 - r15 are calee-saved, they
        ;; need to be pushed onto the stack before using.
        push    r13
        mov     r13, rax        ; save n in r13

        push    r12             ; r12 will serve as a counter
        mov     r12, 1

        push    r14             ; r14 will store r12 % 3

        cmp     r12, r13
        jg      .fin

        ;; div instruction does signed division
        ;; when the divisor is in 64-bit register (rcx in our case)
        ;; rdx:rax get divided by it with quotient stored in rax and
        ;; remainder -- in rdx
        mov     rax, r12
        xor     rdx, rdx
        mov     rcx, 3
        div     rcx

        mov     r14, rdx        ; save r13 % 3 in r14 for further check in .checkNum
        cmp     rdx, 0
        jne     .check5
        print0  fizzstr, 8
        ;; after .check3 we *need* to fall-through to .check5
        mov     rax, r12
        xor     rdx, rdx
        mov     rcx, 5
        div     rcx
        cmp     rdx, 0
        jne     .checkNum
        print0  buzzstr, 8
        jmp     .checkEnd       ; check5 is successful, we can go to .checkEnd

        cmp     r14, 0          ; here r12 % 5 != 0, *but* r12 % 3 could be 0
        je      .checkEnd       ; thus we need to check the result of div 3
        print1  numstr, r12, 8

        print0  lf, 8
        inc     r12
        jmp     .checkStart

        pop     r14
        pop     r12
        pop     r13
        pop     rbp
        mov     rax, 0

        print0  usestr, 8
        mov     rax, 1

        print0  nanstr, 0
        pop     rbp
        mov     rax, 1

Now, run it and stand back in amazement as the solution is being printed onto the screen:

Usage: fizzbuzz <number>
Not a number

To conclude, there are several things we covered in this post, including:

  • how to use library functions,
  • how to link programs with shared libraries,
  • what are static and dynamic linking, what is relocation and how lazy binding is happening at runtime,
  • stack alignment and whether it’s necessary at all times,
  • more assembly tricks, macros and function definitions.

Although, the low level mechanics are rather fiddly, understanding how compilation, linking and loading work as well as being able to read resulting assembly can help in debugging some baffling otherwise issues.

(Bonus) FizzBuzz in other languages

Disclaimer: this sections is written just for lulz, no hard conclusions should be drawn from it.

Out of curiosity, let’s calculate the execution time of a solution in assembly for a modest n = 10M:

0m0.002s 0m0.003s
0m1.993s 0m0.004s

A bit less than 2s, which is not bad at all! After all, it’s as close to the bare metal as possible, so it should be very performant, right? Well, yes… maybe. Let’s find out!

Effectiveness of C

Below is a pretty direct translation of the assembly solution to C:

Checking that it compiles and runs fine:

Usage: c-fizzbuzz <number>
Error: not a number

Good. In regard to the timings, we get a bit more than 2s, which agrees with the intuition — compilers generate great assembly, but there is still small level of overhead:

0m0.002s 0m0.002s
0m2.075s 0m0.004s

Case solved? Not exactly. Let’s compile the code with the maximum optimization level, and time the execution again:

0m0.002s 0m0.002s
0m1.621s 0m0.034s

Now we’re at ~1.6s which is ~23% and ~28% faster than “pure assembly” and “no-optimization C” solutions respectively. Compiler developers try hard to improve the efficiency of the generated assembly. Writing assembly by hand is cool, but C compilers apply lots of non-trivial optimizations to the code without programmers even knowing or understanding this optimizations.

In this particular case, clang optimized a couple things:

  • Replaced calls to printf without parameters with puts, thus avoiding the overhead of needlessly parsing a format string.
  • Used an optimized remainder calculation. In general, division is a costly operation and can take 10s of CPU cycles. But if the divisor is known, remainder can be found quicker without using division6.

Expressiveness of Haskell

The M-word was used only once in the source code. It may not be the most idiomatic Haskell, but it closely resembles other solutions and is arguably easy to read.

Compiling with the maximum optimization level gives the execution time of ~5.49s:

0m0.002s 0m0.003s
0m5.486s 0m0.108s

I’m not an expert in Haskell performance tuning, so there might be a way to implement this more efficiently. But again, these numbers should not be used to argue that FP is slow — this is a degenerate example where the overhead is noticeable. On a real project the compiler will have many more optimization opportunities.

Whatever of Python

I figured that the list would be incomplete without an interpreted language, so here is a solution in Python:

The code resembles Haskell’s version a lot (minus type annotations). The timings are not that bad — about 3x slower than Haskell’s version and 10x slower then C:

0m0.002s 0m0.004s
0m14.858s 0m0.039s


  1. There are instructions in printf that require memory alignment, but this code path is not taken during execution due to absence of floating point arguments.

  2. Loosely speaking, the assembler doesn’t know the final address of a symbol in memory. So, it puts a placeholder in the object file, and instructs the linker to put the final address there when the linker builds an executable. This process is called relocation.

  3. Or more specifically, a program counter (PC) relative displacement.

  4. __la_... stands for lazy and _nl_... — for non-lazy and describe the type of binding performed by the dynamic linker.

  5. The actual memory content is f41f0000 but due to little-endianness byte order is reversed and the least significant byte comes first.

  6. Remainder from division by 3 or by 5 can be found more efficiently using a combination of multiplication and shifts. See Hacker’s delight for more details. Looks like in this particular case remu3m and remu5m methods were used.