Let's Build a Compiler 01: Starting Out

13 minute read Published: 2019-03-21

Compilers, interpreters, runtimes, and the like can sometimes feel vaguely magical. They take your code, perform complex transformations that you only partially understand, and spit out an executable binary. Clearly, the inner workings of such a beast must contain arcane secrets, not meant for the minds of mortals.

Of course, even if it feels that way sometimes, that doesn't make it true. As with everything else in software, there's no magic. You can understand it, with a little research and experimentation. There's plenty of resources out there to understand how compilers work. I haven't found anything that takes you from zero to a fully usable compiler, though, so I figured I'd try to write something that fits the bill. I've always found that the best way to learn how a system works is to build one yourself, so let's do that.

In short, this series will walk you through building an optimizing compiler from scratch. Hopefully we'll both learn something along the way.

I will be assuming some knowledge of native-code build processes, Lisp, C, and x86 assembly language, but I'll try and provide enough detail that anyone who's not familiar with them can still follow along as long as they're not afraid of a little research.

The Basics

We're going to be building a compiler for Scheme, a variant of Lisp. Primarily due to how simple the parser can be, but also because the language is both simple enough to compile and practical enough that we can write our compiler in it. Once it's far enough along, we should be able to compile our compiler with itself - it'll become self-hosting, independent of any external software.

For the initial stages, we'll follow the steps outlined in An Incremental Approach to Compiler Construction, which is a great resource, even if it does tend to gloss over things that need to be described in more detail. If you do have trouble with a specific part, I'll be including working downloads of the current compiler source at the end of each post, so you can use that implementation as a reference.

Though our compiler will eventually be self-hosting, we'll need something to run it before that point. I'll be using GNU Guile v2.2, which runs pretty much everywhere and implements the R6RS specification, which just so happens to be what we're initially aiming to support. On Linux, you should just be able to install via your system's package manager. Once it's installed, you should see a prompt like the following if you start up its REPL:

GNU Guile 2.2.4
Copyright (C) 1995-2017 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)>

You can also enable Readline support (so you can access history with up/down and use tab-completion) by running:

scheme@(guile-user)> (use-modules (ice-9 readline))
scheme@(guile-user)> (activate-readline)

Basic Framework

One nice thing about implementing a Scheme compiler in Scheme is that, at least at first, we can ignore the parser entirely. Any chunk of Lisp code is also a valid data structure, so we can just operate on that structure directly without worrying about writing a parser. For now, our compiler can just be a function compile-program that takes code as input and returns a bunch of 32-bit x86 assembly code as a string. Later on we'll improve upon this, but this is the easiest way to get something working quickly.

The output assembly will then get linked with a runtime library written in C to produce an executable binary file. Our initial runtime looks like this (stored in rts.c):

#include <stdio.h>
#include <stdlib.h>

__attribute__((__cdecl__))
extern int scheme_entry();

int main(int argc, const char **argv) {
    int val = scheme_entry();
    printf("%d\n", val);
    return 0;
}

This just calls a function named scheme_entry that our compiler generates, then prints the integer that it returns. The __attribute__((__cdecl__)) part is a GCC-specific syntax extension that tells the compiler to use the "cdecl" calling convention when executing the function. That means arguments are pushed onto the stack from right to left, and the return value is stored (since this is a 32-bit program) in the eax register. Our assembly just has to read any arguments it needs, run the program, store the result in eax, then return back to the caller and make sure none of the other registers are modified.

For generating the assembly and compiling the output binary, I wrote a basic framework using some of Guile's extension libraries to interact with the OS. We'll replace this later on, but it'll be enough to get going.

(define emit (lambda args (apply simple-format #t args)
                          (newline)))

(define (compile-program program)
  (emit ".text")
  (emit ".p2align 4,,15")
  (emit ".globl scheme_entry")
  (emit "scheme_entry:")

  ; handle incoming call from C
  (emit "push %esi")
  (emit "push %edi")
  (emit "push %edx")

  ; our code goes here
  (emit "movl $42, %eax")

  ; restore state for return to C
  (emit "pop %edx")
  (emit "pop %edi")
  (emit "pop %esi")
  (emit "ret"))

(define (compile-to-binary program)
  (begin
    (with-output-to-file "/tmp/compiled.s"
      (lambda () (compile-program program)))
    (system "gcc -fomit-frame-pointer -m32 /tmp/compiled.s rts.c")))

There's no actual code-generation yet - right now it just generates the equivalent of return 42; in C, but that should let us make sure everything else is working. Now we can fire up Guile and test it out!

scheme@(guile-user)> (load "compiler.scm")
...
scheme@(guile-user)> (compile-to-binary 0)
$1 = 0
scheme@(guile-user)> ^C
$ ls
a.out    compiler.scm   rts.c
$ ./a.out
42
$

Success! We generated the necessary assembly file, compiled it, and emitted a working binary! Before moving forward, let's unpack the Lisp code a little to make sure everyone's up to speed.

First, we're defining a new emit function we can use to output a line of assembly code. It's a relatively simple wrapper around the Guile library function (simple-format destination format-string args) that does basic printf-style string formatting and sends it to an output stream.

Second, the definition of (compile-program program), the entry point to our compiler. This function will take a Lisp program as input and print a bunch of assembly code to the default output stream that implements that program. Here, it ignores the argument entirely and just emits the basic instructions required to return 42.

Finally, the compile-to-binary program is how we actually use the compiler to generate our executable. It calls compile-program, redirecting its output into a file at /tmp/compiled.s (overwriting anything that's already there), then runs gcc -m32 /tmp/compiled.s rts.c to link everything together.

First Steps

Since we already have a runtime that will print out whatever we return, our first step will be to support the very simple language of "constant integers". Starting from the basic framework we already have, all we have to do is replace the 42 in movl $42, %eax with whatever value we need to return.

For the actual implementation, I split the actual emit call out into a (compile-expr e) function; where compile-program is responsible for emitting a full program, compile-expr only has to generate the fragment required to evaluate a passed expression, and replace the (emit "movl $42, %eax") in compile-program with (compile-expr program).

(define (compile-expr e)
  (emit "movl $~a, %eax" a))

The emit call here is analogous to f"movl ${e}, %eax" in Python - it inserts the stringified value of e where 42 previously went in the movl instruction.

I also added a compile-and-run function to make testing easier, which just calls compile-to-binary and then runs the resulting program. Later on, we'll add something to capture the output so we can run automated tests on our compiler and verify the generated code.

(define (compile-and-run program)
  (begin (compile-to-binary program)
         (system "./a.out")))

Then, we can load the new version into a REPL and test it out:

scheme@(guile-user) [4]> (load "compiler.scm")
...
scheme@(guile-user) [4]> (compile-and-run 27)
27
$10 = 0
scheme@(guile-user) [4]> (compile-program 27)
.text
.p2align 4,,15
.globl scheme_entry
scheme_entry:
push %esi
push %edi
push %edx
movl $27, %eax
pop %edx
pop %edi
pop %esi
ret
scheme@(guile-user) [4]>

Data Representation

Supporting integer constants is great, but Scheme includes more than just integers. We also have booleans (#t or #f), character constants (which are just written as the character prefixed by #\, for example #\a for the character 'a'), and the empty list () which is normally used as a "null" value. There's also pairs, lists, vectors, strings, and functions, but those all require heap allocation so we're going to ignore them for now.

The question is, how do we represent these values? We can't just store everything as an integer - the standard includes functions like boolean? that let programs check the type of an object at runtime, so we need a way to represent that type information.

Making everything into a heap-allocated structure is one option, but that'd mean building a lot of extra complexity just to store simple fixed-size datatypes. Ideally we should try to avoid using the heap for as long as possible, since that'd mean memory allocation, and memory allocation means we'd have to implement a garbage collector. Lucky for us, as long as we're willing to sacrifice a few bits of our integers, we can use a trick called "tagged pointers" to pack everything into our 32-bit machine word.

The basic idea behind tagged pointers is to recycle some unused bits of the word as metadata. Broken down at the bit level, the basic types we're trying to represent look like this:

  type  | 31            bit             0
-----------------------------------------
integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
boolean | 000000000000000000000000000000b
   char | 00000000000000000000000cccccccc
pointer | ppppppppppppppppppppppppppppppp

Integers and pointers use every available bit, booleans use the least significant bit to represent zero or one, and characters use the lowest eight bits (Unicode adds complexity, so we'll stick to the normal ASCII set for now). There's no way to represent every available value in every type without adding more bits; there's be no way to distinguish between 0 and #f, or an integer and its corresponding pointer. To solve that, let's sacrifice the lowest two bits of our integers, setting them to always zero. Now, bit 2 is the low-order bit of our actual integer, and we only have 30 bits available to represent our values.

Here's where the optimization comes into play. Since integers are defined to always have the two low-order bits zeroed, we can define nonzero values in those positions to mean something else. For example, what if we shift the meaningful bits of integers and booleans up by eight, then put a marker value in the newly-unoccupied low-order bits?

  type  | 31            bit             0
-----------------------------------------
integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiii00
boolean | 0000000000000000000000b00001111
   char | 000000000000000cccccccc00000111
pointer | ppppppppppppppppppppppppppppppp

Now the two low-order bits are zero if a value is an integer, and 1 when it's something else. You can further distinguish between booleans and characters based on the value of the low-order byte. But our pointers still present a problem - they still use all of the bits. We can't distinguish between the boolean #t and the pointer 0x0000010f. But... are those bits really all used? On x86, at least, we can technically access any memory address we want. Practically, though, the smallest object we need to address on the heap is 4 bytes in size, and most objects are 8 bytes or larger. What if we require that all pointers be 8-byte aligned?

If pointers are aligned to 8-byte boundaries, then the least-significant three bits of our pointers will always be zero, which means we can use them for something else! For example, we can merge the types together like this:

  type  | 31            bit             0
-----------------------------------------
integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiii00
boolean | 0000000000000000000000b00001111
   char | 000000000000000cccccccc00000111
pointer | pppppppppppppppppppppppppppp010

Now, we can take an arbitrary 32-bit word and determine what type of object it is without any external reference, at the cost of a few bit operations when accessing booleans, pointers, and characters. We could also reverse the representations, so that integers always have the low two bits set and pointers have them set to zero. However, our current representation lets us add, subtract, multiply, and divide integers directly without actually shifting them. Essentially, the representation we use can either allow integer operations without conversions or pointer dereferences without conversions. In this series I'm choosing integer operations, but it's worth being aware of the trade-off.

But we're not done! We can take this further. Any given pointer can reference an object of the following types:

Pair
Vector
String
Closure
Symbol

That's 5 types of object, and we have 3 bits free in our pointer. Wouldn't it be cool if we could cram object type information into the pointer too?

Those three bits can be set to any of the following patterns:

We can't use anything ending in 00, since that would represent an integer, and we can't use 111, since that's the common suffix for characters and booleans. This leaves us with five patterns, which is exactly the number we need to represent our object types:

001 - pair
010 - vector
011 - string
101 - symbol
110 - closure

So now we have everything represented unambiguously within a 32-bit word:

      type      | 31            bit             0
-------------------------------------------------
        integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiii00
        boolean | 0000000000000000000000b00001111
           char | 000000000000000cccccccc00000111
   pair pointer | pppppppppppppppppppppppppppp001
 vector pointer | pppppppppppppppppppppppppppp010
 string pointer | pppppppppppppppppppppppppppp011
 symbol pointer | pppppppppppppppppppppppppppp101
closure pointer | pppppppppppppppppppppppppppp110

Now that we've defined how our primitive datatypes are represented, we can move on and actually implement them.

More Simple Constants

First off, let's define some useful constants for our data representation:

(define fixnum-shift 2)
(define fixnum-mask 3)

(define ptr-mask 7) ; mask for pointer type tag
(define ptr-mask-inv #xfffffff8) ; mask for pointer value

(define pair-tag 1)
(define vec-tag 2)
(define str-tag 3)
(define sym-tag 5)
(define closure-tag 6)

(define char-mask 255) ; character type mask
(define char-shift 8)
(define char-tag 7)

(define bool-mask 255)
(define bool-shift 8)
(define bool-tag 15)

Next, a utility function that converts a primitive value into its word representation. We don't need to worry about pointers for now, so this just handles integers, booleans, and characters. For reference, ash, logand, and logior are the GNU Guile functions for bit-shift, bitwise AND, and bitwise OR, respectively.

(define (immediate-rep x)
  (cond ((integer? x) (logand (ash x fixnum-shift) #xffffffff))
        ((char? x) (logior (ash (char->integer x) char-shift) char-tag))
        ((boolean? x)
         (if x
             (logior (ash 1 bool-shift) bool-tag)
             bool-tag))
    ))

Once that's in place, we can expand our supported language to include the other primitive types by passing each expression through immediate-rep before storing it:

(define (compile-expr e)
  (emit "movl $~a, %eax" (immediate-rep e)))

Now just reload the file, and...

scheme@(guile-user)> (load "compiler.scm")
...
scheme@(guile-user)> (compile-and-run 2)
8

Oops - we haven't updated our runtime system to understand the new tagged value encoding. Let's do that now:

#define FIXNUM_MASK     3
#define FIXNUM_TAG      0
#define FIXNUM_SHIFT    2

#define CHAR_MASK       0xff
#define CHAR_SHIFT      8
#define CHAR_TAG        7

#define BOOL_MASK       0xff
#define BOOL_SHIFT      8
#define BOOL_TAG        15

#define PTR_MASK        7
#define PAIR_TAG        1
#define VEC_TAG         2
#define STR_TAG         3
#define SYM_TAG         5
#define CLOSURE_TAG     6

void show(int x) {
    if((x & FIXNUM_MASK) == FIXNUM_TAG) {
        // integer
        printf("%d", x >> FIXNUM_SHIFT);
    } else if((x & CHAR_MASK) == CHAR_TAG) {
        // character
        printf("#\\%c", (char)(x >> CHAR_SHIFT));
    } else if((x & BOOL_MASK) == BOOL_TAG) {
        // boolean
        if((x >> BOOL_SHIFT) != 0) {
            printf("#t");
        } else {
            printf("#f");
        }
    }
}

__attribute__((__cdecl__))
extern int scheme_entry();

int main(int argc, const char **argv) {
    int val = scheme_entry();
    show(val);
    printf("\n");
    return 0;
}

There's no support for the pointer types yet, but that's not a big deal since we aren't actually creating any. Now, let's try testing it again:

scheme@(guile-user)> (load "compiler.scm")
...
scheme@(guile-user)> (compile-and-run 2)
2
scheme@(guile-user)> (compile-and-run -272)
-272
scheme@(guile-user)> (compile-and-run #t)
#t

Success! We can represent simple constants, and they're displayed correctly by our runtime system. In 87 lines of actual code, we have something that can generate an executable, represent the primitive Scheme types, and display a result.

Next time, we'll start adding primitive operations so we can start performing actual computations!