Compilers, interpreters, runtimes, and the like can sometimes feel vaguely magical. They take your code, perform complex transformations that you only partially understand, and spit out an executable binary. Clearly, the inner workings of such a beast must contain arcane secrets, not meant for the minds of mortals.
Of course, even if it feels that way sometimes, that doesn't make it true. As with everything else in software, there's no magic. You can understand it, with a little research and experimentation. There's plenty of resources out there to understand how compilers work. I haven't found anything that takes you from zero to a fully usable compiler, though, so I figured I'd try to write something that fits the bill. I've always found that the best way to learn how a system works is to build one yourself, so let's do that.
In short, this series will walk you through building an optimizing compiler from scratch. Hopefully we'll both learn something along the way.
I will be assuming some knowledge of native-code build processes, Lisp, C, and x86 assembly language, but I'll try and provide enough detail that anyone who's not familiar with them can still follow along as long as they're not afraid of a little research.
The Basics
We're going to be building a compiler for Scheme, a variant of Lisp. Primarily due to how simple the parser can be, but also because the language is both simple enough to compile and practical enough that we can write our compiler in it. Once it's far enough along, we should be able to compile our compiler with itself - it'll become self-hosting, independent of any external software.
For the initial stages, we'll follow the steps outlined in An Incremental Approach to Compiler Construction, which is a great resource, even if it does tend to gloss over things that need to be described in more detail. If you do have trouble with a specific part, I'll be including working downloads of the current compiler source at the end of each post, so you can use that implementation as a reference.
Though our compiler will eventually be self-hosting, we'll need something to run it before that point. I'll be using GNU Guile v2.2, which runs pretty much everywhere and implements the R6RS specification, which just so happens to be what we're initially aiming to support. On Linux, you should just be able to install via your system's package manager. Once it's installed, you should see a prompt like the following if you start up its REPL:
GNU Guile 2.2.4 Copyright (C) 1995-2017 Free Software Foundation, Inc. Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'. This program is free software, and you are welcome to redistribute it under certain conditions; type `,show c' for details. Enter `,help' for help. scheme@(guile-user)>
You can also enable Readline support (so you can access history with up/down and use tab-completion) by running:
scheme@(guile-user)> (use-modules (ice-9 readline)) scheme@(guile-user)> (activate-readline)
Basic Framework
One nice thing about implementing a Scheme compiler in Scheme is that, at least
at first, we can ignore the parser entirely. Any chunk of Lisp code is also a
valid data structure, so we can just operate on that structure directly without
worrying about writing a parser. For now, our compiler can just be a function
compile-program
that takes code as input and returns a bunch of 32-bit x86
assembly code as a string. Later on we'll improve upon this, but this is the
easiest way to get something working quickly.
The output assembly will then get linked with a runtime library written in C to
produce an executable binary file. Our initial runtime looks like this (stored
in rts.c
):
#include <stdio.h> #include <stdlib.h> __attribute__((__cdecl__)) extern int scheme_entry(); int main(int argc, const char **argv) { int val = scheme_entry(); printf("%d\n", val); return 0; }
This just calls a function named scheme_entry
that our compiler generates,
then prints the integer that it returns. The __attribute__((__cdecl__))
part
is a GCC-specific syntax extension that tells the compiler to use the
"cdecl" calling
convention when executing the function. That means arguments are pushed onto the
stack from right to left, and the return value is stored (since this is a 32-bit
program) in the eax
register. Our assembly just has to read any arguments it
needs, run the program, store the result in eax
, then return back to the
caller and make sure none of the other registers are modified.
For generating the assembly and compiling the output binary, I wrote a basic framework using some of Guile's extension libraries to interact with the OS. We'll replace this later on, but it'll be enough to get going.
(define emit (lambda args (apply simple-format #t args) (newline))) (define (compile-program program) (emit ".text") (emit ".p2align 4,,15") (emit ".globl scheme_entry") (emit "scheme_entry:") ; handle incoming call from C (emit "push %esi") (emit "push %edi") (emit "push %edx") ; our code goes here (emit "movl $42, %eax") ; restore state for return to C (emit "pop %edx") (emit "pop %edi") (emit "pop %esi") (emit "ret")) (define (compile-to-binary program) (begin (with-output-to-file "/tmp/compiled.s" (lambda () (compile-program program))) (system "gcc -fomit-frame-pointer -m32 /tmp/compiled.s rts.c")))
There's no actual code-generation yet - right now it just generates the
equivalent of return 42;
in C, but that should let us make sure everything
else is working. Now we can fire up Guile and test it out!
scheme@(guile-user)> (load "compiler.scm") ... scheme@(guile-user)> (compile-to-binary 0) $1 = 0 scheme@(guile-user)> ^C $ ls a.out compiler.scm rts.c $ ./a.out 42 $
Success! We generated the necessary assembly file, compiled it, and emitted a working binary! Before moving forward, let's unpack the Lisp code a little to make sure everyone's up to speed.
First, we're defining a new emit
function we can use to output a line of
assembly code. It's a relatively simple wrapper around the Guile library
function (simple-format destination format-string args)
that does basic
printf-style string formatting and sends it to an output stream.
Second, the definition of (compile-program program)
, the entry point to our
compiler. This function will take a Lisp program as input and print a bunch of
assembly code to the default output stream that implements that program. Here,
it ignores the argument entirely and just emit
s the basic instructions
required to return 42
.
Finally, the compile-to-binary
program is how we actually use the compiler
to generate our executable. It calls compile-program
, redirecting its output
into a file at /tmp/compiled.s
(overwriting anything that's already there),
then runs gcc -m32 /tmp/compiled.s rts.c
to link everything together.
First Steps
Since we already have a runtime that will print out whatever we return, our
first step will be to support the very simple language of "constant integers".
Starting from the basic framework we already have, all we have to do is replace
the 42
in movl $42, %eax
with whatever value we need to return.
For the actual implementation, I split the actual emit
call out into a
(compile-expr e)
function; where compile-program
is responsible for emitting
a full program, compile-expr
only has to generate the fragment required to
evaluate a passed expression, and replace the (emit "movl $42, %eax")
in
compile-program
with (compile-expr program)
.
(define (compile-expr e) (emit "movl $~a, %eax" a))
The emit
call here is analogous to f"movl ${e}, %eax"
in Python - it
inserts the stringified value of e
where 42 previously went in the movl
instruction.
I also added a compile-and-run
function to make testing easier, which just
calls compile-to-binary
and then runs the resulting program. Later on, we'll
add something to capture the output so we can run automated tests on our
compiler and verify the generated code.
(define (compile-and-run program) (begin (compile-to-binary program) (system "./a.out")))
Then, we can load the new version into a REPL and test it out:
scheme@(guile-user) [4]> (load "compiler.scm") ... scheme@(guile-user) [4]> (compile-and-run 27) 27 $10 = 0 scheme@(guile-user) [4]> (compile-program 27) .text .p2align 4,,15 .globl scheme_entry scheme_entry: push %esi push %edi push %edx movl $27, %eax pop %edx pop %edi pop %esi ret scheme@(guile-user) [4]>
Data Representation
Supporting integer constants is great, but Scheme includes more than just
integers. We also have booleans (#t
or #f
), character constants (which are
just written as the character prefixed by #\
, for example #\a
for the
character 'a'), and the empty list ()
which is normally used as a "null"
value. There's also pairs, lists, vectors, strings, and functions, but those all
require heap allocation so we're going to ignore them for now.
The question is, how do we represent these values? We can't just store
everything as an integer - the standard includes functions like boolean?
that
let programs check the type of an object at runtime, so we need a way to
represent that type information.
Making everything into a heap-allocated structure is one option, but that'd mean building a lot of extra complexity just to store simple fixed-size datatypes. Ideally we should try to avoid using the heap for as long as possible, since that'd mean memory allocation, and memory allocation means we'd have to implement a garbage collector. Lucky for us, as long as we're willing to sacrifice a few bits of our integers, we can use a trick called "tagged pointers" to pack everything into our 32-bit machine word.
The basic idea behind tagged pointers is to recycle some unused bits of the word as metadata. Broken down at the bit level, the basic types we're trying to represent look like this:
type | 31 bit 0 ----------------------------------------- integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii boolean | 000000000000000000000000000000b char | 00000000000000000000000cccccccc pointer | ppppppppppppppppppppppppppppppp
Integers and pointers use every available bit, booleans use the least
significant bit to represent zero or one, and characters use the lowest eight
bits (Unicode adds complexity, so we'll stick to the normal ASCII set for now).
There's no way to represent every available value in every type without adding
more bits; there's be no way to distinguish between 0 and #f
, or an integer
and its corresponding pointer. To solve that, let's sacrifice the lowest two
bits of our integers, setting them to always zero. Now, bit 2 is the low-order
bit of our actual integer, and we only have 30 bits available to represent our
values.
Here's where the optimization comes into play. Since integers are defined to always have the two low-order bits zeroed, we can define nonzero values in those positions to mean something else. For example, what if we shift the meaningful bits of integers and booleans up by eight, then put a marker value in the newly-unoccupied low-order bits?
type | 31 bit 0 ----------------------------------------- integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiii00 boolean | 0000000000000000000000b00001111 char | 000000000000000cccccccc00000111 pointer | ppppppppppppppppppppppppppppppp
Now the two low-order bits are zero if a value is an integer, and 1 when it's
something else. You can further distinguish between booleans and characters
based on the value of the low-order byte. But our pointers still present a
problem - they still use all of the bits. We can't distinguish between the
boolean #t
and the pointer 0x0000010f
. But... are those bits really all
used? On x86, at least, we can technically access any memory address we want.
Practically, though, the smallest object we need to address on the heap is 4
bytes in size, and most objects are 8 bytes or larger. What if we require that
all pointers be 8-byte aligned?
If pointers are aligned to 8-byte boundaries, then the least-significant three bits of our pointers will always be zero, which means we can use them for something else! For example, we can merge the types together like this:
type | 31 bit 0 ----------------------------------------- integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiii00 boolean | 0000000000000000000000b00001111 char | 000000000000000cccccccc00000111 pointer | pppppppppppppppppppppppppppp010
Now, we can take an arbitrary 32-bit word and determine what type of object it is without any external reference, at the cost of a few bit operations when accessing booleans, pointers, and characters. We could also reverse the representations, so that integers always have the low two bits set and pointers have them set to zero. However, our current representation lets us add, subtract, multiply, and divide integers directly without actually shifting them. Essentially, the representation we use can either allow integer operations without conversions or pointer dereferences without conversions. In this series I'm choosing integer operations, but it's worth being aware of the trade-off.
But we're not done! We can take this further. Any given pointer can reference an object of the following types:
- Pair
- Vector
- String
- Closure
- Symbol
That's 5 types of object, and we have 3 bits free in our pointer. Wouldn't it be cool if we could cram object type information into the pointer too?
Those three bits can be set to any of the following patterns:
000 001 010 011 100 101 110 111
We can't use anything ending in 00, since that would represent an integer, and we can't use 111, since that's the common suffix for characters and booleans. This leaves us with five patterns, which is exactly the number we need to represent our object types:
001 - pair 010 - vector 011 - string 101 - symbol 110 - closure
So now we have everything represented unambiguously within a 32-bit word:
type | 31 bit 0 ------------------------------------------------- integer | iiiiiiiiiiiiiiiiiiiiiiiiiiiii00 boolean | 0000000000000000000000b00001111 char | 000000000000000cccccccc00000111 pair pointer | pppppppppppppppppppppppppppp001 vector pointer | pppppppppppppppppppppppppppp010 string pointer | pppppppppppppppppppppppppppp011 symbol pointer | pppppppppppppppppppppppppppp101 closure pointer | pppppppppppppppppppppppppppp110
Now that we've defined how our primitive datatypes are represented, we can move on and actually implement them.
More Simple Constants
First off, let's define some useful constants for our data representation:
(define fixnum-shift 2) (define fixnum-mask 3) (define ptr-mask 7) ; mask for pointer type tag (define ptr-mask-inv #xfffffff8) ; mask for pointer value (define pair-tag 1) (define vec-tag 2) (define str-tag 3) (define sym-tag 5) (define closure-tag 6) (define char-mask 255) ; character type mask (define char-shift 8) (define char-tag 7) (define bool-mask 255) (define bool-shift 8) (define bool-tag 15)
Next, a utility function that converts a primitive value into its word
representation. We don't need to worry about pointers for now, so this just
handles integers, booleans, and characters. For reference, ash
, logand
, and
logior
are the GNU Guile
functions
for bit-shift, bitwise AND, and bitwise OR, respectively.
(define (immediate-rep x) (cond ((integer? x) (logand (ash x fixnum-shift) #xffffffff)) ((char? x) (logior (ash (char->integer x) char-shift) char-tag)) ((boolean? x) (if x (logior (ash 1 bool-shift) bool-tag) bool-tag)) ))
Once that's in place, we can expand our supported language to include the other
primitive types by passing each expression through immediate-rep
before
storing it:
(define (compile-expr e) (emit "movl $~a, %eax" (immediate-rep e)))
Now just reload the file, and...
scheme@(guile-user)> (load "compiler.scm") ... scheme@(guile-user)> (compile-and-run 2) 8
Oops - we haven't updated our runtime system to understand the new tagged value encoding. Let's do that now:
#define FIXNUM_MASK 3 #define FIXNUM_TAG 0 #define FIXNUM_SHIFT 2 #define CHAR_MASK 0xff #define CHAR_SHIFT 8 #define CHAR_TAG 7 #define BOOL_MASK 0xff #define BOOL_SHIFT 8 #define BOOL_TAG 15 #define PTR_MASK 7 #define PAIR_TAG 1 #define VEC_TAG 2 #define STR_TAG 3 #define SYM_TAG 5 #define CLOSURE_TAG 6 void show(int x) { if((x & FIXNUM_MASK) == FIXNUM_TAG) { // integer printf("%d", x >> FIXNUM_SHIFT); } else if((x & CHAR_MASK) == CHAR_TAG) { // character printf("#\\%c", (char)(x >> CHAR_SHIFT)); } else if((x & BOOL_MASK) == BOOL_TAG) { // boolean if((x >> BOOL_SHIFT) != 0) { printf("#t"); } else { printf("#f"); } } } __attribute__((__cdecl__)) extern int scheme_entry(); int main(int argc, const char **argv) { int val = scheme_entry(); show(val); printf("\n"); return 0; }
There's no support for the pointer types yet, but that's not a big deal since we aren't actually creating any. Now, let's try testing it again:
scheme@(guile-user)> (load "compiler.scm") ... scheme@(guile-user)> (compile-and-run 2) 2 scheme@(guile-user)> (compile-and-run -272) -272 scheme@(guile-user)> (compile-and-run #t) #t
Success! We can represent simple constants, and they're displayed correctly by our runtime system. In 87 lines of actual code, we have something that can generate an executable, represent the primitive Scheme types, and display a result.
Next time, we'll start adding primitive operations so we can start performing actual computations!