Let's Build a Compiler 02: Primitives

10 minute read Published: 2019-03-27

Last time, we implemented a basic framework that can generate executables, came up with a data model, and implemented a basic code generator that can handle simple constants. In this part, we're going to start implementing primitive operations so our Lisp can start doing real computations! If you're following along in An Incremental Approach to Compiler Construction, this post starts at section 3.3.

Structure of Primitive Calls

In the code our compiler eventually accepts, some expressions, like (my-fn 12 27 x), will refer to user-defined functions that are defined elsewhere in the program. Others, like (+ 1 2), reference primitive operations that are defined as part of the language.

We could just check the name of the function being called, but then we'd risk turning a call to a variable named + that shadows the global primitive into a call to the built-in +. Instead, we can just redefine the language: the primcall function is reserved, and any calls to built-in primitives will have to look like (primcall operation-name args...) to get compiled. Later on, we can do the transformation automatically, so the user doesn't have to worry about the details of what's primitive and what's not.

Now that we know what primitive calls look like, let's define some accessor functions to make our code a bit more readable:

; Check whether the passed form is a primitive call (primcall) form
(define (primitive-call? form) (eq? 'primcall (car form)))

; Get the primitive operation from a passed primcall form
(define (primitive-op form) (cadr form))

; Get the Nth argument of a passed primcall form
(define (primitive-op-arg1 form) (caddr form))
(define (primitive-op-arg2 form) (cadddr form))

; Get all arguments of a passed primcall form
(define (primitive-op-args form) (cddr form))

These functions all accept a primitive call expression, or 'form' in Lisp parlance, and return various properties of it. If you're not familiar with Lisp's naming schemes, the calls to caddr, cddr, and friends might seem a bit opaque. Not to worry - the naming scheme is actually quite logical when you know how Lisp's lists are structured. Lists (as opposed to vectors) in Scheme are defined as chains of pairs. In memory, the list (1 2 3) looks something like this:

   car   cdr        car   cdr        car   cdr
 |-----|-----|    |-----|-----|    |-----|-----|
 |  1  |  o--+--->|  2  |  o--+--->|  3  |     |
 |-----|-----|    |-----|-----|    |-----|-----|

In other words, it's a linked list whose cells are just pairs. Each pair has a car field, the first element, and a cdr field, the second element. So the first item in the list x is (car x), the second item is (car (cdr x)), and so on. Long chains of car and cdr calls get annoying, though, so Scheme defines c?r functions for everything up through cddddr. Those functions just compress the corresponding sequence of car and cdr calls into a more readable name - (cadr x) is short for (car (cdr x)), (cddr x) is short for (cdr (cdr x)), and so on.

Now that the accessors are defined and we know a little more about how lists are stored, we can move on to the actual implementation.

Primitive Unary Operations

First, we modify compile-expr to handle the new type of input form. As part of this, we also have to add a way to check whether a given form is a primitive value that can be passed to immediate-rep.

(define (immediate? x) (or (integer? x) (char? x) (boolean? x) (null? x)))

(define (compile-expr e)
  (cond
    ((immediate? e) (emit "movl $~a, %eax" (immediate-rep e)))
    ((primitive-call? e) (compile-primitive-call e))
    ))

Rather than emitting a movl instruction unconditionally, now we have to check whether the form we're compiling is an immediate or a primitive call, and act appropriately. Once we're dispatching on the expression type, we can implement the body of compile-primitive-call.

(define (compile-primitive-call form)
  (case (primitive-op form)
    ((add1)
     (compile-expr (primitive-op-arg1 form))
     (emit "addl $~a, %eax" (immediate-rep 1)))
    ((sub1)
     (compile-expr (primitive-op-arg1 form))
     (emit "subl $~a, %eax" (immediate-rep 1)))
    ))

Now our compiler should be able to understand the basic add1 and sub1 primitives, which increment or decrement the argument's value, respectively. The implementations here aren't very complex, mainly due to the way we chose to represent integers. If you'll recall, integers are stored as a 30-bit field of their value with the two low-order bits set to zero. So the integer 1 is stored as 4, 12 as 48, and so on. As a direct consequence, we can add and subtract integers without any conversions at all, since addition and subtraction won't change the lower two bits.

Time to test it! Remember that if you're entering Lisp forms like (primcall add1 2) directly, you have to prefix them with a single-quote (') so the host Scheme won't try to execute them as code.

scheme@(guile-user)> (load "compiler.scm")
...
scheme@(guile-user)> (compile-and-run '(primcall add1 5))
6
scheme@(guile-user)> (compile-and-run '(primcall add1 1))
0

More Unary Operations

Incrementing and decrementing numbers isn't very exciting. Let's add some more unary primitives!

For now, this just includes the type-checking predicates: integer?, zero?, boolean?, and char?. These are all implemented in basically the same way. They each AND the value with its corresponding type mask (e.g. 3 for integers), then check the result against the tag bits that are supposed to be in that area, convert the result to a boolean, and return it. We'll also define a helper function for the "check and convert to boolean" code sequence, since we use it so much.

(define (emit-is-eax-equal-to val)
  (emit "cmpl $~a, %eax" val) ; check eax against val
  (emit "movl $0, %eax")      ; zero eax, leaving equal flag in place
  (emit "sete %al")           ; set low bit of eax if they were equal
  (emit "sall $~a, %eax" bool-shift) ; shift the bit up to the bool position
  (emit "orl $~a, %eax" bool-tag))   ; add boolean type tag

(define (compile-primitive-call form)
  (case (primitive-op form)
    ...

    ; integer? - check whether the first arg is an integer
    ((integer?)
     (compile-expr (primitive-op-arg1 form))
     (emit "andl $~a, %eax" fixnum-mask)
     (emit-is-eax-equal-to 0))

    ; boolean? - check whether the first arg is a boolean
    ((boolean?)
     (compile-expr (primitive-op-arg1 form))
     (emit "andl $~a, %eax" bool-mask)
     (emit-is-eax-equal-to bool-tag))

    ; char? check whether the first arg is a character
    ((char?)
     (compile-expr (primitive-op-arg1 form))
     (emit "andl $~a, %eax" char-mask)
     (emit-is-eax-equal-to char-tag))

    ; zero? check whether the first arg is the fixnum 0
    ((zero?)
     (compile-expr (primitive-op-arg1 form))
     (emit-is-eax-equal-to 0))
  ))

Binary Primitives

Unary operations are great, but now it's time to extend our compiler to multi-argument expressions, but we run into a problem. We can't evaluate a + b by just keeping our intermediate value in %eax as we've been doing up to this point. We have to evaluate a, then evaluate b while preserving a's old value, and finally add the two and store the result back in %eax.

To solve this problem, we'll have to store the intermediary result somewhere; the stack's already conveniently allocated for us, there's a pointer to it in the %esp register, and storing temporary data is its whole purpose in life. Let's take advantage of it.

In memory, the stack is laid out starting at high addresses and growing downwards towards 0x00000000. As far as our code is concerned, the first word of memory it's safe to use is at %esp-4, the next is at %esp-8, and so on. Anything at the stack pointer and above belongs to the code calling us and isn't safe to overwrite.

   low addresses
  ---------------
  |             |
  |             |
  |             |
  |             |
  |             | ...
  |             | %esp-24
  |             | %esp-20
  |             | %esp-16
  |             | %esp-12
  |             | %esp-8
  |             | %esp-4
  |-------------|
  | saved regs  | <---- %esp
  |-------------|
  | return addr |
  |-------------|
   high addresses

So to store intermediate data on the stack, we have two (actually more, but I'll present two here) options: we can use the push and pop instructions to move the stack pointer up and down, storing and retrieving data as it goes, or we can keep track of the offsets ourselves and leave %esp in one place. While it might seem like more work to go with option 2, it'll help us out later when it comes time to add local variable bindings, so we're going to go with that.

First, we have to augment our compile-primitive-call and compile-expr functions with a new parameter, si (stack index). This is the first stack offset that's safe to store data in. Initially it'll be set to -4, but as we store more intermediate results it'll decrease to access further elements on the stack.

(define wordsize 4)

(define (compile-primitive-call form si)
  (case (primitive-op form)
    ...
    same contents, but every si is passed unchanged to every (compile-expr) call 
    ...
    ))

(define (compile-expr e si)
  (cond
    ((primitive-call? e) (compile-primitive-call e si))
    ))

(define (compile-program program)
  ...
  (compile-expr program (- wordsize))
  ...)

After testing that everything still works, we can move on to writing basic binary primitives. To store intermediate values, we copy them from %eax to offsets into the stack like -4(%esp), and adjust the stack offset passed to the second call to compile-expr. Note that the parameters to - are evaluated in reverse order, since the second value is getting subtracted from the first.

(define (compile-primitive-call form si)
  (case (primitive-op form)
    ...
    ; + - addition of two fixnums
    ((+)
     (compile-expr (primitive-op-arg1 form) si)
     (emit "movl %eax, ~a(%esp)" si)
     (compile-expr (primitive-op-arg2 form) (- si wordsize))
     (emit "addl ~a(%esp), %eax" si))

    ; - - subtraction of two fixnums
    ((-)
     (compile-expr (primitive-op-arg2 form) si)
     (emit "movl %eax, ~a(%esp)" si)
     (compile-expr (primitive-op-arg1 form) (- si wordsize))
     (emit "subl ~a(%esp), %eax" si))
  ))

We can also implement multiplication, equality, and a less-than operation. Equality and less-than work the same way as the type predicates did, and char equality is just a shifted version of the same, but multiplication's a little trickier. Since we have as input what are essentially two integers 4*a and 4*b, and we want a result of 4*(a*b) as output, we have to divide one of the inputs to the multiplication by four, since otherwise we'd get 16*a*b. We can do this efficiently by shifting the input right by two bits.

(define (compile-primitive-call form si)
  (case (primitive-op form)
    ...
    ; * - multiplication of two fixnums
    ((*)
     (compile-expr (primitive-op-arg1 form) si)
     (emit "movl %eax, ~a(%esp)" si)
     (compile-expr (primitive-op-arg2 form) (- si wordsize))
     (emit "shrl $~a, %eax" fixnum-shift) ; make sure only one is shifted
     (emit "imull ~a(%esp), %eax" si))

    ; = - fixnum equality test
    ((=)
     (compile-expr (primitive-op-arg1 form) si)
     (emit "movl %eax, ~a(%esp)" si)
     (compile-expr (primitive-op-arg2 form) (- si wordsize))
     (emit "cmpl %eax, ~a(%esp)" si)
     (emit "movl $0, %eax")
     (emit "sete %al")
     (emit "sall $~a, %eax" bool-shift)
     (emit "orl $~a, %eax" bool-tag))

    ; = - fixnum less-than test
    ((<)
     (compile-expr (primitive-op-arg1 form) si)
     (emit "movl %eax, ~a(%esp)" si)
     (compile-expr (primitive-op-arg2 form) (- si wordsize))
     (emit "cmpl %eax, ~a(%esp)" si)
     (emit "movl $0, %eax")
     (emit "setl %al")
     (emit "sall $~a, %eax" bool-shift)
     (emit "orl $~a, %eax" bool-tag))

    ; char=? - character equality test
    ((char=?)
     (compile-expr (primitive-op-arg1 form) si)
     (emit "shrl $~a, %eax" char-shift) ; remove tag bits
     (emit "movl %eax, ~a(%esp)" si)
     (compile-expr (primitive-op-arg2 form) (- si wordsize))
     (emit "shrl $~a, %eax" char-shift) ; remove tag bits
     (emit "cmpl %eax, ~a(%esp)" si)
     (emit "movl $0, %eax")
     (emit "sete %al")
     (emit "sall $~a, %eax" bool-shift)
     (emit "orl $~a, %eax" bool-tag))
  ))

Testing

At this point our compiler has gained more features than we can easily test by hand every time we make a change. If we want to keep scaling up its feature set with confidence, we'll need a set of automated tests we can run to make sure nothing's broken. Since the exact mechanics of the test suite aren't relevant to actually building the compiler, I'm not going to go over the implementation here. However, you can download it along with the rest of the source code, and I'll be adding new test cases as the compiler gains new features.

If you have Guile installed on your system, you can run it like so:

$ guile --fresh-auto-compile tests.scm
...
$

Each test case that passes will be printed along with "success", and any failures will be highlighted in ALL CAPS along with their actual and expected outputs. Next time, we'll add more basic forms and implement a simple memory allocation scheme.