A small guide on writing interpreters

This is an average-length (~10 pages) blog post about how to write a parser, AST builder, bytecode compiler, and a bytecode interpreter - in general, and in GameMaker specifically.

Intro

As an example that actually fits into a blog post (rather than a book), here I'm going to outline a very simple interpreter - one that computes result of an expression that contains numbers (15, -4, ...), instance variables (x, y, ...), parentheses (()), and basic operators (+,-,*,/,%,div).

It's going to look like this (click and try x, y, image_index, numbers, operators):

Canvas is not supported :(

(note: the demo isn't normal GM so it only has those 3 instance variables)

GMS1 notes

Before we start - examples here are for GameMaker Studio 2.

If you are still using GameMaker: Studio 1, you'll need to change a few things:

Array declarations

GMS1 didn't have inline array declarations yet so instead of

ds_list_add(out, [a, b, c]);

you could make some script called pack,

/// pack(...values)
var arr = array_create(argument_count);
for (var i = 0; i < argument_count; i++) arr[i] = argument[i];
return arr;

and do

ds_list_add(out, pack(a, b, c));

Macros

GMS1 didn't have inline macro declarations yet so for each

#macro name value

You'd add the according macro using the Macros editor (GMEdit can make this a bit faster).

Or you could just use global variables directly.

Non-GameMaker notes

If you are not using GameMaker at all, some of this might not make immediate sense, so

  • Syntax is largely akin to JS but with some C elements.
  • Scripts/functions are globally scoped.
  • #macro-s are globally scoped and work much like C's #define.
  • global is a globally visible object that "static" variables are commonly held in.
  • value[|index] is an array access operator for ds_list type. (because historically references are indexes rather pointers)
  • Arrays start at 0 but strings start at 1.

Initial setup

This example is going to be called "tiny expression runtime", so the prefix for script names is going to be txr_. The init script is going to be called txr_init.

For a start it'll just have a macro for a global variable that'll contain the error text (if any):

#macro txr_error global.txr_error_val

Now, about errors: with process being done in multiple steps (and, sometimes, nested calls) there is a need for a way to quickly back out of the compilation process when an error occurs. In some languages it is tempting to use exception handling for this, but that may limit compile-time optimizations, and GameMaker doesn't support exception handling yet, so here's for something different:

Instead of making scripts return whatever intended result, the result can be written into a global variable, an error (if any) can be written into other global variable, and the script can return whether an error occurred.

Then, such calls can be made as if (script(...)) return true, which will allow the program to easily back out of the nested call when an error occurs.

So you can have a script called txr_throw with

/// @desc txr_throw(error_text, position)
/// @param error_text
/// @param position
txr_error = argument0 + " at position " + string(argument1);
return true;

and do return txr_throw("<error text>", position); to dispatch an error.

Parser

A parser's job is to take a string and output a list of tokens.

This is done by looping over the string's characters and adding matching tokens to the resulting list.

Setup

Firstly, a bit of initial setup for txr_init.

In GM, a ds_list currently is better for accumulating values, so we're going to keep one around.

And make two enums - one for token types and one for operator types.

// parser:
#macro txr_parse_tokens global.txr_parse_tokens_val
global.txr_parse_tokens = ds_list_create();
enum txr_token {
    eof = 0, // <end of file>
    op = 1, // + - * / % div
    par_open = 2, // (
    par_close = 3, // )
    number = 4, // 37
    ident = 5, // some
}
enum txr_op {
    mul  = 0x01, // *
    fdiv = 0x02, // /
    fmod = 0x03, // %
    idiv = 0x04, // div
    add  = 0x10, // +
    sub  = 0x11, // -
    maxp = 0x20 // maximum priority
}

Operator enum fields have values not without a reason - it's so that the higher 4 bits contain operator precedence. But that's for a bit later. Also 0x10 would be $10 if you use GMS1.

Start

As per above, this is simple enough - we measure string length and then loop over it, extracting the current character's code and comparing it in a switch block:

/// @desc txr_parse(string)
/// @param str
var str = argument0;
var len = string_length(str);
var out = txr_parse_tokens;
ds_list_clear(out);
var pos = 1;
while (pos <= len) {
    var start = pos;
    var char = string_ord_at(str, pos);
    pos += 1;
    switch (char) {

(if you do not work in GM and have string character indexes start at 0, you would set pos to 0 instead and have (pos < len) instead of (pos <= len) in subsequent comparisons)

Whitespace

In common parsers, whitespace is insignificant and thus is just skipped over:

        case ord(" "): case ord("\t"): case ord("\r"): case ord("\n"): break;

If you do not work in GM, ord("<char>") is replaced by character's code at compile-time.

If you work in GMS1, this would have to be case 32: case 9: case 13: case 10: break; instead because there were no escape characters at that time.

Parentheses

These are one of the simpler things, each being given their own token type:

        case ord("("): ds_list_add(out, [txr_token.par_open, start]); break;
        case ord(")"): ds_list_add(out, [txr_token.par_close, start]); break;

Operators

Supported operators are single-character so are pretty straightforward to implement:

        case ord("+"): ds_list_add(out, [txr_token.op, start, txr_op.add]); break;
        case ord("-"): ds_list_add(out, [txr_token.op, start, txr_op.sub]); break;
        case ord("*"): ds_list_add(out, [txr_token.op, start, txr_op.mul]); break;
        case ord("/"): ds_list_add(out, [txr_token.op, start, txr_op.fdiv]); break;
        case ord("%"): ds_list_add(out, [txr_token.op, start, txr_op.fmod]); break;

As you can see, token information is packed into small arrays here. Ideally you would want algebraic data types for this, but, alas - most languages do not spot such a construct, so you've got to use arrays or small container classes.

If you have multi-character operators (e.g. +=), you would detect them sequentially - match the first character first, then check if subsequent characters match the expected as well.

Numbers

Numbers are detected in the default case,

        default:
            if (char >= ord("0") && char <= ord("9")) {
                while (pos <= len) {
                    char = string_ord_at(str, pos);
                    if (char >= ord("0") && char <= ord("9")) {
                        pos += 1;
                    } else break;
                }
                var val = real(string_copy(str, start, pos - start));
                ds_list_add(out, [txr_token.number, start, val]);
            }

Here we check if character is in 0..9 range, then skip over any additional characters that are digits as well, convert the resulting substring to a number, and add an according token type.

A more proper parser would have this a little more complex, checking for period . character, and possibly e# suffix for expontential notation (if you choose to support that).

Identifiers

Identifiers (variable names and keywords) are handled very similarly,

            else if (char == ord("_")
                || (char >= ord("a") && char <= ord("z"))
                || (char >= ord("A") && char <= ord("Z"))
            ) {
                while (pos <= len) {
                    char = string_ord_at(str, pos);
                    if (char == ord("_")
                        || (char >= ord("0") && char <= ord("9"))
                        || (char >= ord("a") && char <= ord("z"))
                        || (char >= ord("A") && char <= ord("Z"))
                    ) {
                        pos += 1;
                    } else break;
                }
                var name = string_copy(str, start, pos - start);
                switch (name) {
                    case "mod": ds_list_add(out, [txr_token.op, start, txr_op.fmod]); break;
                    case "div": ds_list_add(out, [txr_token.op, start, txr_op.idiv]); break;
                    default: ds_list_add(out, [txr_token.ident, start, name]); break;
                }
            }

First character of an identifier can be _, a..z, or A..Z. Subsequent characters can be digits too. When we're done reading the thing, we check if it matches keywords (just div and mod in this case) to output a non-ident token instead.

Error handling

Finally, if the character isn't anything expected, we show an error and quit early:

            else {
                ds_list_clear(out);
                return txr_throw("Unexpected character `" + chr(char) + "`", start);
            }

Parsers usually have very few error conditions - it's only that something is unexpected, or that you have an unclosed string.

Finish

When we're done reading, we add an EOF token (to avoid checking for out-of-bounds reads later) and return false to indicate that no error is occurred and the list is now populated:

    }
}
ds_list_add(out, [txr_token.eof, len + 1]);
return false;

As additional note here, since GameMaker stores strings directly in UTF8 format, if you intend to parse potentially long strings, it becomes beneficial to use string_byte_at or read from a buffer instead.

Building the AST

The next step is to build an abstract syntax tree from the tokens.

It is possible to avoid AST entirely (Lua is a good example), but this does have implications for language design and makes structure transformations harder.

Setup

We're going to need a couple more macros and enums in txr_init.

// builder:
#macro txr_build_list global.txr_build_list_val
#macro txr_build_node global.txr_build_node_val
#macro txr_build_pos  global.txr_build_pos_val
#macro txr_build_len  global.txr_build_len_val
enum txr_node {
    number = 1, // (number)
    ident = 2, // (name)
    unop = 3, // (unop, node)
    binop = 4, // (binop, a, b)
}
enum txr_unop {
    negate = 1, // -value
}
enum txr_build_flag {
    no_ops = 1,
}

Here we have an enum for AST node types (of which we have only a couple), an enum for unary operators (of which we only have negation), and a couple of global variables to share in scripts.

We'll also need a helper script to throw an error on a token/node position, txr_throw_at:

/// @param error_text
/// @param token
var tk = argument1;
if (tk[0] == txr_token.eof) {
    return txr_throw(argument0, "<EOF>");
} else return txr_throw(argument0, tk[1]);

Start

Most of the setup is going to happen in a separate script - say, txr_build:

txr_build_list = txr_parse_tokens;
txr_build_pos = 0;
txr_build_len = ds_list_size(txr_build_list);
if (txr_build_expr(0)) return true;
if (txr_build_pos < txr_build_len - 1) {
    return txr_throw_at("Trailing data", txr_parse_tokens[|txr_build_pos]);
} else return false;

Nothing particular to this - we set up the list and position/length variables, recursively read an expression from it, and throw an error if there was anything else after that.

When building statements instead of expressions, you would instead read statements into a list until end of file is reached.

Expression builder

txr_build_expr would be as following:

/// @param flags
var flags = argument0;
var tk = txr_build_list[|txr_build_pos++];
switch (tk[0]) {
    case txr_token.number: txr_build_node = [txr_node.number, tk[1], tk[2]]; break;
    case txr_token.ident: txr_build_node = [txr_node.ident, tk[1], tk[2]]; break;
    case txr_token.par_open: // (value)
        if (txr_build_expr(0)) return true;
        tk = txr_build_list[|txr_build_pos++];
        if (tk[0] != txr_token.par_close) return txr_throw_at("Expected a `)`", tk);
        break;
    case txr_token.op: // -value, +value
        switch (tk[2]) {
            case txr_op.add:
                if (txr_build_expr(txr_build_flag.no_ops)) return true;
                break;
            case txr_op.sub:
                if (txr_build_expr(txr_build_flag.no_ops)) return true;
                txr_build_node = [txr_node.unop, tk[1], txr_unop.negate, txr_build_node];
                break;
            default: return txr_throw_at("Unexpected token", tk);
        }
        break;
    default: return txr_throw_at("Unexpected token", tk);
}
if ((flags & txr_build_flag.no_ops) == 0) {
    tk = txr_build_list[|txr_build_pos];
    if (tk[0] == txr_token.op) {
        txr_build_pos += 1;
        if (txr_build_ops(tk)) return true;
    }
}
return false;

This is considerably straightforward - the type of value is decided based on first token, So, if you have txr_token.number 4, that's converted to txr_node.number straight away; If it's -4 (txr_op + txr_token), 4 is going to be read by calling txr_build_expr again after reading the minus-sign.

The flag comparison branch following it is what will start assembling operators when encountering an operator after an expression.

Things like function calls would be checked prior to going into that branch;

Other suffixes (such as array access []) would be handled by turning the branch into a loop that combines allowed trailing data and quits once encountering something else.

Operator precedence

Now, to the interesting part - the txr_build_ops script.

The common way of doing this is via Shunting-yard algorithm, which is nice, although not specifically intended for producing abstract syntax trees.

Here a purpose-specific solution is going to be used.

First, a pair of lists are going to be created, one for values, and one for operators:

/// @param first
var nodes = ds_list_create();
ds_list_add(nodes, txr_build_node);
var ops = ds_list_create();
ds_list_add(ops, argument0);

then, we are going to populate these:

Each iteration, an expression is read and added to the list. If the subsequent token is an operator, it is added to the list and another expression is read. The loop stops once a non-operator expression is encountered (for this, only EOF):

var tk;
while (1) {
    if (txr_build_expr(txr_build_flag.no_ops)) {
        ds_list_destroy(nodes);
        ds_list_destroy(ops);
        return true;
    }
    ds_list_add(nodes, txr_build_node);
    //
    tk = txr_build_list[|txr_build_pos];
    if (tk[0] == txr_token.op) {
        txr_build_pos++;
        ds_list_add(ops, tk);
    } else break;
}

After this, for 1 + 2 * 3 - 4, these would be as following (aligned for convenience):

nodes: 1 2 3 4
ops:    + * -

You might suspect where this is going - we are just going to pair expressions together from the "most important" to the "least important" operator:

var n = ds_list_size(ops);
var pmax = (txr_op.maxp >> 4);
var pri = 0;
while (pri < pmax) {
    for (var i = 0; i < n; i++) {
        tk = ops[|i];
        if ((tk[2] >> 4) != pri) continue;
        nodes[|i] = [txr_node.binop, tk[1], tk[2], nodes[|i], nodes[|i + 1]];
        ds_list_delete(nodes, i + 1);
        ds_list_delete(ops, i);
        n -= 1; i -= 1;
    }
    pri += 1;
}

So, for above, the steps would be like

1 + 2 * 3 - 4
1 + (2 * 3) - 4
(1 + (2 * 3)) - 4
((1 + (2 * 3)) - 4)

Finally, the combined expression is pulled from the list, and lists are cleaned up:

txr_build_node = nodes[|0];
ds_list_destroy(nodes);
ds_list_destroy(ops);
return false;

Also, to note: while it might be tempting to reuse a pair of global lists for this, you shouldn't, as something like 2 - (3 + 4) would trigger a separate txr_build_ops call for contents of parentheses.

Compiling bytecode

Next up is bytecode compilation.

Depending on your setup, you may do without this and execute the AST straight away, but that can be a little slower, and will offer it's own challenges when coming to statements.

Here we are going to do a stack machine, which is perhaps the most straightforward as far as compilation and execution goes.

So, if you had

A - (B + 3)

it would be compiled to

push variable A to stack
push variable B to stack
push number 3 to stack
pop 2 values from stack, add them, and push result to stack
pop 2 values from stack, subtract them, and push result to stack

which can be executed linearly inside a single script.

Setup

A few more things go to txr_init:

#macro txr_compile_list global.txr_compile_list_val
txr_compile_list = ds_list_create();
enum txr_action {
    number = 1, // (value): push(value)
    ident = 2, // (name): push(self[name])
    unop = 3, // (unop): push(-pop())
    binop = 4, // (op): a = pop(); b = pop(); push(binop(op, a, b))
}

Actions are pretty much like AST nodes, except flattened.

Start

Same as before, most pre-call setup goes into a separate script - in this case, txr_compile:

/// @param source_string
if (txr_parse(argument0)) return undefined;
if (txr_build()) return undefined;
//
var out = txr_compile_list;
ds_list_clear(out);
if (txr_compile_expr(txr_build_node)) return undefined;
//
var n = ds_list_size(out);
var arr = array_create(n);
for (var i = 0; i < n; i++) arr[i] = out[|i];
ds_list_clear(out);
return arr;

As can be seen, this also ties together previously written parser and AST builder sections.

Forming an array of arrays doesn't quite fit the conventional definition of bytecode, but in GameMaker array access is noticeably faster than memory access functions (even with split up memory access in mind), so there's little reason to complicate debugging for yourself by forming a raw memory buffer instead.

Compiling expressions

Recursive compilation is to happen in txr_compile_expr:

/// @param node
var q = argument0;
var out = txr_compile_list;
switch (q[0]) {
    case txr_node.number: ds_list_add(out, [txr_action.number, q[1], q[2]]); break;
    case txr_node.ident: ds_list_add(out, [txr_action.ident, q[1], q[2]]); break;
    case txr_node.unop:
        if (txr_compile_expr(q[3])) return true;
        ds_list_add(out, [txr_action.unop, q[1], q[2]]);
        break;
    case txr_node.binop:
        if (txr_compile_expr(q[3])) return true;
        if (txr_compile_expr(q[4])) return true;
        ds_list_add(out, [txr_action.binop, q[1], q[2]]);
        break;
    default: return txr_throw_at("Cannot compile node type " + string(q[0]), q);
}
return false;

This remains straightforward - for simple node types, you add the according actions directly; For nodes with child nodes, you add actions for child nodes first, and then the combining action.

Running bytecode

As per above, a bytecode interpreter's job is to take a list of "actions" and execute them, stopping only if an error is detected.

Setup

For once, there's no additional setup here.

However, we're going to need a small helper script for throwing an error and cleaning up the temporary stack structure, called txr_exec_exit:

/// @param error_text
/// @param action
/// @param stack
ds_stack_destroy(argument2);
return txr_throw(argument0, argument1[1]);

Executing bytecode

The magic happens entirely in a script that is to be named txr_exec:

/// @param actions
var arr = argument0;
var len = array_length_1d(arr);
var pos = 0;
var stack = ds_stack_create();
while (pos < len) {
    var q = arr[pos++];
    switch (q[0]) {
        case txr_action.number: ds_stack_push(stack, q[2]); break;
        case txr_action.unop: ds_stack_push(stack, -ds_stack_pop(stack)); break;
        case txr_action.binop:
            var b = ds_stack_pop(stack);
            var a = ds_stack_pop(stack);
            switch (q[2]) {
                case txr_op.add: a += b; break;
                case txr_op.sub: a -= b; break;
                case txr_op.mul: a *= b; break;
                case txr_op.fdiv: a /= b; break;
                case txr_op.fmod: if (b != 0) a %= b; else a = 0; break;
                case txr_op.idiv: if (b != 0) a = a div b; else a = 0; break;
                default: return txr_exec_exit("Can't apply operator " + string(q[2]), q, stack);
            }
            ds_stack_push(stack, a);
            break;
        case txr_action.ident:
            var v = variable_instance_get(id, q[2]);
            if (is_real(v) || is_int64(v) || is_bool(v)) {
                ds_stack_push(stack, v);
            } else return txr_exec_exit("Variable `" + q[2] + "` is not a number", q, stack);
            break;
        default: return txr_exec_exit("Can't run action " + string(q[0]), q, stack);
    }
}
var r = ds_stack_pop(stack);
ds_stack_destroy(stack);
txr_error = "";
return r;

Here we loop over the actions of a "program", executing them accordingly. Error handling can be kept to minimum, as data is freshly generated and checked.

While it may be tempting to make the stack global, doing so will produce oddities if you later decide to implement function calls that can result in execution of another program - a problem that GameMaker's own runtime suffers from at the time of writing (which is why you can't have DLLs call arbitrary GML code).

In conclusion

You know, I somehow thought that this was going to be shorter.

Some downloads:

  • Project from this blog post: txr.yyz | source code | haxe version
  • A slightly fancier expression interpreter: sxr.yyz | source code
    Supports more operators, strings, functions, constants, and compile-time optimizations.
    Note that this was made sometime mid-2017 so structure varies a little.

Have fun!

Update: Part 2 is now available!

Related posts:

5 thoughts on “A small guide on writing interpreters

      • I hate those people who doesn’t respect someone else’s work, and time. I hope that you don’t loose your motivation to create free content and guides like this :)

  1. In txr_parse this part:
    case ord(” “): case ord(“\t”): case ord(“\r”): case ord(“\n”): break;
    doesn’t work in GMS1 giving compile error. Acording to docs it only takes first character of string and returns its unicode value. So GMS1 returns unicode value only for backslash and compiler throws duplicate case statement error.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.