A small guide on writing interpreters, part 2

label hello: select dialog("Hello! What would you like to do?") {
    option "Count to 5":
        for (var i = 1; i <= 5; i = i + 1) {
            wait(1);
            trace(i + "!");
        }
        return 1;
    option "Nothing": jump yousure;
}
label yousure: select dialog("You sure?") {
    option "Yes": trace("Well then,"); return 0;
    option "No": jump hello;
}

Example of supported syntax

As some might remember, earlier this year I have published a small guide on writing interpreters, which went over the process of implementing a basic interpreter capable of evaluating expressions consisting of numbers, operators, and variables.

This continuation of the guide further expands upon concept, outlining how to support calls, statements, and branching - enough for a small scripting language.

Intro

Majority of the patterns are going to remain exactly the same as with the first part, therefore familiarizing yourself with it is recommended.

For this part, as there is a little more code, I'm going to outline the actual logic, and link to the associated code snippet for each step.

Expressions vs statements

Before we start, let's clarify what expressions and statements are [in context of this tutorial], and the difference between the two.

An expression, in short, is anything that you could use as a variable value or as a function argument.
Numbers are expressions, binary operations (a + b) too, and so are the function calls (fn(...)).

A statement is what you could put inside {} or as a then-action in an if-then-else.
return, conditional branching, and loops are all statements.

Some things (such as function calls) are allowed to be both a statement or an expression.
Some languages allow anything to be a statement (and, sometimes, statements to be expressions), but we shall not cover such things here for sake of syntactical clarity.

Small additions

Before we get to the actual statements, let's expand the syntax a tiny bit.

String literals

It's about time we get something that isn't numbers in here.

The parsing step is pretty simple - if we encounter a quote character (we shall allow both "string" and 'string'), we loop through the code until we find another of these or reach the end of it (in which case it's an unclosed string and an error). Then we get the snippet between quotes and pack it up into a token.
If you wanted to support escape characters, you would write characters to a buffer instead, and process escape characters to write something other than themselves.

Build and compile steps are pretty much the same as with numbers - we take the token/node and repackage it into a node/action.

Execution sees a few additional changes - as we now have not just the numbers, we shall make sure that operations cannot be carried out on incompatible types.
We shall allow "a" + 5 ~> "a5", but not go full JavaScript with "5" - 2 ~> 3.
Protip: Never go full JavaScript with implicit casts.

Fractional number literals

That is, not just 4, but 4.5 too.

This only requires to tweak the parser a little to permit a single period . in a number.

With this in mind, no additional validation is required before parsing the number.

Function calls

The butter of the most programming languages.

So long as you have functions, most syntactic omissions can be corrected.
Need array access for your oddish custom structure? Make a function for it.
Need to give interpreter access to game's globals? You can also make a function.
You could even allow interpreted code to generate and compile snippets of other interpreted code for ad-hoc reflection! But please don't.

The parser needs a single extra line to support commas.
Ever see a language without commas in function calls? Not a good sight.

The AST builder is where things get interesting.
So, where we would previously just output a single identifier node, we now check if an identifier is followed by a (, and then:

  • Skip over that (
  • Set up an array to hold the arguments, alongside with an argument counter.
  • Also set up a flag so that we can distinguish between the loop finishing normally and hitting EOF.
  • While there are still tokens left ahead,

    • If we're in front of a ), we skip that, set the flag, and leave the loop.
    • Otherwise we read the next argument expression and stuff it into the array.
    • If the expression is followed by a comma, we skip it.
      If it's neither that or a ), that's illegal.

  • If the function exists and the argument count matches, pack up the function call node.
    Function definitions are filled out in a global dictionary separately.

As an aside here, if you wanted to support functions-as-values, you would instead check for a ( after an expression (like it goes with binary operators), store the to-be-called expression with the arguments, and perform all checks during execution.

Compilation is nothing particular - you generate actions for each of the functions arguments, and then for the call itself (which instructs what to call and how many arguments to get off the stack).

Execution is generally straightforward:

  • Pop the now-computed arguments one by one off the stack into a temporary array.
    Keep in mind that due to execution order the last argument is on top of the stack.
    If your language of choice supports pointers, you may instead pass a pointer to the first argument in the stack, but be careful.
  • Reset the global error marker (which your functions can set to throw an error)
  • Call the function in question with the arguments we've got.
    GameMaker has this slightly worse than most language for the lack of a reflection function that would call a function with the specified array of arguments.
    In other words, you have a switch block for each number of arguments.
  • Check for error marker and either push the result onto stack, or handle the error.

Statements

Now that all of that is out of the way, we can finally get to the important things - statements!
This will involve a couple of modifications to how everything works.

The parser will stay largely unchanged for now, we'll only add a return token to it so that we can verify that anything works as such.

For the AST builder, we'll have a new, separate script for building statements. For now it will:

  • If there's a return token, read the subsequent expression to form a return-node.
  • Otherwise, try to read an expression and see if it's the kind that can be a statement too.
    As previously outlined, for now that's just the function calls.
    If it is, we pack it into a discard <expr> node, which will run the expression, but remove the result from the stack. This way we won't have to add a "push to stack?" flag to each such action.
    If it's not, throw an error about not getting a statement.

We shall also change how the base txr_build works - instead of reading a single expression and exiting, we shall read statements until we hit the end of input, and pack them into a block-node.

Compilation for the new node types is straightforward:

  • Both return and discard nodes process their expression first and then add their action, not too unlike the unary - operator.
  • A block {} node just outputs all of it's children statements.
    As we know that statements are warranted to leave the stack balanced, we don't have to output anything else here.

Execution sees minimal additions:

  • If upon reaching the end of the code the stack is empty, we output 0.
  • return action moves the program to the end of the code
    (after pushing the to-be-returned-value to the stack)
  • discard action pops a value from the stack
    (after it being pushed by preceding action(s))

Blocks & branching

Conditionals (and initial setup)

We're going to allow if-then-else statements. The syntax will be as usual:

if <condition-expression> <then-statement> [else <else-statement>]

This maps straightforwardly for the AST builder:

  • Read the condition-expression
  • Read the then-statement.
  • If the next token is else, skip it and read the else-statement, packing it as if-then-else.
    Otherwise it's a regular if-then statement.

Compilation is the interesting part here.
We are going to add jump and jump unless instructions for the runtime, which means that

if (some) trace("hi");
...

would be executed as:

push(some)
if !pop() jump(label1) // <- jump_unless
    push("hi!")
    call(trace, 1)
label1: ...

and

if (some) trace("yes") else trace("no")

would be executed as:

push(some)
if !pop() jump(label1) // <- jump_unless
    push("yes")
    call(trace, 1)
    jump(label2)
label1: // else
    push("no")
    call(trace, 1)
label2: ...

On compiler side we accomplish this by first outputting a jump/jump_unless action with no jump destination, then outputting the statement(s) from the branch, and then patching the jump-action to point at the end of the branch in compiled code.

Runtime side is just about what you would expect - jump simply changes the index of the next action to be executed, and jump_unless only does so if the freshly pop-ed value is not true-ish.

Blocks

Now that we have some actual branching, it's about time that we allow to put more than one statement per branch. To do so, we'll adopt the conventional curly bracket syntax { ...statements }.

For the parser we'll have a whole two extra lines outputting tokens for { and } accordingly.
This is no different from ( and ) or most other things in there.

AST builder part might remind you of how function calls were handled, except that we look for } instead of a ), and do not require delimiters between statements.
(by which I mean - you can later allow ; as a statement delimiter, but with well-thought-out syntax you don't need one)

There's nothing to be done on compiler or runtime sides because we've already made the builder assume the top-level node to be a block.

Setting variables

This would be the point where I get to explaining loops, but, wait - you can't have a loop if you can't change variables!
Not unless it's a GML-style repeat (times) { ... } loop anyway.
Or if your function calls have side effect on program state.
But, you know, still inconvenient.

As it goes, the parser will need another line to recognize our newly added = operator.

In the AST builder, we shall check if a non-statement expression is being followed by a =, and assume that to be a set-statement until further notice.

In the compiler, we shall check if the node is actually of a settable kind (for now, only variable-nodes are), and produce an action that would set the said expression kind if so.

Runtime side is simple here - we pop the new variable value from the stack, and feed it to the function for setting a variable by name.

Comparison operators

These aren't strictly needed, but, you know, may as well.

I'll try to keep this short as there's not much of interest, so here are the changes:

  • On the parser side we handle the new token types while distinguishing between single-character and two-character operators (< vs <=).
  • Builder is left untouched except for handling unary !, which reads the next expression (without any following operators) and packs it as !(value).
  • The compiler is left completely untouched.
  • The runtime gets new branches to handle the new operators.
    In my case I let base GM rules decide if something is true-ish for ! and whether the values are equal for == / !=.

Boolean & bitwise operators

A good scripting language needs boolean operators (AND, OR), and, ideally, short-circuit evaluation. And maybe some bitwise operators too. But let's look at the interesting part first:

Let's suppose that you have fn1() && fn2() and don't want to run fn2 if fn1 returns a true-ish value. While this is quite commonly done by using existing branching instructions,

call(fn1, 0)
if (!pop()) jump(label1) // <- jump_unless
    call(fn2, 0)
    jump(label2)
label1:
    push(false)
label2:

we can do better - by adding an instruction that checks the value on top of the stack, pops it if it's true-ish, or transits to an offset if it's not:

call(fn1, 0)
if (top()) pop() else jump(label1) // <- new action type
    call(fn2, 0)
label1:

thus, instead of 3 additional instructions, we only have a single one. Always nice.

The boolean OR is the same but in swapped order:

call(fn1, 0)
if (top()) jump(label1) else pop()
    call(fn2, 0)
label1:

(could be done by reusing AND and a unary inversion ! operator if you wanted)

As for bitwise operators, there's absolutely nothing remarkable about them - honestly I'm only adding them because I'm tweaking the parser anyway, and it's strange to not have any (see: Lua < 5.3).

The changes are reflective of what's said here - more unremarkable parser expansions, a special handling for AND/OR compilation, and AND/OR actions for runtime.

Loops

So we're finally here

A regular while-loop is by far the easiest one and is basically a looping if-statement:

// while (condition) { loop }
label1: // <- continue
    condition
    if (!pop()) jump(label2)
    
    loop
    jump(label1)
label2: // <- break

Continue/break statements also just transit to start/end of it accordingly.

A do-while loop performs a check after the iteration and is structured accordingly:

// do { loop } while (condition)
label1:
    loop
label2: // <- continue
    condition
    if (pop()) jump(label1) // <- jump_if
label2: // <- break

The loop body/condition order is swapped (and condition transits back to the loop), and continue statements jump forward to the condition, but otherwise it's fairly alike.

A for-loop is a slightly fancier version of the while-loop.
Some people argue that for this reason you can do without them, but it's the fact that doing continue still runs the post-statement that makes these valuable.

// for (init; condition; post) loop
init
label1:
    condition
    if (!pop()) jump(label3)
    
    loop
label2: // <- continue
    post
    jump(label1)
label3: // <- break

As far as the actual implementation goes:

  • We add new keywords to the parser
  • We build the AST as per earlier outlined rules.
    while and do-while loops are as straightforward as it gets.
    for loop has a bit of extra code to allow () around init/cond/post, and to allow semicolons
    (else you have for (i = 0 i < 5 i = i + 1), which is... strange)
    For the same reason, we shall permit optional semicolons after statements in general.
  • Building break/continue is a little more involved for the sole fact that you should not be able to use them just anywhere - only inside a loop. So we are going to have a pair of flags indicating whether both can be used, and a script that saves the previous values, flips them on, builds a statement, and then restores them.
  • Compilation will then make use of this - break/continue statements generate jump-actions with dummy offsets (-10 and -11) and then we replace them with real ones once we know where the points of interests are.
  • Runtime only needs a jump_if action and that's it.

Comments

Firstly, protip: comments aren't real.

That is, they are stripped during parse phase and have no influence on compilation.

But still nice to have! And not too hard to do. So we'll have those.

For single-line comments we check for // and skip till the end of the line.
For multi-line comments we check for /* and skip till we find */ or hit the end of the input.

The end?

Downloadable versions of the sample project can be found on itch.io or GM Marketplace.

Source code is also available in the in the repository - part-2 branch contains everything up until the end of this part of tutorial, and master branch contains any newer tweaks.

Have fun!

Related posts:

2 thoughts on “A small guide on writing interpreters, part 2

  1. Wooow this is incredible and full of awesome information. A must for everyone wanting to learn and dive into the world of interpreters :) thank you so much

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.