Tiny Great Languages: PL/0

This is part 6 from series “Tiny Great Languages”.

Let’s talk about Pascal. It’s the language I first learned many years ago, and it becomes the final one we cover in this series. Well, not the full version of Pascal, but a tiny educational subset defined by the legendary Niklaus Wirth.

Back in 1976 (around the same time most of the languages in this series were born), Wirth wrote a book titled Algorithms + Data Structures = Programs. This was one of the most influential books in computer science, inspiring Anders Hejlsberg to create the famous Turbo Pascal compiler. At the end of the book, there were chapters on language structures and compilers, explaining how to build a compiler with code generation for a small subset of Pascal known as PL/0.

For this post, we’ll keep things simple and build an interpreter for PL/0. It’s the largest code implementation in the series, but it’s also the most conventional one.

Here’s what a factorial program in PL/0 looks like:

var n, f;
begin
  n := 10;
  f := 1;
  while n > 0 do
  begin
    f := f * n;
    n := n - 1;
  end;
  ! f
end.

Looks familiar, right? It’s quite similar to Pascal.

Lexer

So far, none of the languages we’ve implemented had a real lexer. We used simple hacks like splitting strings by whitespace or brackets, or treating single characters as tokens.

Lexers are crucial for building interpreters and compilers. They transform a stream of bytes into a stream of language tokens, where each token has a type and a value:

# the code:
var n, f;

# becomes:
(keyword, "var")
(ident, "n")
(comma, ",")
(ident, "f")
(semicolon, ";")

In our case, we can cheat a bit by using regular expressions to tokenize the input with minimal code:

def lex(s):
    m = re.match(r"(?P<num>[0-9]+)|(?P<op>[-+*/()<>=])|(?P<ws>\s+)|(?P<kw>begin|end\.|end|if|then|while|do|var|!|\?|call|procedure)|(?P<id>[a-zA-Z]+)|(?P<semi>;)|(?P<asgn>:=)|(?P<comma>,)", s)
    if not m: raise SyntaxError("unexpected character")
    if m.lastgroup == "ws": return lex(s[m.end():])
    return s[:m.end()], m.lastgroup, s[m.end():]

Each call to lex() returns a token value, its type, and the remaining input string:

s = "var n, f;"
while s:
    tok, kind, s = lex(s)
    print(tok, kind)

# Output:
kw var
id n
comma ,
id f
semi ;

Of course, if the language supported more complex tokens (like string literals with escapes or unary operators), we’d need a more sophisticated lexer or even a lexer generator. But for PL/0, this approach will do.

Parser

With the lexer in place, we can now build a syntax tree. In previous languages, we didn’t need a formal syntax tree. Lisp was close, but it didn’t require one because of its “code is data” philosophy. This time, things are different.

To interpret the code, we need to parse it into a tree structure. The “root” of the tree is the program’s entry point, containing a block of statements. Each statement may contain expressions made of factors and terms.

The expression parser will look similar to what we implemented for BASIC, but the statements are very different. We will have conditional statements, loop statements, variable declarations, procedure declarations etc. We’ll represent each statement as a tuple, with the first element being the statement type, and the remaining elements being its components (nested statements).

Our main parser function will be parse(), and we’ll need a helper function eat() to consume tokens from the stream:

def parse(code):
    tok, kind, code = lex(code)

    def eat(expected):
        nonlocal code, tok, kind
        if kind != expected:
            raise SyntaxError(f"Expected {expected} but got {kind}")
        prevtok=tok
        tok, kind, code = lex(code)
        return prevtok

    # Example:
    eat("kw") # consume "var"
    v1 = eat("id") # consume "n"
    eat("comma")
    v2 = eat("id") # consume "f"
    eat("semi")

parse("var n, f;")

There might be a better way to organize the code, like turning the parser into a class and keeping the internal state there. But for simplicity, using nested functions with nonlocal variables works good enough.

The first statement in PL/0 is the “block”. Blocks begin with an optional list of variables, followed by procedures and a body of statements, often starting with “begin” and ending with “end”:

def block():
    var, proc = [], []
    # var <name> , ... ;
    while tok == "var":
        eat("kw") # consume "var"
        while tok != ";":
            var.append(eat("id")) # consume variable name
            if tok == ",": 
                eat("comma") # consume comma delimiter
        eat("semi") # consume final semicolon

    # procedure <name> ; begin ... end;
    while tok == 'procedure':
        eat('kw') # consume "procedure"
        n = eat('id') # consume procedure name
        eat('semi') # consume semicolon
        proc.append((n, block())) # consume procedure body (recursion)
        eat('semi') # consume final semicolon

    # begin ... end (statement)
    return 'block', var, proc, stmt() # consume main block body

This is the core of the parser, but it relies heavily on a stmt() function that can handle different types of statements:

def stmt():
    # <id> := <expr>
    if kind == "id":
        n = eat("id")
        eat("asgn")
        return "asgn", n, expr()

    # call <id> -- call a procedure
    elif tok == "call":
        eat("kw")
        n = eat("id")
        return "call", n

    # ? <id> -- console input
    elif tok == "?":
        eat("kw")
        n = eat("id")
        return "read", n

    # ! <expr> -- console output
    elif tok == "!":
        eat("kw")
        return "write", expr()

    # begin <stmt...> end -- composite statement
    elif tok == "begin":
        eat("kw")
        body = []
        while tok != "end" and tok != "end.":
            body.append(stmt())
            if tok == ";":
                eat("semi")
        eat("kw")
        return "begin", body

    # if <cond> then <stmt>
    elif tok == "if":
        eat("kw")
        c = cond()
        eat("kw")
        return ("if", c, stmt())

    # while <cond> do <stmt>
    elif tok == "while":
        eat("kw")
        c = cond()
        eat("kw")
        return ("while", c, stmt())

We now have a good overview of what PL/0 can do: it supports expressions, conditionals, loops, procedures (without arguments or return values), and basic input/output.

To finish the parser, we just need to handle expressions, conditions, factors, and terms:

def cond():
    e = expr()
    op = eat("op")
    return ("op", op, e, expr())

def expr():
    e = term()
    while tok in "+-":
        op = eat("op")
        e = ("op", op, e, term())
    return e

def term():
    t = factor()
    while tok in "*/":
        op = eat("op")
        t = ("op", op, t, factor())
    return t

def factor():
    if kind == "id":
        n = eat("id")
        return ("id", n)
    elif kind == "num":
        num = eat("num")
        return int(num)
    elif tok == "(":
        eat("op")
        e = expr()
        eat("op")
        return e

At the end of parse(), we return the top-level block, representing the main program:

return block()

Evaluation

With the syntax tree built, we can now traverse it to execute the code. You could also use the tree for other purposes, like pretty-printing, code analysis, or profiling, but in this case, we’ll just focus on running the code.

Our eval() function will handle all statement types and walk through the tree:

def eval(node, scope=(None, {}, {})):
    def find(x, i=1):
        frame = scope
        while frame != None:
            if x in frame[i]:
                return frame[i]
            else:
                frame = frame[0]

    if type(node) is int:
        return node
    elif node[0] == "id":
        return find(node[1], 1)[node[1]]
    elif node[0] == "asgn":
        env = find(node[1], 1)
        env[node[1]] = eval(node[2], scope)
    elif node[0] == "begin":
        for n in node[1]:
            eval(n, scope)
    elif node[0] == "read":
        env = find(node[1], 1)
        env[node[1]] = int(input("> "))
    elif node[0] == "op":
        return {
            "+": operator.add,
            "-": operator.sub,
            "*": operator.mul,
            "/": operator.floordiv,
            "<": operator.lt,
            ">": operator.gt,
            "=": operator.eq,
        }[node[1]](eval(node[2], scope), eval(node[3], scope))
    elif node[0] == "if":
        if eval(node[1], scope):
            eval(node[2], scope)
    elif node[0] == "while":
        while eval(node[1], scope):
            eval(node[2], scope)
    elif node[0] == "write":
        print(eval(node[1], scope))
    elif node[0] == "block":
        env = (scope, {}, {})
        for v in node[1]:
            env[1][v] = 0
        for p in node[2]:
            env[2][p[0]] = p[1]
        eval(node[3], env)
    elif node[0] == "call":
        eval(find(node[1], 2)[node[1]], scope)

The scope and find functions might seem a bit unclear. Since PL/0 allows for nested procedures, we need to manage variable scopes for them. In our interpreter, a scope is represented as a tuple containing three elements: a reference to the parent scope, a dictionary of variables, and a dictionary of procedures. Each new scope is created when entering a new block, and the find function works by traversing upwards through the scopes until it locates the required variable or procedure.

We can now run a factorial program with our PL/0 interpreter:

$ python pl0.py < factorial/fac.pas
3628800

For something a bit more challenging, let’s try running a program that finds prime numbers:

var max, arg, ret;

procedure isprime;
var i;
begin
  ret := 1;
  i := 2;
  while i < arg do
  begin
    if arg / i * i = arg then
    begin
      ret := 0;
      i := arg;
    end;
    i := i + 1;
  end;
end;

procedure primes;
begin
  arg := 2;
  while arg < max do
  begin
    call isprime;
    if ret = 1 then ! arg;
    arg := arg + 1;
  end;
end;

begin
  max := 100;
  call primes;
end.

This program should print the prime numbers between 2 and 100: 2, 3, 5, 7, 11, 13, 17, …, 89, 97.

Well, that’s it! We’ve journeyed through the programming languages of the 1970s, languages that have deeply influenced modern programming. Despite the constraints of hardware at the time, these languages and their implementations were often remarkably small and simple. Constraints can truly inspire creativity.

I hope reading this wasn’t a complete waste of time! As a bonus, the repository with all the code examples includes another little interpreter for the TCL language in just 48 lines of code. If you spot any errors in the interpreters, feel free to open a PR to fix them! All contributions are welcome, and I hope you have fun writing your own programming languages!

I hope you’ve enjoyed this article. You can follow – and contribute to – on Github, Mastodon, Twitter or subscribe via rss.

Sep 17, 2024

See also: Tiny Great Languages: APL and more.