Micro Pascal: from code to p-code

The Micro Pascal cross-compiler is coming a long nicely. User-defined functions or procedures aren’t supported yet. Neither are 8-bit characters or strings. FOR loops, IF-THEN-ELSE statements and 16-bit (unsigned) arithmetic operations do work! It can compile simple PASCAL programs like this:

VAR K : INTEGER;
BEGIN
    FOR K:=20 DOWNTO 1 DO
    BEGIN
       WRITE(K*K);
    END
END

Virtual machine instructions

The compiler emits instructions for a simple stack-based virtual machine. For example, the program above compiles to the following instructions:

0000 RESERVE 0002
0003 LIT 0014
0006 STOREP 0000
0009 LOAD 0000
000C LIT 0001
000F CGE
0010 JZ 0025
0013 LOAD 0000
0016 LOAD 0000
0019 MUL
001A WRITE
001B LOAD 0000
001E DEC
001F STOREP 0000
0022 JMP 0009

The ‘RESERVE 0002’ instruction allocates room for local variables on the call/return stack. In this case two bytes for the 16-bit integer K. ‘LIT 0014’ loads 14 hex (20 decimal) onto the operand stack. ‘STOREP 0000’ stores the value at the top of the operand stack (14 hex) into local memory 0, which is where K is stored. At the same time, the value is popped off the operand stack. So, these two instructions set K := 20.

Next, the value stored at memory location 0, again K, is loaded onto the stack. Then, a value of 1 is pushed onto the operand stack. The ‘CGE’, which means Compare Greater or Equal, compares the two top-most values on the stack, pops them off the stack, and pushes a 1 if the condition is true, or 0 if the condition is false.

If the top-most stack value is 0, JZ jumps to the address given as its argument. If not, it continues execution. The top-most stack value is always popped off. Here, the CGE and JZ check if we’re at the end of the loop. If so, a jump to address 25 is made: the program ends.

The instructions from 0013 to 001A square the value of K and print the result to the console. The instructions from 001B to 001F decrement the loop counter K. Finally, a jump to the end-of-loop code is made and the loop is conditionally executed again.

The abstract syntax tree

The Micro Pascal compiler generates the virtual machine code in the traditional way. A lexical analyzer reads the PASCAL source code and recognizes integers, strings, keywords, operators (+,-,/,*,:=,=,<,>,<=,>=,<>, ; , :), keywords (BEGIN, END, FOR, TO, DOWNTO, IF, THEN, ELSE, VAR, CONSTANT, WRITE, etc) and identifiers (variable/procedure/function names). When a part of the input is recognized, the type and position of occurrence is stored. This is termed a token.

When the entire source code as been tokenised, a parser checks the validity of the program. This is done by comparing the order of the tokens against a set of rules that describe the Micro Pascal language. For example, here is the rule that describes the IF-THEN-ELSE construct:

if_then_else

We see that an IF-THEN-ELSE construct must start with an IF token, which must be followed by an expression. The expression is not a token, but rather another construct that is defined elsewhere. The expression must be followed by a THEN token, which must be followed by a statement construct. Now there are two possibilities: we either encounter an ELSE token followed by another statement, or we have nothing (sometimes referred to as an epsilon).

During the parsing process, the parser collects information about the meaning of the program. For instance, it will find which operations are part of the expression and statements found in the IF-THEN-ELSE construct above. The collected information is then represented as a tree, termed the abstract syntax tree, or AST for short.

The Micro Pascal compiler has an option to dump the AST as a Graphviz .dot file for visualization. The compiling the example program results in the following graphical representation of the AST:

program

Each node in the tree corresponds to a language construct.

VM code generation

To generate the p-code for the vitual machine, the p-code generator traverses the tree, starting at the top, going left and depth first. Let’s do that for the example AST.

We start at the BLOCK node. No p-code needs to be emitted for this node, but it does indicate a new stack frame is needed to store local variables. We go left, to the VARDECL. Here we traverse all the child nodes (INT K, in this case) and collect their size in bytes so we can emit a RESERVE instruction. In this step, each variable is also allocated their storage address, which is address 0 for K.

Now we go up the tree again, to the second branch of the BLOCK node, down through the PROGBLOCK to the FOR node. We see that the FOR node uses ‘K’ as the loop variable. The left-most leaf-node is the FOR loop start expression (20), so we the p-code for K:=20. Then we check the loop variable for the end-of-loop condition, i.e. K<1, and exit the current PROGBLOCK if it is true. Now we emit the p-code belonging to the last FOR leaf-node; a PROGBLOCK.

Again, down the tree we go until we end up at the left-most ‘VAR K’ node. This node simply pushes the variable K onto the stack. We go back to the *-node and down to the other ‘VAR K’ node. Again, we push the variable K onto the stack. Now, we’re back at the *-node and emit a MUL operation. Back up the tree, to the WRITE node, where we emit the WRITE p-code. We keep going back until we reach the FOR node again.

Now, we emit the code to update the loop variable K. The parser has already stored the information on what to do, namely: decrement (–) it. We continue to go back up the tree until we reach the BLOCK node, and we’re done!

Address fixups

During the tree traversal, jump instructions are generated. For example, a conditional jump by the FOR node, to exit the FOR loop. At the time the jump instruction is emitted, the address to jump to is not yet known because the loop body code has not yet been generated. To solve this problem, a placeholder is inserted, termed a label, instead of the actual address. Only when all the p-code has been generated, and all the addresses are known, the labels are replaced with the correct addresses in a processes that is called fixup.

Up next…

There is still a lot to do; Micro Pascal still needs functions and procedures. Also, the 16-bit integer is the only available variable type. To comfortably work with characters and strings, an 8-bit character data type needs to be introduced. This is more complex than it might seem. Having more than one data type requires type checking to be implemented as well as extensions to the p-code instruction set.

Advertisements

Writing software for the HD6309 computer: Micro Pascal

At the time of writing, the HD6309 computer does not have an operating system. It can, however, load binary/SREC files via the UART terminal. I have been using the LWTOOLS toolchain to write programs in MC6809/HD6309 assembly language. This has been workable but not very efficient in terms of the time it takes to write them.

I have been searching for compilers and interpreters that I can port to the HD6309 computer. Several exist in the form of FLEX or OS9 binaries. None, that I could find, were available in source code format. Grant Searle solved this by recreating BASIC from the Tandy Color Computer based on the Unravelled series of books. I wanted a more structured language than basic, something more like C or PASCAL.

My search lead to a series of articles on the implementation of Tiny Pascal in BYTE magazine, which contains the full source code in BASIC. This piqued my interest because it, like many early PASCAL systems, produces byte code (also termed p-code) for a simple virtual machine, instead of code that runs directly on the CPU.
The rationale behind this is that it is much easier to port the virtual machine to new architectures and CPUs, compared to creating an architecture-dependent native code generator. The price to pay is lower execution speed; something I’m willing to live with.

Liking the idea of a virtual machine, I set out to re-write the Tiny Pascal compiler in C/C++. With an abundance of C/C++ compilers available for Windows, Linux and OSX, I though it would make a nice contribution to the retro community.
While the original BASIC code is well documented and the article explains most of it, I gave up — I had overestimated my tolerance level of BASIC 🙂 .

Not wanting to abandon the idea, I’m now in the process of writing a new Tiny Pascal-inspired cross-compiler called Micro Pascal.

Micro Pascal will have the following features:
* 16-bit integers.
* 8-bit characters.
* string constants.
* FOR .. DO loop.
* IF .. THEN .. ELSE.
* logical and arithmetic operations.
* direct memory access through built-in MEM[] function.
* a stack-based virtual machine.
* well-documented byte code.

The work-in-progress compiler is available via my GITHUB account. Check back from time to time for progress updates!