The Micro Pascal cross-compiler is coming a long nicely. User-defined functions or procedures aren’t supported yet. Neither are 8-bit characters or strings. FOR loops, IF-THEN-ELSE statements and 16-bit (unsigned) arithmetic operations do work! It can compile simple PASCAL programs like this:
VAR K : INTEGER; BEGIN FOR K:=20 DOWNTO 1 DO BEGIN WRITE(K*K); END END
Virtual machine instructions
The compiler emits instructions for a simple stack-based virtual machine. For example, the program above compiles to the following instructions:
0000 RESERVE 0002 0003 LIT 0014 0006 STOREP 0000 0009 LOAD 0000 000C LIT 0001 000F CGE 0010 JZ 0025 0013 LOAD 0000 0016 LOAD 0000 0019 MUL 001A WRITE 001B LOAD 0000 001E DEC 001F STOREP 0000 0022 JMP 0009
The ‘RESERVE 0002’ instruction allocates room for local variables on the call/return stack. In this case two bytes for the 16-bit integer K. ‘LIT 0014’ loads 14 hex (20 decimal) onto the operand stack. ‘STOREP 0000’ stores the value at the top of the operand stack (14 hex) into local memory 0, which is where K is stored. At the same time, the value is popped off the operand stack. So, these two instructions set K := 20.
Next, the value stored at memory location 0, again K, is loaded onto the stack. Then, a value of 1 is pushed onto the operand stack. The ‘CGE’, which means Compare Greater or Equal, compares the two top-most values on the stack, pops them off the stack, and pushes a 1 if the condition is true, or 0 if the condition is false.
If the top-most stack value is 0, JZ jumps to the address given as its argument. If not, it continues execution. The top-most stack value is always popped off. Here, the CGE and JZ check if we’re at the end of the loop. If so, a jump to address 25 is made: the program ends.
The instructions from 0013 to 001A square the value of K and print the result to the console. The instructions from 001B to 001F decrement the loop counter K. Finally, a jump to the end-of-loop code is made and the loop is conditionally executed again.
The abstract syntax tree
The Micro Pascal compiler generates the virtual machine code in the traditional way. A lexical analyzer reads the PASCAL source code and recognizes integers, strings, keywords, operators (+,-,/,*,:=,=,<,>,<=,>=,<>, ; , :), keywords (BEGIN, END, FOR, TO, DOWNTO, IF, THEN, ELSE, VAR, CONSTANT, WRITE, etc) and identifiers (variable/procedure/function names). When a part of the input is recognized, the type and position of occurrence is stored. This is termed a token.
When the entire source code as been tokenised, a parser checks the validity of the program. This is done by comparing the order of the tokens against a set of rules that describe the Micro Pascal language. For example, here is the rule that describes the IF-THEN-ELSE construct:
We see that an IF-THEN-ELSE construct must start with an IF token, which must be followed by an expression. The expression is not a token, but rather another construct that is defined elsewhere. The expression must be followed by a THEN token, which must be followed by a statement construct. Now there are two possibilities: we either encounter an ELSE token followed by another statement, or we have nothing (sometimes referred to as an epsilon).
During the parsing process, the parser collects information about the meaning of the program. For instance, it will find which operations are part of the expression and statements found in the IF-THEN-ELSE construct above. The collected information is then represented as a tree, termed the abstract syntax tree, or AST for short.
The Micro Pascal compiler has an option to dump the AST as a Graphviz .dot file for visualization. The compiling the example program results in the following graphical representation of the AST:
Each node in the tree corresponds to a language construct.
VM code generation
To generate the p-code for the vitual machine, the p-code generator traverses the tree, starting at the top, going left and depth first. Let’s do that for the example AST.
We start at the BLOCK node. No p-code needs to be emitted for this node, but it does indicate a new stack frame is needed to store local variables. We go left, to the VARDECL. Here we traverse all the child nodes (INT K, in this case) and collect their size in bytes so we can emit a RESERVE instruction. In this step, each variable is also allocated their storage address, which is address 0 for K.
Now we go up the tree again, to the second branch of the BLOCK node, down through the PROGBLOCK to the FOR node. We see that the FOR node uses ‘K’ as the loop variable. The left-most leaf-node is the FOR loop start expression (20), so we the p-code for K:=20. Then we check the loop variable for the end-of-loop condition, i.e. K<1, and exit the current PROGBLOCK if it is true. Now we emit the p-code belonging to the last FOR leaf-node; a PROGBLOCK.
Again, down the tree we go until we end up at the left-most ‘VAR K’ node. This node simply pushes the variable K onto the stack. We go back to the *-node and down to the other ‘VAR K’ node. Again, we push the variable K onto the stack. Now, we’re back at the *-node and emit a MUL operation. Back up the tree, to the WRITE node, where we emit the WRITE p-code. We keep going back until we reach the FOR node again.
Now, we emit the code to update the loop variable K. The parser has already stored the information on what to do, namely: decrement (–) it. We continue to go back up the tree until we reach the BLOCK node, and we’re done!
During the tree traversal, jump instructions are generated. For example, a conditional jump by the FOR node, to exit the FOR loop. At the time the jump instruction is emitted, the address to jump to is not yet known because the loop body code has not yet been generated. To solve this problem, a placeholder is inserted, termed a label, instead of the actual address. Only when all the p-code has been generated, and all the addresses are known, the labels are replaced with the correct addresses in a processes that is called fixup.
There is still a lot to do; Micro Pascal still needs functions and procedures. Also, the 16-bit integer is the only available variable type. To comfortably work with characters and strings, an 8-bit character data type needs to be introduced. This is more complex than it might seem. Having more than one data type requires type checking to be implemented as well as extensions to the p-code instruction set.