11 KiB
VM Specification
Data types
There are 4 main data types of the virtual machine. They are all unsigned.
| Name | Bits |
| Byte | 8 |
| Short | 16 |
| HWord | 32 |
| Word | 64 |
Generally, the abbreviations B, S, H and W are used for Byte, Short, HWord and Word respectively. The following table shows a comparison between the data types where an entry (row and column) $A\times{B}$ refers to "How many of A can I fit in B".
| Byte | Short | HWord | Word | |
| Byte | 1 | 2 | 4 | 8 |
| Short | 1/2 | 1 | 2 | 4 |
| HWord | 1/4 | 1/2 | 1 | 2 |
| Word | 1/8 | 1/4 | 1/2 | 1 |
These unsigned types can be trivially considered signed via 2s
complement. The signed version of some unsigned type is abbreviated
by prefixing the type with a S_. So the signed version of each type
is S_B, S_S, S_H, S_W.
TODO Storage
There are 4 forms of storage available in the virtual machine: the stack, registers and heap. The stack, registers and call stack are considered fixed storage in that they have an exact fixed capacity within the virtual machine. The heap, on the other hand, can grow dynamically as it supports user requested allocations and is thus considered dynamic storage.
Stack
- FILO data structure
Sin shorthandptrrepresents the top of the stack at any one point during execution,0refers to the address for the bottom of the stack (aka the minimal value ofptr) andnrefers to the address of the end of the usable stack space (aka the maximal value forptr,MAX_STACK)
Registers
- constant time read/write data structure
Rin shorthand- Reserves
mbytes of space (called theMAX_REG), where m must be a positive multiple of 8 -
May be indexed via a pointer in one of the 4 following forms:
b<i>: the ith byte, where i in [0, m)s<i>: the ith short, where i in [0, m/2)h<i>: the ith hword, where i in [0, m/4)w<i>: the ith word, where i in [0, m/8)
-
w<i> refers to the 8 bytes between [8i, 8(i+1)), which implicitly refers to the:
- 8 byte registers {b<j> | j in [8i, 8(i + 1))}
- 4 short registers {s<j> | j in [4i, 4(i + 1))}
- 2 hword registers {h<j> | j in [2i, 2(i + 1))}
TODO Heap
- Random access storage which can be allocated into chunks
Hin shorthand
Call stack
- FILO data structure containing program addresses (indexes in the program)
Cin shorthand- Is reserved for a very small subset of operations for control flow
WIP Instructions
An instruction for the virtual machine is composed of an opcode and, optionally, an operand. The opcode represents the specific behaviour of the instruction i.e. what the instruction does. The operand is an element of one of the data types described previously which the opcode uses as part of its function. The operand is optional based on the opcode: certain opcodes will never require an operand.
Operations: abstracting over opcodes
An operation is some generic behaviour, potentially involving data
storage. Many operations are generic over data types i.e. they
describe some behaviour that works for some subset of types. Opcodes
are simply specialisations of operations over some data type. For
example the generic behaviour of the operation PUSH, which pushes
the operand onto the stack, is specialised into the opcode
PUSH_WORD, which pushes the operand, a word, onto the stack. An
operation may, thus, describe many opcodes and each opcode is a
specialisation of exactly one operation.
The order of an operation is the number of specialisations it has i.e. the number of opcodes that specialise one operation.
Some operations may not be generic over data types in which case they are of order 1 i.e. the opcode describes the exact behaviour of only one operation.
There are only 3 possible orders for operations: 1, 4 and 8. They are given the names Nil, Unsigned and Signed for specialising over:
- No types
- The 4 unsigned data types described earlier
- The 4 unsigned data types and their signed variants as well
Arity
The arity of an operation is the number of input data it takes. An operation can take input in two ways:
- From the operand, encoded in the bytecode
- From the stack by popping from the top
An operation that takes n input data from the stack pops n data from the stack to use as input.
Since there can only be at most one operand, an operation that takes input from the operand must have an arity of at least one.
Hence the arity is the sum of inputs taken from both. This can be 0, in which case the operation is nullary. An operation that takes one input, whether that be from the stack or operand, is unary. An operation that takes two inputs, whichever source either are from, is binary.
Orientation
An operation can be considered oriented around a data storage if it only takes input from that data storage. So an operation that only takes input from the stack is stack-oriented. Or an operation that only takes input from the operand is operand-oriented.
Categorisation of operations
With the notation done, we can now describe all operations that the virtual machine supports. Through describing all of these operations, including their orders and what operand they accept (if any), we can describe all opcodes.
Trivial nullary operations
These are NIL order operations which are super simple to describe.
NOOP: Doesn't do anything.HALT: Stops execution at point
Moving data in fixed storage
There are 5 operations that move data through fixed storage in the virtual machine. They are of Unsigned order, unary and operand-oriented.
| Name | Behaviour |
PUSH |
Pushes operand onto stack |
POP |
Pops datum off stack |
PUSH_REGISTER |
Pushes datum from (operand)th register onto stack |
MOV |
Moves datum off stack to the (operand)th register |
DUP |
Pushes the (operand)th datum in stack onto stack |
Using the heap
The heap is utilised through a set of "helper" operations that safely abstract the underlying implementation. All of these operations are stack-oriented.
| Name | Behaviour | Arity |
MALLOC |
Allocate n amount of data in the heap, pushing a pointer | 1 |
MSET |
Pop a value, set the nth datum of data in the heap | 3 |
MGET |
Push the nth datum of data in the heap onto the stack | 3 |
MDELETE |
Free data in the heap | 1 |
MSIZE |
Get the size of allocation in the heap | 1 |
MALLOC, MSET and MGET are of Unsigned order. Due to unsigned
and signed types taking the same size, they can be used for signed
data as well.
Boolean operations
There are 5 boolean operations. They are of Unsigned order, binary and stack-oriented. These are:
NOTORANDXOREQ
Though they are all of unsigned order they can be used for signed data trivially.
Comparison operations
There are 4 comparison operations. They are all signed operations, binary and stack-oriented. They are:
- LT: Less Than
- LTE: Less Than or Equal
- GT: Greater Than
- GTE: Greater Than or Equal
As EQ is an unsigned order operation and doesn't assert anything on
the actual values, it can be used for comparing two signed inputs. It
doesn't perform a cast when comparing and unsigned and signed input
which may mean certain non equivalent values may be considered equal
(e.g. 0xFAA9 is a negative number in 2s complement but a positive
number in unsigned, considered the same under EQ).
Mathematical operations
There are 3 mathematical operations. They are of unsigned order, binary and stack-oriented. These are:
- PLUS
- SUB
- MULT
Though they are unsigned, any overflowing operation is wrapped around. With some thought these operations can treat unsigned data and be used to generate them.
Control flow operations
There are 2 control flow operations. Each perform a "jump", changing the point of execution to a different point in the program.
| Name | Order | Orientation | Arity |
JUMP_ABS |
NIL | Operand | 1 |
JUMP_IF |
UNSIGNED | Operand+Stack | 2 |
JUMP_ABSinterprets the operand as an absolute program address and sets point of execution to that addressJUMP_IFpops a datum off the stack and compares it to 0. If true, the point of execution is set to the operand (interpreted as an absolute program address). If false, execution continues past it.
Subroutine operations
There are 2 subroutine operations. They are the only operations that can mutate the call stack. Through utilising reserved storage in the virtual machine that can only be altered through these methods, they abstract control flow to a higher degree than the jump operations.
| Name | Orientation | Arity |
| CALL | Operand | 1 |
| RET | - | 0 |
The CALL* operations take a program address as input (either from the operand or from the stack). They push the current program address onto the call stack and perform a jump to the input address.
The RET operation pops a program address off the call stack, performing a jump to that address.
These operations allow the implementation of subroutines: sequences of code that can be self contained and generic over a variety of call sites i.e. can return to the address where it was called without hard coding the address.
TODO IO
Currently IO is really bad: the PRINT_* routines are not a nice abstraction over what's really happening and programs cannot take input from stdin.
TODO Bytecode format
Bytecode files are byte sequence which encode instructions for the virtual machine. Any instruction (even with an operand) has one and only one byte sequence associated with it.