diff --git a/spec.org b/spec.org index 4a995d2..cbab38c 100644 --- a/spec.org +++ b/spec.org @@ -3,86 +3,250 @@ #+description: A specification of instructions for the virtual machine #+date: 2023-11-02 -* WIP Data types -There are 3 main data types of the virtual machine. They are all -unsigned. There exist signed versions of these data types, though -there is no difference (internally) between them. For an unsigned -type the signed version is simply S_. +* Data types +There are 4 main data types of the virtual machine. They are all +unsigned. |-------+------| | Name | Bits | |-------+------| | Byte | 8 | +| Short | 16 | | HWord | 32 | | Word | 64 | |-------+------| -Generally, the abbreviations B, H and W are used for Byte, HWord and -Word respectively. The following table shows a comparison between the -data types where an entry (row and column) $A\times{B}$ refers to "How -many of A can I fit in B". -|-------+------+-------+------| -| | Byte | Hword | Word | -|-------+------+-------+------| -| Byte | 1 | 4 | 8 | -| HWord | 1/4 | 1 | 2 | -| Word | 1/8 | 1/2 | 1 | -|-------+------+-------+------| +Generally, the abbreviations B, S, H and W are used for Byte, Short, +HWord and Word respectively. The following table shows a comparison +between the data types where an entry (row and column) $A\times{B}$ +refers to "How many of A can I fit in B". +|-------+------+-------+-------+------| +| | Byte | Short | HWord | Word | +|-------+------+-------+-------+------| +| Byte | 1 | 2 | 4 | 8 | +| Short | 1/2 | 1 | 2 | 4 | +| HWord | 1/4 | 1/2 | 1 | 2 | +| Word | 1/8 | 1/4 | 1/2 | 1 | +|-------+------+-------+-------+------| + +These unsigned types can be trivially considered signed via 2s +complement. The signed version of some unsigned type is abbreviated +by prefixing the type with a =S_=. So the signed version of each type +is S_B, S_S, S_H, S_W. +* TODO Storage +There are 4 forms of storage available in the virtual machine: the +*stack*, *registers* and *heap*. The stack, registers and call stack +are considered *fixed storage* in that they have an exact fixed +capacity within the virtual machine. The heap, on the other hand, can +grow dynamically as it supports user requested allocations and is thus +considered *dynamic storage*. +** Stack ++ FILO data structure ++ ~S~ in shorthand ++ ~ptr~ represents the top of the stack at any one point during + execution, ~0~ refers to the address for the bottom of the stack + (aka the minimal value of ~ptr~) and ~n~ refers to the address of + the end of the usable stack space (aka the maximal value for + ~ptr~, ~MAX_STACK~) +** Registers ++ constant time read/write data structure ++ ~R~ in shorthand ++ Reserves ~m~ bytes of space (called the ~MAX_REG~), where m must + be a positive multiple of 8 ++ May be indexed via a pointer in one of the 4 following forms: + + ~b~: the ith byte, where i in [0, m) + + ~s~: the ith short, where i in [0, m/2) + + ~h~: the ith hword, where i in [0, m/4) + + ~w~: the ith word, where i in [0, m/8) ++ w refers to the 8 bytes between [8i, 8(i+1)), which implicitly + refers to the: + + 8 byte registers {b | j in [8i, 8(i + 1))} + + 4 short registers {s | j in [4i, 4(i + 1))} + + 2 hword registers {h | j in [2i, 2(i + 1))} +** TODO Heap ++ Random access storage which can be allocated into chunks ++ ~H~ in shorthand +** Call stack ++ FILO data structure containing program addresses (indexes in the + program) ++ ~C~ in shorthand ++ Is reserved for a very small subset of operations for control flow * WIP Instructions An instruction for the virtual machine is composed of an *opcode* and, -potentially, an *operand*. The /opcode/ represents the behaviour of -the instruction i.e. what _is_ the instruction. The /operand/ is an -element of one of the /data types/ described previously. +optionally, an *operand*. The /opcode/ represents the specific +behaviour of the instruction i.e. what the instruction does. The +/operand/ is an element of one of the /data types/ described +previously which the opcode uses as part of its function. The operand +is optional based on the opcode: certain opcodes will never require an +operand. +** Operations: abstracting over opcodes +An *operation* is some generic behaviour, potentially involving data +storage. Many operations are generic over data types i.e. they +describe some behaviour that works for some subset of types. Opcodes +are simply specialisations of operations over some data type. For +example the generic behaviour of the operation ~PUSH~, which pushes +the operand onto the stack, is specialised into the opcode +~PUSH_WORD~, which pushes the operand, a word, onto the stack. An +operation may, thus, describe many opcodes and each opcode is a +specialisation of exactly one operation. -Some instructions do have /operands/ while others do not. The former -type of instructions are called *UNIT* instructions while the latter -type are called *MULTI* instructions[fn:1]. +The *order* of an operation is the number of specialisations it has +i.e. the number of opcodes that specialise one operation. -All /opcodes/ (with very few exceptions[fn:2]) have two components: -the *root* and the *type specifier*. The /root/ represents the -general behaviour of the instruction: ~PUSH~, ~POP~, ~MOV~, etc. The -/type specifier/ specifies what /data type/ it manipulates. A -complete opcode will be a combination of these two e.g. ~PUSH_BYTE~, -~POP_WORD~, etc. Some /opcodes/ may have more /type specifiers/ than -others. +Some operations may not be generic over data types in which case they +are of order 1 i.e. the opcode describes the exact behaviour of only +one operation. + +There are only 3 possible orders for operations: 1, 4 and 8. They are +given the names Nil, Unsigned and Signed for specialising over: ++ No types ++ The 4 unsigned data types described earlier ++ The 4 unsigned data types and their signed variants as well +** Arity +The arity of an operation is the number of input data it takes. An +operation can take input in two ways: ++ From the operand, encoded in the bytecode ++ From the stack by popping from the top + +An operation that takes n input data from the stack pops n data from +the stack to use as input. + +Since there can only be at most one operand, an operation that takes +input from the operand must have an arity of at least one. + +Hence the arity is the sum of inputs taken from both. This can be 0, +in which case the operation is *nullary*. An operation that takes one +input, whether that be from the stack or operand, is *unary*. An +operation that takes two inputs, whichever source either are from, is +*binary*. +** Orientation +An operation can be considered *oriented* around a data storage if it +only takes input from that data storage. So an operation that only +takes input from the stack is *stack-oriented*. Or an operation that +only takes input from the operand is *operand-oriented*. +** Categorisation of operations +With the notation done, we can now describe all operations that the +virtual machine supports. Through describing all of these operations, +including their orders and what operand they accept (if any), we can +describe all opcodes. + +*** Trivial nullary operations +These are NIL order operations which are super simple to describe. ++ =NOOP=: Doesn't do anything. ++ =HALT=: Stops execution at point +*** Moving data in fixed storage +There are 5 operations that move data through fixed storage in the +virtual machine. They are of Unsigned order, unary and +operand-oriented. + +|-----------------+---------------------------------------------------| +| Name | Behaviour | +|-----------------+---------------------------------------------------| +| =PUSH= | Pushes operand onto stack | +| =POP= | Pops datum off stack | +| =PUSH_REGISTER= | Pushes datum from (operand)th register onto stack | +| =MOV= | Moves datum off stack to the (operand)th register | +| =DUP= | Pushes the (operand)th datum in stack onto stack | +|-----------------+---------------------------------------------------| +*** Using the heap +The heap is utilised through a set of "helper" operations that safely +abstract the underlying implementation. All of these operations are +stack-oriented. + +|-----------+----------------------------------------------------------+-------| +| Name | Behaviour | Arity | +|-----------+----------------------------------------------------------+-------| +| =MALLOC= | Allocate n amount of data in the heap, pushing a pointer | 1 | +| =MSET= | Pop a value, set the nth datum of data in the heap | 3 | +| =MGET= | Push the nth datum of data in the heap onto the stack | 3 | +| =MDELETE= | Free data in the heap | 1 | +| =MSIZE= | Get the size of allocation in the heap | 1 | +|-----------+----------------------------------------------------------+-------| + +=MALLOC=, =MSET= and =MGET= are of Unsigned order. Due to unsigned +and signed types taking the same size, they can be used for signed +data as well. +*** Boolean operations +There are 5 boolean operations. They are of Unsigned order, binary +and stack-oriented. These are: ++ =NOT= ++ =OR= ++ =AND= ++ =XOR= ++ =EQ= + +Though they are all of unsigned order they can be used for signed data +trivially. +*** Comparison operations +There are 4 comparison operations. They are all signed operations, +binary and stack-oriented. They are: ++ LT: Less Than ++ LTE: Less Than or Equal ++ GT: Greater Than ++ GTE: Greater Than or Equal + +As =EQ= is an unsigned order operation and doesn't assert anything on +the actual values, it can be used for comparing two signed inputs. It +doesn't perform a cast when comparing and unsigned and signed input +which may mean certain non equivalent values may be considered equal +(e.g. =0xFAA9= is a negative number in 2s complement but a positive +number in unsigned, considered the same under =EQ=). +*** Mathematical operations +There are 3 mathematical operations. They are of unsigned order, +binary and stack-oriented. These are: ++ PLUS ++ SUB ++ MULT + +Though they are unsigned, any overflowing operation is wrapped around. +With some thought these operations can treat unsigned data and be used +to generate them. +*** Control flow operations +There are 2 control flow operations. Each perform a "jump", changing +the point of execution to a different point in the program. + +|--------------+----------+---------------+-------| +| Name | Order | Orientation | Arity | +|--------------+----------+---------------+-------| +| =JUMP_ABS= | NIL | Operand | 1 | +| =JUMP_IF= | UNSIGNED | Operand+Stack | 2 | +|--------------+----------+---------------+-------| + ++ =JUMP_ABS= interprets the operand as an absolute program address and + sets point of execution to that address ++ =JUMP_IF= pops a datum off the stack and compares it to 0. If true, + the point of execution is set to the operand (interpreted as an + absolute program address). If false, execution continues past it. +*** Subroutine operations +There are 2 subroutine operations. They are the only operations that +can mutate the call stack. Through utilising reserved storage in the +virtual machine that can only be altered through these methods, they +abstract control flow to a higher degree than the jump operations. + +|------------+-------------+-------| +| Name | Orientation | Arity | +|------------+-------------+-------| +| CALL | Operand | 1 | +| RET | - | 0 | +|------------+-------------+-------| + +The CALL* operations take a program address as input (either from the +operand or from the stack). They push the current program address +onto the call stack and perform a jump to the input address. + +The RET operation pops a program address off the call stack, +performing a jump to that address. + +These operations allow the implementation of /subroutines/: sequences +of code that can be self contained and generic over a variety of call +sites i.e. can return to the address where it was called without hard +coding the address. +*** TODO IO +Currently IO is really bad: the PRINT_* routines are not a nice +abstraction over what's really happening and programs cannot take +input from stdin. * TODO Bytecode format Bytecode files are byte sequence which encode instructions for the virtual machine. Any instruction (even with an operand) has one and only one byte sequence associated with it. -* TODO Storage -Two types of storage: -+ Data stack which all core VM routines manipulate and work on (FILO) - + ~DS~ in shorthand, with indexing from 0 (referring to the top of the - stack) up to n (referring to the bottom of the stack). B(DS) - refers to the bytes in the stack (the default). -+ Register space which is generally reserved for user space code - i.e. other than ~mov~ no other core VM routine manipulates the - registers - + ~R~ in shorthand, with indexing from 0 to $\infty$. -* TODO Standard library -Standard library subroutines reserve the first 16 words (128 bytes) of -register space (W(R)[0] to W(R)[15]). The first 8 words (W(R)[0] to -W(R)[7]) are generally considered "arguments" to the subroutine while -the remaining 8 words (W(R)[8] to W(R)[15]) are considered additional -space that the subroutine may access and mutate for internal purposes. -The stack may have additional bytes pushed, which act as the "return -value" of the subroutine, but no bytes will be popped off (*Stack -Preservation*). - -If a subroutine requires more than 8 words for its arguments, then it -will use the stack. This is the only case where the stack is mutated -due to a subroutine call, as those arguments will always be popped off -the stack. - -Subroutines must always end in ~RET~. Therefore, they must always be -called via ~CALL~, never by ~JUMP~ (which will always cause error -prone behaviour). * Footnotes -[fn:2] ~NOOP~, ~HALT~, ~MDELETE~, ~MSIZE~, ~JUMP_*~ - -[fn:1] /UNIT/ refers to the fact that the internal representation of -these instructions are singular: two instances of the same /UNIT/ -instruction will be identical in terms of their binary. On the other -hand, two instances of the same /MULTI/ instruction may not be -equivalent due to the operand they take. Crucially, most if not all -/MULTI/ instructions have different versions for each /data type/.