430 lines
17 KiB
Org Mode
430 lines
17 KiB
Org Mode
#+title: TODOs
|
|
#+author: Aryadev Chavali
|
|
#+date: 2023-11-02
|
|
#+startup: noindent
|
|
|
|
* TODO Better documentation [0%] :DOC:
|
|
** TODO Comment coverage [0%]
|
|
*** WIP Lib [50%]
|
|
**** DONE lib/base.h
|
|
**** DONE lib/darr.h
|
|
**** TODO lib/heap.h
|
|
**** TODO lib/inst.h
|
|
*** TODO ASM [0%]
|
|
**** TODO asm/lexer.h
|
|
**** TODO asm/parser.h
|
|
*** TODO VM [0%]
|
|
**** TODO vm/runtime.h
|
|
** TODO Specification
|
|
* TODO Preprocessing directives :ASM:
|
|
Like in FASM or NASM where we can give certain helpful instructions to
|
|
the assembler. I'd use the ~%~ symbol to designate preprocessor
|
|
directives.
|
|
** TODO Macros
|
|
Essentially constants expressions which take literal parameters
|
|
(i.e. tokens) and can use them throughout the body. Something like
|
|
#+begin_src asm
|
|
%macro(name)(param1 param2 param3)
|
|
...
|
|
%end
|
|
#+end_src
|
|
Where each parameter is substituted in a call at preprocessing time.
|
|
A call should look something like this:
|
|
#+begin_src asm
|
|
$name 1 2 3
|
|
#+end_src
|
|
and those tokens will be substituted literally in the macro body.
|
|
* WIP Write assembler in a different language :ASM:
|
|
While the runtime and base library needs to deal with only
|
|
binary, the assembler has to deal with string inputs and a larger
|
|
variety of bugs. As the base library is written in C, and is all that
|
|
is necessary to write a program that targets the virtual machine, we
|
|
could realistically use another language to write the assembler in via
|
|
FFI with minimal pain.
|
|
|
|
Languages in the competition:
|
|
+ C++
|
|
+ Rust
|
|
+ Python
|
|
|
|
2024-04-14: Chose C++ cos it will require the least effort to rewrite
|
|
the currently existing codebase while still leveraging some less
|
|
efficient but incredibly useful features.
|
|
* TODO Rewrite preprocesser to create a custom unit instead of token streams
|
|
** Problem
|
|
A problem that occurs in the preprocessor is token column and line
|
|
count. Say =a.asm= has ~%use "b.asm"~. The tokens from the =b.asm=
|
|
file are inserted into =a.asm='s token stream, but the line/column
|
|
count from there isn't properly set in =a.asm=.
|
|
|
|
A naive solution would be to just recount the lines and columns, but
|
|
this removes information about where those tokens came from. Say an
|
|
error occurs in some of =b.asm='s code: I would like to be able to
|
|
report them.
|
|
|
|
Therefore, we can no longer just generate new token streams from the
|
|
preprocesser and should instead look at making more complex
|
|
abstractions.
|
|
|
|
A problem this could also solve is nested errors and recursive
|
|
constants. Say I have some assembly like so
|
|
#+begin_src asm
|
|
%const limit 20 %end
|
|
%const print-limit
|
|
...
|
|
push.byte $limit
|
|
print.byte
|
|
...
|
|
%end
|
|
#+end_src
|
|
|
|
A call to ~print-limit~ under the current system would insert the
|
|
tokens for print-limit but completely forget about ~push.byte $limit~
|
|
which would cause a parsing error. (This could be fixed under the
|
|
current system by allowing reference resolution inside of const
|
|
blocks, with the conceit that it would be hard to stop infinite recursion)
|
|
** Language model
|
|
The model I have in mind is that all constructs in this meta language
|
|
(the preprocessing language) are either singular tokens or collections
|
|
of tokens/constructs in a recursive sense. This naturally follows
|
|
from the fact that a single pass isn't enough to properly parse this
|
|
language: there must be some recursive nature which forces the
|
|
language to take multiple passes to completely generate a stream that
|
|
can be parsed.
|
|
|
|
This vague notion can be formalised like so. A preprocessing unit is
|
|
either a singular token or a named collection of units. The former
|
|
represents your standard symbols and literals while the later
|
|
represents ~%const~ and ~%use~ calls where there is a clear name
|
|
associated to a collection of one or more tokens (in the case of the
|
|
former it's the constant's name and the latter it's the filename).
|
|
We'll distinguish this as well.
|
|
|
|
#+begin_src text
|
|
Token = PP_USE | PP_CONST | String(Content) | Symbol(Content) | PUSH(Content) | ...
|
|
Type = File(String) | Constant(Symbol)
|
|
Unit = Token | Container(Type . Vector[Unit])
|
|
#+end_src
|
|
|
|
Through this model our initial stream of tokens can be considered
|
|
units. We can already see that this model may solve our original
|
|
problem: with named containers it doesn't matter that certain tokens
|
|
are from different parts of the file or different files as they are
|
|
distinctly typed from the general set of tokens, with a name which
|
|
states where they're from.
|
|
** Processing
|
|
We need this model to have a notion of "processing" though, otherwise
|
|
it's quite useless. A processing function is simply a function which
|
|
takes a unit and returns another unit. We currently have two
|
|
processing functions we can consider: ~process_const~ and
|
|
~process_use~.
|
|
|
|
~process_use~ takes a vector of tokens and, upon encountering PP_USE
|
|
accepts the next token (a string) and tokenises the file
|
|
with that name. Within our model we'd make the stream of tokens
|
|
created from opening the file a /container/.
|
|
|
|
~process_const~ takes a vector of tokens and does two things in an
|
|
iteration:
|
|
1) upon encountering PP_CONST accepts the next n tokens till PP_END is
|
|
encountered, with the first token being a symbol. This is
|
|
registered in a map of constants (~CONSTS~) where the symbol is the
|
|
key and the value associated is the n - 1 tokens accepted
|
|
2) upon encountering a PP_REFERENCE reads the content associated with
|
|
it (considered a symbol ~S~) and replaces it ~CONSTS[S]~ (if S is
|
|
in CONSTS).
|
|
|
|
One thing to note is that both of these definitions are easily
|
|
extensible to the general definition of units: if a unit is a
|
|
container of some kind we can recur through its vector of units to
|
|
resolve any further "calls". For ~process_const~ it's ~%const~ or
|
|
~$ref~ while for ~process_use~ it's ~%use~.
|
|
** History/versioning
|
|
One additional facet to this model I'd like to add is "history". Each
|
|
unit is actually a list (or a singly linked tree where each parent has
|
|
at most one child) of sub-units where the top of the list represents
|
|
the current version. Each descendant is a previous version of the
|
|
token.
|
|
|
|
Say I do some processing on an element of the unit list =a= (with
|
|
index =i=) such that it becomes a new "unit", call it =b=. Then we
|
|
update V by =V[i] = cons(b, a)=. Through this, the lists acts as a
|
|
history of processing that has occurred on the unit. This provides an
|
|
ability to trace the path of preprocessing to an eventual conclusion.
|
|
|
|
Processing occurs on a unit until it cannot be done further i.e. when
|
|
there are no more "calls" in the tree to resolve. The history list
|
|
provides all the versions of a unit till its resolved form.
|
|
|
|
To see what a unit with history may look like (where symbols are
|
|
terminals i.e. completely resolved):
|
|
+ Container('limit' . [a Container("b" . d e f) c])
|
|
+ Container('limit' . [a '$b' c])
|
|
+ Token(PP_REF('$limit'))
|
|
|
|
This shows resolution of the unit reference ~$limit~, which in turn
|
|
leads to the resolution of ~$b~ which is a sub-unit.
|
|
|
|
There are two ways of indefinite resolution, one per method of
|
|
processing. For ~process_use~ it is two files calling ~%use~ on each
|
|
other and for ~process_const~ it is a ~%const~ calling itself. We can
|
|
just disallow it through analysis.
|
|
** Pseudocode
|
|
#+begin_src text
|
|
process_use(V: Vector[Unit]) ->
|
|
[cons((if v is Token(PP_USE) and next(v) is Token(String(S))
|
|
-> Container(File(S) . tokenise(open(v')))
|
|
else if v is Container(name . units)
|
|
-> Container(name . process_use(units))
|
|
else
|
|
-> v),
|
|
v_x)
|
|
v = v_x[0]
|
|
for v_x in V]
|
|
|
|
CONSTS={}
|
|
process_const(V: Vector[Unit]) ->
|
|
[cons((if v is Token(PP_CONST) and next(v) is Token(Symbol(S))
|
|
do {
|
|
i := find(Token(PP_END), V[v:])
|
|
CONSTS[S] = V[next(v):prev(i)]
|
|
-> Container(Constant(S) . CONSTS[S])
|
|
}
|
|
else if v is Token(PP_REF(S))
|
|
-> CONSTS[S]
|
|
else if v is Container(name . units)
|
|
-> Container(name . process_const(units))
|
|
else
|
|
-> v)
|
|
v_x)
|
|
v = v_x[0]
|
|
for v_x in V]
|
|
#+end_src
|
|
* TODO Introduce error handling in base library :LIB:
|
|
There is a large variety of TODOs about errors. Let's fix them!
|
|
8 TODOs currently present.
|
|
* TODO Standard library :ASM:VM:
|
|
I should start considering this and how a user may use it. Should it
|
|
be an option in the VM and/or assembler binaries (i.e. a flag) or
|
|
something the user has to specify in their source files?
|
|
|
|
Something to consider is /static/ and /dynamic/ "linking" i.e.:
|
|
+ Static linking: assembler inserts all used library definitions into
|
|
the bytecode output directly
|
|
+ We could insert all of it at the start of the bytecode file, and
|
|
with [[*Start points][Start points]] this won't interfere with
|
|
user code
|
|
+ 2023-11-03: Finishing the Start point feature has made these
|
|
features more tenable. A program header which is compiled and
|
|
interpreted in bytecode works wonders.
|
|
+ Furthermore library code will have fixed program addresses (always
|
|
at the start) so we'll know at start of assembler runtime where to
|
|
resolve standard library subroutine calls
|
|
+ Virtual machine needs no changes to do this
|
|
** TODO Consider dynamic Linking
|
|
+ Dynamic linking: virtual machine has fixed program storage for
|
|
library code (a ROM), and assembler makes jump references
|
|
specifically for this program storage
|
|
+ When assembling subroutine calls, just need to put references to
|
|
this library storage (some kind of shared state between VM and
|
|
assembler to know what these references are)
|
|
+ VM needs to manage a ROM of some kind for library code
|
|
+ How do we ensure assembled links to subroutine calls don't
|
|
conflict with user code jumps?
|
|
|
|
What follows is a possible dynamic linking strategy. It requires
|
|
quite a few moving parts:
|
|
|
|
The address operand of every program control instruction (~CALL~,
|
|
~JUMP~, ~JUMP.IF~) has a specific encoding if the standard library is
|
|
dynamically linked:
|
|
+ If the most significant bit is 0, the remaining 63 bits encode an
|
|
absolute address within the program
|
|
+ Otherwise, the address encodes a standard library subroutine. The
|
|
bits within the address follow this schema:
|
|
+ The next 30 bits represent the specific module where the
|
|
subroutine is defined (over 1.07 *billion* possible library values)
|
|
+ The remaining 33 bits (4 bytes + 1 bit) encode the absolute
|
|
program address in the bytecode of that specific module for the
|
|
start of the subroutine (over 8.60 *billion* values)
|
|
|
|
The assembler will automatically encode this based on "%USE" calls and
|
|
the name of the subroutines called. On the virtual machine, there is
|
|
a storage location (similar to the ROM of real machines) which stores
|
|
the bytecode for modules of the standard library, indexed by the
|
|
module number. This means, on deserialising the address into the
|
|
proper components, the VM can refer to the module bytecode then jump
|
|
to the correct address.
|
|
|
|
2023-11-09: I'll need a way to run library code in the current program
|
|
system in the runtime. It currently doesn't support jumps or work in
|
|
programs outside of the main one unfortunately. Any proper work done
|
|
in this area requires some proper refactoring.
|
|
|
|
2023-11-09: Constants or inline macros need to be reconfigured for
|
|
this to work: at parse time, we work out the inlines directly which
|
|
means compiling bytecode with "standard library" macros will not work
|
|
as they won't be in the token stream. Either we don't allow
|
|
preprocessor work in the standard library at all (which is bad cos we
|
|
can't then set standard limits or other useful things) or we insert
|
|
them into the registries at parse time for use in program parsing
|
|
(which not only requires assembler refactoring to figure out what
|
|
libraries are used (to pull definitions from) but also requires making
|
|
macros "recognisable" in bytecode because they're essentially
|
|
invisible).
|
|
|
|
2024-04-15: Perhaps we could insert the linking information into the
|
|
program header?
|
|
1) A table which states the load order of certain modules would allow
|
|
the runtime to selectively spin up and properly delegate module
|
|
jumps to the right bytecode
|
|
2)
|
|
* Completed
|
|
** DONE Write a label/jump system :ASM:
|
|
Essentially a user should be able to write arbitrary labels (maybe
|
|
through ~label x~ or ~x:~ syntax) which can be referred to by ~jump~.
|
|
|
|
It'll purely be on the assembler side as a processing step, where the
|
|
emitted bytecode purely refers to absolute addresses; the VM should
|
|
just be dealing with absolute addresses here.
|
|
** DONE Allow relative addresses in jumps :ASM:
|
|
As requested, a special syntax for relative address jumps. Sometimes
|
|
it's a bit nicer than a label.
|
|
** DONE Calling and returning control flow :VM: :ASM:
|
|
When writing library code we won't know the addresses of where
|
|
callers are jumping from. However, most library functions want to
|
|
return control flow back to where the user had called them: we want
|
|
the code to act almost like an inline function.
|
|
|
|
There are two ways I can think of achieving this:
|
|
+ Some extra syntax around labels (something like ~@inline <label>:~)
|
|
which tells the assembly processor to inline the label when a "jump"
|
|
to that label is given
|
|
+ This requires no changes to the VM, which keeps it simple, but a
|
|
major change to the assembler to be able to inline code. However,
|
|
the work on writing a label system and relative addresses should
|
|
provide some insight into how this could be possible.
|
|
+ A /call stack/ and two new syntactic constructs ~call~ and ~ret~
|
|
which work like so:
|
|
+ When ~call <label>~ is encountered, the next program address is
|
|
pushed onto the call stack and control flow is set to the label
|
|
+ During execution of the ~<label>~, when a ~ret~ is encountered,
|
|
pop an address off the call stack and set control flow to that
|
|
address
|
|
+ This simulates the notion of "calling" and "returning from" a
|
|
function in classical languages, but requires more machinery on
|
|
the VM side.
|
|
|
|
2024-04-15: The latter option was chosen, though the former has been
|
|
implemented through [[*Constants][Constants]].
|
|
** DONE Start points :ASM:VM:
|
|
In standard assembly you can write
|
|
#+begin_src asm
|
|
global _start
|
|
_start:
|
|
...
|
|
#+end_src
|
|
and that means the label ~_start~ is the point the program should
|
|
start from. This means the user can define other code anywhere in the
|
|
program and specify something similar to "main" in C programs.
|
|
|
|
Proposed syntax:
|
|
#+begin_src asm
|
|
init <label>
|
|
#+end_src
|
|
|
|
2024-04-15: Used the same syntax as standard assembly, with the
|
|
conceit that multiple ~global~'s may be present but only the last one
|
|
has an effect.
|
|
** DONE Constants
|
|
Essentially a directive which assigns some literal to a symbol as a
|
|
constant. Something like
|
|
#+begin_src asm
|
|
%const(n) 20 %end
|
|
#+end_src
|
|
|
|
Then, during my program I could use it like so
|
|
#+begin_src asm
|
|
...
|
|
push.word $n
|
|
print.word
|
|
#+end_src
|
|
|
|
The preprocessor should convert this to the equivalent code of
|
|
#+begin_src asm
|
|
...
|
|
push.word 20
|
|
print.word
|
|
#+end_src
|
|
|
|
2023-11-04: You could even put full program instructions for a
|
|
constant potentially
|
|
#+begin_src asm
|
|
%const(print-1)
|
|
push.word 1
|
|
print.word
|
|
%end
|
|
#+end_src
|
|
which when referred to (by ~$print-1~) would insert the bytecode given
|
|
inline.
|
|
** DONE Rigid endian :LIB:
|
|
Say a program is compiled on a little endian machine. The resultant
|
|
bytecode file, as a result of using C's internal functions, will use
|
|
little endian.
|
|
|
|
This file, when distributed to other computers, will not work on those
|
|
that use big endian.
|
|
|
|
This is a massive problem; I would like bytecode compiled on one
|
|
computer to work on any other one. Therefore we have to enforce big
|
|
endian. This refactor is limited to only LIB as a result of only the
|
|
~convert_*~ functions being used in the runtime to convert between
|
|
byte buffers (usually read from the bytecode file directly or from
|
|
memory to use in the stack).
|
|
|
|
2024-04-09: Found the ~hto_e~ functions under =endian.h= that provide
|
|
both way host to specific endian conversion of shorts, half words and
|
|
words. This will make it super simple to just convert.
|
|
|
|
2024-04-15: Found it better to implement the functions myself as
|
|
=endian.h= is not particularly portable.
|
|
** DONE Import another file
|
|
Say I have two "asm" files: /a.asm/ and /b.asm/.
|
|
|
|
#+CAPTION: a.asm
|
|
#+begin_src asm
|
|
global main
|
|
main:
|
|
push.word 1
|
|
push.word 1
|
|
push.word 1
|
|
sub.word
|
|
sub.word
|
|
call b-println
|
|
halt
|
|
#+end_src
|
|
|
|
#+CAPTION: b.asm
|
|
#+begin_src asm
|
|
b-println:
|
|
print.word
|
|
push.byte '\n'
|
|
print.char
|
|
ret
|
|
#+end_src
|
|
|
|
How would one assemble this? We've got two files, with /a.asm/
|
|
depending on /b.asm/ for the symbol ~b-println~. It's obvious they
|
|
need to be assembled "together" to make something that could work. A
|
|
possible "correct" program would be having the file /b.asm/ completely
|
|
included into /a.asm/, such that compiling /a.asm/ would lead to
|
|
classical symbol resolution without much hassle. As a feature, this
|
|
would be best placed in the preprocessor as symbol resolution occurs
|
|
in the third stage of parsing (~process_presults~), whereas the
|
|
preprocessor is always the first stage.
|
|
|
|
That would be a very simple way of solving the static vs dynamic
|
|
linking problem: just include the files you actually need. Even the
|
|
standard library would be fine and not require any additional work.
|
|
Let's see how this would work.
|