ovm/todo.org

#+title: TODOs
#+author: Aryadev Chavali
#+date: 2023-11-02
#+startup: noindent

* TODO Better documentation [0%] :DOC:
** TODO Comment coverage [0%]
*** WIP Lib [50%]
**** DONE lib/base.h
**** DONE lib/darr.h
**** TODO lib/heap.h
**** TODO lib/inst.h
*** TODO ASM [0%]
**** TODO asm/lexer.h
**** TODO asm/parser.h
*** TODO VM [0%]
**** TODO vm/runtime.h
** TODO Specification
* TODO Preprocessing directives :ASM:
Like in FASM or NASM where we can give certain helpful instructions to
the assembler.  I'd use the ~%~ symbol to designate preprocessor
directives.
** TODO Macros
Essentially constants expressions which take literal parameters
(i.e. tokens) and can use them throughout the body.  Something like
#+begin_src asm
%macro(name)(param1 param2 param3)
...
%end
#+end_src
Where each parameter is substituted in a call at preprocessing time.
A call should look something like this:
#+begin_src asm
  $name 1 2 3
#+end_src
and those tokens will be substituted literally in the macro body.
* WIP Write assembler in a different language :ASM:
While the runtime and base library needs to deal with only
binary, the assembler has to deal with string inputs and a larger
variety of bugs.  As the base library is written in C, and is all that
is necessary to write a program that targets the virtual machine, we
could realistically use another language to write the assembler in via
FFI with minimal pain.

Languages in the competition:
+ C++
+ Rust
+ Python

2024-04-14: Chose C++ cos it will require the least effort to rewrite
the currently existing codebase while still leveraging some less
efficient but incredibly useful features.
* TODO Rewrite preprocesser to create a custom unit instead of token streams
** Problem
A problem that occurs in the preprocessor is token column and line
count.  Say =a.asm= has ~%use "b.asm"~.  The tokens from the =b.asm=
file are inserted into =a.asm='s token stream, but the line/column
count from there isn't properly set in =a.asm=.

A naive solution would be to just recount the lines and columns, but
this removes information about where those tokens came from.  Say an
error occurs in some of =b.asm='s code: I would like to be able to
report them.

Therefore, we can no longer just generate new token streams from the
preprocesser and should instead look at making more complex
abstractions.

A problem this could also solve is nested errors and recursive
constants.  Say I have some assembly like so
#+begin_src asm
  %const limit 20 %end
  %const print-limit
  ...
  push.byte $limit
  print.byte
  ...
  %end
#+end_src

A call to ~print-limit~ under the current system would insert the
tokens for print-limit but completely forget about ~push.byte $limit~
which would cause a parsing error.  (This could be fixed under the
current system by allowing reference resolution inside of const
blocks, with the conceit that it would be hard to stop infinite recursion)
** Language model
The model I have in mind is that all constructs in this meta language
(the preprocessing language) are either singular tokens or collections
of tokens/constructs in a recursive sense.  This naturally follows
from the fact that a single pass isn't enough to properly parse this
language: there must be some recursive nature which forces the
language to take multiple passes to completely generate a stream that
can be parsed.

This vague notion can be formalised like so.  A preprocessing unit is
either a singular token or a named collection of units.  The former
represents your standard symbols and literals while the later
represents ~%const~ and ~%use~ calls where there is a clear name
associated to a collection of one or more tokens (in the case of the
former it's the constant's name and the latter it's the filename).
We'll distinguish this as well.

#+begin_src text
Token = PP_USE | PP_CONST | String(Content) | Symbol(Content) | PUSH(Content) | ...
Type = File(String) | Constant(Symbol)
Unit = Token      | Container(Type . Vector[Unit])
#+end_src

Through this model our initial stream of tokens can be considered
units.  We can already see that this model may solve our original
problem: with named containers it doesn't matter that certain tokens
are from different parts of the file or different files as they are
distinctly typed from the general set of tokens, with a name which
states where they're from.
** Processing
We need this model to have a notion of "processing" though, otherwise
it's quite useless.  A processing function is simply a function which
takes a unit and returns another unit.  We currently have two
processing functions we can consider: ~process_const~ and
~process_use~.

~process_use~ takes a vector of tokens and, upon encountering PP_USE
accepts the next token (a string) and tokenises the file
with that name.  Within our model we'd make the stream of tokens
created from opening the file a /container/.

~process_const~ takes a vector of tokens and does two things in an
iteration:
1) upon encountering PP_CONST accepts the next n tokens till PP_END is
   encountered, with the first token being a symbol.  This is
   registered in a map of constants (~CONSTS~) where the symbol is the
   key and the value associated is the n - 1 tokens accepted
2) upon encountering a PP_REFERENCE reads the content associated with
   it (considered a symbol ~S~) and replaces it ~CONSTS[S]~ (if S is
   in CONSTS).

One thing to note is that both of these definitions are easily
extensible to the general definition of units: if a unit is a
container of some kind we can recur through its vector of units to
resolve any further "calls".  For ~process_const~ it's ~%const~ or
~$ref~ while for ~process_use~ it's ~%use~.
** History/versioning
One additional facet to this model I'd like to add is "history".  Each
unit is actually a list (or a singly linked tree where each parent has
at most one child) of sub-units where the top of the list represents
the current version.  Each descendant is a previous version of the
token.

Say I do some processing on an element of the unit list =a= (with
index =i=) such that it becomes a new "unit", call it =b=.  Then we
update V by =V[i] = cons(b, a)=.  Through this, the lists acts as a
history of processing that has occurred on the unit.  This provides an
ability to trace the path of preprocessing to an eventual conclusion.

Processing occurs on a unit until it cannot be done further i.e. when
there are no more "calls" in the tree to resolve.  The history list
provides all the versions of a unit till its resolved form.

To see what a unit with history may look like (where symbols are
terminals i.e. completely resolved):
+ Container('limit' . [a Container("b" . d e f) c])
  + Container('limit' . [a '$b' c])
    + Token(PP_REF('$limit'))

This shows resolution of the unit reference ~$limit~, which in turn
leads to the resolution of ~$b~ which is a sub-unit.

There are two ways of indefinite resolution, one per method of
processing.  For ~process_use~ it is two files calling ~%use~ on each
other and for ~process_const~ it is a ~%const~ calling itself.  We can
just disallow it through analysis.
** Pseudocode
#+begin_src text
process_use(V: Vector[Unit]) ->
    [cons((if v is Token(PP_USE) and next(v) is Token(String(S))
             -> Container(File(S) . tokenise(open(v')))
           else if v is Container(name . units)
             -> Container(name . process_use(units))
           else
             -> v),
          v_x)
     v = v_x[0]
     for v_x in V]

CONSTS={}
process_const(V: Vector[Unit]) ->
    [cons((if v is Token(PP_CONST) and next(v) is Token(Symbol(S))
                do {
                    i := find(Token(PP_END), V[v:])
                    CONSTS[S] = V[next(v):prev(i)]
                    -> Container(Constant(S) . CONSTS[S])
                }
           else if v is Token(PP_REF(S))
                -> CONSTS[S]
           else if v is Container(name . units)
               -> Container(name . process_const(units))
           else
               -> v)
          v_x)
     v = v_x[0]
     for v_x in V]
#+end_src
* TODO Introduce error handling in base library :LIB:
There is a large variety of TODOs about errors.  Let's fix them!
8 TODOs currently present.
* TODO Standard library :ASM:VM:
I should start considering this and how a user may use it.  Should it
be an option in the VM and/or assembler binaries (i.e. a flag) or
something the user has to specify in their source files?

Something to consider is /static/ and /dynamic/ "linking" i.e.:
+ Static linking: assembler inserts all used library definitions into
  the bytecode output directly
  + We could insert all of it at the start of the bytecode file, and
    with [[*Start points][Start points]] this won't interfere with
    user code
    + 2023-11-03: Finishing the Start point feature has made these
      features more tenable.  A program header which is compiled and
      interpreted in bytecode works wonders.
  + Furthermore library code will have fixed program addresses (always
    at the start) so we'll know at start of assembler runtime where to
    resolve standard library subroutine calls
  + Virtual machine needs no changes to do this
** TODO Consider dynamic Linking
+ Dynamic linking: virtual machine has fixed program storage for
  library code (a ROM), and assembler makes jump references
  specifically for this program storage
  + When assembling subroutine calls, just need to put references to
    this library storage (some kind of shared state between VM and
    assembler to know what these references are)
  + VM needs to manage a ROM of some kind for library code
  + How do we ensure assembled links to subroutine calls don't
    conflict with user code jumps?

What follows is a possible dynamic linking strategy.  It requires
quite a few moving parts:

The address operand of every program control instruction (~CALL~,
~JUMP~, ~JUMP.IF~) has a specific encoding if the standard library is
dynamically linked:
+ If the most significant bit is 0, the remaining 63 bits encode an
  absolute address within the program
+ Otherwise, the address encodes a standard library subroutine.  The
  bits within the address follow this schema:
  + The next 30 bits represent the specific module where the
    subroutine is defined (over 1.07 *billion* possible library values)
  + The remaining 33 bits (4 bytes + 1 bit) encode the absolute
    program address in the bytecode of that specific module for the
    start of the subroutine (over 8.60 *billion* values)

The assembler will automatically encode this based on "%USE" calls and
the name of the subroutines called.  On the virtual machine, there is
a storage location (similar to the ROM of real machines) which stores
the bytecode for modules of the standard library, indexed by the
module number.  This means, on deserialising the address into the
proper components, the VM can refer to the module bytecode then jump
to the correct address.

2023-11-09: I'll need a way to run library code in the current program
system in the runtime.  It currently doesn't support jumps or work in
programs outside of the main one unfortunately.  Any proper work done
in this area requires some proper refactoring.

2023-11-09: Constants or inline macros need to be reconfigured for
this to work: at parse time, we work out the inlines directly which
means compiling bytecode with "standard library" macros will not work
as they won't be in the token stream.  Either we don't allow
preprocessor work in the standard library at all (which is bad cos we
can't then set standard limits or other useful things) or we insert
them into the registries at parse time for use in program parsing
(which not only requires assembler refactoring to figure out what
libraries are used (to pull definitions from) but also requires making
macros "recognisable" in bytecode because they're essentially
invisible).

2024-04-15: Perhaps we could insert the linking information into the
program header?
1) A table which states the load order of certain modules would allow
   the runtime to selectively spin up and properly delegate module
   jumps to the right bytecode
2)
* Completed
** DONE Write a label/jump system :ASM:
Essentially a user should be able to write arbitrary labels (maybe
through ~label x~ or ~x:~ syntax) which can be referred to by ~jump~.

It'll purely be on the assembler side as a processing step, where the
emitted bytecode purely refers to absolute addresses; the VM should
just be dealing with absolute addresses here.
** DONE Allow relative addresses in jumps :ASM:
As requested, a special syntax for relative address jumps.  Sometimes
it's a bit nicer than a label.
** DONE Calling and returning control flow :VM: :ASM:
When writing library code we won't know the addresses of where
callers are jumping from.  However, most library functions want to
return control flow back to where the user had called them: we want
the code to act almost like an inline function.

There are two ways I can think of achieving this:
+ Some extra syntax around labels (something like ~@inline <label>:~)
  which tells the assembly processor to inline the label when a "jump"
  to that label is given
  + This requires no changes to the VM, which keeps it simple, but a
    major change to the assembler to be able to inline code.  However,
    the work on writing a label system and relative addresses should
    provide some insight into how this could be possible.
+ A /call stack/ and two new syntactic constructs ~call~ and ~ret~
  which work like so:
  + When ~call <label>~ is encountered, the next program address is
    pushed onto the call stack and control flow is set to the label
  + During execution of the ~<label>~, when a ~ret~ is encountered,
    pop an address off the call stack and set control flow to that
    address
  + This simulates the notion of "calling" and "returning from" a
    function in classical languages, but requires more machinery on
    the VM side.

2024-04-15: The latter option was chosen, though the former has been
implemented through [[*Constants][Constants]].
** DONE Start points :ASM:VM:
In standard assembly you can write
#+begin_src asm
  global _start
_start:
  ...
#+end_src
and that means the label ~_start~ is the point the program should
start from.  This means the user can define other code anywhere in the
program and specify something similar to "main" in C programs.

Proposed syntax:
#+begin_src asm
  init <label>
#+end_src

2024-04-15: Used the same syntax as standard assembly, with the
conceit that multiple ~global~'s may be present but only the last one
has an effect.
** DONE Constants
Essentially a directive which assigns some literal to a symbol as a
constant.  Something like
#+begin_src asm
%const(n) 20 %end
#+end_src

Then, during my program I could use it like so
#+begin_src asm
...
  push.word $n
  print.word
#+end_src

The preprocessor should convert this to the equivalent code of
#+begin_src asm
...
  push.word 20
  print.word
#+end_src

2023-11-04: You could even put full program instructions for a
constant potentially
#+begin_src asm
%const(print-1)
  push.word 1
  print.word
%end
#+end_src
which when referred to (by ~$print-1~) would insert the bytecode given
inline.
** DONE Rigid endian :LIB:
Say a program is compiled on a little endian machine.  The resultant
bytecode file, as a result of using C's internal functions, will use
little endian.

This file, when distributed to other computers, will not work on those
that use big endian.

This is a massive problem; I would like bytecode compiled on one
computer to work on any other one.  Therefore we have to enforce big
endian.  This refactor is limited to only LIB as a result of only the
~convert_*~ functions being used in the runtime to convert between
byte buffers (usually read from the bytecode file directly or from
memory to use in the stack).

2024-04-09: Found the ~hto_e~ functions under =endian.h= that provide
both way host to specific endian conversion of shorts, half words and
words.  This will make it super simple to just convert.

2024-04-15: Found it better to implement the functions myself as
=endian.h= is not particularly portable.
** DONE Import another file
Say I have two "asm" files: /a.asm/ and /b.asm/.

#+CAPTION: a.asm
#+begin_src asm
  global main
main:
  push.word 1
  push.word 1
  push.word 1
  sub.word
  sub.word
  call b-println
  halt
#+end_src

#+CAPTION: b.asm
#+begin_src asm
b-println:
  print.word
  push.byte '\n'
  print.char
  ret
#+end_src

How would one assemble this?  We've got two files, with /a.asm/
depending on /b.asm/ for the symbol ~b-println~.  It's obvious they
need to be assembled "together" to make something that could work.  A
possible "correct" program would be having the file /b.asm/ completely
included into /a.asm/, such that compiling /a.asm/ would lead to
classical symbol resolution without much hassle.  As a feature, this
would be best placed in the preprocessor as symbol resolution occurs
in the third stage of parsing (~process_presults~), whereas the
preprocessor is always the first stage.

That would be a very simple way of solving the static vs dynamic
linking problem: just include the files you actually need.  Even the
standard library would be fine and not require any additional work.
Let's see how this would work.