aal/todo.org

#+title: TODOs
#+author: Aryadev Chavali
#+date: 2023-11-02
#+startup: noindent

* WIP Write assembler in a different language :ASM:
While the runtime and base library needs to deal with only
binary, the assembler has to deal with string inputs and a larger
variety of bugs.  As the base library is written in C, and is all that
is necessary to write a program that targets the virtual machine, we
could realistically use another language to write the assembler in via
FFI with minimal pain.

Languages in the competition:
+ C++
+ Rust
+ Python

2024-04-14: Chose C++ cos it will require the least effort to rewrite
the currently existing codebase while still leveraging some less
efficient but incredibly useful features.
** DONE Write Lexer
** WIP Write Preprocesser
** TODO Write parser
* TODO Better documentation [0%] :DOC:
** TODO Comment coverage [0%]
*** TODO ASM [0%]
**** TODO asm/lexer.h
**** TODO asm/parser.h
** TODO Write a specification
* TODO Preprocessing directives :ASM:
Like in FASM or NASM where we can give certain helpful instructions to
the assembler.  I'd use the ~%~ symbol to designate preprocessor
directives.
** TODO Macros
Essentially constants expressions which take literal parameters
(i.e. tokens) and can use them throughout the body.  Something like
#+begin_src asm
%macro(name)(param1 param2 param3)
...
%end
#+end_src
Where each parameter is substituted in a call at preprocessing time.
A call should look something like this:
#+begin_src asm
  $name 1 2 3
#+end_src
and those tokens will be substituted literally in the macro body.
* Completed
** DONE Write a label/jump system :ASM:
Essentially a user should be able to write arbitrary labels (maybe
through ~label x~ or ~x:~ syntax) which can be referred to by ~jump~.

It'll purely be on the assembler side as a processing step, where the
emitted bytecode purely refers to absolute addresses; the VM should
just be dealing with absolute addresses here.
** DONE Allow relative addresses in jumps :ASM:
As requested, a special syntax for relative address jumps.  Sometimes
it's a bit nicer than a label.
** DONE Calling and returning control flow :VM: :ASM:
When writing library code we won't know the addresses of where
callers are jumping from.  However, most library functions want to
return control flow back to where the user had called them: we want
the code to act almost like an inline function.

There are two ways I can think of achieving this:
+ Some extra syntax around labels (something like ~@inline <label>:~)
  which tells the assembly processor to inline the label when a "jump"
  to that label is given
  + This requires no changes to the VM, which keeps it simple, but a
    major change to the assembler to be able to inline code.  However,
    the work on writing a label system and relative addresses should
    provide some insight into how this could be possible.
+ A /call stack/ and two new syntactic constructs ~call~ and ~ret~
  which work like so:
  + When ~call <label>~ is encountered, the next program address is
    pushed onto the call stack and control flow is set to the label
  + During execution of the ~<label>~, when a ~ret~ is encountered,
    pop an address off the call stack and set control flow to that
    address
  + This simulates the notion of "calling" and "returning from" a
    function in classical languages, but requires more machinery on
    the VM side.

2024-04-15: The latter option was chosen, though the former has been
implemented through [[*Constants][Constants]].
** DONE Start points :ASM:VM:
In standard assembly you can write
#+begin_src asm
  global _start
_start:
  ...
#+end_src
and that means the label ~_start~ is the point the program should
start from.  This means the user can define other code anywhere in the
program and specify something similar to "main" in C programs.

Proposed syntax:
#+begin_src asm
  init <label>
#+end_src

2024-04-15: Used the same syntax as standard assembly, with the
conceit that multiple ~global~'s may be present but only the last one
has an effect.
** DONE Constants
Essentially a directive which assigns some literal to a symbol as a
constant.  Something like
#+begin_src asm
%const(n) 20 %end
#+end_src

Then, during my program I could use it like so
#+begin_src asm
...
  push.word $n
  print.word
#+end_src

The preprocessor should convert this to the equivalent code of
#+begin_src asm
...
  push.word 20
  print.word
#+end_src

2023-11-04: You could even put full program instructions for a
constant potentially
#+begin_src asm
%const(print-1)
  push.word 1
  print.word
%end
#+end_src
which when referred to (by ~$print-1~) would insert the bytecode given
inline.
** DONE Rigid endian :LIB:
Say a program is compiled on a little endian machine.  The resultant
bytecode file, as a result of using C's internal functions, will use
little endian.

This file, when distributed to other computers, will not work on those
that use big endian.

This is a massive problem; I would like bytecode compiled on one
computer to work on any other one.  Therefore we have to enforce big
endian.  This refactor is limited to only LIB as a result of only the
~convert_*~ functions being used in the runtime to convert between
byte buffers (usually read from the bytecode file directly or from
memory to use in the stack).

2024-04-09: Found the ~hto_e~ functions under =endian.h= that provide
both way host to specific endian conversion of shorts, half words and
words.  This will make it super simple to just convert.

2024-04-15: Found it better to implement the functions myself as
=endian.h= is not particularly portable.
** DONE Import another file
Say I have two "asm" files: /a.asm/ and /b.asm/.

#+CAPTION: a.asm
#+begin_src asm
  global main
main:
  push.word 1
  push.word 1
  push.word 1
  sub.word
  sub.word
  call b-println
  halt
#+end_src

#+CAPTION: b.asm
#+begin_src asm
b-println:
  print.word
  push.byte '\n'
  print.char
  ret
#+end_src

How would one assemble this?  We've got two files, with /a.asm/
depending on /b.asm/ for the symbol ~b-println~.  It's obvious they
need to be assembled "together" to make something that could work.  A
possible "correct" program would be having the file /b.asm/ completely
included into /a.asm/, such that compiling /a.asm/ would lead to
classical symbol resolution without much hassle.  As a feature, this
would be best placed in the preprocessor as symbol resolution occurs
in the third stage of parsing (~process_presults~), whereas the
preprocessor is always the first stage.

That would be a very simple way of solving the static vs dynamic
linking problem: just include the files you actually need.  Even the
standard library would be fine and not require any additional work.
Let's see how this would work.
** DONE Rewrite lexer
~push.magic~ is a valid PUSH token according to the current lexer.
I'd like to clamp down on this obvious error at the lexer itself, so
the parser can be dedicated to just dealing with address resolution
and conversion to opcodes.

How about an enum which represents the possible type of the operator?