204 lines
6.9 KiB
Org Mode
204 lines
6.9 KiB
Org Mode
#+title: TODOs
|
|
#+author: Aryadev Chavali
|
|
#+date: 2023-11-02
|
|
#+startup: noindent
|
|
|
|
* WIP Write assembler in a different language :ASM:
|
|
While the runtime and base library needs to deal with only
|
|
binary, the assembler has to deal with string inputs and a larger
|
|
variety of bugs. As the base library is written in C, and is all that
|
|
is necessary to write a program that targets the virtual machine, we
|
|
could realistically use another language to write the assembler in via
|
|
FFI with minimal pain.
|
|
|
|
Languages in the competition:
|
|
+ C++
|
|
+ Rust
|
|
+ Python
|
|
|
|
2024-04-14: Chose C++ cos it will require the least effort to rewrite
|
|
the currently existing codebase while still leveraging some less
|
|
efficient but incredibly useful features.
|
|
** DONE Write Lexer
|
|
** WIP Write Preprocesser
|
|
** TODO Write parser
|
|
* TODO Better documentation [0%] :DOC:
|
|
** TODO Comment coverage [0%]
|
|
*** TODO ASM [0%]
|
|
**** TODO asm/lexer.h
|
|
**** TODO asm/parser.h
|
|
** TODO Write a specification
|
|
* TODO Preprocessing directives :ASM:
|
|
Like in FASM or NASM where we can give certain helpful instructions to
|
|
the assembler. I'd use the ~%~ symbol to designate preprocessor
|
|
directives.
|
|
** TODO Macros
|
|
Essentially constants expressions which take literal parameters
|
|
(i.e. tokens) and can use them throughout the body. Something like
|
|
#+begin_src asm
|
|
%macro(name)(param1 param2 param3)
|
|
...
|
|
%end
|
|
#+end_src
|
|
Where each parameter is substituted in a call at preprocessing time.
|
|
A call should look something like this:
|
|
#+begin_src asm
|
|
$name 1 2 3
|
|
#+end_src
|
|
and those tokens will be substituted literally in the macro body.
|
|
* Completed
|
|
** DONE Write a label/jump system :ASM:
|
|
Essentially a user should be able to write arbitrary labels (maybe
|
|
through ~label x~ or ~x:~ syntax) which can be referred to by ~jump~.
|
|
|
|
It'll purely be on the assembler side as a processing step, where the
|
|
emitted bytecode purely refers to absolute addresses; the VM should
|
|
just be dealing with absolute addresses here.
|
|
** DONE Allow relative addresses in jumps :ASM:
|
|
As requested, a special syntax for relative address jumps. Sometimes
|
|
it's a bit nicer than a label.
|
|
** DONE Calling and returning control flow :VM: :ASM:
|
|
When writing library code we won't know the addresses of where
|
|
callers are jumping from. However, most library functions want to
|
|
return control flow back to where the user had called them: we want
|
|
the code to act almost like an inline function.
|
|
|
|
There are two ways I can think of achieving this:
|
|
+ Some extra syntax around labels (something like ~@inline <label>:~)
|
|
which tells the assembly processor to inline the label when a "jump"
|
|
to that label is given
|
|
+ This requires no changes to the VM, which keeps it simple, but a
|
|
major change to the assembler to be able to inline code. However,
|
|
the work on writing a label system and relative addresses should
|
|
provide some insight into how this could be possible.
|
|
+ A /call stack/ and two new syntactic constructs ~call~ and ~ret~
|
|
which work like so:
|
|
+ When ~call <label>~ is encountered, the next program address is
|
|
pushed onto the call stack and control flow is set to the label
|
|
+ During execution of the ~<label>~, when a ~ret~ is encountered,
|
|
pop an address off the call stack and set control flow to that
|
|
address
|
|
+ This simulates the notion of "calling" and "returning from" a
|
|
function in classical languages, but requires more machinery on
|
|
the VM side.
|
|
|
|
2024-04-15: The latter option was chosen, though the former has been
|
|
implemented through [[*Constants][Constants]].
|
|
** DONE Start points :ASM:VM:
|
|
In standard assembly you can write
|
|
#+begin_src asm
|
|
global _start
|
|
_start:
|
|
...
|
|
#+end_src
|
|
and that means the label ~_start~ is the point the program should
|
|
start from. This means the user can define other code anywhere in the
|
|
program and specify something similar to "main" in C programs.
|
|
|
|
Proposed syntax:
|
|
#+begin_src asm
|
|
init <label>
|
|
#+end_src
|
|
|
|
2024-04-15: Used the same syntax as standard assembly, with the
|
|
conceit that multiple ~global~'s may be present but only the last one
|
|
has an effect.
|
|
** DONE Constants
|
|
Essentially a directive which assigns some literal to a symbol as a
|
|
constant. Something like
|
|
#+begin_src asm
|
|
%const(n) 20 %end
|
|
#+end_src
|
|
|
|
Then, during my program I could use it like so
|
|
#+begin_src asm
|
|
...
|
|
push.word $n
|
|
print.word
|
|
#+end_src
|
|
|
|
The preprocessor should convert this to the equivalent code of
|
|
#+begin_src asm
|
|
...
|
|
push.word 20
|
|
print.word
|
|
#+end_src
|
|
|
|
2023-11-04: You could even put full program instructions for a
|
|
constant potentially
|
|
#+begin_src asm
|
|
%const(print-1)
|
|
push.word 1
|
|
print.word
|
|
%end
|
|
#+end_src
|
|
which when referred to (by ~$print-1~) would insert the bytecode given
|
|
inline.
|
|
** DONE Rigid endian :LIB:
|
|
Say a program is compiled on a little endian machine. The resultant
|
|
bytecode file, as a result of using C's internal functions, will use
|
|
little endian.
|
|
|
|
This file, when distributed to other computers, will not work on those
|
|
that use big endian.
|
|
|
|
This is a massive problem; I would like bytecode compiled on one
|
|
computer to work on any other one. Therefore we have to enforce big
|
|
endian. This refactor is limited to only LIB as a result of only the
|
|
~convert_*~ functions being used in the runtime to convert between
|
|
byte buffers (usually read from the bytecode file directly or from
|
|
memory to use in the stack).
|
|
|
|
2024-04-09: Found the ~hto_e~ functions under =endian.h= that provide
|
|
both way host to specific endian conversion of shorts, half words and
|
|
words. This will make it super simple to just convert.
|
|
|
|
2024-04-15: Found it better to implement the functions myself as
|
|
=endian.h= is not particularly portable.
|
|
** DONE Import another file
|
|
Say I have two "asm" files: /a.asm/ and /b.asm/.
|
|
|
|
#+CAPTION: a.asm
|
|
#+begin_src asm
|
|
global main
|
|
main:
|
|
push.word 1
|
|
push.word 1
|
|
push.word 1
|
|
sub.word
|
|
sub.word
|
|
call b-println
|
|
halt
|
|
#+end_src
|
|
|
|
#+CAPTION: b.asm
|
|
#+begin_src asm
|
|
b-println:
|
|
print.word
|
|
push.byte '\n'
|
|
print.char
|
|
ret
|
|
#+end_src
|
|
|
|
How would one assemble this? We've got two files, with /a.asm/
|
|
depending on /b.asm/ for the symbol ~b-println~. It's obvious they
|
|
need to be assembled "together" to make something that could work. A
|
|
possible "correct" program would be having the file /b.asm/ completely
|
|
included into /a.asm/, such that compiling /a.asm/ would lead to
|
|
classical symbol resolution without much hassle. As a feature, this
|
|
would be best placed in the preprocessor as symbol resolution occurs
|
|
in the third stage of parsing (~process_presults~), whereas the
|
|
preprocessor is always the first stage.
|
|
|
|
That would be a very simple way of solving the static vs dynamic
|
|
linking problem: just include the files you actually need. Even the
|
|
standard library would be fine and not require any additional work.
|
|
Let's see how this would work.
|
|
** DONE Rewrite lexer
|
|
~push.magic~ is a valid PUSH token according to the current lexer.
|
|
I'd like to clamp down on this obvious error at the lexer itself, so
|
|
the parser can be dedicated to just dealing with address resolution
|
|
and conversion to opcodes.
|
|
|
|
How about an enum which represents the possible type of the operator?
|