This repository has been archived on 2025-11-10. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
aal/todo.org
2024-07-07 19:08:09 +01:00

6.9 KiB

TODOs

WIP Write assembler in a different language   ASM

While the runtime and base library needs to deal with only binary, the assembler has to deal with string inputs and a larger variety of bugs. As the base library is written in C, and is all that is necessary to write a program that targets the virtual machine, we could realistically use another language to write the assembler in via FFI with minimal pain.

Languages in the competition:

  • C++
  • Rust
  • Python

2024-04-14: Chose C++ cos it will require the least effort to rewrite the currently existing codebase while still leveraging some less efficient but incredibly useful features.

DONE Write Lexer

WIP Write Preprocesser

TODO Write parser

TODO Better documentation [0%]   DOC

TODO Comment coverage [0%]

TODO ASM [0%]

TODO asm/lexer.h
TODO asm/parser.h

TODO Write a specification

TODO Preprocessing directives   ASM

Like in FASM or NASM where we can give certain helpful instructions to the assembler. I'd use the % symbol to designate preprocessor directives.

TODO Macros

Essentially constants expressions which take literal parameters (i.e. tokens) and can use them throughout the body. Something like

%macro(name)(param1 param2 param3)
...
%end

Where each parameter is substituted in a call at preprocessing time. A call should look something like this:

  $name 1 2 3

and those tokens will be substituted literally in the macro body.

Completed

DONE Write a label/jump system   ASM

Essentially a user should be able to write arbitrary labels (maybe through label x or x: syntax) which can be referred to by jump.

It'll purely be on the assembler side as a processing step, where the emitted bytecode purely refers to absolute addresses; the VM should just be dealing with absolute addresses here.

DONE Allow relative addresses in jumps   ASM

As requested, a special syntax for relative address jumps. Sometimes it's a bit nicer than a label.

DONE Calling and returning control flow :VM:   ASM

When writing library code we won't know the addresses of where callers are jumping from. However, most library functions want to return control flow back to where the user had called them: we want the code to act almost like an inline function.

There are two ways I can think of achieving this:

  • Some extra syntax around labels (something like @inline <label>:) which tells the assembly processor to inline the label when a "jump" to that label is given

    • This requires no changes to the VM, which keeps it simple, but a major change to the assembler to be able to inline code. However, the work on writing a label system and relative addresses should provide some insight into how this could be possible.
  • A call stack and two new syntactic constructs call and ret which work like so:

    • When call <label> is encountered, the next program address is pushed onto the call stack and control flow is set to the label
    • During execution of the <label>, when a ret is encountered, pop an address off the call stack and set control flow to that address
    • This simulates the notion of "calling" and "returning from" a function in classical languages, but requires more machinery on the VM side.

2024-04-15: The latter option was chosen, though the former has been implemented through Constants.

DONE Start points   ASM VM

In standard assembly you can write

  global _start
_start:
  ...

and that means the label _start is the point the program should start from. This means the user can define other code anywhere in the program and specify something similar to "main" in C programs.

Proposed syntax:

  init <label>

2024-04-15: Used the same syntax as standard assembly, with the conceit that multiple global's may be present but only the last one has an effect.

DONE Constants

Essentially a directive which assigns some literal to a symbol as a constant. Something like

%const(n) 20 %end

Then, during my program I could use it like so

...
  push.word $n
  print.word

The preprocessor should convert this to the equivalent code of

...
  push.word 20
  print.word

2023-11-04: You could even put full program instructions for a constant potentially

%const(print-1)
  push.word 1
  print.word
%end

which when referred to (by $print-1) would insert the bytecode given inline.

DONE Rigid endian   LIB

Say a program is compiled on a little endian machine. The resultant bytecode file, as a result of using C's internal functions, will use little endian.

This file, when distributed to other computers, will not work on those that use big endian.

This is a massive problem; I would like bytecode compiled on one computer to work on any other one. Therefore we have to enforce big endian. This refactor is limited to only LIB as a result of only the convert_* functions being used in the runtime to convert between byte buffers (usually read from the bytecode file directly or from memory to use in the stack).

2024-04-09: Found the hto_e functions under endian.h that provide both way host to specific endian conversion of shorts, half words and words. This will make it super simple to just convert.

2024-04-15: Found it better to implement the functions myself as endian.h is not particularly portable.

DONE Import another file

Say I have two "asm" files: a.asm and b.asm.

  global main
main:
  push.word 1
  push.word 1
  push.word 1
  sub.word
  sub.word
  call b-println
  halt
a.asm
b-println:
  print.word
  push.byte '\n'
  print.char
  ret
b.asm

How would one assemble this? We've got two files, with a.asm depending on b.asm for the symbol b-println. It's obvious they need to be assembled "together" to make something that could work. A possible "correct" program would be having the file b.asm completely included into a.asm, such that compiling a.asm would lead to classical symbol resolution without much hassle. As a feature, this would be best placed in the preprocessor as symbol resolution occurs in the third stage of parsing (process_presults), whereas the preprocessor is always the first stage.

That would be a very simple way of solving the static vs dynamic linking problem: just include the files you actually need. Even the standard library would be fine and not require any additional work. Let's see how this would work.

DONE Rewrite lexer

push.magic is a valid PUSH token according to the current lexer. I'd like to clamp down on this obvious error at the lexer itself, so the parser can be dedicated to just dealing with address resolution and conversion to opcodes.

How about an enum which represents the possible type of the operator?