todo.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259

#+title: TODOs
#+author: Aryadev Chavali
#+date: 2023-11-02

* TODO Better documentation [0%] :DOC:
** TODO Comment coverage [0%]
*** WIP Lib [50%]
**** DONE lib/base.h
**** DONE lib/darr.h
**** TODO lib/heap.h
**** TODO lib/inst.h
*** TODO ASM [0%]
**** TODO asm/lexer.h
**** TODO asm/parser.h
*** TODO VM [0%]
**** TODO vm/runtime.h
** TODO Specification
* TODO Preprocessing directives :ASM:
Like in FASM or NASM where we can give certain helpful instructions to
the assembler.  I'd use the ~%~ symbol to designate preprocessor
directives.
** TODO Macros
Essentially constants expressions which take literal parameters
(i.e. tokens) and can use them throughout the body.  Something like
#+begin_src asm
%macro(name)(param1 param2 param3)
...
%end
#+end_src
Where each parameter is substituted in a call at preprocessing time.
A call should look something like this:
#+begin_src asm
  $name 1 2 3
#+end_src
and those tokens will be substituted literally in the macro body.
* TODO Write assembler in a different language :ASM:
While the runtime and base library needs to deal with only
binary, the assembler has to deal with string inputs and a larger
variety of bugs.  As the base library is written in C, and is all that
is necessary to write a program that targets the virtual machine, we
could realistically use another language to write the assembler in via
FFI with minimal pain.

Languages in the competition:
+ C++
+ Rust
+ Python
* TODO Introduce error handling in base library :LIB:
There is a large variety of TODOs about errors
* TODO Standard library :ASM:VM:
I should start considering this and how a user may use it.  Should it
be an option in the VM and/or assembler binaries (i.e. a flag) or
something the user has to specify in their source files?

Something to consider is /static/ and /dynamic/ "linking" i.e.:
+ Static linking: assembler inserts all used library definitions into
  the bytecode output directly
  + We could insert all of it at the start of the bytecode file, and
    with [[*Start points][Start points]] this won't interfere with
    user code
    + 2023-11-03: Finishing the Start point feature has made these
      features more tenable.  A program header which is compiled and
      interpreted in bytecode works wonders.
  + Furthermore library code will have fixed program addresses (always
    at the start) so we'll know at start of assembler runtime where to
    resolve standard library subroutine calls
  + Virtual machine needs no changes to do this
+ Dynamic linking: virtual machine has fixed program storage for
  library code (a ROM), and assembler makes jump references
  specifically for this program storage
  + When assembling subroutine calls, just need to put references to
    this library storage (some kind of shared state between VM and
    assembler to know what these references are)
  + VM needs to manage a ROM of some kind for library code
  + How do we ensure assembled links to subroutine calls don't
    conflict with user code jumps?
    + Possibility: most significant bit of a program address is
      reserved such that if 0 it refers to user code and if 1 it
      refers to library code
    + 63 bit references user code (not a lot of loss in precision)
    + Easy to check if a reference is a library reference or a user
      code reference by checking "sign bit" (negativity)
** TODO Dynamic Linking
The address operand of every program control instruction (~CALL~,
~JUMP~, ~JUMP.IF~) has a specific encoding if the standard library is
dynamically linked:
+ If the most significant bit is 0, the remaining 63 bits encode an
  absolute address within the program
+ Otherwise, the address encodes a standard library subroutine.  The
  bits within the address follow this schema:
  + The next 15 bits (7 from the most significant byte, then 8 from
    the next byte) represent the specific module where the subroutine
    is defined (over 32767 possible library values)
  + The remaining 48 bits (6 bytes) encode the absolute program
    address in the bytecode of that specific module for the start of
    the subroutine (over 281 *trillion* values)

The assembler will automatically encode this based on "%USE" calls and
the name of the subroutines called.  On the virtual machine, there is
a storage location (similar to the ROM of real machines) which stores
the bytecode for modules of the standard library, indexed by the
module number.  This means, on deserialising the address into the
proper components, the VM can refer to the module bytecode then jump
to the correct address.

2023-11-09: I'll need a way to run library code in the current program
system in the runtime.  It currently doesn't support jumps or work in
programs outside of the main one unfortunately.  Any proper work done
in this area requires some proper refactoring.

2023-11-09: Constants or inline macros need to be reconfigured for
this to work: at parse time, we work out the inlines directly which
means compiling bytecode with "standard library" macros will not work
as they won't be in the token stream.  Either we don't allow
preprocessor work in the standard library at all (which is bad cos we
can't then set standard limits or other useful things) or we insert
them into the registries at parse time for use in program parsing
(which not only requires assembler refactoring to figure out what
libraries are used (to pull definitions from) but also requires making
macros "recognisable" in bytecode because they're essentially
invisible).
* Completed
** DONE Write a label/jump system :ASM:
Essentially a user should be able to write arbitrary labels (maybe
through ~label x~ or ~x:~ syntax) which can be referred to by ~jump~.

It'll purely be on the assembler side as a processing step, where the
emitted bytecode purely refers to absolute addresses; the VM should
just be dealing with absolute addresses here.
** DONE Allow relative addresses in jumps :ASM:
As requested, a special syntax for relative address jumps.  Sometimes
it's a bit nicer than a label.
** DONE Calling and returning control flow :VM: :ASM:
When writing library code we won't know the addresses of where
callers are jumping from.  However, most library functions want to
return control flow back to where the user had called them: we want
the code to act almost like an inline function.

There are two ways I can think of achieving this:
+ Some extra syntax around labels (something like ~@inline <label>:~)
  which tells the assembly processor to inline the label when a "jump"
  to that label is given
  + This requires no changes to the VM, which keeps it simple, but a
    major change to the assembler to be able to inline code.  However,
    the work on writing a label system and relative addresses should
    provide some insight into how this could be possible.
+ A /call stack/ and two new syntactic constructs ~call~ and ~ret~
  which work like so:
  + When ~call <label>~ is encountered, the next program address is
    pushed onto the call stack and control flow is set to the label
  + During execution of the ~<label>~, when a ~ret~ is encountered,
    pop an address off the call stack and set control flow to that
    address
  + This simulates the notion of "calling" and "returning from" a
    function in classical languages, but requires more machinery on
    the VM side.
** DONE Start points :ASM:VM:
You know how in standard assembly you can write
#+begin_src asm
  global _start
_start:
  ...
#+end_src
and that means the label ~_start~ is the point the program should
start from.  This means the user can define other code anywhere in the
program and specify something similar to "main" in C programs.

Proposed syntax:
#+begin_src asm
  init <label>
#+end_src
** DONE Constants
Essentially a directive which assigns some literal to a symbol as a
constant.  Something like
#+begin_src asm
%const(n) 20 %end
#+end_src

Then, during my program I could use it like so
#+begin_src asm
...
  push.word $n
  print.word
#+end_src

The preprocessor should convert this to the equivalent code of
#+begin_src asm
...
  push.word 20
  print.word
#+end_src

2023-11-04: You could even put full program instructions for a
constant potentially
#+begin_src asm
%const(print-1)
  push.word 1
  print.word
%end
#+end_src
which when referred to (by ~$print-1~) would insert the bytecode given
inline.
** DONE Rigid endian :LIB:
Say a program is compiled on a little endian machine.  The resultant
bytecode file, as a result of using C's internal functions, will use
little endian.

This file, when distributed to other computers, will not work on those
that use big endian.

This is a massive problem; I would like bytecode compiled on one
computer to work on any other one.  Therefore we have to enforce big
endian.  This refactor is limited to only LIB as a result of only the
~convert_*~ functions being used in the runtime to convert between
byte buffers (usually read from the bytecode file directly or from
memory to use in the stack).

2024-04-09: Found the ~hto_e~ functions under =endian.h= that provide
both way host to specific endian conversion of shorts, half words and
words.  This will make it super simple to just convert.
** DONE Import another file
Say I have two "asm" files: /a.asm/ and /b.asm/.

#+CAPTION: a.asm
#+begin_src asm
  global main
main:
  push.word 1
  push.word 1
  push.word 1
  sub.word
  sub.word
  call b-println
  halt
#+end_src

#+CAPTION: b.asm
#+begin_src asm
b-println:
  print.word
  push.byte '\n'
  print.char
  ret
#+end_src

How would one assemble this?  We've got two files, with /a.asm/
depending on /b.asm/ for the symbol ~b-println~.  It's obvious they
need to be assembled "together" to make something that could work.  A
possible "correct" program would be having the file /b.asm/ completely
included into /a.asm/, such that compiling /a.asm/ would lead to
classical symbol resolution without much hassle.  As a feature, this
would be best placed in the preprocessor as symbol resolution occurs
in the third stage of parsing (~process_presults~), whereas the
preprocessor is always the first stage.

That would be a very simple way of solving the static vs dynamic
linking problem: just include the files you actually need.  Even the
standard library would be fine and not require any additional work.
Let's see how this would work.