Update TODOs
This commit is contained in:
171
todo.org
171
todo.org
@@ -19,18 +19,15 @@ Languages in the competition:
|
||||
2024-04-14: Chose C++ cos it will require the least effort to rewrite
|
||||
the currently existing codebase while still leveraging some less
|
||||
efficient but incredibly useful features.
|
||||
* TODO Rewrite lexer
|
||||
~push.magic~ is a valid PUSH token according to the current lexer.
|
||||
I'd like to clamp down on this obvious error at the lexer itself, so
|
||||
the parser can be dedicated to just dealing with address resolution
|
||||
and conversion to opcodes.
|
||||
|
||||
How about an enum which represents the possible type of the operator?
|
||||
** DONE Write Lexer
|
||||
** WIP Write Preprocesser
|
||||
** TODO Write parser
|
||||
* TODO Better documentation [0%] :DOC:
|
||||
** TODO Comment coverage [0%]
|
||||
*** TODO ASM [0%]
|
||||
**** TODO asm/lexer.h
|
||||
**** TODO asm/parser.h
|
||||
** TODO Write a specification
|
||||
* TODO Preprocessing directives :ASM:
|
||||
Like in FASM or NASM where we can give certain helpful instructions to
|
||||
the assembler. I'd use the ~%~ symbol to designate preprocessor
|
||||
@@ -49,159 +46,6 @@ A call should look something like this:
|
||||
$name 1 2 3
|
||||
#+end_src
|
||||
and those tokens will be substituted literally in the macro body.
|
||||
* TODO Rewrite preprocesser to create a custom unit instead of token streams
|
||||
** Problem
|
||||
A problem that occurs in the preprocessor is token column and line
|
||||
count. Say =a.asm= has ~%use "b.asm"~. The tokens from the =b.asm=
|
||||
file are inserted into =a.asm='s token stream, but the line/column
|
||||
count from there isn't properly set in =a.asm=.
|
||||
|
||||
A naive solution would be to just recount the lines and columns, but
|
||||
this removes information about where those tokens came from. Say an
|
||||
error occurs in some of =b.asm='s code: I would like to be able to
|
||||
report them.
|
||||
|
||||
Therefore, we can no longer just generate new token streams from the
|
||||
preprocesser and should instead look at making more complex
|
||||
abstractions.
|
||||
|
||||
A problem this could also solve is nested errors and recursive
|
||||
constants. Say I have some assembly like so
|
||||
#+begin_src asm
|
||||
%const limit 20 %end
|
||||
%const print-limit
|
||||
...
|
||||
push.byte $limit
|
||||
print.byte
|
||||
...
|
||||
%end
|
||||
#+end_src
|
||||
|
||||
A call to ~print-limit~ under the current system would insert the
|
||||
tokens for print-limit but completely forget about ~push.byte $limit~
|
||||
which would cause a parsing error. (This could be fixed under the
|
||||
current system by allowing reference resolution inside of const
|
||||
blocks, with the conceit that it would be hard to stop infinite recursion)
|
||||
** Language model
|
||||
The model I have in mind is that all constructs in this meta language
|
||||
(the preprocessing language) are either singular tokens or collections
|
||||
of tokens/constructs in a recursive sense. This naturally follows
|
||||
from the fact that a single pass isn't enough to properly parse this
|
||||
language: there must be some recursive nature which forces the
|
||||
language to take multiple passes to completely generate a stream that
|
||||
can be parsed.
|
||||
|
||||
This vague notion can be formalised like so. A preprocessing unit is
|
||||
either a singular token or a named collection of units. The former
|
||||
represents your standard symbols and literals while the later
|
||||
represents ~%const~ and ~%use~ calls where there is a clear name
|
||||
associated to a collection of one or more tokens (in the case of the
|
||||
former it's the constant's name and the latter it's the filename).
|
||||
We'll distinguish this as well.
|
||||
|
||||
#+begin_src text
|
||||
Token = PP_USE | PP_CONST | String(Content) | Symbol(Content) | PUSH(Content) | ...
|
||||
Type = File(String) | Constant(Symbol)
|
||||
Unit = Token | Container(Type . Vector[Unit])
|
||||
#+end_src
|
||||
|
||||
Through this model our initial stream of tokens can be considered
|
||||
units. We can already see that this model may solve our original
|
||||
problem: with named containers it doesn't matter that certain tokens
|
||||
are from different parts of the file or different files as they are
|
||||
distinctly typed from the general set of tokens, with a name which
|
||||
states where they're from.
|
||||
** Processing
|
||||
We need this model to have a notion of "processing" though, otherwise
|
||||
it's quite useless. A processing function is simply a function which
|
||||
takes a unit and returns another unit. We currently have two
|
||||
processing functions we can consider: ~process_const~ and
|
||||
~process_use~.
|
||||
|
||||
~process_use~ takes a vector of tokens and, upon encountering PP_USE
|
||||
accepts the next token (a string) and tokenises the file
|
||||
with that name. Within our model we'd make the stream of tokens
|
||||
created from opening the file a /container/.
|
||||
|
||||
~process_const~ takes a vector of tokens and does two things in an
|
||||
iteration:
|
||||
1) upon encountering PP_CONST accepts the next n tokens till PP_END is
|
||||
encountered, with the first token being a symbol. This is
|
||||
registered in a map of constants (~CONSTS~) where the symbol is the
|
||||
key and the value associated is the n - 1 tokens accepted
|
||||
2) upon encountering a PP_REFERENCE reads the content associated with
|
||||
it (considered a symbol ~S~) and replaces it ~CONSTS[S]~ (if S is
|
||||
in CONSTS).
|
||||
|
||||
One thing to note is that both of these definitions are easily
|
||||
extensible to the general definition of units: if a unit is a
|
||||
container of some kind we can recur through its vector of units to
|
||||
resolve any further "calls". For ~process_const~ it's ~%const~ or
|
||||
~$ref~ while for ~process_use~ it's ~%use~.
|
||||
** History/versioning
|
||||
One additional facet to this model I'd like to add is "history". Each
|
||||
unit is actually a list (or a singly linked tree where each parent has
|
||||
at most one child) of sub-units where the top of the list represents
|
||||
the current version. Each descendant is a previous version of the
|
||||
token.
|
||||
|
||||
Say I do some processing on an element of the unit list =a= (with
|
||||
index =i=) such that it becomes a new "unit", call it =b=. Then we
|
||||
update V by =V[i] = cons(b, a)=. Through this, the lists acts as a
|
||||
history of processing that has occurred on the unit. This provides an
|
||||
ability to trace the path of preprocessing to an eventual conclusion.
|
||||
|
||||
Processing occurs on a unit until it cannot be done further i.e. when
|
||||
there are no more "calls" in the tree to resolve. The history list
|
||||
provides all the versions of a unit till its resolved form.
|
||||
|
||||
To see what a unit with history may look like (where symbols are
|
||||
terminals i.e. completely resolved):
|
||||
+ Container('limit' . [a Container("b" . d e f) c])
|
||||
+ Container('limit' . [a '$b' c])
|
||||
+ Token(PP_REF('$limit'))
|
||||
|
||||
This shows resolution of the unit reference ~$limit~, which in turn
|
||||
leads to the resolution of ~$b~ which is a sub-unit.
|
||||
|
||||
There are two ways of indefinite resolution, one per method of
|
||||
processing. For ~process_use~ it is two files calling ~%use~ on each
|
||||
other and for ~process_const~ it is a ~%const~ calling itself. We can
|
||||
just disallow it through analysis.
|
||||
** Pseudocode
|
||||
#+begin_src text
|
||||
process_use(V: Vector[Unit]) ->
|
||||
[cons((if v is Token(PP_USE) and next(v) is Token(String(S))
|
||||
-> Container(File(S) . tokenise(open(v')))
|
||||
else if v is Container(name . units)
|
||||
-> Container(name . process_use(units))
|
||||
else
|
||||
-> v),
|
||||
v_x)
|
||||
v = v_x[0]
|
||||
for v_x in V]
|
||||
|
||||
CONSTS={}
|
||||
process_const(V: Vector[Unit]) ->
|
||||
[cons((if v is Token(PP_CONST) and next(v) is Token(Symbol(S))
|
||||
do {
|
||||
i := find(Token(PP_END), V[v:])
|
||||
CONSTS[S] = V[next(v):prev(i)]
|
||||
-> Container(Constant(S) . CONSTS[S])
|
||||
}
|
||||
else if v is Token(PP_REF(S))
|
||||
-> CONSTS[S]
|
||||
else if v is Container(name . units)
|
||||
-> Container(name . process_const(units))
|
||||
else
|
||||
-> v)
|
||||
v_x)
|
||||
v = v_x[0]
|
||||
for v_x in V]
|
||||
#+end_src
|
||||
* TODO Write a specification for the assembly language
|
||||
In particular the preprocessor macro language and the direct relation
|
||||
between opcodes and parse units in the assembler.
|
||||
* Completed
|
||||
** DONE Write a label/jump system :ASM:
|
||||
Essentially a user should be able to write arbitrary labels (maybe
|
||||
@@ -350,3 +194,10 @@ That would be a very simple way of solving the static vs dynamic
|
||||
linking problem: just include the files you actually need. Even the
|
||||
standard library would be fine and not require any additional work.
|
||||
Let's see how this would work.
|
||||
** DONE Rewrite lexer
|
||||
~push.magic~ is a valid PUSH token according to the current lexer.
|
||||
I'd like to clamp down on this obvious error at the lexer itself, so
|
||||
the parser can be dedicated to just dealing with address resolution
|
||||
and conversion to opcodes.
|
||||
|
||||
How about an enum which represents the possible type of the operator?
|
||||
|
||||
Reference in New Issue
Block a user