Wrote up some notes on how preprocesser language may work
Bit formal and really excessively written but I needed my thoughts down.
This commit is contained in:
152
todo.org
152
todo.org
@@ -1,6 +1,7 @@
|
|||||||
#+title: TODOs
|
#+title: TODOs
|
||||||
#+author: Aryadev Chavali
|
#+author: Aryadev Chavali
|
||||||
#+date: 2023-11-02
|
#+date: 2023-11-02
|
||||||
|
#+startup: noindent
|
||||||
|
|
||||||
* TODO Better documentation [0%] :DOC:
|
* TODO Better documentation [0%] :DOC:
|
||||||
** TODO Comment coverage [0%]
|
** TODO Comment coverage [0%]
|
||||||
@@ -49,9 +50,158 @@ Languages in the competition:
|
|||||||
2024-04-14: Chose C++ cos it will require the least effort to rewrite
|
2024-04-14: Chose C++ cos it will require the least effort to rewrite
|
||||||
the currently existing codebase while still leveraging some less
|
the currently existing codebase while still leveraging some less
|
||||||
efficient but incredibly useful features.
|
efficient but incredibly useful features.
|
||||||
|
* TODO Rewrite preprocesser to create a custom unit instead of token streams
|
||||||
|
** Problem
|
||||||
|
A problem that occurs in the preprocessor is token column and line
|
||||||
|
count. Say =a.asm= has ~%use "b.asm"~. The tokens from the =b.asm=
|
||||||
|
file are inserted into =a.asm='s token stream, but the line/column
|
||||||
|
count from there isn't properly set in =a.asm=.
|
||||||
|
|
||||||
|
A naive solution would be to just recount the lines and columns, but
|
||||||
|
this removes information about where those tokens came from. Say an
|
||||||
|
error occurs in some of =b.asm='s code: I would like to be able to
|
||||||
|
report them.
|
||||||
|
|
||||||
|
Therefore, we can no longer just generate new token streams from the
|
||||||
|
preprocesser and should instead look at making more complex
|
||||||
|
abstractions.
|
||||||
|
|
||||||
|
A problem this could also solve is nested errors and recursive
|
||||||
|
constants. Say I have some assembly like so
|
||||||
|
#+begin_src asm
|
||||||
|
%const limit 20 %end
|
||||||
|
%const print-limit
|
||||||
|
...
|
||||||
|
push.byte $limit
|
||||||
|
print.byte
|
||||||
|
...
|
||||||
|
%end
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
A call to ~print-limit~ under the current system would insert the
|
||||||
|
tokens for print-limit but completely forget about ~push.byte $limit~
|
||||||
|
which would cause a parsing error. (This could be fixed under the
|
||||||
|
current system by allowing reference resolution inside of const
|
||||||
|
blocks, with the conceit that it would be hard to stop infinite recursion)
|
||||||
|
** Language model
|
||||||
|
The model I have in mind is that all constructs in this meta language
|
||||||
|
(the preprocessing language) are either singular tokens or collections
|
||||||
|
of tokens/constructs in a recursive sense. This naturally follows
|
||||||
|
from the fact that a single pass isn't enough to properly parse this
|
||||||
|
language: there must be some recursive nature which forces the
|
||||||
|
language to take multiple passes to completely generate a stream that
|
||||||
|
can be parsed.
|
||||||
|
|
||||||
|
This vague notion can be formalised like so. A preprocessing unit is
|
||||||
|
either a singular token or a named collection of units. The former
|
||||||
|
represents your standard symbols and literals while the later
|
||||||
|
represents ~%const~ and ~%use~ calls where there is a clear name
|
||||||
|
associated to a collection of one or more tokens (in the case of the
|
||||||
|
former it's the constant's name and the latter it's the filename).
|
||||||
|
We'll distinguish this as well.
|
||||||
|
|
||||||
|
#+begin_src text
|
||||||
|
Token = PP_USE | PP_CONST | String(Content) | Symbol(Content) | PUSH(Content) | ...
|
||||||
|
Type = File(String) | Constant(Symbol)
|
||||||
|
Unit = Token | Container(Type . Vector[Unit])
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
Through this model our initial stream of tokens can be considered
|
||||||
|
units. We can already see that this model may solve our original
|
||||||
|
problem: with named containers it doesn't matter that certain tokens
|
||||||
|
are from different parts of the file or different files as they are
|
||||||
|
distinctly typed from the general set of tokens, with a name which
|
||||||
|
states where they're from.
|
||||||
|
** Processing
|
||||||
|
We need this model to have a notion of "processing" though, otherwise
|
||||||
|
it's quite useless. A processing function is simply a function which
|
||||||
|
takes a unit and returns another unit. We currently have two
|
||||||
|
processing functions we can consider: ~process_const~ and
|
||||||
|
~process_use~.
|
||||||
|
|
||||||
|
~process_use~ takes a vector of tokens and, upon encountering PP_USE
|
||||||
|
accepts the next token (a string) and tokenises the file
|
||||||
|
with that name. Within our model we'd make the stream of tokens
|
||||||
|
created from opening the file a /container/.
|
||||||
|
|
||||||
|
~process_const~ takes a vector of tokens and does two things in an
|
||||||
|
iteration:
|
||||||
|
1) upon encountering PP_CONST accepts the next n tokens till PP_END is
|
||||||
|
encountered, with the first token being a symbol. This is
|
||||||
|
registered in a map of constants (~CONSTS~) where the symbol is the
|
||||||
|
key and the value associated is the n - 1 tokens accepted
|
||||||
|
2) upon encountering a PP_REFERENCE reads the content associated with
|
||||||
|
it (considered a symbol ~S~) and replaces it ~CONSTS[S]~ (if S is
|
||||||
|
in CONSTS).
|
||||||
|
|
||||||
|
One thing to note is that both of these definitions are easily
|
||||||
|
extensible to the general definition of units: if a unit is a
|
||||||
|
container of some kind we can recur through its vector of units to
|
||||||
|
resolve any further "calls". For ~process_const~ it's ~%const~ or
|
||||||
|
~$ref~ while for ~process_use~ it's ~%use~.
|
||||||
|
** History/versioning
|
||||||
|
One additional facet to this model I'd like to add is "history". Each
|
||||||
|
unit is actually a list (or a singly linked tree where each parent has
|
||||||
|
at most one child) of sub-units where the top of the list represents
|
||||||
|
the current version. Each descendant is a previous version of the
|
||||||
|
token.
|
||||||
|
|
||||||
|
Say I do some processing on an element of the unit list =a= (with
|
||||||
|
index =i=) such that it becomes a new "unit", call it =b=. Then we
|
||||||
|
update V by =V[i] = cons(b, a)=. Through this, the lists acts as a
|
||||||
|
history of processing that has occurred on the unit. This provides an
|
||||||
|
ability to trace the path of preprocessing to an eventual conclusion.
|
||||||
|
|
||||||
|
Processing occurs on a unit until it cannot be done further i.e. when
|
||||||
|
there are no more "calls" in the tree to resolve. The history list
|
||||||
|
provides all the versions of a unit till its resolved form.
|
||||||
|
|
||||||
|
To see what a unit with history may look like (where symbols are
|
||||||
|
terminals i.e. completely resolved):
|
||||||
|
+ Container('limit' . [a Container("b" . d e f) c])
|
||||||
|
+ Container('limit' . [a '$b' c])
|
||||||
|
+ Token(PP_REF('$limit'))
|
||||||
|
|
||||||
|
This shows resolution of the unit reference ~$limit~, which in turn
|
||||||
|
leads to the resolution of ~$b~ which is a sub-unit.
|
||||||
|
|
||||||
|
There are two ways of indefinite resolution, one per method of
|
||||||
|
processing. For ~process_use~ it is two files calling ~%use~ on each
|
||||||
|
other and for ~process_const~ it is a ~%const~ calling itself. We can
|
||||||
|
just disallow it through analysis.
|
||||||
|
** Pseudocode
|
||||||
|
#+begin_src text
|
||||||
|
process_use(V: Vector[Unit]) ->
|
||||||
|
[cons((if v is Token(PP_USE) and next(v) is Token(String(S))
|
||||||
|
-> Container(File(S) . tokenise(open(v')))
|
||||||
|
else if v is Container(name . units)
|
||||||
|
-> Container(name . process_use(units))
|
||||||
|
else
|
||||||
|
-> v),
|
||||||
|
v_x)
|
||||||
|
v = v_x[0]
|
||||||
|
for v_x in V]
|
||||||
|
|
||||||
|
CONSTS={}
|
||||||
|
process_const(V: Vector[Unit]) ->
|
||||||
|
[cons((if v is Token(PP_CONST) and next(v) is Token(Symbol(S))
|
||||||
|
do {
|
||||||
|
i := find(Token(PP_END), V[v:])
|
||||||
|
CONSTS[S] = V[next(v):prev(i)]
|
||||||
|
-> Container(Constant(S) . CONSTS[S])
|
||||||
|
}
|
||||||
|
else if v is Token(PP_REF(S))
|
||||||
|
-> CONSTS[S]
|
||||||
|
else if v is Container(name . units)
|
||||||
|
-> Container(name . process_const(units))
|
||||||
|
else
|
||||||
|
-> v)
|
||||||
|
v_x)
|
||||||
|
v = v_x[0]
|
||||||
|
for v_x in V]
|
||||||
|
#+end_src
|
||||||
* TODO Introduce error handling in base library :LIB:
|
* TODO Introduce error handling in base library :LIB:
|
||||||
There is a large variety of TODOs about errors. Let's fix them!
|
There is a large variety of TODOs about errors. Let's fix them!
|
||||||
|
|
||||||
8 TODOs currently present.
|
8 TODOs currently present.
|
||||||
* TODO Standard library :ASM:VM:
|
* TODO Standard library :ASM:VM:
|
||||||
I should start considering this and how a user may use it. Should it
|
I should start considering this and how a user may use it. Should it
|
||||||
|
|||||||
Reference in New Issue
Block a user