aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAryadev Chavali <aryadev@aryadevchavali.com>2024-04-16 15:42:34 +0630
committerAryadev Chavali <aryadev@aryadevchavali.com>2024-04-16 15:42:34 +0630
commitd5c43b1c3f77e40201731ff298ce73416951ce72 (patch)
treebd2c6dfdd1a4adb53568468a829f636555786ffe
parent715facf0152ad98d7bc05c295cf8e527e45cb592 (diff)
downloadovm-d5c43b1c3f77e40201731ff298ce73416951ce72.tar.gz
ovm-d5c43b1c3f77e40201731ff298ce73416951ce72.tar.bz2
ovm-d5c43b1c3f77e40201731ff298ce73416951ce72.zip
Wrote up some notes on how preprocesser language may work
Bit formal and really excessively written but I needed my thoughts down.
-rw-r--r--todo.org152
1 files changed, 151 insertions, 1 deletions
diff --git a/todo.org b/todo.org
index 1c09ae0..4dc64e2 100644
--- a/todo.org
+++ b/todo.org
@@ -1,6 +1,7 @@
#+title: TODOs
#+author: Aryadev Chavali
#+date: 2023-11-02
+#+startup: noindent
* TODO Better documentation [0%] :DOC:
** TODO Comment coverage [0%]
@@ -49,9 +50,158 @@ Languages in the competition:
2024-04-14: Chose C++ cos it will require the least effort to rewrite
the currently existing codebase while still leveraging some less
efficient but incredibly useful features.
+* TODO Rewrite preprocesser to create a custom unit instead of token streams
+** Problem
+A problem that occurs in the preprocessor is token column and line
+count. Say =a.asm= has ~%use "b.asm"~. The tokens from the =b.asm=
+file are inserted into =a.asm='s token stream, but the line/column
+count from there isn't properly set in =a.asm=.
+
+A naive solution would be to just recount the lines and columns, but
+this removes information about where those tokens came from. Say an
+error occurs in some of =b.asm='s code: I would like to be able to
+report them.
+
+Therefore, we can no longer just generate new token streams from the
+preprocesser and should instead look at making more complex
+abstractions.
+
+A problem this could also solve is nested errors and recursive
+constants. Say I have some assembly like so
+#+begin_src asm
+ %const limit 20 %end
+ %const print-limit
+ ...
+ push.byte $limit
+ print.byte
+ ...
+ %end
+#+end_src
+
+A call to ~print-limit~ under the current system would insert the
+tokens for print-limit but completely forget about ~push.byte $limit~
+which would cause a parsing error. (This could be fixed under the
+current system by allowing reference resolution inside of const
+blocks, with the conceit that it would be hard to stop infinite recursion)
+** Language model
+The model I have in mind is that all constructs in this meta language
+(the preprocessing language) are either singular tokens or collections
+of tokens/constructs in a recursive sense. This naturally follows
+from the fact that a single pass isn't enough to properly parse this
+language: there must be some recursive nature which forces the
+language to take multiple passes to completely generate a stream that
+can be parsed.
+
+This vague notion can be formalised like so. A preprocessing unit is
+either a singular token or a named collection of units. The former
+represents your standard symbols and literals while the later
+represents ~%const~ and ~%use~ calls where there is a clear name
+associated to a collection of one or more tokens (in the case of the
+former it's the constant's name and the latter it's the filename).
+We'll distinguish this as well.
+
+#+begin_src text
+Token = PP_USE | PP_CONST | String(Content) | Symbol(Content) | PUSH(Content) | ...
+Type = File(String) | Constant(Symbol)
+Unit = Token | Container(Type . Vector[Unit])
+#+end_src
+
+Through this model our initial stream of tokens can be considered
+units. We can already see that this model may solve our original
+problem: with named containers it doesn't matter that certain tokens
+are from different parts of the file or different files as they are
+distinctly typed from the general set of tokens, with a name which
+states where they're from.
+** Processing
+We need this model to have a notion of "processing" though, otherwise
+it's quite useless. A processing function is simply a function which
+takes a unit and returns another unit. We currently have two
+processing functions we can consider: ~process_const~ and
+~process_use~.
+
+~process_use~ takes a vector of tokens and, upon encountering PP_USE
+accepts the next token (a string) and tokenises the file
+with that name. Within our model we'd make the stream of tokens
+created from opening the file a /container/.
+
+~process_const~ takes a vector of tokens and does two things in an
+iteration:
+1) upon encountering PP_CONST accepts the next n tokens till PP_END is
+ encountered, with the first token being a symbol. This is
+ registered in a map of constants (~CONSTS~) where the symbol is the
+ key and the value associated is the n - 1 tokens accepted
+2) upon encountering a PP_REFERENCE reads the content associated with
+ it (considered a symbol ~S~) and replaces it ~CONSTS[S]~ (if S is
+ in CONSTS).
+
+One thing to note is that both of these definitions are easily
+extensible to the general definition of units: if a unit is a
+container of some kind we can recur through its vector of units to
+resolve any further "calls". For ~process_const~ it's ~%const~ or
+~$ref~ while for ~process_use~ it's ~%use~.
+** History/versioning
+One additional facet to this model I'd like to add is "history". Each
+unit is actually a list (or a singly linked tree where each parent has
+at most one child) of sub-units where the top of the list represents
+the current version. Each descendant is a previous version of the
+token.
+
+Say I do some processing on an element of the unit list =a= (with
+index =i=) such that it becomes a new "unit", call it =b=. Then we
+update V by =V[i] = cons(b, a)=. Through this, the lists acts as a
+history of processing that has occurred on the unit. This provides an
+ability to trace the path of preprocessing to an eventual conclusion.
+
+Processing occurs on a unit until it cannot be done further i.e. when
+there are no more "calls" in the tree to resolve. The history list
+provides all the versions of a unit till its resolved form.
+
+To see what a unit with history may look like (where symbols are
+terminals i.e. completely resolved):
++ Container('limit' . [a Container("b" . d e f) c])
+ + Container('limit' . [a '$b' c])
+ + Token(PP_REF('$limit'))
+
+This shows resolution of the unit reference ~$limit~, which in turn
+leads to the resolution of ~$b~ which is a sub-unit.
+
+There are two ways of indefinite resolution, one per method of
+processing. For ~process_use~ it is two files calling ~%use~ on each
+other and for ~process_const~ it is a ~%const~ calling itself. We can
+just disallow it through analysis.
+** Pseudocode
+#+begin_src text
+process_use(V: Vector[Unit]) ->
+ [cons((if v is Token(PP_USE) and next(v) is Token(String(S))
+ -> Container(File(S) . tokenise(open(v')))
+ else if v is Container(name . units)
+ -> Container(name . process_use(units))
+ else
+ -> v),
+ v_x)
+ v = v_x[0]
+ for v_x in V]
+
+CONSTS={}
+process_const(V: Vector[Unit]) ->
+ [cons((if v is Token(PP_CONST) and next(v) is Token(Symbol(S))
+ do {
+ i := find(Token(PP_END), V[v:])
+ CONSTS[S] = V[next(v):prev(i)]
+ -> Container(Constant(S) . CONSTS[S])
+ }
+ else if v is Token(PP_REF(S))
+ -> CONSTS[S]
+ else if v is Container(name . units)
+ -> Container(name . process_const(units))
+ else
+ -> v)
+ v_x)
+ v = v_x[0]
+ for v_x in V]
+#+end_src
* TODO Introduce error handling in base library :LIB:
There is a large variety of TODOs about errors. Let's fix them!
-
8 TODOs currently present.
* TODO Standard library :ASM:VM:
I should start considering this and how a user may use it. Should it