From d5c43b1c3f77e40201731ff298ce73416951ce72 Mon Sep 17 00:00:00 2001 From: Aryadev Chavali Date: Tue, 16 Apr 2024 15:42:34 +0630 Subject: Wrote up some notes on how preprocesser language may work Bit formal and really excessively written but I needed my thoughts down. --- todo.org | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 151 insertions(+), 1 deletion(-) diff --git a/todo.org b/todo.org index 1c09ae0..4dc64e2 100644 --- a/todo.org +++ b/todo.org @@ -1,6 +1,7 @@ #+title: TODOs #+author: Aryadev Chavali #+date: 2023-11-02 +#+startup: noindent * TODO Better documentation [0%] :DOC: ** TODO Comment coverage [0%] @@ -49,9 +50,158 @@ Languages in the competition: 2024-04-14: Chose C++ cos it will require the least effort to rewrite the currently existing codebase while still leveraging some less efficient but incredibly useful features. +* TODO Rewrite preprocesser to create a custom unit instead of token streams +** Problem +A problem that occurs in the preprocessor is token column and line +count. Say =a.asm= has ~%use "b.asm"~. The tokens from the =b.asm= +file are inserted into =a.asm='s token stream, but the line/column +count from there isn't properly set in =a.asm=. + +A naive solution would be to just recount the lines and columns, but +this removes information about where those tokens came from. Say an +error occurs in some of =b.asm='s code: I would like to be able to +report them. + +Therefore, we can no longer just generate new token streams from the +preprocesser and should instead look at making more complex +abstractions. + +A problem this could also solve is nested errors and recursive +constants. Say I have some assembly like so +#+begin_src asm + %const limit 20 %end + %const print-limit + ... + push.byte $limit + print.byte + ... + %end +#+end_src + +A call to ~print-limit~ under the current system would insert the +tokens for print-limit but completely forget about ~push.byte $limit~ +which would cause a parsing error. (This could be fixed under the +current system by allowing reference resolution inside of const +blocks, with the conceit that it would be hard to stop infinite recursion) +** Language model +The model I have in mind is that all constructs in this meta language +(the preprocessing language) are either singular tokens or collections +of tokens/constructs in a recursive sense. This naturally follows +from the fact that a single pass isn't enough to properly parse this +language: there must be some recursive nature which forces the +language to take multiple passes to completely generate a stream that +can be parsed. + +This vague notion can be formalised like so. A preprocessing unit is +either a singular token or a named collection of units. The former +represents your standard symbols and literals while the later +represents ~%const~ and ~%use~ calls where there is a clear name +associated to a collection of one or more tokens (in the case of the +former it's the constant's name and the latter it's the filename). +We'll distinguish this as well. + +#+begin_src text +Token = PP_USE | PP_CONST | String(Content) | Symbol(Content) | PUSH(Content) | ... +Type = File(String) | Constant(Symbol) +Unit = Token | Container(Type . Vector[Unit]) +#+end_src + +Through this model our initial stream of tokens can be considered +units. We can already see that this model may solve our original +problem: with named containers it doesn't matter that certain tokens +are from different parts of the file or different files as they are +distinctly typed from the general set of tokens, with a name which +states where they're from. +** Processing +We need this model to have a notion of "processing" though, otherwise +it's quite useless. A processing function is simply a function which +takes a unit and returns another unit. We currently have two +processing functions we can consider: ~process_const~ and +~process_use~. + +~process_use~ takes a vector of tokens and, upon encountering PP_USE +accepts the next token (a string) and tokenises the file +with that name. Within our model we'd make the stream of tokens +created from opening the file a /container/. + +~process_const~ takes a vector of tokens and does two things in an +iteration: +1) upon encountering PP_CONST accepts the next n tokens till PP_END is + encountered, with the first token being a symbol. This is + registered in a map of constants (~CONSTS~) where the symbol is the + key and the value associated is the n - 1 tokens accepted +2) upon encountering a PP_REFERENCE reads the content associated with + it (considered a symbol ~S~) and replaces it ~CONSTS[S]~ (if S is + in CONSTS). + +One thing to note is that both of these definitions are easily +extensible to the general definition of units: if a unit is a +container of some kind we can recur through its vector of units to +resolve any further "calls". For ~process_const~ it's ~%const~ or +~$ref~ while for ~process_use~ it's ~%use~. +** History/versioning +One additional facet to this model I'd like to add is "history". Each +unit is actually a list (or a singly linked tree where each parent has +at most one child) of sub-units where the top of the list represents +the current version. Each descendant is a previous version of the +token. + +Say I do some processing on an element of the unit list =a= (with +index =i=) such that it becomes a new "unit", call it =b=. Then we +update V by =V[i] = cons(b, a)=. Through this, the lists acts as a +history of processing that has occurred on the unit. This provides an +ability to trace the path of preprocessing to an eventual conclusion. + +Processing occurs on a unit until it cannot be done further i.e. when +there are no more "calls" in the tree to resolve. The history list +provides all the versions of a unit till its resolved form. + +To see what a unit with history may look like (where symbols are +terminals i.e. completely resolved): ++ Container('limit' . [a Container("b" . d e f) c]) + + Container('limit' . [a '$b' c]) + + Token(PP_REF('$limit')) + +This shows resolution of the unit reference ~$limit~, which in turn +leads to the resolution of ~$b~ which is a sub-unit. + +There are two ways of indefinite resolution, one per method of +processing. For ~process_use~ it is two files calling ~%use~ on each +other and for ~process_const~ it is a ~%const~ calling itself. We can +just disallow it through analysis. +** Pseudocode +#+begin_src text +process_use(V: Vector[Unit]) -> + [cons((if v is Token(PP_USE) and next(v) is Token(String(S)) + -> Container(File(S) . tokenise(open(v'))) + else if v is Container(name . units) + -> Container(name . process_use(units)) + else + -> v), + v_x) + v = v_x[0] + for v_x in V] + +CONSTS={} +process_const(V: Vector[Unit]) -> + [cons((if v is Token(PP_CONST) and next(v) is Token(Symbol(S)) + do { + i := find(Token(PP_END), V[v:]) + CONSTS[S] = V[next(v):prev(i)] + -> Container(Constant(S) . CONSTS[S]) + } + else if v is Token(PP_REF(S)) + -> CONSTS[S] + else if v is Container(name . units) + -> Container(name . process_const(units)) + else + -> v) + v_x) + v = v_x[0] + for v_x in V] +#+end_src * TODO Introduce error handling in base library :LIB: There is a large variety of TODOs about errors. Let's fix them! - 8 TODOs currently present. * TODO Standard library :ASM:VM: I should start considering this and how a user may use it. Should it -- cgit v1.2.3-13-gbd6f