Update TODOs

2024-07-07 19:07:35 +01:00
parent 6ae0bbedc5
commit e6c659a14d
1 changed files with 11 additions and 160 deletions
--- a/todo.org
+++ b/todo.org
@@ -19,18 +19,15 @@ Languages in the competition:
 2024-04-14: Chose C++ cos it will require the least effort to rewrite
 the currently existing codebase while still leveraging some less
 efficient but incredibly useful features.
-* TODO Rewrite lexer
-~push.magic~ is a valid PUSH token according to the current lexer.
-I'd like to clamp down on this obvious error at the lexer itself, so
-the parser can be dedicated to just dealing with address resolution
-and conversion to opcodes.
-
-How about an enum which represents the possible type of the operator?
+** DONE Write Lexer
+** WIP Write Preprocesser
+** TODO Write parser
 * TODO Better documentation [0%] :DOC:
 ** TODO Comment coverage [0%]
 *** TODO ASM [0%]
 **** TODO asm/lexer.h
 **** TODO asm/parser.h
+** TODO Write a specification
 * TODO Preprocessing directives :ASM:
 Like in FASM or NASM where we can give certain helpful instructions to
 the assembler.  I'd use the ~%~ symbol to designate preprocessor
@@ -49,159 +46,6 @@ A call should look something like this:
  $name 1 2 3
 #+end_src
 and those tokens will be substituted literally in the macro body.
-* TODO Rewrite preprocesser to create a custom unit instead of token streams
-** Problem
-A problem that occurs in the preprocessor is token column and line
-count.  Say =a.asm= has ~%use "b.asm"~.  The tokens from the =b.asm=
-file are inserted into =a.asm='s token stream, but the line/column
-count from there isn't properly set in =a.asm=.
-
-A naive solution would be to just recount the lines and columns, but
-this removes information about where those tokens came from.  Say an
-error occurs in some of =b.asm='s code: I would like to be able to
-report them.
-
-Therefore, we can no longer just generate new token streams from the
-preprocesser and should instead look at making more complex
-abstractions.
-
-A problem this could also solve is nested errors and recursive
-constants.  Say I have some assembly like so
-#+begin_src asm
-  %const limit 20 %end
-  %const print-limit
-  ...
-  push.byte $limit
-  print.byte
-  ...
-  %end
-#+end_src
-
-A call to ~print-limit~ under the current system would insert the
-tokens for print-limit but completely forget about ~push.byte $limit~
-which would cause a parsing error.  (This could be fixed under the
-current system by allowing reference resolution inside of const
-blocks, with the conceit that it would be hard to stop infinite recursion)
-** Language model
-The model I have in mind is that all constructs in this meta language
-(the preprocessing language) are either singular tokens or collections
-of tokens/constructs in a recursive sense.  This naturally follows
-from the fact that a single pass isn't enough to properly parse this
-language: there must be some recursive nature which forces the
-language to take multiple passes to completely generate a stream that
-can be parsed.
-
-This vague notion can be formalised like so.  A preprocessing unit is
-either a singular token or a named collection of units.  The former
-represents your standard symbols and literals while the later
-represents ~%const~ and ~%use~ calls where there is a clear name
-associated to a collection of one or more tokens (in the case of the
-former it's the constant's name and the latter it's the filename).
-We'll distinguish this as well.
-
-#+begin_src text
-Token = PP_USE | PP_CONST | String(Content) | Symbol(Content) | PUSH(Content) | ...
-Type = File(String) | Constant(Symbol)
-Unit = Token      | Container(Type . Vector[Unit])
-#+end_src
-
-Through this model our initial stream of tokens can be considered
-units.  We can already see that this model may solve our original
-problem: with named containers it doesn't matter that certain tokens
-are from different parts of the file or different files as they are
-distinctly typed from the general set of tokens, with a name which
-states where they're from.
-** Processing
-We need this model to have a notion of "processing" though, otherwise
-it's quite useless.  A processing function is simply a function which
-takes a unit and returns another unit.  We currently have two
-processing functions we can consider: ~process_const~ and
-~process_use~.
-
-~process_use~ takes a vector of tokens and, upon encountering PP_USE
-accepts the next token (a string) and tokenises the file
-with that name.  Within our model we'd make the stream of tokens
-created from opening the file a /container/.
-
-~process_const~ takes a vector of tokens and does two things in an
-iteration:
-1) upon encountering PP_CONST accepts the next n tokens till PP_END is
-   encountered, with the first token being a symbol.  This is
-   registered in a map of constants (~CONSTS~) where the symbol is the
-   key and the value associated is the n - 1 tokens accepted
-2) upon encountering a PP_REFERENCE reads the content associated with
-   it (considered a symbol ~S~) and replaces it ~CONSTS[S]~ (if S is
-   in CONSTS).
-
-One thing to note is that both of these definitions are easily
-extensible to the general definition of units: if a unit is a
-container of some kind we can recur through its vector of units to
-resolve any further "calls".  For ~process_const~ it's ~%const~ or
-~$ref~ while for ~process_use~ it's ~%use~.
-** History/versioning
-One additional facet to this model I'd like to add is "history".  Each
-unit is actually a list (or a singly linked tree where each parent has
-at most one child) of sub-units where the top of the list represents
-the current version.  Each descendant is a previous version of the
-token.
-
-Say I do some processing on an element of the unit list =a= (with
-index =i=) such that it becomes a new "unit", call it =b=.  Then we
-update V by =V[i] = cons(b, a)=.  Through this, the lists acts as a
-history of processing that has occurred on the unit.  This provides an
-ability to trace the path of preprocessing to an eventual conclusion.
-
-Processing occurs on a unit until it cannot be done further i.e. when
-there are no more "calls" in the tree to resolve.  The history list
-provides all the versions of a unit till its resolved form.
-
-To see what a unit with history may look like (where symbols are
-terminals i.e. completely resolved):
-+ Container('limit' . [a Container("b" . d e f) c])
-  + Container('limit' . [a '$b' c])
-    + Token(PP_REF('$limit'))
-
-This shows resolution of the unit reference ~$limit~, which in turn
-leads to the resolution of ~$b~ which is a sub-unit.
-
-There are two ways of indefinite resolution, one per method of
-processing.  For ~process_use~ it is two files calling ~%use~ on each
-other and for ~process_const~ it is a ~%const~ calling itself.  We can
-just disallow it through analysis.
-** Pseudocode
-#+begin_src text
-process_use(V: Vector[Unit]) ->
-    [cons((if v is Token(PP_USE) and next(v) is Token(String(S))
-             -> Container(File(S) . tokenise(open(v')))
-           else if v is Container(name . units)
-             -> Container(name . process_use(units))
-           else
-             -> v),
-          v_x)
-     v = v_x[0]
-     for v_x in V]
-
-CONSTS={}
-process_const(V: Vector[Unit]) ->
-    [cons((if v is Token(PP_CONST) and next(v) is Token(Symbol(S))
-                do {
-                    i := find(Token(PP_END), V[v:])
-                    CONSTS[S] = V[next(v):prev(i)]
-                    -> Container(Constant(S) . CONSTS[S])
-                }
-           else if v is Token(PP_REF(S))
-                -> CONSTS[S]
-           else if v is Container(name . units)
-               -> Container(name . process_const(units))
-           else
-               -> v)
-          v_x)
-     v = v_x[0]
-     for v_x in V]
-#+end_src
-* TODO Write a specification for the assembly language
-In particular the preprocessor macro language and the direct relation
-between opcodes and parse units in the assembler.
 * Completed
 ** DONE Write a label/jump system :ASM:
 Essentially a user should be able to write arbitrary labels (maybe
@@ -350,3 +194,10 @@ That would be a very simple way of solving the static vs dynamic
 linking problem: just include the files you actually need.  Even the
 standard library would be fine and not require any additional work.
 Let's see how this would work.
+** DONE Rewrite lexer
+~push.magic~ is a valid PUSH token according to the current lexer.
+I'd like to clamp down on this obvious error at the lexer itself, so
+the parser can be dedicated to just dealing with address resolution
+and conversion to opcodes.
+
+How about an enum which represents the possible type of the operator?