arl.org: massive updates

2026-01-29 05:19:40 +00:00
parent 82b96e23d5
commit c11b69092d
1 changed files with 48 additions and 146 deletions
--- a/arl.org
+++ b/arl.org
@@ -5,157 +5,59 @@
 We need to be able to compile the following file:
 [[file:examples/hello-world.arl]].  All it does is print "Hello,
 world!".  Should be relatively straightforward.
 ** Stages
 We need the following stages in our MVP transpiler:
 - Source code reading (read bytes from a file)
 - Parse raw bytes into tokens (Lexer)
 - Interpret tokens into a classical AST (Parser)
 - Stack effect and type analysis of the AST for soundness
 - Translate AST into C code (Codegen)
 - Compile C code into native executable (Target)
 It's a Eulerian Path from the source code to the native executable.
 ** DONE Read file
-** DONE Parser
+** DONE Lexer
-** TODO Intermediate representation (Virtual Machine)
+[[file:src/lexer/]]
-[[file:src/arl/vm/]]
+[[file:include/arl/lexer/]]
 ** WIP Parser
 [[file:src/parser/]]
 [[file:include/arl/parser/]]
-Before we get into generating C code and then compiling it, it might
+We need to generate some form of AST from the token stream.  We want
-be worth translating the parsed ARL code into a generic IR.
+something a stage above the tokeniser so it should distinguish the
 following cases:
 - Literal value
 - Primitive call
 *** TODO AST design
 *** TODO Token Stream to AST implementation
 ** TODO Stack effect/type analysis
 [[file:src/analysis/]]
 [[file:include/arl/analysis/]]
-The IR should be primitive in its semantics but should still
+Given the AST, we need to verify the soundness of it with regards to
-encapsulate the intention behind the original ARL code.  This should
+types and the stack.  We have this idea of "stack effects" attached to
-allow us to find a set of minimum requirements for target compilation:
+every node in the AST; literals push values onto the stack and pop
- what can we reasonably use from the target platform to satisfy
+nothing, while operations may pop some operands and push some values.
  supporting the primitive IR?
 - what do we need to hand-roll on the target in order to make this
  work?
-Essentially, we want to write a virtual machine, and translate ARL
+We need a way to:
-code into bytecode for that VM.  Goals:
+- Codify the stack effects of each type of AST node
- Type checking
+- Infer the total stack effect from a sequence of nodes
 - Optimiser (stretch)
-We need the following clear items in our IR:
+These stack effects work in tandem with our type analysis.  Stack
- Static type values
+shape analysis tells us what operands are being fed into primitives,
- Static type variables (possible DeBrujin numbering or other such
+while the type analysis will tell us if the operands are well formed
-  mechanism to abstract naming away and leave it to the target to
+for the primitives.
  generate effectively)
 - Strongly typed primitive operators (numeric, strings, I/O) with
  packed arguments
 We should have a rough grouping between AST objects and this IR.  As
 ARL is Forth-like, we can use the stack semantics to generate this IR
 as we walk the AST in a linear manner.  In practice this should almost
 look like emulating a really small subset of the ARL language itself
 and executing the program in that small subset.
 Looking at how
 [[https://en.wikipedia.org/wiki/Three-address_code][TAC]] works, I
 think it may be a good idea to do something like that for our IR.
 Essentially we should our AST into a sequence of really simple
 bindings, with the final expression being a reference to some binding.
 This also simplifies type checking to just verifying each little
 binding and operation.
 *** Examples
 **** Basic example
 Consider the following ARL code:
 #+begin_src text
 34 35 +
 #+end_src
 When we walk through the above code:
 - 34 (an integer) is pushed onto the stack
 - 35 (an integer) is pushed onto the stack
 - ~+~ primitive is encountered
  - Type check the top two values of the stack; they should be
    integral.
  - ~a b +~ should correspond to ~a + b~ so the IR expression should
    pack the arguments in that order: ~prim-add(34,35)~.
  - Bind the generated IR expression to some unique name, say ~v1~.
    - Ensure this works with type checking; looking up ~v1~'s type
      should give you the output type of the "+" operator (integer).
  - Push ~v1~ onto the stack.
 The final state of the stack should be something like ~[v1]~ where
 ~v1=prim-add(34,35)~.  The final state of the stack, along with the
 bindings we form, is the IR, to pass over to the later stages of the
 compiler.
 **** Slightly more complex example
 Let's look at a slightly more complex program:
 #+begin_src text
 34 35 + 70 swap -
 #+end_src
 - 34 (integer) pushed
 - 35 (integer) pushed
 - ~+~ primitive:
  - As stated previously, the final state of this primitive gives us
    the name ~v1~ on the stack with the association
    ~v1=prim-add(34,35)~.
 - 70 (integer) pushed
 - ~swap~ primitive:
  - Requires two values on the stack, but we care little about their
    types.  Just swaps their order on the stack.
  - We /could/ introduce generics here to make the input/output
    relation ship explicit (forall T, U swap:-(-> (T U) (U T))), but
    at the same time we can just as easily get away with a type hole
    (essentially some kind of ~any~).  Up to debate.
  - We do not generate IR for this primitive as it simply isn't
    necessary.  Instead we perform the swap on our IR stack and
    continue.  The ~swap~ primitive is "transparent" in the final IR.
  - In this situation, the stack goes from ~[v1, 70]~ to
    ~[70, v1]~
 - ~-~ primitive:
  - Type checks the top two values of the stack (which are both
    integers)
  - ~a b -~ should correspond to ~a - b~, thus the corresponding IR
    expression should be ~prim-sub(70,v1)~
  - Associate IR expression with name ~v2~,
  - Push ~v2~ onto the stack.
 The final state of the IR should be:
 - Stack: ~[v2]~
 - Bindings:
  - ~v1=prim-add(34,35)~
  - ~v2=prim-sub(70,v1)~
 Notice how some primitives generate IR, while others manipulate IR
 themselves?  They almost seem like macros!
 Another thing of note is how the final state of the stack is a single
 item in this case; an IR expression representing the entire program.
 When we introduce code level bindings we won't have such nice outputs,
 but it is certainly something to consider.
 **** Hello world! example
 For our hello world:
 #+begin_src text
 "Hello, world!\n" putstr
 #+end_src
 - "Hello, world!\n" (string) pushed
 - "putstr" primitive:
  - Type check the top of the stack (should be a string)
  - Generate IR ~prim-putstr("Hello, world!\n")~
  - Associate with name ~v1~ and push it onto the stack
 Much simpler than our
 *** TODO IR level type checking
 During IR compilation, the following should be type checked:
 - use of callables (primitives, user defined when implemented)
 - variable assignment (when implemented)
 - variable use (when implemented)
 - definition of callables (when implemented)
 We want to ensure no statement is unsound.
 **** TODO Primitive types
 Define the primitive types of the IR.  Remember, simplicity is key,
 but we need to mirror what we're getting on the ARL side.
 **** TODO Type contracts for callables
 Define how we can type check arguments on the stack against the types
 a callable expects for its inputs.  In the same vein, we also need to
 figure out the type of whatever is pushed onto the stack by the
 callable.
 *** TODO Use SSA for user level bindings
 [[https://en.wikipedia.org/wiki/Static_single-assignment_form][Static
 single-assignment form]] is something we should use when we introduce
 for user level bindings.
 ** TODO Code generator
-[[file:src/arl/target-c/]]
+[[file:src/codegen/]]
 [[file:include/arl/codegen/]]
-This should take the IR translated from the AST generated by the
+This should take the AST generated by the parser (which should already
-parser, and write equivalent C code.
+have been analysed), and write equivalent C code.
 ** TODO Target compilation
 [[file:src/target/]]
 [[file:include/arl/target/]]
-After we've generated the C code, we need to call a C compiler on it
+=gcc= and =clang= take C code via /stdin/, so we don't need to write
-to generate a binary.  GCC and Clang allow passing source code through
+the C code to disk - we can just leave it as a buffer of bytes.  So
-stdin, so we don't even need to write to disk first which is nice.
+we'll call the compilers and feed the generated code from the previous
 stage into it via stdin.