Initial implementation

2025-05-09 18:29:52 +01:00
parent 576bf0f308
commit ba5c0a4579
18 changed files with 2217 additions and 0 deletions
--- a/tasks.org
+++ b/tasks.org
@@ -0,0 +1,230 @@
+#+title: Tasks
+#+date: 2025-02-18
+
+* WIP Implement a reader
+We want something a bit generic: able to handle reading from some
+buffer of memory (a string, or contents of a file where we can read
+the entire thing at once) or directly from a file stream (STDIN,
+network streams, etc).
+
+We don't need a tokeniser - the basic grammar of a Lisp is really easy
+to narrow down, so we can skip tokenisation and go straight for
+parsing.
+
+We also want to be able to admit when reading went wrong for some
+reason with proper errors messages (i.e. can be read by Emacs) - this
+will need to be refactored when we introduce errors within the Lisp
+runtime itself.
+** TODO Implement floats and rationals
+Rationals are pretty easy - just two integers (quotient and divisor) -
+so a tagged cons cell would do the job.  Floats are a bit more
+difficult since I'd either need to box them or find a creative way of
+sticking IEEE-754 floats into < 64 bits.
+
+Also implement a reader macro for #e<scientific form>.  Also deal with
+[-,+,]inf(.0) and [-,+,]nan(.0).
+
+Need to do some reading.
+
+[[file:r7rs-tests.scm::test #t (real? #e1e10)][trigger]]
+** TODO Consider user instantiated reader macros
+We don't have an evaluator so we can't really interpret whatever a
+user wants for a reader macro currently, but it would be useful to
+think about it now.  Currently I have a single function which deals
+with parsing reader macros but that's just static.
+
+Thing is, it does have the context as a parameter to pass to delegate
+functions (such as ~parse_vec~) - wouldn't be a massive jump to also
+consider user environments via the context.
+
+[[file:reader.c::perr_t parse_reader_macro(context_t *ctx, input_t
+*inp, lisp_t **ret)][function link]]
+* TODO Consider Lisp runtime errors
+* TODO Admit arbitrarily sized integers
+Currently we admit fixed size integers of 63 bits.  They use 2s
+complement due to x86 which means our max and min are 62 bit based.
+
+However, to even try to be a scheme implementation we need to allow
+arbitrarily sized integers.  What are the specific tasks we need to
+complete in our model to achieve this?:
+ Allow "reading" of unfixed size integers
+  + This will require reading a sequence of base 10 digits without
+    relying on strtold
+ Implement unfixed size integers into our memory model
+  + Certainly, unfixed size integers cannot be carried around like
+    fixnums wherein we can embed an integer into the pointer.
+    Thus we have to allocate them in memory.
+    + NOTE: There will be definitely be an optimisation to be done
+      here; integers that are within the bound of a fixnum could be
+      left as a fixnum then "elevated" to an integer when needed
+  + I think the big idea is allocating them as a fixed set of bytes
+    like big symbols.  For big integers we have to read the memory
+    associated thus we need a pointer.  Due to 2s complement it should
+    be trivial to increase the size of an integer to fit a new result
+    i.e. if I'm adding two integers and that leads to an "overflow"
+    where the result is of greater width than its inputs, we should
+    just allocate new memory for it.
+
+Consequences:
+- Greater memory use
+  - In fact exponential if we need to allocate a whole new integer per
+    operation rather than utilising the input memory
+- Possible loss of performance due to making integers over fixnums
+  when they don't need to be
+- Comparison is harder on integers
+- Harder to cache for the CPU
+
+but all of this is to be expected when the user is an idiot.
+* TODO Think about how to perform operations on different types
+** TODO Integers
+** TODO Symbols
+** TODO Pairs
+* DONE More efficient memory model for symbols
+The primitive model for symbol allocation is an 8 byte number
+representing the size of the symbol, followed by a variable number of
+characters (as bytes).  This is stored somewhere in the memory
+context, which will likely be in the heap.
+
+We're actually wasting a ridiculous amount of memory with this model.
+We'll almost never be using the full 64 bits of the size to represent
+a symbol (who's ever going to go close to 1.8 quintillion bytes for a
+single symbol?) - and even more annoying, there are tons of small
+sized symbols where we actually need _more space_ for the 8 byte size
+than the underlying symbol data.
+
+I think there's definitely a better model available, at least for
+smaller symbols.  We already have inlined integers where the pointer
+itself is the integer, why can't we do the same for symbols?
+A pointer has 8 bytes of data to use - one character is one byte -
+thus we can represent 8 character symbols in one pointer.
+
+If we want this to still be within the remit of our pointer tagging
+scheme, we'll need a bit of space for our administrative purposes
+(admin slop...).  So let's take _one_ byte out of that 8 for that.  So
+we can represent any symbols 7 bytes long in a single pointer.  We'll
+need to keep in mind we want to represent symbols that may be less
+than 7 characters, so that one admin byte is going to be doing some
+heavy lifting.
+
+Let's examine that one admin byte:
+ At least 3 bits are necessary for the actual pointer tag: "at least"
+  because we might increase the size of the tag based on demand
+ Thus, 5 bits left for our use - let's fit the size in there.
+  pow(2,6) - 1 = 63, so we have _way_ more than we need
+
+What are the benefits of doing this?:
+ Symbol equality for small symbols is a cinch: just compare the two
+  tagged "pointers"
+ 7 or less character symbols require no memory allocation, just
+  working off the stack
+
+One thing to note is that for more than 7 character symbols, we'll
+need to allocate memory.  But, in the worst case of 8 character
+symbols, we're only allocating two 64 bit integers: these are easy to
+walk on x86 and we've reached at least parity between the memory
+required for administration (the size number) and the actual data.
+** Being more aggressive?
+Technically, ANSI bytes only need 7 bits.  For each of the 7 bytes
+used for the character data, we can take one bit off, leaving us with
+7 bits to use for an additional character.  We don't need to adjust
+anything else in the schema.
+
+So, hypothetically we could represent up to 8 character symbols!  This
+would require packing the characters more aggressively into a single
+pointer.  Let's look at the layout of our pointers.  This table is
+indexed from most significant to least i.e. 0 is the MSB and 63 is the
+LSB:
+
+|-------+------------|
+| Index | Usage      |
+|-------+------------|
+|  0-55 | Characters |
+| 56-60 | Size       |
+| 61-63 | Tag        |
+|-------+------------|
+
+Honestly though, for an extra byte of information we'll probably have
+to do a lot more work.  x86-64 CPUs are much better at walking bytes
+than they are walking 7 bit offsets.  This may be something to
+consider if CPU time is cheaper than allocating 8 byte symbols
+somewhere.
+* DONE Tagging scheme based on arena pages
+2025-04-09:21:59:29: We went for option (2) of just taking a byte for
+free from the memory address and using it as our management byte.
+** 1) Page-offset schema
+I've realised arenas are way better than the standard array dynamic I
+was going for before.  However, we lose the nicer semantics of using
+an array index for pointers, where we can implement our own semantics
+regarding what bits in that pointer are free to use, when using a
+normal stable pointer into the arena; the host operating system has
+its own semantics regarding how pointers are arranged and this _will_
+change between operating systems.  In particular, because of the way
+I've arranged pages, we can't use the classic "div by 8" methodology
+where new allocations on the heap generally must be aligned by 8 bytes
+(u64), so we can use those 3 bits at the bottom for our tagging;
+offsets into pages are where our pointers will lie and they won't
+necessarily be divisible by 8.
+
+So we can't use the pointers directly into the pages - we'll call
+these types of pointers `host pointers`, because once we have them
+it's trivial to access the underlying data.  We'll call the pointers
+we want to make `managed pointers` because we're managing the memory
+system associated with them.  We want to be able to translate from
+managed pointers to host pointers.
+
+Managed pointers are really just encodings for direct access into the
+arena memory.  So in 8 bytes, we need to encode both the page and the
+specific offset in that page where the pointer is pointing to.  We
+also want to leave space for tagging and any metadata we might want to
+store in the pointer to that data.  A schema I could think of was:
+
+|------------------+--------------------|
+| Index (in bytes) | Representation     |
+|------------------+--------------------|
+|                0 | Metadata (tagging) |
+|              1-4 | Offset in page     |
+|              4-7 | Page choice        |
+|------------------+--------------------|
+
+This gives us pow(2, 24) - 1 = 16777215 possible pages and
+pow(2, 32) - 1 = 4294967295 offsets in each page.  Thus our total
+addressable memory would be pow(2, 56) - 1 = 72057594037927935 bytes.
+
+Certainly no machine would ever have this much memory and so we're
+quite safe for most machines.  That reserved management byte for our
+purposes (tagging, currently) will make the math to translate it a bit
+easier.
+
+Let's reason about how we'd encode and decode these addresses.  The
+arena itself should provide addresses with the management byte set to
+0 for the user to encode what they wish.  The top bytes should be
+encoded as per the above i.e. top 3 bytes as the page index, next 4
+bytes as the offset in that page.  This shouldn't be super difficult
+when we're doing it within the management functions of the arena
+itself as this data should be handy when performing the allocation.
+
+When decoding these addresses i.e. retrieving data i.e. translating
+from a managed pointer to a host pointer, all it will need to do is
+convert the pointer into a byte buffer and copy the top 3 bytes as a
+page index and the next 4 bytes as the offset in the page.  Once these
+are verified to be valid, we can just access the underlying pages and
+get the host pointer.  Because of how arenas work, those host pointers
+will be stable regardless of any further memory management functions
+performed on the arena (excluding cleanup) - so once you have a host
+pointer, you can use it as much as you want without having to worry
+about the pointer becoming invalid in the next second.
+** 2) 48-bit addressing exploit
+Most x86 CPUs only use around 48-56 bits for actual memory addresses -
+mostly as a result of not needing _nearly_ as many addresses as a full
+64 bit word would provide.
+
+So we /could/ get away with using one of those bytes for our
+administrative tasks.  Since the smallest remit we have is one byte,
+we'll stick to that (but maybe we could go for two bytes - need to
+investigate further).
+
+This byte should be the MSB, but using that for tagging will require
+more work than the lowest byte (to look at it we'll need to push that
+byte all the way down).  So we'll be going for a low byte strategy by
+shifting the pointer up by one byte.  This leaves us with the lowest
+byte to play with as we choose.