#+title: Tasks #+date: 2025-02-18 * WIP Implement a reader We want something a bit generic: able to handle reading from some buffer of memory (a string, or contents of a file where we can read the entire thing at once) or directly from a file stream (STDIN, network streams, etc). We don't need a tokeniser - the basic grammar of a Lisp is really easy to narrow down, so we can skip tokenisation and go straight for parsing. We also want to be able to admit when reading went wrong for some reason with proper errors messages (i.e. can be read by Emacs) - this will need to be refactored when we introduce errors within the Lisp runtime itself. ** TODO Implement floats and rationals Rationals are pretty easy - just two integers (quotient and divisor) - so a tagged cons cell would do the job. Floats are a bit more difficult since I'd either need to box them or find a creative way of sticking IEEE-754 floats into < 64 bits. Also implement a reader macro for #e. Also deal with [-,+,]inf(.0) and [-,+,]nan(.0). Need to do some reading. [[file:r7rs-tests.scm::test #t (real? #e1e10)][trigger]] ** TODO Consider user instantiated reader macros We don't have an evaluator so we can't really interpret whatever a user wants for a reader macro currently, but it would be useful to think about it now. Currently I have a single function which deals with parsing reader macros but that's just static. Thing is, it does have the context as a parameter to pass to delegate functions (such as ~parse_vec~) - wouldn't be a massive jump to also consider user environments via the context. [[file:reader.c::perr_t parse_reader_macro(context_t *ctx, input_t *inp, lisp_t **ret)][function link]] * TODO Consider Lisp runtime errors * TODO Admit arbitrarily sized integers Currently we admit fixed size integers of 63 bits. They use 2s complement due to x86 which means our max and min are 62 bit based. However, to even try to be a scheme implementation we need to allow arbitrarily sized integers. What are the specific tasks we need to complete in our model to achieve this?: + Allow "reading" of unfixed size integers + This will require reading a sequence of base 10 digits without relying on strtold + Implement unfixed size integers into our memory model + Certainly, unfixed size integers cannot be carried around like fixnums wherein we can embed an integer into the pointer. Thus we have to allocate them in memory. + NOTE: There will be definitely be an optimisation to be done here; integers that are within the bound of a fixnum could be left as a fixnum then "elevated" to an integer when needed + I think the big idea is allocating them as a fixed set of bytes like big symbols. For big integers we have to read the memory associated thus we need a pointer. Due to 2s complement it should be trivial to increase the size of an integer to fit a new result i.e. if I'm adding two integers and that leads to an "overflow" where the result is of greater width than its inputs, we should just allocate new memory for it. Consequences: - Greater memory use - In fact exponential if we need to allocate a whole new integer per operation rather than utilising the input memory - Possible loss of performance due to making integers over fixnums when they don't need to be - Comparison is harder on integers - Harder to cache for the CPU but all of this is to be expected when the user is an idiot. * TODO Think about how to perform operations on different types ** TODO Integers ** TODO Symbols ** TODO Pairs * DONE More efficient memory model for symbols The primitive model for symbol allocation is an 8 byte number representing the size of the symbol, followed by a variable number of characters (as bytes). This is stored somewhere in the memory context, which will likely be in the heap. We're actually wasting a ridiculous amount of memory with this model. We'll almost never be using the full 64 bits of the size to represent a symbol (who's ever going to go close to 1.8 quintillion bytes for a single symbol?) - and even more annoying, there are tons of small sized symbols where we actually need _more space_ for the 8 byte size than the underlying symbol data. I think there's definitely a better model available, at least for smaller symbols. We already have inlined integers where the pointer itself is the integer, why can't we do the same for symbols? A pointer has 8 bytes of data to use - one character is one byte - thus we can represent 8 character symbols in one pointer. If we want this to still be within the remit of our pointer tagging scheme, we'll need a bit of space for our administrative purposes (admin slop...). So let's take _one_ byte out of that 8 for that. So we can represent any symbols 7 bytes long in a single pointer. We'll need to keep in mind we want to represent symbols that may be less than 7 characters, so that one admin byte is going to be doing some heavy lifting. Let's examine that one admin byte: + At least 3 bits are necessary for the actual pointer tag: "at least" because we might increase the size of the tag based on demand + Thus, 5 bits left for our use - let's fit the size in there. pow(2,6) - 1 = 63, so we have _way_ more than we need What are the benefits of doing this?: + Symbol equality for small symbols is a cinch: just compare the two tagged "pointers" + 7 or less character symbols require no memory allocation, just working off the stack One thing to note is that for more than 7 character symbols, we'll need to allocate memory. But, in the worst case of 8 character symbols, we're only allocating two 64 bit integers: these are easy to walk on x86 and we've reached at least parity between the memory required for administration (the size number) and the actual data. ** Being more aggressive? Technically, ANSI bytes only need 7 bits. For each of the 7 bytes used for the character data, we can take one bit off, leaving us with 7 bits to use for an additional character. We don't need to adjust anything else in the schema. So, hypothetically we could represent up to 8 character symbols! This would require packing the characters more aggressively into a single pointer. Let's look at the layout of our pointers. This table is indexed from most significant to least i.e. 0 is the MSB and 63 is the LSB: |-------+------------| | Index | Usage | |-------+------------| | 0-55 | Characters | | 56-60 | Size | | 61-63 | Tag | |-------+------------| Honestly though, for an extra byte of information we'll probably have to do a lot more work. x86-64 CPUs are much better at walking bytes than they are walking 7 bit offsets. This may be something to consider if CPU time is cheaper than allocating 8 byte symbols somewhere. * DONE Tagging scheme based on arena pages 2025-04-09:21:59:29: We went for option (2) of just taking a byte for free from the memory address and using it as our management byte. ** 1) Page-offset schema I've realised arenas are way better than the standard array dynamic I was going for before. However, we lose the nicer semantics of using an array index for pointers, where we can implement our own semantics regarding what bits in that pointer are free to use, when using a normal stable pointer into the arena; the host operating system has its own semantics regarding how pointers are arranged and this _will_ change between operating systems. In particular, because of the way I've arranged pages, we can't use the classic "div by 8" methodology where new allocations on the heap generally must be aligned by 8 bytes (u64), so we can use those 3 bits at the bottom for our tagging; offsets into pages are where our pointers will lie and they won't necessarily be divisible by 8. So we can't use the pointers directly into the pages - we'll call these types of pointers `host pointers`, because once we have them it's trivial to access the underlying data. We'll call the pointers we want to make `managed pointers` because we're managing the memory system associated with them. We want to be able to translate from managed pointers to host pointers. Managed pointers are really just encodings for direct access into the arena memory. So in 8 bytes, we need to encode both the page and the specific offset in that page where the pointer is pointing to. We also want to leave space for tagging and any metadata we might want to store in the pointer to that data. A schema I could think of was: |------------------+--------------------| | Index (in bytes) | Representation | |------------------+--------------------| | 0 | Metadata (tagging) | | 1-4 | Offset in page | | 4-7 | Page choice | |------------------+--------------------| This gives us pow(2, 24) - 1 = 16777215 possible pages and pow(2, 32) - 1 = 4294967295 offsets in each page. Thus our total addressable memory would be pow(2, 56) - 1 = 72057594037927935 bytes. Certainly no machine would ever have this much memory and so we're quite safe for most machines. That reserved management byte for our purposes (tagging, currently) will make the math to translate it a bit easier. Let's reason about how we'd encode and decode these addresses. The arena itself should provide addresses with the management byte set to 0 for the user to encode what they wish. The top bytes should be encoded as per the above i.e. top 3 bytes as the page index, next 4 bytes as the offset in that page. This shouldn't be super difficult when we're doing it within the management functions of the arena itself as this data should be handy when performing the allocation. When decoding these addresses i.e. retrieving data i.e. translating from a managed pointer to a host pointer, all it will need to do is convert the pointer into a byte buffer and copy the top 3 bytes as a page index and the next 4 bytes as the offset in the page. Once these are verified to be valid, we can just access the underlying pages and get the host pointer. Because of how arenas work, those host pointers will be stable regardless of any further memory management functions performed on the arena (excluding cleanup) - so once you have a host pointer, you can use it as much as you want without having to worry about the pointer becoming invalid in the next second. ** 2) 48-bit addressing exploit Most x86 CPUs only use around 48-56 bits for actual memory addresses - mostly as a result of not needing _nearly_ as many addresses as a full 64 bit word would provide. So we /could/ get away with using one of those bytes for our administrative tasks. Since the smallest remit we have is one byte, we'll stick to that (but maybe we could go for two bytes - need to investigate further). This byte should be the MSB, but using that for tagging will require more work than the lowest byte (to look at it we'll need to push that byte all the way down). So we'll be going for a low byte strategy by shifting the pointer up by one byte. This leaves us with the lowest byte to play with as we choose.