diff options
author | Aryadev Chavali <aryadev@aryadevchavali.com> | 2025-05-28 23:54:34 +0100 |
---|---|---|
committer | Aryadev Chavali <aryadev@aryadevchavali.com> | 2025-05-28 23:54:34 +0100 |
commit | bfff660d0e1a03063b889f84e8cfd046565c6046 (patch) | |
tree | 925ef120d7a20bafaed34fcf87dc5bb150ea3e3d /tasks.org | |
parent | 12de1e8db90bccd5a0eefd21075f07c7b7e3dfaa (diff) | |
download | oats-bfff660d0e1a03063b889f84e8cfd046565c6046.tar.gz oats-bfff660d0e1a03063b889f84e8cfd046565c6046.tar.bz2 oats-bfff660d0e1a03063b889f84e8cfd046565c6046.zip |
Rename tasks.org to oats.org, restructure
Diffstat (limited to 'tasks.org')
-rw-r--r-- | tasks.org | 230 |
1 files changed, 0 insertions, 230 deletions
diff --git a/tasks.org b/tasks.org deleted file mode 100644 index 179e1a2..0000000 --- a/tasks.org +++ /dev/null @@ -1,230 +0,0 @@ -#+title: Tasks -#+date: 2025-02-18 - -* WIP Implement a reader -We want something a bit generic: able to handle reading from some -buffer of memory (a string, or contents of a file where we can read -the entire thing at once) or directly from a file stream (STDIN, -network streams, etc). - -We don't need a tokeniser - the basic grammar of a Lisp is really easy -to narrow down, so we can skip tokenisation and go straight for -parsing. - -We also want to be able to admit when reading went wrong for some -reason with proper errors messages (i.e. can be read by Emacs) - this -will need to be refactored when we introduce errors within the Lisp -runtime itself. -** TODO Implement floats and rationals -Rationals are pretty easy - just two integers (quotient and divisor) - -so a tagged cons cell would do the job. Floats are a bit more -difficult since I'd either need to box them or find a creative way of -sticking IEEE-754 floats into < 64 bits. - -Also implement a reader macro for #e<scientific form>. Also deal with -[-,+,]inf(.0) and [-,+,]nan(.0). - -Need to do some reading. - -[[file:r7rs-tests.scm::test #t (real? #e1e10)][trigger]] -** TODO Consider user instantiated reader macros -We don't have an evaluator so we can't really interpret whatever a -user wants for a reader macro currently, but it would be useful to -think about it now. Currently I have a single function which deals -with parsing reader macros but that's just static. - -Thing is, it does have the context as a parameter to pass to delegate -functions (such as ~parse_vec~) - wouldn't be a massive jump to also -consider user environments via the context. - -[[file:reader.c::perr_t parse_reader_macro(context_t *ctx, input_t -*inp, lisp_t **ret)][function link]] -* TODO Consider Lisp runtime errors -* TODO Admit arbitrarily sized integers -Currently we admit fixed size integers of 63 bits. They use 2s -complement due to x86 which means our max and min are 62 bit based. - -However, to even try to be a scheme implementation we need to allow -arbitrarily sized integers. What are the specific tasks we need to -complete in our model to achieve this?: -+ Allow "reading" of unfixed size integers - + This will require reading a sequence of base 10 digits without - relying on strtold -+ Implement unfixed size integers into our memory model - + Certainly, unfixed size integers cannot be carried around like - fixnums wherein we can embed an integer into the pointer. - Thus we have to allocate them in memory. - + NOTE: There will be definitely be an optimisation to be done - here; integers that are within the bound of a fixnum could be - left as a fixnum then "elevated" to an integer when needed - + I think the big idea is allocating them as a fixed set of bytes - like big symbols. For big integers we have to read the memory - associated thus we need a pointer. Due to 2s complement it should - be trivial to increase the size of an integer to fit a new result - i.e. if I'm adding two integers and that leads to an "overflow" - where the result is of greater width than its inputs, we should - just allocate new memory for it. - -Consequences: -- Greater memory use - - In fact exponential if we need to allocate a whole new integer per - operation rather than utilising the input memory -- Possible loss of performance due to making integers over fixnums - when they don't need to be -- Comparison is harder on integers -- Harder to cache for the CPU - -but all of this is to be expected when the user is an idiot. -* TODO Think about how to perform operations on different types -** TODO Integers -** TODO Symbols -** TODO Pairs -* DONE More efficient memory model for symbols -The primitive model for symbol allocation is an 8 byte number -representing the size of the symbol, followed by a variable number of -characters (as bytes). This is stored somewhere in the memory -context, which will likely be in the heap. - -We're actually wasting a ridiculous amount of memory with this model. -We'll almost never be using the full 64 bits of the size to represent -a symbol (who's ever going to go close to 1.8 quintillion bytes for a -single symbol?) - and even more annoying, there are tons of small -sized symbols where we actually need _more space_ for the 8 byte size -than the underlying symbol data. - -I think there's definitely a better model available, at least for -smaller symbols. We already have inlined integers where the pointer -itself is the integer, why can't we do the same for symbols? -A pointer has 8 bytes of data to use - one character is one byte - -thus we can represent 8 character symbols in one pointer. - -If we want this to still be within the remit of our pointer tagging -scheme, we'll need a bit of space for our administrative purposes -(admin slop...). So let's take _one_ byte out of that 8 for that. So -we can represent any symbols 7 bytes long in a single pointer. We'll -need to keep in mind we want to represent symbols that may be less -than 7 characters, so that one admin byte is going to be doing some -heavy lifting. - -Let's examine that one admin byte: -+ At least 3 bits are necessary for the actual pointer tag: "at least" - because we might increase the size of the tag based on demand -+ Thus, 5 bits left for our use - let's fit the size in there. - pow(2,6) - 1 = 63, so we have _way_ more than we need - -What are the benefits of doing this?: -+ Symbol equality for small symbols is a cinch: just compare the two - tagged "pointers" -+ 7 or less character symbols require no memory allocation, just - working off the stack - -One thing to note is that for more than 7 character symbols, we'll -need to allocate memory. But, in the worst case of 8 character -symbols, we're only allocating two 64 bit integers: these are easy to -walk on x86 and we've reached at least parity between the memory -required for administration (the size number) and the actual data. -** Being more aggressive? -Technically, ANSI bytes only need 7 bits. For each of the 7 bytes -used for the character data, we can take one bit off, leaving us with -7 bits to use for an additional character. We don't need to adjust -anything else in the schema. - -So, hypothetically we could represent up to 8 character symbols! This -would require packing the characters more aggressively into a single -pointer. Let's look at the layout of our pointers. This table is -indexed from most significant to least i.e. 0 is the MSB and 63 is the -LSB: - -|-------+------------| -| Index | Usage | -|-------+------------| -| 0-55 | Characters | -| 56-60 | Size | -| 61-63 | Tag | -|-------+------------| - -Honestly though, for an extra byte of information we'll probably have -to do a lot more work. x86-64 CPUs are much better at walking bytes -than they are walking 7 bit offsets. This may be something to -consider if CPU time is cheaper than allocating 8 byte symbols -somewhere. -* DONE Tagging scheme based on arena pages -2025-04-09:21:59:29: We went for option (2) of just taking a byte for -free from the memory address and using it as our management byte. -** 1) Page-offset schema -I've realised arenas are way better than the standard array dynamic I -was going for before. However, we lose the nicer semantics of using -an array index for pointers, where we can implement our own semantics -regarding what bits in that pointer are free to use, when using a -normal stable pointer into the arena; the host operating system has -its own semantics regarding how pointers are arranged and this _will_ -change between operating systems. In particular, because of the way -I've arranged pages, we can't use the classic "div by 8" methodology -where new allocations on the heap generally must be aligned by 8 bytes -(u64), so we can use those 3 bits at the bottom for our tagging; -offsets into pages are where our pointers will lie and they won't -necessarily be divisible by 8. - -So we can't use the pointers directly into the pages - we'll call -these types of pointers `host pointers`, because once we have them -it's trivial to access the underlying data. We'll call the pointers -we want to make `managed pointers` because we're managing the memory -system associated with them. We want to be able to translate from -managed pointers to host pointers. - -Managed pointers are really just encodings for direct access into the -arena memory. So in 8 bytes, we need to encode both the page and the -specific offset in that page where the pointer is pointing to. We -also want to leave space for tagging and any metadata we might want to -store in the pointer to that data. A schema I could think of was: - -|------------------+--------------------| -| Index (in bytes) | Representation | -|------------------+--------------------| -| 0 | Metadata (tagging) | -| 1-4 | Offset in page | -| 4-7 | Page choice | -|------------------+--------------------| - -This gives us pow(2, 24) - 1 = 16777215 possible pages and -pow(2, 32) - 1 = 4294967295 offsets in each page. Thus our total -addressable memory would be pow(2, 56) - 1 = 72057594037927935 bytes. - -Certainly no machine would ever have this much memory and so we're -quite safe for most machines. That reserved management byte for our -purposes (tagging, currently) will make the math to translate it a bit -easier. - -Let's reason about how we'd encode and decode these addresses. The -arena itself should provide addresses with the management byte set to -0 for the user to encode what they wish. The top bytes should be -encoded as per the above i.e. top 3 bytes as the page index, next 4 -bytes as the offset in that page. This shouldn't be super difficult -when we're doing it within the management functions of the arena -itself as this data should be handy when performing the allocation. - -When decoding these addresses i.e. retrieving data i.e. translating -from a managed pointer to a host pointer, all it will need to do is -convert the pointer into a byte buffer and copy the top 3 bytes as a -page index and the next 4 bytes as the offset in the page. Once these -are verified to be valid, we can just access the underlying pages and -get the host pointer. Because of how arenas work, those host pointers -will be stable regardless of any further memory management functions -performed on the arena (excluding cleanup) - so once you have a host -pointer, you can use it as much as you want without having to worry -about the pointer becoming invalid in the next second. -** 2) 48-bit addressing exploit -Most x86 CPUs only use around 48-56 bits for actual memory addresses - -mostly as a result of not needing _nearly_ as many addresses as a full -64 bit word would provide. - -So we /could/ get away with using one of those bytes for our -administrative tasks. Since the smallest remit we have is one byte, -we'll stick to that (but maybe we could go for two bytes - need to -investigate further). - -This byte should be the MSB, but using that for tagging will require -more work than the lowest byte (to look at it we'll need to push that -byte all the way down). So we'll be going for a low byte strategy by -shifting the pointer up by one byte. This leaves us with the lowest -byte to play with as we choose. |