diff options
author | Aryadev Chavali <aryadev@aryadevchavali.com> | 2025-05-28 23:54:34 +0100 |
---|---|---|
committer | Aryadev Chavali <aryadev@aryadevchavali.com> | 2025-05-28 23:54:34 +0100 |
commit | bfff660d0e1a03063b889f84e8cfd046565c6046 (patch) | |
tree | 925ef120d7a20bafaed34fcf87dc5bb150ea3e3d /oats.org | |
parent | 12de1e8db90bccd5a0eefd21075f07c7b7e3dfaa (diff) | |
download | oats-bfff660d0e1a03063b889f84e8cfd046565c6046.tar.gz oats-bfff660d0e1a03063b889f84e8cfd046565c6046.tar.bz2 oats-bfff660d0e1a03063b889f84e8cfd046565c6046.zip |
Rename tasks.org to oats.org, restructure
Diffstat (limited to 'oats.org')
-rw-r--r-- | oats.org | 207 |
1 files changed, 207 insertions, 0 deletions
diff --git a/oats.org b/oats.org new file mode 100644 index 0000000..dbad845 --- /dev/null +++ b/oats.org @@ -0,0 +1,207 @@ +#+title: Oats tracker +#+author: Aryadev Chavali +#+description: A general tracker for work being done on the project +#+FILETAGS: :oats: + +* Issues :issues: +** TODO Fix issue with memcpy overlap when string concatenating +[[file:lisp/lisp.c::// FIXME: Something is going wrong here!]] + +Ideas on what's going wrong: +- String sizes seem off +- Maybe something is wrong with arena allocator; we use + [[file:lib/sv.c::newsv.data = arena_realloc(allocator, sv.data, + sv.size, newsv.size);][arena_realloc]] which seems to be the root of + the memcpy-overlap +* Features :features: +** WIP Reader :reader: +We want something a bit generic: able to handle reading from some +buffer of memory (a string, or contents of a file where we can read +the entire thing at once) or directly from a file stream (STDIN, +network streams, etc). + +We don't need a tokeniser - the basic grammar of a Lisp is really easy +to narrow down, so we can skip tokenisation and go straight for +parsing. + +We also want to be able to admit when reading went wrong for some +reason with proper errors messages (i.e. can be read by Emacs) - this +will need to be refactored when we introduce errors within the Lisp +runtime itself. +*** TODO Consider user instantiated reader macros +We don't have an evaluator so we can't really interpret whatever a +user wants for a reader macro currently, but it would be useful to +think about it now. Currently I have a single function which deals +with parsing reader macros but that's just static. + +Thing is, it does have the context as a parameter to pass to delegate +functions (such as ~parse_vec~) - wouldn't be a massive jump to also +consider user environments via the context. + +[[file:reader.c::perr_t parse_reader_macro(context_t *ctx, input_t +*inp, lisp_t **ret)][function link]] +*** TODO Parse exponential notation +We're erroring out here due to not having proper reader notation +[[file:examples/r7rs-tests.scm::test #t (real? #e1e10)]] +** TODO Evaluator +** TODO Runtime errors +** TODO Better numerics +We currently admit fixed size integers (63 bits). We _need_ more to +be a scheme. +*** Unfixed size integers +*** Rationals +*** Floats +*** Complex numbers +** TODO Primitive operations +** TODO Macros +** TODO Modules +* Completed :completed: +** DONE More efficient memory model for symbols +The primitive model for symbol allocation is an 8 byte number +representing the size of the symbol, followed by a variable number of +characters (as bytes). This is stored somewhere in the memory +context, which will likely be in the heap. + +We're actually wasting a ridiculous amount of memory with this model. +We'll almost never be using the full 64 bits of the size to represent +a symbol (who's ever going to go close to 1.8 quintillion bytes for a +single symbol?) - and even more annoying, there are tons of small +sized symbols where we actually need _more space_ for the 8 byte size +than the underlying symbol data. + +I think there's definitely a better model available, at least for +smaller symbols. We already have inlined integers where the pointer +itself is the integer, why can't we do the same for symbols? +A pointer has 8 bytes of data to use - one character is one byte - +thus we can represent 8 character symbols in one pointer. + +If we want this to still be within the remit of our pointer tagging +scheme, we'll need a bit of space for our administrative purposes +(admin slop...). So let's take _one_ byte out of that 8 for that. So +we can represent any symbols 7 bytes long in a single pointer. We'll +need to keep in mind we want to represent symbols that may be less +than 7 characters, so that one admin byte is going to be doing some +heavy lifting. + +Let's examine that one admin byte: ++ At least 3 bits are necessary for the actual pointer tag: "at least" + because we might increase the size of the tag based on demand ++ Thus, 5 bits left for our use - let's fit the size in there. + pow(2,6) - 1 = 63, so we have _way_ more than we need + +What are the benefits of doing this?: ++ Symbol equality for small symbols is a cinch: just compare the two + tagged "pointers" ++ 7 or less character symbols require no memory allocation, just + working off the stack + +One thing to note is that for more than 7 character symbols, we'll +need to allocate memory. But, in the worst case of 8 character +symbols, we're only allocating two 64 bit integers: these are easy to +walk on x86 and we've reached at least parity between the memory +required for administration (the size number) and the actual data. +*** Being more aggressive? +Technically, ANSI bytes only need 7 bits. For each of the 7 bytes +used for the character data, we can take one bit off, leaving us with +7 bits to use for an additional character. We don't need to adjust +anything else in the schema. + +So, hypothetically we could represent up to 8 character symbols! This +would require packing the characters more aggressively into a single +pointer. Let's look at the layout of our pointers. This table is +indexed from most significant to least i.e. 0 is the MSB and 63 is the +LSB: + +|-------+------------| +| Index | Usage | +|-------+------------| +| 0-55 | Characters | +| 56-60 | Size | +| 61-63 | Tag | +|-------+------------| + +Honestly though, for an extra byte of information we'll probably have +to do a lot more work. x86-64 CPUs are much better at walking bytes +than they are walking 7 bit offsets. This may be something to +consider if CPU time is cheaper than allocating 8 byte symbols +somewhere. +** DONE Tagging scheme based on arena pages +2025-04-09:21:59:29: We went for option (2) of just taking a byte for +free from the memory address and using it as our management byte. +*** 1) Page-offset schema +I've realised arenas are way better than the standard array dynamic I +was going for before. However, we lose the nicer semantics of using +an array index for pointers, where we can implement our own semantics +regarding what bits in that pointer are free to use, when using a +normal stable pointer into the arena; the host operating system has +its own semantics regarding how pointers are arranged and this _will_ +change between operating systems. In particular, because of the way +I've arranged pages, we can't use the classic "div by 8" methodology +where new allocations on the heap generally must be aligned by 8 bytes +(u64), so we can use those 3 bits at the bottom for our tagging; +offsets into pages are where our pointers will lie and they won't +necessarily be divisible by 8. + +So we can't use the pointers directly into the pages - we'll call +these types of pointers `host pointers`, because once we have them +it's trivial to access the underlying data. We'll call the pointers +we want to make `managed pointers` because we're managing the memory +system associated with them. We want to be able to translate from +managed pointers to host pointers. + +Managed pointers are really just encodings for direct access into the +arena memory. So in 8 bytes, we need to encode both the page and the +specific offset in that page where the pointer is pointing to. We +also want to leave space for tagging and any metadata we might want to +store in the pointer to that data. A schema I could think of was: + +|------------------+--------------------| +| Index (in bytes) | Representation | +|------------------+--------------------| +| 0 | Metadata (tagging) | +| 1-4 | Offset in page | +| 4-7 | Page choice | +|------------------+--------------------| + +This gives us pow(2, 24) - 1 = 16777215 possible pages and +pow(2, 32) - 1 = 4294967295 offsets in each page. Thus our total +addressable memory would be pow(2, 56) - 1 = 72057594037927935 bytes. + +Certainly no machine would ever have this much memory and so we're +quite safe for most machines. That reserved management byte for our +purposes (tagging, currently) will make the math to translate it a bit +easier. + +Let's reason about how we'd encode and decode these addresses. The +arena itself should provide addresses with the management byte set to +0 for the user to encode what they wish. The top bytes should be +encoded as per the above i.e. top 3 bytes as the page index, next 4 +bytes as the offset in that page. This shouldn't be super difficult +when we're doing it within the management functions of the arena +itself as this data should be handy when performing the allocation. + +When decoding these addresses i.e. retrieving data i.e. translating +from a managed pointer to a host pointer, all it will need to do is +convert the pointer into a byte buffer and copy the top 3 bytes as a +page index and the next 4 bytes as the offset in the page. Once these +are verified to be valid, we can just access the underlying pages and +get the host pointer. Because of how arenas work, those host pointers +will be stable regardless of any further memory management functions +performed on the arena (excluding cleanup) - so once you have a host +pointer, you can use it as much as you want without having to worry +about the pointer becoming invalid in the next second. +*** 2) 48-bit addressing exploit +Most x86 CPUs only use around 48-56 bits for actual memory addresses - +mostly as a result of not needing _nearly_ as many addresses as a full +64 bit word would provide. + +So we /could/ get away with using one of those bytes for our +administrative tasks. Since the smallest remit we have is one byte, +we'll stick to that (but maybe we could go for two bytes - need to +investigate further). + +This byte should be the MSB, but using that for tagging will require +more work than the lowest byte (to look at it we'll need to push that +byte all the way down). So we'll be going for a low byte strategy by +shifting the pointer up by one byte. This leaves us with the lowest +byte to play with as we choose. |