New Garbage Collector!
It’s only been a few weeks since the Alpha release of Microvium and I’ve already almost rewritten the engine! There are two major changes I wanted to make to improve performance:
- I’ve moved to a great new garbage collection algorithm which is much faster and has a smaller memory footprint in both RAM and codespace.
- I’ve changed the data model to allow Microvium values to embed native pointers on 16-bit platforms.
The Old Data Model
To understand the new data model, it will help if I first explain the old one, so you can see how they compare.
Microvium “slots” are 16-bit. The term “slot” is what I’m using to refer to a memory location that holds a value. For example, a variable is a “slot”, and an object property has a slot for its value (and another slot for its key), etc.
In the old data model, Microvium reserved the top 2 bits of the 16-bit value to be a tag, leaving a 14-bit space for data associated with that tag.
The 2-bit tag tells you how the 14-bits should be interpreted:
00₂: the data is a 14-bit signed integer
01₂: the data is a 14-bit offset into heap memory
10₂: the data is a 14-bit offset into data memory
11₂: the data is a 14-bit offset into the bytecode
Having values fit into 16-bits is convenient for the kind of device that Microvium targets: small microcontrollers with < 64kB of memory. Many of these are 16-bit architectures, because 16-bits is all you need to address all your memory, and moving to 32-bits increases the silicon size and power consumption.
14-bit integers are large enough for many purposes, such as counters, loop variables, indexes, byte manipulation, etc. For situations where more than 14 bits are required, Microvium has the above-mentioned “offset” values which point from the 16-bit slot to a larger memory allocation, somewhere in ROM or RAM. These larger memory allocations can hold any number of other value types, such as arrays, strings, 32-bit integers, function bytecode, etc.
Depending on the characteristics of the referenced data, they could be stored in either ROM or RAM. This design uses offsets rather than explicit pointers for 2 reasons:
- Not all devices can fit a pointer into 14-bits. In fact, most can’t.
- When a pointer is stored in ROM (i.e. in the bytecode image), it can’t know ahead of time what its address is (since the image could be placed anywhere in ROM).
Side point: what I’ve called “data memory” and “heap memory” here are both memory regions in RAM, but data memory is permanently allocated while heap memory is garbage collected. This is analogous to the difference between the following two allocations in C:
MyStruct myVariable1; // Permanently allocated MyStruct* myVariable2 = malloc(sizeof(MyStruct)); // Could be freed later
This 16-bit data model seems pretty clever (or at least to me it did), but there are a number of issues with it and optimizations we can make. The first issue is that each of these respective offsets can only be 14-bit, giving only an addressable space of up to 16kB. On the whole, I don’t mind Microvium targetting use cases where applications can’t exceed 16kB, but it does feel awfully tight. This is compounded by the fact that the engine itself compiles to about 7kB1, so the engine is necessarily quite chunky relative to the largest bytecode programs it can run.
The other issues are easiest to see when comparing it to the new data model:
The New Data Model
The new data model uses a number of clever tricks to improve both performance and addressable size on 16-bit architectures.
In the new model, the tag is not a fixed size. It is 1 or 2 bits, and now in the lower position of the word rather than the upper.
If the lowest bit is a
0, the word is interpreted as a 16-bit pointer:
This makes the assumption then that pointers are always even, which is a good assumption since Microvium heap allocations are always 2-byte aligned. This means that once checking the low bit, we don’t need to do any bit manipulation to extract the pointer value: the value is already a 16-bit pointer.
This means that — at least on 16-bit architectures — a native pointer can be stored directly in a memory slot (variable) without any need for encoding and decoding. This speeds up a lot of things, but especially the garbage collector (which I’ll talk about in a moment). This also means that Microvium can now address 4x the amount of RAM than it could previously since it can address 64kB of space.
This speedup is quite significant. The heap in Microvium is not allocated contiguously from the host (doing so would make it difficult to expand the heap dynamically when more space is needed), so resolving a 14-bit heap offset, as per the old scheme, involved a function call with fairly complex logic to determine what heap fragment the offset belongs to, and to find the allocation being referenced. With the new scheme, this all goes away.
Bytecode (ROM) Pointers
Not everything can be addressed by 16-bits, even on a 16-bit architecture. In particular, ROM is not necessarily addressable with 16-bit pointers, since ROM may be larger than RAM, or it may be stored in a completely different address space (or even on an external device such as a serial flash).
Microvium intends to cater for these common situations, so the next kind of value that can be stored in a slot is as follows, having the lower 2 bits with a value of
01₂ and the upper 15 bits be an offset into the bytecode image:
Again, this makes use of the fact that bytecode addresses will be even (this was not necessarily true before, but I’ve made it true in this latest version). So to convert this value to an offset, we just have to shift right by 1 bit.
This only gives us up to 32kB of addressable bytecode ROM, but this is still pretty good. The bytecode file itself can be up to 64kB in size, but the ROM segment is limited to 32kB. I think this is more than acceptable for the kind of use cases that Microvium is currently intended for.
This leaves us with the last tag,
11₂, to represent integers:
We still only have signed 14-bit integers, like before. But even for these, this scheme is more efficient than the old! In the old scheme, to decode a 14-bit integer from the 16-bit word, we needed to perform a sign-extension, since the upper 2 bits (tag bits) were always zero, even when representing negative values. A 14-bit-to-16-bit sign extension typically involves multiple operations. In this new scheme, we need only perform a single signed-right-shift by 2 bits, and the sign will automatically be correct!
The New Garbage Collector
I like it when code gets smaller instead of bigger, and the new garbage collector (GC) is one of those cases. And with the new data model, the garbage collector is an order of magnitude faster than it used to be.
All the principles and objectives from my previous garbage-collection post still apply. In fact, I was well into the new design by the time I finished the implementation of the old GC, but I wanted to get the alpha release out before embarking wholesale on a major overhaul. This animation from last time summarizes the kind of environment that Microvium is meant to be operating in, where the circles represent different modules or tasks, and only one or a few at a time are expected to be active/scheduled, taking center stage and consuming their peak memory2:
The key takeaway from the aforementioned post is that the memory consumed by a task for only a short burst is typically significantly cheaper than the amount of memory it permanently reserves. This is because the average permanent memory per task is multiplied by the total number of tasks, while average peak memory per task is only multiplied by the number of simultaneously-active tasks running at their peak.
Single-threaded environments, of which I’m a huge fan, are even better for this since they enforce that there is only one “simultaneously-active” task — micro-tasks are scheduled serially and so each can consume all the resources for a short time, as long as they clean up thoroughly before the next microtask in the queue. If microtasks are kept short (e.g. less than 1ms) then responsiveness is still good and the system can still be “real-time-enough” for many real-world applications.
The new garbage collector is based on Cheney’s Algorithm (wiki). The general idea is this:
- When it’s time to garbage collect, the GC finds reachable (live) allocations and moves them to a new virtual heap.
- It then discards the old heap all at once.
The coolest new feature of the new Microvium GC is that it costs nothing at all to free an allocation! Dead allocations are simply never visited by this algorithm, and just get freed all in one go when the algorithm discards the old heap.
All living allocations have a cost during collection (this is true of all GC algorithms3 ). But also recall from my last post on GCs, that microcontrollers can have a 1000x more processing power than memory4, relative to desktop machines, so the fact that living allocations incur some (small) amount of processing time during every collection is probably not as impactful as the fact that living allocations continue to consume memory after the collection. In other words, don’t worry about how much CPU overhead a garbage collector has, and instead worry about designing your scripts such that their idle memory consumption is small, as you should be doing regardless.
This algorithm makes it fairly cheap to make tons of temporary allocations for intermediate computations or for passing data between functions, etc. And then when the task’s turn completes and the GC is run to clean up, these temporaries are discarded with no processing or memory overhead.
Since allocations in the Microvium heap are contiguous5, creating new allocations is also cheap — it’s almost just bumping a pointer forward6.
In the following animation, blocks represent allocations, and a block turns gray to represent that the allocation has become unreachable7 — blue blocks are live (reachable) allocations. The program8 can allocate quickly and cheaply by just bumping the allocation cursor/pointer forwards. Dead objects remain on the heap until collection. The collector then copies the live allocations to a new space and the old space is discarded in one fell swoop.
With the new data model, tracing the reachability graph is almost trivial: for each variable, check if the lowest bit is a zero. If it is, then the variable holds a 16-bit pointer to a reachable allocation. If not, the variable holds anything else — an integer or a pointer to ROM (which doesn’t need tracing). Reachable objects are copied into the new memory space and their internal slots are in turn investigated the same way.
Collection with this algorithm is quick and proceeds in a single pass with constant memory overhead — no mucking about with mark bits, free lists, work queues/stacks, etc.
Bonus: Array and Object compaction
As a bonus, this new implementation of the GC also compacts arrays and objects on the fly.
Dynamically-sized arrays consume more space than needed so that appending to them is quick: every time they run out of space, their capacity is doubled to make room for twice as many elements. The GC algorithm truncates the capacity of these arrays on the fly, so they use the smallest possible amount of space.
The case is similar for objects. Dynamic objects in Microvium are represented in memory by a linked-list of property cells (key-value pairs with a
next pointer), so that appending new properties is quick. But the linked list consumes twice the amount of space of an equivalent contiguous array, so the GC compacts these linked lists into contiguous property arrays on the fly. This gives the best of both worlds, making it efficient to link a new property cell onto an existing object, but having that property cell subsumed into to the contiguous arrays during GC compaction so that each property takes the smallest possible space while dormant9.
This is measured on an msp430, with floating-point support disabled. Presumably this number will grow over time ↩
Read the full post for a more complete explanation ↩
But can be mitigated on desktop/server machines by the use of generational garbage collectors, an unnecessary complexity for MCUs. ↩
To put it another way, consider that many MCUs have the CPU power to loop through their entire RAM in the order of milliseconds, while desktop-class machines would do the same in the order of seconds ↩
At least, contiguous within the large chunks of memory allocated from the host as the heap grows ↩
Bumping the allocation pointer forward, checking for overflow, and assigning the 2-byte allocation header. ↩
Unreachable means that there is no way for your program/script to access it anymore, so it is subject to garbage collection ↩
The program is called the mutator in GC parlance. ↩
A property in compacted form is 4 bytes: a 2-byte key and a 2-byte value. A property in linked-list form consumes extra bytes for the linked list pointers and the allocation header of each cell. ↩