Category: Microvium

Microvium has try-catch!

Microvium has try-catch!

TL;DR: Microvium now supports exception handling, with just 4 bytes of overhead per active try block. This post covers some of the details of the design, for those who are interested.


Note: if you’re new to try-catch, see the MDN article on the subject. Note however that Microvium does not yet support finally, but in my experience finally is a less-used feature of JS.

Why try-catch?

One of my objectives with Microvium is to allow users to write JS code that is stylistically similar to the kind of code they would write for a larger engine like V8/Node.js. I want it to be easy to write JS libraries that run on both Microvium and Node, and the public interface of these libraries especially should not be contorted to fit the Microvium feature subset.

A major thing that’s been missing from this vision is the ability to use exceptions to pass errors. A library that can’t use exceptions to pass errors must use some other technique such as error codes, which are not typical JavaScript style and would clutter up the API of such a library.

A look at alternatives

Microvium is implemented in C. How does one implement exceptions in a language like C? In the journey of finding the best way to do exceptions, I thought of a number of different ideas.

The most obvious one to come to mind is something like setjmp/longjmp. This is a common way to implement exception-like behavior in a language like C, since it’s the only way to jump from one C function into another that’s earlier on the stack. At face value, this sounds sensible since both the engine and host are running on C.

Looking at other embedded JavaScript engines, mJS and Elk don’t support exceptions at all, but Moddable’s XS uses this setjmp/longjmp approach. At each try, XS mallocs a jump structure which is populated with the state of current virtual registers and physical registers, as well as a pointer to the catch code that would handle the exception. Upon encountering a throw, all the virtual and physical registers are restored to their saved state, which implicitly unwinds the stack, and then control is moved to the referenced catch block.

In my measurements1, one of these jump structures in XS is 120-140 bytes on an embedded ARM architecture and 256-312 bytes on my x64 machine2. Even the jmp_buf on its own is 92 bytes on Cortex M0 (the architecture to which I’ve targetted all my size tests in Microvium). That’s quite heavy! Not to mention the processing overhead of doing a malloc and populating this structure (each time a try block is entered).

How it works in Microvium

After thinking about it for some time, the following is the solution I settled on. Consider the following code, where we have 2 variables on the stack, and 2 try-catch blocks:

If we were to throw an exception on line 5, what needs to happen?

  1. We need to pop the variable y off the stack (but not x), since we’re exiting its containing scope. This is called unwinding the stack.
  2. We need to pass control to the e1 catch clause (line 7)
  3. We need to now record that the catch(e2) on line 11 is the next outer catch block, in case there is another throw

Microvium does this by keeping a linked list of try blocks on the stack. Each try block in the linked list is 2 words (4 bytes): a pointer to the bytecode address of the corresponding catch bytecode, plus a pointer to the next outer try block in the linked list. For the previous example, there will be 2 try blocks on the stack when we’re at line 5, as shown in the following diagram:

Each of the smaller rectangles here represents a 16-bit word (aka slot). The arrows here represent pointers. The red blocks (with an extra border) each represent an active try block consisting of 2 words (catch and nextTry). They form a linked list because each has a pointer to the next outer try block.

The topTry register points to the inner-most active try block — the head of the linked list. Each time the program enters a try statement, Microvium will push another try block to the stack and record its stack address in the topTry register.

The topTry register serves 2 purposes simultaneously. For one, it points to the head of the linked list, which allows us to find the associated catch block when an exception is thrown. But it also points to exactly the stack level we need to unwind to when an exception is thrown. This is similarly true for each nextTry pointer.

When an exception is thrown, Microvium will:

  1. Take the current topTry and unwind the stack to the level it points to.
  2. Transfer control (the program counter) to the bytecode address of the catch block which is now sitting at the top of the stack (i.e. pop the program counter off the stack).
  3. Save the nextTry pointer as the new topTry (i.e. it pops the topTry register off the stack).

Et voila! We’ve implemented try-catch in Microvium with just 4 bytes of overhead per try.

This also works when throwing from one function to a catch in a caller function. The only difference is that the unwind process in step 1 then needs to restore the registers of the caller (e.g. argument count, closure state, return address, etc). This information is already on the stack since it’s also used by the return instruction — we don’t need to save it separately during a try. This works for any level of nested calls if we just unwind repeatedly until reaching the frame containing the catch target.

Conclusion

I’m pretty satisfied with this design.

  • Only one new virtual register (topTry)
  • Only 4 bytes of RAM per active try3
  • No heap overhead
  • Conceptually simple
  • Only makes the engine 156 bytes larger in ROM

The static analysis to implement this is quite hairy — it’s complicated by the interaction between trycatch and return, break, and closures, among other things. For those interested, see the test cases for examples of the kinds of scenarios that try-catch needs to deal with.

On a meta-level, this is a good reminder of how the first idea that comes to mind is often not the best one, and the benefit of thinking through a problem before jumping right in. I’ve been brainstorming ideas about how exceptions might work in Microvium since July 2020 — almost 2 years ago — and I kept going back to the ideas to revise and reconsider, going back and forth between different options. Sometimes, mulling over an idea over a long period of time can help you get to a much better solution in the end.

Another lesson is that simplicity is deceptively time-consuming. Looking only at this apparently-simple final design, you might not guess at the amount of work required to get there. This is a general tragedy of software engineering: it often takes more time to get to a simpler solution, which gives the appearance of achieving less while taking longer to do so.


  1. Please, anyone correct me if I’ve made a mistake here or misrepresented XS’s design in any way. 

  2. The ranges in sizes here depend on whether I use the builtin jmp_buf type or the one supplied by XS 

  3. And 6 bytes of bytecode ROM per try block in the code. 

Snapshotting is like compiling but better

Snapshotting is like compiling but better

TL;DR: The final output of a traditional compiler like GCC bears a family resemblance to a Microvium snapshot, but the snapshotting paradigm is both easier to use and more powerful because it allows real application code to run at build time and its state to persist until runtime.

What is snapshotting?

My Microvium JavaScript engine is built on the paradigm of creating a VM snapshot as the deployable build artifact rather than creating a traditional compiled binary. As a developer, when you run the Microvium engine on your desktop with a command like microvium main.js it will execute the script until all the top-level code is complete and then output a snapshot file containing the final VM state. The snapshot file can then be “resumed” on an embedded device using Microvium’s embedded C library (for more details, see Getting Started). Although Microvium is designed especially for microcontrollers, the principle of snapshotting goes beyond the embedded space.

Comparing to GCC

For this post, I’ll mostly compare Microvium to GCC:

gcc main.c         # Compile a C program with GCC
microvium main.js  # "Compile" a Microvium program

These two commands are analagous. Both produce a single file as the result, and this file is what you want to deploy to the target environment. In the case of GCC, the output is of course the executable (e.g. a.out), while in the case of Microvium, it’s the snapshot (main.mvm-bc).

Both of these commands do some kind of compilation as part of the process. GCC translates your function code to machine instructions, while Microvium translates to virtual machine instructions (bytecode instructions).

Constants

Both the GCC output and the Microvium output have a section for constants, including function code. You may be familiar with this as the .text section. Among other things, this contains constant values, such as:

// JavaScript
const x = 42;
// C
const int x = 42;

… but better

You can do this in Microvium but not in C:

const x = foo();

function foo() {
  return 42;
}

In C, it’s a compile-time error to call runtime functions for the calculation of constants. But in Microvium, this is perfectly legal since there is no distinction between compile-time and runtime — there is only really runtime before the snapshot and runtime after the snapshot. Although informally we may refer to the former as “compile-time” and the latter as “runtime”.

Apart from just being cleaner and easier to use, the Microvium snapshotting paradigm here allows computationally-intensive constants to be calculated at build time, using arbitrary functions and libraries that might also be useful at runtime.

Variable initializers… but better

For non-constant variables, both GCC and Microvium have an output section for the initial1 value of all the variables, which is copied into RAM at runtime. You may know this traditionally as the .data section.

But a major difference between them is that a Microvium snapshot also contains heap data.

// JavaScript
let arr = [1, 2, 3];

// C
int* arr = malloc(3 * sizeof(int));  // ! Can't do this (in top-level code)

The best you could do in C for the above example would be to have an init function that runs early in the program to set up the initial runtime state of the program. Snapshotting is better because this initial structure can be established at build time.

Modules … but better

Microvium and C both have support for structuring your code in multiple files which get bundled into the same output artifact:

import { foo } from './foo.js'
#include "foo.h"

With the #include here, the dependent module implementation (e.g. foo.c) is not automatically compiled and linked into the program by GCC — you need to separately list foo.c to be compiled by GCC, or orchestrate the dependencies using a makefile.

But in the case of a Microvium import, the import statement itself is executed at build time, performing the module resolution, loading, parsing, and linking at build time, as well as executing the top-level code of the imported module. The top-level code of the imported module may in turn import other modules, transitively importing the whole module graph and executing its top-level initialization code.

Preprocessor… but better

Both C and Microvium support “compile-time” logic:

// C
#if USE_FOO_1
#define foo foo1
#else
#define foo foo2
#endif
// JavaScript
const foo = USE_FOO_1 ? foo1 : foo2;

Of course, you already knew that, because all the examples so far have demonstrated the fact that snapshotting allows you to run JavaScript at compile time. But I want to emphasize some of the key reasons why the snapshotting paradigm is better for this:

  • You don’t need two different programming languages (e.g. the preprocessor language and the C language).
  • Your “compile-time” code has the full power of the main language.
  • Your runtime code carries over the state from your compile-time code.

So in some sense, this unifies the preprocessor language with the main language. This applies similarly to other “compile-time languages”, such as makefiles and linker scripts. Or if you’re coming from the JS world, consider how the snapshotting paradigm obviates the need for a webpack.config (see Snapshotting vs Bundling).

But what about using the preprocessor to conditionally include different runtime logic, such as for different devices? For example, consider the following C code:

int myFunction() {
#if SOME_CONDITION
  doX();
#else
  doY();
#endif 
}

Of course, it’s easy to see how this example might translate to JS:

function myFunction(someCondition) {
  if (someCondition) {
    doX();
  } else {
    doY();
  }
}

Now we have the bonus that our unit tests can inject someCondition to test both cases.

But doesn’t this code mean that now we have both doX and doY branches at runtime, taking up ROM space? (and someCondition)

That’s why I’ve also been developing the experimental Microvium Boost: an optimizer that analyzes a snapshot and removes unused branches of code. For example, if the analysis shows that someCondition is always true in your program, it can remove it as a parameter from myFunction and also remove the call to doY as dead code. This is still experimental but has shown significant success so far.

Host exports… but better

So far we’ve been considering an executable output from GCC, but it would be more accurate to compare a Microvium snapshot with a compiled shared library (e.g. DLL). Like a shared library, Microvium snapshots do not have a single entry point but may contain many exported functions to be resolved at runtime on the device.

A Windows DLL suits the analogy better than a Linux shared library, so in this section the examples will use msbuild rather than GCC.

Both a Microvium snapshot and a DLL binary contain a section for dynamic linking information — a table that associates relevant functions in the DLL with a number2 so that the host program using the DLL/snapshot at runtime can find them.

In the case of a DLL, you can provide the compiler with a DEF file that tells the compiler what to put into the DLL export table. If you wanted to export the functions foo, bar, and baz from the DLL with IDs 1, 2, and 3 respectively, your DEF file might look like this:

LIBRARY   MyLibrary
EXPORTS
   foo   @1
   bar   @2
   baz   @3
// C
int foo() { return 42; }
int bar() { return 43; }
int baz() { return 44; }

The equivalent in Microvium would be as follows:

// JavaScript
function foo() { return 42; }
function bar() { return 43; }
function baz() { return 44; }

vmExport(1, foo);
vmExport(2, bar);
vmExport(3, baz);

You may have noticed a recurring theme in this post: the Microvium snapshotting paradigm doesn’t require a whole new language in order to do different build-time tasks. In this case, Microvium doesn’t require a DEF file (or a special __declspec(dllexport) language extension), since vmExport is just a normal function. This is just simpler and more natural.

Another recurring theme here is that the snapshotting approach is more powerful, allowing you to do things that are impossible or impractical in the traditional paradigm. Take a look at the following example in Microvium:

// JavaScript
for (let i = 1; i <= 3; i++) {
  vmExport(i, () => 41 + i);
}

This has the same overall effect as the previous code, adding 3 functions to the export table of the deployed binary with IDs 1, 2, and 3 and which return 42, 43, and 44 respectively. But having vmExport be a normal function means that now we have the full power of the language for orchestrating these exports, or for writing an abstraction layer over the export system, or outsourcing this logic to a third-party library.

Side note: a more subtle point in this example for advanced readers is its code cohesion. The single line of code mixes both compile-time and runtime code (vmExport(i,...) and ()=> 41 + i respectively), but keeps the related parts of both in close proximity. This is the difference between temporal cohesion (grouping code by when its run) vs functional cohesion (grouping code by what feature it relates to) (see Wikipedia). A common disadvantage of having separate build-time or deploy-time code (e.g. a DEF file, makefile, linker script, webpack.config, dockerfile, terraform file, etc) is that it pushes you into temporal cohesion, which in turn damages modularity and reusability.

Conclusion

The idea of deploying a snapshot rather than a traditional compiled binary opens up a whole new paradigm for software development. The end result is very similar — a binary image with sections for different memory spaces, compiled function code, constants, initial variable values, and export/import tables — but the snapshotting paradigm is both simpler and more powerful.


  1. In the context of Microvium, the word “initial” here refers to the initial state when the snapshot is resumed, not the initial state when the program is started, since the program starts at build time, and variables in JavaScript start with the value undefined. 

  2. DLL exports can be by name or number, but Microvium exports are only by number, for efficiency reasons, so that’s what I’m using in the analogy here. 

Microvium is very small

Microvium is very small

TL;DR: The Microvium JavaScript engine for microcontrollers takes less than 16 kB of ROM and 64 bytes of RAM per VM while idle, making it possibly the smallest JavaScript engine to date with more language features than engines 4x its size.


I’ve designed Microvium from the ground up with the intention for it to be tiny, and it’s been an absolute success in that sense. Microvium may be the smallest JavaScript engine out there, but it still packs a punch in terms of features.

*Edit: this post was originally written when Microvium was around 8.2 kB of ROM. Since then, new features have been added. As of August 2023, Microvium is now 12 kB.

Does size matter?

Size often matters in small MCU devices. A large proportion of microcontroller models available on the market still have less than 64 kB of flash and less than 2 kB of RAM. These are still used because they’re smaller, cheaper, and have lower power than their larger counterparts. All the microcontrollers I’ve worked with in my career as a firmware engineer have had ≤ 16 kB RAM.

Some might say that you shouldn’t even want JavaScript on such small devices, and certainly in some cases that would be true. But as I pointed out in my last post, juggling multiple operations in firmware can be both easier and more memory efficient if the high-level logic is described in terms of a language like JavaScript, even if that’s the only thing you’re using it for.

Even on larger devices, do you really want to dedicate a large chunk of it to a JavaScript engine? A smaller engine is a smaller commitment to make — a lower barrier to entry.

How does it compare?

If I Google “smallest JavaScript engine for microcontrollers”, the first one on the list is Elk. Elk is indeed pretty tiny. For me, it compiles to just 11.5 kB of flash1. Microvium compiled with the same settings compiles to about 12 kB — in the same ballpark.

What about RAM?

The amount of RAM Elk uses is not pre-defined — you give it a buffer of RAM of any size you want, but it needs to be at least 96 bytes for the VM kernel state. Microvium takes 36 bytes for the kernel state.

But where there’s a massive difference in memory requirement is that Elk requires all of its memory allocated upfront, and keeps it for the lifetime of the VM. If your script’s peak memory in Elk is 1 kB then you need to give it a 1 kB buffer at startup, so its idle memory usage is 1 kB. Microvium on the other hand uses malloc and free to allocate when needed and free when not needed. Its idle memory usage can be as low as 88 bytes. In typical firmware, idle memory is much more important than peak memory, as I explained in my last post.

What about the feature set? This is another area where Microvium and Elk diverge significantly. The following table shows the differences:

MicroviumElk
var, const (Elk supports let only)
do, switch, for
Computed member access a[b]
Arrow functions, closures
trycatch
asyncawait
Modules
Snapshotting
Uses intermediate bytecode (better performance)
Parser at runtime
ROM12 kB11.5 kB
Idle RAM88 BLots
Peak kernel RAM36 B96 B
Slot size (size of simple variables)2 B8 B

The only thing that Elk can do that Microvium can’t do is execute strings of JavaScript text at runtime. So if your use case involves having human users directly provide scripts to the device, without any intermediate tools that could pre-process the script, then you can’t use Microvium and you might want to use Elk, mJS, or a larger engine like XS. On the other hand, if your use case has at any point a place where you can preprocess scripts before downloading them to the device then you can use Microvium.

Comparing with mJS

But Cesanta, the maker of Elk, also made a larger JS engine with more features: mJS, which is probably the closest match to Microvium in terms of feature set. mJS lets you write for-loops and switch statements for example.

Since they’re closely matched for intent and features, I did a more detailed comparison of mJS and Microvium here. But here’s a summary:

MicroviummJSElk
var, const (mJS supports let only)
Template strings
Arrow functions and closures
trycatch
asyncawait
ES Modules
(but mJS does support a non-standard load function)
do, switch, for
Computed member access a[b]
Uses intermediate bytecode (better performance)
Some builtin-functions
Parser at runtime
ROM12 kB45.6 kB11.5 kB
Slot size2 B8 B8 B

I’ve lumped “some builtin-functions” into one box because it’s not a language feature as such. mJS has a number of builtin functions that Microvium doesn’t have – most notably print, ffi, s2o, JSON.stringify, JSON.parse and Object.create. You can implement these yourself in Microvium quite easily without modifying the engine (or find implementations online), and it gives you the option of choosing what you want rather than having all that space forced on you2.

In terms of features, mJS is a more “realistic” JavaScript engine, compared to Elk’s minimalistic approach. I wouldn’t want to write any substantial real-world JavaScript without a for-loop for example. Like Microvium, mJS also precompiles the scripts to bytecode and then executes the bytecode, which results in much better performance than trying to parse on the fly. Engines like Elk that parse as they execute also have the unexpected characteristic that comments and whitespace slow them down at runtime.

But the added features in mJS means it costs a lot more in terms of ROM space — about 4x more than Elk and Microvium.

Microvium still has more core language features than mJS, making it arguably a more pleasant language to work in. These features are actually quite useful in certain scenarios:

  • Proper ES module support is important for code organization and means that your Microvium modules can also be imported into a node.js or browser environment. You can have the same algorithms shared by your edge devices (microcontrollers), backend servers, and web interfaces, to give your users a unified experience.
  • Closures are fundamental to callback-style asynchronous code, as I explained in my previous post.

Conclusion

I’m obviously somewhat biased since Microvium is my own creation, but the overall picture I get is this:

  • Microvium is the smallest JavaScript engine that I’m aware of3
  • In this tiny size, Microvium actually supports more core language features than engines more than 4x its size. Some of these features are really useful for writing real-world JS apps.
  • Having said that, Microvium has fewer built-in functions — it’s more of a pay-as-go philosophy where your upfront commitment is much less and you bring in support for what you need when you need it.
  • The big trade-off is that Microvium doesn’t have a parser at runtime. In the rare case that you really need a parser at runtime, Microvium simply won’t work for you.

Something that made me smile is this note by one of the authors of mJS in a blog posts:

That makes mJS fit into less than 50k of flash space (!) and less than 1k of RAM (!!). That is hard to beat.

https://mongoose-os.com/blog/mjs-a-new-approach-to-embedded-scripting/

I have great respect for the authors of mJS and what they’ve done, which makes me all the more proud that Microvium is able to knock this out of the ballpark, beating what the seasoned professionals have called “hard to beat”. Of course, this comes with some tradeoffs (no parser and no builtin functions), but I’ve achieved my objective of making a JavaScript engine that has a super-low upfront commitment and will squeeze into the tiniest of free spaces, all while still including most of the language features I consider to be important for real-world JavaScript apps.


  1. All of the sizes quoted in this post are when targeting the 32-bit ARM Cortex M0 using GCC with optimization for size. I’m measuring these sizes in June 2022, and of course they may change over time. 

  2. The ffi in mJS is something that would need to be a built-in in most engines but Microvium’s unique snapshotting approach makes it possible to implement the ffi as a library just like any of the other functions 

  3. Please let me know if you know of a smaller JS engine than Microvium. 

Single-threading is more memory-efficient

Single-threading is more memory-efficient

TL;DR: Single-threading with super-loops or job queues may make more efficient use of a microcontroller’s memory over time, and Microvium’s closures make single-threading easier with callback-style async code.

Multi-threading

In my last post, I proposed the idea that we should think of the memory on a microcontroller not just as a space but as a space-time, with each memory allocation occupying a certain space for some duration of time. I suggested therefore that we should then measure the cost of an allocation in byte-seconds (the area of the above rectangles as bytes × seconds), so long as we assumed that allocations were each small and occurred randomly over time. Randomness like this is a natural byproduct of a multi-threaded environment, where at any moment you may coincidentally have multiple tasks doing work simultaneously and each taking up memory. In this kind of situation, tasks must be careful to use as little memory as possible because at any moment some other tasks may fire up and want to share the memory space for their own work.

The following diagram was generated by simulating a real malloc algorithm with random allocations over time (code here, and there’s a diagram with more allocations here):

A key thing to note is that the peak memory in this hypothetical chaotic environment can be quite a bit higher than the average, and that these peaks are not easily predictable and repeatable because they correspond to the coincidental execution of multiple tasks competing for the same memory space1.

This leaves a random chance that at any moment you could run out of memory if too many things happen at once. You can guard against this risk by just leaving large margins — lots of free memory — but this is not a very efficient use of the space. There is a better way: single threading.

Single threading

Firmware is sometimes structured in a so-called super-loop design, where the main function has a single while(1) loop that services all the tasks in turn (e.g. calling a function corresponding to each task in the firmware). This structure can have a significant advantage for memory efficiency. In this way of doing things, each task essentially has access to all the free memory while it has its turn, as long as it cleans up before the next task, as depicted in the following diagram. (And there may still be some statically-allocated memory and “long-lived” memory that is dynamic but used beyond the “turn” of a task).

Overall, this is a much more organized use of memory and potentially more space-efficient.

In a multi-threaded architecture, if two memory-heavy tasks require memory around the same time, neither has to wait for the other to be finished — or to put it another way, malloc looks for available space but not available time for that space. On the other hand, in a super-loop architecture, those same tasks will each get a turn at different times. Each will have much more memory available to them during their turn while having much less impact on other tasks the rest of the time. And the overall memory profile is a bit more predictable and repeatable.

An animated diagram you may have seen before on my blog demonstrates the general philosophy here. A task remains idle until its turn, at which point it takes center stage and can use all the resources it likes, as long as it packs up and cleans up before the next task.

So, what counts as expensive in this new memory model?

It’s quite clear from the earlier diagram:

  • Memory that is only used for one turn is clearly very cheap. Tasks won’t be interrupted during their turn, so they have full access to all the free memory without impacting the rest of the system.
  • Statically-allocated memory is clearly the most expensive: it takes away from the available memory for all other tasks across all turns.
  • Long-lived dynamic allocations — or just allocations that live beyond a single turn — are back to the stochastic model we had with multi-threading. Their cost is the amount of space × the number of turns they occupy the space for. Because these are a bit unpredictable, they also have an additional cost because they add to the overall risk of randomly running out of memory, so these kinds of allocations should be kept as small and short as possible.

Microvium is designed this way

Microvium is built from the ground up on this philosophy — keeping the idle memory usage as small as possible so that other operations get a turn to use that memory afterward, but not worrying as much about short spikes in memory that last only a single turn.

  • The idle memory of a Microvium virtual machine is as low as 34 bytes2.
  • Microvium uses a compacting garbage collector — one that consolidates and defragments all the living allocations into a single contiguous block — and releases any unused space back to the host firmware. The GC itself uses quite a bit of memory3 but it does so only for a very short time and only synchronously.
  • The virtual call-stack and registers are deallocated when control returns from the VM back to the host firmware.
  • Arrays grow their capacity geometrically (they double in size each time) but a GC cycle truncates unused space in arrays when it compacts.

See here for some more details.

Better than super-loop: a Job Queue

The trouble with a super-loop architecture is that it services every single task in each cycle. It’s inefficient and doesn’t scale well as the number of tasks grows4. There’s a better approach — one that JavaScript programmers will be well familiar with: the job queue.

A job queue architecture in firmware is still pretty simple. Your main loop is just something like this:

while (1) {
  if (thereIsAJobInTheQueue) 
    doNextJob();
  else
    goToSleep();
}

When I write bare-metal firmware, often the first thing I do is to bring in a simple job queue like this. If you’re using an RTOS, you might implement it using RTOS queues, but I’ve personally found that the job-queue style of architecture often obviates the need for an RTOS at all.

As JavaScript programmers may also be familiar with, working in a cooperative single-threaded environment has other benefits. You don’t need to think about locking, mutexes, race conditions, and deadlocks. There is less unpredictable behavior and fewer heisenbugs. In a microcontroller environment especially, a single-threaded design also means you also save on the cost of having multiple dedicated call stacks being permanently allocated for different RTOS threads.

Advice for using job queues

JavaScript programmers have been working with a single-threaded job-queue-based environment for decades and are well familiar with the need to keep the jobs short. When running JS in a browser, long jobs means that the page becomes unresponsive, and the same is true in firmware: long jobs make the firmware unresponsive — unable to respond to I/O or service accumulated buffers, etc. In a firmware scenario, you may want to keep all jobs below 1ms or 10ms, depending on what kind of responsiveness you need5.

As a rule of thumb, to keep jobs short, they should almost never block or wait for I/O. For example, if a task needs to power-on an external modem chip, it should not block while the modem to boots up. It should probably schedule another job to handle the powered-on event later, allowing other jobs to run in the meantime.

But in a single-threaded environment, how do we implement long-running tasks without blocking the main thread? Do you need to create complicated state machines? JavaScript programmers will again recognize a solution…

Callback-based async

JavaScript programmers will be quite familiar the pattern of using continuation-passing-style (CPS) to implement long-running operations in a non-blocking way. The essence of CPS is that a long-running operation should accept a callback argument to be called when the operation completes.

The recent addition of closures (nested functions) as a feature in Microvium makes this so much easier. Here is a toy example one might use for sending data to a server in a multi-step process that continues across 3 separate turns of the job queue:

function sendToServer(url, data) {
  modem.powerOn(powerOnCallback);

  function powerOnCallback() {
    modem.connectTo(url, connectedCallback);
  }

  function connectedCallback() {
    modem.send(data);
  } 
}

Here, the data parameter is in scope for the inner connectedCallback function (closure) to access, and the garbage collector will automatically free both the closure and the data when they aren’t needed anymore. A closure like this is much more memory-efficient than having to allocate a whole RTOS thread, and much less complicated than manually fiddling with state machines and memory ownership yourself.

Microvium also supports arrow functions, so you could write this same example more succinctly like this:

function sendToServer(url, data) {
  modem.powerOn( () => 
    modem.connectTo(url, () => 
      modem.send(data)));
}

Each of these 3 stages — powerOn, connectTo and send — happen in a separate job in the queue. Between each job, the VM is idle — it does not consume any stack space6 and the heap is in a compacted state7.

If you’re interested in more detail about the mechanics of how modem.powerOn etc. might be implemented in a non-blocking way, take a look at this gist where I go through this example in more detail.

Conclusion

So, we’ve seen that multi-threading can be a little hazardous when it comes to dynamic memory management because memory usage is unpredictable, and this also leads to inefficiencies because you need to leave a wider margin of error to avoid randomly running out of memory.

We’ve also seen how single-threading can help to alleviate this problem by allowing each operation to consume resources while it has control, as long it cleans up before the next operation. The super-loop architecture is a simple way to achieve this but an event-driven job-queue architecture is more modular and efficient.

And lastly, we saw that the Microvium JavaScript engine for embedded devices is well suited to this kind of design, because its idle memory usage is particularly small and because it facilitates callback-style asynchronous programming. Writing code this way avoids the hassle and complexity of writing state machines in C, of manually keeping track of memory ownership across those states, and the pitfalls and overheads of multithreading.


  1. This simulation with random allocations is not a completely fair representation of how most firmware allocates memory during typical operation, but it shows the consequence of having many memory-consuming operations that can be preempted unpredictably or outside the control of the firmware itself. 

  2. Or 22 bytes on a 16-bit platform 

  3. In the worst case, it doubles the size of heap while it’s collecting 

  4. A super-loop also makes it more challenging to know when to put the device to sleep since the main loop doesn’t necessarily know when there aren’t any tasks that need servicing right now, without some extra work. 

  5. There will still be some tasks that need to be real-time and can’t afford to wait even a few ms in a job queue to be serviced. I’ve personally found that interrupts are sufficient for handling this kind of real-time behavior, but your needs may vary. Mixing a job queue with some real-time RTOS threads may be a way to get the best of both worlds — if you need it. 

  6. Closures are stored on the virtual heap. 

  7. It’s in a compacted state if you run a GC collection cycle after each event, which you would do if you cared a lot about idle memory usage. 

Microvium: updated memory model

Microvium: updated memory model

TL;DR Microvium can now address up to 64 kB of ROM, up from 32 kB previously, and now runs more efficiently on small 32-bit devices such as an ARM Cortex MCU.

This is a minor update regarding the data model in Microvium (previous post here), for the kind of person who’s interested in this kind of thing.

Recap: Microvium uses 16-bit slots

As covered in the previous post, Microvium uses a 16-bit slot size — variables in Microvium are 16 bits each. This is unusual among the small embedded JavaScript engines. Cesanta’s mJS and Elk use a 64-bit slot size with NaN-boxing, and Moddable’s XS uses a 128-bit slot size (4x32bit words). So, if you have a variable with the number 42 in it, it will take 2 bytes in Microvium, 8 bytes in mJS, and 16 bytes in XS (on the other hand, the number 42.5 is a float and will take 12 bytes in Microvium since it overflows the slot into the heap, but it will still take 8 bytes in mJS and 16 bytes in XS).

Pointers and Paged Memory

If the lowest bit in the slot is 0, Microvium treats the value as a pointer to heap memory.

Since heap memory in Microvium is always 2-byte aligned, the lowest bit of a pointer will always be 0, so the value in the slot exactly corresponds to the pointer value, at least on 16-bit systems (or on 8-bit systems with a 16-bit address bus).

But what about 32-bit systems? Microvium is optimized for devices with 64 kB or less of RAM, but many devices with 64 kB of RAM are actually 32-bit devices. In these devices, there is usually a 32-bit address space, and some sub-range of these addresses will map to physical RAM (and some other range of addresses will map to physical ROM). Even though only 16 bits of information are required to index every byte of RAM, pointers are still 32 bits on these devices.

Previously, Microvium worked fine on 32-bit and 64-bit devices (I do all the testing on my 64-bit PC) but it did so through an expensive mapping table that mapped 16-bit VM addresses to their 32-bit or 64-bit native counterparts (like virtual memory implemented in software). The mapping itself was still O(1) in many cases1, but it involved a function call every time a pointer needed to be mapped, which is a massive overhead to incur. I wasn’t too worried about this because I wasn’t aiming for 32-bit devices as my initial audience, but with the number of 32-bit devices out there, this would quickly become a problem.

To support this kind of scenario more elegantly, Microvium now has a port macro definition that allows you to specify the upper 16-bits of a 32-bit pointer (or upper 48-bits of a 64-bit pointer) for the platform2.

For a more concrete example, the Arduino Nano 33 IOT has up to 32kB of SRAM, starting at the address 0x20000000. So the upper 16-bits of a real pointer into RAM will always be 0x2000. Here is a snippet of the data sheet that shows this:

So, in the Microvium port file, you can now specify the high bits of a 32-bit pointer to be 0x2000, and Microvium will interpret all pointers by simply indexing into the given memory page.

In compiled ARM machine code, the conversion from a 16-bit slot value to 32-bit pointer is just one or two instructions3! This is a significant performance improvement over how it worked before, and makes pointer access in Microvium almost as efficient as native pointer access.

I didn’t actually make this change for performance reasons. I did it because it makes the development of the Microvium engine much easier.

  • On my Windows machine where I develop, I can now use VirtualAlloc to pre-allocate a single 64 kB “page” of memory where the high bits of a pointer are always 0x5555, and run Microvium in just this region of memory. So if I see the Microvium value 0x002A, I know instantly that corresponds to the address 0x5555002A.
  • The addresses are consistent across runs, so when I note down an address in my notebook while debugging, I know it will be the same if I restart the program.
  • I can also have a memory view open in the debugger and it remains consistent across runs and shows all the VM memory in one place.

64 kB ROM

Previously, if the lower 2 bits of a slot were 01b then the value was considered a pointer into ROM after shifting right by 1 bit, giving us a 15-bit address space for ROM, and requiring ROM to be 2-bit aligned to keep the remaining 0 bit.

Now, the slot value is considered a pointer into ROM after zeroing the bottom 2 bits. This doesn’t change the performance, but it means that we can now address up to 64 kB of ROM.

A side effect is that ROM must now be 4-byte aligned since the lower 2 bits of ROM pointers must be zero. This means extra padding sometimes, but I’ve found that the ROM overhead doesn’t grow substantially with this change.

Why did I do this?

  1. Debugging. Previously, if I saw the slot value 0x2A1 while stepping through the engine, I have to bring out a calculator to see that 0x2A1 corresponds to the bytecode address 0x150. Now, the value 0x2A1 corresponds to the bytecode address 0x2A0, which is much easier to follow.
  2. I was irked by the fact that I couldn’t just say that “Microvium supports up to 64kB of memory”. I previously had to qualify it every time — “it supports 64kB of RAM, and 64kB of snapshot image, but only up to 32kB of ROM”. Now I can just say “supports up to 64kB of memory” with no added asterisks or qualifications, since it supports 64 kB of each type of memory it uses. This is part of simplifying the mental model of Microvium, since Microvium is all about simplicity.


  1. The performance was actually O(n) where n is the number of memory blocks, but since Microvium uses a compacting garbage collector, memory is consolidated into a single block periodically, making it O(1) for most pointers most of the time. 

  2. More accurately, you can specify any native address as the origin of the VM address space, but I found it easier to explain this in terms of the high bits. 

  3. In the full ARM instruction set, the conversion is a single 32-bit instruction. In the ARM Thumb instruction set, it takes two 16-bit instructions. The specific number 0x20000000 is relevant here because it’s a power of 2 which is more efficient. 

Microvium Closure Variable Indexing

Microvium Closure Variable Indexing

TL;DR: In many situations, a program in Microvium bytecode can access closure variables with just a single-byte bytecode instruction. The instruction contains a 4-bit index that either indexes into the current lexical scope or recursively overflows to the next-outer lexical scope, cascading up the lexical scope chain until the variable is found. In this post, I discuss some of the journey and the details of how this works.


What is a closure variable?

A closure is a function that accesses variables outside its local lexical environment. See the MDN article on this for a better explanation.

What I mean by a “closure variable” in this post is a local variable that is accessed by a closure1. For example, in the following code, callback is a closure because it accesses x which is outside its own local variables, and correspondingly x is a “closure variable” by this definition because it’s accessed by callback:

function foo() {
  let x = 10;
  setTimeout(callback, 1000);
  function callback() { 
    console.log(x);
  }
}

In Microvium (and other JavaScript engines), closure variables are treated differently from normal local variables because they need to outlive the stack frame in which they are declared. The variable x here needs to survive beyond the return of foo. This is done by allocating the slots for these variables on the heap instead of the stack.

If a variable is demoted to the heap, all access to that variable is via the heap, even if it’s being accessed locally. For example:

function foo() {
  let x = 10;
  console.log(x);  // <--- this is also accessing `x` on the heap
  setTimeout(callback, 1000);
  function callback() { 
    console.log(x);
  }
}

There is a static analysis pass in Microvium that decides whether a variable is a closure variable or not. In principle, it’s safe to allocate all variables on the heap rather than doing this analysis, but the heap is expensive, so as an optimization some variables can be promoted to the stack if they are not used by closures.

For the sake of the rest of this blog post, I will pretend that all variables are closure variables, even if I do not show the closure that accesses them. Or equivalently, I will pretend that there is no optimization analysis to determine which variables can be promoted to the stack.

Background: how it might have worked

Before I get to describing how it does work today, let me describe how I previously implemented it before having my eureka moment. You can skip this section if you want to get straight to the answer instead of going through the journey as I did.

The state of the virtual machine needs to keep track of the current lexical scope. So let’s add a new machine register called scope which points to the current lexical scope.

A goal with Microvium is to keep the engine implementation small. So rather than introducing a new allocation type for environment records and new bytecode instructions to access them, maybe we can reuse an existing type and existing instruction.

A natural solution that might come to mind is to use an Object, where the property keys of the object correspond to the variable names. Microvium has existing bytecode instructions ObjectGet and ObjectSet to get and set properties on an object.

However, Objects are very expensive at runtime. Each property is stored in memory as a key-value pair taking 4 bytes, and property lookup is a linear-time search through the properties to find the one with the right key.

Since we can statically determine the number of variables, a better choice of container for our variables would be a fixed-length array, where we statically compute an index for each variable rather than using its name. In Microvium, a fixed-length array is quite efficient, having constant-time random access to any slot by its index and each slot only consumes 2 bytes.

So let’s think about compiling the following JavaScript to IL:

let x, y , z;
x = 10;
y = 20;
z = 30;

We will have a fixed-length array with 3 slots for these 3 variables, and our new scope register will point to this fixed-length array to say that this is the current scope. Whenever we need to access one of these variables, we will read and write to the array.

I’m showing the allocation header here for completeness. These arrays are heap-allocated, so they each require this implicit memory slot for information about the size and type of the allocation.

So, the following is the IL sequence that might be produced for the above JavaScript:

// let x, y, z;
ArrayNew(3)         // Allocate 3 slots on the heap
StoreReg('scope')   // Save the array in the "scope" register

// x = 10;
LoadReg('scope')    // Fetch the current scope
Literal(10)
ArraySet(0)         // Set the first slot in the array

// x = 20;
LoadReg('scope')    // Fetch the current scope
Literal(20)
ArraySet(1)         // Set the second slot in the array

// x = 30;
LoadReg('scope')    // Fetch the current scope
Literal(30)
ArraySet(2)         // Set the third slot in the array

Side note: Microvium IL is based on a stack machine (see Wikipedia). The instruction Literal(10) pushes the value 10 to the top of the stack. The instruction ArraySet(0) pops the literal value off the stack, and pops the array reference off the stack (which was previously pushed by LoadReg('scope')) and then assigns the 0th slot in the array.

Nested scopes

What if there are nested lexical scopes, as in the following JavaScript code:

function foo() {
  let x;
  function bar() {
    let y;
    function baz() {
      let z;
      x = 10;
      y = 20;
      z = 30;
    }    
  }
}

In the above code, z is accessed in the same scope in which it is declared, as before. But x and y are accessed from parent (outer) scopes relative to the expression that accesses it. So we need a way for the IL to access parent scopes.

Remember that the inner scopes can be instantiated multiple times for each single instantiation of an outer scope. For example, if bar() is called twice within foo, then there will be 2 instances of variable y for every one instance of variable x. So we can’t just put x, y, and z in the same array. We need each lexical scope to be in its own array, and we need a way to reference an outer scope from an inner scope.

This might seem like we need to abandon the fixed-length array as the underlying storage for these variables, since these don’t naturally form chains. But if we think about it a moment, we could reserve one of the slots in the fixed-length array as a pointer to the parent fixed-length array. Let’s mentally reserve the first slot in each array as the pointer to its parent scope.

In the above JS, we have 3 distinct lexical scopes, and so we may land up with a scope chain as follows:

Now, the IL to read variable x may look as follows:

LoadReg('scope')  // Get the current scope (the one containing z)
ArrayGet(0)       // Read the parent scope (the one containing y)
ArrayGet(0)       // Read the parent scope (the one containing x)
ArrayGet(1)       // Read variable x

Each of these instructions encodes to 1 byte, so this is a 4-byte sequence in total.

I thought this was a pretty good solution. It didn’t add any extra instructions to the IL instruction set, and so didn’t make the engine any bigger. An important goal in Microvium is to keep the engine small.

The final solution

In the end, I decided that closures were too important to have it cost so many instructions to read and write to them. Rather than emitting multiple IL instructions just to read or write a single variable, it seemed quite logical to bake this behavior into the engine itself, and add two additional instructions to the IL instruction set: LoadScoped and StoreScoped. I wanted to keep these instructions both compact and efficient, and that’s where the design challenge is interesting.

Consider the following JavaScript with a few more variables to make the pattern clear:

function foo() {
  let a;
  let b;
  function bar() {
    let c;
    let d;
    function baz() {
      let e;
      let f;
      a = 10;
      b = 20;
      c = 30;
      d = 40;
      e = 50;
      f = 60;
    }
  }
}

The assignment statements in this example, such as a = 10, are accessing the variables in one of 3 different lexical scopes. How can we design the instruction format for LoadScoped (and StoreScoped) so that the bytecode can specify which scope is being accessed as well as which variable in that scope?

The eureka moment for me was when I realized that I could use a single mapped index that specifies both the scope and the variable within that scope. I’ll represent this mapping in the following diagram, with the index on the left and the variable it maps to on the right (the allocation headers are omitted here for clarity).

The scopes in this design form a waterfall. Small indexes are accessing the inner-most lexical scope. Larger indexes “overflow” into the next outer closure scope, repeating until the variable is found.

So, the IL LoadScope(1) would load variable e, and LoadScope(8) would load variable b, for example. We can determine the indexes through static analysis — numbering the variables in the closest scope first and then the ones in the next closest, etc2.

Implementation in C

My original concern with adding new bytecode instructions was that it would make the engine much bigger and more complicated. But this design can be implemented efficiently in the engine, adding only a little bit more complexity. See the following C code for finding the variable with index index3:

uint16_t* arr = registers->scope;
do {
  // The length of the array is in its header word
  uint16_t len = arr[-1] & 0xFFF;

  // Is the variable in this scope?
  if (index < len) return arr[index];

  // Otherwise, cascade/overflow to the outer scope
  arr = arr[0];
  index -= len;
} while (1);

Actual Instruction format

For the curious, the actual bytecode instruction format for LoadScoped in its simplest form is just 8 bits, where the upper nibble is the opcode (“LoadScoped”) and the lower nibble is the 4-bit index, allowing up to 15 closure variables to be addressed.

StoreScoped is the same, but with a different opcode.

For the even-more-curious, if your code has more variables than will fit in the 4-bit index, there are a16-bit and 24-bit4 instruction variants as well, as per the following snippet of the Microvium technical documentation:

Conclusion

I’m very satisfied with the final design so far. It brings closures up as first-class citizens in Microvium by making closure variable access almost as space-efficient as local variables, in terms of both bytecode space and memory usage, while also having only a small CPU cost. I think this is important because closures are very common in real-world JavaScript code, especially when programming with a functional style, and also because closures may in future form the foundation for the implementation of other features such as generators and async-await.

This has been a good reminder to me that it’s worth thinking deeply about designs and considering different options, rather than jumping straight in and implementing whatever comes to mind first. In this case, the final design was both simpler to implement and more efficient.

There’s a lot I haven’t talked about in this post, such as how the array is created in the first place and how the scope register changes as control moves between different lexical scopes. But that can be for another time.


  1. If you have a better name for a “closure variable”, please let me know 

  2. Note that one complexity with this design is that the index for a single variable is different depending on which scope the code is accessing it from. The index is not an absolute property of the variable itself. 

  3. This is not the actual code. The actual code needs to deal with error checking and multiple address spaces, for example. 

  4. Why would anyone ever need to address more than 255 closure-scoped variables? I heard a story that the C# compiler assumed that nobody would ever need more than 65536 variables in a single function, but that assumption was violated by a code generator that was generating massive functions. So I’m cautious about saying “nobody will ever need more than 255 variables” when it’s only 2 lines of code to support it. Maybe I’ll change my mind in the future. 

New Garbage Collector!

New Garbage Collector!

It’s only been a few weeks since the Alpha release of Microvium and I’ve already almost rewritten the engine! There are two major changes I wanted to make to improve performance:

  1. I’ve moved to a great new garbage collection algorithm which is much faster and has a smaller memory footprint in both RAM and codespace.
  2. I’ve changed the data model to allow Microvium values to embed native pointers on 16-bit platforms.

The Old Data Model

To understand the new data model, it will help if I first explain the old one, so you can see how they compare.

Microvium “slots” are 16-bit. The term “slot” is what I’m using to refer to a memory location that holds a value. For example, a variable is a “slot”, and an object property has a slot for its value (and another slot for its key), etc.

In the old data model, Microvium reserved the top 2 bits of the 16-bit value to be a tag, leaving a 14-bit space for data associated with that tag.

The 2-bit tag tells you how the 14-bits should be interpreted:

  • Tag 00₂: the data is a 14-bit signed integer
  • Tag 01₂: the data is a 14-bit offset into heap memory
  • Tag 10₂: the data is a 14-bit offset into data memory
  • Tag 11₂: the data is a 14-bit offset into the bytecode

Having values fit into 16-bits is convenient for the kind of device that Microvium targets: small microcontrollers with < 64kB of memory. Many of these are 16-bit architectures, because 16-bits is all you need to address all your memory, and moving to 32-bits increases the silicon size and power consumption.

14-bit integers are large enough for many purposes, such as counters, loop variables, indexes, byte manipulation, etc. For situations where more than 14 bits are required, Microvium has the above-mentioned “offset” values which point from the 16-bit slot to a larger memory allocation, somewhere in ROM or RAM. These larger memory allocations can hold any number of other value types, such as arrays, strings, 32-bit integers, function bytecode, etc.

Depending on the characteristics of the referenced data, they could be stored in either ROM or RAM. This design uses offsets rather than explicit pointers for 2 reasons:

  1. Not all devices can fit a pointer into 14-bits. In fact, most can’t.
  2. When a pointer is stored in ROM (i.e. in the bytecode image), it can’t know ahead of time what its address is (since the image could be placed anywhere in ROM).

Side point: what I’ve called “data memory” and “heap memory” here are both memory regions in RAM, but data memory is permanently allocated while heap memory is garbage collected. This is analogous to the difference between the following two allocations in C:

MyStruct myVariable1; // Permanently allocated
MyStruct* myVariable2 = malloc(sizeof(MyStruct)); // Could be freed later

This 16-bit data model seems pretty clever (or at least to me it did), but there are a number of issues with it and optimizations we can make. The first issue is that each of these respective offsets can only be 14-bit, giving only an addressable space of up to 16kB. On the whole, I don’t mind Microvium targetting use cases where applications can’t exceed 16kB, but it does feel awfully tight. This is compounded by the fact that the engine itself compiles to about 7kB1, so the engine is necessarily quite chunky relative to the largest bytecode programs it can run.

The other issues are easiest to see when comparing it to the new data model:

The New Data Model

The new data model uses a number of clever tricks to improve both performance and addressable size on 16-bit architectures.

In the new model, the tag is not a fixed size. It is 1 or 2 bits, and now in the lower position of the word rather than the upper.

If the lowest bit is a 0, the word is interpreted as a 16-bit pointer:

This makes the assumption then that pointers are always even, which is a good assumption since Microvium heap allocations are always 2-byte aligned. This means that once checking the low bit, we don’t need to do any bit manipulation to extract the pointer value: the value is already a 16-bit pointer.

This means that — at least on 16-bit architectures — a native pointer can be stored directly in a memory slot (variable) without any need for encoding and decoding. This speeds up a lot of things, but especially the garbage collector (which I’ll talk about in a moment). This also means that Microvium can now address 4x the amount of RAM than it could previously since it can address 64kB of space.

This speedup is quite significant. The heap in Microvium is not allocated contiguously from the host (doing so would make it difficult to expand the heap dynamically when more space is needed), so resolving a 14-bit heap offset, as per the old scheme, involved a function call with fairly complex logic to determine what heap fragment the offset belongs to, and to find the allocation being referenced. With the new scheme, this all goes away.

Bytecode (ROM) Pointers

Not everything can be addressed by 16-bits, even on a 16-bit architecture. In particular, ROM is not necessarily addressable with 16-bit pointers, since ROM may be larger than RAM, or it may be stored in a completely different address space (or even on an external device such as a serial flash).

Microvium intends to cater for these common situations, so the next kind of value that can be stored in a slot is as follows, having the lower 2 bits with a value of 01₂ and the upper 15 bits be an offset into the bytecode image:

Again, this makes use of the fact that bytecode addresses will be even (this was not necessarily true before, but I’ve made it true in this latest version). So to convert this value to an offset, we just have to shift right by 1 bit.

This only gives us up to 32kB of addressable bytecode ROM, but this is still pretty good. The bytecode file itself can be up to 64kB in size, but the ROM segment is limited to 32kB. I think this is more than acceptable for the kind of use cases that Microvium is currently intended for.

This leaves us with the last tag, 11₂, to represent integers:

We still only have signed 14-bit integers, like before. But even for these, this scheme is more efficient than the old! In the old scheme, to decode a 14-bit integer from the 16-bit word, we needed to perform a sign-extension, since the upper 2 bits (tag bits) were always zero, even when representing negative values. A 14-bit-to-16-bit sign extension typically involves multiple operations. In this new scheme, we need only perform a single signed-right-shift by 2 bits, and the sign will automatically be correct!

The New Garbage Collector

I like it when code gets smaller instead of bigger, and the new garbage collector (GC) is one of those cases. And with the new data model, the garbage collector is an order of magnitude faster than it used to be.

All the principles and objectives from my previous garbage-collection post still apply. In fact, I was well into the new design by the time I finished the implementation of the old GC, but I wanted to get the alpha release out before embarking wholesale on a major overhaul. This animation from last time summarizes the kind of environment that Microvium is meant to be operating in, where the circles represent different modules or tasks, and only one or a few at a time are expected to be active/scheduled, taking center stage and consuming their peak memory2:

The key takeaway from the aforementioned post is that the memory consumed by a task for only a short burst is typically significantly cheaper than the amount of memory it permanently reserves. This is because the average permanent memory per task is multiplied by the total number of tasks, while average peak memory per task is only multiplied by the number of simultaneously-active tasks running at their peak.

Single-threaded environments, of which I’m a huge fan, are even better for this since they enforce that there is only one “simultaneously-active” task — micro-tasks are scheduled serially and so each can consume all the resources for a short time, as long as they clean up thoroughly before the next microtask in the queue. If microtasks are kept short (e.g. less than 1ms) then responsiveness is still good and the system can still be “real-time-enough” for many real-world applications.

The new garbage collector is based on Cheney’s Algorithm (wiki). The general idea is this:

  • When it’s time to garbage collect, the GC finds reachable (live) allocations and moves them to a new virtual heap.
  • It then discards the old heap all at once.

The coolest new feature of the new Microvium GC is that it costs nothing at all to free an allocation! Dead allocations are simply never visited by this algorithm, and just get freed all in one go when the algorithm discards the old heap.

All living allocations have a cost during collection (this is true of all GC algorithms3 ). But also recall from my last post on GCs, that microcontrollers can have a 1000x more processing power than memory4, relative to desktop machines, so the fact that living allocations incur some (small) amount of processing time during every collection is probably not as impactful as the fact that living allocations continue to consume memory after the collection. In other words, don’t worry about how much CPU overhead a garbage collector has, and instead worry about designing your scripts such that their idle memory consumption is small, as you should be doing regardless.

This algorithm makes it fairly cheap to make tons of temporary allocations for intermediate computations or for passing data between functions, etc. And then when the task’s turn completes and the GC is run to clean up, these temporaries are discarded with no processing or memory overhead.

Since allocations in the Microvium heap are contiguous5, creating new allocations is also cheap — it’s almost just bumping a pointer forward6.

In the following animation, blocks represent allocations, and a block turns gray to represent that the allocation has become unreachable7 — blue blocks are live (reachable) allocations. The program8 can allocate quickly and cheaply by just bumping the allocation cursor/pointer forwards. Dead objects remain on the heap until collection. The collector then copies the live allocations to a new space and the old space is discarded in one fell swoop.

With the new data model, tracing the reachability graph is almost trivial: for each variable, check if the lowest bit is a zero. If it is, then the variable holds a 16-bit pointer to a reachable allocation. If not, the variable holds anything else — an integer or a pointer to ROM (which doesn’t need tracing). Reachable objects are copied into the new memory space and their internal slots are in turn investigated the same way.

Collection with this algorithm is quick and proceeds in a single pass with constant memory overhead — no mucking about with mark bits, free lists, work queues/stacks, etc.

Bonus: Array and Object compaction

As a bonus, this new implementation of the GC also compacts arrays and objects on the fly.

Dynamically-sized arrays consume more space than needed so that appending to them is quick: every time they run out of space, their capacity is doubled to make room for twice as many elements. The GC algorithm truncates the capacity of these arrays on the fly, so they use the smallest possible amount of space.

The case is similar for objects. Dynamic objects in Microvium are represented in memory by a linked-list of property cells (key-value pairs with a next pointer), so that appending new properties is quick. But the linked list consumes twice the amount of space of an equivalent contiguous array, so the GC compacts these linked lists into contiguous property arrays on the fly. This gives the best of both worlds, making it efficient to link a new property cell onto an existing object, but having that property cell subsumed into to the contiguous arrays during GC compaction so that each property takes the smallest possible space while dormant9.


  1. This is measured on an msp430, with floating-point support disabled. Presumably this number will grow over time 

  2. Read the full post for a more complete explanation 

  3. But can be mitigated on desktop/server machines by the use of generational garbage collectors, an unnecessary complexity for MCUs. 

  4. To put it another way, consider that many MCUs have the CPU power to loop through their entire RAM in the order of milliseconds, while desktop-class machines would do the same in the order of seconds 

  5. At least, contiguous within the large chunks of memory allocated from the host as the heap grows 

  6. Bumping the allocation pointer forward, checking for overflow, and assigning the 2-byte allocation header. 

  7. Unreachable means that there is no way for your program/script to access it anymore, so it is subject to garbage collection 

  8. The program is called the mutator in GC parlance. 

  9. A property in compacted form is 4 bytes: a 2-byte key and a 2-byte value. A property in linked-list form consumes extra bytes for the linked list pointers and the allocation header of each cell. 

Microvium Boost
It's like magic

Microvium Boost
It's like magic

TL;DR: This Microvium plugin (in development) optimizes Microvium bytecode by statically determining which variables and properties are accessible and how they might be accessed (read vs write), to decide whether to safely store them in ROM or remove them completely from the bytecode.


With the recent alpha release of Microvium, I’ve since turned my attention to a complimentary piece of the puzzle, which I’m calling Microvium Boost (the name may change).

Microvium Boost is an optimization plugin for Microvium1. It hooks into the Microvium pipeline to optimize a snapshot of bytecode. As a recap, the snapshot is the file that you would download to target MCU; it’s a capture of the full running state of the Microvium virtual machine — see the Concepts page in the Microvium documentation.

Like a typical executable file, a bytecode file contains separate sections for data variables versus constants and functions. I’ve called these sections data and rom2 (these are closely analogous to the .data and .text segments produced by a C compiler). When a virtual machine is restored (loaded) on the target MCU, the data section is copied into RAM so the program can execute it, while the rom section is accessed directly from the bytecode image in ROM during execution (assuming you store the bytecode in ROM).

If I can summarize what Microvium Boost does in a single sentence, it’s this:

Microvium Boost determines the best memory section for each element in the snapshot bytecode, or if it can be completely discarded.

For some things, the best memory section is obvious. For example, function code is immutable and will always be stored in ROM. In the future, frozen objects could also probably be put straight into ROM without any analysis, but Microvium does not currently support freezing.

For all other variables and objects in the snapshot, we can store them in ROM if they are not going to ever be mutated at runtime3. It’s the job of Microvium Boost to determine this ahead of time through static analysis.

A related problem that Microvium Boost solves is choosing whether an element in the snapshot needs to be kept at all. For example, it removes functions that are never called, and objects and variables that are never accessed.

This is a notoriously difficult problem to solve accurately. Take a look at a previous post of mine where I demonstrated that Webpack’s static analysis for tree shaking is quite easy to fool into producing erroneous results.

The best way to show the kind of thing that Microvium Boost can do is by a few examples.

Examples

Moving values to ROM

The following is a simple example4. Here we have a script that creates an object with two properties and then exports a function to the host5 which returns one of those properties.

var obj = { // Stored in ROM by optimization
  x: 5,
  y: 6  // Removed by optimization
};

function run() {
  var temp = obj.y;  // Removed by optimization
  return obj.x;
}

// Export the `run` function to be called by the MCU host
exportValue(0, run); 

In the above case, Microvium Boost determines that object obj is not mutated (not just the variable, but the object itself), and that property y and variable6 temp are not used at all in a particular application.

It’s worth highlighting that var temp = obj.y does not count as a use of property y, since temp is not used. What counts as “usage” is determined lazily, with only those values used in I/O being real usage7, and all other intermediate values only being tentatively needed in case they aid in the computation of I/O.

Built-in Functions and Objects

There is another optimization that Microvium Boost does in the previous example. There are actually a number of built-in library functions that are also culled from the bytecode because they aren’t used (for example, Array.push is not used anywhere here).

This is quite an important feature of Microvium Boost: the ability to have a rich common library of useful functions that any particular script might employ, and have the unused parts of it dropped from the snapshot on a case-by-case basis depending on usage.

Function Parameters

Here’s another interesting example, with the optimization results noted a comments:

var w = 1; // Removed by optimization
var x = 2;
var y = 3;
var z = 4; // Removed by optimization

function run() {
  // Arguments `w` and `z` are removed by optimization
  var result = bar(w, x, y, z); 
  return result;
}

// Parameters `a`, `d` and `e` are removed by optimization
function bar(a, b, c, d, e, f) {
  return b + c + f;
}

exportValue(0, run);

These optimization opportunities might seem obvious when you read the code, but it’s actually quite a difficult problem to solve in the face of multiple possible callers and ambiguity in the function target. To highlight this complexity, consider if we changed run to have the following implementation, which also optimizes just fine with Microvium Boost:

function run(b) {
  var f;

  if (b) f = bar;
  else f = foo; // (Assuming that we define a function `foo`)

  // Arguments `w` and `z` are removed by optimization
  var result = f(w, x, y, z); 
  return result;
}

In this example, the call f(w, x, y, z) might either be calling bar or foo, and it’s impossible to know ahead of time which it is because it depends on the value b passed from the host. If we (the optimizer) remove a parameter from bar, we equally need to remove it from foo so that the parameters for f are consistent. And in general, bar and foo might similarly be called from multiple call sites in the program, so removing a parameter from bar may have implications that affect calls to foo elsewhere.

How does it work?

Microvium Boost is a whole-program optimizer that works differently to any other optimizer I know of. I’ve been working on the concepts behind the algorithm for some years now — they are borrowed from the type inference algorithm for MetalScript.

Naturally, an algorithm like this is beyond what I can explain properly in a blog post, but here is the general idea.

There are two steps to the algorithm:

  1. Determine the dependency relationship between elements of the snapshot (i.e. the program), by stepping sequentially through the source code8. For example, a call operation is related to the corresponding function(s) it calls, such that both have consistent expectations about the parameters to be passed.
  2. Use the relationships to determine consistent facts about the snapshot. For example, to decide whether a particular parameter can be culled from a particular function signature.

The following diagram is the resulting dependency graph for the “Function Parameters” example earlier. I’ve left the labels off the graph just to give a general sense of it9.

In simplified terms, each node in the graph corresponds to an element in the snapshot, such as an instruction, parameter, property, or global variable. The arrows are the inferred directional relationships between these elements. For example, an instruction which reads a global variable is related to the variable that it reads (if the former is not culled, the latter cannot be culled).

Nodes colored solid blue are those determined to be required in the final, optimized snapshot, while the empty circles represent elements that can be culled (much of the culled program is off-screen).

In this particular diagram, as annotated below, the solid-blue root node on the left is the particular IL instruction that returns control to the host: it passes the final value to the host (it is a point of output IO), and so the instruction that generates result cannot be culled from the snapshot. The dependency graph then pulls in a cascade of other operations and parameters that are transitively required in order to produce the single return value, ending on the far right of the graph with the globals variables x and y, which are the ultimate leaf dependencies required to compute the aforementioned output return value to the host. The rest of the nodes are not too important to this discussion at a high level except that they connect the result to the variables x and y from which the result is ultimately derived.

Here’s the corresponding source code again for reference:

var w = 1; // Removed by optimization
var x = 2;
var y = 3;
var z = 4; // Removed by optimization

function run() {
  // Arguments `w` and `z` are removed by optimization
  var result = bar(w, x, y, z); 
  return result;
}

// Parameters `a`, `d` and `e` are removed by optimization
function bar(a, b, c, d, e, f) {
  return b + c + f;
}

exportValue(0, run);

If you changed the source code by removing return result, indeed Microvium Boost will now safely cull global variables x and y, since that whole island of the dependency graph is only anchored by the data required by return operation.

It’s Magic

I subtitled this post “It’s Magic” because of Microvium Boost’s ability to deal with dynamic information, which seems almost impossible to me, despite the fact that I wrote the code for it!

We could make the example a bit more dynamic to illustrate this by making the call to bar indirect, as in the following code:

var w = 1;
var x = 2;
var y = 3;
var z = 4;

function run() {
  var result = call(bar, w, x, y, z);
  return result;
}

function call(func, arg1, arg2, arg3, arg4) {
  return func(arg1, arg2, arg3, arg4);
}

function bar(a, b, c, d, e, f) {
  return b + c + f;
}

exportValue(0, run);

The inferred dependency graph for this looks roughly the same, with an extra layer of nodes between the root return operation and the leaf global variables from which the returned information ultimately feeds. And just like before, it determines that w and z are not used, along with the corresponding arguments to call and bar.

Why Microvium Boost?

This kind of analysis is expensive to compute, but I think it’s well suited to the kind of scenario that Microvium is targeting, where the machine performing the optimization is thousands of times more powerful than the microcontroller device on which the optimized bytecode will run; where every byte and every instruction counts.

I think Microvium Boost enables a style of script programming that was previously infeasible for these kinds of situations.


  1. Microvium Boost is not part of the open-source Microvium codebase 

  2. There is actually a third section in Microvium which you don’t see in a C compiler, which is called gc and is the initial state of the heap in the same way that data is the initial state of the global variables. The principles that apply to data extend to gc 

  3. I believe the XS engine from Moddable takes a different approach, which is to store everything in ROM *until* it’s mutated at runtime, but this requires extra runtime overhead and indirection 

  4. All these are working examples, mostly copied from the Microvium Boost automated test cases 

  5. The host of a typical Microvum program will be the surrounding C firmware on which the VM runs. 

  6. My choice of var over let or const for these examples is for readers who aren’t familiar with JavaScript, and may mistake the meaning of const or be confused by the name let

  7. Values used to discriminate control paths are counted as a special case of I/O, since most control paths eventually lead back to the host. 

  8. Actually, by stepping through the instructions in the intermediate language (IL)  

  9. Just ask me if you want to see the full, labeled graph. 

Microvium Alpha Release!

Microvium Alpha Release!

Microvium version 0.0.9 is published on npm, marking the first alpha release of Microvium! ?

I’m pleased with how quickly this has come together in the 4 months since I started the project back in February. Be ready for good things still to come! Subscribe to my blog if you want to be notified of new developments.

Please feel free to take a look, play around with it, and let me know your thoughts:

Here I have it running on a 16-bit device with 16 kB of ROM1 and 2 kB of RAM. This is roughly the scale of device that Microvium is primarily focused on.

The basic functionality is complete, but be aware that it’s not intended for production systems as of yet — I’m calling this an “alpha” release because the functionality is present but the testing is not complete and there are likely a few inevitable wrinkles still to be surfaced and ironed out.


  1. This device actually has 16 kB of FRAM, not flash, but I’m not using the read-write capabilities of the FRAM in this example. 

Microvium Garbage Collector

Microvium Garbage Collector

Edit: This garbage collector is superseded by the new-and-improved one. The design principles and objectives covered in this post still apply


The Microvium garbage collector is here!

The garbage collector handles the automatic freeing of unused heap memory in the VM.

Microvium makes some interesting and perhaps unique tradeoffs with garbage collection which I’ll focus on in this post.

I’ll divide this post into two parts to cater for different audiences: I’ll start with a summary of the features at a high level, for those who don’t care about the details and design decisions, and then I’ll go into more detail.

Summary of Features

In no particular order, here are some features of the Microvium memory management system:

  1. Allocations are dynamically-sized, each having a 2-byte allocation header.
  2. Allocation headers are optional in some circumstances, so allocations can be even more compact1
  3. Allocations can be as small as 2 bytes2. For example, single-character strings are stored as 2-byte allocations, having a character3 and an implicit null terminator.
  4. The garbage collector is compacting (wiki: Mark-Compact). So, there is no heap fragmentation.
  5. Allocating on the heap is cheap and is constant-time — there is no searching through free lists.
  6. Data structures used for garbage collection are allocated only during garbage collection — no structures are persisted while the VM idle.
  7. Memory is acquired from the host in chunks as-needed, so users don’t need to decide ahead of time how much memory they want to commit to a particular VM.

I’ll add in two limitations for completeness:

  1. The memory for a VM is currently limited to 16 kB
  2. A VM temporarily uses about twice its memory during the garbage collection phase.

Design Considerations

Microvium is for MCUs

While Microvium can and does run fine on desktop-class machines, it’s optimized primarily for microcontrollers. Microcontrollers have quite different characteristics that influenced the design decisions of the garbage collector (GC):

RAM is small relative to processing power.
A 3 GHz desktop computer might have 16 GB of RAM, while a 3 MHz MCU might have 16 kB of RAM — a thousand times less processing power but a million times smaller RAM (obviously this is a very rough calculation to illustrate a point). This is relevant to the GC design because it should be heavily biased towards having a small RAM footprint, since RAM is relatively more expensive than CPU on a microcontroller.

There is no separate CPU cache
Most MCUs I’ve dealt with have no cache — all memory is accessed with constant time4, and often on a single instruction cycle. This affects the design of a GC because things like locality, order-of-access, and prefetching are not factors that need to be considered.

Microvium expects small allocations

Microvium is optimized for 16-bit platforms. Its native value type is a constant-sized 16-bit dynamically-typed value. Anything that doesn’t fit in this small size needs to be heap-allocated, including strings, 32-bit integers5, floats, arrays, objects, etc.

Since Microvium expects these kinds of values to be frequently allocated on the heap instead of the stack, the heap is designed to be fast to allocate onto, allocations shouldn’t incur a lot of memory overhead, and allocations shouldn’t require a lot of padding.

For this reason, Microvium allocations are only 2-byte aligned, so that the most padding ever required will be 1 byte, and only for odd-sized allocations.

Microvium is not the main show

Another significant design consideration is that a Microvium program is not expected to typically be the main program running on a device. Rather, it is optimized for situations where there is an existing firmware application and a Microvium script is only consulted occasionally. The rest of the time, the script is “dormant” and should occupy as few resources as possible.

I think this point is important, so let me explain in a bit more detail…

The way I visualize this design consideration is to think that any one Microvium program is only one out of a number of “tasks” or “components” that exist in the firmware application as a whole. It doesn’t matter whether the other components of the application are other Microvium programs or are C modules — the design considerations are the same.

For argument’s sake, let’s say that some application has 12 components. In the following diagram, I’m representing the RAM space as the dark circle, and each of the colored circles is a “component” occupying RAM, and the space in the middle is vacant memory. Let’s say that each component in this diagram is dormant, but is designed to become active in a particular scenario.

You can think of these as 12 “modes” or “activities”. To use a concrete example, consider an application with Bluetooth support, where one of these 12 components is the part of the firmware that handles Bluetooth communication. Most of the time, there’s no Bluetooth activity, and so the component of the application which handles Bluetooth communication is dormant.

When there is some Bluetooth activity, we can think of it as if the Bluetooth component takes the stage for a short time to do its work, and then afterward, packs itself up and go back to its idle/dormant state. While the component is active, it will likely consume more memory, as it juggles Bluetooth messages in memory, or whatever else it needs to perform its activities. We hope that in most cases, when it’s done processing the Bluetooth event, it can pack up and release memory before waiting dormant until the next event. While it’s packed up, the “stage” is clear (there is free memory) for other components wake up and do what they need to do to respond to other events as they come.

This model works best when you’re operating in a single-threaded environment.

This is the kind of situation that Microvium has been optimized for.

Why do I bring this up?

In this model, the memory that an average component uses while dormant is 12 times more costly than the additional memory it uses while active, because the average dormant memory is multiplied by the number of dormant components, while in this model, there is only one active component at a time.

While this example is an oversimplification, I think it still illustrates the point I’m making: for applications with dynamic memory allocation, memory held by a component for a long time is typically more expensive than memory held for a short time (see also Short-lived memory is cheaper). Microvium can be used in situations where this is not the case, but it’s optimized for situations where this is the case.

So, a VM that incurs a lot of memory for a very short time is considered better than one that has much less memory but holds onto it forever. This principle to any part of the VM, such as the garbage collector, or the Microvium engine as a whole, or user-level JS code. But here we’re talking about the implication for the GC design: if the GC has memory overhead that it holds for a long time, this is more costly than memory overhead for a short time.

What kind of memory overhead is related to the GC? Here are some examples:

  • Some GC algorithms store extra data in the allocation headers6. This is permanent overhead proportional to the dormant heap size of the VM, and so is expensive. Microvium does not do this.
  • Some managed heaps can get fragmented. Even if fragmentation only caused the VM to consume an additional 20% dormant memory, by our above example, this is roughly equivalent to consuming 20%*12 = 240% additional peak memory. So fragmentation is expensive. Microvium uses a compacting collector to fully eliminate fragmentation; to “squeeze” out all the unused spaces before a VM goes dormant.

On the other hand, the Microvium garbage collector uses quite a lot of memory while collecting. In fact, during a typical collection, the amount of memory consumed by the VM may be double for a brief time.

This is because the Microvium collector is actually a semispace collector. At the time of collection, it allocates a whole new heap for the VM, and then copies reachable objects from the old heap to the new one, before discarding the old heap.

The above design considerations are critical to understanding this design choice: for situations where a Microvium script is a small component in a larger firmware system, and is dormant during most of the firmware’s activities, the overwhelming metric to optimize is how well the VM can be “packed up” while it’s dormant, and Microvium achieves this with great success.


  1. Property cells, for example, are 6 bytes each and do not use an allocation header 

  2. Excluding the allocation header 

  3. Strings are utf-8 encoded 

  4. For internal SRAM 

  5. Or any integer larger than 14-bits 

  6. Such as mark bits and forwarding pointers