TL;DR: Microvium closures are now as small as 6 bytes on the heap, down from 12 bytes previously and compared to over 100 B in some other engines.
Microvium is all about squeezing things into tiny spaces, and the latest of these is closures — the memory structure used to carry the state of nested functions. If you’re new to closures, have a look at my previous post which explains closures from a C programmer’s perspective.
One of my goals with Microvium has been to make it feasible and realistic to run JavaScript on really tiny devices. More than half of the microcontrollers selling on Digikey have less than 16kB of RAM and 128kB of flash. When you’re working with devices this small, you count every byte you use.
People typically program these kinds of devices in C. However, as I’ve shown before, there can be some real, practical benefits to using a language like JavaScript in these situations. For example, JavaScript-style single-threading can be more memory-efficient than multithreading, and using a garbage collector like Microvium’s can avoid heap fragmentation overhead and help to reduce the memory leaks associated with manually freeing memory. And of course, your code may be less complicated when working in a higher-level language, which itself can mean less development time and fewer bugs.
When you’re working with such small memory spaces, what kind of size would you consider to be acceptable for a closure function, like the myCounter in the following example?
function makeCounter() {
let x = 0;
function incCounter() {
return ++x;
}
return incCounter;
}
const myCounter = makeCounter();
As I demonstrated in my previous post, if you code up something like this in C, it might take 20 bytes of memory1, ignoring fragmentation overhead. So what kind of size would be acceptable to get the convenience of the JavaScript syntax? Maybe 30 bytes? 50 bytes?
Other JavaScript engines that I’ve measured take over 100 bytes for a closure like this! (See here for my measurement methodology and feel free to correct it). That’s not a fault of the engine, it’s a product of the JavaScript spec which requires that all functions are objects. For function declarations (functions declared using function rather than arrow syntax), not only is the function an object, but it also has a fresh prototype object in case you use the function with new.
That’s a heavy price to pay! Maybe that’s ok on a desktop-class machine. But when you’re working on a device with tens of kilobytes of memory, that’s just not an affordable feature anymore. So, in Microvium, closures are stripped down to their bare minimum. Functions in Microvium are not objects — they cannot have properties. And in return for this trade-off, Microvium closures can be really tiny: the closure in the above example is just 8 bytes! (Including a 2-byte reference and 2-byte allocation header).
Another reason why this is so small in Microvium is its 16-bit slot size. It stores all pointers as 16-bit, even on a 32-bit machine. Numbers, such as x in the example, start out as 16-bit but grow as needed. This is great for things like counters which are likely to be small most of the time but which are able to count up to 253 without overflowing. This is in contrast to C where you typically need to use memory upfront for the largest-possible value that a variable might have.
In general, closures like this in Microvium take 4 + 2n bytes of memory on the heap, where n is the number of variables. This is down from 10 + 2n bytes of memory in the previous design which is a nice improvement.
Assuming here a 32-bit device and FreeRTOS heap. The size I’ve quoted here includes an 8-byte allocation, with 8-byte allocation header, and a 4-byte pointer to the allocation ↩
TL;DR: Closures are a good way to represent callbacks for non-blocking code. Although C doesn’t support closures, you can achieve something similar using structs on the heap, but you can get even better memory efficiency by using JavaScript running on Microvium.
I’ve recently made some improvements to the memory usage of closures in Microvium, which I’ll talk about in more detail in an upcoming post. But before I get there, I want to first talk about how a similar problem can be solved in C, as a point of comparison.
In this post, we’ll look at a scenario where closures are useful, starting with the most basic solution to the problem in C and then progressively improving the design to address its weaknesses until we end up with something closure-like, which I will then contrast with the same solution in JavaScript. This post is aimed at people who are more familiar with C than with JavaScript, and are not necessarily familiar with closures.
Preserving state during long-lived operations
Let’s start by imagining that we have a C program that needs to perform some slow, I/O-bound operation, such as sending data to a server. For simplicity, let’s assume that we just need to send a single integer, and let’s assume that someone has given us a library that does exactly this:
// Send an int to the server, and block until we get confirmation of receipt
void sendDataToServerAndWaitForResponse(int payload);
And let’s say that we have a requirement to log once the payload was sent:
void sendToServerAndLog(int num) {
sendDataToServerAndWaitForResponse(num);
printf("This num was successfully received by the server: %i\n", num);
}
So here I’ve constructed an example where the value of num, which is available before we contact the server, is also used after we get a response from the server. I’ve simplified the example by ignoring failure cases. This is a special case of a common situation where some program state needs to be preserved across long-lived operations, such as a state machine or retry counter.
A problem with the above solution is that it blocks the thread. In terms of memory usage, we can think of it as requiring the memory of a whole call stack, which may occupy anything from hundreds of bytes to megabytes, depending on the system we’re running. And more importantly, we can’t use the thread for anything else while it’s blocked.
Using a callback
To avoid blocking the thread, we could instead consider using a callback function. Imagine that our hypothetical library supported this by taking a callback argument, rather than blocking until completion:
Now we can write our application code like the following, where we’ve split the behavior of sendToServerAndLog into two parts — one part before sending the data and one that’s the continuation after we’ve received a response:
int save_num;
void sendToServerAndLog(int num) {
save_num = num;
sendDataToServer(num, &continue_sendToServerAndLog);
}
void continue_sendToServerAndLog() {
int num = save_num;
printf("This num was successfully received by the server: %i\n", num);
}
In order to have the state of num available in the callback, we needed to persist it somewhere, so we need the save_num variable. This solution is much more memory efficient that the previous — if we’re running on a 32-bit platform, our app code occupies only the 4 bytes of save_num. The library code now also needs to preserve the callback pointer across the call, which uses an additional 4 bytes of memory. So this solution takes a total of 8 bytes of memory during the server operation, compared to using a whole thread and call stack of memory before. But there are still two issues with this design:
The save_num variable persists forever, even when we’re not sending data to the server. If the hypothetical library uses the same pattern to store the callback, then together the full 8 bytes are consuming memory forever.
We can’t have multiple calls to the server running in parallel here. If we call sendToServerAndLog a second time before the first response is received, the value of save_num is corrupted.
Adding context
A common pattern to get around the above problem is to have the library code accept an additional parameter that it passes back to our callback, which here we might name “context” because it represents the contextual information of the particular caller:
typedef void Callback(void* context);
// Whatever `context` is provided will be passed to the `Callback` untouched.
void sendDataToServer(int payload, Callback* callback, void* context);
Now we can use it like this:
void sendToServerAndLog(int num) {
int* context = malloc(sizeof(int));
*context = num;
sendDataToServer(num, &continue_sendToServerAndLog, context);
}
void continue_sendToServerAndLog(void* context_) {
int* context = (int*)context_;
int num = *context;
printf("This num was successfully received by the server: %i\n", num);
free(context);
}
Now, if we call sendToServerAndLog multiple times, each call creates a distinct context that is completely independent of any other call.
Side note: Why make the context a void* instead of just an int, since we only need an int here? The reason is that we’re imagining here that sendDataToServer is part of a reusable library. Even if it’s a first-party library, it’s generally better practice to write it in such a way that it’s decoupled from the particular way you use the library. Making it a void* allows the user of the library to decide the size and shape of the preserved state, rather than coupling it to the library itself.
This is the same reason why we use a function pointer at all, rather than just hardcoding the library to invoke our particular callback directly: the library should not be coupled to the particular function that executes after it. And as the program evolves, we might land up calling it from multiple places that each require a different callback.
Applying the same pattern to sendToServerAndLog
We decided that a good interface for the library function sendDataToServer is one that takes a callback and a context, because the operation takes a long time, and a caller may want to continue doing something else after the operation completes. But similarly, sendToServerAndLog is also an operation that takes a long time, and its caller may also want to do something when the operation completes.
If we’re working with highly coupled code, then maybe we already know whether or not the caller of sendToServerAndLog needs to do anything else afterward, and exactly what it needs to do. But if we want sendToServerAndLog to be a reusable code that is decoupled from its caller, then we should probably have it accept its own callback and context from its caller. If we do this, then we need to persist the caller’s callback and context until the whole operation completes, so let’s upgrade our context to a struct that includes these fields:
// Context for sendToServerAndLog
typedef struct sendToServerAndLog_Context {
int num;
Callback* caller_callback;
void* caller_context;
} sendToServerAndLog_Context;
void sendToServerAndLog(int num, Callback* caller_callback, void* caller_context) {
sendToServerAndLog_Context* context = malloc(sizeof *context);
context->num = num;
context->caller_callback = caller_callback;
context->context = caller_context;
sendDataToServer(num, &continue_sendToServerAndLog, context);
}
void continue_sendToServerAndLog(void* context_) {
sendToServerAndLog_Context* context = context_;
int num = context->num;
printf("This num was successfully received by the server: %i\n", num);
context->caller_callback(context->caller_context);
free(context);
}
Embedding the function pointer
Stylistically, it’s interesting to consider one further modification to this example. You may or may not agree with the following design change, but it leads nicely into the topic of closures in Microvium, which I’ll get to in a moment.
Rather than having the function save and copy around 2 pieces of state on behalf of the caller — the caller_callback and caller_context — we can combine these into one: we can just reference the caller_context and require that the first field in the caller_context is the callback function pointer, as in the following code. Also, rather than calling this a context, let’s now going to call it a closure, since it captures both a function pointer and some general state. The relationship between this and real closures will become more clear later.
typedef void ClosureFunc(Closure* closure);
// A general closure is expected to take this shape
typedef struct Closure {
// Must be the first field
ClosureFunc* invoke;
/* ...other fields may follow...*/
} Closure;
// Here we have a specific closure shape for our function
typedef struct sendToServerAndLog_Closure {
// Must be the first field. If you're using C++, you might instead
// inherit `sendToServerAndLog_Closure` from `Closure`
ClosureFunc* invoke;
// Other fields:
int num;
// Note now that we don’t need to store a separate caller_context
// since both the context and the function pointer are combined
// into the single closure struct.
Closure* caller_callback;
} sendToServerAndLog_Closure;
void sendToServerAndLog(int num, Closure* caller_callback) {
sendToServerAndLog_Closure* closure = malloc(sizeof *closure);
closure->invoke = continue_sendToServerAndLog;
closure->num = num;
closure->caller_callback = caller_callback;
sendDataToServer(num, closure);
}
void continue_sendToServerAndLog(Closure* closure_) {
sendToServerAndLog_Closure* closure = (sendToServerAndLog_Closure*)closure_;
int num = closure->num;
printf("This num was successfully received by the server: %i\n", num);
Closure* caller_callback = closure->caller_callback;
caller_callback->invoke(caller_callback);
free(closure);
}
This final design is quite clean:
The memory for the transient state (e.g. num) is only allocated while the long-running operation is active. Remember that short-lived memory is cheaper memory.
It doesn’t block the thread, so it’s easier to parallelize multiple operations if needed, without each operation consuming a whole thread of memory.
Each layer is decoupled: sendDataToServer doesn’t need to know who its caller is, and similarly sendToServerAndLog doesn’t need to know who its caller is.
The callback is neatly encapsulated into a single pointer value that can be passed around as a first-class value. If you’re familiar with C++, this is a similar benefit to using the std::Function<> type.
But there are some disadvantages to this design:
Although it represents the same behavior as the first design (the synchronous code), the code is now a whole lot more complicated.
Here we’ve only shown one closure signature. But what if we needed a return value to be passed to the closure? In general, each different return type is going to need its own type definitions for ClosureFunc and Closure, which will add up to a lot of boilerplate.
The memory efficiency is not great because it uses malloc and free.
On my Windows machine with Visual C++, I measure malloc to have an overhead cost of 40 bytes per allocation (compiling for x86).
In FreeRTOS, each allocation has an overhead of 8 bytes on a 32-bit platform. With this figure, the closure in the example takes 20 bytes of heap space.
The heap in C/C++ can get fragmented, which costs additional memory.
Using Microvium Instead
We can write this same example in JavaScript using nested functions, as follows:
function sendToServerAndLog(num, caller_callback) {
sendDataToServer(num, continue_sendToServerAndLog);
function continue_sendToServerAndLog() {
console.log(`This num was successfully received by the server: ${num}`);
caller_callback();
}
}
The nested function continue_sendToServerAndLog has access to variables in the outer function (in this case the parameters num and caller_callback). Here I tried to keep the function names consistent with the C example, but in practice, it may be more convenient to do the same thing using arrow function syntax, as follows:
function sendToServerAndLog(num, caller_callback) {
sendDataToServer(num, () => {
console.log(`This num was successfully received by the server: ${num}`);
caller_callback();
});
}
Either way, the values num and caller_callback are automatically captured into a closure on the JavaScript heap, making them available to the nested function automatically.
If you’re using the Microvium JavaScript engine, this created closure has a very similar structure in memory to the final example we did in C — it’s a single structure with a function pointer and two other variables. You may see now why I called the struct in the earlier C example a “closure”. The C code is a more explicit way of representing the same runtime structure, with similar benefits from a decoupling and modularity perspective, although clearly the JavaScript is more syntactically simple.
This closure heap allocation in Microvium will have the following characteristics:
If the num is an integer in the range1-8192 to 8191, the closure occupies 8 bytes of memory, including a 2-byte allocation header, compared to the 20 bytes consumed by the C example on a FreeRTOS heap.
There is no fragmentation overhead, since the Microvium heap is compacting.
Allocating of the closure generally happens in constant time. Since the Microvium heap is contiguous, creating new allocations is similar to just bumping a free pointer forward.
Conclusion
We’ve walked through an example that’s representative of a common situation in program development, especially when networking is involved: network requests take time, and we can either block the current thread or we need another way to remember what we were doing so we can get back to it when the request completes. When writing your code in a modular and decoupled way, it’s better not to assume anything about the caller of your long-running application, so it’s better not to block the thread or hard-code anything about which callback to run or what state to hold onto.
In this case, Microvium actually offers you a way to make your code more memory efficient than the equivalent C code, while also making it easier to follow, and preserving the nice decoupling characteristics. Depending on your situation, this might make Microvium a good choice for orchestrating this kind of high-level program flow, especially when long-running tasks are involved and when you need to keep track of state across those tasks.
I’d say this is another advantage of using Microvium: numbers automatically grow in size as-needed. Integers in the range -8192 to 8191 use 2 bytes of memory ↩
TL;DR: Microvium now has support for classes. Here are some details for the curious.
I’ve been putting off adding support for classes in Microvium because you can do a lot without them, and other features have been more important in my opinion. I’m also personally more of a fan of functional-style programming rather than object-orientated programming, so I seldom write classes myself.
But I kept running into little things that are difficult to do without support for classes. With the recent addition of try-catch, for example, you could now throw, but you couldn’t throw new Error(...) because Error is a class. Another example is the ECMA-419 API for embedded systems, which is an API standard for things like GPIO, but which relies heavily on support for classes.
Microvium now has support for classes, which I think makes it the only JavaScript engine under 100kB of flash that has all the features required to implement ECMA-419 (and Microvium is currently under 10kB at the moment).
Microvium functions are not objects
Classes in JavaScript are more or less just syntactic sugar for old-style prototype-based code. For example, the following are more or less equivalent in normal JavaScript (but the prototype style doesn’t work in Microvium):
// ES6 class style
class Foo {
bar() { console.log('Method bar called') }
}
// Old prototype-style
function Foo() {}
Foo.prototype.bar = function() { console.log('Method bar called') }
Both styles are setting up the Foo.prototype object, but with the class syntax, it is implicit. Then if you evaluate new Foo(), the engine will create a new object whose __protype__ is Foo.prototype.
This is all possible because functions in JavaScript are also objects, and can hold properties like any other object.
But Microvium functions do not support properties, so the function Foo cannot have a prototype property (or any other static class properties like Foo.myStaticProperty).
I made the choice to omit this feature from the engine because this pattern is relatively uncommon outside of classes. It would be a massive waste of RAM to have every function also be an object. A closure function in Microvium currently uses only uses 6 bytes of RAM, and most functions will use no RAM at all if the static analysis determines them not to be closures (e.g. all top-level functions). Whereas a pair of normal objects like { prototype: {} } is already 16 bytes of RAM (6 bytes for each object plus 4 bytes for each property). It would be madness to require that every function declaration uses 16 bytes of RAM just in case someone wants to use it as a constructor.
So how do classes work in Microvium?
The way it works is quite simple: when you use the class syntax, Microvium creates a function that supports properties (i.e. “a class”), but when you use the normal function syntax, it will keep its old behavior.
A class in Microvium is a distinct type. Although typeof MyClass will return "function", as per the JS spec, if you use the C function mvm_typeOf to probe the type, it will tell you VM_T_CLASS not VM_T_FUNCTION.
Internally, a class is a tuple of a propsobject and constructorfunction, as shown in the following diagram. When you access the class like an object (e.g. MyClass.myStaticProperty or MyClass.prototype), it delegates the property-access operation to the props object.
Memory layout for a minimal empty class and an instantiation of the class
When you construct an instance of the class using new MyClass(), it creates a new object whose prototype (__proto__) is props.prototype, and then calls the constructor function with the new object as the this parameter.
The constructor function itself could just be a bytecode function in ROM, but it can, in general, be any function because it delegates the call to the normal calling machinery of Microvium. For example, if you have a class declaration nested inside a function then the constructor will be a closure, with access to the locals of the surrounding function. Or the constructor can be an imported host function. But the simplest case is a class declared at the top level of the code, in which case the constructor part of the tuple just points directly to a function in the bytecode ROM image.
The key-value pair in the above diagram (where the key is "prototype") is how Microvium implements object properties. Each property is a 4-byte pair containing a 2-byte property key and a 2-byte value. Here I’m only assuming one property (MyClass.prototype) but other properties would follow contiguously if there were more.
The next field is unimportant for this example. When you add properties to an object, they’re actually added using a linked list rather than resizing the object structure (for CPU efficiency) but then the garbage collector compacts these linked lists into a contiguous form (for memory efficiency).
Classes and objects are expensive
Classes (and objects) are one of the more expensive features of Microvium. The above minimal/empty class is already 22 bytes of RAM. While each instance of the class is only 6 bytes of RAM, every property on an instance is an additional 4 bytes. So an object with just 2 properties is already 14 bytes.
Property lookup is also quite expensive:
It is linear-time because it has to search through the property list.
Many different things support property lookup, such as arrays, classes, buffers, and of course objects. The lookup code needs to figure out what it is in order to know how to find its properties.
Properties use string keys. For hard-coded literal strings, like x.y or x['y'], these strings don’t incur RAM overhead but they do add to the bytecode size.
For computed string properties like x['a' + 'b'], there is additional overhead to perform string interning — string interning is the process of consolidating different strings with the same content so that they also have the same reference identity, which makes property lookups more efficient.
String interning can potentially trigger a garbage collection cycle because it’s growing the intern table. Apart from the overhead of the collection itself, just the possibility of a garbage collection means that the implementation of property lookup needs to deal with the fact that all the memory may be shuffled around during the property access (e.g. all the temporaries need to be properly GC-reachable), which itself adds overhead.
A property lookup instruction in bytecode involves multiple steps consuming at least 5 bytes of bytecode:
Loading the object from which to get the property (a 1-byte instruction, at least)
Loading the string that represents the property (typically a 3-byte instruction because it embeds a reference to the string)
An instruction to actually trigger the property loading (a 1-byte instruction)
Closures, on the other hand, are asymptotically more efficient. A closure that closes over 2 variables is 14 bytes — the same size as an object with 2 properties. But:
Each additional variable closed over is another 2 bytes, rather than the 4 bytes of a property.
Closure variable access is typically O(1) because closures are random-access structures and the indexes are computed at compile time by static analysis (see my post on closure variable indexing).
Up to 15 different closure variables can be accessed by a single 1-byte instruction, compared to a minimum of 5 bytes for the instruction sequence for object property access.
Can objects be more efficient?
The experimental static analysis technique I designed for Microvium Boost a while back computes all the information that would be required to convert an object to a fixed-length array, in cases where it’s possible to infer the closed set of property keys that are ever used to access an object. My hope is that in the future I could implement this optimization (transforming objects and classes into fixed-length arrays) which could have some significant performance benefits:
An object with 2 properties would be 6 bytes instead of 14.
Each additional object property would take another 2 bytes.
Property lookup would be O(1).
Up to 15 different properties could be accessed using a single-byte instruction.
Conclusion
As it stands, classes are not a feature that I would want to use very often since they are so expensive. But having classes opens the door to different kinds of libraries and APIs, and in particular, is a stepping-stone towards implementing other standards-compliant features in Microvium.
TL;DR: Support for closures in Microvium sets it apart from other JS engines of a similar size. Closures simplify state machines and enable functional-style code. Closures in snapshots are a new way of sharing compile-time state with the runtime program. This post goes through some examples and design details.
What is a closure?
MDN has already done a great job of explaining what a closure is, so I’m just going to borrow their explanation:
A closure is the combination of a function bundled together (enclosed) with references to its surrounding state (the lexical environment). In other words, a closure gives you access to an outer function’s scope from an inner function. In JavaScript, closures are created every time a function is created, at function creation time.
Here is a simple example:
function makeCounter() {
let x = 0;
function incCounter() {
x++;
return x;
}
return incCounter;
}
const myCounter1 = makeCounter();
const myCounter2 = makeCounter();
console.log(myCounter1()); // 1
console.log(myCounter1()); // 2
console.log(myCounter1()); // 3
// myCounter2 is an independent counter
console.log(myCounter2()); // 1
console.log(myCounter2()); // 2
console.log(myCounter2()); // 3
In the above example, the function named incCounter is a closure because it closes over variable x in its outer lexical scope. A closure is just a function, nested in another function, which accesses variables in the outer function.
Microvium also supports the arrow function syntax, so the following example has the same output but is more concise:
For more detail, take a look at the above-mentioned MDN article. The rest of this post will assume that you know what a closure is.
Why are closures useful?
Closures in snapshots
Let’s say that we want a script that exports two functions: one that prints “hello” and the other that prints “world”. Without closures, we could implement this in Microvium as follows:
vmExport(0, printHello);
vmExport(1, printWorld);
function printHello() {
console.log('hello');
}
function printWorld() {
console.log('world');
}
(Recall that vmExport is a function that the script calls to export a function to the host).
In this example, printHello and printWorld are each functions that take no arguments and will print the corresponding string to the console1.
With the introduction of closures, we could factor out the commonality between printHello and printWorld and just have a printX that can print either one:
This refactors the code so that console.log only appears once but is shared by both printHello and printWorld. For the simple case of console.log this doesn’t add much benefit, but you can imagine cases where printX is a lot more complicated and so this refactoring may be beneficial.
Another thing to note in this example is that makePrinter is called at compile time since it’s in the top-level code. The resulting closures printHello and printWorld instantiated at compile time are carried to runtime via the snapshot, along with their state (the value of thingToPrint). The closures feature here plays nicely with snapshotting as a new way to share compile-time state to runtime.
Depending on the style of code you’re familiar with, we can also write the same example more concisely as:
We could also get the list of things to export from a dynamic source, such as an array (or even something read from a file at compile time using fs.readFileSync):
Side note: the above example also demonstrates the usefulness of vmExport being a normal function, rather than some special syntax. Think about your favorite language or engine and how you would implement the above in that language. You can’t define an extern void foo() {} inside a for-loop in C, or public static void foo() {} inside a for-loop in C#. The only solution in these environments to the objective of exporting a programmatically-defined set of functions would be to use a code generator and the result would be much more complicated.
Closures for state machines
It’s much easier and better performance to implement a finite state machine using closures. Consider the following two-state state machine:
The above state machine has 2 states, which I’ve called stateA and stateB. When event 1 is received while in stateA, the machine will transition to stateB, but if any other event (e.g. event 2) is received while in stateA, the machine will not transition to stateB. See Wikipedia for a more detailed description of FSMs.
The ability to use closures allows us to implement this state machine using a function for each state. In the following example, stateA is a function that receives events by its event parameter. We “make” stateA only when we need it, by calling enterStateA(). The example includes an eventCount as part of state A to show how states can have their own variables that are persisted across multiple events.
function enterStateA() {
console.log('Transitioned to State A!');
let eventCount = 0; // Some internal state only used by stateA
// A function that handles events while we're in stateA
function stateA(event) {
if (event === 1) {
currentState = enterStateB();
} else {
eventCount++;
console.log(`Received ${eventCount} events while in state A`);
}
}
return stateA;
}
function enterStateB() {
console.log('Transitioned to State B!');
return event => {
if (event === 2) {
currentState = enterStateA();
}
}
}
// We'll start in stateA
let currentState = enterStateA();
// Every time an event is received, we send it to the current state
const processEvent = event => currentState(event);
// Allow the host firmware to events to the state machine
vmExport(0, processEvent);
// Some example events:
processEvent(5); // Received 1 events while in state A
processEvent(5); // Received 2 events while in state A
processEvent(5); // Received 3 events while in state A
processEvent(1); // Transitioned to State B!
processEvent(1); //
processEvent(2); // Transitioned to State A!
processEvent(2); // Received 1 events while in state A
In the above example, state A has the eventCount counter which is part of its closure. When the system transitions to state B, the counter can be garbage collected. This might not be very useful when only considering a single counter variable, but the pattern generalizes nicely to systems that have more expensive states that may hold buffers and other resources.
Once you understand closures and higher-order functions, this is a very natural way to represent state machines.
Closures under the hood
Let’s go back to the simple “counter” example for illustration:
function makeCounter() {
let x = 0;
function incCounter() {
return ++x;
}
return incCounter;
}
const myCounter = makeCounter();
console.log(myCounter()); // 1
console.log(myCounter()); // 2
console.log(myCounter()); // 3
The runtime memory layout for this example is as follows:
The global variable myCounter is a 2-byte slot, as are all variables in Microvium. The slot contains a pointer to the closure, which is an immutable tuple containing a reference to the function code (incCounter, in this example) and the enclosing lexical environment which in Microvium is called the scope.
The closure and scope are allocated on the garbage-collected heap — if and when the closure is no longer reachable, it will be freed.
When the closure myCounter is called, the VM sees that the callee is a closure and sets a special scope register in the VM to the closure’s scope before running the target bytecode. The bytecode can then interact with the scope variables through special machine instructions that leverage the scope register.
For a look into how variables are accessed by closure code, see my post on Closure Variable Indexing.
More efficient than objects
Closures are much more efficient than objects in Microvium:
Every closure variable is one word (2 bytes) while an object property is 2 words (4 bytes) because a property is stored as a key-value pair. Variables aren’t stored with their name because the static analysis can determine an index for them.
Closure variables can be accessed with single-byte instructions (a 4-bit opcode and 4-bit literal variable index) whereas typical object properties take at least 5 bytes of bytecode instructions to access. This is in part because the “current” closure scope is implicit, while there is no such thing as the “current” object (the object needs to be specified explicitly), and also because object property access requires specifying the property key which is a reference to a string.
All the object key strings take up memory in the bytecode image.
Most closure variables are accessed in O(1) time — adding more variables does not slow down the access time, but adding more properties to an object slows it down by O(n).
Conclusion
Closures are useful and form the backbone of a lot of JavaScript programming. I’ve talked before about how closures are useful for callback-style asynchronous code, and in this post, I’ve also shown how they are useful for modeling state machines and make certain kinds of refactoring possible.
Other tiny JavaScript engines such as Elk and mJS don’t support closures, so this feature sets Microvium apart from the crowd2.
The closures feature has been the single most complicated and time-consuming feature to implement in Microvium because the static analysis requires multiple passes and completely changed how the compiler front-end works. It’s really not as simple as one would first imagine. Consider, as just one example, that the let i binding in for (let i = 0 ...) has multiple instances of i relative to the containing function, and that on each iteration the previous value is copied to the new instance.
But after a long journey, I can say that I’m happy with the result.
This example assumes that you’ve added `console.log` to the global scope ↩
Larger engines such as Espruino and XS support closures along with many other features, but come at a serious size premium. ↩
TL;DR: Microvium now supports exception handling, with just 4 bytes of overhead per active try block. This post covers some of the details of the design, for those who are interested.
Note: if you’re new to try-catch, see the MDN article on the subject. Note however that Microvium does not yet support finally, but in my experience finally is a less-used feature of JS.
Why try-catch?
One of my objectives with Microvium is to allow users to write JS code that is stylistically similar to the kind of code they would write for a larger engine like V8/Node.js. I want it to be easy to write JS libraries that run on both Microvium and Node, and the public interface of these libraries especially should not be contorted to fit the Microvium feature subset.
A major thing that’s been missing from this vision is the ability to use exceptions to pass errors. A library that can’t use exceptions to pass errors must use some other technique such as error codes, which are not typical JavaScript style and would clutter up the API of such a library.
A look at alternatives
Microvium is implemented in C. How does one implement exceptions in a language like C? In the journey of finding the best way to do exceptions, I thought of a number of different ideas.
The most obvious one to come to mind is something like setjmp/longjmp. This is a common way to implement exception-like behavior in a language like C, since it’s the only way to jump from one C function into another that’s earlier on the stack. At face value, this sounds sensible since both the engine and host are running on C.
Looking at other embedded JavaScript engines, mJS and Elk don’t support exceptions at all, but Moddable’s XS uses this setjmp/longjmp approach. At each try, XS mallocs a jump structure which is populated with the state of current virtual registers and physical registers, as well as a pointer to the catch code that would handle the exception. Upon encountering a throw, all the virtual and physical registers are restored to their saved state, which implicitly unwinds the stack, and then control is moved to the referenced catch block.
In my measurements1, one of these jump structures in XS is 120-140 bytes on an embedded ARM architecture and 256-312 bytes on my x64 machine2. Even the jmp_buf on its own is 92 bytes on Cortex M0 (the architecture to which I’ve targetted all my size tests in Microvium). That’s quite heavy! Not to mention the processing overhead of doing a malloc and populating this structure (each time a try block is entered).
How it works in Microvium
After thinking about it for some time, the following is the solution I settled on. Consider the following code, where we have 2 variables on the stack, and 2 try-catch blocks:
If we were to throw an exception on line 5, what needs to happen?
We need to pop the variable y off the stack (but not x), since we’re exiting its containing scope. This is called unwinding the stack.
We need to pass control to the e1 catch clause (line 7)
We need to now record that the catch(e2) on line 11 is the next outer catch block, in case there is another throw
Microvium does this by keeping a linked list of try blocks on the stack. Each try block in the linked list is 2 words (4 bytes): a pointer to the bytecode address of the corresponding catch bytecode, plus a pointer to the next outer try block in the linked list. For the previous example, there will be 2 try blocks on the stack when we’re at line 5, as shown in the following diagram:
Each of the smaller rectangles here represents a 16-bit word (aka slot). The arrows here represent pointers. The red blocks (with an extra border) each represent an active try block consisting of 2 words (catch and nextTry). They form a linked list because each has a pointer to the next outer try block.
The topTry register points to the inner-most active try block — the head of the linked list. Each time the program enters a try statement, Microvium will push another try block to the stack and record its stack address in the topTry register.
The topTry register serves 2 purposes simultaneously. For one, it points to the head of the linked list, which allows us to find the associated catch block when an exception is thrown. But it also points to exactly the stack level we need to unwind to when an exception is thrown. This is similarly true for each nextTry pointer.
When an exception is thrown, Microvium will:
Take the current topTry and unwind the stack to the level it points to.
Transfer control (the program counter) to the bytecode address of the catch block which is now sitting at the top of the stack (i.e. pop the program counter off the stack).
Save the nextTry pointer as the new topTry (i.e. it pops the topTry register off the stack).
Et voila! We’ve implemented try-catch in Microvium with just 4 bytes of overhead per try.
This also works when throwing from one function to a catch in a caller function. The only difference is that the unwind process in step 1 then needs to restore the registers of the caller (e.g. argument count, closure state, return address, etc). This information is already on the stack since it’s also used by the return instruction — we don’t need to save it separately during a try. This works for any level of nested calls if we just unwind repeatedly until reaching the frame containing the catch target.
The static analysis to implement this is quite hairy — it’s complicated by the interaction between try…catch and return, break, and closures, among other things. For those interested, see the test cases for examples of the kinds of scenarios that try-catch needs to deal with.
On a meta-level, this is a good reminder of how the first idea that comes to mind is often not the best one, and the benefit of thinking through a problem before jumping right in. I’ve been brainstorming ideas about how exceptions might work in Microvium since July 2020 — almost 2 years ago — and I kept going back to the ideas to revise and reconsider, going back and forth between different options. Sometimes, mulling over an idea over a long period of time can help you get to a much better solution in the end.
Another lesson is that simplicity is deceptively time-consuming. Looking only at this apparently-simple final design, you might not guess at the amount of work required to get there. This is a general tragedy of software engineering: it often takes more time to get to a simpler solution, which gives the appearance of achieving less while taking longer to do so.
Please, anyone correct me if I’ve made a mistake here or misrepresented XS’s design in any way. ↩
The ranges in sizes here depend on whether I use the builtin jmp_buf type or the one supplied by XS ↩
And 6 bytes of bytecode ROM per try block in the code. ↩
TL;DR: The Microvium JavaScript engine for microcontrollers takes less than 16 kB of ROM and 64 bytesof RAM per VM while idle, making it possibly the smallest JavaScript engine to date with more language features than engines 4x its size.
I’ve designed Microvium from the ground up with the intention for it to be tiny, and it’s been an absolute success in that sense. Microvium may be the smallest JavaScript engine out there, but it still packs a punch in terms of features.
*Edit: this post was originally written when Microvium was around 8.2 kB of ROM. Since then, new features have been added. As of August 2023, Microvium is now 12 kB.
Does size matter?
Size often matters in small MCU devices. A large proportion of microcontroller models available on the market still have less than 64 kB of flash and less than 2 kB of RAM. These are still used because they’re smaller, cheaper, and have lower power than their larger counterparts. All the microcontrollers I’ve worked with in my career as a firmware engineer have had ≤ 16 kB RAM.
Some might say that you shouldn’t even want JavaScript on such small devices, and certainly in some cases that would be true. But as I pointed out in my last post, juggling multiple operations in firmware can be both easier and more memory efficient if the high-level logic is described in terms of a language like JavaScript, even if that’s the only thing you’re using it for.
Even on larger devices, do you really want to dedicate a large chunk of it to a JavaScript engine? A smaller engine is a smaller commitment to make — a lower barrier to entry.
How does it compare?
If I Google “smallest JavaScript engine for microcontrollers”, the first one on the list is Elk. Elk is indeed pretty tiny. For me, it compiles to just 11.5 kB of flash1. Microvium compiled with the same settings compiles to about 12 kB — in the same ballpark.
What about RAM?
The amount of RAM Elk uses is not pre-defined — you give it a buffer of RAM of any size you want, but it needs to be at least 96 bytes for the VM kernel state. Microvium takes 36 bytes for the kernel state.
But where there’s a massive difference in memory requirement is that Elk requires all of its memory allocated upfront, and keeps it for the lifetime of the VM. If your script’s peak memory in Elk is 1 kB then you need to give it a 1 kB buffer at startup, so its idle memory usage is 1 kB. Microvium on the other hand uses malloc and free to allocate when needed and free when not needed. Its idle memory usage can be as low as 88 bytes. In typical firmware, idle memory is much more important than peak memory, as I explained in my last post.
What about the feature set? This is another area where Microvium and Elk diverge significantly. The following table shows the differences:
Microvium
Elk
var, const (Elk supports let only)
✓
do, switch, for
✓
Computed member access a[b]
✓
Arrow functions, closures
✓
try… catch
✓
async–await
✓
Modules
✓
Snapshotting
✓
Uses intermediate bytecode (better performance)
✓
Parser at runtime
✓
ROM
12 kB
11.5 kB
Idle RAM
88 B
Lots
Peak kernel RAM
36 B
96 B
Slot size (size of simple variables)
2 B
8 B
The only thing that Elk can do that Microvium can’t do is execute strings of JavaScript text at runtime. So if your use case involves having human users directly provide scripts to the device, without any intermediate tools that could pre-process the script, then you can’t use Microvium and you might want to use Elk, mJS, or a larger engine like XS. On the other hand, if your use case has at any point a place where you can preprocess scripts before downloading them to the device then you can use Microvium.
Comparing with mJS
But Cesanta, the maker of Elk, also made a larger JS engine with more features: mJS, which is probably the closest match to Microvium in terms of feature set. mJS lets you write for-loops and switch statements for example.
ES Modules (but mJS does support a non-standard load function)
✓
do, switch, for
✓
✓
Computed member access a[b]
✓
✓
Uses intermediate bytecode (better performance)
✓
✓
Some builtin-functions
✓
Parser at runtime
✓
✓
ROM
12 kB
45.6 kB
11.5 kB
Slot size
2 B
8 B
8 B
I’ve lumped “some builtin-functions” into one box because it’s not a language feature as such. mJS has a number of builtin functions that Microvium doesn’t have – most notably print, ffi, s2o, JSON.stringify, JSON.parse and Object.create. You can implement these yourself in Microvium quite easily without modifying the engine (or find implementations online), and it gives you the option of choosing what you want rather than having all that space forced on you2.
In terms of features, mJS is a more “realistic” JavaScript engine, compared to Elk’s minimalistic approach. I wouldn’t want to write any substantial real-world JavaScript without a for-loop for example. Like Microvium, mJS also precompiles the scripts to bytecode and then executes the bytecode, which results in much better performance than trying to parse on the fly. Engines like Elk that parse as they execute also have the unexpected characteristic that comments and whitespace slow them down at runtime.
But the added features in mJS means it costs a lot more in terms of ROM space — about 4x more than Elk and Microvium.
Microvium still has more core language features than mJS, making it arguably a more pleasant language to work in. These features are actually quite useful in certain scenarios:
Proper ES module support is important for code organization and means that your Microvium modules can also be imported into a node.js or browser environment. You can have the same algorithms shared by your edge devices (microcontrollers), backend servers, and web interfaces, to give your users a unified experience.
Closures are fundamental to callback-style asynchronous code, as I explained in my previous post.
Conclusion
I’m obviously somewhat biased since Microvium is my own creation, but the overall picture I get is this:
Microvium is the smallest JavaScript engine that I’m aware of3
In this tiny size, Microvium actually supports more core language features than engines more than 4x its size. Some of these features are really useful for writing real-world JS apps.
Having said that, Microvium has fewer built-in functions — it’s more of a pay-as-go philosophy where your upfront commitment is much less and you bring in support for what you need when you need it.
The big trade-off is that Microvium doesn’t have a parser at runtime. In the rare case that you really need a parser at runtime, Microvium simply won’t work for you.
Something that made me smile is this note by one of the authors of mJS in a blog posts:
That makes mJS fit into less than 50k of flash space (!) and less than 1k of RAM (!!). That is hard to beat.
I have great respect for the authors of mJS and what they’ve done, which makes me all the more proud that Microvium is able to knock this out of the ballpark, beating what the seasoned professionals have called “hard to beat”. Of course, this comes with some tradeoffs (no parser and no builtin functions), but I’ve achieved my objective of making a JavaScript engine that has a super-low upfront commitment and will squeeze into the tiniest of free spaces, all while still including most of the language features I consider to be important for real-world JavaScript apps.
All of the sizes quoted in this post are when targeting the 32-bit ARM Cortex M0 using GCC with optimization for size. I’m measuring these sizes in June 2022, and of course they may change over time. ↩
The ffi in mJS is something that would need to be a built-in in most engines but Microvium’s unique snapshotting approach makes it possible to implement the ffi as a library just like any of the other functions ↩
Please let me know if you know of a smaller JS engine than Microvium. ↩
TL;DR: Single-threading with super-loops or job queues may make more efficient use of a microcontroller’s memory over time, and Microvium’s closures make single-threading easier with callback-style async code.
Multi-threading
In my last post, I proposed the idea that we should think of the memory on a microcontroller not just as a space but as a space-time, with each memory allocation occupying a certain space for some duration of time. I suggested therefore that we should then measure the cost of an allocation in byte-seconds (the area of the above rectangles as bytes × seconds), so long as we assumed that allocations were each small and occurred randomly over time. Randomness like this is a natural byproduct of a multi-threaded environment, where at any moment you may coincidentally have multiple tasks doing work simultaneously and each taking up memory. In this kind of situation, tasks must be careful to use as little memory as possible because at any moment some other tasks may fire up and want to share the memory space for their own work.
The following diagram was generated by simulating a real malloc algorithm with random allocations over time (code here, and there’s a diagram with more allocations here):
A key thing to note is that the peak memory in this hypothetical chaotic environment can be quite a bit higher than the average, and that these peaks are not easily predictable and repeatable because they correspond to the coincidental execution of multiple tasks competing for the same memory space1.
This leaves a random chance that at any moment you could run out of memory if too many things happen at once. You can guard against this risk by just leaving large margins — lots of free memory — but this is not a very efficient use of the space. There is a better way: single threading.
Single threading
Firmware is sometimes structured in a so-called super-loop design, where the main function has a single while(1) loop that services all the tasks in turn (e.g. calling a function corresponding to each task in the firmware). This structure can have a significant advantage for memory efficiency. In this way of doing things, each task essentially has access to all the free memory while it has its turn, as long as it cleans up before the next task, as depicted in the following diagram. (And there may still be some statically-allocated memory and “long-lived” memory that is dynamic but used beyond the “turn” of a task).
Overall, this is a much more organized use of memory and potentially more space-efficient.
In a multi-threaded architecture, if two memory-heavy tasks require memory around the same time, neither has to wait for the other to be finished — or to put it another way, malloc looks for available space but not available time for that space. On the other hand, in a super-loop architecture, those same tasks will each get a turn at different times. Each will have much more memory available to them during their turn while having much less impact on other tasks the rest of the time. And the overall memory profile is a bit more predictable and repeatable.
An animated diagram you may have seen before on my blog demonstrates the general philosophy here. A task remains idle until its turn, at which point it takes center stage and can use all the resources it likes, as long as it packs up and cleans up before the next task.
So, what counts as expensive in this new memory model?
It’s quite clear from the earlier diagram:
Memory that is only used for one turn is clearly very cheap. Tasks won’t be interrupted during their turn, so they have full access to all the free memory without impacting the rest of the system.
Statically-allocated memory is clearly the most expensive: it takes away from the available memory for all other tasks across all turns.
Long-lived dynamic allocations — or just allocations that live beyond a single turn — are back to the stochastic model we had with multi-threading. Their cost is the amount of space × the number of turns they occupy the space for. Because these are a bit unpredictable, they also have an additional cost because they add to the overall risk of randomly running out of memory, so these kinds of allocations should be kept as small and short as possible.
Microvium is designed this way
Microvium is built from the ground up on this philosophy — keeping the idle memory usage as small as possible so that other operations get a turn to use that memory afterward, but not worrying as much about short spikes in memory that last only a single turn.
The idle memory of a Microvium virtual machine is as low as 34 bytes2.
Microvium uses a compacting garbage collector — one that consolidates and defragments all the living allocations into a single contiguous block — and releases any unused space back to the host firmware. The GC itself uses quite a bit of memory3 but it does so only for a very short time and only synchronously.
The virtual call-stack and registers are deallocated when control returns from the VM back to the host firmware.
Arrays grow their capacity geometrically (they double in size each time) but a GC cycle truncates unused space in arrays when it compacts.
The trouble with a super-loop architecture is that it services every single task in each cycle. It’s inefficient and doesn’t scale well as the number of tasks grows4. There’s a better approach — one that JavaScript programmers will be well familiar with: the job queue.
A job queue architecture in firmware is still pretty simple. Your main loop is just something like this:
while (1) {
if (thereIsAJobInTheQueue)
doNextJob();
else
goToSleep();
}
When I write bare-metal firmware, often the first thing I do is to bring in a simple job queue like this. If you’re using an RTOS, you might implement it using RTOS queues, but I’ve personally found that the job-queue style of architecture often obviates the need for an RTOS at all.
As JavaScript programmers may also be familiar with, working in a cooperative single-threaded environment has other benefits. You don’t need to think about locking, mutexes, race conditions, and deadlocks. There is less unpredictable behavior and fewer heisenbugs. In a microcontroller environment especially, a single-threaded design also means you also save on the cost of having multiple dedicated call stacks being permanently allocated for different RTOS threads.
Advice for using job queues
JavaScript programmers have been working with a single-threaded job-queue-based environment for decades and are well familiar with the need to keep the jobs short. When running JS in a browser, long jobs means that the page becomes unresponsive, and the same is true in firmware: long jobs make the firmware unresponsive — unable to respond to I/O or service accumulated buffers, etc. In a firmware scenario, you may want to keep all jobs below 1ms or 10ms, depending on what kind of responsiveness you need5.
As a rule of thumb, to keep jobs short, they should almost never block or wait for I/O. For example, if a task needs to power-on an external modem chip, it should not block while the modem to boots up. It should probably schedule another job to handle the powered-on event later, allowing other jobs to run in the meantime.
But in a single-threaded environment, how do we implement long-running tasks without blocking the main thread? Do you need to create complicated state machines? JavaScript programmers will again recognize a solution…
Callback-based async
JavaScript programmers will be quite familiar the pattern of using continuation-passing-style (CPS) to implement long-running operations in a non-blocking way. The essence of CPS is that a long-running operation should accept a callback argument to be called when the operation completes.
The recent addition of closures (nested functions) as a feature in Microvium makes this so much easier. Here is a toy example one might use for sending data to a server in a multi-step process that continues across 3 separate turns of the job queue:
function sendToServer(url, data) {
modem.powerOn(powerOnCallback);
function powerOnCallback() {
modem.connectTo(url, connectedCallback);
}
function connectedCallback() {
modem.send(data);
}
}
Here, the data parameter is in scope for the inner connectedCallback function (closure) to access, and the garbage collector will automatically free both the closure and the data when they aren’t needed anymore. A closure like this is much more memory-efficient than having to allocate a whole RTOS thread, and much less complicated than manually fiddling with state machines and memory ownership yourself.
Microvium also supports arrow functions, so you could write this same example more succinctly like this:
Each of these 3 stages — powerOn, connectTo and send — happen in a separate job in the queue. Between each job, the VM is idle — it does not consume any stack space6 and the heap is in a compacted state7.
If you’re interested in more detail about the mechanics of how modem.powerOn etc. might be implemented in a non-blocking way, take a look at this gist where I go through this example in more detail.
Conclusion
So, we’ve seen that multi-threading can be a little hazardous when it comes to dynamic memory management because memory usage is unpredictable, and this also leads to inefficiencies because you need to leave a wider margin of error to avoid randomly running out of memory.
We’ve also seen how single-threading can help to alleviate this problem by allowing each operation to consume resources while it has control, as long it cleans up before the next operation. The super-loop architecture is a simple way to achieve this but an event-driven job-queue architecture is more modular and efficient.
And lastly, we saw that the Microvium JavaScript engine for embedded devices is well suited to this kind of design, because its idle memory usage is particularly small and because it facilitates callback-style asynchronous programming. Writing code this way avoids the hassle and complexity of writing state machines in C, of manually keeping track of memory ownership across those states, and the pitfalls and overheads of multithreading.
This simulation with random allocations is not a completely fair representation of how most firmware allocates memory during typical operation, but it shows the consequence of having many memory-consuming operations that can be preempted unpredictably or outside the control of the firmware itself. ↩
In the worst case, it doubles the size of heap while it’s collecting ↩
A super-loop also makes it more challenging to know when to put the device to sleep since the main loop doesn’t necessarily know when there aren’t any tasks that need servicing right now, without some extra work. ↩
There will still be some tasks that need to be real-time and can’t afford to wait even a few ms in a job queue to be serviced. I’ve personally found that interrupts are sufficient for handling this kind of real-time behavior, but your needs may vary. Mixing a job queue with some real-time RTOS threads may be a way to get the best of both worlds — if you need it. ↩
TL;DR: RAM on a microcontroller should not just be thought of as space but as space-time: a task that occupies the same memory but for a longer time is more expensive.
MCU memory is expensive
If you’re reading this, I probably don’t need to tell you: RAM on a microcontroller is typically very constrained. A 3 GHz desktop computer might have 16 GB of RAM, while a 3 MHz MCU might have 16 kB of RAM — a thousand times less processing power but a million times smaller RAM. So in some sense, RAM on an MCU may be a thousand times more valuable than on a desktop machine. Regardless of the exact number, I’m sure we can agree that RAM is a very constrained resource on an MCU1. This makes it important to think about the cost of various features, especially in terms of their RAM usage.
Statically-allocated memory
It’s common especially in smaller C firmware to just pre-allocate different pieces of memory to different components of the firmware (rather than using malloc and free). For example, at the global level, we may declare a 256-byte buffer for receiving data on the serial port:
uint8_t rxBuffer[256];
If we have 1kB of RAM on a device for example, maybe there are 4 components that each own 256 B. Or more likely, some features are more memory-hogging than others, but you get the idea: in this model, each component in the code owns a piece of the RAM for all time.
Dividing up RAM over time
It’s of course a waste to have a 256-byte buffer allocated forever if it’s only used occasionally. The use of dynamic memory allocation (malloc and free) can help to resolve this by allowing components to share the same physical memory at different times, requesting it when needed, and releasing it when not needed.
This allows us to reconceptualize memory as a mosaic over time, with different pieces of the program occupying different blocks of space-time (occupying memory space for some time).
When visualizing memory like this, it’s easy to start to feel like the cost of a memory allocation is not just the amount of memory it locks, but also the time that it locks it for (e.g. the area of each rectangle in the above diagram). In this sense, if I allocate 256 bytes for 2 seconds then it costs 512 byte-seconds, which is an equivalent cost to allocating 512 bytes for 1 second or 128 bytes for 4 seconds.
Being a little bit more rigorous
Skip this section if you don’t care about the edge cases. I’m just trying to be more complete here.
This measure of memory cost is of course just one way of looking at it, and breaks down in edge cases. For example, on a 64kB device, a task2 that consumes 1B for 64k-seconds seems relatively noninvasive while a task that consumes 64kB for 1s is much more costly. So the analogy breaks down in cases where the size of the allocations are significant compared to the total memory size.
Another way the model breaks down is if many of the tasks need memory around the same time — e.g. if there is some burst of activity that requires collaboration between many different tasks. The typical implementation of malloc will just fail if there is not memory available right now, as opposed to perhaps blocking the thread until the requested memory becomes available, as if the memory was like a mutex to be acquired.
But the model is accurate if we make these assumptions:
The size of the individual allocations is small relative to the total memory size
Allocations happen randomly over time and space
Under these assumptions, the total memory usage becomes a stochastic random variable whose expected value is exactly:
The expected allocation size × the expected allocation size × the expected allocation frequency
We could also calculate the probability that the device runs out of memory at any given point (I won’t do the calculations here).
Conclusion
In situations where memory allocations can be approximated as being small and random, the duration of a memory allocation is just as important as its size. Much more care must be taken to optimize memory usage for permanent or long-lived operations.
I’m not very happy with this stochastic viewpoint and all these edge cases. It means that at any point, we could randomly exceed the amount of available memory and the program will just die. Is there a better way to organize memory space-time so we don’t need to worry as much? I believe there is… and I’ll cover that in the next post.
Another thing that makes it a constrained resource is the lack of virtual memory and the ability to page memory in and out of physical RAM, so the RAM size is a hard limit ↩
When talking about this space-time model, I think it’s easier to talk about a “task” than a “component”, where a task here is some activity that needs to be done by the program over a finite stretch of time, and will consume resources over that time. ↩
TL;DR Microvium can now address up to 64 kB of ROM, up from 32 kB previously, and now runs more efficiently on small 32-bit devices such as an ARM Cortex MCU.
This is a minor update regarding the data model in Microvium (previous post here), for the kind of person who’s interested in this kind of thing.
Recap: Microvium uses 16-bit slots
As covered in the previous post, Microvium uses a 16-bit slot size — variables in Microvium are 16 bits each. This is unusual among the small embedded JavaScript engines. Cesanta’s mJS and Elk use a 64-bit slot size with NaN-boxing, and Moddable’s XS uses a 128-bit slot size (4x32bit words). So, if you have a variable with the number 42 in it, it will take 2 bytes in Microvium, 8 bytes in mJS, and 16 bytes in XS (on the other hand, the number 42.5 is a float and will take 12 bytes in Microvium since it overflows the slot into the heap, but it will still take 8 bytes in mJS and 16 bytes in XS).
Pointers and Paged Memory
If the lowest bit in the slot is 0, Microvium treats the value as a pointer to heap memory.
Since heap memory in Microvium is always 2-byte aligned, the lowest bit of a pointer will always be 0, so the value in the slot exactly corresponds to the pointer value, at least on 16-bit systems (or on 8-bit systems with a 16-bit address bus).
But what about 32-bit systems? Microvium is optimized for devices with 64 kB or less of RAM, but many devices with 64 kB of RAM are actually 32-bit devices. In these devices, there is usually a 32-bit address space, and some sub-range of these addresses will map to physical RAM (and some other range of addresses will map to physical ROM). Even though only 16 bits of information are required to index every byte of RAM, pointers are still 32 bits on these devices.
Previously, Microvium worked fine on 32-bit and 64-bit devices (I do all the testing on my 64-bit PC) but it did so through an expensive mapping table that mapped 16-bit VM addresses to their 32-bit or 64-bit native counterparts (like virtual memory implemented in software). The mapping itself was still O(1) in many cases1, but it involved a function call every time a pointer needed to be mapped, which is a massive overhead to incur. I wasn’t too worried about this because I wasn’t aiming for 32-bit devices as my initial audience, but with the number of 32-bit devices out there, this would quickly become a problem.
To support this kind of scenario more elegantly, Microvium now has a port macro definition that allows you to specify the upper 16-bits of a 32-bit pointer (or upper 48-bits of a 64-bit pointer) for the platform2.
For a more concrete example, the Arduino Nano 33 IOT has up to 32kB of SRAM, starting at the address 0x20000000. So the upper 16-bits of a real pointer into RAM will always be 0x2000. Here is a snippet of the data sheet that shows this:
So, in the Microvium port file, you can now specify the high bits of a 32-bit pointer to be 0x2000, and Microvium will interpret all pointers by simply indexing into the given memory page.
In compiled ARM machine code, the conversion from a 16-bit slot value to 32-bit pointer is just one or two instructions3! This is a significant performance improvement over how it worked before, and makes pointer access in Microvium almost as efficient as native pointer access.
I didn’t actually make this change for performance reasons. I did it because it makes the development of the Microvium engine much easier.
On my Windows machine where I develop, I can now use VirtualAlloc to pre-allocate a single 64 kB “page” of memory where the high bits of a pointer are always 0x5555, and run Microvium in just this region of memory. So if I see the Microvium value 0x002A, I know instantly that corresponds to the address 0x5555002A.
The addresses are consistent across runs, so when I note down an address in my notebook while debugging, I know it will be the same if I restart the program.
I can also have a memory view open in the debugger and it remains consistent across runs and shows all the VM memory in one place.
64 kB ROM
Previously, if the lower 2 bits of a slot were 01b then the value was considered a pointer into ROM after shifting right by 1 bit, giving us a 15-bit address space for ROM, and requiring ROM to be 2-bit aligned to keep the remaining 0 bit.
Now, the slot value is considered a pointer into ROM after zeroing the bottom 2 bits. This doesn’t change the performance, but it means that we can now address up to 64 kB of ROM.
A side effect is that ROM must now be 4-byte aligned since the lower 2 bits of ROM pointers must be zero. This means extra padding sometimes, but I’ve found that the ROM overhead doesn’t grow substantially with this change.
Why did I do this?
Debugging. Previously, if I saw the slot value 0x2A1 while stepping through the engine, I have to bring out a calculator to see that 0x2A1 corresponds to the bytecode address 0x150. Now, the value 0x2A1 corresponds to the bytecode address 0x2A0, which is much easier to follow.
I was irked by the fact that I couldn’t just say that “Microvium supports up to 64kB of memory”. I previously had to qualify it every time — “it supports 64kB of RAM, and 64kB of snapshot image, but only up to 32kB of ROM”. Now I can just say “supports up to 64kB of memory” with no added asterisks or qualifications, since it supports 64 kB of each type of memory it uses. This is part of simplifying the mental model of Microvium, since Microvium is all about simplicity.
The performance was actually O(n) where n is the number of memory blocks, but since Microvium uses a compacting garbage collector, memory is consolidated into a single block periodically, making it O(1) for most pointers most of the time. ↩
More accurately, you can specify any native address as the origin of the VM address space, but I found it easier to explain this in terms of the high bits. ↩
In the full ARM instruction set, the conversion is a single 32-bit instruction. In the ARM Thumb instruction set, it takes two 16-bit instructions. The specific number 0x20000000 is relevant here because it’s a power of 2 which is more efficient. ↩
At first I didn’t think this would be something useful to blog about. But the more people I speak to, the more I realize that I might be one of the few who has my perspective on how to debug embedded microcontroller/firmware code on small devices1. So let me share my “secret formula” with you.
Problems
For those who don’t traditionally write embedded microcontroller applications but are still following along for interest, let me quickly highlight some of key problems we face debugging embedded code.
Problem 1: Hardware
The most obvious problem with developing embedded code is that these microchips are generally embedded into a piece of hardware, such as your washing machine, and the code’s primary purpose is often closely related to controlling the hardware. You can often use what is called an in-circuit debugger to actually connect your IDE/debugger to the firmware program running on the embedded device2.
But lets say you’re writing code for an x-ray machine, and lets say that it needs to pulse the x-ray cathode for 10 ms. You do not want to accidentally put a breakpoint during those 10 ms. The breakpoint will suspend the execution of the firmware, but will generally leave the hardware in whatever state it was in – which may be a state you really don’t want it to be in for an extended period of time3.
There are other reasons why hardware can be a problem for debugging, such as when the program outputs physical voltages but does it incorrectly. You may need to start resorting to using a multimeter or oscilloscope just to see what the program is doing. You can’t add a software “debug watch” to a piece of hardware.
Problem 2: Download time
Another problem with debugging is the time it takes to cycle through your modify-build-execute-debug process, because it has an extra “download” phase. In other words, every time you make a change, you have to download the compiled executable to the embedded device. This can take anywhere from a few seconds to minutes, depending on the size of the program. It seems a common pattern for developers to do something like this:
Compile program
Download program
Execute in debug mode
Try put it into the problem state
See the problem (perhaps by a breakpoint and watches)
Go back to your code
Add diagnostic printf statements or more breakpoints
Go back to step 1
I’ll call this the compile-debug cycle, and it can take minutes to get through it each time.
Problem 3: Non-Standard Compilers
Another problem, which perhaps isn’t quite as obvious, is that is seems that an overwhelming number of compilers targeted at embedded microcontrollers do not conform to the C or C++ standards. They often only support a subset of standard C/C++, and they include many extensions for things which are difficult or impossible to do in standard C. This can be a good thing: how do you declare an interrupt routine in standard C? But for debugging code it can cause problems. In the same way that hardware dependent code forces you debug code on the physical hardware, compiler dependent code forces you to debug with the specific debugger provided with the compiler. Why is this a problem? I’ll get to that in a moment.
Solution
So how do I solve these problems?
The answer is stupidly simple. Perhaps even insulting. It is simply:
Don’t write embedded code
Avoid writing code that accesses hardware, or that needs to be downloaded to run, or that uses non-standard extensions of C/C++. You could almost say, “Don’t write embedded code at all”. Embedded code has all of these problems, so just don’t do it!
Instead, write platform-independent code. Use all the modern techniques of dependency injection, unit testing, etc. Get your code working in isolation first, in an environment where your compile-debug cycle is in the order of seconds rather than minutes. Get it working on your local PC! This eliminates step 2 – “Download program” – from the above compile-debug cycle, and makes everything easy.
Also, ideally you should be debugging unit tests, not your fully integrated software. This eliminates step 4 in the above compile-debug cycle, because your unit tests automatically generate the problem state. If they don’t, then you may need to write more tests.
The unit tests for the module you’re working on should automatically run at the click of a button or keyboard shortcut, and should be integrated into the IDE so you don’t have to leave your code to run tests. Personally, I divide my code into small enough modules that my unit tests for the module-under-development can compile and run in under a second, and causes the IDE to jump immediately to the first failing test assertion.
Doing it this way you can easily shorten the compile-debug cycle overhead by a factor of 100 (“overhead” meaning tasks not related to finding or fixing the actual problem). If you also write the code in such a way that you can debug-step backwards then you can also reduce the number of times you need to repeat the compile-debug cycle.
But, Hardware!
Of course, you do need to access hardware at some point. And you probably need to use compiler pragmas and intrinsics to squeeze out the most performance or do unusual things. But you can isolate these cases, and confine them to very thin layers on which your main application code runs. For example, instead of having application-level code directly access an IO port, it can access it through a function which is implemented by an ultra-thin HAL (hardware abstraction layer). The HAL should do as little as possible, and its interface should be pure C (or C++). When I say “hardware”, I also mean other platform specific dependencies, such as a compiler quirks, third-party libraries, and the OS.
The rest of your application code should be platform independent. Don’t use features that aren’t standard C/C++. Obviously only use the subset of the standard that’s actually supported by your compiler. Don’t rely on unspecified “features” of the language, such as the size of a pointer, the result of signed integer overflow, the endianness of an integer, or the alignment of fields in structure (But you weren’t anyway, right?).
Use dependency injection to inject the HAL into your application code4. This is just generally good advice for any program. And how else would you do unit testing? For different build configurations you can inject mocks and interactive simulators just as easily as the HAL on your release build.
Remember that every line of HAL code might take 10 times longer to develop, so you really want to minimize how much HAL there is.
Conclusion
So, the solution to the long debug-compile cycle with embedded code is simply to avoid “embedded” code as much as possible. I do this, and it really works. I can go weeks developing code for a microcontroller without ever powering up the physical device, and then take a few hours or so to integrate it into the final compiled application. The result is that I can use all the modern techniques and practices to develop high quality firmware in a fraction of the time.
Why don’t many embedded developers seem to do this5? I really don’t know. But perhaps its related to the fact that developing for such limited systems is a lot like developing for the desktop computers of 20 years ago, before object orientated patterns and unit testing were really around. Or perhaps it’s because C and C++ make it very difficult to do zero-cost dependency injection. Perhaps its just because the embedded microcontroller industry is much slower to move, since it targets a much smaller audience.
Whatever the reason, it doesn’t have to be that way, and hopefully I’ve given you a few ideas of how to accelerate your development process. If you have any other ideas or questions, feel free to leave them in the comments below.
I’m talking about anything that’s too small to run a proper operating systems like Linux – something in the order of kilobytes of RAM ↩
Again, I’m talking about quite low level programs here. If the “firmware” is running a full OS, such as Linux, then it may have its own debugging tools such a remote GDB server ↩
I’m not saying you would use a breakpoint at all in a real x-ray machine, but the point applies to other scenarios ↩
I’m not going to cover how to do dependency injection for this type of application, but techniques you could consider are: using the linker to “inject” a dependency C API or mock layer into your application; using CRTP to inject dependencies into C++ code statically; and of course plain old virtual classes and/or function pointers ↩
I may be generalizing. But realize again that I’m referring to embedded firmware programs, perhaps running in sub 100 kB of RAM, and not smartphone or embedded Linux development ↩