Author: Michael Hunter

Microvium has try-catch!

Microvium has try-catch!

TL;DR: Microvium now supports exception handling, with just 4 bytes of overhead per active try block. This post covers some of the details of the design, for those who are interested.


Note: if you’re new to try-catch, see the MDN article on the subject. Note however that Microvium does not yet support finally, but in my experience finally is a less-used feature of JS.

Why try-catch?

One of my objectives with Microvium is to allow users to write JS code that is stylistically similar to the kind of code they would write for a larger engine like V8/Node.js. I want it to be easy to write JS libraries that run on both Microvium and Node, and the public interface of these libraries especially should not be contorted to fit the Microvium feature subset.

A major thing that’s been missing from this vision is the ability to use exceptions to pass errors. A library that can’t use exceptions to pass errors must use some other technique such as error codes, which are not typical JavaScript style and would clutter up the API of such a library.

A look at alternatives

Microvium is implemented in C. How does one implement exceptions in a language like C? In the journey of finding the best way to do exceptions, I thought of a number of different ideas.

The most obvious one to come to mind is something like setjmp/longjmp. This is a common way to implement exception-like behavior in a language like C, since it’s the only way to jump from one C function into another that’s earlier on the stack. At face value, this sounds sensible since both the engine and host are running on C.

Looking at other embedded JavaScript engines, mJS and Elk don’t support exceptions at all, but Moddable’s XS uses this setjmp/longjmp approach. At each try, XS mallocs a jump structure which is populated with the state of current virtual registers and physical registers, as well as a pointer to the catch code that would handle the exception. Upon encountering a throw, all the virtual and physical registers are restored to their saved state, which implicitly unwinds the stack, and then control is moved to the referenced catch block.

In my measurements1, one of these jump structures in XS is 120-140 bytes on an embedded ARM architecture and 256-312 bytes on my x64 machine2. Even the jmp_buf on its own is 92 bytes on Cortex M0 (the architecture to which I’ve targetted all my size tests in Microvium). That’s quite heavy! Not to mention the processing overhead of doing a malloc and populating this structure (each time a try block is entered).

How it works in Microvium

After thinking about it for some time, the following is the solution I settled on. Consider the following code, where we have 2 variables on the stack, and 2 try-catch blocks:

If we were to throw an exception on line 5, what needs to happen?

  1. We need to pop the variable y off the stack (but not x), since we’re exiting its containing scope. This is called unwinding the stack.
  2. We need to pass control to the e1 catch clause (line 7)
  3. We need to now record that the catch(e2) on line 11 is the next outer catch block, in case there is another throw

Microvium does this by keeping a linked list of try blocks on the stack. Each try block in the linked list is 2 words (4 bytes): a pointer to the bytecode address of the corresponding catch bytecode, plus a pointer to the next outer try block in the linked list. For the previous example, there will be 2 try blocks on the stack when we’re at line 5, as shown in the following diagram:

Each of the smaller rectangles here represents a 16-bit word (aka slot). The arrows here represent pointers. The red blocks (with an extra border) each represent an active try block consisting of 2 words (catch and nextTry). They form a linked list because each has a pointer to the next outer try block.

The topTry register points to the inner-most active try block — the head of the linked list. Each time the program enters a try statement, Microvium will push another try block to the stack and record its stack address in the topTry register.

The topTry register serves 2 purposes simultaneously. For one, it points to the head of the linked list, which allows us to find the associated catch block when an exception is thrown. But it also points to exactly the stack level we need to unwind to when an exception is thrown. This is similarly true for each nextTry pointer.

When an exception is thrown, Microvium will:

  1. Take the current topTry and unwind the stack to the level it points to.
  2. Transfer control (the program counter) to the bytecode address of the catch block which is now sitting at the top of the stack (i.e. pop the program counter off the stack).
  3. Save the nextTry pointer as the new topTry (i.e. it pops the topTry register off the stack).

Et voila! We’ve implemented try-catch in Microvium with just 4 bytes of overhead per try.

This also works when throwing from one function to a catch in a caller function. The only difference is that the unwind process in step 1 then needs to restore the registers of the caller (e.g. argument count, closure state, return address, etc). This information is already on the stack since it’s also used by the return instruction — we don’t need to save it separately during a try. This works for any level of nested calls if we just unwind repeatedly until reaching the frame containing the catch target.

Conclusion

I’m pretty satisfied with this design.

  • Only one new virtual register (topTry)
  • Only 4 bytes of RAM per active try3
  • No heap overhead
  • Conceptually simple
  • Only makes the engine 156 bytes larger in ROM

The static analysis to implement this is quite hairy — it’s complicated by the interaction between trycatch and return, break, and closures, among other things. For those interested, see the test cases for examples of the kinds of scenarios that try-catch needs to deal with.

On a meta-level, this is a good reminder of how the first idea that comes to mind is often not the best one, and the benefit of thinking through a problem before jumping right in. I’ve been brainstorming ideas about how exceptions might work in Microvium since July 2020 — almost 2 years ago — and I kept going back to the ideas to revise and reconsider, going back and forth between different options. Sometimes, mulling over an idea over a long period of time can help you get to a much better solution in the end.

Another lesson is that simplicity is deceptively time-consuming. Looking only at this apparently-simple final design, you might not guess at the amount of work required to get there. This is a general tragedy of software engineering: it often takes more time to get to a simpler solution, which gives the appearance of achieving less while taking longer to do so.


  1. Please, anyone correct me if I’ve made a mistake here or misrepresented XS’s design in any way. 

  2. The ranges in sizes here depend on whether I use the builtin jmp_buf type or the one supplied by XS 

  3. And 6 bytes of bytecode ROM per try block in the code. 

Snapshotting is like compiling but better

Snapshotting is like compiling but better

TL;DR: The final output of a traditional compiler like GCC bears a family resemblance to a Microvium snapshot, but the snapshotting paradigm is both easier to use and more powerful because it allows real application code to run at build time and its state to persist until runtime.

What is snapshotting?

My Microvium JavaScript engine is built on the paradigm of creating a VM snapshot as the deployable build artifact rather than creating a traditional compiled binary. As a developer, when you run the Microvium engine on your desktop with a command like microvium main.js it will execute the script until all the top-level code is complete and then output a snapshot file containing the final VM state. The snapshot file can then be “resumed” on an embedded device using Microvium’s embedded C library (for more details, see Getting Started). Although Microvium is designed especially for microcontrollers, the principle of snapshotting goes beyond the embedded space.

Comparing to GCC

For this post, I’ll mostly compare Microvium to GCC:

gcc main.c         # Compile a C program with GCC
microvium main.js  # "Compile" a Microvium program

These two commands are analagous. Both produce a single file as the result, and this file is what you want to deploy to the target environment. In the case of GCC, the output is of course the executable (e.g. a.out), while in the case of Microvium, it’s the snapshot (main.mvm-bc).

Both of these commands do some kind of compilation as part of the process. GCC translates your function code to machine instructions, while Microvium translates to virtual machine instructions (bytecode instructions).

Constants

Both the GCC output and the Microvium output have a section for constants, including function code. You may be familiar with this as the .text section. Among other things, this contains constant values, such as:

// JavaScript
const x = 42;
// C
const int x = 42;

… but better

You can do this in Microvium but not in C:

const x = foo();

function foo() {
  return 42;
}

In C, it’s a compile-time error to call runtime functions for the calculation of constants. But in Microvium, this is perfectly legal since there is no distinction between compile-time and runtime — there is only really runtime before the snapshot and runtime after the snapshot. Although informally we may refer to the former as “compile-time” and the latter as “runtime”.

Apart from just being cleaner and easier to use, the Microvium snapshotting paradigm here allows computationally-intensive constants to be calculated at build time, using arbitrary functions and libraries that might also be useful at runtime.

Variable initializers… but better

For non-constant variables, both GCC and Microvium have an output section for the initial1 value of all the variables, which is copied into RAM at runtime. You may know this traditionally as the .data section.

But a major difference between them is that a Microvium snapshot also contains heap data.

// JavaScript
let arr = [1, 2, 3];

// C
int* arr = malloc(3 * sizeof(int));  // ! Can't do this (in top-level code)

The best you could do in C for the above example would be to have an init function that runs early in the program to set up the initial runtime state of the program. Snapshotting is better because this initial structure can be established at build time.

Modules … but better

Microvium and C both have support for structuring your code in multiple files which get bundled into the same output artifact:

import { foo } from './foo.js'
#include "foo.h"

With the #include here, the dependent module implementation (e.g. foo.c) is not automatically compiled and linked into the program by GCC — you need to separately list foo.c to be compiled by GCC, or orchestrate the dependencies using a makefile.

But in the case of a Microvium import, the import statement itself is executed at build time, performing the module resolution, loading, parsing, and linking at build time, as well as executing the top-level code of the imported module. The top-level code of the imported module may in turn import other modules, transitively importing the whole module graph and executing its top-level initialization code.

Preprocessor… but better

Both C and Microvium support “compile-time” logic:

// C
#if USE_FOO_1
#define foo foo1
#else
#define foo foo2
#endif
// JavaScript
const foo = USE_FOO_1 ? foo1 : foo2;

Of course, you already knew that, because all the examples so far have demonstrated the fact that snapshotting allows you to run JavaScript at compile time. But I want to emphasize some of the key reasons why the snapshotting paradigm is better for this:

  • You don’t need two different programming languages (e.g. the preprocessor language and the C language).
  • Your “compile-time” code has the full power of the main language.
  • Your runtime code carries over the state from your compile-time code.

So in some sense, this unifies the preprocessor language with the main language. This applies similarly to other “compile-time languages”, such as makefiles and linker scripts. Or if you’re coming from the JS world, consider how the snapshotting paradigm obviates the need for a webpack.config (see Snapshotting vs Bundling).

But what about using the preprocessor to conditionally include different runtime logic, such as for different devices? For example, consider the following C code:

int myFunction() {
#if SOME_CONDITION
  doX();
#else
  doY();
#endif 
}

Of course, it’s easy to see how this example might translate to JS:

function myFunction(someCondition) {
  if (someCondition) {
    doX();
  } else {
    doY();
  }
}

Now we have the bonus that our unit tests can inject someCondition to test both cases.

But doesn’t this code mean that now we have both doX and doY branches at runtime, taking up ROM space? (and someCondition)

That’s why I’ve also been developing the experimental Microvium Boost: an optimizer that analyzes a snapshot and removes unused branches of code. For example, if the analysis shows that someCondition is always true in your program, it can remove it as a parameter from myFunction and also remove the call to doY as dead code. This is still experimental but has shown significant success so far.

Host exports… but better

So far we’ve been considering an executable output from GCC, but it would be more accurate to compare a Microvium snapshot with a compiled shared library (e.g. DLL). Like a shared library, Microvium snapshots do not have a single entry point but may contain many exported functions to be resolved at runtime on the device.

A Windows DLL suits the analogy better than a Linux shared library, so in this section the examples will use msbuild rather than GCC.

Both a Microvium snapshot and a DLL binary contain a section for dynamic linking information — a table that associates relevant functions in the DLL with a number2 so that the host program using the DLL/snapshot at runtime can find them.

In the case of a DLL, you can provide the compiler with a DEF file that tells the compiler what to put into the DLL export table. If you wanted to export the functions foo, bar, and baz from the DLL with IDs 1, 2, and 3 respectively, your DEF file might look like this:

LIBRARY   MyLibrary
EXPORTS
   foo   @1
   bar   @2
   baz   @3
// C
int foo() { return 42; }
int bar() { return 43; }
int baz() { return 44; }

The equivalent in Microvium would be as follows:

// JavaScript
function foo() { return 42; }
function bar() { return 43; }
function baz() { return 44; }

vmExport(1, foo);
vmExport(2, bar);
vmExport(3, baz);

You may have noticed a recurring theme in this post: the Microvium snapshotting paradigm doesn’t require a whole new language in order to do different build-time tasks. In this case, Microvium doesn’t require a DEF file (or a special __declspec(dllexport) language extension), since vmExport is just a normal function. This is just simpler and more natural.

Another recurring theme here is that the snapshotting approach is more powerful, allowing you to do things that are impossible or impractical in the traditional paradigm. Take a look at the following example in Microvium:

// JavaScript
for (let i = 1; i <= 3; i++) {
  vmExport(i, () => 41 + i);
}

This has the same overall effect as the previous code, adding 3 functions to the export table of the deployed binary with IDs 1, 2, and 3 and which return 42, 43, and 44 respectively. But having vmExport be a normal function means that now we have the full power of the language for orchestrating these exports, or for writing an abstraction layer over the export system, or outsourcing this logic to a third-party library.

Side note: a more subtle point in this example for advanced readers is its code cohesion. The single line of code mixes both compile-time and runtime code (vmExport(i,...) and ()=> 41 + i respectively), but keeps the related parts of both in close proximity. This is the difference between temporal cohesion (grouping code by when its run) vs functional cohesion (grouping code by what feature it relates to) (see Wikipedia). A common disadvantage of having separate build-time or deploy-time code (e.g. a DEF file, makefile, linker script, webpack.config, dockerfile, terraform file, etc) is that it pushes you into temporal cohesion, which in turn damages modularity and reusability.

Conclusion

The idea of deploying a snapshot rather than a traditional compiled binary opens up a whole new paradigm for software development. The end result is very similar — a binary image with sections for different memory spaces, compiled function code, constants, initial variable values, and export/import tables — but the snapshotting paradigm is both simpler and more powerful.


  1. In the context of Microvium, the word “initial” here refers to the initial state when the snapshot is resumed, not the initial state when the program is started, since the program starts at build time, and variables in JavaScript start with the value undefined. 

  2. DLL exports can be by name or number, but Microvium exports are only by number, for efficiency reasons, so that’s what I’m using in the analogy here. 

Microvium is very small

Microvium is very small

TL;DR: The Microvium JavaScript engine for microcontrollers takes less than 16 kB of ROM and 64 bytes of RAM per VM while idle, making it possibly the smallest JavaScript engine to date with more language features than engines 4x its size.


I’ve designed Microvium from the ground up with the intention for it to be tiny, and it’s been an absolute success in that sense. Microvium may be the smallest JavaScript engine out there, but it still packs a punch in terms of features.

*Edit: this post was originally written when Microvium was around 8.2 kB of ROM. Since then, new features have been added. As of August 2023, Microvium is now 12 kB.

Does size matter?

Size often matters in small MCU devices. A large proportion of microcontroller models available on the market still have less than 64 kB of flash and less than 2 kB of RAM. These are still used because they’re smaller, cheaper, and have lower power than their larger counterparts. All the microcontrollers I’ve worked with in my career as a firmware engineer have had ≤ 16 kB RAM.

Some might say that you shouldn’t even want JavaScript on such small devices, and certainly in some cases that would be true. But as I pointed out in my last post, juggling multiple operations in firmware can be both easier and more memory efficient if the high-level logic is described in terms of a language like JavaScript, even if that’s the only thing you’re using it for.

Even on larger devices, do you really want to dedicate a large chunk of it to a JavaScript engine? A smaller engine is a smaller commitment to make — a lower barrier to entry.

How does it compare?

If I Google “smallest JavaScript engine for microcontrollers”, the first one on the list is Elk. Elk is indeed pretty tiny. For me, it compiles to just 11.5 kB of flash1. Microvium compiled with the same settings compiles to about 12 kB — in the same ballpark.

What about RAM?

The amount of RAM Elk uses is not pre-defined — you give it a buffer of RAM of any size you want, but it needs to be at least 96 bytes for the VM kernel state. Microvium takes 36 bytes for the kernel state.

But where there’s a massive difference in memory requirement is that Elk requires all of its memory allocated upfront, and keeps it for the lifetime of the VM. If your script’s peak memory in Elk is 1 kB then you need to give it a 1 kB buffer at startup, so its idle memory usage is 1 kB. Microvium on the other hand uses malloc and free to allocate when needed and free when not needed. Its idle memory usage can be as low as 88 bytes. In typical firmware, idle memory is much more important than peak memory, as I explained in my last post.

What about the feature set? This is another area where Microvium and Elk diverge significantly. The following table shows the differences:

MicroviumElk
var, const (Elk supports let only)
do, switch, for
Computed member access a[b]
Arrow functions, closures
trycatch
asyncawait
Modules
Snapshotting
Uses intermediate bytecode (better performance)
Parser at runtime
ROM12 kB11.5 kB
Idle RAM88 BLots
Peak kernel RAM36 B96 B
Slot size (size of simple variables)2 B8 B

The only thing that Elk can do that Microvium can’t do is execute strings of JavaScript text at runtime. So if your use case involves having human users directly provide scripts to the device, without any intermediate tools that could pre-process the script, then you can’t use Microvium and you might want to use Elk, mJS, or a larger engine like XS. On the other hand, if your use case has at any point a place where you can preprocess scripts before downloading them to the device then you can use Microvium.

Comparing with mJS

But Cesanta, the maker of Elk, also made a larger JS engine with more features: mJS, which is probably the closest match to Microvium in terms of feature set. mJS lets you write for-loops and switch statements for example.

Since they’re closely matched for intent and features, I did a more detailed comparison of mJS and Microvium here. But here’s a summary:

MicroviummJSElk
var, const (mJS supports let only)
Template strings
Arrow functions and closures
trycatch
asyncawait
ES Modules
(but mJS does support a non-standard load function)
do, switch, for
Computed member access a[b]
Uses intermediate bytecode (better performance)
Some builtin-functions
Parser at runtime
ROM12 kB45.6 kB11.5 kB
Slot size2 B8 B8 B

I’ve lumped “some builtin-functions” into one box because it’s not a language feature as such. mJS has a number of builtin functions that Microvium doesn’t have – most notably print, ffi, s2o, JSON.stringify, JSON.parse and Object.create. You can implement these yourself in Microvium quite easily without modifying the engine (or find implementations online), and it gives you the option of choosing what you want rather than having all that space forced on you2.

In terms of features, mJS is a more “realistic” JavaScript engine, compared to Elk’s minimalistic approach. I wouldn’t want to write any substantial real-world JavaScript without a for-loop for example. Like Microvium, mJS also precompiles the scripts to bytecode and then executes the bytecode, which results in much better performance than trying to parse on the fly. Engines like Elk that parse as they execute also have the unexpected characteristic that comments and whitespace slow them down at runtime.

But the added features in mJS means it costs a lot more in terms of ROM space — about 4x more than Elk and Microvium.

Microvium still has more core language features than mJS, making it arguably a more pleasant language to work in. These features are actually quite useful in certain scenarios:

  • Proper ES module support is important for code organization and means that your Microvium modules can also be imported into a node.js or browser environment. You can have the same algorithms shared by your edge devices (microcontrollers), backend servers, and web interfaces, to give your users a unified experience.
  • Closures are fundamental to callback-style asynchronous code, as I explained in my previous post.

Conclusion

I’m obviously somewhat biased since Microvium is my own creation, but the overall picture I get is this:

  • Microvium is the smallest JavaScript engine that I’m aware of3
  • In this tiny size, Microvium actually supports more core language features than engines more than 4x its size. Some of these features are really useful for writing real-world JS apps.
  • Having said that, Microvium has fewer built-in functions — it’s more of a pay-as-go philosophy where your upfront commitment is much less and you bring in support for what you need when you need it.
  • The big trade-off is that Microvium doesn’t have a parser at runtime. In the rare case that you really need a parser at runtime, Microvium simply won’t work for you.

Something that made me smile is this note by one of the authors of mJS in a blog posts:

That makes mJS fit into less than 50k of flash space (!) and less than 1k of RAM (!!). That is hard to beat.

https://mongoose-os.com/blog/mjs-a-new-approach-to-embedded-scripting/

I have great respect for the authors of mJS and what they’ve done, which makes me all the more proud that Microvium is able to knock this out of the ballpark, beating what the seasoned professionals have called “hard to beat”. Of course, this comes with some tradeoffs (no parser and no builtin functions), but I’ve achieved my objective of making a JavaScript engine that has a super-low upfront commitment and will squeeze into the tiniest of free spaces, all while still including most of the language features I consider to be important for real-world JavaScript apps.


  1. All of the sizes quoted in this post are when targeting the 32-bit ARM Cortex M0 using GCC with optimization for size. I’m measuring these sizes in June 2022, and of course they may change over time. 

  2. The ffi in mJS is something that would need to be a built-in in most engines but Microvium’s unique snapshotting approach makes it possible to implement the ffi as a library just like any of the other functions 

  3. Please let me know if you know of a smaller JS engine than Microvium. 

Single-threading is more memory-efficient

Single-threading is more memory-efficient

TL;DR: Single-threading with super-loops or job queues may make more efficient use of a microcontroller’s memory over time, and Microvium’s closures make single-threading easier with callback-style async code.

Multi-threading

In my last post, I proposed the idea that we should think of the memory on a microcontroller not just as a space but as a space-time, with each memory allocation occupying a certain space for some duration of time. I suggested therefore that we should then measure the cost of an allocation in byte-seconds (the area of the above rectangles as bytes × seconds), so long as we assumed that allocations were each small and occurred randomly over time. Randomness like this is a natural byproduct of a multi-threaded environment, where at any moment you may coincidentally have multiple tasks doing work simultaneously and each taking up memory. In this kind of situation, tasks must be careful to use as little memory as possible because at any moment some other tasks may fire up and want to share the memory space for their own work.

The following diagram was generated by simulating a real malloc algorithm with random allocations over time (code here, and there’s a diagram with more allocations here):

A key thing to note is that the peak memory in this hypothetical chaotic environment can be quite a bit higher than the average, and that these peaks are not easily predictable and repeatable because they correspond to the coincidental execution of multiple tasks competing for the same memory space1.

This leaves a random chance that at any moment you could run out of memory if too many things happen at once. You can guard against this risk by just leaving large margins — lots of free memory — but this is not a very efficient use of the space. There is a better way: single threading.

Single threading

Firmware is sometimes structured in a so-called super-loop design, where the main function has a single while(1) loop that services all the tasks in turn (e.g. calling a function corresponding to each task in the firmware). This structure can have a significant advantage for memory efficiency. In this way of doing things, each task essentially has access to all the free memory while it has its turn, as long as it cleans up before the next task, as depicted in the following diagram. (And there may still be some statically-allocated memory and “long-lived” memory that is dynamic but used beyond the “turn” of a task).

Overall, this is a much more organized use of memory and potentially more space-efficient.

In a multi-threaded architecture, if two memory-heavy tasks require memory around the same time, neither has to wait for the other to be finished — or to put it another way, malloc looks for available space but not available time for that space. On the other hand, in a super-loop architecture, those same tasks will each get a turn at different times. Each will have much more memory available to them during their turn while having much less impact on other tasks the rest of the time. And the overall memory profile is a bit more predictable and repeatable.

An animated diagram you may have seen before on my blog demonstrates the general philosophy here. A task remains idle until its turn, at which point it takes center stage and can use all the resources it likes, as long as it packs up and cleans up before the next task.

So, what counts as expensive in this new memory model?

It’s quite clear from the earlier diagram:

  • Memory that is only used for one turn is clearly very cheap. Tasks won’t be interrupted during their turn, so they have full access to all the free memory without impacting the rest of the system.
  • Statically-allocated memory is clearly the most expensive: it takes away from the available memory for all other tasks across all turns.
  • Long-lived dynamic allocations — or just allocations that live beyond a single turn — are back to the stochastic model we had with multi-threading. Their cost is the amount of space × the number of turns they occupy the space for. Because these are a bit unpredictable, they also have an additional cost because they add to the overall risk of randomly running out of memory, so these kinds of allocations should be kept as small and short as possible.

Microvium is designed this way

Microvium is built from the ground up on this philosophy — keeping the idle memory usage as small as possible so that other operations get a turn to use that memory afterward, but not worrying as much about short spikes in memory that last only a single turn.

  • The idle memory of a Microvium virtual machine is as low as 34 bytes2.
  • Microvium uses a compacting garbage collector — one that consolidates and defragments all the living allocations into a single contiguous block — and releases any unused space back to the host firmware. The GC itself uses quite a bit of memory3 but it does so only for a very short time and only synchronously.
  • The virtual call-stack and registers are deallocated when control returns from the VM back to the host firmware.
  • Arrays grow their capacity geometrically (they double in size each time) but a GC cycle truncates unused space in arrays when it compacts.

See here for some more details.

Better than super-loop: a Job Queue

The trouble with a super-loop architecture is that it services every single task in each cycle. It’s inefficient and doesn’t scale well as the number of tasks grows4. There’s a better approach — one that JavaScript programmers will be well familiar with: the job queue.

A job queue architecture in firmware is still pretty simple. Your main loop is just something like this:

while (1) {
  if (thereIsAJobInTheQueue) 
    doNextJob();
  else
    goToSleep();
}

When I write bare-metal firmware, often the first thing I do is to bring in a simple job queue like this. If you’re using an RTOS, you might implement it using RTOS queues, but I’ve personally found that the job-queue style of architecture often obviates the need for an RTOS at all.

As JavaScript programmers may also be familiar with, working in a cooperative single-threaded environment has other benefits. You don’t need to think about locking, mutexes, race conditions, and deadlocks. There is less unpredictable behavior and fewer heisenbugs. In a microcontroller environment especially, a single-threaded design also means you also save on the cost of having multiple dedicated call stacks being permanently allocated for different RTOS threads.

Advice for using job queues

JavaScript programmers have been working with a single-threaded job-queue-based environment for decades and are well familiar with the need to keep the jobs short. When running JS in a browser, long jobs means that the page becomes unresponsive, and the same is true in firmware: long jobs make the firmware unresponsive — unable to respond to I/O or service accumulated buffers, etc. In a firmware scenario, you may want to keep all jobs below 1ms or 10ms, depending on what kind of responsiveness you need5.

As a rule of thumb, to keep jobs short, they should almost never block or wait for I/O. For example, if a task needs to power-on an external modem chip, it should not block while the modem to boots up. It should probably schedule another job to handle the powered-on event later, allowing other jobs to run in the meantime.

But in a single-threaded environment, how do we implement long-running tasks without blocking the main thread? Do you need to create complicated state machines? JavaScript programmers will again recognize a solution…

Callback-based async

JavaScript programmers will be quite familiar the pattern of using continuation-passing-style (CPS) to implement long-running operations in a non-blocking way. The essence of CPS is that a long-running operation should accept a callback argument to be called when the operation completes.

The recent addition of closures (nested functions) as a feature in Microvium makes this so much easier. Here is a toy example one might use for sending data to a server in a multi-step process that continues across 3 separate turns of the job queue:

function sendToServer(url, data) {
  modem.powerOn(powerOnCallback);

  function powerOnCallback() {
    modem.connectTo(url, connectedCallback);
  }

  function connectedCallback() {
    modem.send(data);
  } 
}

Here, the data parameter is in scope for the inner connectedCallback function (closure) to access, and the garbage collector will automatically free both the closure and the data when they aren’t needed anymore. A closure like this is much more memory-efficient than having to allocate a whole RTOS thread, and much less complicated than manually fiddling with state machines and memory ownership yourself.

Microvium also supports arrow functions, so you could write this same example more succinctly like this:

function sendToServer(url, data) {
  modem.powerOn( () => 
    modem.connectTo(url, () => 
      modem.send(data)));
}

Each of these 3 stages — powerOn, connectTo and send — happen in a separate job in the queue. Between each job, the VM is idle — it does not consume any stack space6 and the heap is in a compacted state7.

If you’re interested in more detail about the mechanics of how modem.powerOn etc. might be implemented in a non-blocking way, take a look at this gist where I go through this example in more detail.

Conclusion

So, we’ve seen that multi-threading can be a little hazardous when it comes to dynamic memory management because memory usage is unpredictable, and this also leads to inefficiencies because you need to leave a wider margin of error to avoid randomly running out of memory.

We’ve also seen how single-threading can help to alleviate this problem by allowing each operation to consume resources while it has control, as long it cleans up before the next operation. The super-loop architecture is a simple way to achieve this but an event-driven job-queue architecture is more modular and efficient.

And lastly, we saw that the Microvium JavaScript engine for embedded devices is well suited to this kind of design, because its idle memory usage is particularly small and because it facilitates callback-style asynchronous programming. Writing code this way avoids the hassle and complexity of writing state machines in C, of manually keeping track of memory ownership across those states, and the pitfalls and overheads of multithreading.


  1. This simulation with random allocations is not a completely fair representation of how most firmware allocates memory during typical operation, but it shows the consequence of having many memory-consuming operations that can be preempted unpredictably or outside the control of the firmware itself. 

  2. Or 22 bytes on a 16-bit platform 

  3. In the worst case, it doubles the size of heap while it’s collecting 

  4. A super-loop also makes it more challenging to know when to put the device to sleep since the main loop doesn’t necessarily know when there aren’t any tasks that need servicing right now, without some extra work. 

  5. There will still be some tasks that need to be real-time and can’t afford to wait even a few ms in a job queue to be serviced. I’ve personally found that interrupts are sufficient for handling this kind of real-time behavior, but your needs may vary. Mixing a job queue with some real-time RTOS threads may be a way to get the best of both worlds — if you need it. 

  6. Closures are stored on the virtual heap. 

  7. It’s in a compacted state if you run a GC collection cycle after each event, which you would do if you cared a lot about idle memory usage. 

Short-lived memory is cheaper

Short-lived memory is cheaper

TL;DR: RAM on a microcontroller should not just be thought of as space but as space-time: a task that occupies the same memory but for a longer time is more expensive.

MCU memory is expensive

If you’re reading this, I probably don’t need to tell you: RAM on a microcontroller is typically very constrained. A 3 GHz desktop computer might have 16 GB of RAM, while a 3 MHz MCU might have 16 kB of RAM — a thousand times less processing power but a million times smaller RAM. So in some sense, RAM on an MCU may be a thousand times more valuable than on a desktop machine. Regardless of the exact number, I’m sure we can agree that RAM is a very constrained resource on an MCU1. This makes it important to think about the cost of various features, especially in terms of their RAM usage.

Statically-allocated memory

It’s common especially in smaller C firmware to just pre-allocate different pieces of memory to different components of the firmware (rather than using malloc and free). For example, at the global level, we may declare a 256-byte buffer for receiving data on the serial port:

uint8_t rxBuffer[256];

If we have 1kB of RAM on a device for example, maybe there are 4 components that each own 256 B. Or more likely, some features are more memory-hogging than others, but you get the idea: in this model, each component in the code owns a piece of the RAM for all time.

Dividing up RAM over time

It’s of course a waste to have a 256-byte buffer allocated forever if it’s only used occasionally. The use of dynamic memory allocation (malloc and free) can help to resolve this by allowing components to share the same physical memory at different times, requesting it when needed, and releasing it when not needed.

This allows us to reconceptualize memory as a mosaic over time, with different pieces of the program occupying different blocks of space-time (occupying memory space for some time).

When visualizing memory like this, it’s easy to start to feel like the cost of a memory allocation is not just the amount of memory it locks, but also the time that it locks it for (e.g. the area of each rectangle in the above diagram). In this sense, if I allocate 256 bytes for 2 seconds then it costs 512 byte-seconds, which is an equivalent cost to allocating 512 bytes for 1 second or 128 bytes for 4 seconds.

Being a little bit more rigorous

Skip this section if you don’t care about the edge cases. I’m just trying to be more complete here.

This measure of memory cost is of course just one way of looking at it, and breaks down in edge cases. For example, on a 64kB device, a task2 that consumes 1B for 64k-seconds seems relatively noninvasive while a task that consumes 64kB for 1s is much more costly. So the analogy breaks down in cases where the size of the allocations are significant compared to the total memory size.

Another way the model breaks down is if many of the tasks need memory around the same time — e.g. if there is some burst of activity that requires collaboration between many different tasks. The typical implementation of malloc will just fail if there is not memory available right now, as opposed to perhaps blocking the thread until the requested memory becomes available, as if the memory was like a mutex to be acquired.

But the model is accurate if we make these assumptions:

  • The size of the individual allocations is small relative to the total memory size
  • Allocations happen randomly over time and space

Under these assumptions, the total memory usage becomes a stochastic random variable whose expected value is exactly:

The expected allocation size × the expected allocation size × the expected allocation frequency

We could also calculate the probability that the device runs out of memory at any given point (I won’t do the calculations here).

Conclusion

In situations where memory allocations can be approximated as being small and random, the duration of a memory allocation is just as important as its size. Much more care must be taken to optimize memory usage for permanent or long-lived operations.

I’m not very happy with this stochastic viewpoint and all these edge cases. It means that at any point, we could randomly exceed the amount of available memory and the program will just die. Is there a better way to organize memory space-time so we don’t need to worry as much? I believe there is… and I’ll cover that in the next post.


  1. Another thing that makes it a constrained resource is the lack of virtual memory and the ability to page memory in and out of physical RAM, so the RAM size is a hard limit 

  2. When talking about this space-time model, I think it’s easier to talk about a “task” than a “component”, where a task here is some activity that needs to be done by the program over a finite stretch of time, and will consume resources over that time. 

Microvium: updated memory model

Microvium: updated memory model

TL;DR Microvium can now address up to 64 kB of ROM, up from 32 kB previously, and now runs more efficiently on small 32-bit devices such as an ARM Cortex MCU.

This is a minor update regarding the data model in Microvium (previous post here), for the kind of person who’s interested in this kind of thing.

Recap: Microvium uses 16-bit slots

As covered in the previous post, Microvium uses a 16-bit slot size — variables in Microvium are 16 bits each. This is unusual among the small embedded JavaScript engines. Cesanta’s mJS and Elk use a 64-bit slot size with NaN-boxing, and Moddable’s XS uses a 128-bit slot size (4x32bit words). So, if you have a variable with the number 42 in it, it will take 2 bytes in Microvium, 8 bytes in mJS, and 16 bytes in XS (on the other hand, the number 42.5 is a float and will take 12 bytes in Microvium since it overflows the slot into the heap, but it will still take 8 bytes in mJS and 16 bytes in XS).

Pointers and Paged Memory

If the lowest bit in the slot is 0, Microvium treats the value as a pointer to heap memory.

Since heap memory in Microvium is always 2-byte aligned, the lowest bit of a pointer will always be 0, so the value in the slot exactly corresponds to the pointer value, at least on 16-bit systems (or on 8-bit systems with a 16-bit address bus).

But what about 32-bit systems? Microvium is optimized for devices with 64 kB or less of RAM, but many devices with 64 kB of RAM are actually 32-bit devices. In these devices, there is usually a 32-bit address space, and some sub-range of these addresses will map to physical RAM (and some other range of addresses will map to physical ROM). Even though only 16 bits of information are required to index every byte of RAM, pointers are still 32 bits on these devices.

Previously, Microvium worked fine on 32-bit and 64-bit devices (I do all the testing on my 64-bit PC) but it did so through an expensive mapping table that mapped 16-bit VM addresses to their 32-bit or 64-bit native counterparts (like virtual memory implemented in software). The mapping itself was still O(1) in many cases1, but it involved a function call every time a pointer needed to be mapped, which is a massive overhead to incur. I wasn’t too worried about this because I wasn’t aiming for 32-bit devices as my initial audience, but with the number of 32-bit devices out there, this would quickly become a problem.

To support this kind of scenario more elegantly, Microvium now has a port macro definition that allows you to specify the upper 16-bits of a 32-bit pointer (or upper 48-bits of a 64-bit pointer) for the platform2.

For a more concrete example, the Arduino Nano 33 IOT has up to 32kB of SRAM, starting at the address 0x20000000. So the upper 16-bits of a real pointer into RAM will always be 0x2000. Here is a snippet of the data sheet that shows this:

So, in the Microvium port file, you can now specify the high bits of a 32-bit pointer to be 0x2000, and Microvium will interpret all pointers by simply indexing into the given memory page.

In compiled ARM machine code, the conversion from a 16-bit slot value to 32-bit pointer is just one or two instructions3! This is a significant performance improvement over how it worked before, and makes pointer access in Microvium almost as efficient as native pointer access.

I didn’t actually make this change for performance reasons. I did it because it makes the development of the Microvium engine much easier.

  • On my Windows machine where I develop, I can now use VirtualAlloc to pre-allocate a single 64 kB “page” of memory where the high bits of a pointer are always 0x5555, and run Microvium in just this region of memory. So if I see the Microvium value 0x002A, I know instantly that corresponds to the address 0x5555002A.
  • The addresses are consistent across runs, so when I note down an address in my notebook while debugging, I know it will be the same if I restart the program.
  • I can also have a memory view open in the debugger and it remains consistent across runs and shows all the VM memory in one place.

64 kB ROM

Previously, if the lower 2 bits of a slot were 01b then the value was considered a pointer into ROM after shifting right by 1 bit, giving us a 15-bit address space for ROM, and requiring ROM to be 2-bit aligned to keep the remaining 0 bit.

Now, the slot value is considered a pointer into ROM after zeroing the bottom 2 bits. This doesn’t change the performance, but it means that we can now address up to 64 kB of ROM.

A side effect is that ROM must now be 4-byte aligned since the lower 2 bits of ROM pointers must be zero. This means extra padding sometimes, but I’ve found that the ROM overhead doesn’t grow substantially with this change.

Why did I do this?

  1. Debugging. Previously, if I saw the slot value 0x2A1 while stepping through the engine, I have to bring out a calculator to see that 0x2A1 corresponds to the bytecode address 0x150. Now, the value 0x2A1 corresponds to the bytecode address 0x2A0, which is much easier to follow.
  2. I was irked by the fact that I couldn’t just say that “Microvium supports up to 64kB of memory”. I previously had to qualify it every time — “it supports 64kB of RAM, and 64kB of snapshot image, but only up to 32kB of ROM”. Now I can just say “supports up to 64kB of memory” with no added asterisks or qualifications, since it supports 64 kB of each type of memory it uses. This is part of simplifying the mental model of Microvium, since Microvium is all about simplicity.


  1. The performance was actually O(n) where n is the number of memory blocks, but since Microvium uses a compacting garbage collector, memory is consolidated into a single block periodically, making it O(1) for most pointers most of the time. 

  2. More accurately, you can specify any native address as the origin of the VM address space, but I found it easier to explain this in terms of the high bits. 

  3. In the full ARM instruction set, the conversion is a single 32-bit instruction. In the ARM Thumb instruction set, it takes two 16-bit instructions. The specific number 0x20000000 is relevant here because it’s a power of 2 which is more efficient. 

Microvium Closure Variable Indexing

Microvium Closure Variable Indexing

TL;DR: In many situations, a program in Microvium bytecode can access closure variables with just a single-byte bytecode instruction. The instruction contains a 4-bit index that either indexes into the current lexical scope or recursively overflows to the next-outer lexical scope, cascading up the lexical scope chain until the variable is found. In this post, I discuss some of the journey and the details of how this works.


What is a closure variable?

A closure is a function that accesses variables outside its local lexical environment. See the MDN article on this for a better explanation.

What I mean by a “closure variable” in this post is a local variable that is accessed by a closure1. For example, in the following code, callback is a closure because it accesses x which is outside its own local variables, and correspondingly x is a “closure variable” by this definition because it’s accessed by callback:

function foo() {
  let x = 10;
  setTimeout(callback, 1000);
  function callback() { 
    console.log(x);
  }
}

In Microvium (and other JavaScript engines), closure variables are treated differently from normal local variables because they need to outlive the stack frame in which they are declared. The variable x here needs to survive beyond the return of foo. This is done by allocating the slots for these variables on the heap instead of the stack.

If a variable is demoted to the heap, all access to that variable is via the heap, even if it’s being accessed locally. For example:

function foo() {
  let x = 10;
  console.log(x);  // <--- this is also accessing `x` on the heap
  setTimeout(callback, 1000);
  function callback() { 
    console.log(x);
  }
}

There is a static analysis pass in Microvium that decides whether a variable is a closure variable or not. In principle, it’s safe to allocate all variables on the heap rather than doing this analysis, but the heap is expensive, so as an optimization some variables can be promoted to the stack if they are not used by closures.

For the sake of the rest of this blog post, I will pretend that all variables are closure variables, even if I do not show the closure that accesses them. Or equivalently, I will pretend that there is no optimization analysis to determine which variables can be promoted to the stack.

Background: how it might have worked

Before I get to describing how it does work today, let me describe how I previously implemented it before having my eureka moment. You can skip this section if you want to get straight to the answer instead of going through the journey as I did.

The state of the virtual machine needs to keep track of the current lexical scope. So let’s add a new machine register called scope which points to the current lexical scope.

A goal with Microvium is to keep the engine implementation small. So rather than introducing a new allocation type for environment records and new bytecode instructions to access them, maybe we can reuse an existing type and existing instruction.

A natural solution that might come to mind is to use an Object, where the property keys of the object correspond to the variable names. Microvium has existing bytecode instructions ObjectGet and ObjectSet to get and set properties on an object.

However, Objects are very expensive at runtime. Each property is stored in memory as a key-value pair taking 4 bytes, and property lookup is a linear-time search through the properties to find the one with the right key.

Since we can statically determine the number of variables, a better choice of container for our variables would be a fixed-length array, where we statically compute an index for each variable rather than using its name. In Microvium, a fixed-length array is quite efficient, having constant-time random access to any slot by its index and each slot only consumes 2 bytes.

So let’s think about compiling the following JavaScript to IL:

let x, y , z;
x = 10;
y = 20;
z = 30;

We will have a fixed-length array with 3 slots for these 3 variables, and our new scope register will point to this fixed-length array to say that this is the current scope. Whenever we need to access one of these variables, we will read and write to the array.

I’m showing the allocation header here for completeness. These arrays are heap-allocated, so they each require this implicit memory slot for information about the size and type of the allocation.

So, the following is the IL sequence that might be produced for the above JavaScript:

// let x, y, z;
ArrayNew(3)         // Allocate 3 slots on the heap
StoreReg('scope')   // Save the array in the "scope" register

// x = 10;
LoadReg('scope')    // Fetch the current scope
Literal(10)
ArraySet(0)         // Set the first slot in the array

// x = 20;
LoadReg('scope')    // Fetch the current scope
Literal(20)
ArraySet(1)         // Set the second slot in the array

// x = 30;
LoadReg('scope')    // Fetch the current scope
Literal(30)
ArraySet(2)         // Set the third slot in the array

Side note: Microvium IL is based on a stack machine (see Wikipedia). The instruction Literal(10) pushes the value 10 to the top of the stack. The instruction ArraySet(0) pops the literal value off the stack, and pops the array reference off the stack (which was previously pushed by LoadReg('scope')) and then assigns the 0th slot in the array.

Nested scopes

What if there are nested lexical scopes, as in the following JavaScript code:

function foo() {
  let x;
  function bar() {
    let y;
    function baz() {
      let z;
      x = 10;
      y = 20;
      z = 30;
    }    
  }
}

In the above code, z is accessed in the same scope in which it is declared, as before. But x and y are accessed from parent (outer) scopes relative to the expression that accesses it. So we need a way for the IL to access parent scopes.

Remember that the inner scopes can be instantiated multiple times for each single instantiation of an outer scope. For example, if bar() is called twice within foo, then there will be 2 instances of variable y for every one instance of variable x. So we can’t just put x, y, and z in the same array. We need each lexical scope to be in its own array, and we need a way to reference an outer scope from an inner scope.

This might seem like we need to abandon the fixed-length array as the underlying storage for these variables, since these don’t naturally form chains. But if we think about it a moment, we could reserve one of the slots in the fixed-length array as a pointer to the parent fixed-length array. Let’s mentally reserve the first slot in each array as the pointer to its parent scope.

In the above JS, we have 3 distinct lexical scopes, and so we may land up with a scope chain as follows:

Now, the IL to read variable x may look as follows:

LoadReg('scope')  // Get the current scope (the one containing z)
ArrayGet(0)       // Read the parent scope (the one containing y)
ArrayGet(0)       // Read the parent scope (the one containing x)
ArrayGet(1)       // Read variable x

Each of these instructions encodes to 1 byte, so this is a 4-byte sequence in total.

I thought this was a pretty good solution. It didn’t add any extra instructions to the IL instruction set, and so didn’t make the engine any bigger. An important goal in Microvium is to keep the engine small.

The final solution

In the end, I decided that closures were too important to have it cost so many instructions to read and write to them. Rather than emitting multiple IL instructions just to read or write a single variable, it seemed quite logical to bake this behavior into the engine itself, and add two additional instructions to the IL instruction set: LoadScoped and StoreScoped. I wanted to keep these instructions both compact and efficient, and that’s where the design challenge is interesting.

Consider the following JavaScript with a few more variables to make the pattern clear:

function foo() {
  let a;
  let b;
  function bar() {
    let c;
    let d;
    function baz() {
      let e;
      let f;
      a = 10;
      b = 20;
      c = 30;
      d = 40;
      e = 50;
      f = 60;
    }
  }
}

The assignment statements in this example, such as a = 10, are accessing the variables in one of 3 different lexical scopes. How can we design the instruction format for LoadScoped (and StoreScoped) so that the bytecode can specify which scope is being accessed as well as which variable in that scope?

The eureka moment for me was when I realized that I could use a single mapped index that specifies both the scope and the variable within that scope. I’ll represent this mapping in the following diagram, with the index on the left and the variable it maps to on the right (the allocation headers are omitted here for clarity).

The scopes in this design form a waterfall. Small indexes are accessing the inner-most lexical scope. Larger indexes “overflow” into the next outer closure scope, repeating until the variable is found.

So, the IL LoadScope(1) would load variable e, and LoadScope(8) would load variable b, for example. We can determine the indexes through static analysis — numbering the variables in the closest scope first and then the ones in the next closest, etc2.

Implementation in C

My original concern with adding new bytecode instructions was that it would make the engine much bigger and more complicated. But this design can be implemented efficiently in the engine, adding only a little bit more complexity. See the following C code for finding the variable with index index3:

uint16_t* arr = registers->scope;
do {
  // The length of the array is in its header word
  uint16_t len = arr[-1] & 0xFFF;

  // Is the variable in this scope?
  if (index < len) return arr[index];

  // Otherwise, cascade/overflow to the outer scope
  arr = arr[0];
  index -= len;
} while (1);

Actual Instruction format

For the curious, the actual bytecode instruction format for LoadScoped in its simplest form is just 8 bits, where the upper nibble is the opcode (“LoadScoped”) and the lower nibble is the 4-bit index, allowing up to 15 closure variables to be addressed.

StoreScoped is the same, but with a different opcode.

For the even-more-curious, if your code has more variables than will fit in the 4-bit index, there are a16-bit and 24-bit4 instruction variants as well, as per the following snippet of the Microvium technical documentation:

Conclusion

I’m very satisfied with the final design so far. It brings closures up as first-class citizens in Microvium by making closure variable access almost as space-efficient as local variables, in terms of both bytecode space and memory usage, while also having only a small CPU cost. I think this is important because closures are very common in real-world JavaScript code, especially when programming with a functional style, and also because closures may in future form the foundation for the implementation of other features such as generators and async-await.

This has been a good reminder to me that it’s worth thinking deeply about designs and considering different options, rather than jumping straight in and implementing whatever comes to mind first. In this case, the final design was both simpler to implement and more efficient.

There’s a lot I haven’t talked about in this post, such as how the array is created in the first place and how the scope register changes as control moves between different lexical scopes. But that can be for another time.


  1. If you have a better name for a “closure variable”, please let me know 

  2. Note that one complexity with this design is that the index for a single variable is different depending on which scope the code is accessing it from. The index is not an absolute property of the variable itself. 

  3. This is not the actual code. The actual code needs to deal with error checking and multiple address spaces, for example. 

  4. Why would anyone ever need to address more than 255 closure-scoped variables? I heard a story that the C# compiler assumed that nobody would ever need more than 65536 variables in a single function, but that assumption was violated by a code generator that was generating massive functions. So I’m cautious about saying “nobody will ever need more than 255 variables” when it’s only 2 lines of code to support it. Maybe I’ll change my mind in the future. 

Incidental vs Accidental Complexity

Incidental vs Accidental Complexity

Developers often talk about accidental vs essential software complexity. But is there a difference between accidental complexity and incidental complexity? I think there is.

I like this definition of essential complexity:

Essential complexity is how hard something is to do, regardless of how experienced you are, what tools you use or what new and flashy architecture pattern you used to solve the problem.

(source)

That same source describes accidental complexity as any complexity that isn’t essential complexity, and this may indeed be the way that the terms are typically used. But I’d like to propose a distinction between accidental and incidental complexity:

Accidental complexity: non-essential complexity you introduce into the design by mistake (by accident), and you can realize your mistake in an “oops” moment (and consider correcting or learning from it).

Incidental complexity: any non-essential complexity you introduce into the design, which may be by mistake or on purpose (it may or may not be accidental and may or may not be accompanied by an “oops” moment).

Accidental complexity is when I accidentally knocked a drinking glass off the table and it shattered into pieces, making my morning more complicated as I cleaned it up. It’s marked with “oops” and the realization of causing something I didn’t mean to.

Incidental complexity is when I choose to turn off the hotplate on my stove by pressing the digital “down” button 8 times until the level reaches zero, when my wife later informed me that pressing “up” once would achieve the same thing (since the level wraps around under certain circumstances). I did not accidentally press the “down” button 8 times, I did it intentionally, but was still ignorant of my inefficiency.

Clearly, there’s a fine line between the two. Defining something as incidental complexity but not accidental is a function of my ignorance and depends on my level of awareness, which changes over time and various from person to person.

Why introduce unnecessary complexity on purpose?

1. Simplicity takes effort

Paradoxically, often the simpler solution is also one that takes longer to design and implement, so it can be a strategic design choice to go with a more complicated solution than a simpler one.

This is especially true for POCs and MVPs in unfamiliar domains or where the business requirements aren’t completely clear. In these cases, you can’t be confident of all the details upfront and you’re quite likely to be rewriting the solution a few times before it reaches its final state. Spending time simplifying and organizing each iteration can be a wasted effort if it will just be thrown away.

There may also just not be time for the simpler solution. The added effort defining a clean abstraction and structure takes time and there is an opportunity cost to taking longer on something, maybe through losing business revenue when the feature is delivered later, or through the cost of not working on the next feature. There are tradeoffs to be made.

2. Ignorance: You don’t know any better

This might be a little controversial, but I would argue that if you design something as simple as you can possibly conceive, then any remaining incidental complexity is not “accidental”. Like the example earlier of me pressing “down” 8 times instead of “up” once.

As a concrete example in software, JavaScript never used to have a async/await or the Promise type, and as a result, code was often structured with nested callbacks to handle sequences of asynchronous actions (known as “callback hell”). This kind of code is clearly unnecessarily complicated, no matter how hard people tried to contain the complexity. But designs based on callbacks are not accidentally complicated — they have exactly the design intended by the author which is believed to be the best possible design.

But yet, promises could be (and were eventually) implemented as a third-party library, and so could the async/await pattern (through the abuse of generator functions). So the potential was there from the beginning for every developer to structure their code in a simpler way if they could just realize the potential to do so. The complexity was not imposed as a restriction in the programming language, it was imposed by ourselves because we hadn’t been awakened to a better way of seeing things.

It’s worth noting one more thing: if you’ve ever tried to explain JavaScript Promises to someone, you’ll know that an explanation of what a Promise is or does is unconvincing. Understanding how to think in terms of Promises requires a mental paradigm shift, which doesn’t occur merely by explanation. Even a concrete demonstration may be unconvincing since the code may be “shorter”, but is it really “simpler” when it’s full of all these new and unfamiliar concepts? The way to internalize a new paradigm like this is to use it and practice thinking in terms of it. Only then can we really see what we were missing all along.

Why does it matter?

I think the distinction between accidental and incidental complexity matters because it should keep us humble and aware that the way we write software today is probably ridiculously inefficient compared to the techniques, patterns, and paradigms of the future.

It shows that certain kinds of incidental complexity can look a lot like essential complexity if we don’t yet have the mental tools to see those inefficiencies. The biggest productivity boosts we could gain in our designs might be those that aren’t obvious from the perspective of our current ways of thinking, and which might take a lot of time and effort to see clearly.

It shows that finding incidental complexity is not just a case of looking at a project and identifying what we think is needless, since there’s a good chance that we’re blind to a lot of it and all we will just see are things that make us say “yes, that thing is important for reason xyz — we can’t remove that”. Like all the callbacks in callback hell: every callback seems important, so it seems like it is all essential complexity, even though there exists a completely different way of structuring the code that would make it easier to understand and maintain.

But on the positive side, it shows that we are capable of identifying a lot of incidental complexity, if we spend enough time and effort thinking through things carefully, and if we remain open and curious about better ways of doing things.

Immutable.js vs Immer

Immutable.js vs Immer

This week I came across the library Immer as a convenient way of manipulating immutable data in JavaScript. After reading this Reddit thread that raves about how much better Immer is than Immutable.Js, I was worried I’d made the wrong decision to use Immutable.js in Microvium. But some performance tests quickly cleared up my concerns.

Immer certainly seems convenient to use. It uses a combination of proxies and copy-on-write to allow you to use the normal mutating syntax in JavaScript and it handles the underlying immutability automatically.

But “copy-on-write” raised a red flag in my mind. When dealing with small changes to large collections, as Microvium does, I can’t possibly imagine how copying the whole collection on every write can be efficient. And indeed, a quick performance test shows how bad the situation is:

import { produce } from "immer";
import immutable from 'immutable';

const count = 100;

// Test Immutable.JS
let map1 = immutable.Map();
for (let i = 0; i < count; i++)
  map1 = map1.set(i, i);

// Test Immer
let map2 = new Map();
for (let i = 0; i < count; i++)
  map2 = produce(map2, m => m.set(i, i));

For different values of count, here are my results:

100100010k100k1M10M
Immer11 ms94 ms8.8 sec17 min 47 sec??????
Immutable.JS3 ms9 ms47 ms257 ms2.4 s41 s
Builtin Map37 µs180 µs2 ms18 ms170 ms3.4 s

Insertion into the map using Immer is unsurprisingly O(n) relative to the size of the map, making the whole test O(n²). Even though I had a hunch that this was going to be the case, it was worth checking that Immer wasn’t doing some other clever tricks under the hood to improve the performance.

I’ve never actually tested the performance of Immutable.JS before, so I’m quite pleased to see how well it scales. It seems only slightly worse than O(1) insertion time.

Both libraries are quite a bit slower than using the built-in Map class, so my conclusions are as follows:

  • Use Immer if readability and ease-of-use is more important than performance, which is often the case. But be aware that it’s not a simple process to switch to Immutable.JS later because of how different the API is.
  • Use Immutable.JS if you need high-performance persistent collections that scale well to large sizes, if it’s worth the decrease in readability, type safety, and increased verbosity of the code.
  • Use the mutable built-in collections for performance-critical code that doesn’t strictly need immutability for the algorithm.

TC39 is looking to add built-in immutable collections to the JavaScript standard (see the records and tuples proposal). I’m excited to see how those perform.

Distributed apps in JavaScript

Distributed apps in JavaScript

This post is different from my usual topic of Microvium. I’ve been recently frustrated with the way that distributed applications are written and I’ve been brainstorming ways it could be improved. In my last post on the topic, I suggested that maybe a new programming language would be useful for solving the problem, but I ended the post thinking about how the Microvium snapshotting paradigm might also be a suitable solution. This time, I’m going to consider a solution that doesn’t require Microvium.

A quick summary of the objectives

I want the experience of developing a distributed application to be basically as seamless as that of developing a single-process desktop application, such as the ability to get type checking between services, to debug-step directly from one service to another during development, to write regression tests that encompass multiple services, etc. And I want this all with minimal boilerplate.

My last post on the topic goes into more detail about the issues I have with the current state of affairs and the improvements that could be made.

A proposed solution

Here’s the summary: I think the Microvium snapshotting paradigm is indeed the answer, and I think we can in a sense “polyfill” the snapshotting ability in a limited context by deterministically replaying IO.

I’ll talk about this solution a few parts:

  1. What is snapshotting and how could we use it to solve the problem?
  2. How can we implement snapshotting on a modern JavaScript engine?
  3. Is it necessary to do this JavaScript?
  4. More details

What is snapshotting and how does it help?

I’m using the term “snapshotting” here to mean the ability to “hibernate” a process to a file and then resume it later on the same or a different machine. This means capturing to a file the full state of the process, such as the global variables, the stack and heap state, and machine registers. And the ability to restore the full state in a completely new context.

Normally applications are deployed as either a compiled executable or as source code depending on whether the programming language is compiled or interpreted. Having the snapshotting capability gives us a third option: run the application code at build-time and deploy a snapshot of its final initialized state.

Doing it the latter way allows application code to have pre-runtime effects, such as defining infrastructure, and for pre-runtime code to pass information seamlessly to runtime code, such as secrets and connection strings for the aforementioned infrastructure.

When I talk about coding in JavaScript, I’m really talking about coding in TypeScript, and so another you get with this paradigm is type-checking across different infrastructural components and type checking between infrastructure and runtime code. There are other solutions like Pulumi that allow you to write IaC code in the same language as your application code, but to my knowledge, there are none that allow you to pass [typechecked] state directly from IaC code to runtime code.

How to implement snapshotting?

Microvium is a JavaScript engine that implements snapshotting at the engine level, but Microvium doesn’t yet implement much of the JavaScript spec and its virtual machines are size-limited to 64kB. It’s in no way practical for writing a distributed cloud application.

Node on the other hand is used for cloud software all the time, and I’m a huge fan of it, but it doesn’t support snapshotting in the way described.

But I think we can emulate the snapshotting feature on Node by having a membrane around the application that records all IO while it runs at build-time, and silently replays the IO history as an initialization step at runtime to recreate the final, initialized state. Although this will be slower than just recovering a direct dump of the memory state (if the engine had supported that), in theory, it should work.

This will only be reliable if JavaScript is deterministic (that its state and behavior deterministically follow from its incoming IO), which it almost is already. Work by Agoric on SES and Compartments makes this even more so, by allowing the creation of a sandboxed environment for JavaScript modules, in which non-deterministic JavaScript library functions can be removed by default (e.g. Math.random and new Date), or replaced with deterministic implementations. They do this because they run JavaScript on the blockchain, where a collection of redundant nodes needs to process the same script and reach the exact same conclusion about the new state of the blockchain — a strong determinism requirement.

Does it need to be JavaScript?

Honestly, I can’t think of a way to do this outside of JavaScript. JavaScript has a number of key features that I think are really useful for this solution:

  • JavaScript is already almost completely deterministic. There are only a handful of non-deterministic operations like Math.random, and JavaScript’s flexibility allows us to easily remove these or replace them with deterministic proxies. Contrast this with C#, where it’s impossible to prevent some block of code from performing non-deterministic IO or having non-deterministic behavior by using threads.
  • JavaScript allows us to create perfect proxies and membranes. You can create proxies for classes and not just instances of the class, and you can create proxies for modules. In C#, there is no way to put a membrane around an assembly, namespace, or class.
  • Related to creating perfect membranes, JavaScript allows you to hook into all IO, including the importing a dependent module. This would allow the framework to trace the depenency tree of each individual service and create the appropriate bundles for each.

I think it would be possible to get some approximations of the idea in other programming languages, but it certainly appears to me that JavaScript is particularly well suited.

More Details

Most of the above has been quite abstract. I think it would be worth explaining the vision in a little more concrete detail.

I imagine there to be a thing, which I might call a “service”, which is the minimal distribution unit (we’ll say a service is “a thing that can be deployed”). A service might be a single microservice/lambda, or a database, or a whole distributed application made up of nested services.

A typically SaaS company using this solution would probably just have one root “service” that represents their whole distributed application, and that service would be composed of subservices1, which could each have their own subservices, etc.

A service is a JavaScript module (file), which runs at build time on the build machine (or dev machine), and then with snapshotting is moved to the runtime machine(s) when it’s finished its initialization (when all the top-level code has run, including its transitive function calls, etc).

While executing at build time, the service code is then able to configure its own runtime “hardware” that it will eventually be moved to. For example, it might call a host function to declare “I’d like to run as a lambda with automatic scaling”, etc. The host can record this information and provision the necessary resources before moving the service to its own desired runtime environment.

A special case of this general rule is that a service can choose a runtime manifestation that doesn’t support code execution at all. For example, a database, queue, or pub-sub system.

A service can instantiate other services at build time. Concretely, this might look something like:

var myService = importService('./my-service-code.js');

I expect this to be roughly analogous to just importing another module into the current service, similar to using node’s require() function. It will import and execute the given JS module (in the same machine process), and return an object representing the script’s exports. But with the distinction that new module is to be loaded inside a membrane, and the return value is therefore a proxy for the actual exports inside the membrane.

The membrane serves a few different purposes:

  • When the running services are “moved” to their runtime environments, services in different membranes will be on different physical runtime hardware. Service code may at build time configure its future runtime hardware.
  • At build time, the membrane records all IO exchanges such that it can deterministically replay the IO at runtime during initialization, to restore the service to its exact “snapshotted” state, as mentioned earlier. It can verify and silently absorb outgoing IO, and deterministically replay incoming IO.
  • At build time, services are allowed to pass around references to other services, as a kind of “dependency injection” phase. At runtime, the membranes around each service need to handle the marshalling of calls between services as they use these injected dependencies.

The importService function can construct a new Compartment which allows it virtualize the environment in which the service code runs. This can have a variety of effects:

  • We can provide a replayable variation of non-deterministic builtins, such as Math.random and new Date. These can be treated as a special case of IO. They can return real dates and random numbers at build time, as long as they return the same dates and random numbers when replayed at runtime.
  • We can provide service-specific APIs. For example, a function akin to pleaseLetMeRunAsALambda() could be injected into the globals for the service, or we could expose additional builtin-modules to the service code.
  • We can intercept module imports so that we can bundle the service with its module dependencies into a single package for deployment, without the overhead of bringing in dependencies used by other services.

Pulumi

I haven’t done much research on this yet, but Pulumi seems to provide an API for defining infrastructure in JavaScript code. I speculate that the Pulumi API could be brought in as a low-level build-time API for services to define things like “I want to be a lambda when I grow up”.

Conclusion

I’m fairly confident that something like this would work, and I think it would make for much cleaner code for distributed applications. I also don’t think it would take that long to implement, given that many of the pieces already available off-the-shelf. But even so, I probably shouldn’t get too distracted from Microvium.


  1. This kind of infinitely recursive hierarchical organization of the architecture is something I’ve seen missing from AWS and Azure