Author: Michael Hunter

Single-threading is more memory-efficient

Single-threading is more memory-efficient

TL;DR: Single-threading with super-loops or job queues may make more efficient use of a microcontroller’s memory over time, and Microvium’s closures make single-threading easier with callback-style async code.

Multi-threading

In my last post, I proposed the idea that we should think of the memory on a microcontroller not just as a space but as a space-time, with each memory allocation occupying a certain space for some duration of time:

I suggested that we should then measure the cost of an allocation in byte-seconds (the area of the above rectangles as bytes × seconds), so long as we assumed that allocations were each small and occurred randomly over time. Randomness like this is a natural byproduct of a multi-threaded environment, where at any moment you may coincidentally have multiple tasks doing work simultaneously and each taking up memory. In this kind of situation, tasks must be careful to use as little memory as possible because at any moment some other tasks may start up and want to share the memory space for their own work.

This leaves a random chance that at any moment you could run out of memory if too many things happen at once. You can guard against this by just leaving large margins — lots of free memory — but this is not a very efficient use of the space. There is a better way: single threading.

Single threading

Firmware is sometimes structured in a so-called super-loop design, where the main function has a single while(1) loop that services all the tasks in turn (e.g. calling a function corresponding to each task in the firmware). This structure can have a significant advantage for memory efficiency. In this way of doing things, each task essentially has access to all the free memory while it has its turn, as long as it cleans up before the next task, as depicted in the following diagram. (And there may still be some statically-allocated memory and “long-lived” memory that is dynamic but used beyond the “turn” of a task).

Overall, this is a much more organized use of memory and potentially more space-efficient.

In the multi-threaded model, if two memory-heavy tasks require memory around the same time, neither has to wait for the other to be finished — or to put it another way, malloc looks for available space but not available time for that space. On the other hand, in a super-loop model, those same tasks will each get a turn at different times. Each will have much more memory available to them during their turn while having much less impact on other tasks the rest of the time.

An animated diagram you may have seen before on my blog demonstrates the general philosophy here. A task remains idle until its turn, at which point it takes center stage and can use all the resources it likes, as long as it packs up and cleans up before the next task.

So, what counts as expensive in this new memory model?

It’s quite clear from the earlier diagram:

  • Memory that is only used for one turn is clearly very cheap. Tasks won’t be interrupted during their turn, so they have full access to all the free memory without impacting the rest of the system.
  • Statically-allocated memory is clearly the most expensive: it takes away from the available memory for all other tasks across all turns.
  • Long-lived dynamic allocations — or just allocations that live beyond a single turn — are back to the stochastic model we had with multi-threading. Their cost is the amount of space × the number of turns they occupy the space for. Because these are a bit unpredictable, they also have a cost in that they add to the overall risk of running out of memory, so these kinds of allocations should be kept as small and short as possible.

Microvium is designed this way

Microvium is built from the ground up on this philosophy — keeping the idle memory usage as small as possible so that other operations get a turn to use that memory afterward, but not worrying as much about short spikes in memory that last only a single turn.

  • The idle memory of a Microvium virtual machine is as low as 34 bytes1.
  • Microvium uses a compacting garbage collector — one that consolidates and defragments all the living allocations into a single contiguous block — and releases any unused space back to the host firmware. The GC itself uses quite a bit of memory2 but it does so only for a very short time and only synchronously.
  • The virtual call-stack and registers are deallocated when control returns from the VM back to the host firmware.
  • Arrays grow their capacity geometrically (they double in size each time) but a GC cycle truncates unused space in arrays when it compacts.

See here for some more details.

Better than super-loop: a Job Queue

The trouble with a super-loop architecture is that it services every single task in each cycle. It’s inefficient and doesn’t scale well as the number of tasks grows3. There’s a better approach — one that JavaScript programmers will be well familiar with: the job queue.

A job queue architecture in firmware is still pretty simple. Your main loop is just something like this:

while (1) {
  if (thereIsAJobInTheQueue) 
    doNextJob();
  else
    goToSleep();
}

When I write bare-metal firmware, often the first thing I do is to bring in a simple job queue like this. If you’re using an RTOS, you might implement it using RTOS queues, but I’ve personally found that the job-queue style of architecture often obviates the need for an RTOS at all.

As JavaScript programmers may also be familiar with, working in a cooperative single-threaded environment has other benefits. You don’t need to think about locking, mutexes, race conditions, and deadlocks. There is less unpredictable behavior and fewer heisenbugs. In a microcontroller environment especially, a single-threaded design also means you also save on the cost of having multiple dedicated call stacks being permanently allocated for different RTOS threads.

Advice for using job queues

JavaScript programmers have been working with a single-threaded job-queue-based environment for decades and are well familiar with the need to keep the jobs short. When running JS in a browser, long jobs means that the page becomes unresponsive, and the same is true in firmware: long jobs make the firmware unresponsive — unable to respond to I/O or service accumulated buffers, etc. In a firmware scenario, you may want to keep all jobs below 1ms or 10ms, depending on what kind of responsiveness you need4.

As a rule of thumb, to keep jobs short, they should almost never block or wait for I/O. For example, if a task needs to power-on an external modem chip, it should not block while the modem to boots up. It should probably schedule another job to handle the powered-on event later, allowing other jobs to run in the meantime.

But in a single-threaded environment, how do we implement long-running tasks without blocking the main thread? Do you need to create complicated state machines? JavaScript programmers will again recognize a solution…

Callback-based async

JavaScript programmers will be quite familiar the pattern of using continuation-passing-style (CPS) to implement long-running operations in a non-blocking way. The essence of CPS is that a long-running operation should accept a callback argument to be called when the operation completes.

The recent addition of closures (nested functions) as a feature in Microvium makes this so much easier. Here is a toy example one might use for sending data to a server in a multi-step process that continues across 3 separate turns of the job queue:

function sendToServer(url, data) {
  modem.powerOn(powerOnCallback);

  function powerOnCallback() {
    modem.connectTo(url, connectedCallback);
  }

  function connectedCallback() {
    modem.send(data);
  } 
}

Here, the data parameter is in scope for the inner connectedCallback function (closure) to access, and the garbage collector will automatically free both the closure and the data when they aren’t needed anymore. A closure like this is much more memory-efficient than having to allocate a whole RTOS thread, and much less complicated than manually fiddling with state machines and memory ownership yourself.

Microvium also supports arrow functions, so you could write this same example more succinctly like this:

function sendToServer(url, data) {
  modem.powerOn( () => 
    modem.connectTo(url, () => 
      modem.send(data)));
}

Each of these 3 stages — powerOn, connectTo and send — happen in a separate job in the queue. Between each job, the VM is idle — it does not consume any stack space5 and the heap is in a compacted state6.

If you’re interested in more detail about the mechanics of how modem.powerOn etc. might be implemented in a non-blocking way, take a look at this gist where I go through this example in more detail.

Conclusion

So, we’ve seen that multi-threading can be a little hazardous when it comes to dynamic memory management because memory usage is unpredictable, and this also leads to inefficiencies because you need to leave a wider margin of error to avoid randomly running out of memory.

We’ve also seen how single-threading can help to alleviate this problem by allowing each operation to consume resources while it has control, as long it cleans up before the next operation. The super-loop architecture is a simple way to achieve this but an event-driven job-queue architecture is more modular and efficient.

And lastly, we saw that the Microvium JavaScript engine for embedded devices is well suited to this kind of design, because its idle memory usage is particularly small and because it facilitates callback-style asynchronous programming. Writing code this way avoids the hassle and complexity of writing state machines in C, of manually keeping track of memory ownership across those states, and the pitfalls and overheads of multithreading.


  1. Or 22 bytes on a 16-bit platform 

  2. In the worst case, it doubles the size of heap while it’s collecting 

  3. A super-loop also makes it more challenging to know when to put the device to sleep since the main loop doesn’t necessarily know when there aren’t any tasks that need servicing right now, without some extra work. 

  4. There will still be some tasks that need to be real-time and can’t afford to wait even a few ms in a job queue to be serviced. I’ve personally found that interrupts are sufficient for handling this kind of real-time behavior, but your needs may vary. Mixing a job queue with some real-time RTOS threads may be a way to get the best of both worlds — if you need it. 

  5. Closures are stored on the virtual heap. 

  6. It’s in a compacted state if you run a GC collection cycle after each event, which you would do if you cared a lot about idle memory usage. 

Short-lived memory is cheaper

Short-lived memory is cheaper

TL;DR: RAM on a microcontroller should not just be thought of as space but as space-time: a task that occupies the same memory but for a longer time is more expensive.

MCU memory is expensive

If you’re reading this, I probably don’t need to tell you: RAM on a microcontroller is typically very constrained. A 3 GHz desktop computer might have 16 GB of RAM, while a 3 MHz MCU might have 16 kB of RAM — a thousand times less processing power but a million times smaller RAM. So in some sense, RAM on an MCU may be a thousand times more valuable than on a desktop machine. Regardless of the exact number, I’m sure we can agree that RAM is a very constrained resource on an MCU1. This makes it important to think about the cost of various features, especially in terms of their RAM usage.

Statically-allocated memory

It’s common especially in smaller C firmware to just pre-allocate different pieces of memory to different components of the firmware (rather than using malloc and free). For example, at the global level, we may declare a 256-byte buffer for receiving data on the serial port:

uint8_t rxBuffer[256];

If we have 1kB of RAM on a device for example, maybe there are 4 components that each own 256 B. Or more likely, some features are more memory-hogging than others, but you get the idea: in this model, each component in the code owns a piece of the RAM for all time.

Dividing up RAM over time

It’s of course a waste to have a 256-byte buffer allocated forever if it’s only used occasionally. The use of dynamic memory allocation (malloc and free) can help to resolve this by allowing components to share the same physical memory at different times, requesting it when needed, and releasing it when not needed.

This allows us to reconceptualize memory as a mosaic over time, with different pieces of the program occupying different blocks of space-time (occupying memory space for some time).

When visualizing memory like this, it’s easy to start to feel like the cost of a memory allocation is not just the amount of memory it locks, but also the time that it locks it for (e.g. the area of each rectangle in the above diagram). In this sense, if I allocate 256 bytes for 2 seconds then it costs 512 byte-seconds, which is an equivalent cost to allocating 512 bytes for 1 second or 128 bytes for 4 seconds.

Being a little bit more rigorous

Skip this section if you don’t care about the edge cases. I’m just trying to be more complete here.

This measure of memory cost is of course just one way of looking at it, and breaks down in edge cases. For example, on a 64kB device, a task2 that consumes 1B for 64k-seconds seems relatively noninvasive while a task that consumes 64kB for 1s is much more costly. So the analogy breaks down in cases where the size of the allocations are significant compared to the total memory size.

Another way the model breaks down is if many of the tasks need memory around the same time — e.g. if there is some burst of activity that requires collaboration between many different tasks. The typical implementation of malloc will just fail if there is not memory available right now, as opposed to perhaps blocking the thread until the requested memory becomes available, as if the memory was like a mutex to be acquired.

But the model is accurate if we make these assumptions:

  • The size of the individual allocations is small relative to the total memory size
  • Allocations happen randomly over time and space

Under these assumptions, the total memory usage becomes a stochastic random variable whose expected value is exactly:

The expected allocation size × the expected allocation size × the expected allocation frequency

We could also calculate the probability that the device runs out of memory at any given point (I won’t do the calculations here).

Conclusion

In situations where memory allocations can be approximated as being small and random, the duration of a memory allocation is just as important as its size. Much more care must be taken to optimize memory usage for permanent or long-lived operations.

I’m not very happy with this stochastic viewpoint and all these edge cases. It means that at any point, we could randomly exceed the amount of available memory and the program will just die. Is there a better way to organize memory space-time so we don’t need to worry as much? I believe there is… and I’ll cover that in the next post.


  1. Another thing that makes it a constrained resource is the lack of virtual memory and the ability to page memory in and out of physical RAM, so the RAM size is a hard limit 

  2. When talking about this space-time model, I think it’s easier to talk about a “task” than a “component”, where a task here is some activity that needs to be done by the program over a finite stretch of time, and will consume resources over that time. 

Microvium: updated memory model

Microvium: updated memory model

TL;DR Microvium can now address up to 64 kB of ROM, up from 32 kB previously, and now runs more efficiently on small 32-bit devices such as an ARM Cortex MCU.

This is a minor update regarding the data model in Microvium (previous post here), for the kind of person who’s interested in this kind of thing.

Recap: Microvium uses 16-bit slots

As covered in the previous post, Microvium uses a 16-bit slot size — variables in Microvium are 16 bits each. This is unusual among the small embedded JavaScript engines. Cesanta’s mJS and Elk use a 64-bit slot size with NaN-boxing, and Moddable’s XS uses a 128-bit slot size (4x32bit words). So, if you have a variable with the number 42 in it, it will take 2 bytes in Microvium, 8 bytes in mJS, and 16 bytes in XS (on the other hand, the number 42.5 is a float and will take 12 bytes in Microvium since it overflows the slot into the heap, but it will still take 8 bytes in mJS and 16 bytes in XS).

Pointers and Paged Memory

If the lowest bit in the slot is 0, Microvium treats the value as a pointer to heap memory.

Since heap memory in Microvium is always 2-byte aligned, the lowest bit of a pointer will always be 0, so the value in the slot exactly corresponds to the pointer value, at least on 16-bit systems (or on 8-bit systems with a 16-bit address bus).

But what about 32-bit systems? Microvium is optimized for devices with 64 kB or less of RAM, but many devices with 64 kB of RAM are actually 32-bit devices. In these devices, there is usually a 32-bit address space, and some sub-range of these addresses will map to physical RAM (and some other range of addresses will map to physical ROM). Even though only 16 bits of information are required to index every byte of RAM, pointers are still 32 bits on these devices.

Previously, Microvium worked fine on 32-bit and 64-bit devices (I do all the testing on my 64-bit PC) but it did so through an expensive mapping table that mapped 16-bit VM addresses to their 32-bit or 64-bit native counterparts (like virtual memory implemented in software). The mapping itself was still O(1) in many cases1, but it involved a function call every time a pointer needed to be mapped, which is a massive overhead to incur. I wasn’t too worried about this because I wasn’t aiming for 32-bit devices as my initial audience, but with the number of 32-bit devices out there, this would quickly become a problem.

To support this kind of scenario more elegantly, Microvium now has a port macro definition that allows you to specify the upper 16-bits of a 32-bit pointer (or upper 48-bits of a 64-bit pointer) for the platform2.

For a more concrete example, the Arduino Nano 33 IOT has up to 32kB of SRAM, starting at the address 0x20000000. So the upper 16-bits of a real pointer into RAM will always be 0x2000. Here is a snippet of the data sheet that shows this:

So, in the Microvium port file, you can now specify the high bits of a 32-bit pointer to be 0x2000, and Microvium will interpret all pointers by simply indexing into the given memory page.

In compiled ARM machine code, the conversion from a 16-bit slot value to 32-bit pointer is just one or two instructions3! This is a significant performance improvement over how it worked before, and makes pointer access in Microvium almost as efficient as native pointer access.

I didn’t actually make this change for performance reasons. I did it because it makes the development of the Microvium engine much easier.

  • On my Windows machine where I develop, I can now use VirtualAlloc to pre-allocate a single 64 kB “page” of memory where the high bits of a pointer are always 0x5555, and run Microvium in just this region of memory. So if I see the Microvium value 0x002A, I know instantly that corresponds to the address 0x5555002A.
  • The addresses are consistent across runs, so when I note down an address in my notebook while debugging, I know it will be the same if I restart the program.
  • I can also have a memory view open in the debugger and it remains consistent across runs and shows all the VM memory in one place.

64 kB ROM

Previously, if the lower 2 bits of a slot were 01b then the value was considered a pointer into ROM after shifting right by 1 bit, giving us a 15-bit address space for ROM, and requiring ROM to be 2-bit aligned to keep the remaining 0 bit.

Now, the slot value is considered a pointer into ROM after zeroing the bottom 2 bits. This doesn’t change the performance, but it means that we can now address up to 64 kB of ROM.

A side effect is that ROM must now be 4-byte aligned since the lower 2 bits of ROM pointers must be zero. This means extra padding sometimes, but I’ve found that the ROM overhead doesn’t grow substantially with this change.

Why did I do this?

  1. Debugging. Previously, if I saw the slot value 0x2A1 while stepping through the engine, I have to bring out a calculator to see that 0x2A1 corresponds to the bytecode address 0x150. Now, the value 0x2A1 corresponds to the bytecode address 0x2A0, which is much easier to follow.
  2. I was irked by the fact that I couldn’t just say that “Microvium supports up to 64kB of memory”. I previously had to qualify it every time — “it supports 64kB of RAM, and 64kB of snapshot image, but only up to 32kB of ROM”. Now I can just say “supports up to 64kB of memory” with no added asterisks or qualifications, since it supports 64 kB of each type of memory it uses. This is part of simplifying the mental model of Microvium, since Microvium is all about simplicity.


  1. The performance was actually O(n) where n is the number of memory blocks, but since Microvium uses a compacting garbage collector, memory is consolidated into a single block periodically, making it O(1) for most pointers most of the time. 

  2. More accurately, you can specify any native address as the origin of the VM address space, but I found it easier to explain this in terms of the high bits. 

  3. In the full ARM instruction set, the conversion is a single 32-bit instruction. In the ARM Thumb instruction set, it takes two 16-bit instructions. The specific number 0x20000000 is relevant here because it’s a power of 2 which is more efficient. 

Microvium Closure Variable Indexing

Microvium Closure Variable Indexing

TL;DR: In many situations, a program in Microvium bytecode can access closure variables with just a single-byte bytecode instruction. The instruction contains a 4-bit index that either indexes into the current lexical scope or recursively overflows to the next-outer lexical scope, cascading up the lexical scope chain until the variable is found. In this post, I discuss some of the journey and the details of how this works.


What is a closure variable?

A closure is a function that accesses variables outside its local lexical environment. See the MDN article on this for a better explanation.

What I mean by a “closure variable” in this post is a local variable that is accessed by a closure1. For example, in the following code, callback is a closure because it accesses x which is outside its own local variables, and correspondingly x is a “closure variable” by this definition because it’s accessed by callback:

function foo() {
  let x = 10;
  setTimeout(callback, 1000);
  function callback() { 
    console.log(x);
  }
}

In Microvium (and other JavaScript engines), closure variables are treated differently from normal local variables because they need to outlive the stack frame in which they are declared. The variable x here needs to survive beyond the return of foo. This is done by allocating the slots for these variables on the heap instead of the stack.

If a variable is demoted to the heap, all access to that variable is via the heap, even if it’s being accessed locally. For example:

function foo() {
  let x = 10;
  console.log(x);  // <--- this is also accessing `x` on the heap
  setTimeout(callback, 1000);
  function callback() { 
    console.log(x);
  }
}

There is a static analysis pass in Microvium that decides whether a variable is a closure variable or not. In principle, it’s safe to allocate all variables on the heap rather than doing this analysis, but the heap is expensive, so as an optimization some variables can be promoted to the stack if they are not used by closures.

For the sake of the rest of this blog post, I will pretend that all variables are closure variables, even if I do not show the closure that accesses them. Or equivalently, I will pretend that there is no optimization analysis to determine which variables can be promoted to the stack.

Background: how it might have worked

Before I get to describing how it does work today, let me describe how I previously implemented it before having my eureka moment. You can skip this section if you want to get straight to the answer instead of going through the journey as I did.

The state of the virtual machine needs to keep track of the current lexical scope. So let’s add a new machine register called scope which points to the current lexical scope.

A goal with Microvium is to keep the engine implementation small. So rather than introducing a new allocation type for environment records and new bytecode instructions to access them, maybe we can reuse an existing type and existing instruction.

A natural solution that might come to mind is to use an Object, where the property keys of the object correspond to the variable names. Microvium has existing bytecode instructions ObjectGet and ObjectSet to get and set properties on an object.

However, Objects are very expensive at runtime. Each property is stored in memory as a key-value pair taking 4 bytes, and property lookup is a linear-time search through the properties to find the one with the right key.

Since we can statically determine the number of variables, a better choice of container for our variables would be a fixed-length array, where we statically compute an index for each variable rather than using its name. In Microvium, a fixed-length array is quite efficient, having constant-time random access to any slot by its index and each slot only consumes 2 bytes.

So let’s think about compiling the following JavaScript to IL:

let x, y , z;
x = 10;
y = 20;
z = 30;

We will have a fixed-length array with 3 slots for these 3 variables, and our new scope register will point to this fixed-length array to say that this is the current scope. Whenever we need to access one of these variables, we will read and write to the array.

I’m showing the allocation header here for completeness. These arrays are heap-allocated, so they each require this implicit memory slot for information about the size and type of the allocation.

So, the following is the IL sequence that might be produced for the above JavaScript:

// let x, y, z;
ArrayNew(3)         // Allocate 3 slots on the heap
StoreReg('scope')   // Save the array in the "scope" register

// x = 10;
LoadReg('scope')    // Fetch the current scope
Literal(10)
ArraySet(0)         // Set the first slot in the array

// x = 20;
LoadReg('scope')    // Fetch the current scope
Literal(20)
ArraySet(1)         // Set the second slot in the array

// x = 30;
LoadReg('scope')    // Fetch the current scope
Literal(30)
ArraySet(2)         // Set the third slot in the array

Side note: Microvium IL is based on a stack machine (see Wikipedia). The instruction Literal(10) pushes the value 10 to the top of the stack. The instruction ArraySet(0) pops the literal value off the stack, and pops the array reference off the stack (which was previously pushed by LoadReg('scope')) and then assigns the 0th slot in the array.

Nested scopes

What if there are nested lexical scopes, as in the following JavaScript code:

function foo() {
  let x;
  function bar() {
    let y;
    function baz() {
      let z;
      x = 10;
      y = 20;
      z = 30;
    }    
  }
}

In the above code, z is accessed in the same scope in which it is declared, as before. But x and y are accessed from parent (outer) scopes relative to the expression that accesses it. So we need a way for the IL to access parent scopes.

Remember that the inner scopes can be instantiated multiple times for each single instantiation of an outer scope. For example, if bar() is called twice within foo, then there will be 2 instances of variable y for every one instance of variable x. So we can’t just put x, y, and z in the same array. We need each lexical scope to be in its own array, and we need a way to reference an outer scope from an inner scope.

This might seem like we need to abandon the fixed-length array as the underlying storage for these variables, since these don’t naturally form chains. But if we think about it a moment, we could reserve one of the slots in the fixed-length array as a pointer to the parent fixed-length array. Let’s mentally reserve the first slot in each array as the pointer to its parent scope.

In the above JS, we have 3 distinct lexical scopes, and so we may land up with a scope chain as follows:

Now, the IL to read variable x may look as follows:

LoadReg('scope')  // Get the current scope (the one containing z)
ArrayGet(0)       // Read the parent scope (the one containing y)
ArrayGet(0)       // Read the parent scope (the one containing x)
ArrayGet(1)       // Read variable x

Each of these instructions encodes to 1 byte, so this is a 4-byte sequence in total.

I thought this was a pretty good solution. It didn’t add any extra instructions to the IL instruction set, and so didn’t make the engine any bigger. An important goal in Microvium is to keep the engine small.

The final solution

In the end, I decided that closures were too important to have it cost so many instructions to read and write to them. Rather than emitting multiple IL instructions just to read or write a single variable, it seemed quite logical to bake this behavior into the engine itself, and add two additional instructions to the IL instruction set: LoadScoped and StoreScoped. I wanted to keep these instructions both compact and efficient, and that’s where the design challenge is interesting.

Consider the following JavaScript with a few more variables to make the pattern clear:

function foo() {
  let a;
  let b;
  function bar() {
    let c;
    let d;
    function baz() {
      let e;
      let f;
      a = 10;
      b = 20;
      c = 30;
      d = 40;
      e = 50;
      f = 60;
    }
  }
}

The assignment statements in this example, such as a = 10, are accessing the variables in one of 3 different lexical scopes. How can we design the instruction format for LoadScoped (and StoreScoped) so that the bytecode can specify which scope is being accessed as well as which variable in that scope?

The eureka moment for me was when I realized that I could use a single mapped index that specifies both the scope and the variable within that scope. I’ll represent this mapping in the following diagram, with the index on the left and the variable it maps to on the right (the allocation headers are omitted here for clarity).

The scopes in this design form a waterfall. Small indexes are accessing the inner-most lexical scope. Larger indexes “overflow” into the next outer closure scope, repeating until the variable is found.

So, the IL LoadScope(1) would load variable e, and LoadScope(8) would load variable b, for example. We can determine the indexes through static analysis — numbering the variables in the closest scope first and then the ones in the next closest, etc2.

Implementation in C

My original concern with adding new bytecode instructions was that it would make the engine much bigger and more complicated. But this design can be implemented efficiently in the engine, adding only a little bit more complexity. See the following C code for finding the variable with index index3:

uint16_t* arr = registers->scope;
do {
  // The length of the array is in its header word
  uint16_t len = arr[-1] & 0xFFF;

  // Is the variable in this scope?
  if (index < len) return arr[index];

  // Otherwise, cascade/overflow to the outer scope
  arr = arr[0];
  index -= len;
} while (1);

Actual Instruction format

For the curious, the actual bytecode instruction format for LoadScoped in its simplest form is just 8 bits, where the upper nibble is the opcode (“LoadScoped”) and the lower nibble is the 4-bit index, allowing up to 15 closure variables to be addressed.

StoreScoped is the same, but with a different opcode.

For the even-more-curious, if your code has more variables than will fit in the 4-bit index, there are a16-bit and 24-bit4 instruction variants as well, as per the following snippet of the Microvium technical documentation:

Conclusion

I’m very satisfied with the final design so far. It brings closures up as first-class citizens in Microvium by making closure variable access almost as space-efficient as local variables, in terms of both bytecode space and memory usage, while also having only a small CPU cost. I think this is important because closures are very common in real-world JavaScript code, especially when programming with a functional style, and also because closures may in future form the foundation for the implementation of other features such as generators and async-await.

This has been a good reminder to me that it’s worth thinking deeply about designs and considering different options, rather than jumping straight in and implementing whatever comes to mind first. In this case, the final design was both simpler to implement and more efficient.

There’s a lot I haven’t talked about in this post, such as how the array is created in the first place and how the scope register changes as control moves between different lexical scopes. But that can be for another time.


  1. If you have a better name for a “closure variable”, please let me know 

  2. Note that one complexity with this design is that the index for a single variable is different depending on which scope the code is accessing it from. The index is not an absolute property of the variable itself. 

  3. This is not the actual code. The actual code needs to deal with error checking and multiple address spaces, for example. 

  4. Why would anyone ever need to address more than 255 closure-scoped variables? I heard a story that the C# compiler assumed that nobody would ever need more than 65536 variables in a single function, but that assumption was violated by a code generator that was generating massive functions. So I’m cautious about saying “nobody will ever need more than 255 variables” when it’s only 2 lines of code to support it. Maybe I’ll change my mind in the future. 

Incidental vs Accidental Complexity

Incidental vs Accidental Complexity

Developers often talk about accidental vs essential software complexity. But is there a difference between accidental complexity and incidental complexity? I think there is.

I like this definition of essential complexity:

Essential complexity is how hard something is to do, regardless of how experienced you are, what tools you use or what new and flashy architecture pattern you used to solve the problem.

(source)

That same source describes accidental complexity as any complexity that isn’t essential complexity, and this may indeed be the way that the terms are typically used. But I’d like to propose a distinction between accidental and incidental complexity:

Accidental complexity: non-essential complexity you introduce into the design by mistake (by accident), and you can realize your mistake in an “oops” moment (and consider correcting or learning from it).

Incidental complexity: any non-essential complexity you introduce into the design, which may be by mistake or on purpose (it may or may not be accidental and may or may not be accompanied by an “oops” moment).

Accidental complexity is when I accidentally knocked a drinking glass off the table and it shattered into pieces, making my morning more complicated as I cleaned it up. It’s marked with “oops” and the realization of causing something I didn’t mean to.

Incidental complexity is when I choose to turn off the hotplate on my stove by pressing the digital “down” button 8 times until the level reaches zero, when my wife later informed me that pressing “up” once would achieve the same thing (since the level wraps around under certain circumstances). I did not accidentally press the “down” button 8 times, I did it intentionally, but was still ignorant of my inefficiency.

Clearly, there’s a fine line between the two. Defining something as incidental complexity but not accidental is a function of my ignorance and depends on my level of awareness, which changes over time and various from person to person.

Why introduce unnecessary complexity on purpose?

1. Simplicity takes effort

Paradoxically, often the simpler solution is also one that takes longer to design and implement, so it can be a strategic design choice to go with a more complicated solution than a simpler one.

This is especially true for POCs and MVPs in unfamiliar domains or where the business requirements aren’t completely clear. In these cases, you can’t be confident of all the details upfront and you’re quite likely to be rewriting the solution a few times before it reaches its final state. Spending time simplifying and organizing each iteration can be a wasted effort if it will just be thrown away.

There may also just not be time for the simpler solution. The added effort defining a clean abstraction and structure takes time and there is an opportunity cost to taking longer on something, maybe through losing business revenue when the feature is delivered later, or through the cost of not working on the next feature. There are tradeoffs to be made.

2. Ignorance: You don’t know any better

This might be a little controversial, but I would argue that if you design something as simple as you can possibly conceive, then any remaining incidental complexity is not “accidental”. Like the example earlier of me pressing “down” 8 times instead of “up” once.

As a concrete example in software, JavaScript never used to have a async/await or the Promise type, and as a result, code was often structured with nested callbacks to handle sequences of asynchronous actions (known as “callback hell”). This kind of code is clearly unnecessarily complicated, no matter how hard people tried to contain the complexity. But designs based on callbacks are not accidentally complicated — they have exactly the design intended by the author which is believed to be the best possible design.

But yet, promises could be (and were eventually) implemented as a third-party library, and so could the async/await pattern (through the abuse of generator functions). So the potential was there from the beginning for every developer to structure their code in a simpler way if they could just realize the potential to do so. The complexity was not imposed as a restriction in the programming language, it was imposed by ourselves because we hadn’t been awakened to a better way of seeing things.

It’s worth noting one more thing: if you’ve ever tried to explain JavaScript Promises to someone, you’ll know that an explanation of what a Promise is or does is unconvincing. Understanding how to think in terms of Promises requires a mental paradigm shift, which doesn’t occur merely by explanation. Even a concrete demonstration may be unconvincing since the code may be “shorter”, but is it really “simpler” when it’s full of all these new and unfamiliar concepts? The way to internalize a new paradigm like this is to use it and practice thinking in terms of it. Only then can we really see what we were missing all along.

Why does it matter?

I think the distinction between accidental and incidental complexity matters because it should keep us humble and aware that the way we write software today is probably ridiculously inefficient compared to the techniques, patterns, and paradigms of the future.

It shows that certain kinds of incidental complexity can look a lot like essential complexity if we don’t yet have the mental tools to see those inefficiencies. The biggest productivity boosts we could gain in our designs might be those that aren’t obvious from the perspective of our current ways of thinking, and which might take a lot of time and effort to see clearly.

It shows that finding incidental complexity is not just a case of looking at a project and identifying what we think is needless, since there’s a good chance that we’re blind to a lot of it and all we will just see are things that make us say “yes, that thing is important for reason xyz — we can’t remove that”. Like all the callbacks in callback hell: every callback seems important, so it seems like it is all essential complexity, even though there exists a completely different way of structuring the code that would make it easier to understand and maintain.

But on the positive side, it shows that we are capable of identifying a lot of incidental complexity, if we spend enough time and effort thinking through things carefully, and if we remain open and curious about better ways of doing things.

Immutable.js vs Immer

Immutable.js vs Immer

This week I came across the library Immer as a convenient way of manipulating immutable data in JavaScript. After reading this Reddit thread that raves about how much better Immer is than Immutable.Js, I was worried I’d made the wrong decision to use Immutable.js in Microvium. But some performance tests quickly cleared up my concerns.

Immer certainly seems convenient to use. It uses a combination of proxies and copy-on-write to allow you to use the normal mutating syntax in JavaScript and it handles the underlying immutability automatically.

But “copy-on-write” raised a red flag in my mind. When dealing with small changes to large collections, as Microvium does, I can’t possibly imagine how copying the whole collection on every write can be efficient. And indeed, a quick performance test shows how bad the situation is:

import { produce } from "immer";
import immutable from 'immutable';

const count = 100;

// Test Immutable.JS
let map1 = immutable.Map();
for (let i = 0; i < count; i++)
  map1 = map1.set(i, i);

// Test Immer
let map2 = new Map();
for (let i = 0; i < count; i++)
  map2 = produce(map2, m => m.set(i, i));

For different values of count, here are my results:

100100010k100k1M10M
Immer11 ms94 ms8.8 sec17 min 47 sec??????
Immutable.JS3 ms9 ms47 ms257 ms2.4 s41 s
Builtin Map37 µs180 µs2 ms18 ms170 ms3.4 s

Insertion into the map using Immer is unsurprisingly O(n) relative to the size of the map, making the whole test O(n²). Even though I had a hunch that this was going to be the case, it was worth checking that Immer wasn’t doing some other clever tricks under the hood to improve the performance.

I’ve never actually tested the performance of Immutable.JS before, so I’m quite pleased to see how well it scales. It seems only slightly worse than O(1) insertion time.

Both libraries are quite a bit slower than using the built-in Map class, so my conclusions are as follows:

  • Use Immer if readability and ease-of-use is more important than performance, which is often the case. But be aware that it’s not a simple process to switch to Immutable.JS later because of how different the API is.
  • Use Immutable.JS if you need high-performance persistent collections that scale well to large sizes, if it’s worth the decrease in readability, type safety, and increased verbosity of the code.
  • Use the mutable built-in collections for performance-critical code that doesn’t strictly need immutability for the algorithm.

TC39 is looking to add built-in immutable collections to the JavaScript standard (see the records and tuples proposal). I’m excited to see how those perform.

Distributed apps in JavaScript

Distributed apps in JavaScript

This post is different from my usual topic of Microvium. I’ve been recently frustrated with the way that distributed applications are written and I’ve been brainstorming ways it could be improved. In my last post on the topic, I suggested that maybe a new programming language would be useful for solving the problem, but I ended the post thinking about how the Microvium snapshotting paradigm might also be a suitable solution. This time, I’m going to consider a solution that doesn’t require Microvium.

A quick summary of the objectives

I want the experience of developing a distributed application to be basically as seamless as that of developing a single-process desktop application, such as the ability to get type checking between services, to debug-step directly from one service to another during development, to write regression tests that encompass multiple services, etc. And I want this all with minimal boilerplate.

My last post on the topic goes into more detail about the issues I have with the current state of affairs and the improvements that could be made.

A proposed solution

Here’s the summary: I think the Microvium snapshotting paradigm is indeed the answer, and I think we can in a sense “polyfill” the snapshotting ability in a limited context by deterministically replaying IO.

I’ll talk about this solution a few parts:

  1. What is snapshotting and how could we use it to solve the problem?
  2. How can we implement snapshotting on a modern JavaScript engine?
  3. Is it necessary to do this JavaScript?
  4. More details

What is snapshotting and how does it help?

I’m using the term “snapshotting” here to mean the ability to “hibernate” a process to a file and then resume it later on the same or a different machine. This means capturing to a file the full state of the process, such as the global variables, the stack and heap state, and machine registers. And the ability to restore the full state in a completely new context.

Normally applications are deployed as either a compiled executable or as source code depending on whether the programming language is compiled or interpreted. Having the snapshotting capability gives us a third option: run the application code at build-time and deploy a snapshot of its final initialized state.

Doing it the latter way allows application code to have pre-runtime effects, such as defining infrastructure, and for pre-runtime code to pass information seamlessly to runtime code, such as secrets and connection strings for the aforementioned infrastructure.

When I talk about coding in JavaScript, I’m really talking about coding in TypeScript, and so another you get with this paradigm is type-checking across different infrastructural components and type checking between infrastructure and runtime code. There are other solutions like Pulumi that allow you to write IaC code in the same language as your application code, but to my knowledge, there are none that allow you to pass state directly from IaC code to runtime code.

How to implement snapshotting?

Microvium is a JavaScript engine that implements snapshotting at the engine level, but Microvium doesn’t yet implement much of the JavaScript spec and its virtual machines are size-limited to 64kB. It’s in no way practical for writing a distributed cloud application.

Node on the other hand is used for cloud software all the time, and I’m a huge fan of it, but it doesn’t support snapshotting in the way described.

But I think we can emulate the snapshotting feature on Node by having a membrane around the application that records all IO while it runs at build-time, and silently replays the IO history as an initialization step at runtime to recreate the final, initialized state. Although this will be slower than just recovering a direct dump of the memory state (if the engine had supported that), in theory, it should work.

This will only be reliable if JavaScript is deterministic (that its state and behavior deterministically follow from its incoming IO), which it almost is already. Work by Agoric on SES and Compartments makes this even more so, by allowing the creation of a sandboxed environment for JavaScript modules, in which non-deterministic JavaScript library functions can be removed by default (e.g. Math.random and new Date), or replaced with deterministic implementations. They do this because they run JavaScript on the blockchain, where a collection of redundant nodes needs to process the same script and reach the exact same conclusion about the new state of the blockchain — a strong determinism requirement.

Does it need to be JavaScript?

Honestly, I can’t think of a way to do this outside of JavaScript. JavaScript has a number of key features that I think are really useful for this solution:

  • JavaScript is already almost completely deterministic. There are only a handful of non-deterministic operations like Math.random, and JavaScript’s flexibility allows us to easily remove these or replace them with deterministic proxies. Contrast this with C#, where it’s impossible to prevent some block of code from performing non-deterministic IO or having non-deterministic behavior by using threads.
  • JavaScript allows us to create perfect proxies and membranes. You can create proxies for classes and not just instances of the class, and you can create proxies for modules. In C#, there is no way to put a membrane around an assembly, namespace, or class.
  • Related to creating perfect membranes, JavaScript allows you to hook into all IO, including the importing a dependent module. This would allow the framework to trace the depenency tree of each individual service and create the appropriate bundles for each.

I think it would be possible to get some approximations of the idea in other programming languages, but it certainly appears to me that JavaScript is particularly well suited.

More Details

Most of the above has been quite abstract. I think it would be worth explaining the vision in a little more concrete detail.

I imagine there to be a thing, which I might call a “service”, which is the minimal distribution unit (we’ll say a service is “a thing that can be deployed”). A service might be a single microservice/lambda, or a database, or a whole distributed application made up of nested services.

A typically SaaS company using this solution would probably just have one root “service” that represents their whole distributed application, and that service would be composed of subservices1, which could each have their own subservices, etc.

A service is a JavaScript module (file), which runs at build time on the build machine (or dev machine), and then with snapshotting is moved to the runtime machine(s) when it’s finished its initialization (when all the top-level code has run, including its transitive function calls, etc).

While executing at build time, the service code is then able to configure its own runtime “hardware” that it will eventually be moved to. For example, it might call a host function to declare “I’d like to run as a lambda with automatic scaling”, etc. The host can record this information and provision the necessary resources before moving the service to its own desired runtime environment.

A special case of this general rule is that a service can choose a runtime manifestation that doesn’t support code execution at all. For example, a database, queue, or pub-sub system.

A service can instantiate other services at build time. Concretely, this might look something like:

var myService = importService('./my-service-code.js');

I expect this to be roughly analogous to just importing another module into the current service, similar to using node’s require() function. It will import and execute the given JS module (in the same machine process), and return an object representing the script’s exports. But with the distinction that new module is to be loaded inside a membrane, and the return value is therefore a proxy for the actual exports inside the membrane.

The membrane serves a few different purposes:

  • When the running services are “moved” to their runtime environments, services in different membranes will be on different physical runtime hardware. Service code may at build time configure its future runtime hardware.
  • At build time, the membrane records all IO exchanges such that it can deterministically replay the IO at runtime during initialization, to restore the service to its exact “snapshotted” state, as mentioned earlier. It can verify and silently absorb outgoing IO, and deterministically replay incoming IO.
  • At build time, services are allowed to pass around references to other services, as a kind of “dependency injection” phase. At runtime, the membranes around each service need to handle the marshalling of calls between services as they use these injected dependencies.

The importService function can construct a new Compartment which allows it virtualize the environment in which the service code runs. This can have a variety of effects:

  • We can provide a replayable variation of non-deterministic builtins, such as Math.random and new Date. These can be treated as a special case of IO. They can return real dates and random numbers at build time, as long as they return the same dates and random numbers when replayed at runtime.
  • We can provide service-specific APIs. For example, a function akin to pleaseLetMeRunAsALambda() could be injected into the globals for the service, or we could expose additional builtin-modules to the service code.
  • We can intercept module imports so that we can bundle the service with its module dependencies into a single package for deployment, without the overhead of bringing in dependencies used by other services.

Pulumi

I haven’t done much research on this yet, but Pulumi seems to provide an API for defining infrastructure in JavaScript code. I speculate that the Pulumi API could be brought in as a low-level build-time API for services to define things like “I want to be a lambda when I grow up”.

Conclusion

I’m fairly confident that something like this would work, and I think it would make for much cleaner code for distributed applications. I also don’t think it would take that long to implement, given that many of the pieces already available off-the-shelf. But even so, I probably shouldn’t get too distracted from Microvium.


  1. This kind of infinitely recursive hierarchical organization of the architecture is something I’ve seen missing from AWS and Azure 

Quality vs Quantity

Quality vs Quantity

It’s an age-old debate, but as I get older, my perspective is changing.

I feel increasingly frustrated by tools that don’t work smoothly; tools that weren’t thought out very well. And I ask myself, wouldn’t my life be better without these?

TL;DR The economic incentive for product development is not always aligned with what makes the world a better place. Tools that make people more productive do not necessarily make them happier. As creators, we should first-and-foremost create things that are pleasant to use, even when it’s not economically optimal for us or the companies we work for — economics should serve humankind, not be our master.


I’d be remiss creating such a post without whining about some concrete examples:

My Washing Machine

My washing machine1 has a “quick wash” setting that can do a load of washing in about 30min — great for those in a rush. And it allows you to further customize the quick wash with things like temperature and spin speed.

But for some unknown reason, it doesn’t allow you to select the maximum spin speed while in the “quick wash” setting!

If you want your clothes well-spun, you need to select another setting, the quickest of which is “whites”, and takes three-and-half-hours!

There’s a work-around, which I use most times — you run the machine on a quick-wash cycle (30 min), and then afterward change the setting to “whites” and then select the sub-setting “spin only”.

There are many other confusing and frustrating choices made for the interface and control of this machine, but it would take a whole post to go into them all!

How many extra man-hours would it have taken that company to properly think through the HMI2 design of the washing machine? How much extra would that have added to the price of each machine?

And here’s a deeper question: does the existence of this washing machine design make the world a better or worse place? If we could magically remove all washing machines with this design from history, would we be better or worse off?

What does it even mean to be better or worse off? This depends heavily on your world view. If you are a hedonist, then happiness/pleasure is the ultimate goal, and you can mentally compare the global happiness with or without this washing machine design in it, and make a value judgement about whether the washing machine is a net win or loss.

The difference between the two hypothetical worlds is subtle and complicated. The annoyance I feel when I do the washing is brief and small, but it’s multiplied by the thousands of people who have a similarly-designed machine.

And if we magically removed the design from the world timeline, we’d also be removing the satisfaction I felt when I discovered the work-around, and the catharsis of complaining about it in this blog post, or the social bonding it provides between people with the same issue who can enthusiastically affirm each other’s miserable experiences of such a damned machine.

But even with all these “benefits” of bad design, I honestly believe that we’d all be better off, in at least a small way, if this design didn’t exist.

So, what does this mean? Should humankind band together to create the perfect washing machine — a design which can be used for hundreds of years into the future, making people all of the world happy or even excited to do their washing?

In principle, I’d say “yes”. If we as humankind had collectively spent our efforts creating just one “great” washing machine design instead of a mix of good and bad designs across many companies, we’d probably all be better off overall.

The reality, of course, is more complicated. Different people’s ideas of the “perfect washing machine” are different. Technology makes this a moving target — the best washing machine 1000 years from now will be much better than the best we can do now, but we don’t expect a washing machine company to internally do a thousand years of technological research before creating their first product!

Adobe After Effects

Adobe After Effects is software for creating and editing video and animation. I learnt how to use After Effects this year so I could create little animated diagrams like that in my snapshot-vs-bundler post.

I find After Effects to be a really interesting example when it comes to armchair-philosophy about product quality. After Effects is, in my opinion, incredibly poorly thought out and frustrating to use. But it is also the only software that can do everything that it does, and using it makes you 100x more productive than if you were to try to do the same animations/effects manually.

I won’t go into a lot of specifics of what I think is bad design in After Effects, but I can’t help but mention one example: you can’t copy-paste something from one project to another (in fact, you can’t even open two projects at the same time). If you designed something great in one project and want to use it in the next (reuse, DRY), it’s a complicated and time-consuming process to export parts of one project into a temporary project and then import the temporary project as a nested child of the second project, from which you can now copy and paste.

You’re welcome to disagree with me and think that it’s a great design. The thought-experiment remains valid: in cases where a tool achieves something that was previously unachievable, or boosts their user’s productivity by some factor, is it better for the tool to have more capability or to be more pleasant to use?

The answer is not completely obvious, and it really got me thinking. My mind is in the weird paradox of simultaneously thinking that my life is worse off with After Effects while also I choose to use it because nothing else is available to do the job.

I wonder how many other people might also be in a similar position, perhaps frustrated by tools they use, but pushed to use them because the tools make their competitors more productive. It’s a prisoner’s dilemma situation (wiki).

50 years ago, nobody was complaining about the lack of After Effects software — they were fine with its absence — but today, we complain about its flaws. We’re “forced” to use it, either because our competitors in the space are using it or, in my case, because the mere knowledge of its existence makes work without it feel much slower, which itself is frustrating.

I honestly think the world would be a better place if After Effects didn’t exist and if knowledge of its absence also didn’t exist, nor a craving for something like it to exist. Or even better, the world would be a better place if After Effects was created with more design thought, making common workflows easy and pleasant, even at the expense of its variety of features.

Lessons

As a software engineer, I’m one of the lucky few who are actually contributing to the development and design of these kinds of tools. So what kind of lessons can I learn from some of my frustrations with other tools, like the washing machine or After Effects?

For me personally, the lesson is that quality is far more important than quantity. I think that humankind will be better off with fewer tools and less productivity, in exchange for tools that are more pleasant to work with. It’s better if technological development is slower, maybe 20 years behind where it would otherwise be, but the technologies developed are robust, dependable, and a pleasure to use.

In some ways, this goes against modern business practice. In our capitalistic world, there’s a temporary advantage gained for using a tool that makes the business more productive, even if it makes its employees less happy — the benefit is negated when the competitors do the same thing, removing the competitive advantage of the productivity boost and driving down overall prices of the product (but leaving employees still unhappy with the status quo).

Likewise, on the other side, it’s advantageous for a software company to sell tools that deliver the maximum economic value to their users, not necessarily maximum pleasure and peace. Remember that automating the repetitive parts of a user’s job does not necessarily make their life better, it just changes the nature of their work (now their job is about running the automation software). If a new piece of software allows users to complete their work in half the time, they don’t take the other half of their time to sit in the park and watch the birds play in the fountain!

If you create a tool that delivers more productivity in a way that’s more frustrating to use than the less-productive way, the machine of capitalism forces people to use your more-productive tool (which may make you rich in the process!) at the expense of increasing net frustration in the world.

Microvium

(Non-technical audiences may want to skip this section)

Microvium has parallels with After Effects, in that it’s a new tool that achieves something that was previously unachievable (it brings JavaScript to devices with only 16kB of flash). It boosts productivity significantly over alternatives such as EmbedVM.

So would it be ok if Microvium brought these benefits but was frustrating to use?

I think not. I’ve spent considerable effort in making Microvium as easy and pleasant to use as possible. For example:

  • I take great care in the full experience of integrating and using Microvium.
    • The one-liner install for node hosts, or the two-file copy-and-paste for C hosts (one h file for the interface, and one c file for the library).
    • The getting-started guide demonstrates this to the extreme, leading the user incrementally through all the concepts, building up to a full native host example.
    • The getting-started guide is even part of the regression tests — the automated test infrastructure literally extracts the example code from the getting-started markdown file and make sure it all still works. There’s little more frustrating than an outdated or incorrect introduction to a new tool.
  • I spent a great deal of effort making the FFI3 design super simple.
    • I hope that the way I designed it seems “obvious” and “intuitive” to people looking at it in hindsight — this is the way that designs should be. When you read through the getting-started guide, you probably don’t think at all about how good or bad the FFI system is. It’s hopefully invisible to the user, in the same way that a good washing machine should be invisible. It should just do its job, in the least-frustrating way possible.
    • Compare the FFI design of Microvium with that of Node.js for example, where one requires complicated binding files, additional configuration files (e.g. binding.gyp), the need to understand N-API vs NAN vs the V8 API, etc. In Microvium, you can export a JavaScript function to C code using a single line of JavaScript code and no configuration file is needed:
      vmExport(1234, sayHello);
  • Even for something like the module API for Microvium (for node.js hosts), which many people won’t encounter directly, I put a great deal of thought into making the model as conceptually simple as possible.
    • Compare to something like the Compartment API for JavaScript (under development), where the explanation of its module API necessitates the definition of 6 different kinds of “module specifier”, and the distinction between “resolving”, “loading”, “binding”, and “importing”. While Microvium does all of these, it doesn’t require the user to understand these concepts in order to use it.

This is to say that I’m practicing what I preach. However much I may fall short of the ideal, I put a lot of time and energy into making Microvium a pleasure to use and simple to understand, at the expense of not delivering as many features and it taking longer to develop.

I believe this was the right choice, and I’ll definitely apply it to other projects moving forward.

What are you optimizing for?

I’ll conclude by asking the question: what are you optimizing for?

If you are a manager or leader at a company, you may be inclined optimize the company to produce profits. You may even have a fiduciary duty to do so on behalf of your investors.

Sometimes the profit incentive is aligned with making good products: after all, happy customers are returning customers.

But clearly that’s not always the case — it’s common practice to release buggy software early, so you can iterate on it and use your users as a testbed to weed out the bugs. It’s common practice to “fail fast” in small startups — getting new ideas out the door quickly so you can establish their efficacy in the market, rather than spending significant time designing something that might be pleasant to use but nobody wants it because the idea was flawed. It’s common practice to aim for a first-mover advantage, especially in markets with a network effect, to trap people in your ecosystem before competitors (who may have a more pleasant product but were slower to get there) have a chance to penetrate the market.

Should fiduciary duty or duty as an employee transcend the betterment of mankind?

What goals are you optimizing your products for? What are you optimizing your company for? What are you optimizing your life for?


  1. Actually, my housemate’s washing machine 

  2. Human-machine interface 

  3. Foreign Function Interface — the way in which script code communicates host code and vice versa 

A Distributed Language

A Distributed Language

TL;DR I think it would be great to have a language where single instance of a program can run non-stop for 10 years across a dynamic, heterogeneous mix of hardware, where the program code can perform its own allocation and management of infrastructure resources such as databases, VMs, and public IP addresses, as well as allowing maintainers to safely hot-swap new code into the running program to update it.


This is a deviation from my normal topic — Microvium — to talk about a random idea1.

I recently got a new job at Flip Logistics. Like many businesses an engineer might get a job at, they offer a SaaS solution: a cloud-hosted application with a web front-end, some native apps for various devices, some public APIs for integration with 3rd parties, and backend storage and logic. So far, this aspect is the same as every company I’ve worked at, and I’m sure the same as many others.

Being a newcomer is a great time to see things with fresh eyes. In this post, I’ll go over some of the issues I see with the construction of SaaS software in general (none of this is specific to Flip in any way, and I don’t speak on behalf of Flip in this post), and propose a direction that may resolve a lot of the complexity.

The End Goal

I think the best way to express the issues I see with SaaS2 implementation is to contrast it to the implementation of a single native application. I believe that much of the implementation inefficiency/complexity of a cloud application comes from the fact that it’s distributed: it exists beyond a single OS process. A typical programming language only has first-class support for things pieces running within a single process.

Self-Contained

A native application is self-contained: it starts, it allocates all the resources it needs, including all the interfaces it uses to communicate to the outside environment (e.g. the user).

Contrast this with distributed applications where resources are imposed externally: virtual machines, databases, IP addresses, certificates, firewall rules, etc.

Imagine if simple applications worked this way: to start MS Word, you first need to go to your OS Dashboard where you instantiate a (blank) GUI window for MS Word to eventually use; then you instantiation a file for it work on; and reserve the amount of RAM you want it to run in, and the core of the processor, etc. Then you run your configuration manager to associate the MS Word “process” with the RAM, the file, the GUI window, etc. Then finally you boot the whole thing up and now you can edit your Word document.

Conversely, imagine a cloud “application” that you could just run (instantiate) by the equivalent of just “double-clicking” on it, and letting it allocate its own resources: VMs, network configuration, firewall rules, load balancing, etc.

Documentation, APIs, and Type Safety

In most application software I’ve written, little or no technical documentation needs to exist outside the source code itself. I’m a huge fan of self-documenting code, where the name and type signature of a function or class describes all you really care to know about what it does3.

The information in interface signatures is not only useful for humans, it’s useful because the type system can ensure that implementations and clients of the interface are in agreement about the contract that must be met. This is safe.

But this only applies to interfaces within the context of a single application. As soon as you enter the realm of distributed applications, you now have the issue of defining contracts between the individual distributed components. Much of the time, these are not type-checked, so producers and consumers have no guarantee that contacts are met.

These kinds of inter-service contacts are normally documented externally to the code, where it’s easy to have them get out of date. And hitting “navigate to definition” in your favorite IDE doesn’t take you from the client code to the server code where you can actually see the implementation of the API you might be invoking.

Here’s an example to illustrate the lack of type checking:
In C#, I can bring a queue into existence by new Queue<T>(). Or I could use a 3rd party queue library with similar syntax. The type checker enforces that the contract between the producer and consumer side of the queue. Contrast this with an AWS or Azure distributed application, where instantiating a queue service is done outside your program code and there is no type checking to verify the contract between producers and consumers of the queue, or to self-document the kind of things that consumers should expect to get out of the queue.

This isn’t a straightforward problem to solve. A good type system for distributed applications would probably need to account for tearing between the client and server, where the client and server are on different versions — so contracts defined by type signatures would need to span into the past and future.

Debugging

I’m not sure that any solution in the near future can solve this problem, but the debugging experience for a distributed application is severely lacking.

One area of issue is the fact that you cannot step across service boundaries. While debugging a client of an API, I can step into other functions within that client, but I cannot step from the client to the server code and see both the client and server in the same “call stack”.

Prior Solutions

I’m sure many people have tried to solve these kinds of problems before, to varying success. Two existing solutions off the top of my head:

ASP.NET Web Forms. Within the limited problem-space of a client web application and its backend server, Web Forms attempts to bridge the gap between client and server, having both in the same project and a fair amount of type safety between them. This only works for certain types of applications, and is far from a general solution to the issue.

Infrastructure as Code (IaC), in various shapes — this allows code to specify the so-called “infrastructure”, such as what VMs to create and how to route network traffic to them. This gets part of the way there, but it falls short of the ideal.

To go back to the MS Word analogy, IaC seems akin to having a script that you must run prior to running MS Word, so that the script can set up the “infrastructure” required for MS Word (reserving RAM, creating the file, instantiating the GUI, etc). This still seems unnatural, and it doesn’t give you type checking between the “infrastructure” and the other code.

My Solution

Starting at the end and working backwards, I want a solution where allocating a queue service is as easy as allocating a queue data structure. Something like:

// Instantiate a queue data structure in memory
let localQueue = new Queue<int>();
// Instantiate a queue service, which may in turn instantiate the VM it needs, etc
let infrastructureQueue = new QueueService<int>();

(I’m using a TypeScript-esque language here for my examples)

Similarly for a database:

let db = new DynamoDB<MySchema>({ ...options });

I don’t mean this to create a connection to a database, I mean that these actually create a database. The returned db variable would therefore be a local object representing the remote resource.

Application Lifetime

I also believe that the code that creates the database should be part of the application code. Just like in a simple native application, if a resource is needed by the application, the code at startup should acquire that resource (or it can be acquired on-demand when it’s later needed).

MS Word is instantiated when you double-click the icon (and this is when it acquires many of its resources, such as creating the GUI window), and it’s lifetime might be minutes, hours, or maybe days, before you shut it down and it reclaims all it’s resources.

The lifetime of a distributed cloud service is much longer — you might start it up in production one day, and then retire it 10 or 20 years later. If it allocates a database at startup, you would expect that database to be around for the full 20 years.

This should drive home a paradigm shift I’m trying to communicate: a distributed application, as I’m using the term, is not the thing we currently call a “service” or “microservice”. Rather, the code for the “distributed application” is something that might run for a decade, or a century.

An example is “the Google search engine”, which from the outside is just an application available at google.com which has been running continuously since 1998.

Today’s languages and application environments are simply not suited for this paradigm. We’re used to application code being immutable for the duration of the application: it needs to be right before you run the application. If you need to update the application code, you need to stop the application, and it releases all its resources to deploy the update.

When we move into the domain of distributed applications, and these resources include things like databases, message queues, load balancers, etc., we can’t afford to have the application tear them down when we “close” the application to update its code (nor can we afford to close the application to update its code).

But we may be heading towards a world where it’s possible to mutate a running application. An example that comes to mind with JavaScript is React-JS Hot Reloader (I think this video shows it in action). It is intended as a development tool: changes to your code files are detected when you hit “save”, and the changes are injected into the running application without stopping it. In particular, the in-memory state of your components and store are preserved.

I’m not suggesting that Hot Reloader is the solution here, but just that it shows that the paradigm is unfolding. It’s not unreasonable to think of a future where a single distributed program (single code project with a single entry point) can run for years on a cloud of physical machines and other hardware, allocating and deallocating resources according to the rules it defines in its code — such as making a new database for each client, and spinning up VMs according to its dynamic load4 — and allowing us to maintain it while it continues to run by hot-swapping code into it.

I think that likely the ultimate solution here involves a new language that is designed to handle this kind of thing. A language in which the capability of hot-swapping pieces of code is assumed from the start. A language with type-safety across a non-atomic release (patching different physical machines at different times). A language which can describe processes that run on unreliable hardware and with unreliable communication.

But possibly a lot of progress can also be made with existing languages. Maybe in future I’ll go into more detail about what that might look like.

Microvium

I know that this post isn’t about Microvium, but I want to tie this to Microvium as well, because there is certainly some overlap. I won’t dwell on it because Microvium was not designed to solve this problem.

The key feature of Microvium that has relevance here is the snapshotting feature. It plays with the relationship between code and physical hardware in a novel way:

  • A Microvium process (a running instance of a Microvium program) doesn’t run on just one machine. It is in some sense a “distributed” application because it runs first in one environment and then is suspended and moved to a different environment. In a typical use, it will first run on a server or development machine, and then resume execution on a client or microcontroller.
  • A Microvium process, and by extension the resource is it allocates (such as memory structures) can transcend the physical hardware on which it runs. For example, you can kill power to the machine, releasing all physical RAM, but yet the Microvium app can still have live objects in that powerless state. The program cannot make forward progress without some CPU, but because of the snapshotting feature, the machine (CPU and RAM) that continues execution is not necessarily the same that starts it — the program can jump between machines while keeping it’s live state.
  • A Microvium program can control its own compilation5, because compilation == snapshotting. I previously wrote about how snapshotting can supersede bundling and be superior to it. But looking at this more abstractly, we can think of this as internalizing an externally-imposed workflow, which is similar in principle to my proposal to bring IaC scripts into the program that runs on that infrastructure: giving a program its own control over the infrastructure and perhaps even the build process.

While Microvium may touch on some of the concepts required for a solution, just like Hot Reloading, it’s clearly not the complete picture.


  1. Although I actually do talk about Microvium at the bottom 

  2. I’m going to use the terms “SaaS application”, “cloud application” and “distributed application” interchangeably in this post, on the assumption that most large SaaS system are complicated because they run over distributed hardware 

  3. Or at least I strive toward this end goal 

  4. All by rules defined _in the program code_, or delegated to your favorite third party library, not necessarily baked into the platform 

  5. I’m using the term “compilation” here to mean compiling the source code to bytecode 

New Garbage Collector!

New Garbage Collector!

It’s only been a few weeks since the Alpha release of Microvium and I’ve already almost rewritten the engine! There are two major changes I wanted to make to improve performance:

  1. I’ve moved to a great new garbage collection algorithm which is much faster and has a smaller memory footprint in both RAM and codespace.
  2. I’ve changed the data model to allow Microvium values to embed native pointers on 16-bit platforms.

The Old Data Model

To understand the new data model, it will help if I first explain the old one, so you can see how they compare.

Microvium “slots” are 16-bit. The term “slot” is what I’m using to refer to a memory location that holds a value. For example, a variable is a “slot”, and an object property has a slot for its value (and another slot for its key), etc.

In the old data model, Microvium reserved the top 2 bits of the 16-bit value to be a tag, leaving a 14-bit space for data associated with that tag.

The 2-bit tag tells you how the 14-bits should be interpreted:

  • Tag 00₂: the data is a 14-bit signed integer
  • Tag 01₂: the data is a 14-bit offset into heap memory
  • Tag 10₂: the data is a 14-bit offset into data memory
  • Tag 11₂: the data is a 14-bit offset into the bytecode

Having values fit into 16-bits is convenient for the kind of device that Microvium targets: small microcontrollers with < 64kB of memory. Many of these are 16-bit architectures, because 16-bits is all you need to address all your memory, and moving to 32-bits increases the silicon size and power consumption.

14-bit integers are large enough for many purposes, such as counters, loop variables, indexes, byte manipulation, etc. For situations where more than 14 bits are required, Microvium has the above-mentioned “offset” values which point from the 16-bit slot to a larger memory allocation, somewhere in ROM or RAM. These larger memory allocations can hold any number of other value types, such as arrays, strings, 32-bit integers, function bytecode, etc.

Depending on the characteristics of the referenced data, they could be stored in either ROM or RAM. This design uses offsets rather than explicit pointers for 2 reasons:

  1. Not all devices can fit a pointer into 14-bits. In fact, most can’t.
  2. When a pointer is stored in ROM (i.e. in the bytecode image), it can’t know ahead of time what its address is (since the image could be placed anywhere in ROM).

Side point: what I’ve called “data memory” and “heap memory” here are both memory regions in RAM, but data memory is permanently allocated while heap memory is garbage collected. This is analogous to the difference between the following two allocations in C:

MyStruct myVariable1; // Permanently allocated
MyStruct* myVariable2 = malloc(sizeof(MyStruct)); // Could be freed later

This 16-bit data model seems pretty clever (or at least to me it did), but there are a number of issues with it and optimizations we can make. The first issue is that each of these respective offsets can only be 14-bit, giving only an addressable space of up to 16kB. On the whole, I don’t mind Microvium targetting use cases where applications can’t exceed 16kB, but it does feel awfully tight. This is compounded by the fact that the engine itself compiles to about 7kB1, so the engine is necessarily quite chunky relative to the largest bytecode programs it can run.

The other issues are easiest to see when comparing it to the new data model:

The New Data Model

The new data model uses a number of clever tricks to improve both performance and addressable size on 16-bit architectures.

In the new model, the tag is not a fixed size. It is 1 or 2 bits, and now in the lower position of the word rather than the upper.

If the lowest bit is a 0, the word is interpreted as a 16-bit pointer:

This makes the assumption then that pointers are always even, which is a good assumption since Microvium heap allocations are always 2-byte aligned. This means that once checking the low bit, we don’t need to do any bit manipulation to extract the pointer value: the value is already a 16-bit pointer.

This means that — at least on 16-bit architectures — a native pointer can be stored directly in a memory slot (variable) without any need for encoding and decoding. This speeds up a lot of things, but especially the garbage collector (which I’ll talk about in a moment). This also means that Microvium can now address 4x the amount of RAM than it could previously since it can address 64kB of space.

This speedup is quite significant. The heap in Microvium is not allocated contiguously from the host (doing so would make it difficult to expand the heap dynamically when more space is needed), so resolving a 14-bit heap offset, as per the old scheme, involved a function call with fairly complex logic to determine what heap fragment the offset belongs to, and to find the allocation being referenced. With the new scheme, this all goes away.

Bytecode (ROM) Pointers

Not everything can be addressed by 16-bits, even on a 16-bit architecture. In particular, ROM is not necessarily addressable with 16-bit pointers, since ROM may be larger than RAM, or it may be stored in a completely different address space (or even on an external device such as a serial flash).

Microvium intends to cater for these common situations, so the next kind of value that can be stored in a slot is as follows, having the lower 2 bits with a value of 01₂ and the upper 15 bits be an offset into the bytecode image:

Again, this makes use of the fact that bytecode addresses will be even (this was not necessarily true before, but I’ve made it true in this latest version). So to convert this value to an offset, we just have to shift right by 1 bit.

This only gives us up to 32kB of addressable bytecode ROM, but this is still pretty good. The bytecode file itself can be up to 64kB in size, but the ROM segment is limited to 32kB. I think this is more than acceptable for the kind of use cases that Microvium is currently intended for.

This leaves us with the last tag, 11₂, to represent integers:

We still only have signed 14-bit integers, like before. But even for these, this scheme is more efficient than the old! In the old scheme, to decode a 14-bit integer from the 16-bit word, we needed to perform a sign-extension, since the upper 2 bits (tag bits) were always zero, even when representing negative values. A 14-bit-to-16-bit sign extension typically involves multiple operations. In this new scheme, we need only perform a single signed-right-shift by 2 bits, and the sign will automatically be correct!

The New Garbage Collector

I like it when code gets smaller instead of bigger, and the new garbage collector (GC) is one of those cases. And with the new data model, the garbage collector is an order of magnitude faster than it used to be.

All the principles and objectives from my previous garbage-collection post still apply. In fact, I was well into the new design by the time I finished the implementation of the old GC, but I wanted to get the alpha release out before embarking wholesale on a major overhaul. This animation from last time summarizes the kind of environment that Microvium is meant to be operating in, where the circles represent different modules or tasks, and only one or a few at a time are expected to be active/scheduled, taking center stage and consuming their peak memory2:

The key takeaway from the aforementioned post is that the memory consumed by a task for only a short burst is typically significantly cheaper than the amount of memory it permanently reserves. This is because the average permanent memory per task is multiplied by the total number of tasks, while average peak memory per task is only multiplied by the number of simultaneously-active tasks running at their peak.

Single-threaded environments, of which I’m a huge fan, are even better for this since they enforce that there is only one “simultaneously-active” task — micro-tasks are scheduled serially and so each can consume all the resources for a short time, as long as they clean up thoroughly before the next microtask in the queue. If microtasks are kept short (e.g. less than 1ms) then responsiveness is still good and the system can still be “real-time-enough” for many real-world applications.

The new garbage collector is based on Cheney’s Algorithm (wiki). The general idea is this:

  • When it’s time to garbage collect, the GC finds reachable (live) allocations and moves them to a new virtual heap.
  • It then discards the old heap all at once.

The coolest new feature of the new Microvium GC is that it costs nothing at all to free an allocation! Dead allocations are simply never visited by this algorithm, and just get freed all in one go when the algorithm discards the old heap.

All living allocations have a cost during collection (this is true of all GC algorithms3 ). But also recall from my last post on GCs, that microcontrollers can have a 1000x more processing power than memory4, relative to desktop machines, so the fact that living allocations incur some (small) amount of processing time during every collection is probably not as impactful as the fact that living allocations continue to consume memory after the collection. In other words, don’t worry about how much CPU overhead a garbage collector has, and instead worry about designing your scripts such that their idle memory consumption is small, as you should be doing regardless.

This algorithm makes it fairly cheap to make tons of temporary allocations for intermediate computations or for passing data between functions, etc. And then when the task’s turn completes and the GC is run to clean up, these temporaries are discarded with no processing or memory overhead.

Since allocations in the Microvium heap are contiguous5, creating new allocations is also cheap — it’s almost just bumping a pointer forward6.

In the following animation, blocks represent allocations, and a block turns gray to represent that the allocation has become unreachable7 — blue blocks are live (reachable) allocations. The program8 can allocate quickly and cheaply by just bumping the allocation cursor/pointer forwards. Dead objects remain on the heap until collection. The collector then copies the live allocations to a new space and the old space is discarded in one fell swoop.

With the new data model, tracing the reachability graph is almost trivial: for each variable, check if the lowest bit is a zero. If it is, then the variable holds a 16-bit pointer to a reachable allocation. If not, the variable holds anything else — an integer or a pointer to ROM (which doesn’t need tracing). Reachable objects are copied into the new memory space and their internal slots are in turn investigated the same way.

Collection with this algorithm is quick and proceeds in a single pass with constant memory overhead — no mucking about with mark bits, free lists, work queues/stacks, etc.

Bonus: Array and Object compaction

As a bonus, this new implementation of the GC also compacts arrays and objects on the fly.

Dynamically-sized arrays consume more space than needed so that appending to them is quick: every time they run out of space, their capacity is doubled to make room for twice as many elements. The GC algorithm truncates the capacity of these arrays on the fly, so they use the smallest possible amount of space.

The case is similar for objects. Dynamic objects in Microvium are represented in memory by a linked-list of property cells (key-value pairs with a next pointer), so that appending new properties is quick. But the linked list consumes twice the amount of space of an equivalent contiguous array, so the GC compacts these linked lists into contiguous property arrays on the fly. This gives the best of both worlds, making it efficient to link a new property cell onto an existing object, but having that property cell subsumed into to the contiguous arrays during GC compaction so that each property takes the smallest possible space while dormant9.


  1. This is measured on an msp430, with floating-point support disabled. Presumably this number will grow over time 

  2. Read the full post for a more complete explanation 

  3. But can be mitigated on desktop/server machines by the use of generational garbage collectors, an unnecessary complexity for MCUs. 

  4. To put it another way, consider that many MCUs have the CPU power to loop through their entire RAM in the order of milliseconds, while desktop-class machines would do the same in the order of seconds 

  5. At least, contiguous within the large chunks of memory allocated from the host as the heap grows 

  6. Bumping the allocation pointer forward, checking for overflow, and assigning the 2-byte allocation header. 

  7. Unreachable means that there is no way for your program/script to access it anymore, so it is subject to garbage collection 

  8. The program is called the mutator in GC parlance. 

  9. A property in compacted form is 4 bytes: a 2-byte key and a 2-byte value. A property in linked-list form consumes extra bytes for the linked list pointers and the allocation header of each cell.