Category: Embedded

Single-threading is more memory-efficient

Single-threading is more memory-efficient

TL;DR: Single-threading with super-loops or job queues may make more efficient use of a microcontroller’s memory over time, and Microvium’s closures make single-threading easier with callback-style async code.

Multi-threading

In my last post, I proposed the idea that we should think of the memory on a microcontroller not just as a space but as a space-time, with each memory allocation occupying a certain space for some duration of time:

I suggested that we should then measure the cost of an allocation in byte-seconds (the area of the above rectangles as bytes × seconds), so long as we assumed that allocations were each small and occurred randomly over time. Randomness like this is a natural byproduct of a multi-threaded environment, where at any moment you may coincidentally have multiple tasks doing work simultaneously and each taking up memory. In this kind of situation, tasks must be careful to use as little memory as possible because at any moment some other tasks may start up and want to share the memory space for their own work.

This leaves a random chance that at any moment you could run out of memory if too many things happen at once. You can guard against this by just leaving large margins — lots of free memory — but this is not a very efficient use of the space. There is a better way: single threading.

Single threading

Firmware is sometimes structured in a so-called super-loop design, where the main function has a single while(1) loop that services all the tasks in turn (e.g. calling a function corresponding to each task in the firmware). This structure can have a significant advantage for memory efficiency. In this way of doing things, each task essentially has access to all the free memory while it has its turn, as long as it cleans up before the next task, as depicted in the following diagram. (And there may still be some statically-allocated memory and “long-lived” memory that is dynamic but used beyond the “turn” of a task).

Overall, this is a much more organized use of memory and potentially more space-efficient.

In the multi-threaded model, if two memory-heavy tasks require memory around the same time, neither has to wait for the other to be finished — or to put it another way, malloc looks for available space but not available time for that space. On the other hand, in a super-loop model, those same tasks will each get a turn at different times. Each will have much more memory available to them during their turn while having much less impact on other tasks the rest of the time.

An animated diagram you may have seen before on my blog demonstrates the general philosophy here. A task remains idle until its turn, at which point it takes center stage and can use all the resources it likes, as long as it packs up and cleans up before the next task.

So, what counts as expensive in this new memory model?

It’s quite clear from the earlier diagram:

  • Memory that is only used for one turn is clearly very cheap. Tasks won’t be interrupted during their turn, so they have full access to all the free memory without impacting the rest of the system.
  • Statically-allocated memory is clearly the most expensive: it takes away from the available memory for all other tasks across all turns.
  • Long-lived dynamic allocations — or just allocations that live beyond a single turn — are back to the stochastic model we had with multi-threading. Their cost is the amount of space × the number of turns they occupy the space for. Because these are a bit unpredictable, they also have a cost in that they add to the overall risk of running out of memory, so these kinds of allocations should be kept as small and short as possible.

Microvium is designed this way

Microvium is built from the ground up on this philosophy — keeping the idle memory usage as small as possible so that other operations get a turn to use that memory afterward, but not worrying as much about short spikes in memory that last only a single turn.

  • The idle memory of a Microvium virtual machine is as low as 34 bytes1.
  • Microvium uses a compacting garbage collector — one that consolidates and defragments all the living allocations into a single contiguous block — and releases any unused space back to the host firmware. The GC itself uses quite a bit of memory2 but it does so only for a very short time and only synchronously.
  • The virtual call-stack and registers are deallocated when control returns from the VM back to the host firmware.
  • Arrays grow their capacity geometrically (they double in size each time) but a GC cycle truncates unused space in arrays when it compacts.

See here for some more details.

Better than super-loop: a Job Queue

The trouble with a super-loop architecture is that it services every single task in each cycle. It’s inefficient and doesn’t scale well as the number of tasks grows3. There’s a better approach — one that JavaScript programmers will be well familiar with: the job queue.

A job queue architecture in firmware is still pretty simple. Your main loop is just something like this:

while (1) {
  if (thereIsAJobInTheQueue) 
    doNextJob();
  else
    goToSleep();
}

When I write bare-metal firmware, often the first thing I do is to bring in a simple job queue like this. If you’re using an RTOS, you might implement it using RTOS queues, but I’ve personally found that the job-queue style of architecture often obviates the need for an RTOS at all.

As JavaScript programmers may also be familiar with, working in a cooperative single-threaded environment has other benefits. You don’t need to think about locking, mutexes, race conditions, and deadlocks. There is less unpredictable behavior and fewer heisenbugs. In a microcontroller environment especially, a single-threaded design also means you also save on the cost of having multiple dedicated call stacks being permanently allocated for different RTOS threads.

Advice for using job queues

JavaScript programmers have been working with a single-threaded job-queue-based environment for decades and are well familiar with the need to keep the jobs short. When running JS in a browser, long jobs means that the page becomes unresponsive, and the same is true in firmware: long jobs make the firmware unresponsive — unable to respond to I/O or service accumulated buffers, etc. In a firmware scenario, you may want to keep all jobs below 1ms or 10ms, depending on what kind of responsiveness you need4.

As a rule of thumb, to keep jobs short, they should almost never block or wait for I/O. For example, if a task needs to power-on an external modem chip, it should not block while the modem to boots up. It should probably schedule another job to handle the powered-on event later, allowing other jobs to run in the meantime.

But in a single-threaded environment, how do we implement long-running tasks without blocking the main thread? Do you need to create complicated state machines? JavaScript programmers will again recognize a solution…

Callback-based async

JavaScript programmers will be quite familiar the pattern of using continuation-passing-style (CPS) to implement long-running operations in a non-blocking way. The essence of CPS is that a long-running operation should accept a callback argument to be called when the operation completes.

The recent addition of closures (nested functions) as a feature in Microvium makes this so much easier. Here is a toy example one might use for sending data to a server in a multi-step process that continues across 3 separate turns of the job queue:

function sendToServer(url, data) {
  modem.powerOn(powerOnCallback);

  function powerOnCallback() {
    modem.connectTo(url, connectedCallback);
  }

  function connectedCallback() {
    modem.send(data);
  } 
}

Here, the data parameter is in scope for the inner connectedCallback function (closure) to access, and the garbage collector will automatically free both the closure and the data when they aren’t needed anymore. A closure like this is much more memory-efficient than having to allocate a whole RTOS thread, and much less complicated than manually fiddling with state machines and memory ownership yourself.

Microvium also supports arrow functions, so you could write this same example more succinctly like this:

function sendToServer(url, data) {
  modem.powerOn( () => 
    modem.connectTo(url, () => 
      modem.send(data)));
}

Each of these 3 stages — powerOn, connectTo and send — happen in a separate job in the queue. Between each job, the VM is idle — it does not consume any stack space5 and the heap is in a compacted state6.

If you’re interested in more detail about the mechanics of how modem.powerOn etc. might be implemented in a non-blocking way, take a look at this gist where I go through this example in more detail.

Conclusion

So, we’ve seen that multi-threading can be a little hazardous when it comes to dynamic memory management because memory usage is unpredictable, and this also leads to inefficiencies because you need to leave a wider margin of error to avoid randomly running out of memory.

We’ve also seen how single-threading can help to alleviate this problem by allowing each operation to consume resources while it has control, as long it cleans up before the next operation. The super-loop architecture is a simple way to achieve this but an event-driven job-queue architecture is more modular and efficient.

And lastly, we saw that the Microvium JavaScript engine for embedded devices is well suited to this kind of design, because its idle memory usage is particularly small and because it facilitates callback-style asynchronous programming. Writing code this way avoids the hassle and complexity of writing state machines in C, of manually keeping track of memory ownership across those states, and the pitfalls and overheads of multithreading.


  1. Or 22 bytes on a 16-bit platform 

  2. In the worst case, it doubles the size of heap while it’s collecting 

  3. A super-loop also makes it more challenging to know when to put the device to sleep since the main loop doesn’t necessarily know when there aren’t any tasks that need servicing right now, without some extra work. 

  4. There will still be some tasks that need to be real-time and can’t afford to wait even a few ms in a job queue to be serviced. I’ve personally found that interrupts are sufficient for handling this kind of real-time behavior, but your needs may vary. Mixing a job queue with some real-time RTOS threads may be a way to get the best of both worlds — if you need it. 

  5. Closures are stored on the virtual heap. 

  6. It’s in a compacted state if you run a GC collection cycle after each event, which you would do if you cared a lot about idle memory usage. 

Short-lived memory is cheaper

Short-lived memory is cheaper

TL;DR: RAM on a microcontroller should not just be thought of as space but as space-time: a task that occupies the same memory but for a longer time is more expensive.

MCU memory is expensive

If you’re reading this, I probably don’t need to tell you: RAM on a microcontroller is typically very constrained. A 3 GHz desktop computer might have 16 GB of RAM, while a 3 MHz MCU might have 16 kB of RAM — a thousand times less processing power but a million times smaller RAM. So in some sense, RAM on an MCU may be a thousand times more valuable than on a desktop machine. Regardless of the exact number, I’m sure we can agree that RAM is a very constrained resource on an MCU1. This makes it important to think about the cost of various features, especially in terms of their RAM usage.

Statically-allocated memory

It’s common especially in smaller C firmware to just pre-allocate different pieces of memory to different components of the firmware (rather than using malloc and free). For example, at the global level, we may declare a 256-byte buffer for receiving data on the serial port:

uint8_t rxBuffer[256];

If we have 1kB of RAM on a device for example, maybe there are 4 components that each own 256 B. Or more likely, some features are more memory-hogging than others, but you get the idea: in this model, each component in the code owns a piece of the RAM for all time.

Dividing up RAM over time

It’s of course a waste to have a 256-byte buffer allocated forever if it’s only used occasionally. The use of dynamic memory allocation (malloc and free) can help to resolve this by allowing components to share the same physical memory at different times, requesting it when needed, and releasing it when not needed.

This allows us to reconceptualize memory as a mosaic over time, with different pieces of the program occupying different blocks of space-time (occupying memory space for some time).

When visualizing memory like this, it’s easy to start to feel like the cost of a memory allocation is not just the amount of memory it locks, but also the time that it locks it for (e.g. the area of each rectangle in the above diagram). In this sense, if I allocate 256 bytes for 2 seconds then it costs 512 byte-seconds, which is an equivalent cost to allocating 512 bytes for 1 second or 128 bytes for 4 seconds.

Being a little bit more rigorous

Skip this section if you don’t care about the edge cases. I’m just trying to be more complete here.

This measure of memory cost is of course just one way of looking at it, and breaks down in edge cases. For example, on a 64kB device, a task2 that consumes 1B for 64k-seconds seems relatively noninvasive while a task that consumes 64kB for 1s is much more costly. So the analogy breaks down in cases where the size of the allocations are significant compared to the total memory size.

Another way the model breaks down is if many of the tasks need memory around the same time — e.g. if there is some burst of activity that requires collaboration between many different tasks. The typical implementation of malloc will just fail if there is not memory available right now, as opposed to perhaps blocking the thread until the requested memory becomes available, as if the memory was like a mutex to be acquired.

But the model is accurate if we make these assumptions:

  • The size of the individual allocations is small relative to the total memory size
  • Allocations happen randomly over time and space

Under these assumptions, the total memory usage becomes a stochastic random variable whose expected value is exactly:

The expected allocation size × the expected allocation size × the expected allocation frequency

We could also calculate the probability that the device runs out of memory at any given point (I won’t do the calculations here).

Conclusion

In situations where memory allocations can be approximated as being small and random, the duration of a memory allocation is just as important as its size. Much more care must be taken to optimize memory usage for permanent or long-lived operations.

I’m not very happy with this stochastic viewpoint and all these edge cases. It means that at any point, we could randomly exceed the amount of available memory and the program will just die. Is there a better way to organize memory space-time so we don’t need to worry as much? I believe there is… and I’ll cover that in the next post.


  1. Another thing that makes it a constrained resource is the lack of virtual memory and the ability to page memory in and out of physical RAM, so the RAM size is a hard limit 

  2. When talking about this space-time model, I think it’s easier to talk about a “task” than a “component”, where a task here is some activity that needs to be done by the program over a finite stretch of time, and will consume resources over that time. 

Microvium: updated memory model

Microvium: updated memory model

TL;DR Microvium can now address up to 64 kB of ROM, up from 32 kB previously, and now runs more efficiently on small 32-bit devices such as an ARM Cortex MCU.

This is a minor update regarding the data model in Microvium (previous post here), for the kind of person who’s interested in this kind of thing.

Recap: Microvium uses 16-bit slots

As covered in the previous post, Microvium uses a 16-bit slot size — variables in Microvium are 16 bits each. This is unusual among the small embedded JavaScript engines. Cesanta’s mJS and Elk use a 64-bit slot size with NaN-boxing, and Moddable’s XS uses a 128-bit slot size (4x32bit words). So, if you have a variable with the number 42 in it, it will take 2 bytes in Microvium, 8 bytes in mJS, and 16 bytes in XS (on the other hand, the number 42.5 is a float and will take 12 bytes in Microvium since it overflows the slot into the heap, but it will still take 8 bytes in mJS and 16 bytes in XS).

Pointers and Paged Memory

If the lowest bit in the slot is 0, Microvium treats the value as a pointer to heap memory.

Since heap memory in Microvium is always 2-byte aligned, the lowest bit of a pointer will always be 0, so the value in the slot exactly corresponds to the pointer value, at least on 16-bit systems (or on 8-bit systems with a 16-bit address bus).

But what about 32-bit systems? Microvium is optimized for devices with 64 kB or less of RAM, but many devices with 64 kB of RAM are actually 32-bit devices. In these devices, there is usually a 32-bit address space, and some sub-range of these addresses will map to physical RAM (and some other range of addresses will map to physical ROM). Even though only 16 bits of information are required to index every byte of RAM, pointers are still 32 bits on these devices.

Previously, Microvium worked fine on 32-bit and 64-bit devices (I do all the testing on my 64-bit PC) but it did so through an expensive mapping table that mapped 16-bit VM addresses to their 32-bit or 64-bit native counterparts (like virtual memory implemented in software). The mapping itself was still O(1) in many cases1, but it involved a function call every time a pointer needed to be mapped, which is a massive overhead to incur. I wasn’t too worried about this because I wasn’t aiming for 32-bit devices as my initial audience, but with the number of 32-bit devices out there, this would quickly become a problem.

To support this kind of scenario more elegantly, Microvium now has a port macro definition that allows you to specify the upper 16-bits of a 32-bit pointer (or upper 48-bits of a 64-bit pointer) for the platform2.

For a more concrete example, the Arduino Nano 33 IOT has up to 32kB of SRAM, starting at the address 0x20000000. So the upper 16-bits of a real pointer into RAM will always be 0x2000. Here is a snippet of the data sheet that shows this:

So, in the Microvium port file, you can now specify the high bits of a 32-bit pointer to be 0x2000, and Microvium will interpret all pointers by simply indexing into the given memory page.

In compiled ARM machine code, the conversion from a 16-bit slot value to 32-bit pointer is just one or two instructions3! This is a significant performance improvement over how it worked before, and makes pointer access in Microvium almost as efficient as native pointer access.

I didn’t actually make this change for performance reasons. I did it because it makes the development of the Microvium engine much easier.

  • On my Windows machine where I develop, I can now use VirtualAlloc to pre-allocate a single 64 kB “page” of memory where the high bits of a pointer are always 0x5555, and run Microvium in just this region of memory. So if I see the Microvium value 0x002A, I know instantly that corresponds to the address 0x5555002A.
  • The addresses are consistent across runs, so when I note down an address in my notebook while debugging, I know it will be the same if I restart the program.
  • I can also have a memory view open in the debugger and it remains consistent across runs and shows all the VM memory in one place.

64 kB ROM

Previously, if the lower 2 bits of a slot were 01b then the value was considered a pointer into ROM after shifting right by 1 bit, giving us a 15-bit address space for ROM, and requiring ROM to be 2-bit aligned to keep the remaining 0 bit.

Now, the slot value is considered a pointer into ROM after zeroing the bottom 2 bits. This doesn’t change the performance, but it means that we can now address up to 64 kB of ROM.

A side effect is that ROM must now be 4-byte aligned since the lower 2 bits of ROM pointers must be zero. This means extra padding sometimes, but I’ve found that the ROM overhead doesn’t grow substantially with this change.

Why did I do this?

  1. Debugging. Previously, if I saw the slot value 0x2A1 while stepping through the engine, I have to bring out a calculator to see that 0x2A1 corresponds to the bytecode address 0x150. Now, the value 0x2A1 corresponds to the bytecode address 0x2A0, which is much easier to follow.
  2. I was irked by the fact that I couldn’t just say that “Microvium supports up to 64kB of memory”. I previously had to qualify it every time — “it supports 64kB of RAM, and 64kB of snapshot image, but only up to 32kB of ROM”. Now I can just say “supports up to 64kB of memory” with no added asterisks or qualifications, since it supports 64 kB of each type of memory it uses. This is part of simplifying the mental model of Microvium, since Microvium is all about simplicity.


  1. The performance was actually O(n) where n is the number of memory blocks, but since Microvium uses a compacting garbage collector, memory is consolidated into a single block periodically, making it O(1) for most pointers most of the time. 

  2. More accurately, you can specify any native address as the origin of the VM address space, but I found it easier to explain this in terms of the high bits. 

  3. In the full ARM instruction set, the conversion is a single 32-bit instruction. In the ARM Thumb instruction set, it takes two 16-bit instructions. The specific number 0x20000000 is relevant here because it’s a power of 2 which is more efficient. 

Debugging Embedded Code

Debugging Embedded Code

Embedded microcontroller

At first I didn’t think this would be something useful to blog about. But the more people I speak to, the more I realize that I might be one of the few who has my perspective on how to debug embedded microcontroller/firmware code on small devices1. So let me share my “secret formula” with you.

Problems

For those who don’t traditionally write embedded microcontroller applications but are still following along for interest, let me quickly highlight some of key problems we face debugging embedded code.

Problem 1: Hardware

The most obvious problem with developing embedded code is that these microchips are generally embedded into a piece of hardware, such as your washing machine, and the code’s primary purpose is often closely related to controlling the hardware. You can often use what is called an in-circuit debugger to actually connect your IDE/debugger to the firmware program running on the embedded device2.

But lets say you’re writing code for an x-ray machine, and lets say that it needs to pulse the x-ray cathode for 10 ms. You do not want to accidentally put a breakpoint during those 10 ms. The breakpoint will suspend the execution of the firmware, but will generally leave the hardware in whatever state it was in – which may be a state you really don’t want it to be in for an extended period of time3.

There are other reasons why hardware can be a problem for debugging, such as when the program outputs physical voltages but does it incorrectly. You may need to start resorting to using a multimeter or oscilloscope just to see what the program is doing. You can’t add a software “debug watch” to a piece of hardware.

Problem 2: Download time

Another problem with debugging is the time it takes to cycle through your modify-build-execute-debug process, because it has an extra “download” phase. In other words, every time you make a change, you have to download the compiled executable to the embedded device. This can take anywhere from a few seconds to minutes, depending on the size of the program. It seems a common pattern for developers to do something like this:

  1. Compile program
  2. Download program
  3. Execute in debug mode
  4. Try put it into the problem state
  5. See the problem (perhaps by a breakpoint and watches)
  6. Go back to your code
  7. Add diagnostic printf statements or more breakpoints
  8. Go back to step 1

I’ll call this the compile-debug cycle, and it can take minutes to get through it each time.

Problem 3: Non-Standard Compilers

Another problem, which perhaps isn’t quite as obvious, is that is seems that an overwhelming number of compilers targeted at embedded microcontrollers do not conform to the C or C++ standards. They often only support a subset of standard C/C++, and they include many extensions for things which are difficult or impossible to do in standard C. This can be a good thing: how do you declare an interrupt routine in standard C? But for debugging code it can cause problems. In the same way that hardware dependent code forces you debug code on the physical hardware, compiler dependent code forces you to debug with the specific debugger provided with the compiler. Why is this a problem? I’ll get to that in a moment.

Solution

Microchip solution!

So how do I solve these problems?

The answer is stupidly simple. Perhaps even insulting. It is simply:

Don’t write embedded code

Avoid writing code that accesses hardware, or that needs to be downloaded to run, or that uses non-standard extensions of C/C++. You could almost say, “Don’t write embedded code at all”. Embedded code has all of these problems, so just don’t do it!

Instead, write platform-independent code. Use all the modern techniques of dependency injection, unit testing, etc. Get your code working in isolation first, in an environment where your compile-debug cycle is in the order of seconds rather than minutes. Get it working on your local PC! This eliminates step 2 – “Download program” – from the above compile-debug cycle, and makes everything easy.

Also, ideally you should be debugging unit tests, not your fully integrated software. This eliminates step 4 in the above compile-debug cycle, because your unit tests automatically generate the problem state. If they don’t, then you may need to write more tests.

The unit tests for the module you’re working on should automatically run at the click of a button or keyboard shortcut, and should be integrated into the IDE so you don’t have to leave your code to run tests. Personally, I divide my code into small enough modules that my unit tests for the module-under-development can compile and run in under a second, and causes the IDE to jump immediately to the first failing test assertion.

Doing it this way you can easily shorten the compile-debug cycle overhead by a factor of 100 (“overhead” meaning tasks not related to finding or fixing the actual problem). If you also write the code in such a way that you can debug-step backwards then you can also reduce the number of times you need to repeat the compile-debug cycle.

But, Hardware!

Of course, you do need to access hardware at some point. And you probably need to use compiler pragmas and intrinsics to squeeze out the most performance or do unusual things. But you can isolate these cases, and confine them to very thin layers on which your main application code runs. For example, instead of having application-level code directly access an IO port, it can access it through a function which is implemented by an ultra-thin HAL (hardware abstraction layer). The HAL should do as little as possible, and its interface should be pure C (or C++). When I say “hardware”, I also mean other platform specific dependencies, such as a compiler quirks, third-party libraries, and the OS.

The rest of your application code should be platform independent. Don’t use features that aren’t standard C/C++. Obviously only use the subset of the standard that’s actually supported by your compiler. Don’t rely on unspecified “features” of the language, such as the size of a pointer, the result of signed integer overflow, the endianness of an integer, or the alignment of fields in structure (But you weren’t anyway, right?).

Use dependency injection to inject the HAL into your application code4. This is just generally good advice for any program. And how else would you do unit testing? For different build configurations you can inject mocks and interactive simulators just as easily as the HAL on your release build.

Remember that every line of HAL code might take 10 times longer to develop, so you really want to minimize how much HAL there is.

Conclusion

So, the solution to the long debug-compile cycle with embedded code is simply to avoid “embedded” code as much as possible. I do this, and it really works. I can go weeks developing code for a microcontroller without ever powering up the physical device, and then take a few hours or so to integrate it into the final compiled application. The result is that I can use all the modern techniques and practices to develop high quality firmware in a fraction of the time.

Why don’t many embedded developers seem to do this5? I really don’t know. But perhaps its related to the fact that developing for such limited systems is a lot like developing for the desktop computers of 20 years ago, before object orientated patterns and unit testing were really around. Or perhaps it’s because C and C++ make it very difficult to do zero-cost dependency injection. Perhaps its just because the embedded microcontroller industry is much slower to move, since it targets a much smaller audience.

Whatever the reason, it doesn’t have to be that way, and hopefully I’ve given you a few ideas of how to accelerate your development process. If you have any other ideas or questions, feel free to leave them in the comments below.


  1. I’m talking about anything that’s too small to run a proper operating systems like Linux – something in the order of kilobytes of RAM 

  2. Again, I’m talking about quite low level programs here. If the “firmware” is running a full OS, such as Linux, then it may have its own debugging tools such a remote GDB server 

  3. I’m not saying you would use a breakpoint at all in a real x-ray machine, but the point applies to other scenarios 

  4. I’m not going to cover how to do dependency injection for this type of application, but techniques you could consider are: using the linker to “inject” a dependency C API or mock layer into your application; using CRTP to inject dependencies into C++ code statically; and of course plain old virtual classes and/or function pointers 

  5. I may be generalizing. But realize again that I’m referring to embedded firmware programs, perhaps running in sub 100 kB of RAM, and not smartphone or embedded Linux development