The other day I had to write a function in C which decodes Unicode characters from a sequence of bytes in UTF-8.
UTF-8 is a variable length encoding, meaning that one character (known as a “code point”) can be represented by any number of bytes – usually one or two bytes. The exact format of UTF-8 is irrelevant for this series, but you can find more information on Wikipedia if you’re interested.
So what we want is something that takes a sequence of bytes, and produces a sequence of Unicode characters.
sequence of bytes -> sequence of Unicode code points
But in this series I don’t want to talk specifically about UTF-8 sequences. The concepts I want to investigate with this example aren’t unique to UTF-8 decoding. Many of the problems we solve in software development involve writing code that transforms a sequence of one thing into a sequence of another. Whether its parsing objects from a file, querying objects from a database, or presenting data to the user. In fact most programs don’t do anything except process sequences: they accept a sequence of user input, and produce a sequence of reactions to those inputs.
So, sequences are important, and figuring out how to represent and manipulate them is also important. In this series I hope look at a few different ways of representing sequences, and consider the pros and cons of each. I’m not going to be looking at any one particular programming language, so the insights I uncover should help understand sequences in any language at a deeper level. I’m going to start by looking at C, because of its simplicity and lack of abstraction. Then I’ll consider other common languages such as C++ and C#. And I’ll finish off by pointing out the common flaws across all of these languages, and consider a new hypothetical language to address these problems.
Let’s start with what is perhaps the most obvious implementation in C. The function must accept a sequence of bytes, and return a sequence of characters. In C, as in many languages, a common way of representing a sequence of something is with the use of an array. So let’s have a function that accepts an array as input, and produces an array as an output:
wchar_t* decodeUtf8_attempt1(const uint8_t* data);
Since this is a language-agnostic investigation, let me explain that
uint8_t is, unsurprisingly, the type representing a byte. And
wchar_t is one of the types that can be used to represent a character (in this case a Unicode code point). The
const modifier is good practice, and in this case means that the data passed to the function is strictly input data, and that the function promises not to modify it.
The implementation for this function is too boring to show here, but would do something like this:
- Run through the input data, counting the number of Unicode characters
- Allocate the memory for the output array of characters, probably by calling the traditional C
- Run through the input data again, this time storing each output character in the output array
- Return the output array
We can see immediately that there are problems with this implementation. Perhaps the most striking is that we’re making two passes over the input data. In terms of performance, this might not be serious: it might might make an
O(n) operation into an
O(2n) operation, which is still an
O(n) operation, and not much slower. On the other hand if the input data is large enough then the beginning of the input might not still be in the CPU cache by the time we make the second pass, and it could land up a lot slower.
A more important problem with making two passes over the input data is that we’re breaking the DRY principle. The second pass is very likely to repeat a lot of the code used in the first pass. It wouldn’t be difficult to accidentally predict a slightly different size on the first pass than you use on the second pass, and land up with some ugly memory bugs.
But actually, we can’t do much better an implementation of the function given this particular function interface (aka function “prototype”). We could perhaps have avoided the double pass over the input data if we’d chosen to provide a conservative over-estimate of the output array size by using the fact that there will never be more Unicode characters than bytes. But this would be a clever trick that wouldn’t apply to sequences in general, so it’s not relevant to our investigation. Or we could perhaps have avoided the double-pass if we were willing to incur extra overhead of resizing the output array as we encountered more characters while parsing the input. And indeed this might have been a preferable way to go since it avoids some duplicate code, although it’s debatable whether it would be more efficient.
Heap, Memory and Ownership
Another issue with this implementation/interface is that we’re allocating memory at all. The heap is slow by most definitions. If you’re in a multithreaded environment the heap is generally also a shared resource, causing thread locking, CPU multi-core cache invalidation, and a variety of other effects that would put the breaks on our program.
Once the heap space is allocated, somebody needs to free it. There are the usual problems with this: added risk forgetting to free memory results in a leak, or freeing it twice or prematurely and having difficult-to-track-down bugs in the program. But there’s also a more subtle issue: since the function abnegates ownership of the result-array to the caller, the caller is coupled to the implementation of the decoder function. That is to say, the decoder function chose to implement the output as a heap-allocated array, but now the caller also has to be aware of this implementation detail because it needs to also access the heap to free the array. This coupling would make it difficult to change the implementation down the line to use some other dynamic storage, such as a memory pool.
But there’s another memory-related problem that’s also pertinent to our investigation that’s hiding in our function interface: both the input and the output sequences are completely “populated” before they’re passed across the interface boundary. Or to put it another way: the entire input sequence is passed to the decoder function at once, and it returns the entire output sequence back to the caller at once.
It would be a terrible thing if all sequences were manifested like this. Imagine if, in your favorite desktop application, it were necessary for you to perform all the clicks and key presses before you ran the program (so that the whole sequence of input events is “populated” before the program is invoked), and then execute the program on those clicks and presses, and it in turn skips through all the GUI sequences that correspond to those clicks and presses. Some things clearly work better if you can provide input in a streaming fashion, rather than a batched fashion. We’ll talk more about streaming implementations later in the series.
The Good Parts
So there are lots of problems with this naïve implementation. But there are also some good things, which may not be obvious amongst all the flaws I’ve highlighted.
For one it’s a very simple interface and implementation, and simplicity is often severely underestimated. It’s very easy to understand how to use it and what it’s doing. Whoever is maintaining it doesn’t need to understand any complicated patterns or language features.
Another benefit that may not be immediately apparent is that the implementation decides exactly when and where it wants to read from the input (I call this “pull“, since the function actively pulls elements out of the input array), and exactly when and where it wants to write to the output array (I call this “push“, since the function actively pushes into the output array). This makes for a simpler and more maintainable implementation.
Next time I’m going to look at another alternative function interface – one which accepts a “push” input rather than a pull one – and we’ll compare this with our first solution.