
Sequences: Part 2
In this series we’re looking at different ways of designing interfaces that interact with sequences. To investigate different interface design choices we’re using an example function which decodes UTF-8 encoded text – one that consumes a sequence of bytes, and produces the corresponding sequence of Unicode characters. Last time we considered a very simple design where the function interface simply accepted a null-terminated, heap-allocated byte array as an input argument, and returned a null-terminated, heap-allocated character array as output. Here it is again for reference:
wchar_t* decodeUtf8_attempt1(const uint8_t* data);
Remember that we’re only looking at the interface of the function, since that’s the most important part when it comes to modularity and maintainability. Last week we considered some of the problems with the design of this function’s interface. One of things we said was a problem is that the output sequence is passed as a fully populated heap-allocated array. This meant that our function would probably have to use the heap, which would add inefficiencies and possibly duplicated code for a double-pass over the input data. It also raises the concern of pointer ownership, and coupling the function caller to unnecessary implementation details.
So let’s try again with our second attempt.
Attempt 2
What happens if, instead of returning the whole output sequence at once in an array1, we instead return the output sequence one element at a time. For example, we might do this:
// Takes null-terminated UTF-8 encoded string at `cursor` // Returns first code character at the cursor // Outputs cursor of the next character at `nextCursor` wchar_t decodeUtf8_attempt2(const uint8_t* cursor, const uint8_t** nextCursor);
Again, since this is a language-agnostic investigation, I’d like to just clarify some points for those who might be a little rusty with C/C++. The double asterisk in const uint8_t**
means that nextCursor
is an output parameter2. Both const
s still mean that the input data is unchanged by the function.
So the function essentially accepts one argument: a pointer to the first byte of the UTF-8 data we wish to decode. It returns two outputs: the Unicode character represented by a wchar_t
, and a pointer const uint8_t*
. To decode a whole document or stream of data we would call the function multiple times – once for each Unicode character.
Although this function has changed a little since now it returns only one character at a time, it hasn’t really changed in essence. The new function interface itself is still just a particular implementation of our overarching conceptual interface:
sequence of bytes -> sequence of Unicode code characters
That is to say, we can still think of it as a function that accepts a sequence of inputs and returns a sequence of outputs – because that was our original requirement and this function fulfills that requirement. The state of the iteration is now contained outside the function itself, which is why we have the extra parameter, but the function still manages that state (calculating the next cursor and moving through the input bytes).
For those of you who are unconvinced about the idea of it still returning a sequence when it appears to return only one item, consider how this function could be seen as a generator. Each time it’s called, it will generate the next item in the output sequence. The parameters it requires are simply for persisting state between generator calls, and could be seen as “private” to the generator.
We could say that the data representing the sequence is no longer associated with a sequence of contiguous memory, but is instead “stored” in a more mysterious form. Something like a chronological sequence of return values.
So, is it better?
This function now doesn’t need to do any heap allocation at all, which could improve its performance. It also alleviates the problem of pointer ownership for the returned sequence, since there is no pointer because there isn’t any heap allocation.
But now the function is called many more times for the same sequence. Will this be a problem? Well, function calls on their own aren’t a problem, since the optimizer can inline many calls that aren’t necessary. For example if the caller was indeed outputting directly into some container or array in a tight loop, then the optimizer might inline the whole decodeUtf8
function. Of course it might not, so it may be a consideration for you. But word on the street is that most modern compilers are probably better than us humans at figuring out when a call should be inlined, so I think of this as a win.
There’s also a nice separation of concerns with this implementation. Since the function doesn’t loop, the number of test cases required to verify its behavior is much smaller. If it operates correctly on one character, and sets up the state correctly for the next character, then by induction it must work correctly for all following characters in the sequence.
So, we’re done?
Nope. This second attempt is much better than the first. But it leaves a lot to be desired.
For one thing, the input sequence must still be represented by a contiguous block of memory, which gives similar problems to what we thought we just solved.
Another problem with the input being a solid block of memory, which may not be immediately evident, is that the input and output sequences use inconsistent representations. The output is pulled by the caller “on demand”, while the input must already be there and waiting for use. This would be a problem if we wanted to stack multiple such functions together.
What if the input bytes come from a decompression function, while the output characters go so some parser function?
Now we have a problem. Since the output of one function doesn’t match the representation of the input to the next function (assuming that each layer looks a lot like our decodeUtf8_attempt2
), we will again need containers to act as buffers between the functions.
What we need is a way to get the input and output to use the same philosophy, but without forcing the implementation of the function to use the heap as in our first attempt. This is what we’ll be looking at next time.