Month: September 2014

Sequences: Part 3

Sequences: Part 3

I’m taking a bit longer than usual to write up new blog posts recently since I’m in the process of moving from San Diego to Melbourne, Australia. Hence the photo of the linked-list sequence of Koalas above1. Things should get back to normal in a couple of posts from now, and I’ll let you know how the move goes!

 

But enough about my life – you’re here because software is awesome! And together we’re exploring the best ways of working with sequences. Today we’re going to try to write a function2 that decodes UTF-8, without using the heap, and keeping the input “shape” the same as the output shape, so that multiple similar functions can be stacked together (something I touched on at the end of the previous post).

Let me quickly try to describe what I mean having the input “shape” match the the output. I’ll take last week’s example of having a “decompress” function which feeds bytes into our beloved “decodeUtf8” function, which feeds characters into a “parse” function. Often what we land up with something like the following situation:

StackingBadShapes

That is, the “stack” of functions doesn’t fit together on it’s own. One function wants to push data to the next function (it wants to be in control), while the next function wants to pull from the previous (it wants to be in control instead). What we land up needing is something in between each layer of the stack. Something that doesn’t mind being pushed to and pulled from. Something doesn’t doesn’t take any control. This is normally a container, such as a list or buffer:

StackingBadShapes2

Each of our two attempts so far has manifested this problem slightly differently. In our first attempt, the function pulled bytes from an array, and pushed the resulting characters to an array. In that case the buffer was built into the function itself, so this push-pull conflict was absorbed, but its memory inefficiencies and lack of asynchrony were still an issue.

Our second attempt could be said to have rightfully had a “pull-pull” shape – as we want – since it pulled out of the input array, and the caller pulled characters from it one-at-a-time by calling it. The shape mismatch in that case was simply that the caller was forced to provide an array input, while the output was definitely not array. What was the output exactly?

This takes us into the land of iterators.

Iterators

For reference, here’s last week’s function again.

// Takes null-terminated UTF-8 encoded string at `cursor`
// Returns first code character at the cursor
// Outputs cursor of the next character at `nextCursor`
wchar_t decodeUtf8_attempt2(const uint8_t* cursor, const uint8_t** nextCursor);

And let’s consider how we might use it:

const uint8_t* cursor = init_decodeUtf8(document);

const uint8_t* nextCursor;
wchar_t c = decodeUtf8_attempt2(cursor, &nextCursor);

while (c != 0)
{
    // Use c
    // ...

    cursor = nextCursor;
    c = decodeUtf8_attempt2(cursor, &nextCursor);
}

This code assumes that we have some function “init_decodeUtf8” that gives you the initial cursor state for some document. Notice that our code here doesn’t interact directly with the value of the cursor state, it only interacts with the functions init_decodeUtf8 and decodeUtf8_attempt2. This is intentional, and is done to encapsulate the state of the cursor. That is to say, the state of the cursor is managed only by those two functions, which limits the number of places in the code you need to consider when you think “what state can the cursor be in?”. Although in this case the encapsulation is manually enforced (we have to just “know” not to interact with the cursor state outside those two functions), if we upgrade our example to C++ we can get the compiler to enforce the encapsulation and data hiding:

class Utf8Decoder
{
public:
    Utf8Decoder(Document document)
    {
        cursor = init_decodeUtf8(document);
    }

    wchar_t next()
    {
        return decodeUtf8_attempt2(this->cursor, &this->nextCursor);
    }

    
private:
    const uint8_t* init_decodeUtf8(Document document)
    {
        // ...
    }

    wchar_t decodeUtf8_attempt2(const uint8_t* cursor, const uint8_t** nextCursor)
    {
        // ...
    }

    const uint8_t* cursor; 
    const uint8_t* nextCursor;
};

This C++ class has a public interface (aka “surface area”) that exposes two functions: the constructor to initialize the object state, and a “next” function to progress the state to the next element and retrieve that element. The class encapsulates the state of the cursor, and prevents clients of the class from accidentally modifying the cursor.3

This class is an iterator. It provides a way for users of the function/class to iterate through the output sequence of characters. I would say that we haven’t made it into an iterator by making the class, but that we’ve just revealed the true nature of the original decodeUtf8_attempt2 function: it always was an iterator.

For those who are familiar with C++, you’ll probably notice the similarity between the Utf8Decoder class and a standard C++ input iterator.

For those who aren’t that familiar with C++, you may notice the similarity with Java’s Iterator<T> (and corresponding Iterable<T>), or C#’s IEnumerator<T> (and corresponding IEnumerable<T>). These are each codifications of the pattern that we described in part 2.

Attempt 3

So, we said that we wanted to try get the input and output “shapes” to be the same. Since we’ve now said that the output surface area is an iterator, we can be more specific and say that we want the input to also be accessed via an iterator, rather than directly passing it a whole array of input bytes.

This is actually quite a challenging problem, and I’m going to first choose C# as my tool of choice to represent the solution:

IEnumerable<char> decodeUtf8_attempt3(IEnumerable<byte> bytes);

This is perhaps the most elegant solution we’ve encountered so far. It’s the most direct representation of the original problem statement, and works entirely using iterators. The input and output are the same “shape”, and it would be very easy to pipe the result of one function into another that accepts a sequence of that type.

Now let’s also take a look at how we might do this in C. What we want to do is abstract the input to decodeUtf8 function, so that while the input could be an array, but it could also be another iterator. We also want this function itself to be an iterator of the same shape. What about this:

typedef uint8_t ByteSourceFunc(void* state);

// state must be of type decodeUtf8_state
void init_decodeUtf8(void* state, ByteSourceFunc input, void* input_state);

// state must be of type decodeUtf8_state, initialized by init_decodeUtf8
wchar_t decodeUtf8_attempt3(void* state);

struct decodeUtf8_state
{
    ByteSourceFunc* input;
    void* input_state;
};

This is quite awful, and requires some explanation. Firstly, the decodeUtf8_attempt3 looks very much the same as it did in attempt 2. This new decodeUtf8 function is expected to yield a new character every time it’s called, the same as before. The significant difference is that now the cursor state isn’t statically typed (it just uses void* to represent “any type”), and that it holds some sort of abstracted state (the input_state field). State does have a runtime type, and for this to work the state must be of type decodeUtf8_state. Why is it typed void if it must be decodeUtf8_state? It’s because the caller of decodeUtf8_attemp3 doesn’t know that it’s calling this specific function, but instead could be calling any function that produces characters while maintaining state.

The input to the iterator is provided when we initialize it, by calling init_decodeUtf8. We tell it what state to initialize, and where it must get its input data from. It must get its input data from another iterator function, and that function itself requires some iterator state which decodeUtf8_attempt3 needs to provide, so we pass that in.

This is quite awful, and if it doesn’t make complete sense to you, don’t worry. The point is that it gets incredibly difficult to write code in C that has abstract dependencies. Not only is the abstraction apparent at runtime, since every byte needs to be read through an indirection function call accessing indirect state data, but it’s also just less readable and really hard to get right.

C++ is only marginally better. It provides standard containers with iterators, but this doesn’t solve the problem of chaining functions together since most functions that act on sequences must pull from an input iterator and push to an output iterator. Most often you then need to have a container as a buffer to be able to “fit” these functions together. This can be good, but if you’re operating under tight memory constraints or dealing with asynchronous data then this typical approach can be a problem4.

C++ also provides template programming, which could allow you to have an abstract iterator input to a function, without the runtime overhead. But this is not easy to do, and although I would always suggest having functions that depend on abstractions, I would never recommend writing all your functions using C++ template programming to get those abstractions.

C# provided a much better solution to the eye, although at runtime there are many similarities between our C implementation and the C# one. For example, both will be using indirect function calls, and both provide a level of runtime abstraction.

We may have run out of options on this one. The languages have just let us down. There seems to be no way to get the efficiency, abstraction, and syntactic simplicity in the same package.

But that’s not the end of our journey. This pull-pull pattern is only one answer to dealing with sequences. Next time, we’ll turn the problem on its head and consider how to deal with sequences that are asynchronous. That is, sequences where you can’t pull data from the source, but instead the source pushes data to your function. For example, when you’re processing data from a network stream, you don’t want to have to wait for all the data to be present before starting to operate on it.


  1. Which, by the way, I do not have rights to, and could not track down its original source. No copyright infringement is intended, so if the photo is yours, please let me know. 

  2. As before, we’re don’t care about the implementation of the function, but more about writing a good interface to the function 

  3. You’ll note that the class is a little bit more verbose than it needs to be, because I’ve intentionally kept the init_decodeUtf8 and decodeUtf8_attempt2 functions as similar as possible to the original forms to show the equivalence between the object orientated way of looking at it and the functional way of looking at it. 

  4. Newer versions of C++ may be starting to deal with these problems, but it still isn’t nearly as neat as it could be 

Sequences: Part 2

Sequences: Part 2

In this series we’re looking at different ways of designing interfaces that interact with sequences. To investigate different interface design choices we’re using an example function which decodes UTF-8 encoded text – one that consumes a sequence of bytes, and produces the corresponding sequence of Unicode characters. Last time we considered a very simple design where the function interface simply accepted a null-terminated, heap-allocated byte array as an input argument, and returned a null-terminated, heap-allocated character array as output. Here it is again for reference:

wchar_t* decodeUtf8_attempt1(const uint8_t* data);

decodeUtf8-1

Remember that we’re only looking at the interface of the function, since that’s the most important part when it comes to modularity and maintainability. Last week we considered some of the problems with the design of this function’s interface. One of things we said was a problem is that the output sequence is passed as a fully populated heap-allocated array. This meant that our function would probably have to use the heap, which would add inefficiencies and possibly duplicated code for a double-pass over the input data. It also raises the concern of pointer ownership, and coupling the function caller to unnecessary implementation details.

So let’s try again with our second attempt.

Attempt 2

What happens if, instead of returning the whole output sequence at once in an array1, we instead return the output sequence one element at a time. For example, we might do this:

// Takes null-terminated UTF-8 encoded string at `cursor`
// Returns first code character at the cursor
// Outputs cursor of the next character at `nextCursor`
wchar_t decodeUtf8_attempt2(const uint8_t* cursor, const uint8_t** nextCursor);

Again, since this is a language-agnostic investigation, I’d like to just clarify some points for those who might be a little rusty with C/C++. The double asterisk in const uint8_t** means that nextCursor is an output parameter2. Both consts still mean that the input data is unchanged by the function.

So the function essentially accepts one argument: a pointer to the first byte of the UTF-8 data we wish to decode. It returns two outputs: the Unicode character represented by a wchar_t, and a pointer const uint8_t*. To decode a whole document or stream of data we would call the function multiple times – once for each Unicode character.

Although this function has changed a little since now it returns only one character at a time, it hasn’t really changed in essence. The new function interface itself is still just a particular implementation of our overarching conceptual interface:

sequence of bytes -> sequence of Unicode code characters

That is to say, we can still think of it as a function that accepts a sequence of inputs and returns a sequence of outputs – because that was our original requirement and this function fulfills that requirement. The state of the iteration is now contained outside the function itself, which is why we have the extra parameter, but the function still manages that state (calculating the next cursor and moving through the input bytes).

For those of you who are unconvinced about the idea of it still returning a sequence when it appears to return only one item, consider how this function could be seen as a generator. Each time it’s called, it will generate the next item in the output sequence. The parameters it requires are simply for persisting state between generator calls, and could be seen as “private” to the generator.

We could say that the data representing the sequence is no longer associated with a sequence of contiguous memory, but is instead “stored” in a more mysterious form. Something like a chronological sequence of return values.

So, is it better?

This function now doesn’t need to do any heap allocation at all, which could improve its performance. It also alleviates the problem of pointer ownership for the returned sequence, since there is no pointer because there isn’t any heap allocation.

But now the function is called many more times for the same sequence. Will this be a problem? Well, function calls on their own aren’t a problem, since the optimizer can inline many calls that aren’t necessary. For example if the caller was indeed outputting directly into some container or array in a tight loop, then the optimizer might inline the whole decodeUtf8 function. Of course it might not, so it may be a consideration for you. But word on the street is that most modern compilers are probably better than us humans at figuring out when a call should be inlined, so I think of this as a win.

There’s also a nice separation of concerns with this implementation. Since the function doesn’t loop, the number of test cases required to verify its behavior is much smaller. If it operates correctly on one character, and sets up the state correctly for the next character, then by induction it must work correctly for all following characters in the sequence.

So, we’re done?

Nope. This second attempt is much better than the first. But it leaves a lot to be desired.

For one thing, the input sequence must still be represented by a contiguous block of memory, which gives similar problems to what we thought we just solved.

Another problem with the input being a solid block of memory, which may not be immediately evident, is that the input and output sequences use inconsistent representations. The output is pulled by the caller “on demand”, while the input must already be there and waiting for use. This would be a problem if we wanted to stack multiple such functions together.

What if the input bytes come from a decompression function, while the output characters go so some parser function?

LayeredDecode

Now we have a problem. Since the output of one function doesn’t match the representation of the input to the next function (assuming that each layer looks a lot like our decodeUtf8_attempt2), we will again need containers to act as buffers between the functions.

What we need is a way to get the input and output to use the same philosophy, but without forcing the implementation of the function to use the heap as in our first attempt. This is what we’ll be looking at next time.


  1. Or, in other languages, most other container types such as lists, queues vectors, etc 

  2. The exact details are more complicated if you aren’t familiar with pointers, and in different situations it will mean different things.