Month: October 2013

Saying more than you mean

Saying more than you mean

Last week I brought up the seemingly simple question “What is a Variable?” in a new light. I concluded by saying that variables are actually a bad thing from the compiler’s perspective, because it’s difficult for the compiler to reason about the overall behavior of the program based on how variables change.

I’d like to continue this line of thought by looking at an example. How would you write some C code to sum up a bunch of numbers? You might do it like this:

int sum(int* numbers, int count)
{
    int accumulator = 0;
    for (int i = 0; i < count; i++)
        accumulator += numbers[i];
    return accumulator;
}

The objective of the function is simply to add the numbers together - but what does the code actually semantically say? What the compiler sees is basically equivalent to this:

int a(int* b, int c)
{
    int d, e;
    d = 0;
    e = 0;
f:
    if (e < c)
        d = d + *(b + e);
    else
        goto g;
    e = e + 1;
    goto f;
g:
    return e;
}

All I've done here is remove some of the things that the compiler doesn't understand in any deeper way, such as names, and the fact that the commonly seen pattern for (i = 0; i < count; i++) actually means that i is meant to iteratively be all the values from 0 to count - 1. The code above is semantically equivalent to the previous code, and the compiler doesn't understand one better than the other[1]. What do you think when you read it? Did you even read it? It's incredibly hard to figure out what this code is doing just by looking how variables are assigned and in what order they're assigned. The only reason the first piece of code makes more sense is because it uses humanly recognizable patterns and constructs (including the names of things).

What did you mean to say?

One of the problems with the above code (either of them - because they mean the same thing), is that it says too much. For example, it says that count is a 32-bit integer [2], as opposed to, say, a byte. This means it's saying to the caller "this function can add up to 2147483647 items together". Did we mean to say that?

It's also saying that the accumulator is a 32-bit integer, which means that if the mathematical sum is more than 2147483647 then the function will return some number other than the sum. Did we mean to say that? And if this was indeed intentional, do we mean this behavior to be different on a 64-bit platform rather than a 32-bit platform? If we call the function with only a few small numbers, do we still intend to use an accumulator that goes to 2 billion? What did we mean to say?

The function can only add a list that is specifically a memory-contiguous array of integers. It can't add a linked list of integers, it can't add a sequence of integers as they arrive from a network source.

Apart from the types, there is also an apparent order to the statements. The code reads like recipe - a sequence of instructions to be followed to get to the final answer (especially if you read the latter code, where a goto is like "skip to step 7" or "go back to step 3"). Did we mean to only start at the first element and not the last? Is it important that we sum the second element only after the first element[3]? Did we mean to initialize the accumulator before the loop variable?

Saying too much

So it seems it's possible the code actually says too much. That is, it says more than we intended it to say when we designed it, and it says more than it needs to say to get the job done. There are two problems with saying too much. The one problem is related to the humans who interact with the code, and the other is related to computers that interact with the code.

For those who write the code, they may be writing too much because they're filling in details that are actually irrelevant to their intended outcome. For those reading the code, they need to first translate the code into the actual intention of the original author, leaving out details that they think aren't intended or relevant to the overall picture.

A compiler has the same problem, except it can't simply ignore what the code actually says and instead try to guess what the author meant by it. It can't guess which details are relevant and which aren't. This means the compiler is incredibly restricted with what it can do when it implements the compiled output. With all of these unintended restrictions on the sum function, pretty much the only way to implement this function is by imitating the recipe step for step in machine code. The optimize may be able to omit a step if it's obviously not needed, but nothing major.

Isn't this just life?

It doesn't need to be this way. For an example, lets look at the equivalent sum function in the language of Haskell [4]:

s = foldl (+) 0 numbers

For those who don't know Haskell, the foldl does the "loop through all the numbers", the (+) means that we're adding each one as we go, and the 0 means we start the accumulator with 0. What I'd like to point out is how much this code doesn't say[5]. Almost everything it doesn't say is something that the function is not restricted by. For example, since we didn't say anywhere that the input had to be 32-bit integers, we can actually pass in an array[6] of anything that can be added together - for example floats. We also didn't say the accumulator must be 32-bit [7], or put any restrictions on the length of the array.

The Point

If we don't say what we don't mean, the code is better, cleaner, more flexible, and easier to use. It leaves more time and space to say more of the things that we actually do mean, and it leaves the compiler more room to be intelligent about how it actually implements the code at a machine level. We need to design languages that don't coerce you into saying things, and to help our fellow programmers understand what they're actually saying with the code they write.


[1] From a look at the C standard, I think there's also an issue of variable scope, but this isn't relevant to the point

[2] Obviously this is compiler dependent

[3] I don't know enough about C to say that the order of the summation in this particular case is strictly part of the semantics of the language without the array contents being marked volatile. For example, how would our function be expected to behave if another thread were to start changing items in the array in some particular order? I know that C historically has not been well defined for multi-threading situations, so it's likely the behavior is undefined.

[4] Taken from Rosetta Code

[5] Haskell isn't perfect either, and this code also implicitly says more than it should, but it's a good example.

[6] Actually it works on lists and not arrays, but lets pretend

[7] I think Haskell lets us down here because the accumulator size is probably defined by the output of the + operator and not some intelligent idea about how big the result really is

What is a Variable?

What is a Variable?

The other day I was thinking about how variables should be represented in a language. This got me down the strange path of wondering, “What, actually, is a variable?”

Most people get taught the concept of a variable very early on in programming. The analogy I hear used a lot is that of a box, or pigeon hole – it is a placeholder in which values can be put. Only a single value can be put in the placeholder at a time, so new values overwrite older values.

int x; // placeholder x
x = 5; // x has a value of 5
x = 6; // x has a value of 6
print(x);

Accessing a variable is accessing the value that was most recently put in it.

When looking at a low level language like C or C++, we can literally think of variables as memory locations. We can use the syntax &x to find out the memory location. We could literally open up our computer, and if we had a sufficiently good microscope (and one which can detect electric charges), and some crazy surgery tools, and a really good map, we could actually see the electrons in this memory location that make up the value. Changing the value of the variable will literally change the charges of the electrons or currents making that physical memory location. Of course, in a modern computer this isn’t quite as simple as I’ve made it sound – but certainly in simple embedded microcontrollers this is literally the case.

So a variable is a memory location?

The problem I have, is that I think a program can be fully defined in terms of input and output. That is, a program can be defined by how it interacts with its environment. A memory location is, generally speaking, unobservable by the “outside” – it is neither an input nor an output. Typically when a program changes a variable, I (the user outside the program) can’t tell that it changed. When you download photoshop, it’s description doesn’t say “Photoshop is a program that changes the following variables…”.

So a program is defined by its input and output, and a variable is not input nor output. So a variable is not part of the program definition. So what is it then?

A variable is a language feature. It is a part of the special language that we use to describe the program definition to the compiler.

Consider the following program:

void main()
{
    int x;
    x = 5;
    x = 6;
    printf("x: %i\n", x); // Outputs "x: 6"
}

I would say that the program definition is to output “x: 6”. Whether or not the compiler uses a memory location to do this is irrelevant – in fact it probably shouldn’t. The statement x = 6 is not part of the definition, but rather a description to the compiler to “remember for next time, that x is 6”.

So here’s a question: what happens to the statement x = 5? Is it optimized out?

No. I think it’s incorrect to think that x = 5 is optimized out. It’s better to say that x = 5 is always there in the code, but has no influence on definition of the program behavior and so won’t affect the outputted binary. x = 5 is never in the specification to start with. The description of the behavior (the code) says the program must output that last assigned value of x to the console. The last assigned value of x is 6, so “6” must be outputted to the console. The value 5 doesn’t even feature here. It doesn’t contribute to the program description, and so is as good as a blank line.

In this case there is no program input, so we’re only looking at the program output. The compiler doesn’t even need to consider x = 5 when it decides what program must do. For the sake of making the compiler itself easier to implement, it may indeed originally consider x = 5 as setting a memory location, and then optimized it out later. But this is a detail of how the compiler is implemented, and it need not be that way at all.

Why does it matter?

To the average programmer, this distinction shouldn’t matter at all. Whether a variable is a physical memory location, or whether a variable is a tool for a describing program behavior to the compiler, makes no difference most of the time.

To me it makes a difference in how you think about programs. It makes a difference about how upset you are when the compiler doesn’t generate exactly what you expected, or it “optimizes” out something that you intended it not to. These cases are not the compiler being stupid because it’s not doing the code you wrote, it’s the compiler being intelligent and implementing your description exactly as you put it. It’s actually all the other times that the compiler is being “stupid” – when it can’t think of how to implement your program except by applying the very basic fall-back rules like “make a memory location for each variable”.

This is more and more important as compilers get more and more “intelligent”. It’s also important when thinking about the design of new programming languages. One needs to think about what language features to provide that make it easiest for the for compiler to be intelligent. Variables are not one of these features – it’s very difficult to reason about the behavior of a program by looking at how variables change, and very easy to instead dumbly map variables to memory locations.

Subscribing Member Functions to C Callbacks

Subscribing Member Functions to C Callbacks

Today I’m going to talk about another complaint I have with C/C++, regarding the incompatibility between member functions and standard functions.

The synopsis goes like this. For compatibility with C, I have some C++ callback functions that look like this:

void foo(void* state, int eventCode);

That is, they have a state argument which represents any persistent state that the subscriber wishes to have passed back to it later when the event is fired. For illustration, the subscription function might look something like this:

typedef void (*CallbackFunc)(void* state, int eventCode);

void subscribe(CallbackFunc callback, void* state)
{
    // add the callback to a list of functions to be called
}

And we can subscribe to the event like this:

void* myState = ...; // Some state we want passed to the callback
subscribe(&foo, myState);

I won’t even get into the problem of state “ownership” in this post – that’s a whole different problem that drives me up the wall in C and C++. When it comes to function pointers, there is an incompatibility between the above function and this member function:

class State
{
    void foo(int eventCode);
};

In other words we can’t subscribe the member function to the same event:

// Lets say we have some state here
State* myState = new State();

// We can't subscribe like this:
subscribe(State::foo, myState);

// Or like this
subscribe(myState.foo);

If you’re a C++ programmer, which you probably are (at least a little I’m sure), you’re probably thinking I’m crazy right now – “Of course you can’t – why would you even try?”

The reason it could work is simply because of the underlying mechanism of member functions. Everyone knows that there is a “hidden” argument passed to a (non-static) member function invocation – the value for the this parameter. Exactly how this invisible argument gets passed is not part of the standard, so it doesn’t have to use the same mechanism as a normal parameter, but for the sake of allowing this kind of compatibility I would have suggested that it be standardized to be passed the same way as a normal parameter.

Another way of approaching the problem is to look at C# for inspiration – a “function pointer” (a delegate in C#) can have any state attached to it. So a member function is the same as a static function and the same as a closure or multi-cast delegate etc, provided that the parameter types fit. To me this is the opposite way of looking at it: instead of saying that a member function has an additional invisible this parameter, we now say that actually every function has a this parameter which provides the attached state, but in the case of a non-member function it is just ignored by the callee.

This is probably not strictly what happens in C#, but it is a reasonable way of looking at it. If we were to apply this to C++, then the compiler could compile a hidden this pointer to all functions (including non-member functions). The optimizer would remove it again for all functions where it isn’t needed for whatever reason (such as for those functions whose addresses aren’t taken in the program). This would make it possible to have the expression myState.foo actually be a valid expression – it would evaluate to function pointer (along with its implicit state myState).

Of course this second option adds overhead to every function pointer, because it now has to point to both the function itself, and to the attached state. I think that if I were to redesign C++ from scratch, that I would probably accept this overhead and do it anyway. Perhaps I would include a back-door option to create a “no-state” function and it’s corresponding pointer, which would be compatible with C functions. And I would specify that the calling convention used to pass the this parameter is the same as that used for the first parameter of a “no-state” function. This would make C and C++ function pointers seamlessly interoperable.

Why does it bug me?

It bugs me because stateful events, callbacks, or continuations are part of any good design, and C++ makes them incredibly difficult to implement. C++ claims to be object orientated, but yet its member function pointers don’t refer to the object they’re a member of. So C++ function pointers are really C function pointers – except that they’re type-incompatible with C function pointers. So really they aren’t useful on their own in either context! It seems to me this is a massive oversight of the original language.