Saying more than you mean
Last week I brought up the seemingly simple question “What is a Variable?” in a new light. I concluded by saying that variables are actually a bad thing from the compiler’s perspective, because it’s difficult for the compiler to reason about the overall behavior of the program based on how variables change.
I’d like to continue this line of thought by looking at an example. How would you write some C code to sum up a bunch of numbers? You might do it like this:
int sum(int* numbers, int count)
{
int accumulator = 0;
for (int i = 0; i < count; i++)
accumulator += numbers[i];
return accumulator;
}
The objective of the function is simply to add the numbers together - but what does the code actually semantically say? What the compiler sees is basically equivalent to this:
int a(int* b, int c)
{
int d, e;
d = 0;
e = 0;
f:
if (e < c)
d = d + *(b + e);
else
goto g;
e = e + 1;
goto f;
g:
return e;
}
All I've done here is remove some of the things that the compiler doesn't understand in any deeper way, such as names, and the fact that the commonly seen pattern for (i = 0; i < count; i++)
actually means that i
is meant to iteratively be all the values from 0
to count - 1
. The code above is semantically equivalent to the previous code, and the compiler doesn't understand one better than the other[1]. What do you think when you read it? Did you even read it? It's incredibly hard to figure out what this code is doing just by looking how variables are assigned and in what order they're assigned. The only reason the first piece of code makes more sense is because it uses humanly recognizable patterns and constructs (including the names of things).
What did you mean to say?
One of the problems with the above code (either of them - because they mean the same thing), is that it says too much. For example, it says that count
is a 32-bit integer [2], as opposed to, say, a byte. This means it's saying to the caller "this function can add up to 2147483647 items together". Did we mean to say that?
It's also saying that the accumulator is a 32-bit integer, which means that if the mathematical sum is more than 2147483647 then the function will return some number other than the sum. Did we mean to say that? And if this was indeed intentional, do we mean this behavior to be different on a 64-bit platform rather than a 32-bit platform? If we call the function with only a few small numbers, do we still intend to use an accumulator that goes to 2 billion? What did we mean to say?
The function can only add a list that is specifically a memory-contiguous array of integers. It can't add a linked list of integers, it can't add a sequence of integers as they arrive from a network source.
Apart from the types, there is also an apparent order to the statements. The code reads like recipe - a sequence of instructions to be followed to get to the final answer (especially if you read the latter code, where a goto
is like "skip to step 7" or "go back to step 3"). Did we mean to only start at the first element and not the last? Is it important that we sum the second element only after the first element[3]? Did we mean to initialize the accumulator before the loop variable?
Saying too much
So it seems it's possible the code actually says too much. That is, it says more than we intended it to say when we designed it, and it says more than it needs to say to get the job done. There are two problems with saying too much. The one problem is related to the humans who interact with the code, and the other is related to computers that interact with the code.
For those who write the code, they may be writing too much because they're filling in details that are actually irrelevant to their intended outcome. For those reading the code, they need to first translate the code into the actual intention of the original author, leaving out details that they think aren't intended or relevant to the overall picture.
A compiler has the same problem, except it can't simply ignore what the code actually says and instead try to guess what the author meant by it. It can't guess which details are relevant and which aren't. This means the compiler is incredibly restricted with what it can do when it implements the compiled output. With all of these unintended restrictions on the sum
function, pretty much the only way to implement this function is by imitating the recipe step for step in machine code. The optimize may be able to omit a step if it's obviously not needed, but nothing major.
Isn't this just life?
It doesn't need to be this way. For an example, lets look at the equivalent sum function in the language of Haskell [4]:
s = foldl (+) 0 numbers
For those who don't know Haskell, the foldl
does the "loop through all the numbers", the (+)
means that we're adding each one as we go, and the 0
means we start the accumulator with 0
. What I'd like to point out is how much this code doesn't say[5]. Almost everything it doesn't say is something that the function is not restricted by. For example, since we didn't say anywhere that the input had to be 32-bit integers, we can actually pass in an array[6] of anything that can be added together - for example floats. We also didn't say the accumulator must be 32-bit [7], or put any restrictions on the length of the array.
The Point
If we don't say what we don't mean, the code is better, cleaner, more flexible, and easier to use. It leaves more time and space to say more of the things that we actually do mean, and it leaves the compiler more room to be intelligent about how it actually implements the code at a machine level. We need to design languages that don't coerce you into saying things, and to help our fellow programmers understand what they're actually saying with the code they write.
[1] From a look at the C standard, I think there's also an issue of variable scope, but this isn't relevant to the point
[2] Obviously this is compiler dependent
[3] I don't know enough about C to say that the order of the summation in this particular case is strictly part of the semantics of the language without the array contents being marked volatile. For example, how would our function be expected to behave if another thread were to start changing items in the array in some particular order? I know that C historically has not been well defined for multi-threading situations, so it's likely the behavior is undefined.
[4] Taken from Rosetta Code
[5] Haskell isn't perfect either, and this code also implicitly says more than it should, but it's a good example.
[6] Actually it works on lists and not arrays, but lets pretend
[7] I think Haskell lets us down here because the accumulator size is probably defined by the output of the +
operator and not some intelligent idea about how big the result really is