Monthly Archives: June 2014

Needy Code

Today I was reading about an implementation of a MultiDictionary in C# – a dictionary that can have more than one value with the same key, perhaps something like the C++ std::multimap. It all looks pretty straight forward: a key can have multiple values, so when you index into the MultiDictionary you get an ICollection<T> which represents all the values corresponding to the key you specified. This got me wondering, why did the author choose to use an ICollection<T> instead of an IEnumerable<T>? This wasn’t a question specific to MultiDictionary, but more of a philosophical musing about the interfaces between different parts of a program.

What is IEnumerable/ICollection?

In case you’re not familiar with these types, the essence is that IEnumerable<T> represents a read-only set of items (of type T), which can be sequentially iterated using foreach. Through the use of LINQ extension methods we can interact with IEnumerable<T> at high level of abstraction. For example we can map an IEnumerable<T> xusing the syntax x.Select(...), and there are methods for many other higher-order functions. Another great thing about IEnumerable<T> is that its type parameter T is covariant. This means that if a function is asking for an IEnumerable<Animal>, you can pass it an IEnumerable<Cat> – if it needs a box of animals, a box of cats will do just fine.

An ICollection<T> is an IEnumerable<T> with some extra functionality for mutation – adding and removing items from the collection. This added mutability comes at the cost of covariance: the type parameter of ICollection<T> is invariant. If a function needs an ICollection<Animal>, I can’t give it an ICollection<Cat> – if it needs a storage for animals, I can’t give it a cat storage because it might need to store an elephant.

What to choose?

IEnumerable<T> is a more general type than ICollection<T> – more objects can be described by IEnumerable<T> than by ICollection<T>. The philosophical question is: should you use a more general type or a more specific type when defining the interface to a piece of code?

So, if you’re writing a function that produces a collection of elements, is it better to use IEnumerable or ICollection1 (assuming that both interfaces validly describe the objects you intend to describe)? If you produce an IEnumerable then you have fewer requirements to fulfill, but are free to add more functionality that isn’t covered by IEnumerable. That is, even if the code implements all the requirements of ICollection, you may still choose to deliberately only expose IEnumerable as a public interface to your code. You are committing to less, but free to do more than you commit to. This seems like the better choice: you’re not tied down because you’ve made fewer commitments. In the future if you need to change something, you have more freedom to make more changes without using a different interface.

If you are consuming a collection of elements, then the opposite is true. If you commit to only needing an IEnumerable then you can’t decide later you now need more – like deciding you need to mutate it. You can demand more and use less of what you demanded, but you can’t demand less and use more than you asked for. From this perspective, even if your code only relies on the implementation of IEnumerable it seems better to expose the need for ICollection in its interface because it doesn’t tie it down as much.

This strikes me to be very similar to producers and consumers in the real world. The cost of producing a product benefits from committing to as little as possible, while the “cost” of consuming a product benefits from demanding as much as possible. But there is also another factor: a consumer who demands too much may not be able to find the product their they’re looking for. That is, more general products can be used by more people. In software terms: a function or class with very specific requirements can only be used in places where those very specific requirements are met – it’s less reusable.

Is it Subjective?

Perhaps this all comes down to a matter of taste or circumstance. Code that’s very “needy”2 is less reusable, but easier to refactor without breaking the interface boundaries. Code that’s very “generous”((“Generous” = produces very specific types but consumes very general types)) is more reusable but more restricted and brittle in the face of change.

I personally prefer to write code that is very generous. That is, given the same block of code, I write my interfaces to the code such that I consume the most general possible types but produce the most specific possible types. In many ways this seals the code – it’s very difficult to change it without breaking the interface. But on the other hand, it makes different parts more compatible and reusable. If I need to make a change, I either change the interface (and face the consequences), or I simply write a new function with a different interface.

I also think that it may be better to write very generous interfaces to code that is unlikely to change and will often be reused, such as in libraries and frameworks. Domain-specific code which is likely to change across different versions of the program may benefit from being more needy.

MultiDictionary

Although this whole discussion is not specifically about the MultiDictionary, the idea was seeded by my encountering of MultiDictionary. So lets spend a moment considering how I might have implemented MultiDictionary myself.

Firstly, most of this discussion so far has been centered around how you would define the interface, or function signature, of a piece of code that either produces or consumes generic types. I haven’t really gone into what that code should actually look like, but rather what its interface should look like given that the code already has intrinsic demands and guarantees. If we apply this to the MultiDictionary, then it probably has its most generous signature already. The code already implements ICollection, so it’s generous to expose it in its interface. The dictionary itself doesn’t really consume any interfaces beyond it’s basic insert and delete functions, so we can’t talk about its generosity as a consumer3.

On the other hand, if I had written the code myself I possibly would have only exposed IEnumerable. Not because I write needy interfaces4, but because it just seems like a better representation of the underlying principle of a multi-dictionary. As the code stands, there are two different ways of adding a value to the dictionary: one by calling the add method on the dictionary, and one by calling the add method of the collection associated with a key. This duplication strikes me as flaw in the abstraction: the MultiDictionary is probably implemented using a Dictionary and some kind of collection type, such as a List (I imagine Dictionary<TKey, List<TValue>>). It could just as easily, although perhaps not as efficiently, have been implemented as a List of key-value pairs (List<KeyValuePair<TKey,TValue>>). Would the author still have exposed ICollection<TValue>? Probably not. This strikes me as a leaky abstraction in some ways, and I would probably have avoided it. In my mind, exposing IEnumerable<TValue> is a better choice in this particular case. The neediness would actually simplify the interface, and make it easier to reason about.

Perhaps this is about not saying more than you mean. You define upfront what you mean by a MultiDictionary, and then you describe exactly that in the interface, before implementing it in code. If you mean it to be a set of collections, then ICollection is great. If you mean it to be a Dictionary which happens to support duplicate keys, then ICollection seems to say too much.

 


  1. For the remainder of the post I will use the term IEnumerable to mean IEnumerable<T>, and the same for ICollection 

  2. “Needy” = produces very general types but consumes very specific types 

  3. Actually, I could argue that since ICollection is mutable and accepts data in both directions, the exposing of ICollection makes it a consumer in a way that IEnumerable wouldn’t, but I won’t go into that 

  4. Perhaps a better word here would be stingy 

Everything you don’t say

Recently I was reading an old blog post I came across about productivity variations among software developers, where he says that “researchers have found 10-fold differences in productivity and quality between different programmers with the same levels of experience and also between different teams working within the same industries“. My immediate instinct as I was reading it was that it makes perfect sense, and it’s supported by my own experience, and from what I hear from others in the software industry.

But at the end of the post, he talks about a study which compares productivity of the teams for Excel and Lotus, and makes between their productivity in lines of code:

 Excel’s team produced about 13,000 lines of code per staff year. Lotus’s team produced 1,500 lines of code per staff year. 

It’s interesting the huge difference in lines of code per staff-year, but is that really a measure of productivity? It’s not clear from the post whether a high LOC-per-year is considered high or low productivity, but it seems to mean high productivity in this context. If this is so, then I am probably the least productive person in my team. I make an effort to write as few lines of code as possible.

In fact for the last few months my average LOC per day is probably negative, if you count deletion of lines as negative lines produced1. That sounds terrible, but it’s because I not only try to write as few lines as possible, but I also simplify other people’s verbose code as I go through it. It’s not uncommon for me to rewrite existing code using 5 or 10 times fewer lines (typically simpler lines).

Let me clarify though. Writing fewer lines of code is not my goal, it’s a side effect. What I try to do is write simpler code. Code that’s easier to reason about. Code that’s more pleasant to maintain. Code that has fewer bugs just because it’s simpler. What generally results from this is fewer lines of code. I do consider fewer lines of code to be a goal, but only in service of readability.

Leverage

Another reason why fewer lines of code can be a good sign, is when you consider how much leverage each line of code has. I don’t know if leverage is the right word here, so let me explain. What I mean is simply a small piece of code that has a big effect, because of some amplification mechanism. For example, a bad coder might repeat almost the same line of code 20 times, while a good coder will instead write a for-loop which amplifies his single line of code to cover 20 cases2. A good programmer will use abstractions like this (and polymorphism, generics, frameworks, code generators, DSLs, etc) to amplify small pieces of code to great effects. A good programmer is not one who can write 1000 lines of code, but one who knows how to write 1 line of code in just the right way that it can be reused 1000 times. If he continues to write another 999 lines of this kind of code, he will have produced 1000 times more than the man who wrote the 1000 lines which had only one use34.

This sort of leverage is not linear, because good software engineers can leverage the leverage. They can abstract the abstractions. Any time you find yourself repeating something, you write code to do it for you. The more abstractly you can recognize these patterns, the more generally the solutions will apply and the less code you’ll need to write to get things done (although you have to be careful to not make code more complicated in all the abstraction).

Conclusion

I do understand that getting a quantitative measurement of programming productivity can be incredibly difficult. If it was easy, I could just give potential employers my objective productivity “score” in various areas of expertise, and they could simply hire me according to how this fits their objectives. It would also make it easier to pay software engineers what they’re individually worth. But I think “number of lines of code” is terrible variable to proxy for productivity. Especially when you look at the extremes of the spectrum, where people are really productive or where people get very little done.

This doesn’t change the fact that, at least for me, a better solution is one with fewer lines of code. Every line of code you don’t write is valuable, and often this value outweighs the cost of planning how to not write that line.


  1. If you didn’t count deletion as negative, then the best way to be productive is simply to delete and rewrite the same function every day for the rest of your life 

  2. You might think this is a bad example because nobody would write the non-for-loop code, but it’s exactly the situation I found just the other day 

  3. It’s difficult to qualify the word “use”, since this doesn’t need to refer to the programmer himself reusing the code, but could also just refer to the number of times it’s executed, like in the example of the for-loop 

  4. You might think 1000-times reusability is ridiculous, but there are number of places where I’ve achieved this level of “amplification”. They seem to normally manifest themselves as frameworks 

Thoughts about Random Access

Recently I’ve been thinking about what it means for a data structure to support random access (aka “direct access”). This may seem so obvious that it’s not worth thinking about: an array supports “random access” because you can access any element with O(1) complexity by just calculating the elements position and going straight there. A linked list doesn’t have this property because the only way to get to an element is through other elements that it’s linked to. But let’s explorer this idea further.

Here’s one way Wikipedia explains random access quite well:

As a rule the assumption is that each element can be accessed roughly as easily and efficiently as any other, no matter how many elements may be in the set, nor how many coordinates may be available for addressing the data

Let me add another note to this which is perhaps obvious but not normally specified: we’re talking about runtime complexity. If we had a linked list completely defined at compile-time then I would still classify it as a “random access”, even though linked lists are normally not considered to be random access1.

So what is it about an array gives us random access? Let’s take a guess: it’s because the elements of the array are stored contiguously in memory.

ContiguousElements

This seems like a reasonable first guess. Because “element 2” is known to be after “element 1” which is after “element 0”, we can calculate where “element 2” is. To get the position of element 2, we multiply the size of an element by 2, and move to that binary position in the array.2

But what if the elements aren’t all the same size?3

ContiguousVariableElements

If element 1 in the array might be larger than another then we can’t multiply anymore so we need to somehow sequentially access the elements, right? Well, no. If the size of element 1 is known at compile time then we can simply offset this amount when calculating the positions of any element that comes after it.

Perhaps I deceived you when I said all the elements aren’t the same size in this example: I didn’t say whether or not there was static information about the sizes of elements. Lets say that we have an array, which we’ll define such that every second element is 4 bytes big, and every other element is 8 bytes big. To calculate the binary position of an even element in the array we just use index / 2 * 12 + (start of array) and to calculate the position of an odd element in the array we just say index / 2 * 12 + (start of array + 4) (assuming some implicit integer truncation). These are both constant time operations.

So to sum up, it’s not about whether the array elements are the same size, or how they are laid out in memory. It’s about what we know at compile time vs what calculations needs to be delayed until runtime, and how complicated those runtime calculations need to be. If we go back to our original definitions then this isn’t anything new – it’s just a restatement of our definition of random access. But it does make me think a bit differently about it. 


  1. This would be quite difficult to do in most languages. You could do it in C++ with template meta programming 

  2. Actually it’s not the size of the element that counts, but the pitch of the array – there might be padding bytes between elements to help properly align them. But let’s just imagine the “size” includes the padding between the elements 

  3. I don’t know any compiler that will generate arrays of elements that don’t all have the same inline value size, but we’re not so much talking about what compilers do as about what could be done