How to talk about Data Structures

FizzBuzz.scala I've been interviewing lately for "senior programmer" positions. I find it outright astonishing how many applicants for these positions lack what I would consider the most fundamental of programming skills. Some people have suggested using FizzBuzz (a trivial programming exercise) as a filter; what I use is a discussion of data structures.

I don't expect that the people I am hiring should be experts in the behavior of obscure data structures. I don't expect them to be able to invent something like Xor-linked lists on their own. But I do wish that they could describe the features of common data structures and talk about them in a way that demonstrates that they comprehend the advantages and disadvantages of different choices.

The first key idea behind data structures is the distinction between interface and implementation. Of course, this is fundamental to all object-oriented programming, but is particularly visible in the area of data structures. An interface is the specification of what a data structure can do, and there can be and often are more than one actual implementation (or actual working code) that can implement the interface. For instance, a "list" (an ordered collection of objects) can be implemented using an array, using a doubly-linked-list, or using a hashmap, and each has different strengths and weaknesses. By separating out WHAT needs to be done from HOW it is implemented you allow the choice of "how" to be revised later without difficulty.

The second key idea behind data structures is that they vary, generally, in two ways: speed and space, and that these need to be measured using complexity theory ("Big-O notation").

In most cases, any structure or algorithm will do when the size of things is small. Even randomsort doesn't take very long to sort a list of length 6 on modern hardware. Building from this insight, complexity theory makes no attempt to measure just how long something takes; instead it measures how that running time increases with some key size (like the number of items in the data structure). If something takes O(items in the collection) time, that means (roughly) that the time is proportional to the size of the collection, for large enough collections. The thing that varies is usually given a name, so one will say "with this structure, re-sorting takes O(N) time" where N is defined (or implied) to be the number of items in the collection.

These are the fundamental ways a programmer should think about any data structure: first separate out the interface from the particular implementation details, then evaluate the implementation in terms of how long various operations take (and how much space is required) paying attention just to the asymptotic behavior. Thus one might think: "For this list, I'll be spending most of the time accessing the items out of order, so an array, where random access is O(1), would be a good choice." Or else "In that list I'll often be inserting items at random positions within the list, so I'd better not use an array where insertion takes O(N)... I'd be better off with a linked list where insertion takes O(1)."

My team is hiring right now, and unfortunately, much like those who recommend the use of FizzBuzz as a bozo filter, I too find that an enormous proportion of the applicants (even of those with "8 years experience at J2EE development") are simply incapable of doing that reasoning. So if you are hoping to get hired on my team, expect to be answering a question about basic data structures. My dream is to someday hire a developer who can think this: "Gee, this list needs is built using random insertions, then it is accessed in a random order. What kind of data structure could I use that would do both efficiently?".

Posted Sun 17 February 2008 by mcherm in Programming