How Even Immutables are Hard with Threads

Armen Rigo has a blog posting (worthy of an article of its own) proposing using STM (Software Transactional Memory) in PyPy. In a discussion on reddit someone suggested that you could have weaker threading guarantees and just use locks manually. It wouldn't be so hard, they explained, because:

You really only have to do it for data that is not read-only. I would for example say that it's pretty rare for classes to change after they have been set up for the first time (presumably before any threads are even started), making the class basically read-only, which could be safely shared across threads.

I wanted to give a detailed response with why this approach is nieve. Actually, it has been tried before and failed. It may work OK with certain kinds of languages (mostly "functional" languages), but fails with other kinds of languages, and Python is an extreme example of the kind of language where it won't work.

For an example, consider Java. The JVM (Java Virtual Machine) has special features that were added to support exactly this behavior, but in practice few programmers use them. Let's take a really simple example: suppose you create some data structure and a function to initialize it. In thread A you create the object, then initialize it, then pass it off to existing threads B and C. Threads B and C simultaneously read stuff from the data structure in ways that WOULD be dangerous except that the data structure is immutable after initialization.

The problem is that the guarantees provided in threading are MUCH weaker than you think. It's not just that there are different threads all working at the same time and reading and writing from the same memory locations, the architecture of modern CPUs makes that impossible. You see, it takes hundreds of times longer to read something from memory or write it to memory as it takes to process something in the registers. So to execute "X = Y + 1", the computer COULD spend 100 cycles reading Y, then 1 cycle adding 1 then 100 cycles writing X for a total of 201 cycles to execute. But that would be unbearably slow. Instead, it takes 100 cycles to do a bulk read of the whole memory area around where Y is stored into high-speed caches. It takes another 100 cycles to do a bulk read of the whole memory area aroudn where X is stored. It takes 1 cycle to add, then takes 100 cycles to do a bulk write of the memory area containing X. That's 301 cycles... which sounds even worse.

But it's NOT worse if the compiler cheats. Instead, it spends 100 cycles reading Y and 100 cycles reading X. Then it executes the +1 for one cycle. Then, BEFORE writing out X it does some OTHER calculations on the chunks of memory that have been read in. If the program has good cache locality (active objects are near each other in memory) it may get 75 cycles of useful work done before it needs to spend 100 cycles to "flush the cache out" (write X and the other things that were updated. That would be a total of 375 cycles to do 75 bits of work, or just 5 cycles per line -- a LOT better than 201!

But in order to do this, the compiler has the "cheat". It has to execute bits of work out of order, although it can take special precautions to make sure that it gets the same answer as if it executed them in the order written. As seen by THIS thread. But as seen by a DIFFERENT thread, the steps may appear to happen in a very different order. The other thread won't see the effects until they get flushed to main memory, and that won't happen after every computation (unless it is running 100x too slow!!!).

WHEW!! Big wall of text there, but the story should explain why one thread in a program may see the computations by another thread happen in a different order. So imagine this:

"In thread you A create the object, then initialize it, then pass it off to existing threads B and C."

But imagine that from thread C's point of view, A created it, then passed it off, and only initializes it LATER. In fact, perhaps C will start using it at the same time that A is initializing it -- so it's not really immutable, and terrible errors result. This is NOT just a theoretical risk: I have written real code that exhibited this behavior when running on a multi-core machine.

In order to help protect against this, the Java langage added a special exception to the Java threading model. Despite all other threading rules, if a class is declared "final" (immutable) and then all code executed within the constructor is guaranteed to be occur before the constructor ends EVEN AS SEEN BY OTHER THREADS. In theory, this is a great tool for creating data structures ahead of time and then reading them after initialization from other threads.

But in practice it isn't so good. Initializing everything within the constructor of an immutable object turns out to be a real pain. Often you really want to use a hashtable (not immutable), or use Spring injection to populate your objects after the constructor, or slew of other choices that make it hard to stuff all your setup code inside of constructors. Some languages support this better: Scala and Closure are examples of languages on the JVM that use this feature well, but in Java it is awkward because there's no little special support for working with immutable objects. Python, as a language, is even worse: there are NO immutable objects in Python! So while it might be possible to do as you suggest (create special locks and use them around every __init__ method, then carefully make sure nothing is modified outside of __init__), the resulting language wouldn't really read like Python.

Posted Tue 23 August 2011 by mcherm in Programming