Immutable Trees and Threading Evil - Part 2

In part 1 I introduced immutable trees, this time I will talk about threading evil. If you haven't yet read part 1, go back and do so now.

We will start by recapping the argument showing that there cannot be a loop. This argument went as follows:

The RandomTreeBuilder and the random behavior of NodePool conspire to create an unpredictable mess, but we do know at least this much:

The TreeNodes are immutable. The left and right children are passed in to the constructor, and they can never be modified later (because they are private and have only getters). So if there is no loop when the node is first created, there never will be.
Since the left and right nodes are passed into the constructor, they must have been created before the new parent node. They cannot be created after or at the same time or they wouldn't be available to pass in to the constructor.
"Created before" is transitive: if all children must be created before their parent, then every node must be created before every ancestor.
No node can be created before itself, since that creation happened just once. But if there were a loop, then a node is an ancestor of itself. Thus, there cannot be any loops.

It sounds very plausible. The fallacy in the argument is in step 2 when it concludes that he child nodes must have been created before the parent. With threads in the mix, it is possible that the child nodes weren't create before, after OR at the same time: threading allows incomparable events that are none of these. The official Java memory model (other languages behave the same, but I know the Java one in detail so that's what I'm describing) states that "happens before" is a "partial relation": certain events "happen before" others but the some are just incomparable. It seems like a very odd thing to permit, so I will start by explaining the reason.

Suppose that you have the following snippet of code:

{
    int sum = a + b;      // Line 1
    int product = a * b;  // Line 2
    setSum(x);            // Line 3
    setProduct(y);        // Line 4
}

If the compiler were rather simpleminded then the low-level code for this could reads something like this:

# -- Line 1 --
1. Load variable "a" into register 1.
2. Load variable "b" into register 2.
3. Add register 1 and register 2; store in register 3.
4. Write register 3 to memory location for "sum".
# -- Line 2 --
5. Load variable "a" into register 1.
6. Load variable "b" into register 2.
7. Multiply register 1 and register 2; store in register 3.
8. Write register 3 to memory location for "product".
# -- Line 3 --
9. Load variable "x" into register 1.
10. Invoke "setSum" with argument from register 1.
# -- Line 4 --
11. Load variable "y" into register 1.
12. Invoke "setProduct" with argument from register 1.

[NOTE: this example is made up—real assembly code doesn't look like this. But the basic idea is accurate.]

If we assume that add and multiply take 1 clock cycle, while memory read or write takes 100 clock cycles then this takes 802 clock cycles plus the time for the subroutines. Now imagine an optimizing compiler that does this instead:

1. Load variable "a" into register 1.
2. Load variable "b" into register 2.
3. Add register 1 and register 2; store in register 3.
4. Invoke "setSum" with argument from register 3.
5. Multiply register 1 and register 2; store in register 3.
6. Invoke "setProduct" with argument from register 3.

That version takes 202 clock cycles (plus the time for the subroutines). It also used less memory (no need to store temporary variables x and y). If a and b had already been in registers, it might have taken only _2_ clock cycles!

There are enormous performance gains if the compiler-writers are allowed to reorder statements to perform efficiently. They take into account registers, levels of caching, and other factors when doing so. In the above example it is not exactly clear whether "line 3" occurred before or after "line 4", but the end result is guaranteed to be the same as if it HAD occurred in order.

So let's imagine that an optimizing compiler is reordering the statements in the run() method of the RandomTreeBuilder class. This time we won't try to reason out why it might chose one order or the other, we'll accept that it picks the ordering it thinks will go fastest. Here is the original code:

for (int i=0; i<NUM_NODES_PER_THREAD; i++) {
    TreeNode leftChild = nodePool.takeRandomNode();
    TreeNode rightChild = nodePool.takeRandomNode();
    nodePool.putNode(new TreeNode(leftChild, rightChild));
}

Here is a straightforward same-as-the-text ordering:

1. Invoke nodePool.takeRandomNode(); store in register 1.
2. Invoke nodePool.takeRandomNode(); store in register 2.
3. Allocate memory for new TreeNode(); store address in register 3.
4. Invoke TreeNode.<init>() with argument registers 3, 1 and 2.
5. Invoke nodePool.putNode() with argument from register 3.

And here is another possible ordering:

1. Allocate memory for new TreeNode(); store address in register 3.
2. Invoke nodePool.putNode() with argument from register 3.
3. Invoke nodePool.takeRandomNode(); store in register 1.
4. Invoke nodePool.takeRandomNode(); store in register 2.
5. Invoke TreeNode.<init>() with argument registers 3, 1 and 2.

Perhaps they were swapped because TreeNode.<init>() and nodePool.takeRandomNode() both clobber register 3 so it's more efficient to call nodePool.putNode() first. It doesn't really matter what the reason is, just that the JVM is entitled to order it this way if it wants to.

But now imagine that there are two different threads doing this simultaneously. Thread A executes steps 1 and 2, storing its new node in slot 23 of the NodePool. Simultaneously, thread B executes steps 1 and 2 storing its new node in slot 75. Moments later thread A executes step 4 and happens to pick slot 75, while thread B executes step 4 and happens to pick slot 23. We now have created two nodes, each of which is a child of the other.

And THIS is why threads are "evil". The behavior underneath the covers is complicated in ways that are not intuitive. If you stick to some well-defined practices (eg: if NodePool were synchronized using locks) then they will work just fine, but if you don't then you can get very strange behavior which is nearly impossible to debug.

Before I wrap up, there's something I mentioned early in part 1 that I want to come back to. Immutable objects like the TreeNode we designed are extremely handy things to use, particularly in threaded programs because an immutable object is completely threadsafe. The problem I described here comes because the "immutable object" isn't really immutable: in this case another thread sees the partially-constructed value before all of the fields are set. Well, the Java memory model contains a special case rule to make it easier to use immutable objects. This rules says that (unless the programmer passes the partially-constructed object to another thread from within the constructor) any fields which are declared to be "final" will be seen by other threads as set before the constructor exits. That rule would have prohibited the reordering that caused our loop.

If we declare the "left" and "right" fields to be "final", then loops in the tree become impossible just as we originally wanted. Yes, true immutability triumphs over evil! err.... I mean over threads.

Posted Wed 19 December 2007 by mcherm in Programming