yaakov.online

yaakov.online


I fight with computers

Red-Green Trees

One of the many things I do at work is contribute towards one of our internal build tools. Without going into too many details, at a high level this build tool:

  1. Analyses all our project files
  2. Figures out what each project's output is
  3. Figures out what each project's dependencies are
  4. Builds a dependency tree/graph based on these dependencies
  5. Compiles each project in dependency order, with as much parallelism as possible

A while ago I was considering changing part of this tool to use immutable data structures, but it appears to be impossible to create an immutable tree. For example, consider the following example code (in C#1):

sealed class Tree<TNodeValue>  
{
    public Tree(TreeNode<TNodeValue> rootNode)
    {
        RootNode = rootNode;
    }

    public TreeNode<TNodeValue> RootNode { get; }
}

sealed class TreeNode<TValue>  
{
    public TreeNode(TValue value, IEnumerable<TreeNode<TValue>> children, TreeNode<TValue> parent)
    {
        Value = value;
        Children = children;
        Parent = parent;
    }

    public TValue Value { get; }
    public IEnumerable<TreeNode<TValue>> Children { get; }
    public TreeNode<TValue> Parent { get; }
}

This looks like a decent starting point, but if you try and actually use this tree you'll notice two pretty big limitations:

  1. A TreeNode must be initialised with a Parent value (if it's not the root), hence the tree must be created top-down.
  2. A TreeNode must be initialised with it's Children, hence the tree must be created bottom-up.

We now have a nice Tree library that nobody can use because we've effectively deadlocked the consumer into needing to create two different objects which point to each-other, at the exact same time.

After pondering the issue for quite some time, this co-worker pointed me to a concept in Roslyn (the C# / Visual Basic compiler) called a Red-Green tree. There are two major sources of information on this - this StackOverflow answer about the re-use of SyntaxNodes within Roslyn, and this blog post by Eric Lippert about the internals of Roslyn's syntax tree.

Effectively, what the Roslyn team have done with a Red-Green tree is create two trees, one wrapping the other.

The "Green" tree is created bottom-up by creating a node that has a value and children.

The "Red" tree is created top-down, and each "Red" node contains a "Green" node and a parent reference.

In C#, this would resemble the following:

sealed class RedGreenTree<T>  
{
        public RedGreenTree(GreenNode<T> rootNode)
        {
                this.rootNode = rootNode;
        }

        readonly GreenNode<T> rootNode;

        public RedNode<T> RootNode => new RedNode<T>(rootNode, null);
}

sealed class GreenNode<T>  
{
        public GreenNode(T value, IEnumerable<GreenNode<T>> children)
        {
                Value = value;
                Children = children;
        }

        public T Value { get; }
        public IEnumerable<GreenNode<T>> Children { get; }
}

sealed class RedNode<T>  
{
        public RedNode(GreenNode<T> value, RedNode<T> parent)
        {
                this.value = value;
                Parent = parent;
        }

        readonly GreenNode<T> value;

        public T Value => value.Value;
        public IEnumerable<RedNode<T>> Children => value.Children?.Select(c => new RedNode<T>(c, this)) ?? Enumerable.Empty<RedNode<T>>();
        public RedNode<T> Parent { get; }
}

Where the magic happens is inside RedNode<T>.Children. When you enumerate it's children, it creates a new RedNode<T> for each child, passing itself in as the parent, and returns that collection.

This means that:

  1. External consumers will never see a GreenNode<T>. This is purely an implementation detail.
  2. As long as you hold a reference to a RedNode<T>, you can guarantee that neither the parent, children or value will change. This makes it easier to reason about concurrency and multithreading, as it is immutable.

For a full implementation, we would still need some further details such as:

However, this overly simplified implementation works (yay!), and has helped me to understand this interesting data structure.

I'm hoping I can expand upon this in future and use it in a few places, such as the above-mentioned build tool.

Lastly, if like me you were wondering where "Red" and "Green" come from in this data structure, Eric Lippert explains this in his blog post (link above):

Incidentally, these are called “red/green trees” because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There’s no other meaning to the colours.

  1. I was going to use Swift for this blog post but the Playgrounds editor, SourceKit and the LLDB server kept crashing on me. 🙄