The binary search tree (BST) is among the most important data structures. Previous concurrent implementations of balanced BSTs without locks either used coarse-grained transactions, which limit concurrency, or lacked rigorous proofs of correctness. In this paper, we describe a general technique for implementing any data structure based on a down-tree (a directed acyclic graph of indegree one), with updates that modify any connected subgraph of the tree atomically. The resulting implementations are non-blocking, which means that some process is always guaranteed to make progress, even if processes crash. Our approach drastically simplifies the task of proving correctness. This makes it feasible to develop provably correct implementations of non-blocking balanced BSTs with fine-grained synchronization (i.e., with updates that synchronize on a small constant number of nodes).
As with all concurrent implementations, the implementations obtained using our technique are more efficient if each update to the data structure involves a small number of nodes near one another. We call such an update localized. We use operation to denote an operation of the abstract data type (ADT) being implemented by the data structure. Operations that cannot modify the data structure are called queries. For some data structures, such as Patricia tries and leaf-oriented BSTs, operations modify the data structure using a single localized update. In some other data structures, operations that modify the data structure can be split into several localized updates that can be freely interleaved.
A particularly interesting application of our technique is to implement relaxed-balance versions of sequential data structures efficiently. Relaxed-balance data structures decouple updates that rebalance the data structure from operations, and allow updates that accomplish rebalancing to be delayed and freely interleaved with other updates. For example, a chromatic tree is a relaxed-balance version of a red-black tree (RBT) which splits up the insertion or deletion of a key and any subsequent rotations into a sequence of localized updates. There is a rich literature of relaxed-balance versions of sequential data structures , and several papers (e.g., ) have described general techniques that can be used to easily produce them from large classes of existing sequential data structures. The small number of nodes involved in each update makes relaxed-balance data structures perfect candidates for efficient implementation using our technique.
We provide a simple template that can be filled in to obtain an implementation of any update for a data structure based on a down-tree. We prove that any data structure that follows our template for all of its updates will automatically be linearizable and non-blocking. The template takes care of all process coordination, so the data structure designer is able to think of updates as atomic steps.
To demonstrate the use of our template, we provide a complete, provably correct, non-blocking linearizable implementation of a chromatic tree , which is a relaxed-balanced version of a RBT. To our knowledge, this is the first provably correct, non-blocking balanced BST implemented using fine-grained synchronization. Our chromatic trees always have height , where is the number of keys stored in the tree and is the number of insertions and deletions that are in progress (Section 5.3).
We show that sequential implementations of some queries are linearizable, even though they completely ignore concurrent updates. For example, an ordinary BST search (that works when there is no concurrency) also works in our chromatic tree. Ignoring updates makes searches very fast. We also describe how to perform successor queries in our chromatic tree, which interact properly with updates that follow our template (Section 5.5).
We show experimentally that our Java implementation of a chromatic tree rivals, and often significantly outperforms, known highly-tuned concurrent dictionaries, over a variety of workloads, contention levels and thread counts. For example, with 128 threads, our algorithm outperforms Java’s non-blocking skip-list by 13% to 156%, the lock-based AVL tree of Bronson et al. by 63% to 224%, and a RBT that uses software transactional memory (STM) by 13 to 134 times (Section 6).
2 Related Work
There are many lock-based implementations of search tree data structures. (See [1, 9] for state-of-the-art examples.) Here, we focus on implementations that do not use locks. Valois  sketched an implementation of non-blocking node-oriented BSTs from CAS. Fraser  gave a non-blocking BST using 8-word CAS, but did not provide a full proof of correctness. He also described how multi-word CAS can be implemented from single-word CAS instructions. Ellen et al.  gave a provably correct, non-blocking implementation of leaf-oriented BSTs directly from single-word CAS. A similar approach was used for -ary search trees  and Patricia tries . All three used the cooperative technique originated by Turek, Shasha and Prakash  and Barnes . Howley and Jones  used a similar approach to build node-oriented BSTs. They tested their implementation using a model checker, but did not prove it correct. Natarajan and Mittal  give another leaf-oriented BST implementation, together with a sketch of correctness. Instead of marking nodes, it marks edges. This enables insertions to be accomplished by a single CAS, so they do not need to be helped. It also combines deletions that would otherwise conflict. All of these trees are not balanced, so the height of a tree with keys can be .
Tsay and Li  gave a general approach for implementing trees in a wait-free manner using LL and SC operations (which can, in turn be implemented from CAS, e.g., ). However, their technique requires every process accessing the tree (even for read-only operations such as searches) to copy an entire path of the tree starting from the root. Concurrency is severely limited, since every operation must change the root pointer. Moreover, an extra level of indirection is required for every child pointer.
Red-black trees [5, 18] are well known BSTs that have height . Some attempts have been made to implement RBTs without using locks. It was observed that the approach of Tsay and Li could be used to implement wait-free RBTs  and, furthermore, this could be done so that only updates must copy a path; searches may simply read the path. However, the concurrency of updates is still very limited. Herlihy et al.  and Fraser and Harris  experimented on RBTs implemented using software transactional memory (STM), which only satisfied obstruction-freedom, a weaker progress property. Each insertion or deletion, together with necessary rebalancing is enclosed in a single large transaction, which can touch all nodes on a path from the root to a leaf.
Some researchers have attempted fine-grained approaches to build non-blocking balanced search trees, but they all use extremely complicated process coordination schemes. Spiegel and Reynolds  described a non-blocking data structure that combines elements of B-trees and skip lists. Prior to this paper, it was the leading implementation of an ordered dictionary. However, the authors provided only a brief justification of correctness. Braginsky and Petrank  described a B+tree implementation. Although they have posted a correctness proof, it is very long and complex.
In a balanced search tree, a process is typically responsible for restoring balance after an insertion or deletion by performing a series of rebalancing steps along the path from the root to the location where the insertion or deletion occurred. Chromatic trees, introduced by Nurmi and Soisalon-Soininen , decouple the updates that perform the insertion or deletion from the updates that perform the rebalancing steps. Rather than treating an insertion or deletion and its associated rebalancing steps as a single, large update, it is broken into smaller, localized updates that can be interleaved, allowing more concurrency. This decoupling originated in the work of Guibas and Sedgewick  and Kung and Lehman . We use the leaf-oriented chromatic trees by Boyar, Fagerberg and Larsen . They provide a family of local rebalancing steps which can be executed in any order, interspersed with insertions and deletions. Moreover, an amortized constant number of rebalancing steps per Insert or Delete is sufficient to restore balance for any sequence of operations. We have also used our template to implement a non-blocking version of Larsen’s leaf-oriented relaxed AVL tree . In such a tree, an amortized logarithmic number of rebalancing steps per Insert or Delete is sufficient to restore balance.
There is also a node-oriented relaxed AVL tree by Bougé et al. , in which an amortized linear number of rebalancing steps per Insert or Delete is sufficient to restore balance. Bronson et al.  developed a highly optimized fine-grained locking implementation of this data structure using optimistic concurrency techniques to improve search performance. Deletion of a key stored in an internal node with two children is done by simply marking the node and a later insertion of the same key can reuse the node by removing the mark. If all internal nodes are marked, the tree is essentially leaf-oriented. Crain et al. gave a different implementation using lock-based STM  and locks , in which all deletions are done by marking the node containing the key. Physical removal of nodes and rotations are performed by one separate thread. Consequently, the tree can become very unbalanced. Drachsler et al.  give another fine-grained lock-based implementation, in which deletion physically removes the node containing the key and searches are non-blocking. Each node also contains predecessor and successor pointers, so when a search ends at an incorrect leaf, sequential search can be performed to find the correct leaf. A non-blocking implementation of Bougé’s tree has not appeared, but our template would make it easy to produce one.
3 Llx, Scx and Vlx Primitives
The load-link extended (LLX), store-conditional extended (SCX) and validate-extended (VLX) primitives are multi-word generalizations of the well-known load-link (LL), store-conditional (SC) and validate (VL) primitives and they have been implemented from single-word CAS . The benefit of using LLX, SCX and VLX to implement our template is two-fold: the template can be described quite simply, and much of the complexity of its correctness proof is encapsulated in that of LLX, SCX and VLX.
Instead of operating on single words, LLX, SCX and VLX operate on Data-records, each of which consists of a fixed number of mutable fields (which can change), and a fixed number of immutable fields (which cannot). LLX attempts to take a snapshot of the mutable fields of a Data-record . If it is concurrent with an SCX involving , it may return Fail, instead. Individual fields of a Data-record can also be read directly. An SCX takes as arguments a sequence of Data-records, a subsequence of , a pointer to a mutable field of one Data-record in , and a new value for that field. The SCX tries to atomically store the value in the field that points to and finalize each Data-record in . Once a Data-record is finalized, its mutable fields cannot be changed by any subsequent SCX, and any LLX of the Data-record will return Finalized instead of a snapshot.
Before a process invokes SCX or VLX(), it must perform an LLX on each Data-record in . The last such LLX by the process is said to be linked to the SCX or VLX, and the linked LLX must return a snapshot of (not Fail or Finalized). An SCX() by a process modifies the data structure only if each Data-record in has not been changed since its linked LLX(); otherwise the SCX fails. Similarly, a VLX returns True only if each Data-record in has not been changed since its linked LLX() by the same process; otherwise the VLX fails. VLX can be used to obtain a snapshot of a set of Data-records. Although LLX, SCX and VLX can fail, their failures are limited in such a way that we can use them to build non-blocking data structures. See  for a more formal specification of these primitives.
These new primitives were designed to balance ease of use and efficient implementability using single-word CAS. The implementation of the primitives from CAS in  is more efficient if the user of the primitives can guarantee that two constraints, which we describe next, are satisfied. The first constraint prevents the ABA problem for the CAS steps that actually perform the updates.
Constraint 1: Each invocation of SCX tries to change to a value that it never previously contained.
The implementation of SCX does something akin to locking the elements of in the order they are given. Livelock can be easily avoided by requiring all sequences to be sorted according to some total order on Data-records. However, this ordering is necessary only to guarantee that SCXs continue to succeed. Therefore, as long as SCXs are still succeeding in an execution, it does not matter how sequences are ordered. This observation leads to the following constraint, which is much weaker.
Constraint 2: Consider each execution that contains a configuration after which the value of no field of any Data-record changes. There is a total order of all Data-records created during this execution such that, for every SCX whose linked LLXs begin after , the sequence passed to the SCX is sorted according to the total order.
It is easy to satisfy these two constraints using standard approaches, e.g., by attaching a version number to each field, and sorting sequences by any total order, respectively. However, we shall see that Constraints 1 and 2 are automatically satisfied in a natural way when LLX and SCX are used according to our tree update template.
Under these constraints, the implementation of LLX, SCX, and VLX in  guarantees that there is a linearization of all SCXs that modify the data structure (which may include SCXs that do not terminate because a process crashed, but not any SCXs that fail), and all LLXs and VLXs that return, but do not fail.
We assume there is a Data-record which acts as the entry point to the data structure and is never deleted. This Data-record points to the root of a down-tree. We represent an empty down-tree by a pointer to an empty Data-record. A Data-record is in the tree if it can be reached by following pointers from . A Data-record is removed from the tree by an SCX if is in the tree immediately prior to the linearization point of the SCX and is not in the tree immediately afterwards. Data structures produced using our template automatically satisfy one additional constraint:
Constraint 3: A Data-record is finalized when (and only when) it is removed from the tree.
Under this additional constraint, the implementation of LLX and SCX in  also guarantees the following three properties.
If LLX returns a snapshot, then is in the tree just before the LLX is linearized.
If an SCX is linearized and is (a pointer to) a Data-record, then this Data-record is in the tree immediately after the SCX is linearized.
If an operation reaches a Data-record by following pointers read from other Data-records, starting from , then was in the tree at some earlier time during the operation.
These properties are useful for proving the correctness of our template. In the following, we sometimes abuse notation by treating the sequences and as sets, in which case we mean the set of all Data-records in the sequence.
The memory overhead introduced by the implementation of LLX and SCX is fairly low. Each node in the tree is augmented with a pointer to a descriptor and a bit. Every node that has had one of its child pointers changed by an SCX points to a descriptor. (Other nodes have a Nil pointer.) A descriptor can be implemented to use only three machine words after the update it describes has finished. The implementation of LLX and SCX in  assumes garbage collection, and we do the same in this work. This assumption can be eliminated by using, for example, the new efficient memory reclamation scheme of Aghazadeh et al. .
4 Tree Update Template
Our tree update template implements updates that atomically replace an old connected subgraph in a down-tree by a new connected subgraph. Such an update can implement any change to the tree, such as an insertion into a BST or a rotation used to rebalance a RBT. The old subgraph includes all nodes with a field (including a child pointer) to be modified. The new subgraph may have pointers to nodes in the old tree. Since every node in a down-tree has indegree one, the update can be performed by changing a single child pointer of some node . (See Figure 1.) However, problems could arise if a concurrent operation changes the part of the tree being updated. For example, nodes in the old subgraph, or even , could be removed from the tree before ’s child pointer is changed. Our template takes care of the process coordination required to prevent such problems.
Each tree node is represented by a Data-record with a fixed number of child pointers as its mutable fields (but different nodes may have different numbers of child fields). Each child pointer points to a Data-record or contains Nil (denoted by in our figures). For simplicity, we assume that any other data in the node is stored in immutable fields. Thus, if an update must change some of this data, it makes a new copy of the node with the updated data.
Our template for performing an update to the tree is fairly simple: An update first performs LLXs on nodes in a contiguous portion of the tree, including and the set of nodes to be removed from the tree. Then, it performs an SCX that atomically changes the child pointer as shown in Figure 1 and finalizes nodes in . Figure 2 shows two special cases where is empty. An update that performs this sequence of steps is said to follow the template.
We now describe the tree update template in more detail. An update UP that follows the template shown in Figure 3 takes any arguments, , that are needed to perform the update. UP first reads a sequence of child pointers starting from to reach some node . Then, UP performs LLXs on a sequence of nodes starting with . For maximal flexibility of the template, the sequence can be constructed on-the-fly, as LLXs are performed. Thus, UP chooses a non-Nil child of one of the previous nodes to be the next node of by performing some deterministic local computation (denoted by NextNode in Figure 3) using any information that is available locally, namely, the snapshots of mutable fields returned by LLXs on the previous elements of , values read from immutable fields of previous elements of , and . (This flexibility can be used, for example, to avoid unnecessary LLXs when deciding how to rebalance a BST.) UP performs another local computation (denoted by Condition in Figure 3) to decide whether more LLXs should be performed. To avoid infinite loops, this function must eventually return True in any execution of UP. If any LLX in the sequence returns Fail or Finalized, UP also returns Fail, to indicate that the attempted update has been aborted because of a concurrent update on an overlapping portion of the tree. If all of the LLXs successfully return snapshots, UP invokes SCX and returns a result calculated locally by the Result function (or Fail if the SCX fails).
UP applies the function SCX-Arguments to use locally available information to construct the arguments , , and for the SCX. The postconditions that must be satisfied by SCX-Arguments are somewhat technical, but intuitively, they are meant to ensure that the arguments produced describe an update as shown in Figure 1 or Figure 2. The update must remove a connected set of nodes from the tree and replace it by a connected set of newly-created nodes that is rooted at by changing the child pointer stored in to point to . In order for this change to occur atomically, we include and the node containing in . This ensures that if any of these nodes has changed since it was last accessed by one of UP’s LLXs, the SCX will fail. The sequence may also include any other nodes in .
More formally, we require SCX-Arguments to satisfy nine postconditions. The first three are basic requirements of SCX.
is a subsequence of .
is a subsequence of .
The node containing the mutable field is in .
Let be the directed graph , where is the set of all child pointers of nodes in when they are initialized, and for some }. Let be the value read from by the LLX on .
is a non-empty down-tree rooted at .
If then and .
If and , then .
UP allocates memory for all nodes in including .
Postcondition PC7 requires to be a newly-created node, in order to satisfy Constraint 1. There is no loss of generality in using this approach: If we wish to change a child of node to Nil (to chop off the entire subtree rooted at ) or to a descendant of (to splice out a portion of the tree), then, instead, we can replace by a new copy of with an updated child pointer. Likewise, if we want to delete the entire tree, then can be changed to point to a new, empty Data-record.
The next postcondition is used to guarantee Constraint 2, which is used to prove progress.
The sequences constructed by all updates that take place entirely during a period of time when no SCXs change the tree structure must be ordered consistently according to a fixed tree traversal algorithm (for example, an in-order traversal or a breadth-first traversal).
Stating the remaining postcondition formally requires some care, since the tree may be changing while UP performs its LLXs. If , let be the directed graph , where is the union of the sets of edges representing child pointers read from each when it was last accessed by one of UP’s LLXs and for some }. represents UP’s view of the nodes in according to its LLXs, and is the fringe of . If other processes do not change the tree while UP is being performed, then contains the nodes that should remain in the tree, but whose parents will be removed and replaced. Therefore, we must ensure that the nodes in are reachable from nodes in (so they are not accidentally removed from the tree). Let be the directed graph , where is the union of the sets of edges representing child pointers read from each when it was last accessed by one of UP’s LLXs and for some }. Since , and are not affected by concurrent updates, the following postcondition can be proved using purely sequential reasoning, ignoring the possibility that concurrent updates could modify the tree during UP.
If is a down-tree and , then is a non-empty down-tree rooted at and .
4.1 Correctness and Progress
For brevity, we only sketch the main ideas of the proof here. The full proof appears in Appendix 0.B. Consider a data structure in which all updates follow the tree update template and SCX-Arguments satisfies postconditions PC1 to PC9. We prove, by induction on the sequence of steps in an execution, that the data structure is always a tree, each call to LLX and SCX satisfies its preconditions, Constraints 1 to 3 are satisfied, and each successful SCX atomically replaces a connected subgraph containing nodes with another connected subgraph containing nodes finalizing and removing the nodes in from the tree, and adding the new nodes in to the tree. We also prove no node in the tree is finalized, every removed node is finalized, and removed nodes are never reinserted.
We linearize each update UP that follows the template and performs an SCX that modifies the data structure at the linearization point of its SCX. We prove the following correctness properties.
If UP were performed atomically at its linearization point, then it would perform LLXs on the same nodes, and these LLXs would return the same values.
This implies that UP’s SCX-Arguments and Result computations must be the same as they would be if UP were performed atomically at its linearization point, so we obtain the following.
If UP were performed atomically at its linearization point, then it would perform the same SCX (with the same arguments) and return the same value.
If a process follows child pointers starting from a node in the tree at time and reaches a node at time , then was in the tree at some time between and . Furthermore, if reads from a mutable field of at time then, at some time between and , node was in the tree and this field contained .
The following properties, which come from , can be used to prove non-blocking progress of queries.
If LLXs are performed infinitely often, then they return snapshots or Finalized infinitely often.
If VLXs are performed infinitely often, and SCXs are not performed infinitely often, then VLXs return True infinitely often.
Each update that follows the template is wait-free. Since updates can fail, we also prove the following progress property.
If updates that follow the template are performed infinitely often, then updates succeed infinitely often.
A successful update performs an SCX that modifies the tree. Thus, it is necessary to show that SCXs succeed infinitely often. Before an invocation of SCX can succeed, it must perform an LLX that returns a snapshot, for each . Even if P1 is satisfied, it is possible for LLXs to always return Finalized, preventing any SCXs from being performed. We prove that any algorithm whose updates follow the template automatically guarantees that, for each Data-record , each process performs at most one invocation of LLX that returns Finalized. We use this fact to prove P3.
5 Application: Chromatic Trees
Here, we show how the tree update template can be used to implement an ordered dictionary ADT using chromatic trees. Due to space restrictions, we only sketch the algorithm and its correctness proof. All details of the implementation and its correctness proof appear in Appendix 0.C. The ordered dictionary stores a set of keys, each with an associated value, where the keys are drawn from a totally ordered universe. The dictionary supports five operations. If is in the dictionary, Get returns its associated value. Otherwise, Get returns . Successor returns the smallest key in the dictionary that is larger than (and its associated value), or if no key in the dictionary is larger than . Predecessor is analogous. Insert replaces the value associated with by and returns the previously associated value, or if was not in the dictionary. If the dictionary contains , Delete removes it and returns the value that was associated immediately beforehand. Otherwise, Delete() simply returns .
A RBT is a BST in which the root and all leaves are coloured black, and every other node is coloured either red or black, subject to the constraints that no red node has a red parent, and the number of black nodes on a path from the root to a leaf is the same for all leaves. These properties guarantee that the height of a RBT is logarithmic in the number of nodes it contains. We consider search trees that are leaf-oriented, meaning the dictionary keys are stored in the leaves, and internal nodes store keys that are used only to direct searches towards the correct leaf. In this context, the BST property says that, for each node , all descendants of ’s left child have keys less than ’s key and all descendants of ’s right child have keys that are greater than or equal to ’s key.
To decouple rebalancing steps from insertions and deletions, so that each is localized, and rebalancing steps can be interleaved with insertions and deletions, it is necessary to relax the balance properties of RBTs. A chromatic tree  is a relaxed-balance RBT in which colours are replaced by non-negative integer weights, where weight zero corresponds to red and weight one corresponds to black. As in RBTs, the sum of the weights on each path from the root to a leaf is the same. However, RBT properties can be violated in the following two ways. First, a red child node may have a red parent, in which case we say that a red-red violation occurs at this child. Second, a node may have weight , in which case we say that overweight violations occur at this node. The root always has weight one, so no violation can occur at the root.
To avoid special cases when the chromatic tree is empty, we add sentinel nodes at the top of the tree (see Figure 10). The sentinel nodes and have key to avoid special cases for Search, Insert and Delete, and weight one to avoid special cases for rebalancing steps. Without having a special case for Insert, we automatically get the two sentinel nodes in Figure 10(b), which also eliminate special cases for Delete. The chromatic tree is rooted at the leftmost grandchild of . The sum of weights is the same for all paths from the root of the chromatic tree to its leaves, but not for paths that include or the sentinel nodes.
Rebalancing steps are localized updates to a chromatic tree that are performed at the location of a violation. Their goal is to eventually eliminate all red-red and overweight violations, while maintaining the invariant that the tree is a chromatic tree. If no rebalancing step can be applied to a chromatic tree (or, equivalently, the chromatic tree contains no violations), then it is a RBT. We use the set of rebalancing steps of Boyar, Fagerberg and Larsen , which have a number of desirable properties: No rebalancing step increases the number of violations in the tree, rebalancing steps can be performed in any order, and, after sufficiently many rebalancing steps, the tree will always become a RBT. Furthermore, in any sequence of insertions, deletions and rebalancing steps starting from an empty chromatic tree, the amortized number of rebalancing steps is at most three per insertion and one per deletion.
We represent each node by a Data-record with two mutable child pointers, and immutable fields , and that contain the node’s key, associated value, and weight, respectively. The child pointers of a leaf are always Nil, and the value field of an internal node is always Nil.
Get, Insert and Delete each execute an auxiliary procedure, Search(), which starts at and traverses nodes as in an ordinary BST search, using Reads of child pointers until reaching a leaf, which it then returns (along with the leaf’s parent and grandparent). Because of the sentinel nodes shown in Figure 10, the leaf’s parent always exists, and the grandparent exists whenever the chromatic tree is non-empty. If it is empty, Search returns Nil instead of the grandparent. We define the search path for at any time to be the path that Search() would follow, if it were done instantaneously. The Get() operation simply executes a Search() and then returns the value found in the leaf if the leaf’s key is , or otherwise.