1 Introduction
There has been much recent work on designing efficient concurrent implementations of set data structures [4, 5, 8, 10, 12, 13, 21, 29, 36, 38], which provide algorithms for Insert, Delete, and Find. There is increasing interest in providing additional operations for modern applications, including iterators [1, 32, 33, 35, 36, 37] or general range queries [6, 9]. These are required in many bigdata applications [11, 26, 34], where shared inmemory treebased data indices must be created for fast data retrieval and useful data analytics. Prevalent programming frameworks (e.g., Java [23], .NET [31], TBB [22]) that provide concurrent data structures have added operations to support (nonlinearizable) iterators.
The Binary Search Tree (BST) is one of the most fundamental data structures. Ellen et al. [13] presented the first nonblocking implementation (which we will call NBBST) of a BST from singleword CAS. NBBST has several nice properties. Updates operating on different parts of the tree do not interfere with one other and Finds never interfere with any other operation. The code of NBBST is modular and a detailed proof of correctness is provided in [14].
In this paper, we build upon NBBST to get a persistent version of it, called PNBBST. In a persistent data structure, old versions of the data structure are preserved when it is modified, so that one can access any old version. We achieve persistence on top of NBBST by applying a relatively simple technique which fully respects the modularity and simplicity of NBBST’s design.
In a concurrent setting, a major motivation for providing data structure persistence is that it facilitates the implementation, in a waitfree way [18], of advanced operations (such as range queries) on top of the data structure. We exploit persistence in PNBBST to provide the first waitfree implementation of RangeScan on top of tree data structures, using singleword CAS. RangeScan(, ) returns a set containing all keys in the implemented set that are between the given keys and . PNBBST also provides nonblocking (also known as lockfree [18]) implementations of Insert, Delete, and Find.
PNBBST is linearizable [20], uses singleword CAS, and tolerates any number of crash failures. As in NBBST, updates in PNBBST on different parts of the tree are executed in parallel without interfering with one another. A Find simply follows tree edges from the root to a leaf and it may have to help an update operation only if the update is taking place at the parent or grandparent of the leaf that the search arrives at. Thus, Find employs a lightweight helping mechanism. Similarly, RangeScan helps only those operations that are in progress on the nodes that it traverses. RangeScan may print keys (or perform some processing of the nodes, e.g., counting them) as it traverses the tree, thus avoiding any space overhead. PNBBST does not require knowledge of the number of processes in the system, and therefore it works in a dynamic system where the set of participating processes changes.
The code of PNBBST is as modular as that of NBBST, making it fairly easy to understand. However, designing a linearizable implementation of RangeScan required solving several synchronization problems between RangeScans and concurrent update operations on the same part of the tree, so that a RangeScan sees all the successful update operations linearized before it but not those linearized after it. Specifically, we had to (a) apply a mechanism based on sequence numbers set by RangeScans, to split the execution into phases and assign each operation to a distinct phase, (b) design a scheme for linearizing operations that is completely different from that of of NBBST by taking into consideration the phase to which each operation belongs, (c) ensure some additional necessary synchronization between RangeScans and updates, and (d) use a more elaborate helping scheme. The proof of correctness borrows from that of NBBST. However, due to the mentioned complications, many parts of it are more intricate. The proof that RangeScans work correctly is completely novel.
2 Related Work
Our implementation is based on NBBST, the binary search tree implementation proposed in [13]. Brown et al. [7] generalized the techniques in [13] to get the primitives LLX, SCX and VLX which are generalizations of loadlink, storeconditional and validate. These primitives can be used to simplify the nonblocking implementation of updates in every data structure based on a down tree (see [8, 17] for examples). Unfortunately, our technique for supporting range queries cannot directly be implemented using LLX and SCX: the functionality hidden inside LLX must be split in two parts between which some synchronization is necessary to coordinate RangeScans with updates. The work in [13] has also been generalized in [38] to get a nonblocking implementation of a Patricia trie. None of these implementations of nonblocking search trees supports range queries.
Prokopec et al. [36] presented a nonblocking implementation of a concurrent hash trie which supports a Scan operation that provides a consistent snapshot of the entire data structure. Their algorithm uses indirection nodes (inodes) [41] that double the height of the tree. To implement Scan, the algorithm provides a persistent implementation of the trie in which updates may have to copy the entire path of nodes they traverse to synchronize with concurrent Scans. Moreover, the algorithm causes a lot of contention on the root node. The algorithm could be adjusted to support RangeScan. However, every RangeScan would cause updates taking place anywhere in the tree to copy all the nodes they visit, even if they are not in the part of the tree being scanned.
Petrank and Timnat [35] gave a technique (based on [24]) to implement Scan on top of nonblocking set data structures such as linked lists and skip lists. Concurrent Scans share a snap collector object in which they record information about the nodes they traverse. To ensure that a Scan appropriately synchronizes with updates, processes executing updates or Finds must also record information about the operations they perform (or those executed by other processes they encounter) in the snap collector object. Although the snap collector object’s primitive operations is waitfree, the following example shows that the implementation of Scan using those primitives is nonblocking but not waitfree. Assume that the algorithm is applied on top of the nonblocking sorted linked list implementation presented by Harris [16]. A Scan must traverse the list, and this traversal may never complete if concurrent updates continue to add more elements to the end of the list faster than the Scan can traverse them. In this case, the lists maintained in the snap collector will grow infinitely long. In case is known, updates on different parts of the data structure do not interfere with one another and have been designed to be fast. However, Scan is rather costly in terms of both time and space. Chatterjee [9] generalizes the algorithm of Petrank and Timnat to get a nonblocking implementation of RangeScan using partial snapshots [2]. In a different direction, work in [1, 37] characterizes when implementing the technique of [35] on top of nonblocking data structures is actually possible.
Brown et al. [6] presented an implementation of a ary search tree supporting RangeScan in an obstructionfree way [19]. Avni et al. [3] presented a skip list implementation which supports RangeScan. It can be either lockfree or be built on top of a transactional memory system, so its progress guarantees are weaker than waitfreedom. Bronson et al. [5] presented a blocking implementation of a relaxedbalance AVL tree which provides support for Scan.
Some papers present waitfree implementations of Scan (or RangeScan) on data structures other than trees or in different settings. Nikolakopoulos et al. [32, 33] gave a set of consistency definitions for Scan and presented Scan algorithms for the lockfree concurrent queue in [28] that ensure different consistency and progress guarantees. Fatourou et al. [15] presented a waitfree implementation of Scan on top of the nonblocking deque implementation of [27]. Kanellou and Kallimanis [25] introduced a new graph model and provided a waitfree implementation of a nodestatic graph which supports partial traversals in addition to edge insertions, removals, and weight updates. Spiegelman et al. [39] presented two memory models and provided waitfree dynamic atomic snapshot algorithms for both.
3 Overview of the BST Implementation and Preliminaries
We provide a brief description of NBBST (following the presentation in [13]) and some preliminaries.
NBBST implements Binary Search Trees (BST) that are leaforiented, i.e., all keys are stored in the leaves of the tree. The tree is full and maintains the binary search tree property: for every node in the tree, the key of is larger than the key of every node in ’s left subtree and smaller than or equal to the key of every node in ’s right subtree. The keys of the Internal nodes are used solely for routing to the appropriate leaf during search. A leaf (internal) node is represented by an object of type Leaf (Internal, respectively); we say that Leaf and Internal nodes are of type Node (see Figure 2).
To insert a key in a leaforiented tree, a search for is first performed. Let and be the leaf that this search arrives at and its parent. If does not contain , then a subtree consisting of an internal node and two leaf nodes is created. The leaves contain and the key of (with the smaller key in the left leaf). The internal node contains the bigger of these two keys. The child pointer of which was pointing to is changed to point to the root of this subtree. Similarly, for a Delete(), let , and be the leaf node that the search Delete performs arrives at, its parent, and its grandparent. If the key of is , then the child pointer of which was pointing to is changed to point to the sibling of . By performing the updates in this way, the properties of the tree are maintained.
An implementation is linearizable if, in every execution , each operation that completes in (and some that do not) can be assigned a linearization point between the starting and finishing time of its execution so that the return values of those operations are the same in as if the operations were executed sequentially in the order specified by their linearization points.
To ensure linearizability, NBBST applies a technique that flags and marks nodes. A node is flagged before any of its child pointers changes. A node is permanently marked before it is removed. To mark and flag nodes, NBBST uses CAS. CAS() changes the value of object to if its current value is equal to , otherwise the CAS fails and no change is applied on . In either case, the value that had before the execution of CAS is returned.
NBBST provides a routine, Search(), to search the data structure for key . Search returns pointers to the leaf node at which the Search arrives, to its parent, and to its grandparent. Find() executes Search() and checks whether the returned leaf contains the key . Insert() executes Search() to get a leaf and its parent . It then performs a flag CAS, to flag , then a child CAS to change the appropriate child pointer of to point to the root of the newly created subtree of three nodes, and finally an unflag CAS to unflag . If it fails to flag , it restarts without executing the other two CAS steps. Similarly, a Delete() calls Search to get a leaf , its parent , and its grandparent . It first executes a flag CAS trying to flag . If this fails, it restarts. If the flagging succeeds, it executes a mark CAS to mark . If this fails, it unflags and restarts. Otherwise, it executes a child CAS to change the apropriate child pointer of to point from to the sibling of , it unflags and returns. Both Insert and Delete operations execute the body of a while loop repeatedly until they succeed. The execution of an iteration of the while loop is called attempt.
Processes may fail by crashing. An implementation is nonblocking if in every infinite execution, infinitely many operations are completed. NBBST is nonblocking: Each process that flags or marks a node stores in it a pointer to an Info object, which contains information about the operation it performs (see Figure 2). This information includes the old and new values that should be used by the CAS steps that will perform to complete the execution of . Other processes that apply operations on the same part of the data structure can help this operation complete and unflag the node. Once they do so, they are able to retry their own operations. Helping is necessary only if an update operation wants to flag or mark a node already flagged or marked by another process.
4 A Persistent Binary Search Tree Supporting Range Queries
We modify NBBST to get PNBBST, a BST implementation that supports RangeScan, in addition to Insert, Delete, and Find.
4.1 Overview
In a concurrent environment, care must be taken to synchronize RangeScans with updates since as a RangeScan traverses the tree, it may see an update by a process but it may miss an update that finishes before starts, and was applied on the part of the tree that has already been visited by the RangeScan (thus violating linearizability).
To avoid such situations, PNBBST implements a persistent version of the leaforiented tree, thus allowing a RangeScan to reconstruct previous versions of it. To achieve this, PNBBST stores in each node an additional pointer, called . Whenever the child pointer of a node changes from a node to a node , the pointer of points to . (Figure 1 illustrates an example.)
PNBBST maintains a shared integer, , which is incremented each time a RangeScan takes place. Each operation has a sequence number associated with it. Each RangeScan starts its execution by reading and uses the value read as its sequence number. Each other operation reads at the beginning of each of its attempts. The sequence number of is the sequence number read in its last attempt. A successful update operation records its sequence number in the Info object it creates during its last attempt. Intuitively, each RangeScan initiates a new execution phase whenever it increments . For each , phase is the period during which has the value . We say that all operations with sequence number belong to phase .
Each tree node has a sequence number which is the sequence number of the operation that created it. In this way, a RangeScan may figure out which nodes have been inserted or deleted by updates that belong to later phases. For any Internal node whose sequence number is at most , we define the version left (or right) child of to be the node that is reached by following the left (or right) child pointer of and then following its pointers until reaching the first node whose field is less than or equal to . (We prove that such a node exists.) For every configuration , we define graph as follows. The nodes of is the set of all existing nodes in and the edges go from nodes to their version children; is the subgraph of containing those nodes that are reachable from the root node in . We prove that is a binary search tree.
We linearize every Scan operation with sequence number at the end of phase , with ties broken in an arbitrary way. Moreover, we linearize all Insert, Delete and Find operations that belong to phase during phase . To ensure linearizability, PNBBST should guarantee that a RangeScan with sequence number ignores all changes performed by successful update operations that belong to phases with sequence numbers bigger than . To ensure this, each operation with sequence number ignores those nodes of the tree that have sequence numbers bigger than by moving from a node to its appropriate version child. Thus, each operation with sequence number always operates on .
28  Initialization: 
29  shared counter := 0 
30  shared Info * := pointer to a new Info object whose field is Abort, and whose other fields are 
31  shared Internal * := pointer to new Internal node with field , field , 
field , field , and its and fields pointing to new Leaf nodes whose fields  
are , fields are , and keys and , respectively 
32  Search(, int ): {  
33  Precondition:  
34  Internal *, *  
35  Node *  
36  while points to an internal node {  
37  Remember parent of  
38  Remember parent of  
39  Go to appropriate version child of  
40  }  
41  return  
42  }  
43  ReadChild(): Node* {  
44  Precondition: is non and  
45  if then else  Move down to appropriate child 
46  while ()  
47  return ;  
48  }  
49  ValidateLink(): {  
50  Preconditions: and are non  
51  Update  
52  
53  if then {  
54  Help()  
55  return  
56  }  
57  if ( and ) or ( and ) then return  
58  else return  
59  }  
60  ValidateLeaf(Internal *, Internal *, Leaf *, Key ) : {  
61  Preconditions: and are non and if then is non  
62  Update  
63  Boolean  
64  
65  if and then  
66  
67  return  
68  }  
69  Find(): Leaf* {  
70  Internal *  
71  Leaf *  
72  Boolean  
73  while True {  
74  
75  
76  
77  if then {  
78  if then return  
79  else return  
80  }  
81  }  
82  }  
83  CASChild(Internal *, Node *, Node *) {  
Precondition: points to an Internal node and points to a Node (i.e., neither is ) and  
This routine tries to change one of the child fields of the node that points to from to .  
84  if then  
85  CAS  child CAS 
86  else  
87  CAS  child CAS 
88  } 
89  Frozen(Update ): Boolean {  
90  return (( and or  
and ))  
91  }  
92  Execute (Internal *[], Update [], Internal *[], Internal *,  
Node *, Node *, int ): Boolean {  
93  Preconditions: (a) Elements of are non, (b) is a subset of , (c) is an element of ,  
94  (d) and are distinct and non, (e) is an element of ,  
95  (f) , and (g) if then is infinite.  
96  for to length of {  
97  if then {  
98  if then Help()  
99  return False  
100  }  
101  }  
102  pointer to a new Info record containing  
103  if then  freeze CAS 
104  return Help()  
105  else return False  
106  }  
107  Help(Info *): boolean {  
108  Precondition: is non and does not point to the Dummy Info object  
109  int  
110  boolean  
111  if then  
112  CAS(, , Abort)  abort CAS 
113  else CAS(, , Try)  try CAS 
114  
115  while and length of do {  
116  if appears in then  
117  freeze CAS  
118  else  freeze CAS 
119  
120  
121  }  
122  if then {  
123  
124  commit write  
125  } else if then  
126  abort write  
127  return ()  
128  }  
129  RangeScan(int , int ): Set {  
130  
131  
132  return  
133  }  
134  ScanHelper(Node *, int , int , int ): Set {  
135  Precondition: points to a node with  
136  Info *  
137  if points to a leaf then return  
138  else {  
139  
140  if then  
141  if then return ScanHelper  
142  else if then return ScanHelper  
143  else return  
144  
145  }  
146  } 
147  Insert(): boolean {  
148  Internal * , *, *  
149  Leaf *, *  
150  Leaf *  
151  Update  
152  Info *  
153  Boolean  
154  while True {  
155  
156  
157  
158  if then {  
159  if then return False  Cannot insert duplicate key 
160  else {  
161  pointer to a new Leaf node whose field is , its field is equal to , and its field is  
162  pointer to a new Leaf whose key is ,  
its field is equal to and its field is equal to  
163  pointer to a new Internal node with field ,  
field , its field equal to and its field equal to ,  
and with two child fields equal to and  
(the one with the smaller key is the left child),  
164  if Execute() then return True  
165  }  
166  }  
167  }  
168  }  
169  Delete(): boolean {  
170  Internal *, *  
171  Leaf *  
172  Node *, *  
173  Update  
174  Info *  
175  Boolean  
176  while True {  
177  
178  
179  
180  if then {  
181  if then return False  Key is not in the tree 
182  := ReadChild()  
183  
184  if then {  
185  pointer to a new copy of sibling with its field set to and its pointer set to  
186  if is Internal then {  
187  
188  if then  
189  } else  
190  if and Execute(  
) then  
191  return True  
192  }  
193  }  
194  }  
195  } 
To ensure linearizability, PNBBST should also ensure that each RangeScan sees all the successful updates that belong to phases smaller than or equal to . To achieve this, PNBBST employs a handshaking mechanism between each scanner and the updaters. It also uses a helping mechanism which is more elaborate than that of NBBST.
To describe the handshaking mechanism in more detail, consider any update operation initiated by process . No process can be aware of before performs a successful flag CAS for . Assume that flags node for in an attempt with sequence number . To ensure that no RangeScan with sequence number will miss , checks whether still has the value after the flag CAS has occurred. We call this check the handshaking check of . If the handshaking check succeeds, it is guaranteed that no RangeScan has begun its traversal between the time that reads at the beginning of the execution of and the time the handshaking check of is executed. Note that any future RangeScan with sequence number that traverses while is still in progress, will see that is flagged and find out the required information to complete in its Info object. In PNBBST, the RangeScan helps complete before it continues its traversal.
However, if the handshaking check fails, does not know whether any RangeScan that incremented to a value greater than has already traversed the part of the tree that is trying to update, and has missed this update. At least one of these RangeScans will have sequence number equal to . Thus, if succeeds, linearizability could be violated. To avoid this problem, proactively aborts its attempt of if the handshaking check fails, and then it initiates a new attempt for (which will have a sequence number bigger than ). This abort mechanism is implemented as follows. The Info object has a field, called , which takes values from the set (initially ). Each attempt creates an Info object. To abort the execution of an attempt, changes the field of its Info object to Abort. Once an attempt is aborted, the value of the field of its Info object remains Abort forever. If the handshaking check succeeds, then changes the field of the Info object of to Try and tries to execute the remaining steps of this attempt. If completes successfully, it changes the field of the Info object to Commit. Info objects whose field is equal to or Try belong to update operations that are still in progress.
We now describe the linearization points in more detail. If an attempt of an Insert or Delete ultimately succeeds in updating a child pointer of the tree to make the update take effect, we linearize the operation at the time that attempt first flags a node: this is when the update first becomes visible to other processes. (This scheme differs from the original NBBST, where updates are linearized at the time they actually change a child pointer in the tree.) Because of handshaking, this linearization point is guaranteed to be before the end of the phase to which the operation belongs.
When a Find operation completes a traversal of a branch of the tree to a leaf, it checks whether an update has already removed the leaf or is in progress and could later remove that leaf from the tree. If so, the Find helps the update complete and retries. Otherwise, the Find terminates and is linearized at the time when the leaf is in the tree and has no pending update that might remove it later. (As in the original NBBST, the traversal of the branch may pass through nodes that are no longer in the tree, but so long as it ends up at a leaf that is still present in the current tree we prove that it ends up at the correct leaf of the current tree.) An Insert() that finds key is already in the tree, and a Delete() that discovers that is not in the tree are linearized similarly to Find operations.
The helping mechanism employed by Find operations ensures that the Find will see an update that has been linearized (when it flags a node) before the Find but has not yet swung a child pointer to update the shape of the tree. But it is also crucial for synchronizing with RangeScan operations, for the following reason. Assume that a process initiates an Insert(). It reads in and successfully performs its flag CAS. Then, a RangeScan is initiated by a process and changes the value of from to . Finally, a Find(1) is initiated by a process and reads in . Find() and Insert() will arrive at the same leaf node (because Insert() has not performed its child CAS by the time Find reaches the leaf). If Find() ignores the flag that exists on the parent node of and does not help Insert() to complete, it will return False. If Insert() now continues its execution, it will complete successfully, and given that it has sequence number , it will be linearized before Find() which has sequnce number . That would violate linearizability.
4.2 Detailed Implementation
A RangeScan() first determines its sequence number (line 4) and then increments to start a new phase (line 4). To traverse the appropriate part of the tree, it calls ScanHelper() (line 4). ScanHelper starts from the root and recursively calls itself on the version left child of the current node if is greater than ’s key, or on ’s version right child if is smaller than ’s key, or on both version children if ’s key is between and (lines 4–4). Whenever it visits a node where an update is in progress, it helps the update to complete (line 4). ReadChild is used to obtain ’s appropriate version child.
Search() traverses a branch of from the root to a leaf node (lines 3–3). Find gets a sequence number (line 3) and calls Search(, ) (line 3) to traverse the BST to a leaf . Next, it calls ValidateLeaf to ensure that there is no update that has removed or has flagged ’s parent or grandparent for an update that could remove from the tree. If the validation succeeds, the Find is linearized at line 3. If it finds an update in progress, the Find helps complete it at line 3. If the validation is not successful, Find retries.
An Insert() performs repeated attempts. Each attempt first determines a sequence number , and calls Search(, ) (line 5) to traverse to the appropriate leaf in . It then calls ValidateLeaf, just as Find does. If the validation is successful and is not already in the tree (line 5), a subtree of three nodes is created (lines 5–5). Execute (line 5) performs the remaining actions of the Insert, in a way that is similar to the Insert of NBBST.
In a way similar to Insert(), a Delete() performs repeated attempts (line 5). Each attempt determines its sequence number (line 5) and calls Search(, ) (line 5) to get the leaf , its parent and grandparent . Next, it validates the leaf (as in Find). If successful, it finds the sibling of (lines 5–5) and calls Execute (line 5) to perform the remaining actions. We remark that, in contrast to what happens in NBBST which changes the appropriate child pointer of to point to the sibling of , PNBBST creates a new node where it copies the sibling of and changes the appropriate child pointer of to point to this new copy. This is necessary to avoid creating cycles consisting of and pointers, which could cause infinite loops during Search.
Finally, we discuss Execute and Help. Execute checks whether there are operations in progress on the nodes that are to be flagged or marked and helps them if necessary (lines 4–4). If this is not the case, it creates a new Info object (line 4), performs the first flag CAS to make the Info object visible to other processes (line 4) and calls Help to perform the remaining actions (line 4). Help() first performs the handshaking (line 4–4). If does not abort (line 4), Help attempts to flag and mark the remaining nodes recorded in the Info object pointed to by (lines 4–4). If it succeeds (line 4), it executes a child CAS to apply the required change on the appropriate tree pointer (line 4). If the child CAS is successful, commits (line 4), otherwise it aborts (line 4).
5 Proof of Correctness
5.1 Proof Outline
We first prove each call to a subroutine satisfies its preconditions. This is proved together with some simple invariants, for instance, that ReadChild() returns a pointer to a node whose sequence number is at most . Next, we prove that fields of nodes are updated in an orderly way and we study properties of the child CAS steps. A node is frozen for an Info object if points to and a call to Frozen() would return True. A freeze CAS (i.e., a flag or mark CAS) belongs to an Info object if it occurs in an instance of Help whose parameter is a pointer to , or on line flagCAS1 with being the Info object created on line createinfo. We prove that only the first freeze CAS that belongs to an Info object on each of the nodes in can be successful. Only the first child CAS belonging to can succeed and this can only occur after all nodes in have been frozen. If a successful child CAS belongs to , the field of never has the value Abort. Specifically, this field is initially and changes to Try or Abort (depending on whether handshaking is performed successfully on lines helphandshakingtryCAS). If it changes to Try, then it may become Commit or Abort later (depending on whether all nodes in are successfully frozen for ). A node remains frozen for until changes to Commit or Abort. Once this occurs, the value of never changes again. Only then can the field of the node become frozen for a different Info object. Values stored in fields of nodes and in pointers are distinct (so no ABA problem may arise).
An ichild (dchild) CAS is a child CAS belonging to an Info object that was created by an Insert (Delete, respectively). Note that executing a successful freeze CAS (belonging to an Info object with sequence number ) on a node acts as a “lock” on set on behalf of the operation that created . A successful child CAS belonging to occurs only if the nodes that it will affect have been frozen. Every such node has sequence number less than or equal to . The ichild CAS replaces a leaf with sequence number with a subtree consisting of an internal node and two leaves (see Figure 1). All three nodes of this subtree have sequence number and have never been in the tree before. Moreover, the pointer of the internal node of this subtree points to (whereas those of the two leaves point to ). These changes imply that the execution of the ichild CAS does not affect any of the trees with . The part of the tree on which the ichild CAS is performed cannot change between the time all of the freeze CAS steps (for ) were performed and the time the ichild CAS is executed. So, the change that the ichild CAS performs is visible in every with just after this CAS has been executed. Similarly, a dchild CAS does not cause any change to any tree with . However, for each , it replaces a node in with a copy of the sibling of the node to be deleted (which is a leaf), thus removing three nodes from the tree (see Figure 1).
Characterizing the effects of child CAS steps in this way allows us to prove that no node in , , ever acquires a new ancestor after it is first inserted in the tree. Using this, we also prove that if a node is in the search path for key in at some time, then it remains in the search path for in at all later times. We also prove that for every node an instance of Search(, ) traverses, was in (and on the search path for in it) at some time during the Search. These facts allows us to prove that every , , is a BST at all times. Moreover, we prove that our validation scheme ensures that all successful update operations are applied on the latest version of the tree.
Fix an execution . An update is imminent at some time during if it has sucessfully executed its first freeze CAS before this time and it later executes a successful child CAS in . We prove that at each time, no two imminent updates have the same key. For configuration , let be the set of keys stored in leaves of at plus the set of keys of imminent Insert operations at minus the set of keys of imminent Delete operations at . Let the abstract set be the set that would result if all update operations with linearization points at or before would be performed atomically in the order of their linearization points. We prove the invariant that . Once we know this, we can prove that each operation returns the same result as it would if the operations were executed sequentially in the order defined by their linearization points, to complete the linearizability argument.
A RangeScan with sequence number is waitfree because it traverses , which can only be modified by updates that begin before the RangeScan’s increment of the (due to handshaking). To prove that the remaining operations are nonblocking, we show that an attempt of an update that freezes its first node can only be blocked by an update that freezes a lower node in the tree, so the update operating at a lowest node in the tree makes progress.
5.2 Formal Proof
We now provide the full proof of correctness. Specifically, we prove that the implementation is linearizable and satisfies progress properties. The early parts of the proof are similar to proofs in previous work [7, 14, 38], but are included here for completeness since the details differ. Most of the more novel aspects of the proof are in Sections 5.2.4 and 5.2.5.
5.2.1 Basic Invariants
We start by proving some simple invariants, and showing that there are no nullpointer exceptions in the code.
Observation 1
The , and fields of a Node never change. No field of an Info record, other than , ever changes. The pointer never changes.
Observation 2
If an Info object’s state field is Commit or Abort in some configuration, it can never be or Try in a subsequent configuration.

The state of an Info object can be changed only on lines abortCAS, tryCAS, commitWrite and abortWrite. None of these can change the value from Commit or Abort to or Try.
Observation 3
The value of is always nonnegative, and for every configuration and every node in configuration , .

The variable is initialized to 0 and never decreases. All nodes in the initial configuration have field 0. Whenever a node is created by an Insert or Delete, its field is assigned a value that the update operation read from earlier.
Invariant 4
The following statements hold.

Each call to a routine satisfies its preconditions.

Each Search that has executed line searchinitialize has local variables that satisfy the following: and .

Each Search that has executed line searchadvancep has local variables that satisfy the following: and .

Each Search that has executed line searchinitialize has local variables that satisfy the following: if is finite then and .

Each ReadChild that has executed line readchild has local variables that satisfy the following: and there is a chain of pointers from to a node whose field is at most .

Each ReadChild that terminates returns a pointer to a node whose sequence number is at most .

Each Find that has executed line findsearch has non values in its local variables and .

Each Insert that has executed line insertsearch has local variables that satisfy the following: and and .

Each Delete that has executed line deletesearch has local variables that satisfy the following: and and . Moreover, if , then and .

For each Internal node , ’s children pointers are non. Moreover, one can reach a node with sequence number at most by tracing pointers from either of ’s children.

For each Info object except , all elements of are non, is a subset of , is an element of , and are distinct and non, is an element of , and .

Each Update record has a non field.

For any Internal node , any node reachable from by following a chain of pointers has and any node reachable from by following a chain of pointers has .

For any Info object , if , then is infinite.

Any node that can be reached from by following a chain of pointers has an infinite key.

For any Internal node , any terminating call to ReadChild returns a node whose key is less than , and any terminating call to ReadChild returns a node whose key is greater than or equal to . Any call to ReadChild returns a node whose key is infinite.

We prove that all claims are satisfied in every finite execution by induction on the number of steps in the execution.
For the base case, consider an execution of 0 steps. Claims 1 to 9 are satisfied vacuously. The initialization ensures that claims 10 to 15 are true in the initial configuration.
Assume the claims hold for some finite execution . We show that the claims hold for , where is any step.

If is a call to Search at line findsearch, insertsearch or deletesearch, the value of was read from in a previous line. The value of is always nonnegative, so the precondition of the Search is satisfied.
If is a call to ReadChild on line searchadvancel, the preconditions are satisfied by induction hypothesis 3. If is a call to ReadChild on line readsibling, the preconditions are satisfied by induction hypothesis 9. If is a call to ReadChild on line scanhelperrecursive1 to scanhelperrecursive4, the preconditions are satisfied because ScanHelper’s preconditions were satisfied (by induction hypothesis 1).
If is a call to ValidateLink on line validateleafp or validategpp of ValidateLeaf, the preconditions follow from the preconditions of ValidateLeaf, which are satisfied by induction hypothesis 1. (In the latter case, we know from the test on line validategpp that .) If is a call to ValidateLink on line deletevalidatepsib, the preconditions are satisfied because the Search on line deletesearch returned a node with sequence number at most by induction hypothesis 3, and then ReadChild on line readsibling returned a node, by induction hypothesis 6. If is a call to ValidateLink on line validatesibnephew1 or validatesibnephew2, the preconditions are satisfied by induction hypothesis 6 applied to the preceding call to ReadChild on line readsibling.
If is a call to ValidateLeaf on line findvalidateleaf, insertvalidateleaf or deletevalidateleaf, then the preconditions follow from induction hypotheses 2, 3, 4 and readchildresult applied to the preceding call to Search on line findsearch, insertsearch or deletesearch, respectively.
If is a call to Execute on line insertexecute of Insert, preconditions (a)–(f) follow from induction hypothesis 8 and the fact that line createinternal creates after reading and sets to . It remains to prove precondition (g). Suppose . Since ValidateLeaf on line insertvalidateleaf returned True, the call to ValidateLink on line validateleafp also returned True. So, was the result of the ReadChild on line valreadchild of ValidateLink. By induction hypothesis 16, has an infinite key. Thus, the new Internal node created on line createinternal of the Insert has an infinite key, as required to satisfy precondition (g).
If is a call to Execute on line deleteexecute of Delete, preconditions (a)–(c) follow from induction hypothesis 9 and the fact that (since the Delete did not terminate on line deletefalse), and induction hypothesis 6 applied to the preceding call to ReadChild on line readsibling. Precondition (d) follows from the additional fact that is created on line copysibling after reading a pointer to , which as already argued is non. Precondition (e) is obviously satisfied. Precondition (f) follows from the fact that line copysibling sets to be . It remains to prove precondition (g). Suppose . Since ValidateLeaf on line deletevalidateleaf returned True, the call to ValidateLink on line validategpp also returned True. Then, was the result of the ReadChild on line valreadchild of ValidateLink. By induction hypothesis 16, has an infinite key. The ReadChild on line readsibling returns , which also has an infinite key by induction hypothesis 16. Thus, the node created at line copysibling has an infinite key, as required to satisfy precondition (g).
If is a call to Help on line valhelp, executehelpothers or scanhelperhelp, the argument is non, by induction hypothesis 12. Moreover, the preceding call to InProgress returned true, so the Info object had state or Try. By Observation 2, this Info object cannot be the Dummy object, which is initialized to have state Abort. If is a call to Help on line executehelpself, the precondition is satisfied, since the argument is created at line createinfo.
If is a call to CASChild on line helpCASchild, the Info object is not the Dummy, by the precondition to Help, which was satisfied when Help was called, by induction hypothesis 1. So, the preconditions of CASChild are satisfied by induction hypothesis 11.
If is a call to ScanHelper on line scanreturn, the precondition is satisfied since and the value of is always nonnegative. If is a call to ScanHelper on line scanhelperrecursive1 to scanhelperrecursive4, the precondition is satisfied by induction hypothesis 6.

By Observation 1, the field of a node does not change. So it suffices to prove that any update to in the Search routine preserves the invariant.

First, suppose is the first step of a Search that sets so that is finite. Then is not an execution of line searchinitialize, because never changes and has key , by Observation 1. Likewise, is not the assignment to that occurs in the first execution of line searchadvancel, since the ReadChild on that line (which terminates before ) would have returned a node with an infinite key, by induction hypothesis 16. Thus, occurs after the second execution of line searchadvancegp, which happens after the first execution of line searchadvancep. By induction hypothesis 3, the second execution of line searchadvancep assigns a nonnull value to , and .
It remains to consider any step that assigns a new value to (at line searchadvancegp) after the first time is assigned a node with a finite value. As argued in the previous paragraph, this execution of line searchadvancegp will not occur in the first two iterations of the Search’s while loop. So the claim follows from induction hypothesis 3.

By Observation 1, fields are never changed. Thus, it suffices to show that any step that updates inside the ReadChild routine maintains this invariant.
If is a step that sets to a child of at line readchild, the claim follows from induction hypothesis 10 applied to the configuration just before .
If is an execution of line readprev, the claim is clearly preserved.

If is a step in which ReadChild terminates, the claim follows from induction hypothesis 5 applied to the configuration prior to .

It suffices to consider the step in which the Search called at line findsearch terminates. That Search performed at least one iteration of its while loop (since is an Internal node). So, by induction hypotheses 2 and 3, it follows that the values that Search returns, which the Find stores in and , are not .

It suffices to consider the step in which the Search called at line insertsearch terminates. That Search performed at least one iteration of its while loop (since is an Internal node). So, by induction hypotheses 2 and 3, it follows that the values that Search returns, which the Insert stores in and , are not and have fields that are at most .

It suffices to consider the step in which the Search called at line deletesearch terminates. That Search performed at least one iteration of its while loop (since is an Internal node). So, by induction hypotheses 2 and 3, it follows that the values that Search returns, which the Delete stores in and , are not and have fields that are at most . If , it follows from induction hypothesis 4 that the value Search returns, which the Delete stores in , is not and that .

By Observation 1, pointers are never changed. Thus, it suffices to show that every step that changes a child pointer preserves this invariant. Consider a step that changes a child pointer by executing a successful child CAS (at line childCAS1 or childCAS2). By the precondition of CASChild, the new child pointer will be non and this new child’s pointer will point to the previous child. Since one could reach a node with field at most by following pointers from the old child (by induction hypothesis 10), this will likewise be true if one follows pointers from the new child.

By Observation 1, the and fields of an Info object never change. Thus it is sufficient to consider the case where the step is the creation of a new Info object at line createinfo of the Execute routine. Claim 11 for the new Info object follows from the fact that the preconditions of Execute were satisfied when it was invoked before .

We consider all steps that construct a new Update record. If is an execution of line flagCAS1, the field of the new Update record is

Comments
There are no comments yet.