Persistent Non-Blocking Binary Search Trees Supporting Wait-Free Range Queries

05/12/2018
by   Panagiota Fatourou, et al.
0

This paper presents the first implementation of a search tree data structure in an asynchronous shared-memory system that provides a wait-free algorithm for executing range queries on the tree, in addition to non-blocking algorithms for Insert, Delete and Find, using single-word Compare-and-Swap (CAS). The implementation is linearizable and tolerates any number of crash failures. Insert and Delete operations that operate on different parts of the tree run fully in parallel (without any interference with one another). We employ a lightweight helping mechanism, where each Insert, Delete and Find operation helps only update operations that affect the local neighbourhood of the leaf it arrives at. Similarly, a Scan helps only those updates taking place on nodes of the part of the tree it traverses, and therefore Scans operating on different parts of the tree do not interfere with one another. Our implementation works in a dynamic system where the number of processes may change over time. The implementation builds upon the non-blocking binary search tree implementation presented by Ellen et al. (in PODC 2010) by applying a simple mechanism to make the tree persistent.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/14/2018

The Amortized Analysis of a Non-blocking Chromatic Tree

A non-blocking chromatic tree is a type of balanced binary search tree w...
12/14/2017

Range Queries in Non-blocking k-ary Search Trees

We present a linearizable, non-blocking k-ary search tree (k-ST) that su...
01/02/2020

Analysis and Evaluation of Non-Blocking Interpolation Search Trees

We start by summarizing the recently proposed implementation of the firs...
12/18/2017

Pragmatic Primitives for Non-blocking Data Structures

We define a new set of primitive operations that greatly simplify the im...
12/18/2017

A General Technique for Non-blocking Trees

We describe a general technique for obtaining provably correct, non-bloc...
09/04/2018

A Simple and Practical Concurrent Non-blocking Unbounded Graph with Reachability Queries

Graph algorithms applied in many applications, including social networks...
01/03/2022

Lock-Free Locks Revisited

This paper presents a new and practical approach to lock-free locks base...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been much recent work on designing efficient concurrent implementations of set data structures [4, 5, 8, 10, 12, 13, 21, 29, 36, 38], which provide algorithms for Insert, Delete, and Find. There is increasing interest in providing additional operations for modern applications, including iterators [1, 32, 33, 35, 36, 37] or general range queries [6, 9]. These are required in many big-data applications [11, 26, 34], where shared in-memory tree-based data indices must be created for fast data retrieval and useful data analytics. Prevalent programming frameworks (e.g., Java [23], .NET [31], TBB [22]) that provide concurrent data structures have added operations to support (non-linearizable) iterators.

The Binary Search Tree (BST) is one of the most fundamental data structures. Ellen et al. [13] presented the first non-blocking implementation (which we will call NB-BST) of a BST from single-word CAS. NB-BST has several nice properties. Updates operating on different parts of the tree do not interfere with one other and Finds never interfere with any other operation. The code of NB-BST is modular and a detailed proof of correctness is provided in [14].

In this paper, we build upon NB-BST to get a persistent version of it, called PNB-BST. In a persistent data structure, old versions of the data structure are preserved when it is modified, so that one can access any old version. We achieve persistence on top of NB-BST by applying a relatively simple technique which fully respects the modularity and simplicity of NB-BST’s design.

In a concurrent setting, a major motivation for providing data structure persistence is that it facilitates the implementation, in a wait-free way [18], of advanced operations (such as range queries) on top of the data structure. We exploit persistence in PNB-BST to provide the first wait-free implementation of RangeScan on top of tree data structures, using single-word CAS. RangeScan(, ) returns a set containing all keys in the implemented set that are between the given keys and . PNB-BST also provides non-blocking (also known as lock-free [18]) implementations of Insert, Delete, and Find.

PNB-BST is linearizable [20], uses single-word CAS, and tolerates any number of crash failures. As in NB-BST, updates in PNB-BST on different parts of the tree are executed in parallel without interfering with one another. A Find simply follows tree edges from the root to a leaf and it may have to help an update operation only if the update is taking place at the parent or grandparent of the leaf that the search arrives at. Thus, Find employs a lightweight helping mechanism. Similarly, RangeScan helps only those operations that are in progress on the nodes that it traverses. RangeScan may print keys (or perform some processing of the nodes, e.g., counting them) as it traverses the tree, thus avoiding any space overhead. PNB-BST does not require knowledge of the number of processes in the system, and therefore it works in a dynamic system where the set of participating processes changes.

The code of PNB-BST is as modular as that of NB-BST, making it fairly easy to understand. However, designing a linearizable implementation of RangeScan required solving several synchronization problems between RangeScans and concurrent update operations on the same part of the tree, so that a RangeScan sees all the successful update operations linearized before it but not those linearized after it. Specifically, we had to (a) apply a mechanism based on sequence numbers set by RangeScans, to split the execution into phases and assign each operation to a distinct phase, (b) design a scheme for linearizing operations that is completely different from that of of NB-BST by taking into consideration the phase to which each operation belongs, (c) ensure some additional necessary synchronization between RangeScans and updates, and (d) use a more elaborate helping scheme. The proof of correctness borrows from that of NB-BST. However, due to the mentioned complications, many parts of it are more intricate. The proof that RangeScans work correctly is completely novel.

2 Related Work

Our implementation is based on NB-BST, the binary search tree implementation proposed in [13]. Brown et al. [7] generalized the techniques in [13] to get the primitives LLX, SCX and VLX which are generalizations of load-link, store-conditional and validate. These primitives can be used to simplify the non-blocking implementation of updates in every data structure based on a down tree (see [8, 17] for examples). Unfortunately, our technique for supporting range queries cannot directly be implemented using LLX and SCX: the functionality hidden inside LLX must be split in two parts between which some synchronization is necessary to coordinate RangeScans with updates. The work in [13] has also been generalized in [38] to get a non-blocking implementation of a Patricia trie. None of these implementations of non-blocking search trees supports range queries.

Prokopec et al. [36] presented a non-blocking implementation of a concurrent hash trie which supports a Scan operation that provides a consistent snapshot of the entire data structure. Their algorithm uses indirection nodes (i-nodes) [41] that double the height of the tree. To implement Scan, the algorithm provides a persistent implementation of the trie in which updates may have to copy the entire path of nodes they traverse to synchronize with concurrent Scans. Moreover, the algorithm causes a lot of contention on the root node. The algorithm could be adjusted to support RangeScan. However, every RangeScan would cause updates taking place anywhere in the tree to copy all the nodes they visit, even if they are not in the part of the tree being scanned.

Petrank and Timnat [35] gave a technique (based on [24]) to implement Scan on top of non-blocking set data structures such as linked lists and skip lists. Concurrent Scans share a snap collector object in which they record information about the nodes they traverse. To ensure that a Scan appropriately synchronizes with updates, processes executing updates or Finds must also record information about the operations they perform (or those executed by other processes they encounter) in the snap collector object. Although the snap collector object’s primitive operations is wait-free, the following example shows that the implementation of Scan using those primitives is non-blocking but not wait-free. Assume that the algorithm is applied on top of the non-blocking sorted linked list implementation presented by Harris [16]. A Scan must traverse the list, and this traversal may never complete if concurrent updates continue to add more elements to the end of the list faster than the Scan can traverse them. In this case, the lists maintained in the snap collector will grow infinitely long. In case is known, updates on different parts of the data structure do not interfere with one another and have been designed to be fast. However, Scan is rather costly in terms of both time and space. Chatterjee [9] generalizes the algorithm of Petrank and Timnat to get a non-blocking implementation of RangeScan using partial snapshots [2]. In a different direction, work in [1, 37] characterizes when implementing the technique of [35] on top of non-blocking data structures is actually possible.

Brown et al. [6] presented an implementation of a -ary search tree supporting RangeScan in an obstruction-free way [19]. Avni et al. [3] presented a skip list implementation which supports RangeScan. It can be either lock-free or be built on top of a transactional memory system, so its progress guarantees are weaker than wait-freedom. Bronson et al. [5] presented a blocking implementation of a relaxed-balance AVL tree which provides support for Scan.

Some papers present wait-free implementations of Scan (or RangeScan) on data structures other than trees or in different settings. Nikolakopoulos et al. [32, 33] gave a set of consistency definitions for Scan and presented Scan algorithms for the lock-free concurrent queue in [28] that ensure different consistency and progress guarantees. Fatourou et al. [15] presented a wait-free implementation of Scan on top of the non-blocking deque implementation of [27]. Kanellou and Kallimanis [25] introduced a new graph model and provided a wait-free implementation of a node-static graph which supports partial traversals in addition to edge insertions, removals, and weight updates. Spiegelman et al. [39] presented two memory models and provided wait-free dynamic atomic snapshot algorithms for both.

3 Overview of the BST Implementation and Preliminaries

We provide a brief description of NB-BST (following the presentation in [13]) and some preliminaries.

NB-BST implements Binary Search Trees (BST) that are leaf-oriented, i.e., all keys are stored in the leaves of the tree. The tree is full and maintains the binary search tree property: for every node in the tree, the key of is larger than the key of every node in ’s left subtree and smaller than or equal to the key of every node in ’s right subtree. The keys of the Internal nodes are used solely for routing to the appropriate leaf during search. A leaf (internal) node is represented by an object of type Leaf (Internal, respectively); we say that Leaf and Internal nodes are of type Node (see Figure 2).

To insert a key in a leaf-oriented tree, a search for is first performed. Let and be the leaf that this search arrives at and its parent. If does not contain , then a subtree consisting of an internal node and two leaf nodes is created. The leaves contain and the key of (with the smaller key in the left leaf). The internal node contains the bigger of these two keys. The child pointer of which was pointing to is changed to point to the root of this subtree. Similarly, for a Delete(), let , and be the leaf node that the search Delete performs arrives at, its parent, and its grandparent. If the key of is , then the child pointer of which was pointing to is changed to point to the sibling of . By performing the updates in this way, the properties of the tree are maintained.

An implementation is linearizable if, in every execution , each operation that completes in (and some that do not) can be assigned a linearization point between the starting and finishing time of its execution so that the return values of those operations are the same in as if the operations were executed sequentially in the order specified by their linearization points.

To ensure linearizability, NB-BST applies a technique that flags and marks nodes. A node is flagged before any of its child pointers changes. A node is permanently marked before it is removed. To mark and flag nodes, NB-BST uses CAS. CAS() changes the value of object to if its current value is equal to , otherwise the CAS fails and no change is applied on . In either case, the value that had before the execution of CAS is returned.

NB-BST provides a routine, Search(), to search the data structure for key . Search returns pointers to the leaf node at which the Search arrives, to its parent, and to its grandparent. Find() executes Search() and checks whether the returned leaf contains the key . Insert() executes Search() to get a leaf and its parent . It then performs a flag CAS, to flag , then a child CAS to change the appropriate child pointer of to point to the root of the newly created subtree of three nodes, and finally an unflag CAS to unflag . If it fails to flag , it restarts without executing the other two CAS steps. Similarly, a Delete() calls Search to get a leaf , its parent , and its grandparent . It first executes a flag CAS trying to flag . If this fails, it restarts. If the flagging succeeds, it executes a mark CAS to mark . If this fails, it unflags and restarts. Otherwise, it executes a child CAS to change the apropriate child pointer of to point from to the sibling of , it unflags and returns. Both Insert and Delete operations execute the body of a while loop repeatedly until they succeed. The execution of an iteration of the while loop is called attempt.

Processes may fail by crashing. An implementation is non-blocking if in every infinite execution, infinitely many operations are completed. NB-BST is non-blocking: Each process that flags or marks a node stores in it a pointer to an Info object, which contains information about the operation it performs (see Figure 2). This information includes the old and new values that should be used by the CAS steps that will perform to complete the execution of . Other processes that apply operations on the same part of the data structure can help this operation complete and unflag the node. Once they do so, they are able to retry their own operations. Helping is necessary only if an update operation wants to flag or mark a node already flagged or marked by another process.

4 A Persistent Binary Search Tree Supporting Range Queries

Insert()Delete()
Figure 1: Examples of Insert and Delete.

We modify NB-BST to get PNB-BST, a BST implementation that supports RangeScan, in addition to Insert, Delete, and Find.

4.1 Overview

In a concurrent environment, care must be taken to synchronize RangeScans with updates since as a RangeScan traverses the tree, it may see an update by a process but it may miss an update that finishes before starts, and was applied on the part of the tree that has already been visited by the RangeScan (thus violating linearizability).

To avoid such situations, PNB-BST implements a persistent version of the leaf-oriented tree, thus allowing a RangeScan to reconstruct previous versions of it. To achieve this, PNB-BST stores in each node an additional pointer, called . Whenever the child pointer of a node changes from a node to a node , the pointer of points to . (Figure 1 illustrates an example.)

PNB-BST maintains a shared integer, , which is incremented each time a RangeScan takes place. Each operation has a sequence number associated with it. Each RangeScan starts its execution by reading and uses the value read as its sequence number. Each other operation reads at the beginning of each of its attempts. The sequence number of is the sequence number read in its last attempt. A successful update operation records its sequence number in the Info object it creates during its last attempt. Intuitively, each RangeScan initiates a new execution phase whenever it increments . For each , phase is the period during which has the value . We say that all operations with sequence number belong to phase .

Each tree node has a sequence number which is the sequence number of the operation that created it. In this way, a RangeScan may figure out which nodes have been inserted or deleted by updates that belong to later phases. For any Internal node whose sequence number is at most , we define the version- left (or right) child of to be the node that is reached by following the left (or right) child pointer of and then following its pointers until reaching the first node whose field is less than or equal to . (We prove that such a node exists.) For every configuration , we define graph as follows. The nodes of is the set of all existing nodes in and the edges go from nodes to their version- children; is the subgraph of containing those nodes that are reachable from the root node in . We prove that is a binary search tree.

We linearize every Scan operation with sequence number at the end of phase , with ties broken in an arbitrary way. Moreover, we linearize all Insert, Delete and Find operations that belong to phase during phase . To ensure linearizability, PNB-BST should guarantee that a RangeScan with sequence number ignores all changes performed by successful update operations that belong to phases with sequence numbers bigger than . To ensure this, each operation with sequence number ignores those nodes of the tree that have sequence numbers bigger than by moving from a node to its appropriate version- child. Thus, each operation with sequence number always operates on .

1 type Update {                  stored in one CAS word 2          3         Info *info 4         } 5 type Info { 6          7         Internal *[]  nodes to be frozen 8         Update []  old values for freeze CAS steps 9         Internal *[]  nodes to be marked 10         Internal *  node whose child will change 11         Node *  old value for child CAS 12         Node *  new value for the child CAS 13         int  sequence number 14         } 15 type Internal {                  subtype of Node 16          17         Update 18         Node *left, * 19         Node * 20         int seq 21         } 22 type Leaf {                       subtype of Node 23          24         Update 25         Node * 26         int seq 27         }
28  Initialization:
29 shared counter := 0
30 shared Info * := pointer to a new Info object whose field is Abort, and whose other fields are
31 shared Internal * := pointer to new Internal node with field , field ,
         field , field , and its and fields pointing to new Leaf nodes whose fields
        are , fields are , and keys and , respectively
Figure 2: Type definitions and initialization.
32 Search(, int ): {
33          Precondition:
34         Internal *, *
35         Node *
36         while points to an internal node {
37                   Remember parent of
38                  Remember parent of
39                  Go to appropriate version- child of
40                }
41         return
42         }
43 ReadChild(): Node* {
44          Precondition: is non- and
45         if then else  Move down to appropriate child
46         while ()
47         return ;
48         }
49 ValidateLink(): {
50          Preconditions: and are non-
51         Update
52         
53         if then {
54                 Help()
55                return
56                }
57         if ( and ) or ( and ) then return
58         else return
59         }
60 ValidateLeaf(Internal *, Internal *, Leaf *, Key ) : {
61          Preconditions: and are non- and if then is non-
62         Update
63         Boolean
64         
65         if and then
66         
67         return
68         }
69 Find(): Leaf* {
70         Internal *
71         Leaf *
72         Boolean
73         while True {
74                 
75                
76                
77                if then {
78                        if then return
79                        else return
80                        }
81                }
82         }
83 CAS-Child(Internal *, Node *, Node *) {
         Precondition: points to an Internal node and points to a Node (i.e., neither is ) and
         This routine tries to change one of the child fields of the node that points to from to .
84         if then
85                 CAS  child CAS
86                else
87                 CAS  child CAS
88                }
Figure 3: Pseudocode for Search, Find and some helper routines.
89 Frozen(Update ): Boolean {
90         return (( and or
                         and ))
91                        }
92 Execute (Internal *[], Update [], Internal *[], Internal *,
                                        Node *, Node *, int ): Boolean {
93                                                Preconditions: (a) Elements of are non-, (b) is a subset of , (c) is an element of ,
94                          (d) and are distinct and non-, (e) is an element of ,
95                         (f) , and (g) if then is infinite.
96                        for to length of {
97                 if then {
98                        if then Help()
99                        return False
100                        }
101                }
102          pointer to a new Info record containing
103         if then  freeze CAS
104                 return Help()
105                else return False
106         }
107 Help(Info *): boolean {
108          Precondition: is non- and does not point to the Dummy Info object
109         int
110         boolean
111         if then
112                 CAS(, , Abort)  abort CAS
113                else CAS(, , Try)  try CAS
114         
115         while and length of do {
116                 if appears in then
117                          freeze CAS
118                        else   freeze CAS
119                
120                
121                }
122         if then {
123                 
124                  commit write
125                } else if then
126                   abort write
127                return ()
128         }
129 RangeScan(int , int ): Set {
130         
131         
132         return
133         }
134 ScanHelper(Node *, int , int , int ): Set {
135          Precondition: points to a node with
136         Info *
137         if points to a leaf then return
138         else {
139                 
140                if then
141                if then return ScanHelper
142                else if then return ScanHelper
143                else return
144                                
145                               }
146         }
Figure 4: Pseudocode for Execute, Help and Scan.
147 Insert(): boolean {
148         Internal * , *, *
149         Leaf *, *
150         Leaf *
151         Update
152         Info *
153         Boolean
154         while True {
155                 
156                
157                
158                if then {
159                        if then return False  Cannot insert duplicate key
160                        else {
161                                 pointer to a new Leaf node whose field is , its field is equal to , and its field is
162                                pointer to a new Leaf whose key is ,
                                      its field is equal to  and its field is equal to
163                                        pointer to a new Internal node with field ,
                                       field , its field equal to and its field equal to ,
                                      and with two child fields equal to and
                                      (the one with the smaller key is the left child),
164                                       if Execute() then return True
165                               }
166                        }
167                }
168         }
169 Delete(): boolean {
170         Internal *, *
171         Leaf *
172         Node *, *
173         Update
174         Info *
175         Boolean
176         while True {
177                 
178                
179                
180                if then {
181                        if then return False  Key is not in the tree
182                         := ReadChild()
183                        
184                        if then {
185                                 pointer to a new copy of sibling with its field set to and its pointer set to
186                               if is Internal then {
187                                       
188                                       if then
189                                       } else
190                               if and Execute(
                                                                        ) then
191                                       return True
192                                       }
193                        }
194                }
195         }
Figure 5: Pseudocode for Insert and Delete.

To ensure linearizability, PNB-BST should also ensure that each RangeScan sees all the successful updates that belong to phases smaller than or equal to . To achieve this, PNB-BST employs a handshaking mechanism between each scanner and the updaters. It also uses a helping mechanism which is more elaborate than that of NB-BST.

To describe the handshaking mechanism in more detail, consider any update operation initiated by process . No process can be aware of before performs a successful flag CAS for . Assume that flags node for in an attempt with sequence number . To ensure that no RangeScan with sequence number will miss , checks whether still has the value after the flag CAS has occurred. We call this check the handshaking check of . If the handshaking check succeeds, it is guaranteed that no RangeScan has begun its traversal between the time that reads at the beginning of the execution of and the time the handshaking check of is executed. Note that any future RangeScan with sequence number that traverses while is still in progress, will see that is flagged and find out the required information to complete in its Info object. In PNB-BST, the RangeScan helps complete before it continues its traversal.

However, if the handshaking check fails, does not know whether any RangeScan that incremented to a value greater than has already traversed the part of the tree that is trying to update, and has missed this update. At least one of these RangeScans will have sequence number equal to . Thus, if succeeds, linearizability could be violated. To avoid this problem, pro-actively aborts its attempt of if the handshaking check fails, and then it initiates a new attempt for (which will have a sequence number bigger than ). This abort mechanism is implemented as follows. The Info object has a field, called , which takes values from the set (initially ). Each attempt creates an Info object. To abort the execution of an attempt, changes the field of its Info object to Abort. Once an attempt is aborted, the value of the field of its Info object remains Abort forever. If the handshaking check succeeds, then changes the field of the Info object of to Try and tries to execute the remaining steps of this attempt. If completes successfully, it changes the field of the Info object to Commit. Info objects whose field is equal to or Try belong to update operations that are still in progress.

We now describe the linearization points in more detail. If an attempt of an Insert or Delete ultimately succeeds in updating a child pointer of the tree to make the update take effect, we linearize the operation at the time that attempt first flags a node: this is when the update first becomes visible to other processes. (This scheme differs from the original NB-BST, where updates are linearized at the time they actually change a child pointer in the tree.) Because of handshaking, this linearization point is guaranteed to be before the end of the phase to which the operation belongs.

When a Find operation completes a traversal of a branch of the tree to a leaf, it checks whether an update has already removed the leaf or is in progress and could later remove that leaf from the tree. If so, the Find helps the update complete and retries. Otherwise, the Find terminates and is linearized at the time when the leaf is in the tree and has no pending update that might remove it later. (As in the original NB-BST, the traversal of the branch may pass through nodes that are no longer in the tree, but so long as it ends up at a leaf that is still present in the current tree we prove that it ends up at the correct leaf of the current tree.) An Insert() that finds key is already in the tree, and a Delete() that discovers that is not in the tree are linearized similarly to Find operations.

The helping mechanism employed by Find operations ensures that the Find will see an update that has been linearized (when it flags a node) before the Find but has not yet swung a child pointer to update the shape of the tree. But it is also crucial for synchronizing with RangeScan operations, for the following reason. Assume that a process initiates an Insert(). It reads in and successfully performs its flag CAS. Then, a RangeScan is initiated by a process and changes the value of from to . Finally, a Find(1) is initiated by a process and reads in . Find() and Insert() will arrive at the same leaf node (because Insert() has not performed its child CAS by the time Find reaches the leaf). If Find() ignores the flag that exists on the parent node of and does not help Insert() to complete, it will return False. If Insert() now continues its execution, it will complete successfully, and given that it has sequence number , it will be linearized before Find() which has sequnce number . That would violate linearizability.

4.2 Detailed Implementation

A RangeScan() first determines its sequence number (line 4) and then increments to start a new phase (line 4). To traverse the appropriate part of the tree, it calls ScanHelper() (line 4). ScanHelper starts from the root and recursively calls itself on the version- left child of the current node if is greater than ’s key, or on ’s version- right child if is smaller than ’s key, or on both version- children if ’s key is between and (lines 44). Whenever it visits a node where an update is in progress, it helps the update to complete (line 4). ReadChild is used to obtain ’s appropriate version- child.

Search() traverses a branch of from the root to a leaf node (lines 33). Find gets a sequence number (line 3) and calls Search(, ) (line 3) to traverse the BST to a leaf . Next, it calls ValidateLeaf to ensure that there is no update that has removed or has flagged ’s parent or grandparent for an update that could remove from the tree. If the validation succeeds, the Find is linearized at line 3. If it finds an update in progress, the Find helps complete it at line 3. If the validation is not successful, Find retries.

An Insert() performs repeated attempts. Each attempt first determines a sequence number , and calls Search(, ) (line 5) to traverse to the appropriate leaf in . It then calls ValidateLeaf, just as Find does. If the validation is successful and is not already in the tree (line 5), a subtree of three nodes is created (lines 55). Execute (line 5) performs the remaining actions of the Insert, in a way that is similar to the Insert of NB-BST.

In a way similar to Insert(), a Delete() performs repeated attempts (line 5). Each attempt determines its sequence number (line 5) and calls Search(, ) (line 5) to get the leaf , its parent and grandparent . Next, it validates the leaf (as in Find). If successful, it finds the sibling of (lines 55) and calls Execute (line 5) to perform the remaining actions. We remark that, in contrast to what happens in NB-BST which changes the appropriate child pointer of to point to the sibling of , PNB-BST creates a new node where it copies the sibling of and changes the appropriate child pointer of to point to this new copy. This is necessary to avoid creating cycles consisting of and pointers, which could cause infinite loops during Search.

Finally, we discuss Execute and Help. Execute checks whether there are operations in progress on the nodes that are to be flagged or marked and helps them if necessary (lines 44). If this is not the case, it creates a new Info object (line 4), performs the first flag CAS to make the Info object visible to other processes (line 4) and calls Help to perform the remaining actions (line 4). Help() first performs the handshaking (line 44). If does not abort (line 4), Help attempts to flag and mark the remaining nodes recorded in the Info object pointed to by (lines 44). If it succeeds (line 4), it executes a child CAS to apply the required change on the appropriate tree pointer (line 4). If the child CAS is successful, commits (line 4), otherwise it aborts (line 4).

5 Proof of Correctness

5.1 Proof Outline

We first prove each call to a subroutine satisfies its preconditions. This is proved together with some simple invariants, for instance, that ReadChild() returns a pointer to a node whose sequence number is at most . Next, we prove that fields of nodes are updated in an orderly way and we study properties of the child CAS steps. A node is frozen for an Info object if points to and a call to Frozen() would return True. A freeze CAS (i.e., a flag or mark CAS) belongs to an Info object if it occurs in an instance of Help whose parameter is a pointer to , or on line flagCAS1 with being the Info object created on line create-info. We prove that only the first freeze CAS that belongs to an Info object on each of the nodes in can be successful. Only the first child CAS belonging to can succeed and this can only occur after all nodes in have been frozen. If a successful child CAS belongs to , the field of never has the value Abort. Specifically, this field is initially and changes to Try or Abort (depending on whether handshaking is performed successfully on lines help-handshaking-tryCAS). If it changes to Try, then it may become Commit or Abort later (depending on whether all nodes in are successfully frozen for ). A node remains frozen for until changes to Commit or Abort. Once this occurs, the value of never changes again. Only then can the field of the node become frozen for a different Info object. Values stored in fields of nodes and in pointers are distinct (so no ABA problem may arise).

An ichild (dchild) CAS is a child CAS belonging to an Info object that was created by an Insert (Delete, respectively). Note that executing a successful freeze CAS (belonging to an Info object with sequence number ) on a node acts as a “lock” on set on behalf of the operation that created . A successful child CAS belonging to occurs only if the nodes that it will affect have been frozen. Every such node has sequence number less than or equal to . The ichild CAS replaces a leaf with sequence number with a subtree consisting of an internal node and two leaves (see Figure 1). All three nodes of this subtree have sequence number and have never been in the tree before. Moreover, the pointer of the internal node of this subtree points to (whereas those of the two leaves point to ). These changes imply that the execution of the ichild CAS does not affect any of the trees with . The part of the tree on which the ichild CAS is performed cannot change between the time all of the freeze CAS steps (for ) were performed and the time the ichild CAS is executed. So, the change that the ichild CAS performs is visible in every with just after this CAS has been executed. Similarly, a dchild CAS does not cause any change to any tree with . However, for each , it replaces a node in with a copy of the sibling of the node to be deleted (which is a leaf), thus removing three nodes from the tree (see Figure 1).

Characterizing the effects of child CAS steps in this way allows us to prove that no node in , , ever acquires a new ancestor after it is first inserted in the tree. Using this, we also prove that if a node is in the search path for key in at some time, then it remains in the search path for in at all later times. We also prove that for every node an instance of Search(, ) traverses, was in (and on the search path for in it) at some time during the Search. These facts allows us to prove that every , , is a BST at all times. Moreover, we prove that our validation scheme ensures that all successful update operations are applied on the latest version of the tree.

Fix an execution . An update is imminent at some time during if it has sucessfully executed its first freeze CAS before this time and it later executes a successful child CAS in . We prove that at each time, no two imminent updates have the same key. For configuration , let be the set of keys stored in leaves of at plus the set of keys of imminent Insert operations at minus the set of keys of imminent Delete operations at . Let the abstract set be the set that would result if all update operations with linearization points at or before would be performed atomically in the order of their linearization points. We prove the invariant that . Once we know this, we can prove that each operation returns the same result as it would if the operations were executed sequentially in the order defined by their linearization points, to complete the linearizability argument.

A RangeScan with sequence number is wait-free because it traverses , which can only be modified by updates that begin before the RangeScan’s increment of the (due to handshaking). To prove that the remaining operations are non-blocking, we show that an attempt of an update that freezes its first node can only be blocked by an update that freezes a lower node in the tree, so the update operating at a lowest node in the tree makes progress.

5.2 Formal Proof

We now provide the full proof of correctness. Specifically, we prove that the implementation is linearizable and satisfies progress properties. The early parts of the proof are similar to proofs in previous work [7, 14, 38], but are included here for completeness since the details differ. Most of the more novel aspects of the proof are in Sections 5.2.4 and 5.2.5.

5.2.1 Basic Invariants

We start by proving some simple invariants, and showing that there are no null-pointer exceptions in the code.

Observation 1

The , and fields of a Node never change. No field of an Info record, other than , ever changes. The pointer never changes.

Observation 2

If an Info object’s state field is Commit or Abort in some configuration, it can never be or Try in a subsequent configuration.

  • The state of an Info object can be changed only on lines abortCAS, tryCAS, commitWrite and abortWrite. None of these can change the value from Commit or Abort to or Try.  

Observation 3

The value of is always non-negative, and for every configuration and every node in configuration , .

  • The variable is initialized to 0 and never decreases. All nodes in the initial configuration have field 0. Whenever a node is created by an Insert or Delete, its field is assigned a value that the update operation read from earlier.  

Invariant 4

The following statements hold.

  1. Each call to a routine satisfies its preconditions.

  2. Each Search that has executed line search-initialize has local variables that satisfy the following: and .

  3. Each Search that has executed line search-advance-p has local variables that satisfy the following: and .

  4. Each Search that has executed line search-initialize has local variables that satisfy the following: if is finite then and .

  5. Each ReadChild that has executed line read-child has local variables that satisfy the following: and there is a chain of pointers from to a node whose field is at most .

  6. Each ReadChild that terminates returns a pointer to a node whose sequence number is at most .

  7. Each Find that has executed line find-search has non- values in its local variables and .

  8. Each Insert that has executed line insert-search has local variables that satisfy the following: and and .

  9. Each Delete that has executed line delete-search has local variables that satisfy the following: and and . Moreover, if , then and .

  10. For each Internal node , ’s children pointers are non-. Moreover, one can reach a node with sequence number at most by tracing pointers from either of ’s children.

  11. For each Info object except , all elements of are non-, is a subset of , is an element of , and are distinct and non-, is an element of , and .

  12. Each Update record has a non-  field.

  13. For any Internal node , any node reachable from by following a chain of pointers has and any node reachable from by following a chain of pointers has .

  14. For any Info object , if , then is infinite.

  15. Any node that can be reached from by following a chain of pointers has an infinite key.

  16. For any Internal node , any terminating call to ReadChild returns a node whose key is less than , and any terminating call to ReadChild returns a node whose key is greater than or equal to . Any call to ReadChild returns a node whose key is infinite.

  • We prove that all claims are satisfied in every finite execution by induction on the number of steps in the execution.

    For the base case, consider an execution of 0 steps. Claims 1 to 9 are satisfied vacuously. The initialization ensures that claims 10 to 15 are true in the initial configuration.

    Assume the claims hold for some finite execution . We show that the claims hold for , where is any step.

    1. If is a call to Search at line find-search, insert-search or delete-search, the value of was read from in a previous line. The value of is always non-negative, so the precondition of the Search is satisfied.

      If is a call to ReadChild on line search-advance-l, the preconditions are satisfied by induction hypothesis 3. If is a call to ReadChild on line read-sibling, the preconditions are satisfied by induction hypothesis 9. If is a call to ReadChild on line scanhelper-recursive1 to scanhelper-recursive4, the preconditions are satisfied because ScanHelper’s preconditions were satisfied (by induction hypothesis 1).

      If is a call to ValidateLink on line validate-leaf-p or validate-gp-p of ValidateLeaf, the preconditions follow from the preconditions of ValidateLeaf, which are satisfied by induction hypothesis 1. (In the latter case, we know from the test on line validate-gp-p that .) If is a call to ValidateLink on line delete-validate-p-sib, the preconditions are satisfied because the Search on line delete-search returned a node with sequence number at most by induction hypothesis 3, and then ReadChild on line read-sibling returned a node, by induction hypothesis 6. If is a call to ValidateLink on line validate-sib-nephew1 or validate-sib-nephew2, the preconditions are satisfied by induction hypothesis 6 applied to the preceding call to ReadChild on line read-sibling.

      If is a call to ValidateLeaf on line find-validate-leaf, insert-validate-leaf or delete-validate-leaf, then the preconditions follow from induction hypotheses 2, 3, 4 and readchild-result applied to the preceding call to Search on line find-search, insert-search or delete-search, respectively.

      If is a call to Execute on line insert-execute of Insert, preconditions (a)–(f) follow from induction hypothesis 8 and the fact that line create-internal creates after reading and sets to . It remains to prove precondition (g). Suppose . Since ValidateLeaf on line insert-validate-leaf returned True, the call to ValidateLink on line validate-leaf-p also returned True. So, was the result of the ReadChild on line val-read-child of ValidateLink. By induction hypothesis 16, has an infinite key. Thus, the new Internal node created on line create-internal of the Insert has an infinite key, as required to satisfy precondition (g).

      If is a call to Execute on line delete-execute of Delete, preconditions (a)–(c) follow from induction hypothesis 9 and the fact that (since the Delete did not terminate on line delete-false), and induction hypothesis 6 applied to the preceding call to ReadChild on line read-sibling. Precondition (d) follows from the additional fact that is created on line copy-sibling after reading a pointer to , which as already argued is non-. Precondition (e) is obviously satisfied. Precondition (f) follows from the fact that line copy-sibling sets to be . It remains to prove precondition (g). Suppose . Since ValidateLeaf on line delete-validate-leaf returned True, the call to ValidateLink on line validate-gp-p also returned True. Then, was the result of the ReadChild on line val-read-child of ValidateLink. By induction hypothesis 16, has an infinite key. The ReadChild on line read-sibling returns , which also has an infinite key by induction hypothesis 16. Thus, the node created at line copy-sibling has an infinite key, as required to satisfy precondition (g).

      If is a call to Help on line val-help, execute-help-others or scanhelper-help, the argument is non-, by induction hypothesis 12. Moreover, the preceding call to InProgress returned true, so the Info object had state or Try. By Observation 2, this Info object cannot be the Dummy object, which is initialized to have state Abort. If is a call to Help on line execute-help-self, the precondition is satisfied, since the argument is created at line create-info.

      If is a call to CAS-Child on line help-CAS-child, the Info object is not the Dummy, by the precondition to Help, which was satisfied when Help was called, by induction hypothesis 1. So, the preconditions of CAS-Child are satisfied by induction hypothesis 11.

      If is a call to ScanHelper on line scan-return, the precondition is satisfied since and the value of is always non-negative. If is a call to ScanHelper on line scanhelper-recursive1 to scanhelper-recursive4, the precondition is satisfied by induction hypothesis 6.

    2. By Observation 1, the field of a node does not change. So it suffices to prove that any update to in the Search routine preserves the invariant.

      Line search-initialize sets to which has . By induction hypothesis 1, the Search has , so claim 2 is satisfied.

      Line search-advance-l sets to the result of a ReadChild, so claim 2 is satisfied by induction hypothesis 6.

    3. It suffices to prove that any upate to in the Search routine preserves the invariant. Whenever is updated at line search-advance-p, it is set to the value stored in , so claim 3 follows from induction hypothesis 2.

    4. First, suppose is the first step of a Search that sets so that is finite. Then is not an execution of line search-initialize, because never changes and has key , by Observation 1. Likewise, is not the assignment to that occurs in the first execution of line search-advance-l, since the ReadChild on that line (which terminates before ) would have returned a node with an infinite key, by induction hypothesis 16. Thus, occurs after the second execution of line search-advance-gp, which happens after the first execution of line search-advance-p. By induction hypothesis 3, the second execution of line search-advance-p assigns a non-null value to , and .

      It remains to consider any step that assigns a new value to (at line search-advance-gp) after the first time is assigned a node with a finite value. As argued in the previous paragraph, this execution of line search-advance-gp will not occur in the first two iterations of the Search’s while loop. So the claim follows from induction hypothesis 3.

    5. By Observation 1, fields are never changed. Thus, it suffices to show that any step that updates inside the ReadChild routine maintains this invariant.

      If is a step that sets to a child of at line read-child, the claim follows from induction hypothesis 10 applied to the configuration just before .

      If is an execution of line read-prev, the claim is clearly preserved.

    6. If is a step in which ReadChild terminates, the claim follows from induction hypothesis 5 applied to the configuration prior to .

    7. It suffices to consider the step in which the Search called at line find-search terminates. That Search performed at least one iteration of its while loop (since is an Internal node). So, by induction hypotheses 2 and 3, it follows that the values that Search returns, which the Find stores in and , are not .

    8. It suffices to consider the step in which the Search called at line insert-search terminates. That Search performed at least one iteration of its while loop (since is an Internal node). So, by induction hypotheses 2 and 3, it follows that the values that Search returns, which the Insert stores in and , are not  and have fields that are at most .

    9. It suffices to consider the step in which the Search called at line delete-search terminates. That Search performed at least one iteration of its while loop (since is an Internal node). So, by induction hypotheses 2 and 3, it follows that the values that Search returns, which the Delete stores in and , are not  and have fields that are at most . If , it follows from induction hypothesis 4 that the value Search returns, which the Delete stores in , is not  and that .

    10. By Observation 1, pointers are never changed. Thus, it suffices to show that every step that changes a child pointer preserves this invariant. Consider a step that changes a child pointer by executing a successful child CAS (at line childCAS1 or childCAS2). By the precondition of CAS-Child, the new child pointer will be non- and this new child’s pointer will point to the previous child. Since one could reach a node with field at most by following pointers from the old child (by induction hypothesis 10), this will likewise be true if one follows pointers from the new child.

    11. By Observation 1, the and fields of an Info object never change. Thus it is sufficient to consider the case where the step is the creation of a new Info object at line create-info of the Execute routine. Claim 11 for the new Info object follows from the fact that the preconditions of Execute were satisfied when it was invoked before .

    12. We consider all steps that construct a new Update record. If is an execution of line flagCAS1, the field of the new Update record is