# An Efficient Universal Construction for Large Objects

This paper presents L-UC, a universal construction that efficiently implements dynamic objects of large state in a wait-free manner. The step complexity of L-UC is O(n+kw), where n is the number of processes, k is the interval contention (i.e., the maximum number of active processes during the execution interval of an operation), and w is the worst-case time complexity to perform an operation on the sequential implementation of the simulated object. L-UC efficiently implements objects whose size can change dynamically. It improves upon previous universal constructions either by efficiently handling objects whose state is large and can change dynamically, or by achieving better step complexity.

## Authors

• 10 publications
• 4 publications
• 2 publications
• ### Wait-Free Universality of Consensus in the Infinite Arrival Model

In classical asynchronous distributed systems composed of a fixed number...
08/06/2019 ∙ by Grégoire Bonin, et al. ∙ 0

• ### Efficient Partial Snapshot Implementations

In this work, we propose the λ-scanner snapshot, a variation of the snap...
06/10/2020 ∙ by Nikolaos D. Kallimanis, et al. ∙ 0

• ### Concurrent Fixed-Size Allocation and Free in Constant Time

Our goal is to efficiently solve the dynamic memory allocation problem i...
08/10/2020 ∙ by Guy E. Blelloch, et al. ∙ 0

• ### Fully Dynamic Data Structures for Interval Coloring

We consider the dynamic graph coloring problem restricted to the class o...
04/01/2019 ∙ by Girish Raguvir J, et al. ∙ 0

• ### Upper and Lower Bounds on the Space Complexity of Detectable Object

The emergence of systems with non-volatile main memory (NVM) increases t...
02/26/2020 ∙ by Ohad Ben-Baruch, et al. ∙ 0

• ### Time-Windowed Contiguous Hotspot Queries

A hotspot of a moving entity is a region in which it spends a significan...
11/10/2017 ∙ by Ali Gholami Rudi, et al. ∙ 0

• ### UMPNet: Universal Manipulation Policy Network for Articulated Objects

We introduce the Universal Manipulation Policy Network (UMPNet) – a sing...
09/13/2021 ∙ by Zhenjia Xu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Motivation and Contribution

Multi-core processors are nowadays found in all computing devices. Concurrent data structures are frequently used as the means through which processes communicate in multi-core contexts, thus it is important to have efficient and fault-tolerant implementations of them. A universal construction [11, 12] provides an automatic mechanism to get a concurrent implementation of any data structure (or object) from its sequential implementation.

In this paper, we present L-UC, an efficient, wait-free universal construction that deals with dynamic objects whose state is large. Wait-freedom  [11] ensures that every process finishes the execution of each operation it initiates within a finite number of steps. The step complexity of L-UC is , where is the number of processes in the system, is the interval contention, i.e., the maximum number of processes that are active during the execution interval of an operation, and is the worst-case time complexity to perform an operation on the sequential data structure. The step complexity of an algorithm is the maximum number of shared memory accesses performed by a thread for applying any operation on the simulated object in any execution.

A large number of the previously-presented universal constructions [1, 2, 5, 7, 8, 11, 12] work by copying the entire state of the simulated object locally, making the required updates on the local copy, and then trying to make the local copy shared by changing one (or a few) shared pointers to point to it. Copying the state of the object locally is however very inefficient when coping with large objects. L-UC avoids copying the entire state of the simulated object locally; in contrast, it applies the required changes directly on the shared state of the object. For doing so, processes need to synchronize when applying the changes. Previous universal constructions that apply changes directly to the shared data structure (e.g., [5]) synchronize on the basis of each operation. However, this results in high synchronization cost. To reduce this cost, L-UC applies a wait-free analog of the combining technique [9, 8]: each process simulates, in addition to its own operation, the operations of other active processes. So, in L-UC, processes have to pay the synchronization cost once for a batch of operations and not for each distinct operation.

Sim [8, 10] is a wait-free universal construction that implements the combining technique. In Sim, each process that wants to apply an operation, first announces it in an array. Then, reads all other announced operations, makes a local copy of the shared state, applies all the operations it is aware of on this copy, and tries to update a shared variable to point to this local copy. P-Sim, the practical version of Sim (presented also in [8]) is highly efficient for objects whose state is small. L-UC borrows some of the ideas presented in  [8]. Specifically, as P-Sim, L-UC uses an

array in which processes announce their operations, and employs bit vectors to figure out which processes have active operations at each point in time. However, the bit vector mechanism of

L-UC is more elaborated than that of P-Sim, because the active processes have to agree on the set of operations that must be applied on the shared data structure before they attempt to perform any changes. In contrast to Sim, L-UC avoids copying locally the object’s state. This makes L-UC appropriate for simulating large objects.

L-UC also borrows some ideas from the universal construction presented in [5] that copes with large objects. As in the universal construction in [5], in L-UC, each process uses a directory to store copies of the shared variables (e.g., the shared nodes) it accesses while executing operations on the data structure. L-UC combines this idea with the idea of implementing a wait-free analog of the combining technique. This way, L-UC achieves step complexity that is . In scenarios of low contention, this bound can be much smaller than the achieved by the universal construction in [5]. Moreover, the universal construction in [5] have processes synchronize on the basis of every single operation, whereas in L-UC, processes synchronize once to execute a whole batch of operations.

### 1.2 Related Work

In [11], Herlihy studied how shared objects can be simulated, in a wait-free manner, using read-write registers and consensus objects. In the proposed universal construction, the simulated object is represented by a list of records. Each record stores information about an operation (its type, its arguments, and its response) that has been performed on the simulated object. It also stores the state of the simulated object after all operations inserted in the list up until (including it) have been applied on the implemented object in the order that they have been inserted in the list. To agree on which record will be inserted in the list next, each record additionally stores an -consensus object. To ensure wait-freedom, the algorithm also employs an announce array of elements, where the threads running in the system announce their operations, and stores a (strictly increasing) sequence number in each record, which illustrates the order in which this record was inserted in the list. Threads help the record of a thread to be inserted as the -th record in the list when . The step complexity of the algorithm is . The space overhead of the algorithm is and each register contains the entire state of the object and a sequence number growing infinitely large. Herlihy revisited wait-free simulation of objects in [12], where it presented a universal construction which uses LL/SC and CAS objects and achieves step complexity , where is the total size of the simulated object. These algorithms [11, 12] are inappropriate for large objects, as they work by copying the entire state of the object locally.

Afek, Dauber and Touitou presented in [1] a universal construction that employs a tree structure to monitor which processes are active, i.e. which processes are performing an operation on the simulated object at a given time. This tree technique was combined with some of the techniques proposed in [11, 12] in order to get a universal construction for simulating large objects, which has step complexity .

Anderson and Moir presented in [3] a wait-free universal construction for simulating large objects. In their algorithms, a contiguous array is used to represent the state of the object. Specifically, the object state is stored in data blocks of size each. To restrict memory overhead, the algorithms operate under the following assumptions: each operation can modify at most blocks and each thread can help at most other threads. The step complexity of the universal construction in [3] is .

In [7], Fatourou and Kallimanis presented the family of RedBlue adaptive universal constructions. The F-RedBlue algorithm achieves step complexity and uses ) LL/SC registers. However, F-RedBlue uses large registers and it is not able to simulate objects whose state is stored in more than one register. S-RedBlue uses small registers, but the application of an operation requires to copy the entire state of the simulated object and thus it is inefficient for large objects. LS-RedBlue and BLS-RedBlue improve the step complexity of the algorithms presented by Anderson and Moir in [3] for large objects.

In [6], Felber et al. present CX, a wait-free universal construction, suitable for simulating large objects. This universal construction keeps up to instances of the object state. In order to perform an update on the shared object, a process first appends its request in a shared request queue and then attempts to obtain the lock of some of the object instances. We remark that each such object instance stores a pointer to a queue node. Subsequently, the process uses this pointer to produce a valid copy of the object by performing all operations that were contained in the shared queue starting from the pointed node. Notice that CX has space complexity , where is the number of processes and is the total size of the simulated object.

The rest of this paper is organized as follows. Our model is discussed in Section 2. L-UC is presented in Section 3. Section 3.1 provides an overview of the way the algorithm works and its pseudocode. Section 3.2 presents a detailed description of L-UC. A discussion of its complexity is provided in Section 3.3 and a sketch of proof for its correctness in Section 3.4.

## 2 Model

We consider an asynchronous system of processes, , each of which may fail by crashing. Threads communicate by accessing (shared) base objects. Each base object stores a value and supports some primitives in order to access its state. An LL/SC object supports the atomic primitives LL and SC. LL() returns the value that is stored into . The execution of SC by a thread , , must follow the execution of LL() by , and changes the contents of to if has not changed since the execution of ’s latest LL on . If SC changes the value of to , true is returned and we say that the SC is successful; otherwise, the value of does not change, false is returned and we say that the SC is not successful or it is failed.L-UC is presented using LL/SC objects (as is the case for Sim [8, 10]). However, in a practical version of it, L-UC will be implemented using CAS objects (as is the case for P-Sim [8, 10]). A CAS object supports in addition to , the primitive CAS() which stores to if the current value of is equal to and returns true; otherwise the contents of remain unchanged and false is returned.

A universal construction can be used to implement any shared object. A universal construction supports the ApplyOp(, ) operation, which applies the operation (or request) to the simulated object and returns the return value of to the calling thread . In this paper, the concepts of an operation and a request have the same meaning and are used interchangeably. A universal construction provides a routine, for each process, to implement ApplyOp.

An object is linearizable, if in every execution , it is possible to assign to each completed operation (and to some of the uncompleted operations), a point , called the linearization point of , such that: follows the invocation and precedes the response of , and the response returned by is the same as the response would return if all operations in were executed sequentially in the order imposed by the linearization points.

A configuration is a vector that contains the values of the base objects and the states of the processes, and describes the system at some point in time. At the initial configuration, processes are in their initial state and the base objects contain initial values. A step is taken by some process whenever the process executes a primitive on a shared register; the step may also include some local computation that is performed before the execution of the primitive. An execution is a sequence of steps. The interval contention of an instance of some operation in an execution is the number of processes that are active during the execution of this instance. The step complexity of an operation is the maximum number of steps that any thread performs during the execution of any instance of the operation in any execution. Wait-freedom guarantees that every process finishes each operation it executes in a finite number of steps.

## 3 The L-UC Algorithm

This section presents L-UC, our wait-free universal construction for large objects.

### 3.1 Overview

struct NewVar {    // node of list of newly allocated variables
ItemSV *var;   // points to the ItemSV struct of the variable
NewVar *next;  // points to the next element of the list
};
struct NewList {
ItemSV *first;
};
struct State {
boolean applied[1..n];
boolean papplied[1..n];
int seq;
NewList *var_list;
RetVal RVals[1..n]; // return values
};
struct DirectoryNode {
Name name;      // variable name
ItemSV *sv;     // data item for the variable
Value val;      // value of the data item
};
struct ItemSV {  // data item for a variable
Value val[0..1];// old and new values of data item
int toggle;     // toggle shows the current value of data item
int seq;
};
// Toggles is implemented as an integer of @$n$@ bits; if @$n$@ is big, more than one such integers can be used
shared Integer Toggles = @$<0, …, 0>$@;
shared State S = @$<F,…,F>, <F,…,F>, 0, <\bot>, <\bot, …,\bot>>$@;
shared OpType Announce[1..n] = {@$\bot$@, …, @$\bot$@};
// Private local variable for process @$p_i$@
Integer @$toggle_i$@ = @$2^i$@;
RetVal ApplyOp(request req){  // Pseudocode for process @$p_i$@
Announce[i] = req;        // Announce request @$req$@ @\label{alg:lsimopt:announce_op}@
@$toggle_i$@ = -@$toggle_i$@; @\label{alg:lsimopt:toggle_toggle}@
@\FAD@(Toggles, @$toggle_i$@);       // toggle @$p_i$@s bit by adding @$2^i$@ to Toggles @\label{alg:lsimopt:first_add}@
@\ATTEMPT@();      // call @\ATTEMPT@ twice to ensure that req will be performed@\label{alg:lsimopt:first_attempt}@
@\ATTEMPT@();                 @\label{alg:lsimopt:second_attempt}@
return S.rvals[i];        // @$p_i$@ finds its return value into @$S.rvals[i]$@
}
label={alg:lsimopt},escapechar=@,name=lsimopt-1, postbreak=\/\/\space, breakautoindent=true, breakindent=150pt, breaklines]
void Attempt(Request req) {              // pseudocode for process @$p_i$@
ProcessIndex q, j;
State ls, tmp;
Set lact;
Directory @$D$@;
NewVar *pvar = new NewVar(), *ltop;
ItemSV sv, *psv = new ItemSV();
psv@$\rightarrow \langle$@val, toggle, seq@$\rangle$@ = @$<<\bot, \bot>,0,0>$@;
pvar@$\rightarrow \langle$@var, next@$\rangle$@ = <psv, null>;
for j=1 to 2 do { @\label{alg:lsimopt:attempt_loop}@
D = @$\emptyset$@;                                // initialize direcory D@\label{alg:lsimopt:dir_init}@
ls = @\LL@(S);                           // create a local copy of @$S$@ @\label{alg:lsimopt:ll_iteration}@
ltop = ls.var_list@$\rightarrow$@first; // read pointer to the list of newly-allocated variables@\label{alg:lsimopt:init_list}@
tmp.seq = ls.seq + 1;                               @\label{alg:lsimopt:tmp_inc}@
tmp.papplied[1..n] = ls.applied[1..n];              @\label{alg:lsimopt:s_papplied}@
tmp.applied[1..n] = lact[1..n];       // @$p_i$@ will later attempt to update S with tmp, so it sets the fields of tmp appropriately@\label{alg:lsimopt:s_applied}@
tmp.rvals[1…n] = ls.rvals[1..n];@\label{alg:lsimopt:copy_rvals}@
for q=1 to n do {                     @\label{alg:lsimopt:for_loop}@
if (ls.applied[q] @$\neq$@ ls.papplied[q]) { // qs request is pending@\label{alg:lsimopt:if_apply}@
foreach access of a variable x while applying request Announce[q]{@\label{alg:lsimopt:foreach_access}@
if (x is a newly allocated variable) {@\label{alg:lsimopt:alloc_var}@
if(@\CAS@(ltop@$\rightarrow$@next, null, pvar)){@\label{alg:lsimopt:add_list}@
psv = new ItemSV();
psv@$\rightarrow \langle$@val, toggle, seq @$\rangle$@ = @$<<\bot, \bot>,0,0>$@;
pvar = new NewVar();
pvar@$\rightarrow \langle$@var, next@$\rangle$@ = <psv, null>;
} @\Suppressnumber@
// use node pointed by @$ltop \rightarrow next$@ as the new variables metadata@\Reactivatenumber@
ltop = ltop@$\rightarrow$@next;@\label{alg:lsimopt:new_var}@
add <x, ltop@$\rightarrow$@var, ltop@$\rightarrow$@var.val[0]> to D; @\label{alg:lsimopt:add_new_item_to_dir}@
} else {                  // x is not a newly allocated variable
let svp be a pointer to the ItemSV struct for x;
// perform the request on the local copy of x (if any) @\Reactivatenumber@
if (x exists in D) read x from D;
else {
sv = @\LL@(*svp);@\label{alg:lsimopt:ll_dir}@
else goto Line @\ref{alg:lsimopt:vl}@; // values read from @$S$@ by @$p_i$@ obsolete, so start from scratch@\label{alg:lsimopt:obsolute_on_read}@
}
} else if (the access is a write instruction) update x in D;@\label{alg:lsimopt:update_dir}@
}
}
store into tmp.rvals[q] the return value;@\label{alg:lsimopt:calculate_return_value}@
}
}
if (!@\VL@(S)) continue; // value read in @$S$@ by @$p_i$@ is obsolete, so start from scratch @\label{alg:lsimopt:vl}@
foreach record <x, svp, v> in D {@\label{alg:lsimopt:flush_dir}@
if(svp@$\rightarrow$@seq > tmp.seq) break;  // all requests have been applied, so leave the loop @\label{alg:lsimopt:for_break}@
else if(svp@$\rightarrow$@seq == tmp.seq) continue; // the variable has been modified, so continue @\label{alg:lsimopt:sc_dir1}@
else if(svp@$\rightarrow$@toggle == 0) SC(*svp, @$<<$@svp@$\rightarrow$@val[0],v>, 1, tmp.seq>); @\label{alg:lsimopt:sc_dir2}@
else SC(*svp, @$<<$@v, svp@$\rightarrow$@val[1]>, 0, tmp.seq>); // make update visible@\label{alg:lsimopt:sc_dir3}@
}
tmp.var_list = new List();  tmp.var_list@$\rightarrow$@first = null; @\label{alg:lsimopt:new_list}@
@\SC@(S, tmp);                       // try to modify S @\label{alg:lsimopt:sc_on_s}@
}
}

The pseudocode for L-UC is provided in Listings LABEL:alg:lsimopt_ds and LABEL:alg:lsimopt. The state of the simulated data structure in L-UC is shared and it can be updated directly by any process. Each process that wants to apply a request, first announces it in an array. In addition to the array, L-UC uses a bit vector of bits, one for each process. A process toggles its bit, , after announcing a new request. The use of implements a fast mechanism for informing other processes of those processes that have pending requests.

Each execution of L-UC can be partitioned into phases. In each phase , the set of requests that will be executed in the next phase is agreed upon by the processes that are active. Moreover, those requests that have been agreed upon in the previous phase are indeed executed.

A process that wants to execute a new request, it first announces it in , and then it toggles its bit in . Afterwards, it calls a function, called Attempt, twice: After the execution of the first instance of Attempt by , it is ensured that the set of requests agreed upon in one of the phases that overlap the execution of the Attempt, contains ’s request. After the execution of the second instance of Attempt by , it is ensured that ’s request has been applied.

L-UC uses an LL/SC object which stores appropriate fields to ensure the required synchronization between the processes in each phase. The first phase (phase 1) starts at the initial configuration and ends when the first successful SC is applied on . Phase starts when phase finishes and ends when the -th successful SC is applied on .

To decide which set of requests will be executed in each phase, contains two bit vectors, called and , of bits each (one for each process). The current request initiated by a process has not yet been applied, if . When this condition holds, we call the current request of process pending.

In each instance of Attempt, copies the value of in a local variable (line LABEL:alg:lsimopt:ll_iteration), records necessary changes that it makes to its fields in another local variable (lines LABEL:alg:lsimopt:tmp_inc-LABEL:alg:lsimopt:copy_rvals, LABEL:alg:lsimopt:calculate_return_value, LABEL:alg:lsimopt:new_list), and uses SC in an effort to update to the value contained in (line LABEL:alg:lsimopt:sc_on_s). Specifically, reads on line LABEL:alg:lsimopt:ll_iteration (by performing an LL) and on line LABEL:alg:lsimopt:read_toggles. It then copies into (line LABEL:alg:lsimopt:s_papplied) and into (line LABEL:alg:lsimopt:s_applied). Recall that the and fields of encode the requests that are to be performed in each phase. So, if the SC that performs on line LABEL:alg:lsimopt:sc_on_s succeeds, all processes that will read the value this SC will write to , will attempt to perform the requests encoded by in those fields.

Next, for each , , checks whether (lines LABEL:alg:lsimopt:for_loop-LABEL:alg:lsimopt:if_apply), and if this is so, it applies the request recorded in . To execute the pending requests recorded in , a process uses a caching mechanism as in [4, 5]: When a process first accesses a shared variable (e.g., a variable of the simulated shared data structure), it maintains a copy of it in a directory, (which is local to ). For each pending request recorded in , the required updates are first performed by in the local copies of the data items that are residing in the directory (lines LABEL:alg:lsimopt:foreach_access-LABEL:alg:lsimopt:calculate_return_value). Read requests executed by are also served using . Only after it has finished the simulation of all pending requests, applies the changes listed in the elements of its directory to the shared data structure (lines LABEL:alg:lsimopt:flush_dir-LABEL:alg:lsimopt:sc_dir3).

For each data item of the simulated object’s state, L-UC maintains a record (struct) of type . This struct stores the old and the current value of the data item in an array of two elements, a toggle bit that identifies the position in the array from where the current value for should be read, and a sequence number that is used for synchronization.

Note that contains also a field that is incremented every time a successful SC on is performed. This field identifies the current phase of the execution. Before performing an update on the shared data structure (lines LABEL:alg:lsimopt:flush_dir-LABEL:alg:lsimopt:sc_dir3), validates the values of the field read in () and that stored in for (). Only if (line LABEL:alg:lsimopt:sc_dir3), the update is performed since otherwise it is already obsolete, i.e., is already greater than and therefore the SC of line LABEL:alg:lsimopt:sc_on_s by will fail.

Both the old and the current values of must be stored in in order to avoid the following bad scenario. Consider two processes and that simulate the same request . Assume that is ready to execute line LABEL:alg:lsimopt:ll_dir for some variable , whereas has finished the simulation of (lines LABEL:alg:lsimopt:flush_dir-LABEL:alg:lsimopt:sc_dir3) and has started updating the shared data structure. Then, it might happen that reads the updated version for although it should have read the old version. For this reason, stores the old value (in addition to the new value) in one of the entries of the array and appropriately updates the toggle bit to indicate which of the two values is the new one. If discovers that it is too slow (line LABEL:alg:lsimopt:add_dir1), it reads the old value for stored in the entry of its array. Notice that, to ensure wait-freedom, should continue executing (to cope with the case that fails before performing all the required updates to the shared data structure).

When a new data item is allocated while executing a set of requests, additional synchronization between the processes that execute this set of requests is required to avoid situations where several processes allocate, each, a different record for . We use a technique similar to that presented in [5] to ensure that all these processes use the same allocated ItemSV structure for . Specifically, L-UC stores into a pointer (called ) to a list of newly created data items shared by all processes that read this instance of . Each time a process needs to allocate the -th, , such data item, it tries to add a structure of type as the -th element of the list (line LABEL:alg:lsimopt:add_list). If it does not succeed, some other process has already done so, so uses this structure (by moving pointer to this element on line LABEL:alg:lsimopt:init_list, and by inserting in its dictionary on line LABEL:alg:lsimopt:add_new_item_to_dir).

We remark that the fields of must be updated in an atomic way using SC. This requires that registers in the system store two words which is impractical. However, we can utilize single-word registers by using indirection. Indirection can also be used to implement using single-word registers.

### 3.2 Detailed Description of Attempt

In the following, we detail the implementation of function Attempt, presented in Algorithm LABEL:alg:lsimopt. When Attempt is executed by some process , executes two iterations (line LABEL:alg:lsimopt:attempt_loop) of checking whether there are pending requests and of attempting to apply them, as follows. It initializes its local directory (line LABEL:alg:lsimopt:dir_init), creates in a local copy of the state of the simulated object (line LABEL:alg:lsimopt:ll_iteration), and reads in the value of (line LABEL:alg:lsimopt:read_toggles), thus obtaining a view of which processes have pending requests at the current point in time (i.e., calculating the set of pending requests). Furthermore, it locally stores into a pointer to the current variable list of the simulated object (line LABEL:alg:lsimopt:init_list). Recall that the state of the object is copied into local variable using an LL primitive. In case this instance of Attempt is successful in applying the pending requests, it will update the shared state of the system using an SC primitive. For this purpose, the local variable is prepared in lines LABEL:alg:lsimopt:tmp_inc to LABEL:alg:lsimopt:s_applied, to serve as the value that will be stored into the shared state in case of success.

After having read the state of the simulated object, as well as the state of the requests of the other processes, can detect which requests are pending. For this purpose, it iterates through the (locally stored) state of each process (line LABEL:alg:lsimopt:for_loop) and checks whether the values of and differ for this process (line LABEL:alg:lsimopt:if_apply). If so, the request of this process was still pending when Attempt read the value of and therefore, Attempt intents to apply it. Notice that the iteration through the and values consist of local steps. Notice also that at most out of processes have active requests, meaning what the request application contributes to step complexity depends on rather than .

We remark that the request of a process is expressed as a piece of sequential code. Therefore, in order to apply the request of some process, an instance of Attempt has to run through the sequential code of this request and carry out the variable accesses that this request entails, i.e. Attempt has to apply the modifications that this request incurs on the simulated object’s variables (line LABEL:alg:lsimopt:foreach_access). We distinguish three cases, namely the case where an access creates a new variable, the case where an access reads a variable, and the case where an access modifies an already existing variable.

Finally, in the third case (line LABEL:alg:lsimopt:update_dir), where the access is a write to an already existing variable. In case that the accessed variable already exists in the local dictionary, the update on the local dictionary (line LABEL:alg:lsimopt:update_dir), updates the variable’s value stored in the local dictionary. Otherwise, the update (line LABEL:alg:lsimopt:update_dir) creates a new entry and stores the value of the variable. Once the sequential code for the current request has all been run through and all variable accesses for the request have been performed, the request returns a return value, which is stored by Attempt for the process to access (line LABEL:alg:lsimopt:calculate_return_value).

Recall that any update to a variable of the simulated object is performed locally by Attempt. Therefore, once all active requests have been applied, Attempt has to write back the local updates to the shared variables of the simulated object (lines LABEL:alg:lsimopt:flush_dir - LABEL:alg:lsimopt:sc_dir3). Notice that once again, the sequence numbers of the local and shared copies are instrumental in detecting whether a variable has already been updated or not (lines LABEL:alg:lsimopt:sc_dir1 - LABEL:alg:lsimopt:sc_dir3). More specifically, the condition of line LABEL:alg:lsimopt:sc_dir1 checks if another process has already updated or not the value of the shared variable while trying to apply the same set of operations calculated in lines LABEL:alg:lsimopt:s_papplied - LABEL:alg:lsimopt:copy_rvals. In case that a process is very slow and the whole set of operations calculated in lines LABEL:alg:lsimopt:s_papplied - LABEL:alg:lsimopt:copy_rvals is applied, the condition of line LABEL:alg:lsimopt:sc_dir2 fails, and the process breaks the execution (line LABEL:alg:lsimopt:for_break) of the for-loop of lines LABEL:alg:lsimopt:flush_dir - LABEL:alg:lsimopt:sc_dir3. Finally, once the updates have been performed, Attempt tries to update , before performing any remaining iteration of the for loop of line LABEL:alg:lsimopt:attempt_loop.

### 3.3 Step Complexity

By inspection of the pseudocode of ApplyOp, it becomes apparent that its step complexity is determined by the step complexity of Attempt. In a practical version of L-UC where is implemented using indirection, lines LABEL:alg:lsimopt:ll_iteration and LABEL:alg:lsimopt:read_toggles contribute to performance, since the size of the data records that are read is . The body of the if statement of line LABEL:alg:lsimopt:if_apply (i.e., lines LABEL:alg:lsimopt:foreach_access-LABEL:alg:lsimopt:update_dir) is executed times, each time contributing a factor of (because of the foreach statement of line LABEL:alg:lsimopt:foreach_access). Note that searching an element in the dictionary, adding an element to it or removing an element from it does not cause any shared memory accesses, i.e., it causes only local computation. So, the cost of executing lines LABEL:alg:lsimopt:alloc_var-LABEL:alg:lsimopt:calculate_return_value is . Note also that at most elements are contained in each dictionary. Therefore, the foreach of line LABEL:alg:lsimopt:flush_dir contributes to the total cost. The rest of the code lines access only local variables and thus they do not contribute to the step complexity of the algorithm. We conclude that the step complexity of ApplyOp is .

### 3.4 Correctness Proof

This section provides a sketch of the correctness proof of L-UC. We start with some useful notation. Let be any execution of L-UC and assume that some thread , , executes requests in . Let be the argument of the -th call of L-UC by and let be the -th instance of Attempt executed by (Figure 1). Let be the initial configuration. Define as the configuration after the execution of the Add instruction of line LABEL:alg:lsimopt:first_add; let . We use , , to denote the -th bit of , and let be the value of ’s local variable at the end of .

In the following lemma, we argue that during the execution of each of the two iterations of the for loop of line LABEL:alg:lsimopt:attempt_loop of any instance of Attempt, at least one successful SC instruction is performed.

###### Lemma 3.1.

Consider any , . There are at least two successful SC instructions in the execution interval of .

We continue with two technical lemmas. The first argues that the value of ’s bit in the array is equal to after the execution of the -th Add instruction of line LABEL:alg:lsimopt:first_add by . It also shows that no process other than can change this bit.

###### Lemma 3.2.

For each , , it holds that (1) at , and (2) has the same value between and .

The next lemma studies the value of after the execution of the -th instance of Attempt by .

###### Lemma 3.3.

Consider any execution , , of function Attempt by some thread . is equal to just after the end of .

For each , let be the configuration resulting after the execution of the -th Add instruction in . At , is equal to false. Lemma 3.3 implies that just after , is equal to true. Let be the first configuration between and the end of at which is equal to true. Consider any request , . Lemma 3.3 implies that just after , is equal to , while just after , is equal to . Let be the first configuration between the end of and the end of such that is equal to . Figure 1 illustrates the above notation.

Since the value of can change only by the execution of SC instructions on , it follows that just before a successful SC on is executed. Let be this SC instruction and let be its matching LL instruction. Let be the read of that is executed between and by the same thread.

Lemma 3.4 states that is performed at the proper timing and returns the anticipated value.

###### Lemma 3.4.

Consider any , , it holds that is executed after and reads in .

###### Proof.

Assume, by the way of contradiction, that is executed before . Let be the Attempt that executes .

Assume first that . Then, by its definition, (which is executed by after ) writes to a value equal to ; the code (lines LABEL:alg:lsimopt:read_toggles, LABEL:alg:lsimopt:s_applied) implies that, in this case, reads in . Lemma 3.2 implies that between and . Thus, could not read in , which is a contradiction.

Assume now that . By our assumption that is executed before , it follows that , which is executed before , precedes . In case that follows , Lemma 3.2 implies that reads in . By the pseudocode (lines LABEL:alg:lsimopt:read_toggles, LABEL:alg:lsimopt:s_applied and LABEL:alg:lsimopt:sc_on_s), it follows that writes the value into . By its definition, stores into , which is a contradiction. Thus, is executed before . By its definition, starts its execution after and finishes its execution before . Lemma 3.1 implies that at least two successful SC instructions are executed in the execution interval of . Recall that precedes and therefore also the beginning of , while by definition follows the end of . It follows that is not a successful SC instruction, which is a contraction. ∎

We next argue that, between certain configurations (namely and ), the value of has the anticipated value and this value does not change in the execution interval defined by the two configurations.

###### Lemma 3.5.

Consider any , . At each configuration between and , it holds that .

###### Proof.

Assume, by the way of contradiction, that there is at least one configuration between and such that is equal to some value . Let be the first of these configurations. Since only SC instructions of line LABEL:alg:lsimopt:sc_on_s write on base object , it follows that there is a successful SC instruction, let it be , executed just before that stores at . Let be the Attempt that executes and let be the read instruction that executes on line LABEL:alg:lsimopt:read_toggles of the pseudocode. By the definition of and , it is implied that follows and precedes . Lemma 3.2 implies that in any configuration between and . Since writes into , the pseudocode (lines LABEL:alg:lsimopt:read_toggles and LABEL:alg:lsimopt:sc_on_s) imply that precedes . It follows that precedes , since precedes . Therefore precedes . This implies that there is a successful SC instruction, which is , between and . Thus, is a failed SC instruction, which is a contradiction. ∎

By Lemma 3.5 and the pseudocode (line LABEL:alg:lsimopt:s_papplied), it follows that at . Denote by the first configuration after such that a successful SC instruction is executed.

The next lemma studies properties of .

###### Lemma 3.6.

precedes and follows .

We next argue that the and arrays of indicate that does not have a pending request between and .

###### Lemma 3.7.

in any configuration between and ( is not included).

By Lemma 3.7, and by line LABEL:alg:lsimopt:s_papplied, it follows that at . This and the definition of imply:

###### Lemma 3.8.

in any configuration between and ( is not included).

We continue to define what it means for a process to apply a request on the simulated object. We say that a request by some thread is applied on the simulated object if (1) the Read instruction on (line LABEL:alg:lsimopt:read_toggles), executed by some request (that might be or any other request), includes in the set of threads it returns, (2) procedure Attempt, executed by reads in , the request type written there by for and considers it as the new request type for , (3) Attempt by calls apply for (lines LABEL:alg:lsimopt:foreach_access - LABEL:alg:lsimopt:calculate_return_value), and the execution of the SC at line LABEL:alg:lsimopt:sc_on_s (let it be ) on succeeds. When these conditions are satisfied, we sometimes also say that applies on the simulated object or that applies on the simulated object.

###### Lemma 3.9.

is applied to the simulated object at configuration .

###### Proof.

Let be the Attempt that executes the successful SC instruction (let it be this SC instruction) just before . Let be the matching LL of . Since, is a successful SC instruction, it is implied that follows . Observation 3.8 implies that reads for a value different from that stored in . Therefore, the if statement of line LABEL:alg:lsimopt:if_apply returns true. Thus, a request for thread is applied at . Let be this request and assume, by the way of contradiction, that . Lemma 3.4 implies that executes its read on after . By the pseudocode (lines LABEL:alg:lsimopt:read_toggles, LABEL:alg:lsimopt:foreach_access), reads after , thus the reading of by is executed between and . Since writes its request to before , the reading of by returns . Thus, applies as the request of in the simulated object. ∎

We are now ready to assign linearization points. For each and , we place the linearization point of at ; ties are broken by the order imposed by identifiers of threads.

It is not difficult to argue that the linearization point of each request is placed in the execution interval of the request.

###### Lemma 3.10.

Each request is linearized within its execution interval.

To prove consistency, denote by the -th successful instruction on base object . Let be any iteration of the for loop of line LABEL:alg:lsimopt:attempt_loop that is executed by a thread . Let be the sequence of base objects read by the LL instructions of line LABEL:alg:lsimopt:ll_dir in . Denote by the number of elements of .

For each , denote by the prefix of containing the first elements of , i.e. , where is the -th LL instruction performed by on any base object. Let be the empty sequence.

Let be the sequence of insertions in directory (lines LABEL:alg:lsimopt:add_dir1-LABEL:alg:lsimopt:add_dir2) by . Denote by the number of elements of . Obviously, it holds that . For each , denote by the prefix of containing the first elements of , i.e. , where is the -th value inserted to directory . Let be the empty sequence.

Let be the sequence of shared base objects accessed by while executing lines LABEL:alg:lsimopt:sc_dir1-LABEL:alg:lsimopt:sc_dir2 (we sometimes abuse notation and say that a code line is executed by to denote that the code line is executed by during the execution of ). Denote by the number of elements of . For each , denote by the prefix of that contains the last elements of , i.e. , where is the -th request (lines LABEL:alg:lsimopt:sc_dir1-LABEL:alg:lsimopt:sc_dir2) by . Let be the empty sequence.

Let be the sequence of shared base objects allocations during iteration (lines LABEL:alg:lsimopt:alloc_var-LABEL:alg:lsimopt:new_var). Denote by the number of elements of . For each , denote by the prefix of that contains the first elements of , i.e. , where is the -th base object allocation by .

Let be the sequence of allocations/reads/writes that performs on base objects in lines LABEL:alg:lsimopt:alloc_var-LABEL:alg:lsimopt:sc_dir3 of the pseudocode. Denote by the number of elements of . Obviously, it holds that . For each , denote by the prefix of that contains the first elements of sequence (i.e. ) where is the -th base object allocations/reads/writes of base objects performed by .

The next lemma states that for any process that has a pending request, the -th element of the array stores the pending request of for an appropriate time interval.

###### Lemma 3.11.

Let be any integer such that at configuration . Let be the value of at . In any configuration between and , it holds that .

###### Lemma 3.12.

Let be any shared base object other than . For any , the following claims are true:

1. At most one successful SC instruction is executed on between and .

2. In case that a successful SC instruction is executed on , it holds that just before and just after .

3. Let be some iteration of the loop of line LABEL:alg:lsimopt:attempt_loop executed by a thread that executes at least one successful SC instruction on . If is the LL instruction of line LABEL:alg:lsimopt:ll_iteration executed by , then is executed after .

4. Let , be two iterations of the for loop of line LABEL:alg:lsimopt:attempt_loop executed by threads and respectively, such that that both , execute their LL instructions of line LABEL:alg:lsimopt:ll_iteration somewhere between and , , and . If both , execute line LABEL:alg:lsimopt:flush_dir, just before it holds that .

###### Proof.

We prove the claims by induction on . Fix any and assume that the claims hold for . We prove that the claims hold for .

We first prove Claim 1. Let be the first of the successful SC instruction on between and . We prove that just after the execution of . Assume by the way of contradiction that . Let be the iteration of line LABEL:alg:lsimopt:ll_iteration executed by some thread that executes . Let be the matching LL instruction of . Since executes successfully line LABEL:alg:lsimopt:sc_dir2 of the pseudocode, the pseudocode (lines LABEL:alg:lsimopt:vl and LABEL:alg:lsimopt:sc_dir2) implies that the VL instruction of line LABEL:alg:lsimopt:vl returns true. Since is executed by before this VL instruction, it follows that precedes . Thus, the VL instruction of line LABEL:alg:lsimopt:vl is executed before . Let be the iteration of the loop of line LABEL:alg:lsimopt:ll_iteration at which is executed and let be the thread that executes . Obviously, has been executed between and . Since is also executed between and , the induction hypothesis (Claim .ii) implies that . Thus, has also executed an SC instruction on . By lines LABEL:alg:lsimopt:ll_dir, LABEL:alg:lsimopt:flush_dir-LABEL:alg:lsimopt:sc_dir2 and LABEL:alg:lsimopt:sc_on_s of the pseudocode, it follows that there is a successful SC instruction on between and . Let be this instruction. By induction hypothesis (claim 1), it follows that just after the execution of . Since is a successful SC instruction, follows . By the pseudocode (lines LABEL:alg:lsimopt:sc_dir1-LABEL:alg:lsimopt:sc_dir2), it follows that is not executed, which is a contradiction. Therefore just after the execution of . We now prove that there is no other successful SC instruction between and on . Assume by the way of contradiction that at least one successful SC instruction takes place between and . Let be the first of these instructions. Since, is a successful SC instruction, it follows that its matching LL instruction follows . By the pseudocode (lines LABEL:alg:lsimopt:sc_dir1-LABEL:alg:lsimopt:sc_dir2), it follows that is not executed since , which is a contradiction.

Claim 2 is proved using a similar argument as that above for Claim 1.

We now prove Claim 3. Assume by the way of contradiction that is executed between and , . Let be the thread that executes on some iteration . By Claim 1 and by Claim 2, it follows that just before . Thus is not executed, which is a contradiction. Thus, Claim 3 holds.

To prove Claim 4, it is enough to prove that , for any . We prove this claim by induction on the number of elements of (see appendix). ∎

Denote by , the prefix of which ends at and let be the first configuration following . Let be the empty execution. Denote by the linearization order of the requests in .

We are now ready to prove that is linearizable. This require to prove that the object state is consistent after the execution of each successful SC on .

###### Lemma 3.13.

For each , the following claims hold:

1. object’s state is consistent at , and

2. is linearizable.

###### Proof.

We prove the claim by induction on . The claim holds trivially; we remark that is empty in this case. Fix any and assume that the claim holds for . We prove that the claim holds for .

By the induction hypothesis, it holds that: (1) object’s state is consistent at , and (2) is consistent with linearization