1 Introduction
Consider a system of threads, which share a set of distinct atomic counters. We wish to implement a scalable approximate counter, which we will call a MultiCounter, by distributing the contention among these distinct instances: to increment the global counter, a thread selects two atomic counters and uniformly at random, reads their values, and (atomically) increments by the value of the one which has lower value according to the values it read. To read the global counter, the thread returns the value of a randomly chosen counter , multiplied by . ^{1}^{1}1This multiplication serves to maintain the same magnitude as the total number of updates to the distributed counter up to a point in time.
The astute reader will have noticed that this process is similar to the classic twochoice load balancing process [6], in which a sequence of balls are placed into initially empty bins, and, in each step, a new ball is placed into the less loaded of two randomly chosen bins. Here, the individual atomic counters are the bins, and each increment corresponds to a new ball being added. This sequential load balancing process is extremely well studied [26, 23]: a series of deep technical results established that the difference between the most loaded bin and the average is in expectation [6, 23], and that this difference remains stable as the process executes for increasingly many steps [9, 25]
. We would therefore expect the above relaxed concurrent counter to have relatively low and stable skew among the outputs at consecutive operations, and to scale well, as contention is distributed among the
counters.However, there are several technical issues when attempting to analyze this natural process in a concurrent setting.

First, concurrency interacts with classic twochoice load balancing process in nontrivial ways. The key property of the twochoice process which ensures good load balancing is that trials are biased towards less loaded bins—equivalently, operations are biased towards incrementing counters of lesser value. However, this property may break due to concurrency: at the time of the update, a thread may end up updating the counter of higher value among its two choices if the counter of smaller value is updated concurrently since it was read by the thread, thus surpassing the other counter.

Finally, assuming such a data structure can be analyzed and specified, it is not clear whether it would be in any way useful: many existing applications are built around data structures with deterministic guarantees, and it is not obvious how scalable, relaxed data structures can be leveraged in standard concurrent settings.
One may find it surprising that analysing such a relatively simple concurrent process is so challenging. Beyond this specific instance, these difficulties reflect wider issues in this area: although these constructs are reasonably popular in practice due to their good scalability, e.g. [7, 24, 31, 27], their properties are nontrivial to pin down [3], and it is as of yet unclear how they interact with the higherorder algorithmic applications they are part of [20].
Contribution.
In this work, we take a step towards addressing these challenges. Specifically:

We provide the first analysis of a twochoice load balancing process in an asynchronous setting, where operations may be interleaved, and the interleaving is decided by an adversary. We show that the resulting process is robust to concurrency, and continues to provide strong balancing guarantees in potentially infinite executions, as long as the ratio between the number of bins and the number of threads is above a large constant threshold.

We introduce a new correctness condition for randomized relaxed data structures, called distributional linearizability. Intuitively, a concurrent data structure is distributionally linearizable to a sequential random process , defined in terms of a sequential specification , a cost function measuring the deviation from the sequential specification, and a distribution on the values of the cost function, if every execution of can be mapped onto an execution of the relaxed sequential process , respecting the outputs and the costs incurred, as well as the order of nonoverlapping operations.

We prove that the randomized MultiCounter data structure introduced above is distributionally linearizable to a (sequential) variant of the classic twochoice load balancing process. This allows us to formally define the properties of MultiCounters. Moreover, we show that this analytic framework also covers variants of MultiQueues [27], a popular family of concurrent data structures implementing relaxed concurrent priority queues. This yields the first analytical guarantees for MultiQueues in concurrent executions.

We implement the MultiCounters, and show that they can provide a highly scalable approximate timestamping mechanism, with relatively low skew. We build on this, and show that MultiCounters can be successfully applied to timestampbased concurrency control mechanisms such as the TL2 software transactional memory protocol [13]
. This usage scenario presents an unexpected tradeoff: assuming low contention, the resulting TM protocol scales almost linearly, but may break correctness with very low probability. In particular, we show that there exist workloads and parameter settings for which this relaxed TM protocol scales almost linearly, improving the performance of the TL2 baseline by more than
, without breaking correctness.
Techniques.
Our main technical contribution is the concurrent analysis of the classic twochoice load balancing process, in an asynchronous setting, where the interleaving of lowlevel steps is decided by an oblivious adversary. The core of our analysis builds on the elegant potential method of Peres, Talwar and Wieder [25], which we render robust to asynchronous updates based on potentially stale information. To achieve this, we overcome two key technical challenges. The first is that, given an operation , as more and more other operations execute between the point where it reads and the point where it updates, the more stale its information becomes, and so the probability that makes the “right” choice at the time of update, inserting into the less loaded of its two random choices, decreases. Moreover, operations updating with stale information will “stampede” towards lowerweight bins, effectively skewing the distribution. The second technical issue we overcome is that longrunning operations, which experience a lot of concurrency, may in fact be adversarially biased towards the wrong choice, inserting into the more loaded of its two choices with nontrivial probability. We discuss these issues in detail in Section 6.1.
In brief, our analysis circumvents this issues by showing that a variant of the twochoice process where up to a constant fraction of updates are corrupted, in the sense that they perform the “wrong” update, will still have similar balance properties as the original process. It is interesting to note that even the order in which corrupted updates occur can be controlled by the adversary through increased concurrency, which is not the case in standard analyses [25]. The critical property which we leverage in our analysis is that, while individual operations can be arbitrarily contended (and therefore biased), there is a bound of on the average contention per operation, which in turn bounds the average amount of bias the adversary can induce over a period of time. Our argument formalizes this intuition, and phrases it in terms of the evolution of the potential function.
We show that this result has implications beyond “parallelizing” the classic twochoice process, as we can leverage it to obtain probabilistic bounds on the skew of the MultiCounter. Using the framework of [3], which connected twochoice load balancing with MultiQueue data structures in the sequential case, we can obtain guarantees for this popular data structure pattern in concurrent executions.
2 Related Work
Randomized Load Balancing.
The classic twochoice balanced allocation process was introduced in [6], where the authors show that, under twochoice insertion, the most loaded among bins is at most above the average, both in expectation and with high probability. The literature studying analyses and extensions of this process is extremely vast, hence we direct the reader to [26, 23] for indepth surveys of these techniques. Considerable effort has been dedicated to understanding guarantees in the “heavilyloaded” case, where the number of insertion steps is unbounded [9, 25]
, and in the “weighted” case, in which ball weights come from a probability distribution
[30, 10]. A tourdeforce by Peres, Talwar, and Wieder [25] gave a potential argument characterizing a general form of the heavilyloaded, weighted process on graphs. Our analysis starts from their framework, and modifies it to analyze a concurrent, adversarial process. One significant change from their analysis is that, due to the adversary, changes in the potential are only partly stochastic: most steps might be slightly biased away from the better of the two choices, while a subset of choices might be almost deterministically biased towards the wrong choice. Further, the adversary can decide the order in which these different steps, with different biases, occur.Lenzen and Wattenhofer [21] analyzed parallel ballsintobins processes, in which balls need to be distributed among bins, under a communication model between the balls and the bins, showing that almostperfect allocation can be achieved in rounds of communication. This setting is quite different from the one we consider here. Similar delayed information models, where outdated information is given to the insertion process were considered by Mitzenmacher [22] and by Berenbrink, Czumaj, Englert, Fridetzky, and Nagel [8]. The former reference proposes a bulletin board model with periodic updates, in which information about the load of the model is updated only periodically (every seconds), and various allocation mechanisms. The author provides an analysis of this process in the asymptotic case (as ), supported by simulations. The latter reference [8] considers a similar model where balls arrive in batches, and must perform allocations collectively based solely on the information available at the beginning of the batch, without additional communication. The authors prove that the greedy multiplechoice process preserves its strong load balancing properties in this setting: in particular, the gap between min and max remains . The key difference between these models and the one we consider is that our model is completely asynchronous, and in fact the interleavings are chosen adversarially. The technique we employ is fundamentally different from those of [22, 8]. In particular, we believe our techniques could be adapted to rederive the main result of [8], albeit with worse constants.
Recent work by a subset of the authors [3] analyzed the following producerconsumer process: a set of balls labelled are inserted sequentially at random into bins; in parallel, balls are removed from the bins by always picking the lower labelled (higher priority) of two uniform random choices.^{2}^{2}2Balls in each bin are sorted in increasing order of label, i.e. each bin corresponds to a sequential priority queue. This process sequentially models a series of popular implementations of concurrent priority queue data structures, e.g. [27, 16]. This process provides the following guarantees: in each step , the expected rank of the label removed among labels still present in the system is , and with high probability in . That is, this sequential process provides a structured probabilistic relaxation of a standard priority queue.
Relaxed Data Structures.
The process considered in [3] is sequential, whereas the data structures implemented are concurrent. Thus, there was a significant gap between the theoretical guarantees and the practical implementation. Our current work extends to concurrent data structures, closing this gap. Under the oblivious adversary assumption and given our parametrization, we show for the first time that practical data structures such as [27, 16, 3] provide guarantees in real executions.
Designing efficient concurrent/parallel data structures with relaxed semantics was initiated by Karp and Zhang [19], with other significant early work by Deo and Prasad [11] and Sanders [28]. It has recently become an extremely active research area, see e.g. [29, 7, 31, 4, 16, 24, 27, 3] for recent examples. To the best of our knowledge, ours is the first analysis of randomized relaxed concurrent data structures which works under arbitrary oblivious schedulers: previous analyses such as [4, 27, 3] required strong assumptions on the set of allowable interleavings. Dice et al. [12] considered randomized data structures for scalable exact and approximate counting. They consider the efficient parallelization of sequential approximate counting methods, and therefore have a significantly different focus than our work.
3 System Model
Asynchronous Shared Memory.
We consider a standard asynchronous sharedmemory model, e.g. [5], in which threads (or processes) , communicate through shared memory, on which they perform atomic operations such as , , and . The operation takes no arguments, and returns the value of the register before the increment was performed, incrementing its value by .
The Oblivious Adversarial Scheduler.
Threads follow an algorithm, composed of sharedmemory steps and local computation, including random coin flips. The order of process steps is controlled by an adversarial entity we call the scheduler. Time is measured in terms of the number of sharedmemory steps scheduled by the adversary. The adversary may choose to crash a set of at most processes by not scheduling them for the rest of the execution. A process that is not crashed at a certain step is correct, and if it never crashes then it takes an infinite number of steps in the execution. In the following, we assume a standard oblivious adversarial scheduler, which decides on the interleaving of thread steps independently of the coin flips they produce during the execution.
Shared Objects.
The algorithms we consider are implementations of shared objects. A shared object is an abstraction providing a set of methods, each given by a sequential specification. In particular, an implementation of a method for an object is a set of algorithms, one for each executing process. When thread invokes method of object , it follows the corresponding algorithm until it receives a response from the algorithm. Upon receiving the response, the process is immediately assigned another method invocation. In the following, we do not distinguish between a method and its implementation. A method invocation is pending at some point in the execution if has been initiated but has not yet received a response. A pending method invocation is active if it is made by a correct process (note that the process may still crash in the future). For example, a concurrent counter could implement and methods, with the same semantics as those of the sequential data structure.
Linearizability.
The standard correctness condition for concurrent implementations is linearizability [18]: roughly, a linearizable implementation ensures that each concurrent operation can be seen as executing at a single instant in time, called its linearization point. The mapping from method calls to linearization points induces a global order on the method calls, which is guaranteed to be consistent to a sequential execution in terms of the method outputs; moreover, each linearization point must occur between the start and end time of the corresponding method.
Recent work, e.g. [17], considers deterministic relaxed variants of linearizability, in which operations are allowed to deviate from the sequential specification by a relaxation factor. Such relaxations appear to be necessary in the case of data structures such as exact counters or priority queues in order to circumvent strong linear lower bounds on their concurrent complexity [2]. While specifying such data structures in the concurrent case is wellstudied [17, 1, 15], less is known about how to specify structured randomized relaxations.
With High Probability.
We say that an event occurs with high probability in a parameter, e.g. , if it occurs with probability at least , for some constant .
4 The MultiCounter Algorithm
Description.
The algorithm implements an approximate counter by distributing updates among distinct counters, each of which supports atomic and operations. Please see Algorithm 1 for pseudocode. To read the counter value, a thread simply picks one of the counters uniformly at random, and returns its value multiplied by . To increment the counter value, the thread picks two counter indices and uniformly at random, and reads their current values sequentially. It then proceeds to update (increment) the value of the counter which appeared to have a lower value given its two reads. (In case of a tie, or when the two choices are identical, the tie is broken arbitrarily.)
Relation to Load Balancing.
A sequential version of the above process, in which the counter is read or incremented atomically, is identical to the classic twochoice balanced allocation process [6], where each counter corresponds to a bin, and each increment corresponds to a new ball being inserted into the less loaded of two randomly chosen bins.
In a concurrent setting, the critical departure from the sequential model is that the values read can be inconsistent with respect to a sequential execution: there may be no single point in time when the two counters had the values and observed by the thread; moreover, these values may change between the point where they are read, and the point where the update is performed.
More technically, the sequential variant of the twochoice process has the crucial property that, at each increment step, it is “biased” towards incrementing the counter of lower value. This does not necessarily hold for the concurrent approximate counter: for an operation where a large number of updates occur between the read and the update points, the read information is stale, and therefore the thread’s increment choice may be no better than a perfectly random one; in fact, as we shall see in the analysis, it is actually possible for an adversary to engineer cases where the algorithm’s choice is biased towards incrementing the counter of higher value.
5 Distributional Linearizability
We generalize the classic linearizability correctness condition to cover randomized relaxed concurrent data structures, such as the MultiCounter. Intuitively, we will say that a concurrent data structure is distributionally linearizable to a corresponding relaxed sequential process , defined in terms of a sequential specification , a cost function measuring the deviation from the sequential specification, and a distribution on the cost function values, such that every execution of can be mapped onto an execution of the relaxed sequential process , respecting the outputs and the incurred costs, as well as the order of nonoverlapping operations. To formalize this definition, we introduce the following machinery, part of which is adopted from [17].
Data Structures and Labeled Transition Systems.
Let be a set of methods including input and output values. A sequential history is a sequence over , i.e. an element in . A (sequential) data structure is a sequential specification which is a prefixclosed set of sequential histories. For example, the sequential specification of a stack consists of all valid sequences for a stack, i.e. in which every places elements on top of the stack, and every removes elements from the top of the stack.
Given a sequential specification , two sequential histories are equivalent, written , if they correspond to the same “state:” formally, for any sequence , iff . Let be the equivalence class of .
Definition 5.1
Let be a sequential specification. Its corresponding labeled transition sequence (LTS) is an object , with states , set of labels , transition relation given by iff initial state .
Notice that the sequential specification can be alternatively defined as the set of all traces of the initial state of : formally, for any , we have iff .
Randomized Quantitative Relaxations.
Let be a data structure with . To obtain a randomized quantitative relaxation of , we apply the following four steps. The first three steps are identical to deterministic quantitative relaxations [17], whereas the fourth defines the probability distribution on costs:

Completion: We start from , and construct a completed labeled transition system, with transitions from any state to any other state by any method:

Cost function: We add a cost function to the LTS. The transition cost will satisfy
A quantitative path is a sequence
We call the sequence of transitions and costs the quantitative trace of , denoted by .

Path cost function: Given a quantitative path , its path cost is defined as . Path costs are monotone with respect to prefix order: if is a prefix of , then .

Probability distribution: Given an arbitrary state in , we define a probability space on the set of possible transitions and their corresponding costs from this state, where the sample space is the set of all transitions in , the algebra is defined in the straightforward way based on the set of elementary events , and is a probability measure .
Importantly, this allows us to define, for any path, the notion of probability for costs incurred at each step. This probability space is readily extended for arbitrary finite paths, where we assume that the cost probabilities at each step are independent of previous steps, i.e., historyless. This process induces a Markov chain, whose state at each step is given by the state
of the corresponding LTS, and whose transitions are LTS transitions, with costs and probabilities as above.
Distributional Linearizability.
With this in place, we now define distributionally linearizable data structures:
Definition 5.2
Let be a randomized concurrent data structure, and let be a randomized quantitative relaxation of a sequential specification with respect to a cost function , and a probability distribution on costs. We say that is distributionally linearizable to iff for every concurrent schedule , there exists a mapping of completed operations in under to transitions in the quantitative path of , preserving outputs, and respecting the order of nonoverlapping operations. This mapping can be used to associate any schedule to a distribution of costs for under the schedule .
We now make a few important remarks on this definition.

The main difficulty when formally defining the “costs” incurred by in a concurrent execution is in dealing with the execution history, and with the impact of pending operations on these costs. The above definition allows us to define costs, given a schedule, only in terms of the sequential process , and bounds the incurred costs in terms of the probability distribution defined in . This definition ensures that the probability distribution on costs incurred at each step only depends on the current state of the sequential process.

The second key question is how to use this definition. One subtle aspect of this definition is that the mapping to the sequential randomized quantitative relaxation is done per schedule: intuitively, this is because an adversary might change the schedule, and cause the distribution of costs of the data structure to change. Thus, it is often difficult to specify a precise cost distribution, which covers all possible schedules. However, for the data structures we analyze, we will be able to provide tail bounds on the cost distributions induced by all possible schedules.
The natural next question, which we answer in the following section, is whether nontrivial such data structures exist and can be analyzed.
6 Analysis of the MultiCounter
We will focus on proving the following result.
Theorem 6.1
Fix a large constant . Given an oblivious adversary, distributed counters and threads with , for any fixed schedule, the MultiCounter algorithm is distributionally linearizable to a randomized relaxed sequential counter process, which, at any step , returns a value that is at most away from the number of increments applied up to , both in expectation and with high probability in .
We emphasize that the relaxation guarantees are independent of the time at which the guarantee is examined, and that they would thus hold in infinite executions.
6.1 Modeling the Concurrent Process
In the following, we will focus on analyzing executions formed exclusively of operations, whose lowerlevel steps may be interleaved. (Adding operations at any point during the execution will be immediate.) We model the process as follows. First, we assume a schedule that is fixed by the adversary. For each thread , and nonnegative integers , we consider a sequence of operations , each of which is defined by its starting time , corresponding to the time when its first read step was scheduled, and completion time , corresponding to the time when its update time is scheduled, such that for all . (Recall that the scheduler defines a global order on individual steps.) At most operations may be active at a given time, corresponding to the fact that we only have parallel threads.
For each operation , we record its contention as the number of distinct operations scheduled between its start and end time. (Alternatively, we could define this quantity as the number of operations which complete in the time interval .) Note that at most distinct operations can be concurrent with at any given time, but the contention for a specific operation is potentially unbounded.
We can rephrase the original process as follows. For each operation , the adversary sets the time when it performs the first and its second read of counter values / bin weights, as well as its contention , by scheduling other operations concurrently. The only constraint on the adversary is that not more than operations can be active at the same time.
Since the adversary is oblivious, we notice that the update process is equivalent to the following: at the time when the update is scheduled, the thread executing the operation generates two uniform random indices and , and is given values and for the two corresponding counters / bin weights, read at previous (possibly different) points in time. We will stick to the bin weight formulation from now on, with the understanding that the two are equivalent.
The thread will then increment the weight of the bin with the smaller value read (among and ) by . This formulation has the slight advantage that it makes the update process sequential, by moving the random choices to the time when the update is made, using the principle of deferred decisions. Critically, the bin weights on which the update decision is based are potentially stale. We will focus on this simplified variant of the process in the following.
Discussion.
The key difference between the above process and the classic poweroftwochoices process is the fact that the choice of bin / counter which the thread updates is based on stale, potentially invalid information. Recall that key to the strong balancing properties of the classic process is the fact that it is biased towards inserting in less loaded bins; the process which inserts into randomly chosen bins is known to diverge [25]. In particular, notice it is possible that, by the time when the thread performs the update, the order of the bins’ load may have changed, i.e. the thread in fact inserts into the more loaded bin among its two choices at the time of the update.
Since the oblivious adversary decides its schedule independently of the threads’ random choices, it cannot deterministically cause a specific update to insert into the more loaded bin. However, it can significantly bias an update towards inserting into the more loaded bin:
Assume for example an execution suffix where all threads read concurrently at some time ^{3}^{3}3Technically, since we count time in terms of sharedmemory operations, these reads occur at consecutive times after . However, all their read values are identical to the read value at , and hence we choose to simplify notation in this way. and then proceed to perform updates, one after another. Pick an operation for which the gap between the two values read and (at the time of the read) is , say . So will increment . At the same time, notice that all the other operations which read concurrently with are biased towards inserting in rather than , since its rank (in increasing order of weight) is lower than that of bin . Hence, as the adversary schedules more and more operations between at ’s update time, it is increasingly likely to invert the relation between and by the time of ’s update, causing it to insert into the “wrong” bin.
The previous example suggests that the adversary is able to bias some subset of the operations towards picking the wrong bin at the time of the update. Another issue is that operations which experience high contention, for which there are many updates between the read point and the update point, the read values and become meaningless: for example, if the weights of bin and become equal at some time between and ’s update, then from this point in time these two bins appear completely symmetrical to the algorithm, and ’s choice given the information that at may be no better than uniform random.
One issue which further complicates this last example is that, at , there may be a nonzero number of other operations which already made their reads (for instance, at ), but have not updated yet. Since these operations read at a point where , they are in fact biased towards inserting in . So, looking at the event that updates the less loaded of its two random choices at update time, we notice that its probability in this example is strictly worse than uniform random choice.
We summarize this somewhat lengthy discussion with two points, which will be useful in our analysis:

As they experience concurrent updates, operations may accrue bias towards inserting into the more loaded of their two random choices.

Longrunning operations may in fact have a higher probability of inserting into the more loaded bin than into the less loaded one, i.e. may become biased towards making the “wrong” choice at the time of the update.
6.2 Notation and Background
The Process.
In the following, it will be useful to consider the following sequential relaxation of the twochoice process, introduced by [25], called the choice process: We are given bins, initially empty. In each step , we flip a biased coin: with probability we will insert a ball into the less loaded of two randomly chosen bins; otherwise, we insert the ball into a randomly chosen bin. This process is analyzed in [25], which shows that, at any time in its execution, the gap between the maximum and minimum value of a bin is , with high probability in , irrespective of .
We now introduce some notation, which will be common between our analysis and that of [25]. For simplicity, we will assume that, at the beginning of each step in this sequential process, bins are always ranked in increasing order of their weight. If is the probability that we pick the th ranked bin for insertion, and is the twochoice probability, then it is easy to see that the process guarantees
Further, notice that, for any , we have, ignoring the negligible factor, that
For any bin and time , let be the weight of bin at time . Let be the average weight at time over the bins. Let , and let be a parameter to be fixed later. Define
Finally, define the potential function
The main technical result of [25] can be phrased as:
Theorem 6.2
Let , be parameters as given above, and let . Then there exists a constant such that, for any time , we have .
In turn, this implies that the maximum gap between the most loaded and the least loaded bin at a step is in expectation and with high probability in .
6.3 Main Argument
Analysis Overview.
Throughout the analysis, we will fix a large constant such that . The analysis proceeds in the following technical steps.

We define an operation as for if, with probability at least , the bin adds to is not accessed by another operation at any point during the execution of . We will identify a constant such that all operations with contention are .

We lower bound the expected decrease in potential caused by a step that is .

We upper bound the expected increase in potential caused by a step that is not .

We argue that, for any adversarial strategy, out of any group of consecutive operations, at least have to be . We then upper bound the change in potential over any stretch of operations, showing that it has to stay in .
We will prove the following simple claim as starting point for the analysis:
Lemma 6.3
If for we have that its contention , then the step (operation) is .
Proof. Let and be two random bins chosen by , w.l.o.g we assume that is chosen to add to by . Considering that and for any operation, the probability of accessing bin is at most , we get:
where we used the inequality .
Next we try to bound the expected decrease in potential if operation is .
Lemma 6.4
If is , then :
(1) 
Proof. First notice that if chooses to delete from bin after looking at bins and bin is not accessed by any other operation during execution of , then bin must have been less loaded than bin for the entire interval between the second read of and the write of .
Now, we will use Theorem 3.1 from [25]
, stated below. Assume that we are given a weight vector
, in increasing order of weight, and two probability vectors and , where we assume that probability vectors are sorted in decreasing order. We say that majorizes if for any :Let be expected potential function if we choose bin according to probability vector (that is, th less loaded bin is chosen with probability ) and let be expected potential function if we choose bin according to probability vector (that is, the th least loaded bin is chosen with probability ). Then Theorem 3.1 from [25] implies that , because probability vector is more biased towards lesser loaded bins than probability vector .
What we need to show is that probability vector of which is majorizes the probability vector of choice process for some . That is, for any :
We do exactly that in the following. Let be the probability vector for the bin choice of the fully sequential process. As we know, for any , . Recall that if bin is not accessed by another operation during the execution of , then must add to the bin which is the lesser loaded at the time of writing. Thus, if is the probability that adds to the lesser loaded bin, we have that .
Then:
On the other hand:
From the equations above, it is easy to see that for any and :
Theorem 2.9 from [25] gives us that :
(2) 
The fact that gives us the lemma.
If inserts into the lesser loaded bin with probability at most , we assume the worst scenario. That is, we assume that always inserts into the more loaded bin and we try to bound the expected potential increase for that case. For this, again let us assume that the weight vector is ordered such that and let , be a probability that bin is chosen. In this case, .
We now fix some constants: let and , so that for every we have . Also, to be consistent with the above lemma we fix . At this point, we also fix . This allows us to prove the following lemma:
Lemma 6.5
If is a bad operation, then:
(3) 
Proof. First we consider what is expected change in . Let . We have two cases here. If bin is chosen, then the change is:
(4) 
where in we used the Taylor expansion of the exponential around and the fact that since , we have that . Using similar arguments we can prove that, when some other bin is chosen:
(5) 
Therefore, we get that:
(6) 
where we used that and . This gives us that:
(7) 
In a similar way, we can prove that:
(8) 
Since , we get that . Combining this with inequality 7 gives us the Lemma.
Now we consider consecutive operations and prove that at most of them can be bad:
Lemma 6.6
For any , we have that .
Proof. We argue by contradiction. Let us assume that the number of bad operations is at least . By the pigeonhole principle, there exist bad operations and , , which are performed by the same thread. This means that since these operations are not concurrent, we have that . Thus, we get a contradiction: .
Endgame.
With all this machinery in place, we proceed to prove the following.
Lemma 6.7
Given any time , we have
Proof. We will proceed by induction on . We will first prove that, if , then .
We have two cases. The first is if there exists a time such that . Let us now focus on bounding the maximum expected value of in this case. First notice that the maximum expected increase of because of a good step is an additive factor. The expected value of after a bad step is upper bounded a multiplicative factor. Hence, by Lemma 6.6 and using the fact that , the expected maximum value of at is at most
The second case is if there exists no such time in , meaning that Then, by Lemma 6.4, we have that, at each good step,
(9) 
Hence, we can expand the recursion to upper bound the change in between and as
(10) 
This last expression is upper bounded by if , which concludes the proof of our first claim above. Hence, at the end of each interval of additional operations, the expected potential cannot exceed . To complete the proof, we notice that the value that attains inside the interval of size occurs if bad steps occur in succession immediately after its start. However, the maximum value that can attain is upper bounded as
(11) 
This concludes the proof of the Lemma.
The Constant .
A sufficient setting for for the analysis to hold is , yielding .
The following claim completes the proof of Theorem 6.1.
Lemma 6.8
Fix a large constant . Given an oblivious adversary, distributed counters and threads with , for any time in the execution of the approximate counter algorithm the counter returns a value that is at most away from the number of increment operations which completed up to time , in expectation. Moreover, for any and all sufficiently large, we have
Proof. The proof is similar to [25] (the main difficulty was to reach asymptotically the same potential upper bound). We aim to bound , the maximum gap between the weight of two bins at a step.
By choosing sufficiently large, we have that in Lemma 6.7. We first prove the bound in expectation. Note that Lemma 6.7 implies that and for all . Let denote the maximum weight of any bin at time , and let be the minimum weight of any bin. Then, we have
where (a) follows from Jensen’s inequality, and (b) follows from the definition of . Similarly, we have . Since the true value of the counter at time is , these two statements imply that for all , we have , as desired.
We now prove the high probability bound. Observe that if , then we have . Hence,
Similarly,
Combining these two guarantees with a union bound immediately yields the desired guarantee.
7 Distributional Linearizability for Concurrent Relaxed Queues
We now extend the analysis in the previous section to imply distributional linearizability guarantees in concurrent executions for a variant of the MultiQueue process analyzed by [3]. This process is presented in Algorithm 2. We note that this process applies specifically to implement general concurrent queues, and will also apply to priority queues assuming that a sufficiently large buffer of elements always exists in the queues such that no insertion is ever performed on an element of higher priority than an element which has already been removed.
7.1 Application to Concurrent Relaxed Queues
Description.
We wish to implement a concurrent data structure with queuelike semantics, so that a dequeue always removes an element which is among the oldest elements in the queue, w.h.p. We assume we are given a set of linearizable priority queues such that each supports , , , where is the priority of the element, and returns the element with smallest priority in the priority queue, but does not remove it. We also assume that each processor has access to a clock which gives an absolute time, and which are consistent amongst all the processors, that is, if processor reads in the linearization before processor reads , then processor ’s value is smaller. Such an assumption is realistic; recent Intel processors support the RDTSC hardware operation, which provides this functionality for cores on the same socket.
The procedure, given formally in Algorithm 2, is similar to our approximate counter. To enqueue, a thread reads the wall clock, chooses a random priority queue, and adds the element to that priority queue with priority given by the time. To dequeue, we choose two random priority queues, find the one having a higher priority element on top, and delete from that priority queue. In case two processes enqueue to the same priority queue concurrently, their clock values will ensure a consistent ordering, handled by the internal implementation of the priority queues.
Analysis.
It can be shown that this relaxed queue implementation ensures that the rank gap between the smallest timestamp head element of any queue and the largest timestamp head element of any queue is at most , for . Given Lemma 6.7, the argument will follow the same pattern as the analysis in [3]. The key difference is that [3] studies a sequential process, whereas we consider a concurrent one. The key nontrivial step in this derivation is a generalization of Lemma 6.5 for exponential weights of mean , as opposed to weights of value :
Theorem 7.1
Let and be parameters with , for a large constant . Assuming an oblivious adversary, the MultiQueue algorithm with parameter (Algorithm 2) is distributionally linearizable to a sequential randomized relaxed queue , which ensures that, at each step , the rank of a dequeued element is in expectation, and with high probability.
Proof. First we prove the same result as in Lemma 6.5
, when the load of chosen bin is increased by random variable
, such that , whereis an exponential distributed random variable with parameter
. We have that . Letbe the moment generating function for a distribution of
. Observe that . This gives us that . We fix and , so that for every we have .Notice that if bin is chosen:
By Taylor expansion, we get that for :
(12) 
If bin some bin is chosen:
(13) 
Again using Taylor expansion we have that for :
(14) 
Observe that this gives us exactly the same bound as in the proof of Lemma 6.5
(15) 
Similarly, we can prove that :
(16) 
Summing two inequalities above gives us:
(17) 
8 Experimental Results
Setup.
Our experiments were run on an Intel E74830 v3 with 12 cores per socket and 2 hyperthreads (HTs) per core, for a total of 24 threads, and 128GB of RAM. In all of our experiments, we pinned threads to avoid unnecessary context switches. Hyperthreading is only used with more than 12 threads. The machine runs Ubuntu 14.04 LTS. All code was compiled with the GNU C++ compiler (G++) 6.3.0 with compilation options std=c++11 mcx16 O3.
Synthetic Benchmarks.
We implemented and benchmarked the MultiCounter algorithm on a multicore machine. To test the behavior under contention, threads continually increment the counter value using the twochoice process. We use no synchronization other than the atomic fetch and increment instruction for the update. Figure 1(a) shows the scalability results, while Figure 1(b) shows the “quality” guarantees of the implementation in terms of values returned by the counter over time, as well as maximum gap between bins over time. Quality is measured in a singlethreaded execution, for counters. (Recording quality accurately in a concurrent execution appears complicated, as it is not clear how to order the concurrent read steps.)
TL2 Benchmark.
Transactional Locking II (TL2) is a software implementation of transactional memory introduced by [13]. TL2 guarantees opacity by using finegrained locking and a global clock . TL2 associates a version lock with each memory location. A version lock behaves like a traditional lock, except it additionally stores a version number that represents the value of when the memory location protected by the lock was last modified. At a high level, a transaction starts by reading , and uses the clock value it reads to determine whether it ever observes the effects of an uncommitted transaction. If so, the transaction will abort. Otherwise, after performing all of its reads, it locks the addresses in its write set (validating these locations to ensure that they have not been written recently), rereads to obtain a new version , performs its writes, then releases its locks, updating their versions to .
TL2 with Relaxed Global Clocks.
In the standard implementation of TL2, is incremented using fetchandadd (FAA). This quickly becomes a concurrency bottleneck as the number of threads increases, so the the authors developed several improved implementations of . However, they too experience scaling problems at large thread counts. We replace this global clock counter with a MultiCounter implementation, and compare against a highlyoptimized baseline implementation.
Due to the fact that the counter is relaxed, reasoning about the correctness of the resulting algorithm is no longer straightforward. In particular, a key property we need to enforce is that the timestamp which a thread writes to a set of objects as part of its transaction (generated when the thread is holding locks to commit and written to all objects in its write set) cannot be held by any other threads at the same time, since such threads might read those concurrent updates concurrently, and believe that they occurred in the past. For this reason, we modify the TL2 algorithm so that threads write “in the future,” by adding a quantity , which exceeds the maximum clock skew we expect to encounter in the MultiCounter over an execution, to the maximum timestamp they have encountered during their execution so far. Thus, each new write always increments an object’s timestamp by . We stress that that the (approximate) global clock is implemented by the MultiCounter algorithm, and that it is disjoint from the object timestamps.
This protocol induces the following tradeoffs. First, the resulting transactional algorithm only ensures safety with high probability, since the bound might be broken at some point during the execution, and lead to a nonserializable transaction, with extremely low probability. Second, we note that, once an object is written with a timestamp that occurs in the future, transactions which immediately read this object may abort, since they see a timestamp that is larger than theirs. Hence, once an object is written, at least operations should occur without accessing this object, so that the system clock is incremented past the read point without causing readers to abort. Intuitively, this upper bounds the frequency at which objects should be written to for this approximate timestamping mechanism to be efficient. On the positive side, this mechanism allows us to break the scalability bottleneck caused by the global clock.
We verify this intuition through implementation. See Figures 1(c)—1(e). We are given an array of transactional objects, with between 10K and 1M. Transactions pick 2 array locations uniformly at random, then start a transaction, increment both locations, and then commit the transaction. We record the average throughput out of ten onesecond experiments. We verify correctness by checking that the array contents are consistent with the number of executed operations at the end of the run; none of these experiments have resulted in erroneous outputs. We record the rate at which transactions commit, as a function of the number of threads. We note that, for 1M and 100K objects, the average frequency at which each location is written is below the heavy abort threshold, and we obtain almost linear scaling with MultiCounters. At 10K objects we surpass this threshold, and see a considerable drop in performance, because of a large number of aborts.
9 Conclusions and Future Work
We have presented the first concurrent analysis of the twochoice loadbalancing process, showing that this classic randomized algorithm is in fact robust to asynchrony under an oblivious adversary. Our analysis extends existing tools, namely [25], in nontrivial ways, in particular by showing that the potential analysis can withstand adversarially corrupted updates. Our results have nontrivial practical applications, as they show that a popular set of randomized concurrent data structures in fact provide strong probabilistic guarantees in arbitrary executions, which we express via a new correctness condition called distributional linearizability. This inspires a scalable approximate counting mechanism, trading off contention and exactness guarantees, which can be used to scale a transactional application.
An immediate direction of future work is to reduce the large constant gap between the number of bins and the number of threads . We did not specifically optimize for this gap in the current version. It is interesting to also ask whether the process will preserve its properties even under high contention, e.g. . The main reason for which we assume that is to withstand adversarial executions in which the adversary schedules whole “blocks” of updates, which effectively reset the distribution of bin loads. Threads acting with stale information after such a block perform choices which are effectively random (or worse than random). We conjecture that there may be values of and for which the process breaks down, in the sense that its max gap is no longer bounded by .
References
 [1] Yehuda Afek, Guy Korland, and Eitan Yanovsky. Quasilinearizability: Relaxed consistency for improved concurrency. In International Conference on Principles of Distributed Systems, pages 395–410. Springer, 2010.
 [2] Dan Alistarh, James Aspnes, Keren CensorHillel, Seth Gilbert, and Rachid Guerraoui. Tight bounds for asynchronous renaming. J. ACM, 61(3):18:1–18:51, June 2014.
 [3] Dan Alistarh, Justin Kopinsky, Jerry Li, and Giorgi Nadiradze. The power of choice in priority scheduling. arXiv preprint arXiv:1706.04178, 2017. To appear in PODC 2017.
 [4] Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. The spraylist: A scalable relaxed priority queue. In 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San Francisco, CA, USA, 2015. ACM.
 [5] Hagit Attiya and Jennifer Welch. Distributed computing: fundamentals, simulations, and advanced topics, volume 19. John Wiley & Sons, 2004.
 [6] Yossi Azar, Andrei Z Broder, Anna R Karlin, and Eli Upfal. Balanced allocations. SIAM journal on computing, 29(1):180–200, 1999.
 [7] Dmitry Basin, Rui Fan, Idit Keidar, Ofer Kiselov, and Dmitri Perelman. CafÉ: Scalable task pools with adjustable fairness and contention. In Proceedings of the 25th International Conference on Distributed Computing, DISC’11, pages 475–488, Berlin, Heidelberg, 2011. SpringerVerlag.
 [8] Petra Berenbrink, Artur Czumaj, Matthias Englert, Tom Friedetzky, and Lars Nagel. Multiplechoice balanced allocation in (almost) parallel. In APPROXRANDOM, pages 411–422. Springer, 2012.

[9]
Petra Berenbrink, Artur Czumaj, Angelika Steger, and Berthold Vöcking.
Balanced allocations: The heavily loaded case.
In
Proceedings of the Thirtysecond Annual ACM Symposium on Theory of Computing
, STOC ’00, pages 745–754, New York, NY, USA, 2000. ACM.  [10] Petra Berenbrink, Tom Friedetzky, Zengjian Hu, and Russell Martin. On weighted ballsintobins games. Theor. Comput. Sci., 409(3):511–520, December 2008.
 [11] N. Deo and S. Prasad. Parallel heap: An optimal parallel priority queue. The Journal of Supercomputing, 6(1):87–98, March 1992.
 [12] Dave Dice, Yossi Lev, and Mark Moir. Scalable statistics counters. In 25th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’13, Montreal, QC, Canada, pages 43–52, 2013.
 [13] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking ii. In International Symposium on Distributed Computing, pages 194–208. Springer, 2006.
 [14] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: distributed graphparallel computation on natural graphs. In OSDI, volume 12, page 2, 2012.
 [15] Andreas Haas, Thomas A Henzinger, Andreas Holzer, Christoph M Kirsch, Michael Lippautz, Hannes Payer, Ali Sezgin, Ana Sokolova, and Helmut Veith. Local linearizability. arXiv preprint arXiv:1502.07118, 2015.
 [16] Andreas Haas, Michael Lippautz, Thomas A. Henzinger, Hannes Payer, Ana Sokolova, Christoph M. Kirsch, and Ali Sezgin. Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation. In Computing Frontiers Conference, CF’13, Ischia, Italy, May 14  16, 2013, pages 17:1–17:9, 2013.
 [17] Thomas A. Henzinger, Christoph M. Kirsch, Hannes Payer, Ali Sezgin, and Ana Sokolova. Quantitative relaxation of concurrent data structures. SIGPLAN Not., 48(1):317–328, January 2013.
 [18] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–492, July 1990.
 [19] R. M. Karp and Y. Zhang. Parallel algorithms for backtrack search and branchandbound. Journal of the ACM, 40(3):765–789, 1993.
 [20] Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Priority queues are not good concurrent priority schedulers. In European Conference on Parallel Processing, pages 209–221. Springer, 2015.
 [21] Christoph Lenzen and Roger Wattenhofer. Tight bounds for parallel randomized load balancing. Distrib. Comput., 29(2):127–142, April 2016.
 [22] Michael Mitzenmacher. How useful is old information? IEEE Transactions on Parallel and Distributed Systems, 11(1):6–20, 2000.
 [23] Michael David Mitzenmacher. The Power of Two Random Choices in Randomized Load Balancing. PhD thesis, PhD thesis, Graduate Division of the University of California at Berkley, 1996.
 [24] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 456–471, New York, NY, USA, 2013. ACM.
 [25] Yuval Peres, Kunal Talwar, and Udi Wieder. Graphical balanced allocations and the 1 + betachoice process. Random Struct. Algorithms, 47(4):760–775, December 2015.
 [26] Andrea W Richa, M Mitzenmacher, and R Sitaraman. The power of two random choices: A survey of techniques and results. Combinatorial Optimization, 9:255–304, 2001.
 [27] Hamza Rihani, Peter Sanders, and Roman Dementiev. Brief announcement: Multiqueues: Simple relaxed concurrent priority queues. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’15, pages 80–82, New York, NY, USA, 2015. ACM.
 [28] P. Sanders. Randomized priority queues for fast parallel access. Journal Parallel and Distributed Computing, Special Issue on Parallel and Distributed Data Structures, 49:86–97, 1998.
 [29] Nir Shavit and Itay Lotan. Skiplistbased concurrent priority queues. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, pages 263–268. IEEE, 2000.
 [30] Kunal Talwar and Udi Wieder. Balanced allocations: The weighted case. In Proceedings of the Thirtyninth Annual ACM Symposium on Theory of Computing, STOC ’07, pages 256–265, New York, NY, USA, 2007. ACM.
 [31] Martin Wimmer, Jakob Gruber, Jesper Larsson Träff, and Philippas Tsigas. The lockfree klsm relaxed priority queue. In ACM SIGPLAN Notices, volume 50, pages 277–278. ACM, 2015.
Comments
There are no comments yet.