# Parallel Balanced Allocations: The Heavily Loaded Case

We study parallel algorithms for the classical balls-into-bins problem, in which m balls acting in parallel as separate agents are placed into n bins. Algorithms operate in synchronous rounds, in each of which balls and bins exchange messages once. The goal is to minimize the maximal load over all bins using a small number of rounds and few messages. While the case of m=n balls has been extensively studied, little is known about the heavily loaded case. In this work, we consider parallel algorithms for this somewhat neglected regime of m≫ n. The naive solution of allocating each ball to a bin chosen uniformly and independently at random results in maximal load m/n+Θ(√(m/n· n)) (for m≥ n n) w.h.p. In contrast, for the sequential setting Berenbrink et al (SIAM J. Comput 2006) showed that letting each ball join the least loaded bin of two randomly selected bins reduces the maximal load to m/n+O( m) w.h.p. To date, no parallel variant of such a result is known. We present a simple parallel threshold algorithm that obtains a maximal load of m/n+O(1) w.h.p. within O( (m/n)+^* n) rounds. The algorithm is symmetric (balls and bins all "look the same"), and balls send O(1) messages in expectation per round. The additive term of O(^* n) in the complexity is known to be tight for such algorithms (Lenzen and Wattenhofer Distributed Computing 2016). We also prove that our analysis is tight, i.e., algorithms of the type we provide must run for Ω({ (m/n),n}) rounds w.h.p. Finally, we give a simple asymmetric algorithm (i.e., balls are aware of a common labeling of the bins) that achieves a maximal load of m/n + O(1) in a constant number of rounds w.h.p. Again, balls send only a single message per round, and bins receive (1+o(1))m/n+O( n) messages w.h.p.

## Authors

• 14 publications
• 29 publications
• 10 publications
01/11/2019

### Exponentially Faster Massively Parallel Maximal Matching

The study of graph problems in the Massively Parallel Computations (MPC)...
07/17/2018

### Massively Parallel Symmetry Breaking on Sparse Graphs: MIS and Maximal Matching

The success of massively parallel computation (MPC) paradigms such as Ma...
01/16/2018

### Round- and Message-Optimal Distributed Graph Algorithms

Distributed graph algorithms that separately optimize for either the num...
10/20/2021

### Balanced Allocations: Caching and Packing, Twinning and Thinning

We consider the sequential allocation of m balls (jobs) into n bins (ser...
06/29/2021

### Optimal Spanners for Unit Ball Graphs in Doubling Metrics

Resolving an open question from 2006, we prove the existence of light-we...
06/13/2020

### Balanced Allocation on Dynamic Hypergraphs

The balls-into-bins model randomly allocates n sequential balls into n b...
05/03/2021

### A Tight Parallel Repetition Theorem for Partially Simulatable Interactive Arguments via Smooth KL-Divergence

Hardness amplification is a central problem in the study of interactive ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider simple parallel algorithms for the heavily loaded regime of the well-known balls into bins problem. When balls are thrown randomly into bins, the maximal load can be bounded by with high probability (w.h.p.)111Throughout this work, we say that an event happens with high probability if it succeeds with probability of at least for any constant . for any (e.g., by Chernoff’s bound). In the balanced case, i.e., for , it was demonstrated that parallel communication between balls and bins can considerably improve this load using a small number of messages and rounds. In contrast, for the regime, to this point, there was no (communication efficient) parallel algorithm that outperforms the naïve random allocation.

In this paper, we ask how to leverage communication to improve the maximal load for this heavily loaded case. We are in particular intrigued by the number of communication rounds required to achieve the almost perfect maximal load of . We focus primarily on algorithms which are symmetric (bins are anonymous) and use few messages.

#### The Classical Setting of Balls into Bins.

Balls into bins and related problems have been studied thoroughly in a wide range of models. The high-level goal of any balls-into-bins algorithm is to allocate “efficiently” a set of items (e.g., jobs, balls) to a set of resources (machines, bins). The naïve single-choice algorithm places each ball into a bin chosen independently and uniformly at random. It is well-known that for this achieves a maximal load of with high probability. In a seminal work, Azar et al. introduced the multiple-choice paradigm, in which the balls are placed into bins sequentially one by one, and each ball is allocated to the least loaded among randomly selected bins. They showed that this algorithm achieves, w.h.p., a maximal load of , an exponential improvement over the single choice algorithm.

Adler et al. [ACMR98] introduced the parallel framework for the balls-into-bins problem, with the objective of parallelizing this sequential multiple choice process. They restricted attention to simple and natural parallel algorithms that are both (i) symmetric: all balls and bins run the same algorithms, and bins are anonymous; and (ii) non-adaptive: each ball picks a set of bins uniformly and independently at random and communicate only with these bins throughout the protocol. They showed that such symmetric and non-adaptive algorithms can achieve a total load of with the same number of rounds.

Lenzen and Wattenhofer [LW16] relaxed the non-adaptivity constraint, and presented an adaptive and symmetric algorithm that obtains a bin load of , w.h.p., within rounds and using a total of messages. Again, this is tight for this class of algorithms, and dropping any of the constraints the lower bound imposes leads to constant-round solutions.

#### The Heavily Loaded Case of Balls into Bins.

It has been noted in the literature that the regime of the balls into bins problem is fundamentally different than the case where ; this explains why attempts to extend the analysis of existing algorithms to the heavily loaded case mostly fail [BCSV06, TW14]. In a breakthrough result, Berenbrink et al. [BCSV06] provided an ingenious analysis for the multiple choice process in the heavily loaded regime. They showed that when balls are allowed to pick the best among random choices, the bin load becomes with high probability. Thus the -choice process super-exponentially improves the excess bin load compared to the single choice random allocation and makes it independent of .

To the best of our knowledge, there has been no work that parallelizes this sequential process in a similar manner as has been done by Adler et al. and others for the case.222We note that Stemann [Ste96] considers the possibility that , but provides algorithms for load only; for almost the entire range of parameters, the naïve algorithm or using multiple instances of algorithms for yields better results. As a result, no better parallel algorithm has been known for this regime other than placing balls randomly into bins.

#### Our Results.

We propose a very simple threshold algorithm (cf. [ACMR98]) that appears to be suitable for the heavily loaded regime. In every synchronous round of our algorithm, each unallocated ball sends a join-request to a bin chosen uniformly at random. Bins will accept balls up to a load of (a threshold that increases with ). Thus, a bin with load at the beginning of round acknowledges up to requests (chosen arbitrarily among all received requests) and declines the rest. We show that such a simple algorithm achieves a maximal load of within rounds with high probability.

###### Theorem 1.

There exists a parallel symmetric and adaptive algorithm of rounds that achieves maximal load of with high probability. The algorithm uses a total of messages, w.h.p.

Note that, trivially, one can place all balls within rounds, by each ball approaching each bin once (and bins using thresholds of in all rounds). Thus the above time bound is of interest whenever .

The technically most challenging part is our lower bound argument. We consider a special class of threshold algorithms to which our algorithm belongs. This class consists of all threshold algorithms in which in every round, every (unallocated) ball contacts bins sampled uniformly and independently at random. This class generalizes our algorithm in two ways. First, it allows a ball to contact bins per round instead of only (as in the main phase of our algorithm). Second, it allows bins to have distinct threshold values, which can depend on the state of the entire system in an arbitrary way.

###### Theorem 2.

Any threshold algorithm in which in each round balls choose bins to contact uniformly and independently at random w.h.p. runs for rounds or has a maximal load of .

This theorem applies to the algorithm of Theorem 1, but not to the trivial -round algorithm mentioned above. We conjecture that any threshold algorithm runs for

rounds or incurs larger loads, but a proof seems challenging due to the obstacles imposed by balls using differing probability distributions for deciding which bins to contact.

#### Asymmetric Algorithms.

In the asymmetric setting, all bins are distinguished based on globally known IDs, which can be rephrased as all balls’ port numberings of bins being consistent. A perfect allocation can be obtained trivially in this setting, simply by letting all balls contact the first bin, which then can send to each ball the bin ID to which it should be assigned. To rule out such trivial solutions, one should restrict attention to algorithms in which no bin receives (significantly) more messages than necessary. Concretely, bins should receive no more than messages; as with constant probability some bin will receive messages even if each ball sends a single message, this is the best we can hope for.

###### Theorem 3.

There exists a parallel asymmetric algorithm that achieves a maximal load of within rounds w.h.p., where each bin receives a total of messages w.h.p.

This goes to show that, similar to the case of , asymmetry allows for highly efficient solutions. In what follows, we give a high-level overview of the proofs of Theorems 2 and 1. The full proof of Theorem 3 is given in Section 5.

Following [ABKU99], multiple-choice algorithms have been studied extensively in the sequential setting. For instance, [Vöc03] considered a variant of this setting where the selections made by balls are allowed to be nonuniform and dependent. The works [SP02, MPS02] have studied the effect of memory when combined with the multiple choice paradigm and showed that a choice from memory is asymptotically better than a random choice. The analysis of the multiple choice process for the heavily loaded case was first provided by [BCSV06] and considerably simplified by [TW14]. See [Wie17] for a survey on sequential multiple-choice algorithms.

Turning to the distributed/parallel setting, [SS12] studied distributed load balancing protocols on general graph topologies. [BCE12] considers a semi-parallel framework for balls into bins, in which the balls arrive in batches rather than one by one as in the sequential setting. [ADRS14] consider a variant of the balls-into-bins problem, namely, the renaming problem and the setting of synchronous message passing with failure-prone servers. Finally, [BL14] introduced a general framework for parallel balls-into-bins algorithms and generalizes some of the algorithms analyzed in [LW16].

### 1.1 Our Approach in a Nutshell.

#### The Symmetric Algorithm.

To get some intuition on threshold algorithms, we start by considering the most naïve algorithm, in which each bin agrees to accept at most balls in total, without modifying its threshold over the course of the algorithm. That is, in every round each unallocated ball picks a bin uniformly and independently at random, each bin agrees to accept at most balls in total, and rejects the rest. Clearly, the final load of each bin is bounded by and hence it remains to consider the running time of such an algorithm. One can show that, w.h.p., after a single round a constant fraction of the bins are going to be full (i.e., contain balls). Hence, the probability of an unallocated ball to contact a full bin in the following rounds is constant. This immediately entails a running time lower bound of , even if the balls may contact a constant number of bins per round.

The crux idea of our symmetric algorithm is to set the threshold lower than the allowed bin load (e.g., in the first round we set ). At first glance, this seems unintuitive as a bin might reject balls despite the fact that it still has room. The key observation here is that setting the threshold a bit smaller than the allowed load keeps all bins equally loaded throughout the algorithm, yet permits placing all but a few of the remaining balls in each step. This prevents the situation where an unallocated ball blindly searches for a free bin in between many occupied bins. Crunching the numbers shows that this approach reduces the number of remaining balls to in rounds, after which the established techniques for the case of can be applied.

#### The Lower Bound.

Our lower bound approach considers a natural family of threshold algorithms, which in particular captures the above algorithm. Every algorithm is this family has the following structure. In each round , every unallocated ball picks bins independently and uniformly at random. Every bin accepts up to requests and rejects the rest. The value can be chosen non-deterministically by the bins.

This class is more general than our algorithm, in several ways. Most significantly, it allows bins to have different thresholds. The decision of these can depend on the system state at the beginning of each round (excluding future random choices of balls). Moreover, we allow for algorithms that “collect” allocation requests from balls for several rounds before allocating them according to the chosen threshold. While this is not a good strategy for algorithms, is it useful in the simulation part of the proof, which is explained next.

The proof follows in two steps. First, we prove the lower bound for degree one algorithms (where balls contact a single bin in each iteration) in the family described above. The argument for this step is somewhat technical, and it is based on focusing on one class of bins that have roughly the same number of rejected balls in expectation. We show that one can find such a class of bins which captures a large fraction of the expected number of rejected balls. We then exploit the fact that all bins in this class are roughly the same, which allows us to provide concentration results for that class.

The second step is a simulation technique in which we show how to simulate an algorithm with higher degree by an algorithm from the above family. Roughly speaking, we simulate a degree algorithm by contacting a single bin over different rounds. Only after these rounds the bins decide which balls to accept. Here we crucially rely on the fact that our lower bound for single degree algorithms includes such algorithms.

## 2 Preliminaries

###### Definition 1 (With high probability (w.h.p.)).

We say that the random variable

attains values from the set with high probability, if for an arbitrary, but fixed constant . More simply, we say occurs w.h.p.

We use some theory on negatively associated random variables, which is given in [DR98].

###### Definition 2 (Negative Association).

A set of random variables is said to be negatively associated if for any two disjoint index sets and two functions that are both monotone increasing or both monotone decreasing, it holds that

 E[f(Xi:i∈I)⋅g(Xj:j∈J)]≤E[f(Xi:i∈I)]⋅E[g(Xj:j∈J)].
###### Lemma 1 (Chernoff Bound).

Let be independent or negatively associated random variables that take the value 1 with probability and 0 otherwise, , and . Then for any ,

 Pr[X<(1−δ)μ]≤e−δ2μ/2

and

 Pr[X>(1+δ)μ]≤e−δ2μ/3.

If , with we get that

 Pr[X<μ−√2μlogm]≤1/m,

and

 Pr[X>μ+√3μlogm]≤1/m.
###### Proposition 1 ([Dr98], Proposition 7(2)).

Non-decreasing (or non-increasing) functions of disjoint subsets of negatively associated variables are also negatively associated.

Our lower bound proof makes use of the following Berry-Esseen inequality.

###### Theorem 4 (Berry-Esseen Inequality [Ber41, Ess42]).

Let , , be i.i.d. random variables with , , and , and let . Denote by

and by

the cumulative distribution function of the standard normal distribution. Then

 sups∈R{|F(s)−ϕ(s)|}≤cρ∑3√M ,~{}~{}for a constant c.

#### Symmetric Algorithm for m=n.

Our algorithm for the heavily loaded regime uses the algorithm of [LW16] for allocating balls into bins. We denote this algorithm by . Specifically, we use the following theorem.

###### Theorem 5.

[From [LW16]] There exists a symmetric algorithm for placing balls into bins with the following properties w.h.p.: The algorithm terminates after rounds with bin load at most . The total number of messages sent is , where in each round balls send and receive messages in expectation and many with high probability. Finally, in each round, bins send and receive messages in expectation and many with high probability.

## 3 The Parallel Symmetric Algorithm

In this section, we describe our symmetric algorithm for allocating balls into bins. We begin by describing the precise model in which the algorithm works.

#### The Model.

The system consists of bins and balls, and operates in the synchronous message passing model, where each round consists of the following steps.

1. Balls perform local computations and send messages to arbitrary bins.

2. Bins receive these messages, perform local computations and send messages to any balls they have been contacted by in this or earlier rounds.

3. Balls receive these messages and may commit to a bin (and terminate).

All algorithms may be randomized and have unbounded computational resources; however, we strive for using only very simple computations.

#### High-Level Description.

The algorithm consists of two phases. The first phase consists of rounds, at the end of which the number of unallocated balls is . The second phase consists of rounds and completes the allocation by applying Theorem 5 [LW16].

For simplicity, we will assume that all values specified in the following are integers; as we aim for asymptotic bounds, rounding has no relevant impact on our results. In our algorithm, the threshold values of all bins are the same, but depend on the current round. In the first round, all bins set their threshold to , each ball picks a single bin uniformly at random, and bins accept at most balls and reject the rest. Applying Chernoff’s bound, we see that w.h.p. each bin is contacted by at least balls. Hence, each bin has exactly allocated balls after the first round. Accordingly, the number of unallocated balls after the first round is . We continue the same way in the second round, handling an instance with balls and bins. It follows that the number of remaining balls after rounds is bounded by . When gets very close to , i.e., , concentration is not sufficiently strong any more to guarantee that all bins receive the desired number of balls. However, one can show that w.h.p. this holds true for the vast majority of bins. Overall, we show that after rounds, unallocated balls remain.

At this point, we employ the parallel algorithm of Lenzen and Wattenhofer [LW16], which takes additional rounds. To this end, we let each bin act as virtual bins. This way, at most additional balls will be allocated to each bin, as the algorithm guarantees a maximum bin load of . We next describe the algorithm and its analysis in detail.

#### The Algorithm Aheavy:

1. Set .

2. For do:

1. Each ball sends an allocation request to a uniformly sampled bin.

2. Set . Each bin accepts up to balls, where is the load of the bin at the beginning of the round.

3. Set .

3. At this point at most balls are unallocated (w.h.p.). Run for the remaining balls with each bin simulating virtual bins.

###### Theorem 6.

Algorithm finishes after rounds with maximal load of , w.h.p., using in total messages (over all rounds). Each ball sends and receives messages in expectation and many w.h.p. Each bin sends and receives messages w.h.p.

###### Proof.

For any round of step (2), let be the number of unallocated balls at the beginning of the round, and notice that

is the bin’s estimate of

. Fix a round . Let be a random variable indicating the number of balls that choose bin in round (we suppress the round index for ease of notation) and set .

Observe that . Moreover, , as balls can be allocated by the end of round . We make frequent use of these observations in the following. We start by bounding the probability that a bin gets “underloaded” in a given round, i.e., despite the conservatively small chosen threshold, it does not receive sufficiently many requests to allocate balls in round .

.

###### Proof.

For all , it holds that

 Ti−Ti−1=˜min−˜mi+1n=˜min−(˜min)23.

As , . Using a Chernoff bound with , we get that

 Pr[Xb

Using this bound, we next show that each bin is allocated balls to match its threshold in each round, at least until only balls remain.

###### Claim 2.

Let be minimal with the property that for a sufficiently large constant . Then w.h.p.

###### Proof.

We apply Claim 1 to all bins and all . Using a union bound over all such events, the probability that in any such round for any bin is bounded by

 i0−1∑i=0ne−(˜min)1/3/2∈O⎛⎜ ⎜⎝ni0−1∑i=02−ie−(˜mi0−1n)1/3/2⎞⎟ ⎟⎠⊆ne−Ω(clogn)⊆n−Ω(c).

Thus, w.h.p. each bin has exactly balls allocated to it at the end of round . Therefore, w.h.p. ∎

It remains to consider the final iterations required to reduce to . As the number of balls is not large enough anymore to ensure sufficient concentration for individual bins, we consider the random variable counting the number of balls allocated to all bins together in round .

###### Claim 3.

Let be minimal with the property that . For each round and any , it holds that with probability at least , where .

###### Proof.

Denote by , , the indicator variables which are if bin receives fewer than allocation requests in round and else. By Claim 1 and linearity of expectation, we have for that

 E[Z]≤e−(˜min)1/3/2n.

The random variables are negatively associated (according to Definition 2). To see this, observe that by [DR98, Theorem 13] we know that are negatively associated: the are monotone nonincreasing functions of disjoint subsets of the negatively associated variables (namely, is a function of the set ), so Proposition 1 applies. Therefore, we can apply a Chernoff bound (with ) to :

 Pr[Z>2E[Z]]≤e−E[Z]/3.

If for a sufficiently large constant , this entails that w.h.p. Otherwise, we use a simple domination argument: each is replaced by an independent 0-1 variable that is with probability , so that for we have that

 Pr[Z>2E[Z′]]≤Pr[Z′>2E[Z′]]≤e−c

Together, this entails that w.h.p. As (where ), we have that and . Hence for a suitable choice of . As decreases exponentially in , which itself decreases exponentially in , we also have that

 2e−(˜min)1/3/2(Ti−Ti−1)n<2e−(˜min)1/3/2˜min

if is sufficiently large. Noting that , the claim follows. ∎

###### Claim 4.

For any , with probability at least , where .

###### Proof.

The number of unallocated balls at the beginning of round is . By Claim 2, we have that w.h.p., i.e., for all w.h.p. For , by Claim 3 we have that w.h.p., where is the constant in the w.h.p. bound. Accordingly, by a union bound it holds that

 mi1≤m−n⎛⎝i1−1∑i=0(Ti−Ti−1)+i1−1∑i=i0f(c)2−(i1−i)⎞⎠

with probability . As , w.h.p. for a suitable choice of . ∎

Thus, after iterations, at most balls remain unallocated w.h.p. We apply , where each of the bins simulates virtual bins. That is, any ball allocated in one of the virtual bins will be allocated in the real bin. Finally, by the properties of we have that each virtual bin will have at most balls and thus each real bin will add at most balls. Overall, the total load of any bin is .

#### Number of Messages.

We bound the number of messages sent by balls and bins. The number of messages sent in step 3 is specified in Theorem 5. Thus, we analyze the messages in step 2.

Each ball sends at most 1 message per round, thus a total of in round . Each round reduces the number of balls by at least a constant factor, cf. Claim 2 and Claim 3. Thus, the total number of messages sent is bounded by a geometric series, i.e., at most messages are sent w.h.p. Moreover, since all balls are identical we have that the expected number of message sent by a ball is . The probability that a single ball sends more than message is at most . Thus, with high probability, a ball sends at most messages. As all messages are sent to uniformly and independently random bins, a standard Chernoff bound yields that each bin receives messages w.h.p. ∎

#### A Note on Success Probability.

As described, Algorithm succeeds with high probability in . As may be a constant, this probability bound could be a constant as well. However, the case of can be covered by a trivial algorithm that deterministically guarantees a perfectly balanced allocation in rounds: balls try all bins one by one, in arbitrary order (which may be different for each ball). Bins use threshold in each round. If , we can apply this trivial algorithm within our round budget. Combining both algorithms, we achieve a success probability of for the entire parameter range.

## 4 Lower Bound for Threshold Algorithms

In this section, we present a lower bound for a special class of threshold algorithms. Roughly speaking, the only limitation that we pose here is that in each round unallocated balls pick the bins they contact independently and uniformly at random (as in our upper bound), and bins do not take decisions based on random choices of balls in future rounds.

This class is more general than our algorithm, as it allows bins to have different thresholds. The decision on these thresholds can be an arbitrary function of the system state at the beginning of the round (excluding future random choices of balls); this does not affect the lower bound result. Moreover, we allow for algorithms that “collect” allocation requests from balls for rounds before allocating them according to the chosen threshold. While this is not a good strategy for algorithms, it is useful for generalizing our lower bound to algorithms in which balls contact multiple bins in each round, as it allows for a straightforward simulation argument.

#### The Family of Uniform Threshold Algorithms.

The degree of an algorithm is the maximal number of bins that a ball contacts in a single phase. Formally, in this special threshold model a degree algorithm collecting for rounds works in phase as follows. Bins and balls have each an internal state . Decisions are a function of , which is updated after each operation, and (private) randomness. We remark, however, that the structure imposed by the algorithm actually entails that the state of a non-allocated ball is simply a function of its own randomness only, as it received no information beyond all its requests being rejected.

In contrast, bins may perform more complex internal operations. Denote by the load of bin at the beginning of phase , i.e., the number of balls it has sent accept messages to and which have not yet informed the bin that they are allocated to another bin.

1. Each bin determines its threshold for the current phase. The decision on these thresholds is oblivious to (i.e., stochastically independent from) the random choices of balls in this and future phases.

2. Based on its state, each ball chooses (at most) bins uniformly and independently at random to send allocation requests to. These requests are sent over rounds, i.e., at most per round.

3. Denote by the set of balls sending a request to bin in this phase. In the last round of the phase, bin responds with accept messages to a subset of of size . This set is chosen based on the bin’s port numbers for the requesting balls333For each bin, there is a bijection from to the balls. Requests from a ball are received on the respective port and responses are sent to the same port. Balls have a port numbering of the bins for the same purpose. and its internal randomness, subject to the constraint that each ball is accepted only once.

4. Balls receive accept messages. They may decide on an accepting bin to be allocated to (provided they received at least one accept message so far) at the end of any phase (i.e., they do not need to commit immediately), where this phase is a function of the phase number in which they received the first accept message.444This is not a good idea for algorithms, but we use it in our lower bound for a simulation argument.

5. Balls that selected a bin inform all bins that sent accept messages to it about its decision at the end of the phase.

For technical reasons, we assume that bins port numbers are chosen adversarially, i.e., first the randomness of balls and bins is determined and then the port numbering is chosen. Algorithms must achieve their load guarantees despite this; note that our algorithms are capable of this.

The structure of this section is as follows. We first establish in Section 4.1 the lower bound for degree algorithms, i.e., threshold algorithms in which each unallocated ball contacts one bin chosen independently and uniformly and random (our algorithm falls within this class). Then, in Section 4.2, we extend the argument to any degree algorithms for by providing a simulation result.

### 4.1 Lower Bound for Degree 1 Algorithms

Our lower bound shows that any algorithm in the threshold model, granted that balls choose bins uniformly at random, must use a large number of rounds.

###### Theorem 7.

Suppose balls each contact one of bins independently and uniformly at random, where for a sufficiently large constant . If bin accepts up to balls contacting it, where and does not depend on the balls’ randomness, with probability at least the number of balls that is not accepted is for .

###### Proof.

Denote by the expected number of messages received by bin . Fix a bin and denote by the random variable counting the number of messages it receives. Because each ball picks a bin uniformly and independently at random, we have that

 X(i)=M∑j=1Xj,

where the are independent - variables attaining with probability (we omit for ease of notation). Our first goal is to provide a lower bound on the expected number of rejected balls. To do that, we first analyze a single bin and show the following:

###### Claim 5.

Any bin has load at least with probability .

###### Proof.

We apply the Berry-Esseen Inequality (see Theorem 4) to the random variables , . Thus, and , yielding that

 supx∈R{|F(x)−ϕ(x)|}≤c(1−2p(1−p))√p(1−p)Mp≤1/2≤c(1−p)√p(1−p)M

in the terminology of the theorem, where , i.e., equals the deviation of the load of bin from its expectation. Thus, the theorem implies that for all , we have that

 Pr[Y≥x√μ2]p≤1/2≥Pr[Y≥x√(1−p)μ]=Pr[Y≥x∑√M]≥1−F(x)−c√C.

Choosing and using that is sufficiently large, it follows that

 P[X(i)≥μ+2√μ]∈Ω(1).\qed

Thus, we have shown that any bin has load at least with probability , causing it to reject at least balls (provided that ).

###### Corollary 1.

At least balls are rejected in expectation for .

###### Proof.

By Claim 5, the expected number of rejected balls for bin is at least . Thus, by linearity of expectation the expected number of rejected balls is at least

 p0n∑i=1max{μ+2√μ−Li,0}≥p0(M+2√Mn−n∑i=1Li)≥p0√Mn,

where the final step exploits that with being sufficiently large. ∎

So far, we have shown that the expected number of rejected balls is sufficiently large. One of the major obstacles for providing a concentration result comes from the fact that the number of rejected balls might vary considerably between bins (e.g., due to different threshold values). To overcome this, our proof strategy is based on finding a sufficiently “heavy” subset of bins that have roughly the same number of rejected balls in expectation.

Towards that goal, for every bin , we look at the value and restrict attention to all bins satisfying that . These bins are now divided into classes where, for , bin iff . Let be the class of all bins with .

The selection of the class of bins for which we will show concentration is done in two steps. First, we find at most (plus 1) particular classes that together capture at least half of the expected value of rejected balls. Once we do that, we focus on the heaviest class among these classes, hence loosing only a factor of in our bounds. Concretely, denoting by the largest value of such that , the following holds.

###### Claim 6.

Let . Then the expected number of rejected balls by bins is at least . In addition, .

###### Proof.

First, suppose that . Observe that the total contribution of all bins is at most , since . By the prerequisite that for a sufficiently large constant , we may assume that and get that . As by Corollary 1 at least balls are rejected in expectation, the classes capture at least half of this expectation.

Second, consider the case that . We claim that this entails that , as would yield for all that

 μ+2√μ−Li≤μ+2√μ=Mn+2√Mn≤2Mn≤2t,

implying that . Therefore, indeed and hence . It follows that

 ∑i∈I∗Si+∑k

Using the same expression for the expected number of rejected balls as in the proof of Corollary 1, we get that

 p0kmax∑k=kmin∑i∈IkSi≥p02⎛⎝∑i∈I∗Si+∑k∈Z0∑i∈IkSi⎞⎠=p02n∑i=1max{μ+2√μ−Li,0}≥p0√M/n2

balls are rejected in expectation by bins in classes . As in the first case and in the second case , this completes the proof. ∎

By the pigeonhole principle and Claim 6, there must be a class satisfying that

 p0∑i∈IkSi≥p0√Mn2(t+1).

Denote by , , the indicator variables that are iff . By [DR98, Theorem 13] and Proposition 1 these variables are negatively associated. Setting , we have that , and by Chernoff’s bound (Lemma 1), it follows that

 Pr[Z

If , then we have that with probability , the number of rejected balls is at least

 2k−1p0|Ik|≥p04∑i∈IkSi∈Ω(√Mnt).

It remains to consider the case that . Because up to factor all bins in have the same value, it holds for each that

 Si=μ+2√μ−Li∈Ω(√Mnt⋅|Ik|). (1)

Let . By Inequality (1) and because for sufficiently large ,

 Li≤μ+2√μ−3α≤μ−α.

As , this bound also implies that . As is the sum of independent - variables, we can thus apply Chernoff’s bound to to see that for sufficiently large ,

 Pr[X(i)−Li<α/2] ≤Pr[X(i)≤μ−α/2] ≤Pr[X(i)≤μ(1−α/(2μ))] ∈e−Ω(n2/(t2⋅|Ik|2))∈e−Ω((n/t)2/3),

where in the final step we use that . By a union bound over all bins in , we get that with probability , the number of rejected balls from this class is at least . ∎

### 4.2 Simulation for Higher Degree

In this subsection, we show that any algorithm with a higher degree (i.e., balls can contact more than one bin in a single round) can be simulated by an algorithm with degree 1 at the expense of more rounds. To this end, we simply increase the length of phases by factor . We then proceed to show that a degree 1 algorithm with phase length can be improved on by reducing the phase length. We then can apply Theorem 7 to the resulting degree 1 algorithm of phase length 1 to prove Theorem 2.

###### Lemma 2.

Let be a uniform threshold algorithm of degree that runs in rounds. Then there is a uniform threshold algorithm with degree 1 that achieves the same maximal load within rounds.

###### Proof.

simulates . It simply increases phase length by a factor of and lets the balls send their messages spread out over more rounds. This reduces the degree to . At the end of each phase, the bins can compute the internal state they would have in and act accordingly. Thus, bin loads will be identical to those in . ∎

###### Lemma 3.

There is a uniform threshold algorithm of degree and phase length achieving the same guarantees on bin loads in the same number of rounds.

###### Proof.

Assume that has phase length . We simulate by algorithm of phase length . Balls and bins keep maintaining a state according to , following these rules:

• If a ball receives its first accept message in round of , it determines the phase of this round belongs to. Then it determines the phase of in which it would inform bins about its decision. It will do so in in round (i.e., the same round this would happen in ).

• For each , at the beginning of round each bin computes the threshold it would use in in phase based on the state for it maintains. This threshold is used in phases of . The subset of balls it accepts in a given phase of is chosen arbitrarily.

• To update the internal state a bin maintains for from phase to phase , at the end of round it performs the following operation. Let be the set of ports it received requests on. It determines the subset of ports it would have responded to with accept messages in when receiving the requests it got in rounds . Let be the set of ports it sent accept messages to in rounds of . The bin now “rearranges” its port numbering by permuting such that is mapped to . Finally, it updates its state for in accordance with the modified port numbering and the requests received during rounds .

We claim that the third step maintains the invariant that the simulation is consistent with an execution of at the bin for the port numbering it computes. This holds true, because no bin ever sends two accept messages to the same ball, implying that the modification to the port numbering never conflicts with earlier such changes made. Thanks to this observation, a straightforward induction now establishes that simulates an execution of for the port numberings the bins have determined by the end of the simulation. Accordingly, achieves the same load distribution as with the modified port numbers.

Note that the choice of port numbers does not affect the guarantees on the load distribution makes, as we assumed an adversarial choice of bins’ port numbers. Thus, the claim follows. ∎

We are now ready to complete the lower bound proof.

###### Proof of Theorem 2.

First, we show that the claim holds for degree algorithms with phase length by repeatedly applying Theorem 7. The induction hypothesis is that after round , at least balls remain with probability . By the induction hypothesis, we have that

 min{logn,log(Mi/n)}≤log(Mi/n)≤log((m/n2)3−i)∈O((m/n)3−(i+1)/2).

As the total capacity of all bins is by assumption, the theorem555Note that we can apply Theorem 7 due to the constraint that bins thresholds are independent from balls random choices regarding which bins to contact. and the induction hypothesis imply that, with probability

 (1−ie−Ω(n1/2))(1−e−Ω((n/logn)2/3))≥1−(i+1)e−Ω(n1/2),

we have that

 Mi+1∈Ω(√Minmin{logn,log(Mi/n)})⊆Ω⎛⎜⎝((m/n)3−in2−3−i(m/n)3−(i+1))1/2⎞⎟⎠⊆(m/n)3−(i+1)n1−3−(i+1),

as claimed.

Note that in the induction step we applied Theorem 7, which necessitates that , which holds for sufficiently small . To ensure that the probability bound is sufficiently strong for a w.h.p. result, we need, e.g., that . Both are ensured by the assumptions of the theorem. Finally, by applying Lemma 2 and Lemma 3, we can extend the result to degree algorithms for any and arbitrary phase length . ∎

## 5 An Asymmetric Algorithm

In this section, we prove Theorem 3 by providing an asymmetric algorithm that achieves a maximal load of , w.h.p., within a constant number of rounds. In this algorithm, each bin receives messages in total. If , we apply a single round of the symmetric algorithm from Section 3 first to reduce the number of remaining balls to , so that each bin receives messages in the first round and messages in the subsequent application of the asymmetric algorithm.

Similarly to before, each active ball sends a single request in each round. The key idea of the algorithm is to operate on simulated “superbins.” Each superbin is controlled by a leader, where we make sure that the expected number of messages received by each superbin leader is roughly in each round (unless is very small). Denote by a value that is large enough so that the deviation from the expected number of messages a superbin receives is at most w.h.p. Then we can be sure that superbins receive messages w.h.p., and it allocates the respective balls to its bins round-robin.

As a result, the algorithm w.h.p. allocates exactly the same number of balls to each bin, and it is straightforward to show that this process allocates all but balls in a constant number of rounds. It then completes by invoking an asymmetric algorithm for allocating balls with constant load in constant time, where each bin simulates virtual bins.

Concretely, the algorithm operates as follows.

1. Set