Unconstrained Submodular Maximization with Constant Adaptive Complexity

In this paper, we consider the unconstrained submodular maximization problem. We propose the first algorithm for this problem that achieves a tight (1/2-ε)-approximation guarantee using Õ(ε^-1) adaptive rounds and a linear number of function evaluations. No previously known algorithm for this problem achieves an approximation ratio better than 1/3 using less than Ω(n) rounds of adaptivity, where n is the size of the ground set. Moreover, our algorithm easily extends to the maximization of a non-negative continuous DR-submodular function subject to a box constraint and achieves a tight (1/2-ε)-approximation guarantee for this problem while keeping the same adaptive and query complexities.

Authors

• 62 publications
• 21 publications
• 55 publications
• Linear-Time Algorithms for Adaptive Submodular Maximization

In this paper, we develop fast algorithms for two stochastic submodular ...
07/08/2020 ∙ by Shaojie Tang, et al. ∙ 0

• Black Box Submodular Maximization: Discrete and Continuous Settings

In this paper, we consider the problem of black box continuous submodula...
01/28/2019 ∙ by Lin Chen, et al. ∙ 6

• Continuous Submodular Maximization: Beyond DR-Submodularity

In this paper, we propose the first continuous optimization algorithms t...
06/21/2020 ∙ by Moran Feldman, et al. ∙ 0

• Continuous Profit Maximization: A Study of Unconstrained Dr-submodular Maximization

Profit maximization (PM) is to select a subset of users as seeds for vir...
04/12/2020 ∙ by Jianxiong Guo, et al. ∙ 0

Adaptive sequential decision making is one of the central challenges in ...
11/09/2019 ∙ by Hossein Esfandiari, et al. ∙ 12

• Submodular Streaming in All its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity

Streaming algorithms are generally judged by the quality of their soluti...
05/02/2019 ∙ by Ehsan Kazemi, et al. ∙ 4

• A Greedy Algorithm for the Social Golfer and the Oberwolfach Problem

Inspired by the increasing popularity of Swiss-system tournaments in spo...
07/21/2020 ∙ by Daniel Schmand, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Faced with the massive data sets ubiquitous in many modern machine learning and data mining applications, there has been a tremendous interest in developing parallel and scalable optimization algorithms. At the heart of designing such algorithms, there is an inherent trade-off between the number of adaptive sequential rounds of parallel computations (also known as

adaptive complexity), the total number of objective function evaluations (also known as query complexity) and the resulting solution quality.

In the context of submodular maximization, the above trade-off has recently received a growing interest. We say that a set function on a finite ground set of size is submodular if it satisfies

 f(S∪{e})−f(S)≥f(T∪{e})−f(T)for every S⊆T⊆\cN and e∈\cN∖T.

We also say that such a function is monotone if it satisfies for every two sets . The definition of submodularity intuitively captures diminishing returns

, which allows submodular functions to faithfully model diversity, cooperative costs and information gain, making them increasingly important in various machine learning and artificial intelligence applications

[17]. Examples include viral marketing [33], data summarization [36, 47]

, neural network interpretation

[18][30, 31], sensor placement [34], dictionary learning [16], compressed sensing [19] and fMRI parcellation [45], to name a few. At the same time, submodular functions also enjoy tractability as they can be minimized exactly and maximizaed approximately in polynomial time. In fact, there has been a surge of novel algorithms to solve submodular maximization problems at scale under various models of computation, including centralized [11, 13, 28, 43], streaming [2, 12, 14, 35], distributed [5, 6, 37, 39] and decentralized [41] frameworks. While the aforementioned works aim to obtain tight approximation guarantees, and some other works strove to achieve this goal with a minimal number of functions evaluations [1, 10, 25, 26, 38], until recently almost all works on submodular maximization ignored one important aspect of optimization, namely, the adaptive complexity. More formally, the adaptive complexity of a submodular maximization procedure is the minimum number of sequential rounds required for implementing it, where in each round polynomially-many independent function evaluations can be executed in parallel [4]. All the previously mentioned works may require adaptive rounds in the worst case.

A year ago, Balkanski and Singer [4] showed, rather surprisingly, that one can achieve an approximation ratio of for maximizing a non-negative monotone submodular function subject to a cardinality constraint using adaptive rounds. They also proved that no constant factor approximation guarantee can be obtained for this problem in adaptive rounds. The approximation guarantee of [4] was very quickly improved in several independent works [3, 20, 23] to (using adaptive rounds), which almost matches an impossibility result by [42] showing that no polynomial time algorithm can achieve -approximation for the problem, regardless of the amount of adaptivity it uses. It should be noted also that [23] manages to achieve the above parameters while keeping the query complexity linear in . An even more recent line of work studies algorithms with low adaptivity for more general submodular maximization problems, which includes problems with non-monotone objective functions and/or constraints beyond the cardinality constraint [15, 21, 22]. Since all these results achieve constant approximation for problems generalizing the maximization of a monotone submodular function subject to a cardinality constraint, they all inherit the impossibility result of [4], and thus, use at least adaptive rounds.

In this paper, we study the Unconstrained Submodular Maximization (USM) problem which asks to find an arbitrary set maximizing a given non-negative submodular function . This problem was studied by a long list of works [9, 11, 24, 27, 29], culminating with a linear time -approximation algorithm [11], which was proved to be the best possible approximation for the problem by [24]. Since it does not impose any constraints on the solution, USM does not inherit the impossibility result of [4]. In fact, it is known that one can get approximation ratios of and for this problem using and adaptive rounds, respectively [24]. The results of [24] leave open the question of whether one can get an optimal approximation for USM while keeping the number of adaptive rounds independent of . In this paper we answer this question in the affirmative. Specifically, we prove the following theorem, where the notation hides a polylogarithmic dependence on .

Theorem 1.1.

For every constant , there is an algorithm that achieves -approximation for USM using adaptive rounds and a query complexity which is linear in .

To better understand our result, one should consider the way in which the algorithm is allowed to access the objective function . The most natural way to allow such an access is via an oracle that given a set returns . Such an oracle is called a value oracle for . A more powerful way to allow the algorithm access to is through an oracle known as a value oracle for the multilinear extension of . The multilinear extension of the set function is a function defined as

for every vector

, where is a random set that includes every element

with probability

, independently. A value oracle for is an oracle that given a vector returns .

In Section 3 we describe and analyze an algorithm which satisfies all the requirements of Theorem 1.1 and assumes value oracle access to . Since the multilinear extension can be approximated arbitrarily well using value oracle access to via sampling (see, e.g., [13]), it is standard practice to convert algorithms that assume value oracle access to into algorithms that assume such access to . However, a straightforward conversion of this kind usually increases the query complexity of the algorithm by a factor of , which is unacceptable in many applications. Thus, we describe and analyze in Appendix A an alternative algorithm which satisfies all the requirements of Theorem 1.1 and assumes value oracle access to . While this algorithm is not directly related to the algorithm from Section 3, the two algorithms are based on the same ideas, and thus, we chose to place only the simpler of them in the main part of the paper.

Before concluding this section, we would like to mention that the notion of diminishing returns can be extended to the continuous domains as follows. A differentiable function , defined over a compact set , is called DR-submodular [8] if for all vectors such that we have —where the inequalities are interpreted coordinate-wise. A canonical example of a DR-submodular function is the multilinear extension of a submodular set function. It has been recently shown that non-negative DR-submodular functions can be (approximately) maximized over convex bodies using first-order methods [8, 32, 40]. Moreover, inspired by the double greedy algorithm of [11], it was shown that one can achieve a tight -approximation guarantee for the maximization of such functions subject to a box constraint [7, 44]. The algorithm we describe in Section 3 can be easily extended to maximize also arbitrary non-negative DR-submodular functions subject to a box constraint as long as it is possible to evaluate both the objective function and its derivatives. The extended algorithm still achieves a tight -approximation guarantee, while keeping its original adaptive and query complexities. The details of the extension are given in Appendix B.

1.1 Our Technique

All the known algorithms for maximizing a non-negative monotone submodular function subject to a cardinality constraint that use few adaptive rounds update their solutions in iterations. A typical such algorithm decides which elements to add to the solution in a given iteration by considering the set of elements with (roughly) the largest marginal, and then adding as many such elements as possible, as long as the improvement in the value of the solution is roughly linear in the number of added elements. This yields a bound on the number iterations (and thus, adaptive rounds) through the following logic.

• The increase stops being roughly linear only when the marginal of a constant fraction of the elements considered decreased significantly. Thus, the set of elements with the maximum marginal decreases in an exponential rate, and after a logarithmic number of iterations no such elements remains, which means that the maximum marginal itself decreases.

• After the maximum marginal decreases a few times, it becomes small enough that one can argue that there is no need to add additional elements to the solution.

A similar idea can be used to decrease the number of adaptive round used by standard algorithms for USM such as the algorithm of [11]. However, this results in an algorithm whose adaptive complexity is still poly-logarithmic in . Moreover, both parts of the logic presented above are responsible for this. First, the maximum marginal is only guaranteed to reduce after a logarithmic number of iterations. Second, the maximum marginal has to decrease all the way from to , where is an arbitrary optimal solution, which requires a logarithmic number of decreases even when every decrease is by a constant factor.

Getting an adaptive complexity which is independent of requires us to modify the above framework in two ways. The first modification is that rather than using the maximum marginal to measure the “advancement” we have made so far, we use an alternative potential function which is closely related to the gain one can expect from a single element in the next iteration. Since each update adds elements until the gain stops being linear in the number of elements added, we are guaranteed that the gain pair element decreases significantly after every iteration, and so does the potential function.

Unfortunately, the potential function might originally be as large as , and the algorithm has to decrease it all the way to at most , which means that the above modification alone cannot make the adaptive complexity independent of . Thus, we also need a second modification which is a pre-processing step designed to decrease the potential to in a single iteration. The pre-processing is based on the observation that as long as the gain that can be obtained from a random element is large enough, this gain overwhelms any loss that can be incurred due to this element. Thus, one can evolve the solution in a random way until the potential becomes larger than only by a constant factor.

2 Preliminaries

Given a set , we denote by the characteristic vector of , i.e., a vector that contains in the coordinates corresponding to elements of and in the remaining coordinates. Additionally, given vectors , we denote by and their coordinate-wise maximum and minimum, respectively. Similarly, we write and when these inequalities hold coordinate-wise.

Given an element and a vector , we denote by the partial derivative of the multilinear extension with respect to the -coordinate of . One can note that, due to the multilinearity of , obeys the equality

 ∂uF(x)=F(x∨1{u})−F(x∧1\cN∖{u}).

One consequence of this equality is that an algorithm with a value oracle access to also has access to ’s derivatives. As usual, we denote by the gradient of at the point , i.e., is a vector whose -coordinate is .

The following is a well-known property that we often use in our proofs.

Observation 2.1.

Given the multilinear extension of a submodular function and two vectors obeying , .

Proof.

Let be a uniformly random vector . For the sake of the proof it is useful to assume that and . Notice that this assumption does not change the distributions of and , and thus, we still have and . Furthermore, the assumption yields , which implies (by the submodularity of ) that for every element we have

 ∂uF(x)= F(x∨1{u})−F(x∧1\cN∖{u})=\bE[f(R(x)∪{u})−f(R(x)∖{u})] ≥ \bE[f(R(y)∪{u})−f(R(y)∖{u})]=F(y∨1{u})−F(y∧1\cN∖{u})=∂uF(y).\qed

3 Algorithm

Consider Theorem 3.1, and observe that Theorem 1.1 follows from it when we allow the algorithm value oracle access to the multilinear extension of the objective function. In this section, we prove Theorem 3.1 by describing and analyzing an algorithm that obeys all the properties guaranteed by this theorem.

Theorem 3.1.

For every constant , there is an algorithm that assumes value oracle access to the multilinear extension of the objective function and achieves -approximation for USM using adaptive rounds and value oracle queries to .

Before presenting the promised algorithm, let us quickly recall the main structure of one of the algorithms used by [11] to get an optimal -approximation for USM. This algorithm maintains two vectors whose original values are and , respectively. To update these vectors, the algorithm considers the elements of one after the other in some arbitrary order. When considering an element , the algorithm finds an appropriate value , increases from to and decreases from to . One can observe that this update rule guarantees two things. First, that throughout the execution of the algorithm, and second, that both and become equal to the vector (i.e., the vector whose -coordinate is for every ) when the algorithm terminates.

The analysis of the algorithm of [11] depends on the particular choice of used. Specifically, Buchbinder et al. [11] showed that their choice of guarantees that the change in following every individual update of and is at least twice the change in following this update, where . Since and start as and , respectively, and end up both as , this yields

 2F(r)−[f(∅)+f(\cN)]≥2[F(OPT(1∅,1\cN))−F(OPT(r,r))]=2[f(OPT)−F(r)],

which implies the -approximation ratio of the algorithm by rearrangement and the non-negativity of .

The algorithm that we present in this section is similar to the algorithm of [11] in the sense that it also maintains two vectors and updates them in a way that guarantees two things. First, that the inequality holds throughout the execution of the algorithm, and second, that the change in following every individual update of and is at least (roughly) twice the change in following this update. More formally, the properties of our update procedure, which we term Update, are described by the following proposition. In this proposition, and throughout the section, we assume that and is in the range .111If , then USM can be solved in constant time (and adaptivity) by enumerating all the possible solutions; and if , then Theorem 3.1 is trivial.

Proposition 3.2.

The input for Update consists of two vectors and two scalars and . If this input obeys (i.e., every coordinate of the vector is larger than the corresponding coordinate of by exactly ), then Update outputs two vectors and a scalar obeying

1. ,

2. either or and

3. .

Moreover, Update requires only a constant number of adaptive rounds and value oracle queries to .

One can observe that in addition to the guarantees discussed above, Proposition 3.2 also shows that Update significantly decreases the expression . Intuitively, this decrease represents the “progress” made by every execution of Update, and it allows us to bound the number of iterations (and thus, adaptive rounds) used by our algorithm. Nevertheless, to make the number of iterations independent of , we need to start with and vectors for which the expression is already not too large. We use a procedure named Pre-Process to find such vectors. The formal properties of this procedure are given by the next proposition.

Proposition 3.3.

The input for Pre-Process consists of a single value . If , then Pre-Process outputs two vectors and a scalar obeying

1. ,

2. either or and

3. .

Moreover, Pre-Process requires only a constant number of adaptive rounds and value oracle queries to .

We defer the presentation of the procedures Update and Pre-Process and their analyses to Sections 3.1 and 3.2, respectively. However, using these procedures we are now ready to present the algorithm that we use to prove Theorem 3.1. This algorithm is given as Algorithm 1.

Let us denote by the number of iterations made by Algorithm 1. We begin the analysis of the algorithm with the following lemma, which proves some basic properties of Algorithm 1.

Lemma 3.4.

It always holds that , and for every integer it holds that , and .

Proof.

It was proved by Feige et al. [24] that . In contrast, since is always a feasible solution, we get . This completes the proof of the first part of the lemma.

We prove the rest of the lemma by induction. For the lemma holds by the guarantee of Proposition 3.3. Assume now that the lemma holds for some , and let us prove it for . By the induction hypothesis we have , and . Moreover, the fact that the iteration was not the last one implies that . Hence, all the conditions of Proposition 3.2 on the input for Update hold with respect to the execution of this procedure in the -th iteration of Algorithm 1, and thus, the proposition guarantees , and , as required. ∎

The last lemma shows that the input for the procedure Pre-Process obeys the conditions of Proposition 3.3 and the input for the procedure Update obeys the conditions of Proposition 3.2 in all the iterations of Algorithm 1. The following lemma uses these facts to get an upper bound on the number of iterations performed by Algorithm 1 and a lower bound on the value of the output of this algorithm. We note that the proofs of this lemma and the corollary that follows it resemble the above discussed analysis of the algorithm of [11].

and .

Proof.

Consider the potential function , and let us study the change in this potential as a function of . Since for every and the conditions of Proposition 3.2 are satisfied in all the iterations of Algorithm 1, this proposition guarantees for every . In other words, the potential function decreases by at least every time that increases by , and thus, . Next, we would like to bound and . Since the conditions of Proposition 3.3 are also satisfied, it guarantees that . In contrast, the submodularity of and the inequality imply . Combining all the above observations, we get

 Φ(0)≥Φ(ℓ−1)+4ετ(ℓ−1)⇒16τ≥4ετ(ℓ−1)⇒ℓ≤1+4ε−1≤5ε−1.

Let us now get to proving the second part of the lemma. By Proposition 3.3, we get

 F(x0)+F(y0)≥ 2[f(OPT)−F(OPT(x0,y0))]−4ε⋅f(OPT) (1) ≥ 2(1−3ε)⋅[f(OPT)−F(OPT(x0,y0))]−4ε⋅f(OPT),

where the second inequality holds since is the expected value of over some distribution of sets, and thus, is upper bounded by . Additionally, Proposition 3.2 implies that for every we have

 [F(xi)+F(yi)]−[F(xi−1)+F (yi−1)]−2(1−3ε)⋅[F(OPT(xi−1,yi−1))−F(OPT(xi,yi))] ≥ −4ετ(Δi−1−Δi)−2ε2⋅1\cN[∇F(xi−1)−∇F(yi−1)] = −4ετ(Δi−1−Δi)−2ε2⋅Φ(i−1)≥−4ετ(Δi−1−Δi)−32ε2τ,

where the second inequality holds since we have already proved that and that is a decreasing function of in the range . Adding up the last inequality for every and adding Inequality (1) to the sum, we get

 F(xℓ)+F(yℓ)≥ 2(1−3ε)⋅[f(OPT)−F(OPT(xℓ,yℓ))]−4ετ(Δ0−Δℓ)−32ℓε2τ−4ε⋅f(OPT) ≥ 2(1−3ε)⋅[f(OPT)−F(OPT(xℓ,yℓ))]−4ετ−32ℓε2τ−4ε⋅f(OPT) ≥ 2(1−3ε)⋅[f(OPT)−F(OPT(xℓ,yℓ))]−(8ε+32ℓε2)⋅f(OPT) ≥ 2(1−3ε)⋅[f(OPT)−F(OPT(xℓ,yℓ))]−168ε⋅f(OPT),

where the second inequality holds since , the third inequality holds since by Lemma 3.4 and the last inequality holds by plugging in the upper bound we have on . ∎

Corollary 3.6.

. Hence, the approximation ratio of Algorithm 1 is at least .

Proof.

We first note that the second part of the corollary follows from the first one since the last line of Algorithm 1 returns a random set whose expected value, with respect to , is . Thus, the rest of the proof is devoted to proving the first part of the corollary.

Observe that because otherwise the algorithm would not have stopped after iterations. Thus, and . Plugging these observations into the guarantee of Lemma 3.5, we get

 2F(xℓ)≥2(1−3ε)⋅[f(OPT)−F(xℓ)]−168ε⋅f(OPT),

and the corollary now follows immediately by rearranging the last inequality and using the non-negativity of . ∎

To complete the proof of Theorem 3.1 we still need to upper bound the adaptivity of Algorithm 1 and the number of value oracle queries that it uses, which is done by the next lemma.

Lemma 3.7.

The adaptivity of Algorithm 1 is , and it uses value oracle queries to .

Proof.

Except for the value oracle queries used by the procedures Update and Pre-Process, Algorithm 1 uses only a single value oracle query (for evaluating ). Thus, the adaptivity of Algorithm 1 is at most

and the number of oracle queries it uses is at most

 1+(value oracle queries used by {{\text{{Pre-Process}}}})+ℓ⋅(value oracle queries used by {{\text{{Update}}}}). (3)

Proposition 3.2 guarantees that each execution of the procedure Update requires at most rounds of adaptivity and oracle queries, and Proposition 3.3 guarantees that the single execution of the procedure Pre-Process requires at most rounds of adaptivity and oracle queries. Plugging these observations and the upper bound on given by Lemma 3.5 into (2) and (3), we get that the adaptivity of Algorithm 1 is at most

 1+O(1)+5ε−1⋅O(1)=O(ε−1),

and its query complexity is at most

 1+O(n/ε)+5ε−1⋅O(nε−1lnε−1)=O(nε−2logε−1).\qed

3.1 The Procedure Update

In this section we describe the promised procedure Update and prove that it indeed obeys all the properties guaranteed by Proposition 3.2. Let us begin by recalling Proposition 3.2.

Proposition 3.2.

The input for Update consists of two vectors and two scalars and . If this input obeys (i.e., every coordinate of the vector is larger than the corresponding coordinate of by exactly ), then Update outputs two vectors and a scalar obeying

1. ,

2. either or and

3. .

Moreover, Update requires only a constant number of adaptive rounds and value oracle queries to .

The procedure Update itself appears as Algorithm 2 and consists of two main steps. In the first step the algorithm calculates for every element a basic rate whose intuitive meaning is that if an update of size is selected during the second step, then will be increased by and will be decreased by . Thus, we are guarantees that the difference decreases by for all the elements of . We also note that the formula for calculating the basic rate is closely based on the update rule used by the algorithm of [11].

To understand the second step of Update, observe that can be rewritten in terms of the selected as , whose derivative according to is

 r⋅∇F(x+δr)−(1\cN−r)⋅∇F(y−δ(1\cN−r)).

For this derivative is , and the algorithm looks for the minimum for which the derivative becomes significantly smaller than that. For efficiency purposes, the algorithm only checks possible values out of an exponentially increasing series of values rather than every possible value. The algorithm then makes an update of size . Since is (roughly) the first value for which the derivative decreased significantly compared to the original derivative, making a step of size using rates calculated based on the marginals at and makes sense. Moreover, since the derivative does decrease significantly after a step of size , should intuitively be significantly smaller than , which is one of the guarantees of Proposition 3.2.

We begin the analysis of Update with the following observation, which states some useful properties of the vectors produced by Update, and (in particular) implies part (a) of Proposition 3.2. In this observation, and in the rest of the section, we implicitly assume that the input of Update obeys all the requirements of Proposition 3.2.

Observation 3.8.

and , and thus, , and . Moreover, .

Proof.

To see why holds, consider an arbitrary coordinate of . The only case in which this coordinate is not set to either or by Update is when both and are positive, in which case

 ru=auau+bu∈(0,1).

We also observe that the definition of implies , and thus, we get , and .

It remains to prove , which follows since

 y′−x′=(y−δ(1\cN−r))−(x+δr)=(y−x)−δ⋅1\cN=Δ⋅1\cN−δ⋅1\cN=(Δ−δ)⋅1\cN=Δ′⋅1\cN.\qed

Our next objective is to prove part (b) of Proposition 3.2, which shows that Update makes a significant progress when one measures progress in terms of the decrease in the value of the expression .

If , then .

Proof.

Note that implies , which only happens when

 r⋅∇F(x+δr)−(1\cN−r)⋅∇F(y−δ(1\cN−r))≤ar+b(1\cN−r)−γ.

Plugging in the definitions of , , and , the last inequality becomes

 r⋅∇F(x′)−(1\cN−r)⋅∇F(y′)≤r⋅∇F(x)−(1\cN−r)⋅∇F(y)−γ.

To remove the vector from this inequality, we add to it the two inequalities and . Both these inequalities hold due to submodularity since and . One can observe that the result of this addition is the inequality guaranteed by the lemma. ∎

We now get to proving part (c) of Proposition 3.2. Towards this goal, we need to find a way to relate the expression to the difference . The next lemma upper bounds the last difference. Let us define .

.

Proof.

Using the chain rule, we get

 F(OPT(x,y))−F(OPT(x′,y′)) =F((1OPT∨x)∧y)−F((1OPT∨x′)∧y′) = −∫δ0dF((1OPT∨(x+tr))∧(y−t(1\cN−r)))dtdt = ∫δ0{∑u∈OPT(1−ru)⋅∂uF((1OPT∨(x+tr))∧(y−t(1\cN−r))) −∑u∈\cN∖OPTru⋅∂uF((1OPT∨(x+tr))∧(y−t(1\cN−r)))⎫⎬⎭dt.

Using the submodularity of and the fact that , the rightmost side of the last equation can be upper bounded as follows.

 F(OPT(x,y) = ∫δ0⎧⎨⎩∑u∈OPTau(1−ru)+∑u∈\cN∖OPTburu⎫⎬⎭dt≤∫δ0∑u∈\cNmax{buru,au(1−ru)}dt = δ⋅∑u∈\cNmax{buru,au(1−ru)}.

To complete the proof of the lemma, it remains to observe that for every element it holds that . To see that this is the case, note that every such element must fall into one out of only two possible options. The first option is that and , which imply , and thus, . The second option is that , which implies , and thus, . ∎

Next, we would like to lower bound , which we do in Lemma 3.12. However, before we can state and prove this lemma, we need to prove the following technical observation.

Observation 3.11.

For every element , .

Proof.

The submodularity of implies

 au+bu=∂uF(x)−∂uF(y)≥∂uF(x)−∂uF(x)=0,

which yields the second inequality of the observation.

In the proof of the first inequality of the observation we assume for simplicity that . The proof for the other case is analogous. There are now three cases to consider. If is positive, then so must be by our assumption, which implies , and thus,

 auru+bu(1−ru)≥auru=(au)2au+bu≥(au)22au=au2=max{au,bu}2.

The second case we need to consider is when . Note that in this case , which implies

 auru+bu(1−ru)=bu=max{au,bu}2,

where the second equality holds since the inequality proved above and our assumptions that and yield together . It remains to consider the case in which and . Note that the inequality proved above implies that in this case we have , and thus, and

 auru+bu(1−ru)=au≥max{au,bu}2.\qed

Intuitively, the following lemma holds because the way in which is chosen by Update guarantees that there exists a value which is small, either in absolute terms or compared to , such that the derivative of as a function of is almost for every .