 # An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

We consider the closely related problems of bandit convex optimization with two-point feedback, and zero-order stochastic convex optimization with two function evaluations per round. We provide a simple algorithm and analysis which is optimal for convex Lipschitz functions. This improves on dujww13, which only provides an optimal result for smooth functions; Moreover, the algorithm and analysis are simpler, and readily extend to non-Euclidean problems. The algorithm is based on a small but surprisingly powerful modification of the gradient estimator.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of bandit convex optimization with two-point feedback . This problem can be defined as a repeated game between a learner and an adversary as follows: At each round , the adversary picks a convex function on , which is not revealed to the learner. The learner then chooses a point from some known and closed convex set , and suffers a loss . As feedback, the learner may choose two points and receive111This is slightly different than the model of , where the learner only chooses and the loss is . However, our results and analysis can be easily translated to their setting, and the model we discuss translates more directly to the zero-order stochastic optimization considered later. . The learner’s goal is to minimize average regret, defined as

 1TT∑t=1ft(wt)−minw∈W1TT∑t=1ft(w).

In this note, we focus on obtaining bounds on the expected average regret (with respect to the learner’s randomness).

A closely-related and easier setting is zero-order stochastic convex optimization. In this setting, our goal is to approximately solve , given limited access to where are i.i.d. instantiations. Specifically, we assume that each is not directly observed, but rather can be queried at two points. This models situations where computing gradients directly is complicated or infeasible. It is well-known  that given an algorithm with expected average regret in the bandit optimization setting above, if we feed it with the functions , then the average of the points generated satisfies the following bound on the expected optimization error:

 E[F(¯wT)]−minw∈WF(w)≤RT.

Thus, an algorithm for bandit optimization can be converted to an algorithm for zero-order stochastic optimization with similar guarantees.

The bandit optimization setting with two-point feedback was proposed and studied in . Independently,  and considered two-point methods for stochastic optimization. Both papers are based on randomized gradient estimates which are then fed into standard first-order algorithms (e.g. gradient descent, or more generally mirror descent). However, the regret/error guarantees in both papers were suboptimal in terms of the dependence on the dimension. Recently,  considered a similar approach for the stochastic optimization setting, attaining an optimal error guarantee when is a smooth function (differential and with Lipschitz-continuous gradients). Related results in the smooth case were also obtained by . However, to tackle the general case, where may be non-smooth,  resorted to a non-trivial smoothing scheme and a significantly more involved analysis. The resulting bounds have additional factors (logarithmic in the dimension) compared to the guarantees in the smooth case. Moreover, an analysis is only provided for Euclidean problems (where the domain and Lipschitz parameter of scale with the norm).

In this note, we present and analyze a simple algorithm with the following properties:

• For Euclidean problems, it is optimal up to constants for both smooth and non-smooth functions. This closes the gap between the smooth and non-smooth Euclidean problems in this setting.

• The algorithm and analysis are readily applicable to non-Euclidean problems. We give an example for the -norm, with the resulting bound optimal up to a factor.

• The algorithm and analysis are simpler than those proposed in 

. They apply equally to the bandit and zero-order optimization setting, and can be readily extended using standard techniques (e.g. to strongly-convex functions, regret/error bounds holding with high-probability rather than just in expectation, and improved bounds if allowed

observations per round instead of just two).

Like previous algorithms, our algorithm is based on a random gradient estimator, which given a function and point , queries at two random locations close to

, and computes a random vector whose expectation is a gradient of a smoothed version of

. The papers [8, 4, 6] essentially use the estimator which queries at and (where is a random unit vector and is a small parameter), and returns

 dδ(f(w+δu)−f(w))u. (1)

The intuition is readily seen in the one-dimensional () case, where the expectation of this expression equals

 12δ(f(w+δ)−f(w−δ)), (2)

which indeed approximates the derivative of (assuming is differentiable) at , if is small enough.

In contrast, our algorithm uses a slightly different estimator (also used in ), which queries at , and returns

 d2δ(f(w+δu)−f(w−δu))u. (3)

Again, the intuition is readily seen in the case , where the expectation of this expression also equals Eq. (2).

When is sufficiently small and is differentiable at , both estimators compute a good approximation of the true gradient . However, when

is not differentiable, the variance of the estimator in Eq. (

1) can be quadratic in the dimension , as pointed out by : For example, for and

, the second moment equals

 E[∥∥∥dδ(f(δu)−f(0))u∥∥∥2] = E[d2∥u∥2] = d2.

Since the performance of the algorithm crucially depends on the second moment of the gradient estimate, this leads to a highly sub-optimal guarantee. In , this was handled by adding an additional random perturbation and using a more involved analysis. Surprisingly, it turns out that the slightly different estimator in Eq. (3) does not suffer from this problem, and its second moment is essentially linear in the dimension .

## 2 Algorithm and Main Results

We consider the algorithm described in Figure 1, which performs standard mirror descent using a randomized gradient estimator of a (smoothed) version of at point . We make the assumption that one can indeed query at any point as specified in the algorithm222This may require us to query at a distance outside . If we must query within , then one can simply run the algorithm on a slightly smaller set , where for all , ensuring that we always query at . Since the formal guarantee in Thm. 1 holds for arbitrarily small , and each is Lipschitz, we can always take and small enough so that the additional regret/error incurred is negligible..

The analysis of the algorithm is presented in the following theorem:

###### Theorem 1.

Assume the following conditions hold:

1. is -strongly convex with respect to a norm , and for some .

2. is convex and -Lipschitz with respect to the -norm .

3. The dual norm of is such that for some .

If , and chosen such that , then the sequence generated by the algorithm satisfies the following for any and :

 E[1TT∑t=1ft(wt)−1TT∑t=1ft(w∗)]≤c p∗G2R√dT,

where is some numerical constant.

We note that conditions 1 is standard in the analysis of the mirror-descent method (see the specific corollaries below), whereas conditions 2 and 3 are needed to ensure that the variance of our gradient estimator is controlled.

As mentioned earlier, the bound on the average regret which appears in Thm. 1 immediately implies a similar bound on the error in a stochastic optimization setting, for the average point . We note that the result is robust to the choice of , and is the same up to constants as long as . Also, the constant , while always bounded above zero, shrinks as (see the proof for details).

As a first application, let us consider the case where is the Euclidean norm . In this case, we can take , and the algorithm reduces to a standard variant of online gradient descent, defined as and . In this case, we get the following corollary:

###### Corollary 1.

Suppose for all is -Lipschitz with respect to the Euclidean norm, and . Then using and , it holds for some constant and any that

 E[1TT∑t=1ft(wt)−1TT∑t=1ft(w∗)]≤c G2R√dT,

The proof is immediately obtained from Thm. 1, noting that in our case. This bound matches (up to constants) the lower bound in , hence closing the gap between upper and lower bounds in this setting.

As a second application, let us consider the case where is the -norm, , the domain is the simplex in , (although our result easily extends to any subset of the -norm unit ball), and we use a standard entropic regularizer:

###### Corollary 2.

Suppose for all is -Lipschitz with respect to the norm. Then using and , it holds for some constant and any that

 E[1TT∑t=1ft(wt)−1TT∑t=1ft(w∗)]≤c G1√dlog2(d)T.

This bound matches (this time up to a logarithmic factor) the lower bound in  for this setting .

###### Proof.

The function is -strongly convex with respect to the -norm (see for instance , Example 2.5), and has value at most on the simplex. Also, if is -Lipschitz with respect to the -norm, then it must be -Lipschitz with respect to the Euclidean norm. Finally, to satisfy condition 3 in Thm. 1, we upper bound using the following lemma, whose proof is given in the appendix:

###### Lemma 1.

If

is uniformly distributed on the unit sphere in

, , then where is a positive numerical constant independent of .

Plugging these observations into Thm. 1 leads to the desired result. ∎

## 3 Proof of Theorem 1

As discussed in the introduction, the key to getting improved results compared to previous papers is the use of a slightly different random gradient estimator, which turns out to have significantly less variance. The formal proof relies on a few simple lemmas listed below. The key lemma is Lemma 5, which establishes the improved variance behavior.

###### Lemma 2.

For any , it holds that

 T∑t=1⟨~gt,wt−w∗⟩≤1ηR2+ηT∑t=1∥~gt∥2∗.

This lemma is the canonical result on the convergence of online mirror descent, and the proof is standard (see e.g. ).

###### Lemma 3.

Define the function

 ^ft(w)=Eut[ft(w+δtut)],

over , where is a vector picked uniformly at random from the Euclidean unit sphere. Then the function is convex, Lipschitz with constant , satisfies

 supw∈W|^ft(w)−ft(w)|≤δtG2,

and is differentiable with the following gradient:

 ∇^ft(w)=Eut[dδtft(w+δtut)ut].
###### Proof.

The fact that the function is convex and Lipschitz is immediate from its definition and the assumptions in the theorem. The inequality follows from being a unit vector and that is assumed to be -Lipschitz with respect to the -norm. The differentiability property follows from Lemma 2.1 in . ∎

###### Lemma 4.

For any function which is -Lipschitz with respect to the -norm, it holds that if is uniformly distributed on the Euclidean unit sphere, then

 √E[(g(u)−E[g(u)])4]≤cL2d.

for some numerical constant .

###### Proof.

A standard result on the concentration of Lipschitz functions on the Euclidean unit sphere implies that

 Pr(|g(u)−E[g(u)]|>t)≤2exp(−c′dt2/L2)

for some numerical constant (see the proof of Proposition 2.10 and Corollary 2.6 in ). Therefore,

 √E[(g(u)−E[g(u)])4]=√∫∞t=0Pr((g(u)−E[g(u)])4>t)dt =√∫∞t=0Pr(|g(u)−E[g(u)]|>4√t)dt≤ ⎷∫∞t=02exp(−c′d√tL2)dt=√2L4(c′d)2,

which equals for some numerical constant . ∎

###### Lemma 5.

It holds that (where is as defined in Lemma 3), and for some numerical constant .

###### Proof.

For simplicity of notation, we drop the subscript. Since has a symmetric distribution around the origin,

 E[~g|w] =Eu[d2δ(f(w+δu)−f(w−δu))u] =Eu[d2δ(f(w+δu))u]+Eu[d2δf(w−δu)(−u)] =Eu[d2δ(f(w+δu))u]+Eu[d2δf(w+δu)(u)] =Eu[dδf(w+δu)u]

which equals by Lemma 3.

As to the second part of the lemma, we have the following, where is an arbitrary parameter and where we use the elementary inequality .

 E[∥~g∥2∗|w] =Eu[∥d2δ(f(w+δu)−f(w−δu))u∥2∗] =d24δ2Eu[∥u∥2∗(f(w+δu)−f(w−δu))2] =d24δ2Eu[∥u∥2∗((f(w+δu)−α)−(f(w−δu)−α))2] ≤d22δ2Eu[∥u∥2∗((f(w+δu)−α)2+(f(w−δu)−α)2)] =d22δ2(Eu[∥u∥2∗(f(w+δu)−α)2]+Eu[∥u∥2∗(f(w−δu)−α)2]).

Again using the symmetrical distribution of , this equals

 d22δ2 (Eu[∥u∥2∗(f(w+δu)−α)2]+Eu[∥u∥2∗(f(w+δu)−α)2]) d2δ2Eu[∥u∥2∗(f(w+δu)−α)2].

Applying Cauchy-Schwartz and using the condition stated in the theorem, we get the upper bound

In particular, taking and using Lemma 4 (noting that is -Lipschitz w.r.t. in terms of the -norm), this is at most as required. ∎

We are now ready to prove the theorem. Taking expectations on both sides of the inequality in Lemma 2, we have

 E[T∑t=1⟨~gt,wt−w∗⟩] ≤ 1ηR2+ηT∑t=1E[∥~gt∥2∗] = 1ηR2+ηT∑t=1E[E[∥~gt∥2∗|wt]]. (4)

Using Lemma 5, the right hand side is at most

 1ηR2+ηcdp2∗G22T

The left hand side of Eq. (4), by Lemma 5 and convexity of , equals

By Lemma 3, this is at least

 E[T∑t=1(ft(wt)−ft(w∗))]−G2T∑t=1δt.

Combining these inequalities and plugging back into Eq. (4), we get

 E[T∑t=1(ft(wt)−ft(w∗))]≤G2T∑t=1δt+1ηR2+cdp2∗G22ηT.

Choosing , and any , we get

 E[T∑t=1(ft(wt)−ft(w∗))]≤(c+2)p∗G2R√dT.

Dividing both sides by , the result follows.

## References

•  A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, 2010.
•  A. Barvinok. Measure concentration lecture notes.
•  N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.
•  J. Duchi, M. Jordan, M. Wainwright, and A. Wibisono. Optimal rates for zero-order optimization: the power of two function evaluations. Information Theory, IEEE Transactions on, 61(5):2788–2806, May 2015.
•  A. Flaxman, A. Kalai, and B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, 2005.
•  S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
•  M. Ledoux. The concentration of measure phenomenon, volume 89. American Mathematical Soc., 2005.
•  Y. Nesterov. Random gradient-free minimization of convex functions. Technical Report 2011/16, ECORE, 2011.
•  S. Shalev-Shwartz. Online learning and online convex optimization.

Foundations and Trends in Machine Learning

, 4(2), 2012.

## Appendix A Proof of Lemma 1

We note that the distribution of is equivalent to that of , where is a standard Gaussian random vector. Moreover, by a standard concentration bound on the norm of Gaussian random vectors (e.g. Corollary 2.3 in , with ):

 max{Pr(∥n∥2≤√d2),Pr(∥n∥2≥√2d)}≤exp(−d16).

Finally, for any value of , we always have , since the Euclidean norm is always larger than the infinity norm. Combining these observations, and using for the indicator function of the event , we have

 E[∥u∥4∞] =E[∥n∥4∞∥n∥42] =Pr(∥n∥2≤√d2)E[∥n∥4∞∥n∥42 ∣∣ ∣∣ ∥n∥2≤√d2]+Pr(∥n∥2>√d2)E[∥n∥4∞∥n∥42 ∣∣ ∣∣ ∥n∥2>√d2] ≤exp(−d16)∗1+Pr(∥n∥2>√d2)E⎡⎢ ⎢ ⎢⎣∥n∥4∞(√d/2)4 ∣∣ ∣ ∣∣ ∥n∥2>√d2⎤⎥ ⎥ ⎥⎦ =exp(−d16)+(2d)2E[∥n∥4∞1∥n∥2>√d/2] (5)

Thus, it remains to upper bound where

is a standard Gaussian random variable. Letting

, and noting that are independent and identically distributed standard Gaussian random variables, we have for any scalar that

 Pr(∥n∥∞≤z) =n∏i=1Pr(|ni|≤z) = (Pr(|n1|≤z))d =(1−Pr(|n1|>z))d (1)≥ 1−dPr(|n1|>z) =1−2dPr(n1>z) (2)≥ 1−dexp(−z2/2),

where is Bernoulli’s inequality, and is using a standard tail bound for a Gaussian random variable. In particular, the above implies that

 Pr(∥n∥∞>z)≤dexp(−z2/2).

Therefore, for an arbitrary positive scalar ,

 E[∥n∥4∞] =∫∞z=0Pr(∥n∥4∞>z)dz ≤∫rz=01dz+∫∞z=rPr(∥n∥∞>4√z)dz ≤r+∫∞z=rdexp(−√z2)dz =r+4d(2+√r)exp(−√r2).

In particular, plugging (which is larger than , since we assume ), we get . Plugging this back into Eq. (5), we get that

which can be shown to be at most for all , where is a numerical constant. In particular, this means that as required.