A gradient estimator via L1-randomization for online zero-order optimization with two point feedback

05/27/2022
by   Arya Akhavan, et al.
CNRS
Istituto Italiano di Tecnologia
0

This work studies online zero-order optimization of convex and Lipschitz functions. We present a novel gradient estimator based on two function evaluation and randomization on the ℓ_1-sphere. Considering different geometries of feasible sets and Lipschitz assumptions we analyse online mirror descent algorithm with our estimator in place of the usual gradient. We consider two types of assumptions on the noise of the zero-order oracle: canceling noise and adversarial noise. We provide an anytime and completely data-driven algorithm, which is adaptive to all parameters of the problem. In the case of canceling noise that was previously studied in the literature, our guarantees are either comparable or better than state-of-the-art bounds obtained by <cit.> and <cit.> for non-adaptive algorithms. Our analysis is based on deriving a new Poincaré type inequality for the uniform measure on the ℓ_1-sphere with explicit constants, which may be of independent interest.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/31/2015

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

We consider the closely related problems of bandit convex optimization w...
06/29/2021

Optimal Rates for Random Order Online Optimization

We study online convex optimization in the random order model, recently ...
02/01/2021

Distributed Zero-Order Optimization under Adversarial Noise

We study the problem of distributed zero-order optimization for a class ...
09/18/2015

Accelerating Optimization via Adaptive Prediction

We present a powerful general framework for designing data-dependent opt...
02/06/2020

Regret analysis of the Piyavskii-Shubert algorithm for global Lipschitz optimization

We consider the problem of maximizing a non-concave Lipschitz multivaria...
06/14/2020

Exploiting Higher Order Smoothness in Derivative-free Optimization and Continuous Bandits

We study the problem of zero-order optimization of a strongly convex fun...
01/25/2011

Online Adaptive Decision Fusion Framework Based on Entropic Projections onto Convex Sets with Application to Wildfire Detection in Video

In this paper, an Entropy functional based online Adaptive Decision Fusi...

1 Introduction

In this work we study the problem of convex online zero-order optimization with two-point feedback, in which adversary fixes a sequence of convex functions and the goal of the learner is to minimize the cumulative regret with respect to the best action in a prescribed convex set . This problem has received significant attention in the context of continuous bandits and online optimization (see e.g., agarwal2010; flaxman2004; saha2011; bubeck2012regret; Shamir17; bubeck2017kernel; akhavan2020; Gasnikov; lattimore21a, and references therein).

We consider the following protocol: at each round the algorithm chooses (that can be queried outside of ) and the adversary reveals

where are the noise variables (random or not) to be specified. Based on the above information and the previous rounds, the learner outputs and suffers loss . The goal of the learner is to minimize the cumulative regret

At the core of our approach is a novel zero-order gradient estimator based on two function evaluations outlined in Algorithm 1. A key novelty of our estimator is that it employs a randomization step over the sphere. This is in contrast to related prior work (see e.g., NY1983; PT90; flaxman2004; agarwal2010; Shamir13; duchi2015; BP2016; Gasnikov2017; akhavan2020; akhavan2021distributed) that was employing or type randomizations to define . We use the proposed estimator within an online mirror descent procedure to tackle the zero-order online convex optimization problem, matching or improving the state-of-the-art results. duchi2015 and Shamir17 have studied instances of the above problem under the assumption that , which we will further refer to as canceling noise assumption. Specifically, duchi2015 considered the stochastic optimization framework where , , and obtained bounds on the optimization error rather than on cumulative regret, while Shamir17 analyzed the case . The results in duchi2015; Shamir17 are obtained for the objective functions that are Lipschitz with respect to the -norm for and . The proposed method allows us to improve upon these results in several aspects.

Contributions.  The contributions of the present paper can be summarized as follows. 1) We present a new randomized zero-order gradient estimator and study its statistical properties, both under canceling noise and under adversarial noise (see Lemma 1 and Lemma 4); 2) In the canceling noise case () in Theorem 1 we show that mirror descent based on our gradient estimator either improves or matches the state-of-the-art bounds duchi2015; Shamir17. We derive the results for Lipschitz functions with respect to all -norms, . In particular, when and

is the probability simplex, our bound is better by a

factor than that of duchi2015; Shamir17; 3) We propose a completely data-driven and anytime version of the algorithm, which is adaptive to all parameters of the problem. We show that it achieves analogous performance as the non-adaptive algorithm in the case of canceling noise and only slightly worse performance under adversarial noise. To the best of our knowledge, no adaptive algorithms were developed for zero-order online problems so far; 4) As a key element of our analysis, we derive in Lemma 3 a Poincaré type inequality with explicit constants for the uniform measure on -sphere. This result may be of independent interest.

Notation.  Throughout the paper we use the following notation. We denote by the -norm in . For any we denote by the component-wise sign function (defined at as ). We let be the standard inner product in . For we introduce the open -ball and -sphere respectively as

For two , we denote by (resp. ) the minimum (resp. the maximum) between and . We denote by , the gamma function. In what follows, always stands for the natural logarithm and .

2 The algorithm

Let be a closed convex subset of and let be a convex function. The procedure that we propose in this paper is summarized in Algorithm 1.

Input: Convex function , step size , and parameters , for
Initialization:

Generate independently vectors

uniformly distributed on , set
for   do
         Output : 
       and   // Query
         // -gradient estimate
         // Update
      
end for
Algorithm 1 Zero-Order -randomized online mirror descent

Intuition behind the gradient estimate.  The form of gradient estimator in Algorithm 1 is explained by Stokes’ theorem (see Theorem 5 in the appendix and the discussion that follows). Stokes’ theorem provides a connection between the gradient of a function (first order information) and itself (zero order information). Under some regularity conditions, it establishes that

where is the boundary of , is the outward normal vector to , and denotes the surface measure. Introducing and distributed uniformly on and respectively, we can re-write the above identity as

where is the surface area of and is its volume. In what follows we consider the special case . For this choice of we have with leading to our gradient estimate for the two-point feedback setup.

Computational aspects. Let us highlight two appealing practical features of the -randomized gradient estimator in Algorithm 1. First, we can easily evaluate any -norm of . Indeed, it holds that , i.e., computing only requires elementary operations. Second, this gradient estimator is very economic in terms of the required memory. In order to store we only need bits and float. None of these properties is inherent to the popular alternatives based on the randomization over the -sphere (see e.g., NY1983; flaxman2004; BP2016) or on Gaussian randomization (see e.g., Nesterov2011; NS17; Ghadimi2013).

In order to get one needs to generate distributed uniformly on . The most straightforward way to do it consists in first generating a

-dimensional vector of i.i.d. centered scaled Laplace random variables and then normalizing this vector by its

-norm. The result is guaranteed to follow the uniform distribution on  (see e.g., Schechtman_Zinn90, Lemma 1).

3 Assumptions

We say that the convex function is -strongly convex with respect to the -norm on if

for all and all , where is the subdifferential of at point .

Throughout the paper, we assume that , , and set such that (resp. for ) with the usual agreement . We will use the following assumptions. The following conditions hold:

  1. The set is compact and convex.

  2. There exists , which is -strongly convex on w.r.t. the -norm and such that

    for some constant .

  3. Each function is convex on for all .

  4. For all , and all we have for some constant .

Assumption 3 is rather standard in the study of mirror descent-type algorithms and have been previously considered in the context of zero-order problems in duchi2015; Shamir17. Note that the constant is not necessarily dimension independent. Below we provide two classical examples of  (see e.g., Shalev-Shwartz, Section 2).

Example 1.

Let be any convex subset of and . Then, is -strongly convex on w.r.t. the -norm.

Example 2.

Let . Then111where ., is -strongly convex on w.r.t. the -norm and .

Assumptions on the noise.  We consider two different assumptions on the noises . The first noise assumption is common in the stochastic optimization context (see e.g., duchi2015; Shamir17; Ghadimi2013; Nesterov2011; NS17). [Canceling noise] For all , it holds that almost surely. Formally, Assumption 2 permits noisy evaluations of function values. However, due to the fact that we are allowed to query at two points, taking difference of and

in the estimator of the gradient effectively erases the noise. It results in a smaller variance of our gradient estimator. Importantly, Assumption

2 covers the case of no noise, that is, the classical online optimization setting as defined, e.g., in Shalev-Shwartz.

Second, we consider an adversarial noise assumption, which is essentially equivalent to the assumptions used in akhavan2020; akhavan2021distributed. [Adversarial noise] For all , it holds that: (i) and ; (ii) and are independent of 222Informally, part (ii) of Assumption 3 is always satisfied. Indeed, ’s and ’s are coming from the environment and are unknown to the learner while ’s are artificially generated by the learner. We mention part (ii) only for formal mathematical rigor.. Assumption 3 allows for stochastic and that are not necessarily zero-mean or independent over the trajectory. Furthermore, it permits bounded non-stochastic adversarial noises.

Note that, since the choice of function belongs to the learner and is given, it is always reasonable to assume that parameter is known. At the same time, parameters and may be either known or unknown. We will study both cases in the next sections.

4 Upper bounds on the regret

In this section, we present the main convergence results for Algorithm 1 when are known to the learner. The case when they are unknown is analyzed in Section 5, where we develop fully adaptive versions of Algorithm 1.

To state our results in a unified way, we introduce the following sequence that depends on the dimension and on the norm index :

The value will explicitly influence the choice of the step size and of the discretization parameter .

The first result of this section establishes the convergence guarantees under the canceling noise assumption. This case was previously considered by duchi2015 and Shamir17.

Theorem 1.

Let Assumptions 3 and 2 be satisfied. Then, Algorithm 1 with the parameters

where , satisfies, for any ,

Note that, as in other related works duchi2015; Shamir17; liang2014zeroth; Nesterov2011; NS17, under the canceling noise (or no noise) assumption the discretization parameter can be chosen arbitrary small. This is due to the fact that, under the canceling noise assumption, the variance of the gradient estimate is bounded by a constant independent of . It is no longer the case under the adversarial noise assumption as exhibited in the next theorem.

Theorem 2.

Let Assumptions 3 and 3 be satisfied. Then Algorithm 1 with the parameters

where , satisfies, for any ,

Comparison to  state-of-the-art bounds.  We provide two examples of and compare results for our new method to those of duchi2015; Shamir17 where only the canceling noise Assumption 2 and were considered. Let and . Then under Assumption 2, Algorithm 1 with defined in Example 1, satisfies

In the setup of Corollary 4, duchi2015 obtain rate and Shamir17 exhibits , which is the optimal rate. Both results do not specify the leading absolute constants. Let and . Then under Assumption 2, Algorithm 1 with defined in Example 2, satisfies

In the setup of Corollary 4, Shamir17 proves the rate  for the method with -randomization. On the other hand, duchi2015 derived a lower bound . Thus, our algorithm further reduces the gap between the upper and the lower bounds.

5 Adaptive algorithms

Theorems 1 and 2 used the step size and the discretization parameter that depend on the potentially unknown quantities , , and the optimization horizon . In this section, we show that, under the canceling noise Assumption 2, adaptation to unknown comes with nearly no price. On the other hand, under the adversarial noise Assumption 3, our adaptive rate has a slightly worse dependence on and in the dominant term. The proof is based on combining the adaptive scheme for online mirror descent (see Section 6.9 in Orabona2019, for an overview) with our bias and variance evaluations, cf. Section 6 below.

Theorem 3.

Let Assumptions 3 and 2 be satisfied. Then, Algorithm 1 with the parameters333We adopt the convention that , if .

satisfies for any

The above result gives, up to an absolute constant, the same convergence rate as that of the non-adaptive Theorem 1. In other words, the price for adaptive algorithm does not depend on the parameters of the problem. Finally, we derive an adaptive algorithm under Assumption 3.

Theorem 4.

Let Assumptions 3 and 3 be satisfied. Then, Algorithm 1 with the parameters

satisfies for any

Note that the bound of Theorem 4 has a less advantageous dependency on and compared to Theorem 2, where we had instead of . We remark that if is known but is unknown, one can recover the dependency by selecting depending on . We do not state this result that can be derived in a similar way and favor here only the fully adaptive version.

6 Elements of proofs

In this section, we outline major ingredients for the proofs of Theorems 14. The full proofs can be found in Appendix C. Here, we only focus on novel elements without reproducing the general scheme of online mirror descent analysis (see e.g., Shalev-Shwartz; Orabona2019). Namely, we highlight two key facts, which are the smoothing lemma (Lemma 1) and the Poincaré type inequality for the uniform measure on (Lemma 3) used to control the variance.

6.1 Bias and smoothing lemma

First, as in the prior work that was using smoothing ideas (see e.g., NY1983; flaxman2004; Shamir17), we show that our gradient estimate

is an unbiased estimator of a surrogate version of

and establish its approximation properties.

Lemma 1 (Smoothing lemma).

Fix and . Let be an -Lipschitz function w.r.t. the -norm. Let be distributed uniformly on and be distributed uniformly on . Let for . Then is differentiable and

Furthermore, we have for all and all ,

(1)

Finally, if is convex and is convex in , then is convex in and for .

Proof.

There are three claims to prove. For the first one, we notice that has the same distribution as , hence,

and the first claim follows from Theorem 6 in the Appendix (a version of Stokes’, or divergence, theorem) applied to with observation that where is the gradient defined almost everywhere and whose existence is ensured by the Rademacher theorem.

We now prove the approximation property (1). Assuming grants that . Since is -Lipschitz w.r.t. the -norm we get that, for any ,

(2)

If then (1) follows from Lemma 2. If then using again Lemma 2 we find

which together with (2) yields the desired bound.

Finally, if is convex in , then for all and we have

Thus is indeed convex on . Furthermore, again by convexity of , we deduce that for any

where . ∎

The proof of Lemma 1 relies on the control of the -norm of random vector established in the next result.

Lemma 2.

Let and let be distributed uniformly on . Then .

Proof.

Let be random variables having the Laplace distribution with mean and scale parameter . Set . Then, following (Barthe_Guedon_Mendelson_Naor05, Theorem 1) we have

where denotes equality in distribution. Furthermore, (Schechtman_Zinn90, Lemma 1) states that the random variables

are independent. Hence, for any , it holds that

where follows from Jensen’s inequality and uses the fact that for all . ∎

6.2 Variance and Poincaré type inequality

We additionally need to control the squared -norm of each gradient estimator . This is where we get the main improvement of our procedure compared to previously proposed methods. To derive the result, we first establish the following lemma of independent interest, which allows us to control the variance of Lipschitz functions on . The proof of this lemma is given in the Appendix.

Lemma 3.

Let . Assume that is a continuously differentiable function and is distributed uniformly on . Then

Furthermore, if is an -Lipschitz function w.r.t. the -norm then

Remark 1.

Since for all , the last inequality of Lemma 3 imlies that

(3)

We can now deduce the following bound on the squared -norm of .

Lemma 4.

Let and . Assume that is -Lipschitz w.r.t. the -norm. Then, for all ,

Proof.

Using the definition of we get

Let . First, observe that and under both Assumption 2 and Assumption 3(ii) it holds that . Using these remarks and the fact that under adversarial noise Assumption 3, , we find:

Furthermore, since is -Lipschitz, w.r.t. the -norm, the map is -Lipschitz w.r.t. the -norm. Applying inequality (3) to bound , yields the desired result. ∎

Note that under adversarial noise Assumption 3, the bound on squared -norm of gets an additional term . In contrast to the case of canceling noise Assumption 2, this does not allow us to take arbitrary small hence inducing the bias-variance trade-off.

7 Numerical illustration

In this section, we provide a numerical comparison of our algorithm with the method based on -randomization from Shamir17 (see Appendix D for the definition). We consider the no noise model and , , with the function defined as

where such that for . We choose

As stated in Example 2, is -strongly convex on w.r.t. the -norm and . Moreover, is a Lipschitz function w.r.t. the -norm. We deploy the adaptive parameterization that appears in Theorem 3. In Figure 1 we present the optimization error of the algorithms, which is defined as

Figure 1: Opt. error vs. number of iterations for -randomization (as in Shamir17) and our method.

The results are averaged over

trials and the standard deviation is reported in the shaded area. One can observe that the

-randomization method behaves significantly better than the -randomization algorithm. The theoretical bound for our method in this setup has a gain in the rate.

8 Discussion and comparison to prior work

We introduced and analyzed a novel estimator for the gradient based on randomization over the -sphere. We established guarantees for the online mirror descent algorithm with the gradient replaced by the proposed estimator. We provided an anytime and completely data-driven algorithm, which is adaptive to all parameters of the problem. Our analysis is based on deriving a Poincaré type inequality for the uniform measure on the -sphere that may be of independent interest.
Under the canceling noise assumption and , our setting is analogous to duchi2015; Shamir17. For the case and canceling noise, we show that the performance of our method is the same as in (Shamir17, Corollary 2) up to absolute constants that were not made explicit in Shamir17. For the case of and canceling noise, we improved the bound (Shamir17, Corollary 3) by a factor. For the case , , comparing with the lower bound in (duchi2015, Proposition 1), shows that the result of Theorem 1 is minimax optimal. For the case , (duchi2015, Proposition 2) shows that our result in Theorem 1 is optimal up to a factor.
Under the adversarial noise assumption, Theorem 2 provides the rate , that is, we get an additional factor compared to the canceling noise case. It remains unclear whether it is optimal under adversarial noise – this question deserves further investigation. Note that, under sub-Gaussian noise assumption and , one can achieve the rate with a relatively big  agarwal2011; belloni2015escaping; bubeck2017kernel; lattimore21a. In particular, with an ellipsoid type method lattimore21a obtains the rate for the cumulative regret. However, in a setup of optimizing nearly convex functions (that can be considered as related to the case of adversarial noise), the optimal rate for some class of polynomial time algorithms is  (see risteski2016algorithms, Theorem 3.1).

The work of E. Chzhen was supported by ANR PIA funding: ANR-20-IDEES-0002. The third author is partially supported by European Union’s Horizon 2020 research and innovation programme under ELISE grant agreement No 951847. The research of A.B. Tsybakov is supported by a grant of the French National Research Agency (ANR), “Investissements d’Avenir” (LabEx Ecodec/ANR-11-LABX-0047).

References

Appendix A Integration by parts

We first recall the following result that can be found in (zorich2016, Section 13.3.5, Exercise 14a).

Theorem 5 (Integration by parts in a multiple integral).

Let be an open connected subset of with a piecewise smooth boundary oriented by the outward unit normal . Let be a continuously differentiable function in . Then

Remark 2.

We refer to (zorich2016, Section 12.3.2, Definitions 4 and 5) for the definition of piecewise smooth surfaces and their orientations respectively.

The idea of using the instance of Theorem 5 (also called Stokes’ theorem) with to obtain -randomized estimators of the gradient belongs to NY1983. It was further used in several papers  (flaxman2004; BP2016; Shalev-Shwartz; Shamir17) to mention just a few. Those papers were referring to NY1983 but NY1983 did not provide an exact statement of the result (nor a reference) and only tossed the idea in a discussion. However, the classical analysis formulation as presented in Theorem 5 does not apply to Lipschitz continuous functions that were considered in (flaxman2004; BP2016; Shalev-Shwartz; Shamir17). We are not aware of whether its extension to Lipschitz continuous functions, though rather standard, is proved in the literature.

In this paper, we apply Theorem 5 with the -ball . Our aim in this section is to provide a variant of Theorem 5 applicable to a Lipschitz continuous function , which is not necessarily continuously differentiable on . To this end, we will go through the argument of approximating by functions, where is an open bounded connected subset of such that . Let , where is the standard mollifier. Let be a function satisfying the Lipschitz condition w.r.t. the -norm: . Since is continuous in and, by construction , then using basic properties of mollification (see e.g., evans2018measure, Theorem 4.1 (ii)) we have

uniformly on (in particular, uniformly on ). Furthermore, let be the gradient of , which by Rademacher theorem (see e.g., evans2018measure, Theorem 3.2) is well defined almost everywhere w.r.t. the Lebesgue measure and

It follows that is absolutely integrable on for any . Furthermore, since

we can apply (evans2018measure, Theorem 4.1 (iii)) that yields

Combining the above remarks we obtain that the result of Theorem 5 is valid for functions that are Lipschitz continuous w.r.t. the -norm. Thus, it is also valid when the Lipschitz condition is imposed w.r.t. any -norm with . Specifying this conclusion for the particular case , we obtain the following theorem.

Theorem 6.

Let the function be Lipschitz continuous w.r.t. the -norm with . Then

where is defined up to a set of zero Lebesgue measure by the Rademacher theorem.

Appendix B Proof of Lemma 3

To prove Lemma 3, we first recall the Poincaré inequality for the univariate exponential measure (mean and scale parameter Laplace distribution).

Lemma 5 (Lemma 2.1 in bobkov1997poincare).

Let be mean and scale parameter Laplace random variable. Let be continuous almost everywhere differentiable function such that

then,

We are now in a position to prove Lemma 3. The proof is inspired by (barthe2009remarks, Lemma 2).

Proof of Lemma 3.

First we assume that is continuously differentiable. Let be a vector of i.i.d. mean and scale parameter Laplace random variables and define . Introduce the notation

Lemma 1 in Schechtman_Zinn90 asserts that, for uniformly distributed on ,

(4)

In particular, . Using the Efron-Stein inequality (see e.g., boucheron2013concentration, Theorem 3.1) we obtain

where

with . Note that on the event (whose complement has zero measure), the function

satisfies the assumptions of Lemma 5. Thus,