1 Introduction
In this work we study the problem of convex online zeroorder optimization with twopoint feedback, in which adversary fixes a sequence of convex functions and the goal of the learner is to minimize the cumulative regret with respect to the best action in a prescribed convex set . This problem has received significant attention in the context of continuous bandits and online optimization (see e.g., agarwal2010; flaxman2004; saha2011; bubeck2012regret; Shamir17; bubeck2017kernel; akhavan2020; Gasnikov; lattimore21a, and references therein).
We consider the following protocol: at each round the algorithm chooses (that can be queried outside of ) and the adversary reveals
where are the noise variables (random or not) to be specified. Based on the above information and the previous rounds, the learner outputs and suffers loss . The goal of the learner is to minimize the cumulative regret
At the core of our approach is a novel zeroorder gradient estimator based on two function evaluations outlined in Algorithm 1. A key novelty of our estimator is that it employs a randomization step over the sphere. This is in contrast to related prior work (see e.g., NY1983; PT90; flaxman2004; agarwal2010; Shamir13; duchi2015; BP2016; Gasnikov2017; akhavan2020; akhavan2021distributed) that was employing or type randomizations to define . We use the proposed estimator within an online mirror descent procedure to tackle the zeroorder online convex optimization problem, matching or improving the stateoftheart results. duchi2015 and Shamir17 have studied instances of the above problem under the assumption that , which we will further refer to as canceling noise assumption. Specifically, duchi2015 considered the stochastic optimization framework where , , and obtained bounds on the optimization error rather than on cumulative regret, while Shamir17 analyzed the case . The results in duchi2015; Shamir17 are obtained for the objective functions that are Lipschitz with respect to the norm for and . The proposed method allows us to improve upon these results in several aspects.
Contributions. The contributions of the present paper can be summarized as follows. 1) We present a new randomized zeroorder gradient estimator and study its statistical properties, both under canceling noise and under adversarial noise (see Lemma 1 and Lemma 4); 2) In the canceling noise case () in Theorem 1 we show that mirror descent based on our gradient estimator either improves or matches the stateoftheart bounds duchi2015; Shamir17. We derive the results for Lipschitz functions with respect to all norms, . In particular, when and
is the probability simplex, our bound is better by a
factor than that of duchi2015; Shamir17; 3) We propose a completely datadriven and anytime version of the algorithm, which is adaptive to all parameters of the problem. We show that it achieves analogous performance as the nonadaptive algorithm in the case of canceling noise and only slightly worse performance under adversarial noise. To the best of our knowledge, no adaptive algorithms were developed for zeroorder online problems so far; 4) As a key element of our analysis, we derive in Lemma 3 a Poincaré type inequality with explicit constants for the uniform measure on sphere. This result may be of independent interest.Notation. Throughout the paper we use the following notation. We denote by the norm in . For any we denote by the componentwise sign function (defined at as ). We let be the standard inner product in . For we introduce the open ball and sphere respectively as
For two , we denote by (resp. ) the minimum (resp. the maximum) between and . We denote by , the gamma function. In what follows, always stands for the natural logarithm and .
2 The algorithm
Let be a closed convex subset of and let be a convex function. The procedure that we propose in this paper is summarized in Algorithm 1.
Intuition behind the gradient estimate. The form of gradient estimator in Algorithm 1 is explained by Stokes’ theorem (see Theorem 5 in the appendix and the discussion that follows). Stokes’ theorem provides a connection between the gradient of a function (first order information) and itself (zero order information). Under some regularity conditions, it establishes that
where is the boundary of , is the outward normal vector to , and denotes the surface measure. Introducing and distributed uniformly on and respectively, we can rewrite the above identity as
where is the surface area of and is its volume. In what follows we consider the special case . For this choice of we have with leading to our gradient estimate for the twopoint feedback setup.
Computational aspects. Let us highlight two appealing practical features of the randomized gradient estimator in Algorithm 1. First, we can easily evaluate any norm of . Indeed, it holds that , i.e., computing only requires elementary operations. Second, this gradient estimator is very economic in terms of the required memory. In order to store we only need bits and float. None of these properties is inherent to the popular alternatives based on the randomization over the sphere (see e.g., NY1983; flaxman2004; BP2016) or on Gaussian randomization (see e.g., Nesterov2011; NS17; Ghadimi2013).
In order to get one needs to generate distributed uniformly on . The most straightforward way to do it consists in first generating a
dimensional vector of i.i.d. centered scaled Laplace random variables and then normalizing this vector by its
norm. The result is guaranteed to follow the uniform distribution on (see e.g., Schechtman_Zinn90, Lemma 1).3 Assumptions
We say that the convex function is strongly convex with respect to the norm on if
for all and all , where is the subdifferential of at point .
Throughout the paper, we assume that , , and set such that (resp. for ) with the usual agreement . We will use the following assumptions. The following conditions hold:

The set is compact and convex.

There exists , which is strongly convex on w.r.t. the norm and such that
for some constant .

Each function is convex on for all .

For all , and all we have for some constant .
Assumption 3 is rather standard in the study of mirror descenttype algorithms and have been previously considered in the context of zeroorder problems in duchi2015; Shamir17. Note that the constant is not necessarily dimension independent. Below we provide two classical examples of (see e.g., ShalevShwartz, Section 2).
Example 1.
Let be any convex subset of and . Then, is strongly convex on w.r.t. the norm.
Example 2.
Let . Then^{1}^{1}1where ., is strongly convex on w.r.t. the norm and .
Assumptions on the noise. We consider two different assumptions on the noises . The first noise assumption is common in the stochastic optimization context (see e.g., duchi2015; Shamir17; Ghadimi2013; Nesterov2011; NS17). [Canceling noise] For all , it holds that almost surely. Formally, Assumption 2 permits noisy evaluations of function values. However, due to the fact that we are allowed to query at two points, taking difference of and
in the estimator of the gradient effectively erases the noise. It results in a smaller variance of our gradient estimator. Importantly, Assumption
2 covers the case of no noise, that is, the classical online optimization setting as defined, e.g., in ShalevShwartz.Second, we consider an adversarial noise assumption, which is essentially equivalent to the assumptions used in akhavan2020; akhavan2021distributed. [Adversarial noise] For all , it holds that: (i) and ; (ii) and are independent of ^{2}^{2}2Informally, part (ii) of Assumption 3 is always satisfied. Indeed, ’s and ’s are coming from the environment and are unknown to the learner while ’s are artificially generated by the learner. We mention part (ii) only for formal mathematical rigor.. Assumption 3 allows for stochastic and that are not necessarily zeromean or independent over the trajectory. Furthermore, it permits bounded nonstochastic adversarial noises.
Note that, since the choice of function belongs to the learner and is given, it is always reasonable to assume that parameter is known. At the same time, parameters and may be either known or unknown. We will study both cases in the next sections.
4 Upper bounds on the regret
In this section, we present the main convergence results for Algorithm 1 when are known to the learner. The case when they are unknown is analyzed in Section 5, where we develop fully adaptive versions of Algorithm 1.
To state our results in a unified way, we introduce the following sequence that depends on the dimension and on the norm index :
The value will explicitly influence the choice of the step size and of the discretization parameter .
The first result of this section establishes the convergence guarantees under the canceling noise assumption. This case was previously considered by duchi2015 and Shamir17.
Theorem 1.
Note that, as in other related works duchi2015; Shamir17; liang2014zeroth; Nesterov2011; NS17, under the canceling noise (or no noise) assumption the discretization parameter can be chosen arbitrary small. This is due to the fact that, under the canceling noise assumption, the variance of the gradient estimate is bounded by a constant independent of . It is no longer the case under the adversarial noise assumption as exhibited in the next theorem.
Theorem 2.
Comparison to stateoftheart bounds. We provide two examples of and compare results for our new method to those of duchi2015; Shamir17 where only the canceling noise Assumption 2 and were considered. Let and . Then under Assumption 2, Algorithm 1 with defined in Example 1, satisfies
In the setup of Corollary 4, duchi2015 obtain rate and Shamir17 exhibits , which is the optimal rate. Both results do not specify the leading absolute constants. Let and . Then under Assumption 2, Algorithm 1 with defined in Example 2, satisfies
In the setup of Corollary 4, Shamir17 proves the rate for the method with randomization. On the other hand, duchi2015 derived a lower bound . Thus, our algorithm further reduces the gap between the upper and the lower bounds.
5 Adaptive algorithms
Theorems 1 and 2 used the step size and the discretization parameter that depend on the potentially unknown quantities , , and the optimization horizon . In this section, we show that, under the canceling noise Assumption 2, adaptation to unknown comes with nearly no price. On the other hand, under the adversarial noise Assumption 3, our adaptive rate has a slightly worse dependence on and in the dominant term. The proof is based on combining the adaptive scheme for online mirror descent (see Section 6.9 in Orabona2019, for an overview) with our bias and variance evaluations, cf. Section 6 below.
Theorem 3.
The above result gives, up to an absolute constant, the same convergence rate as that of the nonadaptive Theorem 1. In other words, the price for adaptive algorithm does not depend on the parameters of the problem. Finally, we derive an adaptive algorithm under Assumption 3.
Theorem 4.
Note that the bound of Theorem 4 has a less advantageous dependency on and compared to Theorem 2, where we had instead of . We remark that if is known but is unknown, one can recover the dependency by selecting depending on . We do not state this result that can be derived in a similar way and favor here only the fully adaptive version.
6 Elements of proofs
In this section, we outline major ingredients for the proofs of Theorems 1 – 4. The full proofs can be found in Appendix C. Here, we only focus on novel elements without reproducing the general scheme of online mirror descent analysis (see e.g., ShalevShwartz; Orabona2019). Namely, we highlight two key facts, which are the smoothing lemma (Lemma 1) and the Poincaré type inequality for the uniform measure on (Lemma 3) used to control the variance.
6.1 Bias and smoothing lemma
First, as in the prior work that was using smoothing ideas (see e.g., NY1983; flaxman2004; Shamir17), we show that our gradient estimate
is an unbiased estimator of a surrogate version of
and establish its approximation properties.Lemma 1 (Smoothing lemma).
Fix and . Let be an Lipschitz function w.r.t. the norm. Let be distributed uniformly on and be distributed uniformly on . Let for . Then is differentiable and
Furthermore, we have for all and all ,
(1) 
Finally, if is convex and is convex in , then is convex in and for .
Proof.
There are three claims to prove. For the first one, we notice that has the same distribution as , hence,
and the first claim follows from Theorem 6 in the Appendix (a version of Stokes’, or divergence, theorem) applied to with observation that where is the gradient defined almost everywhere and whose existence is ensured by the Rademacher theorem.
We now prove the approximation property (1). Assuming grants that . Since is Lipschitz w.r.t. the norm we get that, for any ,
(2) 
If then (1) follows from Lemma 2. If then using again Lemma 2 we find
which together with (2) yields the desired bound.
Finally, if is convex in , then for all and we have
Thus is indeed convex on . Furthermore, again by convexity of , we deduce that for any
where . ∎
The proof of Lemma 1 relies on the control of the norm of random vector established in the next result.
Lemma 2.
Let and let be distributed uniformly on . Then .
Proof.
Let be random variables having the Laplace distribution with mean and scale parameter . Set . Then, following (Barthe_Guedon_Mendelson_Naor05, Theorem 1) we have
where denotes equality in distribution. Furthermore, (Schechtman_Zinn90, Lemma 1) states that the random variables
are independent. Hence, for any , it holds that
where follows from Jensen’s inequality and uses the fact that for all . ∎
6.2 Variance and Poincaré type inequality
We additionally need to control the squared norm of each gradient estimator . This is where we get the main improvement of our procedure compared to previously proposed methods. To derive the result, we first establish the following lemma of independent interest, which allows us to control the variance of Lipschitz functions on . The proof of this lemma is given in the Appendix.
Lemma 3.
Let . Assume that is a continuously differentiable function and is distributed uniformly on . Then
Furthermore, if is an Lipschitz function w.r.t. the norm then
Remark 1.
Since for all , the last inequality of Lemma 3 imlies that
(3) 
We can now deduce the following bound on the squared norm of .
Lemma 4.
Let and . Assume that is Lipschitz w.r.t. the norm. Then, for all ,
Proof.
Using the definition of we get
Let . First, observe that and under both Assumption 2 and Assumption 3(ii) it holds that . Using these remarks and the fact that under adversarial noise Assumption 3, , we find:
Furthermore, since is Lipschitz, w.r.t. the norm, the map is Lipschitz w.r.t. the norm. Applying inequality (3) to bound , yields the desired result. ∎
7 Numerical illustration
In this section, we provide a numerical comparison of our algorithm with the method based on randomization from Shamir17 (see Appendix D for the definition). We consider the no noise model and , , with the function defined as
where such that for . We choose
As stated in Example 2, is strongly convex on w.r.t. the norm and . Moreover, is a Lipschitz function w.r.t. the norm. We deploy the adaptive parameterization that appears in Theorem 3. In Figure 1 we present the optimization error of the algorithms, which is defined as
The results are averaged over
trials and the standard deviation is reported in the shaded area. One can observe that the
randomization method behaves significantly better than the randomization algorithm. The theoretical bound for our method in this setup has a gain in the rate.8 Discussion and comparison to prior work
We introduced and analyzed a novel estimator for the gradient based on randomization over the sphere. We established guarantees for the online mirror descent algorithm with the gradient replaced by the proposed estimator. We provided an anytime and completely datadriven algorithm, which is adaptive to all parameters of the problem.
Our analysis is based on deriving a
Poincaré type inequality for the uniform measure on the sphere that may be of independent interest.
Under the canceling noise assumption and , our setting is analogous to duchi2015; Shamir17. For the case and canceling noise, we show that the performance of our method is the same as in (Shamir17, Corollary 2) up to absolute constants that were not made explicit in Shamir17.
For the case of and canceling noise, we improved the bound (Shamir17, Corollary 3) by a factor. For the case , , comparing with the lower bound in (duchi2015, Proposition 1), shows that the result of Theorem 1 is minimax optimal. For the case , (duchi2015, Proposition 2) shows that our result in Theorem 1 is optimal up to a factor.
Under the adversarial noise assumption, Theorem 2 provides the rate , that is, we get an additional factor compared to the canceling noise case. It remains unclear whether it is optimal under adversarial noise – this question deserves further investigation. Note that, under subGaussian noise assumption and , one can achieve the rate with a relatively big agarwal2011; belloni2015escaping; bubeck2017kernel; lattimore21a.
In particular, with an ellipsoid type method lattimore21a obtains the rate for the cumulative regret.
However, in a setup of optimizing nearly convex functions (that can be considered as related to the case of adversarial noise), the optimal rate for some class of polynomial time algorithms is (see risteski2016algorithms, Theorem 3.1).
The work of E. Chzhen was supported by ANR PIA funding: ANR20IDEES0002. The third author is partially supported by European Union’s Horizon 2020 research and innovation programme under ELISE grant agreement No 951847. The research of A.B. Tsybakov is supported by a grant of the French National Research Agency (ANR), “Investissements d’Avenir” (LabEx Ecodec/ANR11LABX0047).
References
Appendix A Integration by parts
We first recall the following result that can be found in (zorich2016, Section 13.3.5, Exercise 14a).
Theorem 5 (Integration by parts in a multiple integral).
Let be an open connected subset of with a piecewise smooth boundary oriented by the outward unit normal . Let be a continuously differentiable function in . Then
Remark 2.
We refer to (zorich2016, Section 12.3.2, Definitions 4 and 5) for the definition of piecewise smooth surfaces and their orientations respectively.
The idea of using the instance of Theorem 5 (also called Stokes’ theorem) with to obtain randomized estimators of the gradient belongs to NY1983. It was further used in several papers (flaxman2004; BP2016; ShalevShwartz; Shamir17) to mention just a few. Those papers were referring to NY1983 but NY1983 did not provide an exact statement of the result (nor a reference) and only tossed the idea in a discussion. However, the classical analysis formulation as presented in Theorem 5 does not apply to Lipschitz continuous functions that were considered in (flaxman2004; BP2016; ShalevShwartz; Shamir17). We are not aware of whether its extension to Lipschitz continuous functions, though rather standard, is proved in the literature.
In this paper, we apply Theorem 5 with the ball . Our aim in this section is to provide a variant of Theorem 5 applicable to a Lipschitz continuous function , which is not necessarily continuously differentiable on . To this end, we will go through the argument of approximating by functions, where is an open bounded connected subset of such that . Let , where is the standard mollifier. Let be a function satisfying the Lipschitz condition w.r.t. the norm: . Since is continuous in and, by construction , then using basic properties of mollification (see e.g., evans2018measure, Theorem 4.1 (ii)) we have
uniformly on (in particular, uniformly on ). Furthermore, let be the gradient of , which by Rademacher theorem (see e.g., evans2018measure, Theorem 3.2) is well defined almost everywhere w.r.t. the Lebesgue measure and
It follows that is absolutely integrable on for any . Furthermore, since
we can apply (evans2018measure, Theorem 4.1 (iii)) that yields
Combining the above remarks we obtain that the result of Theorem 5 is valid for functions that are Lipschitz continuous w.r.t. the norm. Thus, it is also valid when the Lipschitz condition is imposed w.r.t. any norm with . Specifying this conclusion for the particular case , we obtain the following theorem.
Theorem 6.
Let the function be Lipschitz continuous w.r.t. the norm with . Then
where is defined up to a set of zero Lebesgue measure by the Rademacher theorem.
Appendix B Proof of Lemma 3
To prove Lemma 3, we first recall the Poincaré inequality for the univariate exponential measure (mean and scale parameter Laplace distribution).
Lemma 5 (Lemma 2.1 in bobkov1997poincare).
Let be mean and scale parameter Laplace random variable. Let be continuous almost everywhere differentiable function such that
then,
We are now in a position to prove Lemma 3. The proof is inspired by (barthe2009remarks, Lemma 2).
Proof of Lemma 3.
First we assume that is continuously differentiable. Let be a vector of i.i.d. mean and scale parameter Laplace random variables and define . Introduce the notation
Lemma 1 in Schechtman_Zinn90 asserts that, for uniformly distributed on ,
(4) 
In particular, . Using the EfronStein inequality (see e.g., boucheron2013concentration, Theorem 3.1) we obtain
where
with . Note that on the event (whose complement has zero measure), the function
satisfies the assumptions of Lemma 5. Thus,