# Exploiting Higher Order Smoothness in Derivative-free Optimization and Continuous Bandits

We study the problem of zero-order optimization of a strongly convex function. The goal is to find the minimizer of the function by a sequential exploration of its values, under measurement noise. We study the impact of higher order smoothness properties of the function on the optimization error and on the cumulative regret. To solve this problem we consider a randomized approximation of the projected gradient descent algorithm. The gradient is estimated by a randomized procedure involving two function evaluations and a smoothing kernel. We derive upper bounds for this algorithm both in the constrained and unconstrained settings and prove minimax lower bounds for any sequential search method. Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters. Based on this algorithm, we also propose an estimator of the minimum value of the function achieving almost sharp oracle behavior. We compare our results with the state-of-the-art, highlighting a number of key improvements.

• 4 publications
• 63 publications
• 17 publications
02/01/2021

### Distributed Zero-Order Optimization under Adversarial Noise

We study the problem of distributed zero-order optimization for a class ...
03/08/2021

### On the Oracle Complexity of Higher-Order Smooth Non-Convex Finite-Sum Optimization

We prove lower bounds for higher-order methods in smooth non-convex fini...
10/27/2017

### Lower Bounds for Higher-Order Convex Optimization

State-of-the-art methods in convex and non-convex optimization employ hi...
10/01/2018

### A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption

We study the problem of optimizing a function under a budgeted number of...
12/17/2020

### Stochastic Compositional Gradient Descent under Compositional constraints

This work studies constrained stochastic optimization problems where the...
12/14/2019

### Randomized derivative-free Milstein algorithm for efficient approximation of solutions of SDEs under noisy information

We deal with pointwise approximation of solutions of scalar stochastic d...
05/27/2022

### A gradient estimator via L1-randomization for online zero-order optimization with two point feedback

This work studies online zero-order optimization of convex and Lipschitz...

## 1 Introduction

We study the problem of zero-order stochastic optimization, in which we aim to minimize an unknown strongly convex function via a sequential exploration of its function values, under measurement error, and a closely related problem of continuous (or continuum-armed) stochastic bandits. These problems have received significant attention in the literature, see [1, 2, 3, 4, 7, 9, 10, 13, 14, 16, 17, 18, 20, 21, 22, 27, 29, 30, 31, 32, 33, 35], and are fundamental for many applications in which the derivatives of the function are either too expensive or impossible to compute. A principal goal of this paper is to exploit higher order smoothness properties of the underlying function in order to improve the performance of search algorithms. We derive upper bounds on the estimation error for a class of projected gradient-like algorithms, as well as close matching lower bounds, that characterize the role played by the number of iterations, the strong convexity parameter, the smoothness parameter, the number of variables, and the noise level.

Let be the function that we wish to minimize over a closed convex subset of . Our approach, outlined in Algorithm 1, builds upon previous work in which a sequential algorithm queries at each iteration a pair of function values, under a general noise model. Specifically, at iteration the current guess for the minimizer of is used to build two perturbations and , where the function values are queried subject to additive measurement errors and , respectively. The values can be chosen in different ways. In this paper, we set (Line 1), where is a suitably chosen small parameter,

is random and uniformly distributed on

, and is uniformly distributed on the unit sphere. The estimate for the gradient is then computed at Line 2 and used inside a projected gradient method scheme to compute the next exploration point. We introduce a suitably chosen kernel that allows us to take advantage of higher order smoothness of .

The idea of using randomized procedures for derivative-free stochastic optimization can be traced back to Nemirovski and Yudin [25, Sec. 9.3] who suggested an algorithm with one query per step at point , with uniform on the unit sphere. Its versions with one, two or more queries were studied in several papers including [1, 3, 16, 33]. Using two queries per step leads to better performance bounds as emphasized in [1, 3, 13, 16, 28, 33]. Randomizing sequences other than uniform on the sphere were also explored: uniformly distributed on a cube [28], Gaussian  [26, 27], uniformly distributed on the vertices of a cube [32] or satisfying some general assumptions [12, 13]. Except for [3, 12, 28], these works study settings with low smoothness of (2-smooth or less) and do not invoke kernels (i.e. and in Algorithm 1). The use of randomization with smoothing kernels was proposed by Polyak and Tsybakov [28] and further developed by Dippon [12], and Perchet [3] to whom the current form of Algorithm 1 is due.

In this paper we consider higher order smooth functions satisfying the generalized Hölder condition with parameter , cf. inequality (1) below. For integer , this parameter can be roughly interpreted as the number of bounded derivatives. Furthermore, we assume that is -strongly convex. For such functions, we address the following two main questions:

• What is the performance of Algorithm 1 in terms of the cumulative regret and optimization error, namely what is the explicit dependency of the rate on the main parameters ?

• What are the fundamental limits of any sequential search procedure expressed in terms of minimax optimization error?

To handle task (a), we prove upper bounds for Algorithm 1, and to handle (b), we prove minimax lower bounds for any sequential search method.

Contributions. Our main contributions can be summarized as follows:

1. Under an adversarial noise assumption (cf. Assumption 2.1 below), we establish for all upper bounds of the order for the optimization risk and for the cumulative regret of Algorithm 1, both for its constrained and unconstrained versions;

2. In the case of independent noise satisfying some natural assumptions (including the Gaussian noise), we prove a minimax lower bound of the order for the optimization risk when is not very small. This shows that to within the factor of the bound for Algorithm 1 cannot be improved for all ;

3. We show that, when is too small, below some specified threshold, higher order smoothness does not help to improve the convergence rate. We prove that in this regime the rate cannot be faster than , which is not better (to within the dependency on ) than for derivative-free minimization of simply convex functions [2, 18];

4. For , we obtain a bracketing of the optimal rate between and
. In a special case when is a fixed numerical constant, this validates a conjecture in [32] (claimed there as proved fact) that the optimal rate for scales as ;

5. We propose a simple algorithm of estimation of the value requiring three queries per step and attaining the optimal rate for all . The best previous work on this problem [6] suggested a method with exponential complexity and proved a bound of the order for where is an unspecified constant.

Notation. Throughout the paper we use the following notation. We let and be the standard inner product and Euclidean norm on , respectively. For every close convex set and we denote by the Euclidean projection of to . We assume everywhere that . We denote by the class of functions with Hölder smoothness (inequality (1) below). Recall that is -strongly convex for some if, for any it holds that . We further denote by the class of all -strongly convex functions belonging to .

Organization. We start in Section 2 with some preliminary results on the gradient estimator. Section 3 presents our upper bounds for Algorithm 1, both in the constrained and unconstrained case. In Section 4 we observe that a slight modification of Algorithm 1 can be used to estimated the minimum value (rather than the minimizer) of . Section 4 presents improved upper bounds in the case . In Section 6 we establish minimax lower bounds. Finally, Section 7 contrasts our results with previous work in the literature and discusses future directions of research.

## 2 Preliminaries

In this section, we give the definitions, assumptions and basic facts that will be used throughout the paper. For , let be the greatest integer strictly less than . We denote by the set of all functions that are times differentiable and satisfy, for all the Hölder-type condition

 ∣∣∣f(z)−∑0≤|m|≤ℓ1m!Dmf(x)(z−x)m∣∣∣≤L∥z−x∥β, (1)

where , the sum is over the multi-index , we used the notation , , and we defined

 Dmf(x)νm=∂|m|f(x)∂m1x1⋯∂mdxdνm11⋯νmdd,∀ν=(ν1,…,νd)∈Rd.

In this paper, we assume that the gradient estimator defined by Algorithm 1 uses a kernel function satisfying

 ∫K(u)du=0,∫uK(u)du=1,∫ujK(u)du=0, j=2,…,ℓ, ∫|u|β|K(u)|du<∞. (2)

Examples of such kernels obtained as weighted sums of Legendre polynomials are given in [28] and further discussed in [3].

###### Assumption 2.1.

It holds, for all

, that: (i) the random variables

and are independent from and from , and the random variables and are independent; (ii) and , where .

Note that we do not assume and to have zero mean. Moreover, they can be non-random and no independence between noises on different steps is required, so that the setting can be considered as adversarial. Having such a relaxed set of assumptions is possible because of randomization that, for example, allows the proofs go through without assuming the zero mean noise.

We will also use the following assumption.

###### Assumption 2.2.

Function is 2-smooth, that is, differentiable on and such that for all , where .

It is easy to see that this assumption implies that . The following lemma gives a bound on the bias of the gradient estimator.

###### Lemma 2.3.

Let , with and let Assumption 2.1 (i) hold. Let and be defined by Algorithm 1 and let Then

 ∥E[^gt|xt]−∇f(xt)∥≤κβLdhβ−1t. (3)

The next lemma provides a bound on the stochastic variability of the estimated gradient by controlling its second moment.

###### Lemma 2.4.

Let Assumption 2.1(i) hold, let and be defined by Algorithm 1 and set . Then

• If , and Assumption 2.2 holds,

 E[∥^gt∥2|xt]≤9κ¯L2(d∥xt−x∗∥2+d2h2t8)+3κd2σ22h2t,
• If and is a closed convex subset of such that , then

 E[∥^gt∥2|xt]≤9κ(G2d+L2d2h2t2)+3κd2σ22h2t.

## 3 Upper bounds

In this section, we provide upper bounds on the cumulative regret and on the optimization error of Algorithm 1. First we consider Algorithm 1 when the convex set is bounded (constrained case).

###### Theorem 3.1.

(Upper Bound, Constrained Case.) Let with and . Let Assumptions 2.1 and 2.2 hold and let be a convex compact subset of . Assume that . If then the cumulative regret of Algorithm 1 with

 ht=(3κσ22(β−1)(κβL)2)12βt−12β,   ηt=2αt,   t=1,…,T

satisfies

 ∀x∈Θ: T∑t=1E[f(xt)−f(x)]≤1α(d2(A1T1/β+A2)+A3dlogT), (4)

where , with constant depending only on , and . The optimization error of averaged estimator satisfies

 E[f(¯xT)−f(x∗)]≤1α⎛⎜⎝d2⎛⎜⎝A1Tβ−1β+A2T⎞⎟⎠+A3dlogTT⎞⎟⎠, (5)

where . If , then the cumulative regret and the optimization error of Algorithm 1 with any chosen small enough and satisfy the bounds (4) and (5), respectively, with and .

• We use the definition of Algorithm 1 and strong convexity of to obtain an upper bound for , which depends on the bias term and on the stochastic error term . By substituting (that is derived from balancing the two terms) and in Lemmas 2.3 and 2.4 we obtain upper bounds for and that imply the desired upper bound for due to a recursive argument in the spirit of [5]. ∎

In the non-noisy case () we get the rate for the cumulative regret, and for the optimization error. In what concerns the optimization error, this rate is not optimal since one can achieve much faster rate under strong convexity [27]. However, for the cumulative regret in our derivative-free setting it remains an open question whether the result of Theorem 3.1 can be improved. Previous papers on derivative-free online methods with no noise [1, 13, 16] provide slower rates than . The best known so far is , cf. [1, Corollary 5]. We may also notice that the cumulative regret bounds of Theorem 3.1 trivially extend to the case when we query functions depending on rather than a single . Another immediate fact is that on the r.h.s. of inequalities (4) and (5) we can take the minimum with and , respectively, where is the Euclidean diameter of . Finally, the factor in the bounds for the optimization error can be eliminated by considering averaging from to rather than from 1 to , in the spirit of [29]. We refer to Appendix D for the details and proofs of these facts.

We now study the performance of Algorithm 1 when . In this case we make the following choice for the parameters and in Algorithm 1:

 ht =T−12β,   ηt=1αT,   t=1,…,T0, (6) ht =t−12β,     ηt=2αt,    t=T0+1,…,T,

where and is a positive constant111If the algorithm does not use (6). Assumptions of Theorem 3.2 are such that condition holds. depending only on the kernel (this is defined in the proof of Theorem 3.2 in Appendix B) and recall is the Lipschitz constant on the gradient . Finally, define the estimator

 ¯xT0,T=1T−T0T∑t=T0+1xt. (7)
###### Theorem 3.2.

(Upper Bounds, Unconstrained Case.) Let with and . Let Assumptions 2.1 and 2.2 hold. Assume also that , where . Let ’s be the updates of Algorithm 1 with , and as in (6) and a non-random . Then the estimator defined by (7) satisfies

 E[f(¯xT0,T)−f(x∗)]≤Cκ¯L2dαT∥x1−x∗∥2+Cd2α((κβL)2+κ(¯L2+σ2))T−β−1β (8)

where is a constant depending only on and .

• As in the proof of Theorem 3.1, we apply Lemmas 2.3 and 2.4. But we can only use Lemma 2.4(i) and not Lemma 2.4(ii) and thus the bound on the stochastic error now involves . So, after taking expectations, we need to control an additional term containing . However, the issue concerns only small () since for bigger this term is compensated due to the strong convexity with parameter . This motivates the method where we use the first iterations to get a suitably good (but not rate optimal) bound on and then proceed analogously to Theorem 3.1 for iterations . ∎

## 4 Estimation of f(x∗)

In this section, we apply the above results to estimation of the minimum value . The literature on this problem either assumes that ’s are distributed uniformly enough and (with no strong convexity) [19, 24] or and ’s are chosen sequentially [6, 23]. In the fist case, cannot be estimated better than at the slow rate (see, [19] for ). For the second case, which is our setting, the best result so far is obtained in [6]. The estimator of in [6] is defined via a multi-stage procedure whose complexity increases exponentially with the dimension and it is shown to achieve (asymptotically, for greater than an exponent of ) the rate for functions in with . Here, is some constant depending on and in an unspecified way.

Observe that is not an estimator since it depends on the unknown , so Theorem 3.1 does not provide a result about estimation of . In this section, we show that using the computationally simple Algorithm 1 and making one more query per step (that is, having three queries per step in total) allows us to achieve the rate for all with no dependency on the dimension in the main term. Note that the rate cannot be improved. Indeed, one cannot estimate with a better rate even using the ideal but non-realizable oracle that makes all queries at point .

In order to construct our estimator, at any step of Algorithm 1 we make along with and the third query , where is some noise and are the updates of Algorithm 1. We estimate by The properties of estimator are summarized in the next theorem, which is an immediate corollary of Theorem 3.1.

###### Theorem 4.1.

Let the assumptions of Theorem 3.1 be satisfied. Let and assume that are independent random variables with and for . If attains its minimum at point , then

 E|^M−f(x∗)|≤σT12+1α⎛⎜⎝d2⎛⎜⎝A1Tβ−1β+A2T⎞⎟⎠+A3dlogTT⎞⎟⎠. (9)
###### Remark 4.2.

With three queries per step, the risk (error) of the oracle that makes all queries at point is . Thus, for the estimator achieves asymptotically as the oracle risk to within the factor. We do not obtain such a sharp property for , in which case the remainder term in Theorem 4.1 accounting for the accuracy of Algorithm 1 is of the same order as the main term .

Note that in Theorem 4.1 the noises are assumed to be independent and zero mean random variables, which is essential to obtain the rate. Nevertheless, we do not require independence between the noises and the noises in the other two queries and . Another interesting point is that for the third query is not needed and is estimated with the rate either by or by This is an easy consequence of the above argument, the property (19) – see Lemma A.3 in the appendix – which is specific for the case , and the fact that the optimal choice of is of order for .

## 5 Improved bounds for β=2

In this section, we consider the case and obtain improved bounds that scale as rather than with the dimension in the constrained optimization setting analogous to Theorem 3.1. First note that for we can simplify the algorithm. The use of kernel is redundant when , and therefore in this section we define the approximate gradient as

 ^gt=d2ht(yt−y′t)ζt, (10)

where and . A well-known observation that goes back to [25] consists in the fact that defined in (10

) is an unbiased estimator of the gradient of the surrogate function

defined by

 ^ft(x)=Ef(x+ht~ζ),∀x∈Rd,

where the expectation is taken with respect to the random vector uniformly distributed on the unit ball . The properties of the surrogate are described in Lemmas A.2 and A.3 presented in the appendix.

The improvement in the rate that we get for is due to the fact that we can consider Algorithm 1 with defined in (10) as the SGD for the surrogate function. Then the bias of approximating by scales as , which is smaller than the squared bias of approximating the gradient arising in the proof of Theorem 3.1 that scales as when . On the other hand, the stochastic variability terms are the same for both methods of proof. This explains the gain in dependency on . However, this technique does not work for since then the error of approximating by , which is of the order (with small), becomes too large compared to the bias of Theorem 3.1.

###### Theorem 5.1.

Let with . Let Assumption 2.1 hold and let be a convex compact subset of . Assume that . If then for Algorithm 1 with defined in (10) and parameters and we have

 ∀x∈Θ:  ET∑t=1(f(xt)−f(x))≤min(GBT,2√3Lσd√α√T+A4d2αlogT), (11)

where is the Euclidean diameter of and . Moreover, if the optimization error of averaged estimator is bounded as

 E[f(¯xT)−f(x∗)]≤min(GB,2√3Lσd√αT+A4d2αlogTT). (12)

Finally, if , then the cumulative regret of Algorithm 1 with any chosen small enough and and the optimization error of its averaged version are of the order and , respectively.

Note that the terms and appearing in these bounds can be improved to and at the expense of assuming that the norm is uniformly bounded by not only on but also on a large enough Euclidean neighborhood of . Moreover, the factor in the bounds for the optimization error can be eliminated by considering averaging from to rather than from 1 to in the spirit of [29]. We refer to Appendix D for the details and proofs of these facts. A major conclusion is that, when and we consider the optimization error, those terms are negligible with respect to and thus an attainable rate is .

We close this section by noting, in connection with the bandit setting, that the bound (11) extends straightforwardly (up to a change in numerical constants) to the cumulative regret of the form , where the losses are measured at the query points and depends on . This fact follows immediately from the proof of Theorem 5.1 presented in the appendix and the property (19), see Lemma A.3 in the appendix.

## 6 Lower bound

In this section we prove a minimax lower bound on the optimization error over all sequential strategies that allow the query points depend on the past. For , we assume that and we consider strategies of choosing the query points as where are Borel functions and is any random variable. We denote by the set of all such strategies. The noises

are assumed in this section to be independent with cumulative distribution function

satisfying the condition

 ∫log(dF(u)/dF(u+v))dF(u)≤I0v2,|v|

for some ,

. For example, for Gaussian distribution

this condition holds with . Note that the class includes the sequential strategy of Algorithm 1 that corresponds to taking as an even number, and choosing and for even

and odd

, respectively. The presence of the randomizing sequences is not crucial for the lower bound. Indeed, Theorem 6.1 below is valid conditionally on any randomization, and thus the lower bound remains valid when taking expectation over the randomizing distribution.

###### Theorem 6.1.

Let . For ,, let denote the set of functions that attain their minimum over in and belong to , where . Then for any strategy in the class we have

 supf∈F′α,βE[f(zT)−minxf(x)]≥Cmin(max(α,T−1/2+1/β),d√T,dαT−β−1β), (14)

and

 supf∈F′α,βE[∥zT−x∗(f)∥2]≥Cmin(1,dT1β,dα2T−β−1β), (15)

where is a constant that does not depend of , and , and is the minimizer of on .

The proof is given in Appendix B and relies on Assouad’s Lemma, see e.g. [34] with a careful choice of least favorable functions.

We stress that the condition in this theorem is necessary. It should always hold if the intersection is not empty. Notice also that the threshold on the strong convexity parameter plays an important role in bounds (14) and (15). Indeed, for below this threshold, the bounds start to be independent of . Moreover, in this regime, the rate of (14) becomes , which is asymptotically and thus not better as function of than the rate attained for zero-order minimization of simply convex functions [2, 7]. Intuitively, it seems reasonable that -strong convexity should be of no added value for very small . Theorem 6.1 allows us to quantify exactly how small such should be. Also, quite naturally, the threshold becomes smaller when the smoothness increases.
Finally note that for the lower bounds (14) and (15) are, in the interesting regime of large enough , of order and , respectively. This highlights the near minimax optimal properties of Algorithm 1 in the setting of Theorem 5.1.

## 7 Discussion and related work

There is a great deal of attention to zero-order feedback stochastic optimization and convex bandits problems in the recent literature. Several settings are studied: (i) deterministic in the sense that the queries contain no random noise and we query functions depending on rather than where are Lipschitz or 2-smooth [1, 16, 26, 27, 30, 33]; (ii) stochastic with two-point feedback where the two noisy evaluations are obtained with the same noise and the noisy functions are Lipschitz or 2-smooth [13, 26, 27] (this setting does not differ much from (i) in terms of the analysis and the results); (iii) stochastic, where the noises are independent zero-mean random variables [2, 3, 4, 12, 15, 20, 21, 28, 32]. In this paper, we considered a setting, which is more general than (iii) by allowing for adversarial noise (no independence or zero-mean assumption in contrast to (iii), no Lipschitz assumption in contrast to settings (i) and (ii)), which are both covered by our results when the noise is set to zero.

One part of our results are bounds on the cumulative regret, cf. (4) and (11). We emphasize that they remain trivially valid if the queries are from depending on instead of , and thus cover the setting (i). To the best of our knowledge, there were no such results in this setting previously, except for [3] that gives bounds with suboptimal dependency on in the case of classical (non-adversarial) noise. In the non-noisy case, we get bounds on the cumulative regret with faster rates than previously known for the setting (i). It remains an open question whether these bounds can be improved.

The second part of our results dealing with the optimization error is closely related to the work on derivative-free stochastic optimization under strong convexity and smoothness assumptions initiated in [15, 28] and more recently developed in [3, 12, 20, 32]. It was shown in [28] that the minimax optimal rate for scales as , where is an unspecified function of and (for an upper bound of the same order was earlier established in [15]). The issue of establishing non-asymptotic fundamental limits as function of the main parameters of the problem (, and ) was first addressed in [20] giving a lower bound for . This was improved to when by Shamir [32] who conjectured that the rate is optimal for , which indeed follows from our Theorem 5.1 (although [32] claims the optimality as proved fact by referring to results in [1], such results cannot be applied in setting (iii) because the noise cannot be considered as Lipschitz). A result similar to Theorem 5.1 is stated without proof in Bach and Perchet [3, Proposition 7] but not for the cumilative regret and with a suboptimal rate in the non-noisy case. For integer , Bach and Perchet [3] present explicit upper bounds as functions of , and with, however, suboptimal dependency on except for their Proposition 8 that is problematic (see Appendix C for the details). Finally, by slightly modifying the proof of Theorem 3.1 we get that the estimation risk is , which is to within factor of the main term in the lower bound (15) (see Appendix D for details).

The lower bound in Theorem 6.1 is, to the best of our knowledge, the first result providing non-asymptotic fundamental limits under general configuration of , and . The known lower bounds [20, 28, 32] either give no explicit dependency on and , or treat the special case and . Moreover, as an interesting consequence of our lower bound we find that, for small strong convexity parameter (namely, below the threshold), the best achievable rate cannot be substantially faster than for simply convex functions, at least for moderate dimensions. Indeed, for such small , our lower bound is asymptotically independently of the smoothness index and on , while the achievable rate for convex functions is shown to be in [2] and improved to in [7] (both up to log-factors). The gap here is only in the dependency on the dimension. Our results imply that for above the threshold, the gap between upper and lower bounds is much smaller. Thus, our upper bounds in this regime scale as while the lower bound of Theorem 6.1 is of the order ; moreover for , upper and lower bounds match in the dependency on .

We hope that our work will stimulate further study at the intersection of zero-order optimization and convex bandits in machine learning. An important open problem is to study novel algorithms which match our lower bound simultaneously in all main parameters. For example a class of algorithms worth exploring are those using memory of the gradient in the spirit of Nesterov accelerated method. Yet another important open problem is to study lower bounds for the regret in our setting. Finally, it would be valuable to study extensions of our work to locally strongly convex functions.

Acknowledgements. We would like to thank Francis Bach, Vianney Perchet, Saverio Salzo and Ohad Shamir for helpful discussions. The second author was partially supported by SAP SE; the third author acknowledges the funding from Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047).

## References

• [1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proc. 23rd International Conference on Learning Theory, pages 28–40, 2010.
• [2] A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, volume 25, pages 1035–1043, 2011.
• [3] F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. In Proc. 29th Annual Conference on Learning Theory, pages 1–27, 2016.
• [4] P. L. Bartlett, V. Gabillon, and M. Valko. A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption. In Proc. 30th International Conference on Algorithmic Learning Theory, pages 184–206, 2019.
• [5] P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems 20, pages 65–72, 2008.
• [6] E. Belitser, S. Ghosal, and H. van Zanten. Optimal two-stage procedures for estimating location and size of the maximum of a multivariate regression function. Ann. Statist., 40(6):2850–2876, 2012.
• [7] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the local minima via simulated annealing: Optimization of approximately convex functions. In Proc. 28th Annual Conference on Learning Theory, pages 240–265, 2015.
• [8] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–257, 2015.
• [9] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
• [10] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In

Proc. 49th Annual ACM SIGACT Symposium on Theory of Computing

, pages 72–85, 2017.
• [11] K. L. Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954.
• [12] J. Dippon. Accelerated randomized stochastic optimization. Ann. Statist., 31(4):1260–1281, 2003.
• [13] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
• [14] P. Dvurechensky, A. Gasnikov, and E. Gorbunov. An accelerated method for derivative-free smooth stochastic convex optimization. arXiv preprint arXiv:1802.09022, 2018.
• [15] V. Fabian. Stochastic approximation of minima with improved asymptotic speed. The Annals of Mathematical Statistics, 38(1):191–200, 1967.
• [16] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proc. 16th Annual ACM-SIAM Symposium on Discrete algorithms (SODA), pages 385––394, 2005.
• [17] X. Hu, L. A. Prashanth, A. György, and C. Szepesvári. (Bandit) convex optimization with biased noisy gradient oracles. In

Proc. 10th International Conference on Artificial Intelligence and Statistics

, pages 819–828, 2016.
• [18] X. Hu, L. A. Prashanth, A. György, and C. Szepesvári. Escaping the local minima via simulated annealing: Optimization of approximately convex functions. In Proc. 10th International Conference on Artificial Intelligence and Statistics, pages 819–828, 2016.
• [19] I. A. Ibragimov and R. Z. Khas’minskii.

Estimation of the maximum value of a signal in gaussian white noise.

Mat. Zametki, 32(4):746–750, 1982.
• [20] K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. In Advances in Neural Information Processing Systems, volume 26, pages 2672–2680, 2012.
• [21] A. Locatelli and A. Carpentier. Adaptivity to smoothness in x-armed bandits. In Proc. 31st Annual Conference on Learning Theory, pages 1–30, 2018.
• [22] C. Malherbe and N. Vayatis. Global optimization of lipschitz functions. In Proc. 34th International Conference on Machine Learning, pages 2314–2323, 2017.
• [23] A. Mokkadem and M. Pelletier. A companion for the Kiefer–Wolfowitz–Blum stochastic approximation algorithm. Ann. Statist., 35:1749–1772, 2007.
• [24] H.-G. Müller. Kernel estimators of zeros and of location and size of extrema of regression functions. Scand. J. Stat., 12:221–232, 1985.
• [25] A. S. Nemirovsky and D. B Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley & Sons, 1983.
• [26] Y. Nesterov. Random gradient-free minimization of convex functions. Technical Report 2011001, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2011.
• [27] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Found. Comput. Math., 17:527––566, 2017.
• [28] T. B. Polyak and A. B. Tsybakov. Optimal order of accuracy of search algorithms in stochastic optimization. Problems of Information Transmission, 26(2):45–53, 1990.
• [29] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proc. 29th Int. Conf. on Machine Learning, pages 1571–1578, 2012.
• [30] A. Saha and A. Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proc. 14th International Conference on Artificial Intelligence and Statistics, pages 636–642, 2011.
• [31] S. Shalev-Shwartz. Online learning and online convex optimisation. Foundations and Trends in Machine Learning, 4:107–194, 2011.
• [32] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proc. 30th Annual Conference on Learning Theory, pages 1–22, 2013.
• [33] O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(1):1703–1713, 2017.
• [34] A. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, 2009.
• [35] Y. Wang, S. Du, S. Balakrishnan, and A. Singh. Stochastic zeroth-order optimization in high dimensions. In Proc. 21st International Conference on Artificial Intelligence and Statistics, pages 1356–1365, 2018.

## Supplementary material

The supplementary material is organized as follows. In Appendix A we provide some auxiliary results, including those stated in Section 2 above. In Appendix B we give proofs of the results which were only stated or whose proof was only sketched in the paper. For reader’s convenience all such results are restated below. Appendix C contains some comments on previous results in [3]. Finally, in Appendix D we present refined versions of Theorems 3.1 and 5.1.

## Appendix A Auxiliary results

See 2.3

• To lighten the presentation and without loss of generality we drop the lower script “” in all quantities. Using the Taylor expansion we have

 f(x+hrζ)=f(x)+⟨∇f(x),hrζ⟩+∑2≤|m|≤ℓ(rh)|m|m!D(m)f(x)ζm+R(hrζ),

where by assumption . Thus,

 E[^g|x]=dhE[(⟨∇f(x),hrζ⟩+∑2≤|m|≤ℓ,|m|odd(rh)|m|m!D(m)f(x)ζm+R(hrζ)−R(−hrζ)2)ζK(r)].

Since is uniformly distributed on the unit sphere we have , where

is the identity matrix. Therefore,

 E[dh⟨∇f(x),hζ⟩ζ]=∇f(x).

As for and we conclude that

 ∥E[^g|x]−∇f(x)∥ ≤d2hE[|R(hrζ)−R(−hrζ)||K(r)|]≤κβLdhβ−1.

See 2.4

• We have

 ∥^g∥2 =d24h2∥∥(f(x+hrζ)−f(x−hrζ)+ξ−ξ′)ζK(r)∥∥2 =d24h2(f(x+hrζ)−f(x−hrζ