## 1 Introduction

Derivative-free optimization schemes have a long history in optimization; for instance, see the book by Spall [32] for an overview. Such procedures are desirable in settings in which explicit gradient calculations may be computationally infeasible, expensive, or impossible. Classical techniques in stochastic and non-stochastic optimization, including Kiefer-Wolfowitz-type procedures [e.g. 23]

, use function difference information to approximate gradients of the function to be minimized rather than calculating gradients. There has recently been renewed interest in optimization problems with only functional (zero-order) information available—rather than first-order gradient information—in optimization, machine learning, and statistics

[17, 1, 29, 19, 3].In machine learning and statistics, this interest has centered around bandit optimization [17, 6, 1], where a player and adversary compete, with the player choosing points in some domain and an adversary choosing a point , forcing the player to suffer a loss . The goal is to choose an optimal point based only on observations of function values . Applications of such bandit problems include online auctions and advertisement selection for search engines. Similarly, the field of simulation-based optimization provides many examples of problems in which optimization is performed based only on function values [32, 13, 29]. Additionally, in many problems in statistics—including graphical model inference [34] and structured-prediction [33]—the objective is defined variationally (as the maximum of a family of functions), so explicit differentiation may be difficult.

Despite the long history and recent renewed interest in such procedures, a precise understanding of their convergence behavior remains elusive. In this paper, we study algorithms for solving stochastic convex optimization problems of the form

(1) |

where is a compact convex set, is a distribution over the space , and for -almost every , the function is closed and convex. We focus on the convergence rates of algorithms observing only stochastic realizations of the function values , though our algorithms naturally apply in the non-stochastic case as well.

One related body of work focuses on problems where, for a given value (or sample ), it is only possible to observe at a single location . Nemirovski and Yudin [26, Chapter 9.3] develop a randomized sampling strategy that estimates the gradient via randomized evaluations of function values at points on the surface of an -sphere. Flaxman et al. [17] build on this approach and establish some implications for bandit convex optimization problems. The convergence rates given in these early papers are sub-optimal, as more recent work shows [3]. For instance, Agarwal et al. [3] provide algorithms that achieve convergence rates after iterations of ; however, as the authors themselves note, the algorithms are quite complicated. Jamieson et al. [21] present simpler comparison-based algorithms for solving a subclass of such problems, and Shamir [31] gives optimal algorithms for quadratic objectives, as well as providing some lower bounds on optimization error when only single function values are available.

Some of the difficulties inherent in optimization using only a single
function evaluation are alleviated when the function can be evaluated at *two* points, as noted
independently by Agarwal et al. [1] and Nesterov [29]. Such
multi-point settings prove useful for optimization problems in
which observations are available, yet we only have black-box
access to objective values ; examples of such
problems include simulation-based
optimization [29, 13] and variational approaches
to graphical models and classification [33, 34].
The essential insight underlying multi-point schemes is as follows:
for small non-zero scalar

and a vector

, the quantity approximates a directional derivative of in the direction that first-order schemes may exploit. Relative to schemes based on only a single function evaluation at each iteration, such two-sample-based gradient estimators exhibit faster convergence rates [1, 29, 19]. In the current paper, we take this line of work further, in particular by characterizing*optimal*rates of convergence over all procedures based on multiple noisy function evaluations. Moreover, adopting the two-point perspective, we present simple randomization-based algorithms that achieve these optimal rates.

More formally, we study algorithms that receive a vector of paired observations, , where and are points selected by the algorithm. The observation takes the form

(2) |

where is an independent sample drawn from the distribution . After iterations, the algorithm returns a vector . In this setting, we analyze stochastic gradient and mirror-descent procedures [38, 26, 7, 27] that construct gradient estimators using the two-point observations (as well as the natural extension to observations). By a careful analysis of the dimension dependence of certain random perturbation schemes, we show that the convergence rate attained by our stochastic gradient methods is roughly a factor of worse than that attained by stochastic methods that observe the full gradient . Under appropriate conditions, our convergence rates are a factor of better than those attained in past work [1, 29]. For smooth problems, Ghadimi and Lan [19] provide results sharper than those in the papers [1, 29], but do not show optimality of their methods nor consider non-Euclidean problems. In addition, although we present our results in the framework of stochastic optimization, our analysis also applies to (multi-point) bandit online convex optimization problems [17, 6, 1], where our results are the sharpest provided to date. Our algorithms apply in both smooth and non-smooth cases as well as to non-stochastic problems [26, 29], where our procedures give the fastest known convergence guarantees for the non-smooth case. Finally, by using information-theoretic techniques for proving lower bounds in statistical estimation, we establish that our explicit achievable rates are sharp up to constant factors or (in some cases) factors at most logarithmic in the dimension.

The remainder of this paper is organized as follows: in the next section, we present our multi-point gradient estimators and their convergence rates, providing results in Section 2.1 and 2.2 for smooth and non-smooth objectives , respectively. In Section 3, we provide information-theoretic minimax lower bounds on the best possible convergence rates, uniformly over all schemes based on function evaluations. We devote Sections 4 and Section 5 to proofs of the achievable convergence rates and the lower bounds, respectively, deferring more technical arguments to appendices.

##### Notation

For sequences indexed by , the inequality indicates that there is a universal numerical constant such that . For a convex function , we let

denote the subgradient set of at . We say a function is -strongly convex with respect to the norm if for all , we have for all . Given a norm , we denote its dual norm by . We let

denote the standard normal distribution on

. We denote the -ball in with radius centered at by , and denotes the -dimensional -sphere in with radius centered at . We also use the shorthands and , and for the all-ones vector.## 2 Algorithms

We begin by providing some background on the class of stochastic
mirror descent methods for solving the problem . They are based on a *proximal
function* , meaning a differentiable and strongly convex
function defined over . The proximal function defines a
Bregman divergence via

The mirror descent (MD) method generates a sequence of iterates contained in , using stochastic gradient information to perform the update from iterate to iterate. The algorithm is initialized at some point . At iterations , the MD method receives a (subgradient) vector , which it uses to compute the next iterate via

(3) |

where is a non-increasing sequence of positive stepsizes.

Throughout the paper, we impose two assumptions that are standard in analysis of mirror descent methods [26, 7, 27]. Letting denote a minimizer of the problem (1), the first assumption concerns properties of the proximal function and the optimizaton domain . The proximal function is -strongly convex with respect to the norm . The domain is compact, and there exists such that for . Our second assumption is standard for almost all first-order stochastic gradient methods [27, 35, 29], and it holds whenever the functions are -Lipschitz with respect to the norm . We use to denote the dual norm to , and let denote a measurable subgradient selection for the functions ; that is, with . There is a constant such that the (sub)gradient selection satisfies for . When Assumptions 2 and 2 hold, the convergence rates of stochastic mirror descent methods are well understood. In detail, suppose that the variables are sampled i.i.d. according to . With the assignment , let the sequence be generated by the mirror descent iteration (3). Then for a stepsize , the average satisfies

(4) |

We refer to the papers [7, 27, Section 2.3] for results of this type.

For the remainder of this section, we explore the use of function difference information to obtain subgradient estimates that can be used in mirror descent methods to achieve statements similar to the convergence guarantee (4). We begin by analyzing the smooth case—when the instantaneous functions have Lipschitz gradients—and proceed to the more general (non-smooth) case in the subsequent section.

### 2.1 Two-point gradient estimates and convergence rates: smooth case

Our first step is to show how to use two function values to construct nearly unbiased estimators of the gradient of the objective function

under a smoothness condition. Using analytic methods different from those from past work [1, 29], we are able to obtain optimal dependence with the problem dimension . In more detail, our procedure is based on a non-increasing sequence of positive smoothing parameters and a distribution on , to be specified, satisfying . Given a smoothing constant , vector , and observation , we define the directional gradient estimate at the point as(5) |

Using the estimator (5), we then perform the following two steps. First, upon receiving the point , we sample an independent vector from and set

(6) |

In the second step, we apply the mirror descent update (3) to the quantity to obtain the next parameter .

Intuition for the estimator (5) follows by considering directional derivatives. The directional derivative of the function at the point in the direction is

This limit always exists when is convex [20, Chapter VI], and if is differentiable at , then . With this background, the estimate (5) is motivated by the following fact [29, equation (32)]: whenever exists, we have

where the final equality uses our assumption that . Consequently, given sufficiently small choices of , the vector (6) should be a nearly unbiased estimator of the gradient .

In addition to the unbiasedness condition , we require a few additional assumptions on . The first ensures that the estimator is well-defined. The domain of the functions and support of satisfy

(7) |

If we apply smoothing with Gaussian perturbation, the containment (7) implies , though we still optimize over the compact set in the update (3). We remark in passing that if the condition (7) fails, it is possible to optimize instead over the smaller domain , assuming w.l.o.g. that has non-empty interior, so long as has compact support (cf. Agarwal et al. [1, Algorithm 2]). We also impose the following properties on the smoothing distribution: For , the quantity is finite, and moreover, there is a function such that

(8) |

Although the quantity is required to be finite, its value does not appear explicitly in our theorem statements. On the other hand, the dimension-dependent quantity from condition (8) appears explicitly in our convergence rates. As an example of these two quantities, suppose that we take to be the distribution of the standard normal , and use the -norm . In this case, a straightfoward calculation shows that and .

Finally, as previously stated, the analysis of this section requires a smoothness assumption: There is a function such that for -almost every , the function has -Lipschitz continuous gradient with respect to the norm , and moreover the quantity is finite.

Essential to stochastic gradient procedures—recall Assumption 2 and the result (4)—is that the gradient estimator be nearly unbiased and have small norm. Accordingly, the following lemma provides quantitative guarantees on the error associated with the gradient estimator (5).

Under Assumptions 7 and 2.1, the gradient estimate (5) has expectation

(9) |

for a vector such that . Its expected squared norm has the bound

(10) |

See Section 4.2 for the proof. The bound (9) shows that the estimator is unbiased for the gradient up to a correction term of order , while the second inequality (10

) shows that the second moment is—up to an order

correction—within a factor of the standard second moment . We note in passing that the parameter in the lemma can be taken arbitrarily close to 0, which only makes a better estimate of . The intuition is straightforward: with two points, we can obtain arbitrarily accurate estimates of the directional derivative.Our main result in this section is the following theorem on the convergence rate of the mirror descent method using the gradient estimator (6). Under Assumptions 2, 2, 2.1, 7, and 2.1, consider a sequence generated according to the mirror descent update (3) using the gradient estimator (6), with step and perturbation sizes

Then for all ,

(11) |

where , and the expectation is taken with respect to the samples and .

The proof of Theorem 2.1 builds on
convergence proofs developed in the analysis of online and stochastic
convex optimization [38, 27, 1, 29], but requires additional technical care,
since we never truly receive unbiased gradients. We provide the proof
in Section 4.1.

Before continuing, we make a few remarks. First, the method is reasonably robust to the selection of the step-size multiplier ; Nemirovski et al. [27] previously noted this robustness for gradient-based MD methods. As long as , mis-specifying the multiplier results in a scaling at worst linear in . We may also use multiple independent random samples , , in the construction of the gradient estimator (6) to obtain more accurate estimates of the gradient via . See Corollary 2.1.1 to follow for an example of this construction. In addition, the convergence rate of the method is independent of the Lipschitz continuity constant of the instantaneous gradients , because, as noted following Lemma 2.1, we may take arbitrarily close to 0. This suggests that similar results may hold for non-differentiable functions; indeed, as we show in the next section, a slightly more complicated construction of the estimator leads to analogous guarantees for general non-smooth functions.

Although we have provided bounds on the expected convergence rate, it is possible to give high-probability convergence guarantees

[cf. 12, 27] under additional tail conditions on —for example, under the boundedness condition —though obtaining sharp dimension-dependence requires care. Additionally, while we have presented our results as convergence guarantees for stochastic optimization problems, an inspection of our analysis in Section 4.1 shows that we also obtain (expected) regret bounds for bandit online convex optimization problems [cf. 17, 6, 1].#### 2.1.1 Examples and corollaries

We now provide examples of random sampling strategies that lead to concrete bounds for the mirror descent algorithm based on the subgradient estimator (6). For each corollary, we specify the norm , proximal function , and distribution . We then compute the values that the distribution implies in Assumption 2.1 and apply Theorem 2.1 to obtain a convergence rate.

We begin with a corollary that characterizes the convergence rate of our algorithm with the proximal function under a Lipschitz continuity condition:

Given an optimization domain , suppose that is uniform on the surface of the -ball of radius , and that . Then

###### Proof.

The rate Corollary 2.1.1 provides is the fastest derived to date for zero-order stochastic optimization using two function evaluations; both Agarwal et al. [1] and Nesterov [29] achieve rates of convergence of order . In concurrent work, Ghadimi and Lan [19] provide a result (their Corollary 3.3) that achieves a similar rate to that above, but their primary focus is on non-convex problems. Moreover, we show in the sequel that this convergence rate is actually optimal.

Using multiple function evaluations yields faster convergence rates, as we obtain more accurate estimates of the instantaneous gradients . The following extension of Corollary 2.1.1 illustrates this effect: In addition to the conditions of Corollary 2.1.1, let , be sampled independently according to , and at each iteration of mirror descent use the gradient estimate with the step and perturbation sizes

There exists a universal constant such that for all ,

Corollary 2.1.1 shows the intuitive result that, with a number of evaluations linear in the dimension , it is possible to attain the standard (full-information) convergence rate (cf. [2]) using only function evaluations; we are (essentially) able to estimate the gradient . We provide a proof of Corollary 2.1.1 in Section 4.3.

In high-dimensional scenarios, appropriate choices for the proximal function yield better scaling on the norm of the gradients [26, 18, 27]. In the setting of online learning or stochastic optimization, suppose that one observes gradients . If the domain is the simplex, then exponentiated gradient algorithms [22, 7] using the proximal function obtain rates of convergence dependent on the -norm of the gradients . This scaling is more palatable than bounds that depend on Euclidean norms applied to the gradient vectors, which may be a factor of larger. Similar results apply using proximal functions based on -norms [8, 7]. In our case, if we make the choice and , we obtain the following corollary, which holds under the conditions of Theorem 2.1. Suppose that , the optimization domain is contained in the -ball , and is uniform on the hypercube . There is a universal constant such that

###### Proof.

The chosen of proximal function is strongly convex with respect to the norm (see [26, Appendix 1]). In addition, the choice implies , and for any . Consequently, we have , which allows us to apply Theorem 2.1 with the norm and the dual norm .

We claim that Assumption 7 is satisfied with . Since , we have

Finally, we have , which is finite as needed. By the inclusion of in the -ball of radius and our choice of proximal function, we have

(For instance, see Lemma 3 in the paper [18].) We thus find that for any , and using the step and perturbation size choices of Theorem 2.1 gives the result. ∎

Corollary 2.1.1 attains a convergence rate that scales with dimension as , which is a much worse dependence on dimension than that of (stochastic) mirror descent using full gradient information [26, 27]. As in Corollaries 2.1.1 and 2.1.1, which have similar additional factors, the additional dependence on suggests that while iterations are required to achieve -optimization accuracy for mirror descent methods, the two-point method requires iterations to obtain the same accuracy. In Section 3 we show that this dependence is sharp: apart from logarithmic factors, no algorithm can attain better convergence rates, including the problem-dependent constants and .

### 2.2 Two-point gradient estimates and convergence rates: general case

We now turn to the general setting in which the function , rather than having a Lipschitz continuous gradient, satisfies only the milder condition of Lipschitz continuity. The difficulty in this non-smooth case is that the simple gradient estimator (6) may have overly large norm. For instance, a naive calculation using only the -Lipschitz continuity of the function gives the bound

(12) |

This upper bound always scales at least quadratically in the dimension, since we have the lower bound , where the final equality uses the assumption . This quadratic dependence on dimension leads to a sub-optimal convergence rate. Moreover, this scaling appears to be unavoidable using a single perturbing random vector: taking and setting shows that the bound (12) may hold with equality.

Nevertheless, the convergence rate in
Theorem 2.1 shows that *near*
non-smoothness is effectively the same as being smooth. This
suggests that if we can smooth the objective slightly, we may
achieve a rate of convergence even in the non-smooth case that is
roughly the same as that in Theorem 2.1.
The idea of smoothing the objective has been used to obtain faster
convergence rates in both deterministic and stochastic
optimization [28, 15]. In the stochastic
setting, Duchi et al. [15] leverage the well-known fact that
convolution is a smoothing operation, and they consider minimization
of a sequence of smoothed functions

(13) |

where has density with respect to Lebesgue measure. In this case, is always differentiable; moreover, if is Lipschitz, then is Lipschitz under mild conditions.

The smoothed function (13) leads us to a
*two-point* strategy: we use a random direction as in the smooth
case (6) to estimate the gradient, but we
introduce an extra step of randomization for the point at which we
evaluate the function difference. Roughly speaking, this randomness
has the effect of making it unlikely that the perturbation vector
is near a point of non-smoothness, which allows us to
apply results similar to those in the smooth case.

More precisely, our construction uses two non-increasing sequences of positive parameters and with , and two smoothing distributions , on . Given smoothing constants , vectors , and observation , we define the (non-smooth) directional gradient estimate at the point as

(14) |

Using we may define our gradient estimator, which follows the same intuition as our construction of the stochastic gradient (6) from the smooth estimator (5). Now, upon receiving the point , we sample independent vectors and , and set

(15) |

We then proceed as in the preceding section, using this estimator in
the mirror descent method.

To demonstrate the convergence of gradient-based schemes with gradient estimator (15), we require a few additional assumptions. For simplicity, in this section we focus on results for the Euclidean norm . We impose the following condition on the Lipschitzian properties of , which is a slight strengthening of Assumption 2. There is a function such that for -a.e. , the function is -Lipschitz with respect to the -norm , and the quantity is finite.

We also impose the following assumption on the smoothing distributions and . The smoothing distributions are one of the following pairs: (1) both and are standard normal in with identity covariance, (2) both and are uniform on the -ball of radius , or (3) the distribution is uniform on the -ball of radius and the distribution is uniform on the -sphere of radius . Additionally, we assume the containment

Under Assumptions 2.2 and 2.2, the gradient estimator (14) has expectation

(16) |

where has bound . There exists a universal constant such that

(17) |

Comparing Lemma 2.2 to
Lemma 2.1, both show that one can
obtain nearly unbiased gradient of the function using two function
evaluations, but additionally, they show that the squared norm of the gradient
estimator is *at most* times larger than the expected norm
of the subgradients , as captured by
the quantity from
Assumption 2
or 2.2. In our approach, non-smoothness
introduces an additional logarithmic penalty in the dimension; it may
be possible to remove this factor, but we do not know how at this
time. The key is that taking the second smoothing parameter
to be small enough means that, aside from the
dimension penalty, the gradient estimator is
essentially unbiased for and has squared norm at most
. This bound on size is essential for our
main result, which we now state.
Under Assumptions 2,
2.2,
and 2.2, consider a sequence
generated according to the
mirror descent update (3) using the
gradient estimator (15) with
step and perturbation sizes

Then there exists a universal (numerical) constant such that for all ,

(18) |

where , and the expectation is taken with respect to the samples and . The proof of Theorem 2.2 roughly follows that of Theorem 2.1, except that we prove that the sequence approximately minimizes the sequence of smoothed functions rather than . However, for small , these two functions are quite close, which combined with the estimates from Lemma 2.2 gives the result. We give the full argument in Section 4.4.

Theorem 2.2 shows that the convergence rate of our two-point stochastic gradient algorithm for general non-smooth functions is (at worst) a factor of worse than the rate for smooth functions in Corollary 2.1.1. Notably, the rate of convergence here has substantially better dimension dependence than previously known results [1, 29, 19].

## 3 Lower bounds on zero-order optimization

Thus far, we have presented two main results
(Theorems 2.1
and 2.2) that provide achievable
rates for perturbation-based gradient procedures. It is natural to
wonder whether or not these rates are sharp. In this section, we show
that our results are—in general—unimprovable by more than a
constant factor (a logarithmic factor in dimension in the setting of
Corollary 2.1.1). These results show that *no*
algorithm exists that can achieve a faster convergence rate than those
we have presented under the oracle
model (2).

We begin by describing the notion of minimax error. Let be a collection of pairs , each of which defines an objective function of the form (1). Let denote the collection of all algorithms that observe a sequence of data points with and return an estimate . Given an algorithm and a pair , we define the optimality gap

where is the output of algorithm

on the sequence of observed function values. The expectation of this random variable defines the

*minimax error*

(19) |

where the expectation is taken over the observations and any additional randomness in . This quantity measures the performance of the best algorithm in , where performance is required to be uniformly good over the class .

We now turn to the statement of our lower bounds, which are based on simple choices of the classes . For a given -norm , we consider the class of linear functionals

Each of these function classes satisfy Assumption 2.2 by construction, and moreover, has Lipschitz constant 0 for all . We state each of our lower bounds assuming that the domain is equal to some -ball of radius , that is, . Our first result considers the case with domain an arbitrary -ball with , so we measure gradients in the -norm. For the class and , we have

(20) |

Combining the lower bound (20) with our algorithmic schemes in Section 2 shows that they are optimal up to constant factors. More specifically, for , the -ball of radius contains the -ball of radius , so Corollary 2.1.1 provides an upper bound on the minimax rate of convergence of order in the smooth case, while for , Proposition 3 provides the lower bound . Theorem 2.2, providing a rate of

in the general (non-smooth) case, is also tight to within logarithmic factors. Consequently, the stochastic gradient descent algorithm (

3) coupled with the sampling strategies (6) and (15) is optimal for stochastic problems with two-point feedback.We can prove a parallel lower bound that applies when using multiple () function evaluations in each iteration, that is, in the context of Corollary 2.1.1. In this case, an inspection of the proof of Proposition 3 shows that we have the bound

(21) |

We show this in the remarks following the proof of Proposition 3 in Section 5.1. In particular, we see that the minimax rate of convergence over the -ball is , which approaches the full information minimax rate of convergence, , as .

For our second lower bound, we investigate the minimax rates at which it is possible to solve stochastic convex optimization problems in which the objective is Lipschitz continuous in the -norm, or equivalently, in which the gradients are bounded in -norm. As noted earlier, such scenarios are suitable for high-dimensional problems [e.g. 27]. For the class with , we have

This result also demonstrates the optimality of our mirror descent algorithms up to logarithmic factors. Recalling Corollary 2.1.1, the MD algorithm (3) with , where , implies that . On the other hand, Proposition 3 provides the lower bound . These upper and lower bounds match up to logarithmic factors in dimension.

It is worth comparing these lower bounds to the achievable rates of convergence when full gradient information is available—that is, when one has access to the subgradient selection —and when one has access to only a single function evaluation