DeepAI
Log In Sign Up

Asymptotic convergence rates for averaging strategies

08/10/2021
by   Laurent Meunier, et al.
Facebook
0

Parallel black box optimization consists in estimating the optimum of a function using λ parallel evaluations of f. Averaging the μ best individuals among the λ evaluations is known to provide better estimates of the optimum of a function than just picking up the best. In continuous domains, this averaging is typically just based on (possibly weighted) arithmetic means. Previous theoretical results were based on quadratic objective functions. In this paper, we extend the results to a wide class of functions, containing three times continuously differentiable functions with unique optimum. We prove formal rate of convergences and show they are indeed better than pure random search asymptotically in λ. We validate our theoretical findings with experiments on some standard black box functions.

READ FULL TEXT VIEW PDF
01/31/2019

Parallel Black-Box Complexity with Tail Bounds

We propose a new black-box complexity model for search algorithms evalua...
06/17/2021

Optimum-statistical collaboration towards efficient black-box optimization

With increasingly more hyperparameters involved in their training, machi...
10/09/2019

Stochastic Implicit Natural Gradient for Black-box Optimization

Black-box optimization is primarily important for many compute-intensive...
10/18/2021

A portfolio approach to massively parallel Bayesian optimization

One way to reduce the time of conducting optimization studies is to eval...
05/03/2016

Blackbox: A procedure for parallel optimization of expensive black-box functions

This note provides a description of a procedure that is designed to effi...
01/31/2018

A Cross Entropy based Optimization Algorithm with Global Convergence Guarantees

The cross entropy (CE) method is a model based search method to solve op...
05/25/2018

Parallel Architecture and Hyperparameter Search via Successive Halving and Classification

We present a simple and powerful algorithm for parallel black box optimi...

1. Introduction

Finding the minimum of a function from a set of points and their images is a standard task used for instance in hyper-parameter tuning (Bergstra and Bengio, 2012), or control problems. While random search estimate of the optimum consists in returning , in this paper we focus on the similar strategy that consists in averaging the best samples, i.e. returning where .

These kinds of strategies are used in many evolutionary algorithms such as CMA-ES. Although experiments show that these methods perform well, it is not still understood why taking the average of best points actually leads to a lower regret. In

(Meunier et al., 2020a), it is proved in the case of quadratic functions that the regret is indeed lower for the averaging strategy than for pure random search. In this paper, we extend the result of this paper by proving convergence rates for a wide class of functions including three times continuously differentiable functions with unique optima.

1.1. Related work

1.1.1. Better than picking up the best

Given a finite number of samples equipped with their fitness values, we can simply pick up the best, or average the “best ones” (Beyer, 1995; Meunier et al., 2020a), or apply a surrogate model (Gupta et al., 2021; Sudret, 2012; Dushatskiy et al., 2021; Auger et al., 2005; Bossek et al., 2019; Rudi et al., 2020). Overall, the best is quite robust, but the surrogate or the averaging usually provides better convergence rates. Using surrogate modeling is fast when the dimension is moderate and the objective function is smooth (simple regret in for points in dimension with times differentiability, leading to superlinear rates in evolutionary computation (Auger et al., 2005)). In this paper, we are interested in the rates obtained by averaging the best samples for a wide class of functions. We extend the results of (Meunier et al., 2020a) which only hold for the sphere function.

1.1.2. Weighted averaging

Among the various forms of averaging, it has been proposed to take into account the fact that the sampling is not uniform (evolutionary algorithms in continuous domains typically use Gaussian sampling) in (Teytaud and Teytaud, 2009): we here simplify the analysis by considering a uniform sampling in a ball, though we acknowledge that this introduces the constraint that the optimum is indeed in the ball. (Arnold et al., 2009; Auger et al., 2011) have proposed weights depending on the fitness value, though they acknowledge a moderate impact: we here consider equal weights for the best.

1.1.3. Choosing the selection rate

The choice of the selection rate is quite debated in evolutionary computation: one can find (Escalante and Reyes, 2013), (Beyer and Sendhoff, 2008), (Beyer and Schwefel, 2002), (Hansen and Ostermeier, 2003), (Teytaud, 2007; Fournier and Teytaud, 2010) and still others in (Beyer, 1995; Jebalia and Auger, 2010). In this paper, we focus on the selection rate when the number of samples is very large in the case of parallel optimization. In this case, the selection ratio would tend to . We carefully analyze this ratio and derive convergence rates using this selection ratio.

1.1.4. Taking into account many basins

While averaging the best samples, the non-uniqueness of an optimum might lead to averaging points coming from different basins. Thus we consider at first the case of a unique optimum and hence a unique basin. Then we aim to tackle the case where there are possibly different basins. Island models (Skolicki, 2007) have also been proposed for taking into account different basins. (Meunier et al., 2020a) has proposed a tool for adapting depending on the (non) quasi-convexity. In the present work, we extend the methodology proposed in (Meunier et al., 2020a).

1.2. Outline

In the present paper, we first introduce, in Section 2, the large class of functions we will study, and study some useful properties of these functions in Section 3. Then, in Section 4, we prove upper and lower convergence rates for random search for these functions. In Section 5, we extend (Meunier et al., 2020a) by showing that asymptotically in the number of samples , the handled functions satisfy a better convergence rate than random search. We then extend our results on wider classes of functions in Section 6. Finally we validate experimentally our theoretical findings and compare with other parallel optimization methods.

2. Beyond quadratic functions

In the present section, we present the assumptions to extend the results from (Meunier et al., 2020a) to the non-quadratic case. We will denote the closed ball centered at of radius in endowed with its canonical Euclidean norm denoted by . We will also denote by the corresponding open ball. All other balls intervening in what follows will also follow that notation. For any subset , we will denote the uniform law on .

Let be a continuous function for which we would like to find an optimum point . The existence of such an optimum point is guaranteed by continuity on a compact set. For the sake of simplicity, we assume that . We define the -level sets of as follows.

Definition 0 ().

Let be a continuous function. The closed sublevel set of of level is defined as:

We now describe the assumptions we will make on the function that we optimize.

Assumption 1 ().

is a continuous function and admits a unique optimum point such that . Moreover we assume that can be written:

for some bounded function (there exists such that for all , ), a symmetric positive definite matrix and a real number.

Note that is uniquely defined by the previous relation. In the following we will denote by and

respectively the smallest and the largest eigenvalue of

. As is positive definite, we have . We will also set , which is a norm (the -norm) on as is symmetric positive definite. We then have

Remark 1 (Why a unique optimum ?).

The uniqueness of the optimum is an hypothesis required to avoid that chosen samples come from two or more wells for . In this case the averaging strategy would lead to a mistaken point because points from the different wells would be averaged. Nonetheless, multimodal functions can be tackled using our non-quasiconvexity trick (Section 6.2).

Remark 2 (Which functions satisfy Assumption 1?).

One may wonder if Assumption 1 is restrictive or not. We can remark that three times continuously differentiable functions satisfy the assumption with , as long as the unique optimum satisfies a strict second order stationary condition. Also, we will see in Section 6.1 that results are immediately valid for strictly increasing transformations of any for which Assumption 1 holds, so that we indirectly include all piecewise linear functions as well as long as they have a unique optimum. So the class of functions is very large, and in particular allows non symmetric functions to be treated, which might seem counter intuitive at first.

The aim of this paper is to study a parallel optimization problem as follows. We sample

from the uniform distribution on

. Let

denote the ordered random variables, where the order is given by the objective function

We then introduce the -best average

In the following of the paper, we will compare the standard random search algorithm (i.e. ) with the algorithm that consists in returning the average of the best points. To this end, we will study the expected simple regret for functions satisfying the assumption:

3. Technical lemmas

In this section, we prove two technical lemmas on that will be useful to study the convergence of the algorithm. The first one shows that can be upper bounded and lower bounded by two spherical functions.

Lemma 0 ().

Under Assumption 1, there exist two real numbers , such that, for all :

(1)

Moreover such and must satisfy .

Proof.

As is symmetric positive definite, we have the following classical inequality for the -norm

(2)

Now set for

By the above inequalities, we have

Thus, as , we obtain . By assumption, the function is also bounded as .
We thus conclude that there exists such that, for all

Now notice that is a closed subset of the compact set hence it is also compact. Moreover, by assumption is continuous on and for all Hence is continuous and positive on this compact set. Thus it attains its minimum and maximum on this set and its minimum is positive. In particular, we can write, on this set, for some

We now set . Note that because and (as is positive definite). We also set which is also positive. These are global bounds for which gives the first part of the result.
For the second part, let

be a normalized eigenvector respectively associated to

. Then

Taking the limit as . we get that, if satisfies (1), then . Similarly, we can prove that .∎

Secondly, we frame into two ellipsoids as . This lemma is a consequence of the assumptions we make on .

Lemma 0 ().

Under Assumption 1, there exists such that for , we have where:

with and two functions satisfying

when for some constants and which are respectively a (specific) lower and upper bound for .

Proof.

By assumption , hence we have:

Let . This is a continuous, strictly increasing function on . By a classical consequence of the intermediate value theorem, this implies that admits a continuous, strictly increasing inverse function. Note that hence . Thus we can write . We now denote by . As is non-decreasing, we get

Now observe that for sufficiently small

Indeed, if , we have by the triangle inequality and (2)

Recall that by assumption and let . As , for sufficiently small, we have hence for sufficiently small, which gives the inclusion .
For the asymptotics of , as we have by definition , and as we deduce that . Let us define . We have . We then compute:

This gives

As for , we obtain

which concludes for .

On the other side, we recall that for all as is the unique minimum of on . Write

Now observe that, as , we have for , by the triangle inequality, . Hence, by the classical inequality for the -norm (2), we get

So we have:

The function is differentiable. A study of the derivative shows that is continuous, strictly increasing on and continuous, strictly decreasing on where . Hence admits a continuous strictly increasing inverse and a continuous strictly decreasing inverse . We thus write

Hence

with . We now show that for sufficiently small

Indeed, note first that if , we obtain by (2)

where we have used that, as , the triangle inequality gives . Hence . We now show that . Indeed, at , are by definition, the two roots of

Hence . By continuity of at , we obtain that for sufficiently small. As , we thus obtain that, for sufficiently small, . Next, the same line of reasoning as the one for , using that and , shows that for sufficiently small.
Hence, for small enough we have

This gives .
Finally, similarly to , we can show that , which concludes the proof of this lemma. ∎

4. Bounds for random search

In this section we provide upper bounds and lower bounds for the random search algorithm for functions satisfying Assumption 1. These bounds will also be useful for analyzing the convergence of the -best approach.

4.1. Upper bound

First, we prove an upper bound for functions satisfying Assumption 1.

Lemma 0 (Upper bound for random search algorithm).

Let be a function satisfying Assumption 1. There exists a constant and an integer such that for all integers :

Proof.

Let us first recall the following classical property about the expectation of a positive valued random variable:

By independence of the samples we have:

Then thanks to Lemma 3.1:

where the second equality follows because almost surely. Then, by definition of the uniform law as well as the non-increasing character of , we obtain

Note that . Thus the second term in the last equality satisfies . The first term has a closed form given in (Meunier et al., 2020a):

Finally thanks to the Stirling approximation, we conclude:

where is a constant independent from . ∎

This lemma proves that the strategy consisting in returning the best sample (i.e. random search) has an upper rate of convergence of order , which depends on dimension of the space. It also worth noting this result is common in the literature (Rudi et al., 2020; Bergstra and Bengio, 2012)

4.2. Lower bound

We now give a lower bound for the convergence of the random search algorithm. We also prove a conditional expectation bound that will be useful for the analysis of the -best averaging approach.

Lemma 0 (Lower bound for random search algorithm).

Let be a function satisfying Assumption 1. There exist a constant and such that for all integers , we have the following lower bound for random search:

Moreover, let be a sequence of integers such that , and . Then, there exist a constant and such that for all and , we have the following lower bound when the sampling is conditioned:

Proof.

The proof is very similar to the previous one. Let us first show the unconditional inequality. We use the identity for the expectation of a positive random variable

Since the samples are independent, we have

Using Lemma 3.1, we get:

We can decompose the integral to obtain:

where the last inequality follows by Stirling’s approximation applied to the first term and because the second term is as in previous proof.
This concludes the proof of the first part of the lemma. Let us now treat the case of the conditional inequality. Using the same first identity as above we have

Remark 3 ().

Note that if we sample independent variables while conditioning on and keep only the -best variables such that , this is exactly equivalent to sampling directly from the -level set. This result was justified and used in (Meunier et al., 2020a) in their proofs.

Hence we obtain

Using Lemma 3.1, we get:

where the last inequality follows from the inclusion , which is also a consequence of Lemma 3.1. We then get

This lemma, along with Lemma 4.1, proves that for any function satisfying Assumption 1, its rate of convergence is exponentially dependent on the dimension and of order where is the number of points sampled to estimate the optimum.

Remark 4 (Convergence of the distance to the optimum).

It is worth noting that, thanks to Lemma 3.1, the convergence rates are also valid for the square distance to the optimum .

5. Convergence rates for the -best averaging approach

In the next section we focus on the case where we average the best samples among the samples. We first prove a lemma when the sampling is conditional on the -th value.

Lemma 0 ().

Let be a function satisfying Assumption 1. There exists a constant such that for all and and two integers such that , we have the following conditional upper bound:

Proof.

We first decompose the expectation as follows.

(3)
(4)

where we have use the same argument as in Remark 3 in the first equality. We will treat the terms (3) and (4) independently. We first look at (3

). We have the following “bias-variance” decomposition.

We will use Lemma 3.2. We have . Hence for the variance term

where means ”is equivalent to when , in other words, iff as . For the bias term, recall that