Finding the minimum of a function from a set of points and their images is a standard task used for instance in hyper-parameter tuning (Bergstra and Bengio, 2012), or control problems. While random search estimate of the optimum consists in returning , in this paper we focus on the similar strategy that consists in averaging the best samples, i.e. returning where .
These kinds of strategies are used in many evolutionary algorithms such as CMA-ES. Although experiments show that these methods perform well, it is not still understood why taking the average of best points actually leads to a lower regret. In(Meunier et al., 2020a), it is proved in the case of quadratic functions that the regret is indeed lower for the averaging strategy than for pure random search. In this paper, we extend the result of this paper by proving convergence rates for a wide class of functions including three times continuously differentiable functions with unique optima.
1.1. Related work
1.1.1. Better than picking up the best
Given a finite number of samples equipped with their fitness values, we can simply pick up the best, or average the “best ones” (Beyer, 1995; Meunier et al., 2020a), or apply a surrogate model (Gupta et al., 2021; Sudret, 2012; Dushatskiy et al., 2021; Auger et al., 2005; Bossek et al., 2019; Rudi et al., 2020). Overall, the best is quite robust, but the surrogate or the averaging usually provides better convergence rates. Using surrogate modeling is fast when the dimension is moderate and the objective function is smooth (simple regret in for points in dimension with times differentiability, leading to superlinear rates in evolutionary computation (Auger et al., 2005)). In this paper, we are interested in the rates obtained by averaging the best samples for a wide class of functions. We extend the results of (Meunier et al., 2020a) which only hold for the sphere function.
1.1.2. Weighted averaging
Among the various forms of averaging, it has been proposed to take into account the fact that the sampling is not uniform (evolutionary algorithms in continuous domains typically use Gaussian sampling) in (Teytaud and Teytaud, 2009): we here simplify the analysis by considering a uniform sampling in a ball, though we acknowledge that this introduces the constraint that the optimum is indeed in the ball. (Arnold et al., 2009; Auger et al., 2011) have proposed weights depending on the fitness value, though they acknowledge a moderate impact: we here consider equal weights for the best.
1.1.3. Choosing the selection rate
The choice of the selection rate is quite debated in evolutionary computation: one can find (Escalante and Reyes, 2013), (Beyer and Sendhoff, 2008), (Beyer and Schwefel, 2002), (Hansen and Ostermeier, 2003), (Teytaud, 2007; Fournier and Teytaud, 2010) and still others in (Beyer, 1995; Jebalia and Auger, 2010). In this paper, we focus on the selection rate when the number of samples is very large in the case of parallel optimization. In this case, the selection ratio would tend to . We carefully analyze this ratio and derive convergence rates using this selection ratio.
1.1.4. Taking into account many basins
While averaging the best samples, the non-uniqueness of an optimum might lead to averaging points coming from different basins. Thus we consider at first the case of a unique optimum and hence a unique basin. Then we aim to tackle the case where there are possibly different basins. Island models (Skolicki, 2007) have also been proposed for taking into account different basins. (Meunier et al., 2020a) has proposed a tool for adapting depending on the (non) quasi-convexity. In the present work, we extend the methodology proposed in (Meunier et al., 2020a).
In the present paper, we first introduce, in Section 2, the large class of functions we will study, and study some useful properties of these functions in Section 3. Then, in Section 4, we prove upper and lower convergence rates for random search for these functions. In Section 5, we extend (Meunier et al., 2020a) by showing that asymptotically in the number of samples , the handled functions satisfy a better convergence rate than random search. We then extend our results on wider classes of functions in Section 6. Finally we validate experimentally our theoretical findings and compare with other parallel optimization methods.
2. Beyond quadratic functions
In the present section, we present the assumptions to extend the results from (Meunier et al., 2020a) to the non-quadratic case. We will denote the closed ball centered at of radius in endowed with its canonical Euclidean norm denoted by . We will also denote by the corresponding open ball. All other balls intervening in what follows will also follow that notation. For any subset , we will denote the uniform law on .
Let be a continuous function for which we would like to find an optimum point . The existence of such an optimum point is guaranteed by continuity on a compact set. For the sake of simplicity, we assume that . We define the -level sets of as follows.
Definition 0 ().
Let be a continuous function. The closed sublevel set of of level is defined as:
We now describe the assumptions we will make on the function that we optimize.
Assumption 1 ().
is a continuous function and admits a unique optimum point such that . Moreover we assume that can be written:
for some bounded function (there exists such that for all , ), a symmetric positive definite matrix and a real number.
Note that is uniquely defined by the previous relation. In the following we will denote by and
respectively the smallest and the largest eigenvalue of. As is positive definite, we have . We will also set , which is a norm (the -norm) on as is symmetric positive definite. We then have
Remark 1 (Why a unique optimum ?).
The uniqueness of the optimum is an hypothesis required to avoid that chosen samples come from two or more wells for . In this case the averaging strategy would lead to a mistaken point because points from the different wells would be averaged. Nonetheless, multimodal functions can be tackled using our non-quasiconvexity trick (Section 6.2).
Remark 2 (Which functions satisfy Assumption 1?).
One may wonder if Assumption 1 is restrictive or not. We can remark that three times continuously differentiable functions satisfy the assumption with , as long as the unique optimum satisfies a strict second order stationary condition. Also, we will see in Section 6.1 that results are immediately valid for strictly increasing transformations of any for which Assumption 1 holds, so that we indirectly include all piecewise linear functions as well as long as they have a unique optimum. So the class of functions is very large, and in particular allows non symmetric functions to be treated, which might seem counter intuitive at first.
The aim of this paper is to study a parallel optimization problem as follows. We sample
from the uniform distribution on. Let
denote the ordered random variables, where the order is given by the objective function
We then introduce the -best average
In the following of the paper, we will compare the standard random search algorithm (i.e. ) with the algorithm that consists in returning the average of the best points. To this end, we will study the expected simple regret for functions satisfying the assumption:
3. Technical lemmas
In this section, we prove two technical lemmas on that will be useful to study the convergence of the algorithm. The first one shows that can be upper bounded and lower bounded by two spherical functions.
Lemma 0 ().
Under Assumption 1, there exist two real numbers , such that, for all :
Moreover such and must satisfy .
As is symmetric positive definite, we have the following classical inequality for the -norm
Now set for
By the above inequalities, we have
Thus, as , we obtain .
By assumption, the function is also bounded as .
We thus conclude that there exists such that, for all
Now notice that is a closed subset of the compact set hence it is also compact. Moreover, by assumption is continuous on and for all Hence is continuous and positive on this compact set. Thus it attains its minimum and maximum on this set and its minimum is positive. In particular, we can write, on this set, for some
We now set . Note that
because and (as
is positive definite). We also set
which is also positive. These are global bounds for which gives the first part of the result.
For the second part, let
be a normalized eigenvector respectively associated to. Then
Taking the limit as . we get that, if satisfies (1), then . Similarly, we can prove that .∎
Secondly, we frame into two ellipsoids as . This lemma is a consequence of the assumptions we make on .
Lemma 0 ().
Under Assumption 1, there exists such that for , we have where:
with and two functions satisfying
when for some constants and which are respectively a (specific) lower and upper bound for .
By assumption , hence we have:
Let . This is a continuous, strictly increasing function on . By a classical consequence of the intermediate value theorem, this implies that admits a continuous, strictly increasing inverse function. Note that hence . Thus we can write . We now denote by . As is non-decreasing, we get
Now observe that for sufficiently small
Indeed, if , we have by the triangle inequality and (2)
Recall that by assumption and let .
As , for sufficiently
small, we have hence
for sufficiently small, which gives the inclusion .
For the asymptotics of , as we have by definition , and as we deduce that . Let us define . We have . We then compute:
As for , we obtain
which concludes for .
On the other side, we recall that for all as is the unique minimum of on . Write
Now observe that, as , we have for , by the triangle inequality, . Hence, by the classical inequality for the -norm (2), we get
So we have:
The function is differentiable. A study of the derivative shows that is continuous, strictly increasing on and continuous, strictly decreasing on where . Hence admits a continuous strictly increasing inverse and a continuous strictly decreasing inverse . We thus write
with . We now show that for sufficiently small
Indeed, note first that if , we obtain by (2)
where we have used that, as , the triangle inequality gives . Hence . We now show that . Indeed, at , are by definition, the two roots of
Hence . By continuity
of at , we obtain that
for sufficiently small. As ,
we thus obtain that, for sufficiently small, .
Next, the same line of reasoning as the one for , using
that and ,
shows that for sufficiently small.
Hence, for small enough we have
This gives .
Finally, similarly to , we can show that , which concludes the proof of this lemma. ∎
4. Bounds for random search
In this section we provide upper bounds and lower bounds for the random search algorithm for functions satisfying Assumption 1. These bounds will also be useful for analyzing the convergence of the -best approach.
4.1. Upper bound
First, we prove an upper bound for functions satisfying Assumption 1.
Lemma 0 (Upper bound for random search algorithm).
Let be a function satisfying Assumption 1. There exists a constant and an integer such that for all integers :
Let us first recall the following classical property about the expectation of a positive valued random variable:
By independence of the samples we have:
Then thanks to Lemma 3.1:
where the second equality follows because almost surely. Then, by definition of the uniform law as well as the non-increasing character of , we obtain
Note that . Thus the second term in the last equality satisfies . The first term has a closed form given in (Meunier et al., 2020a):
Finally thanks to the Stirling approximation, we conclude:
where is a constant independent from . ∎
4.2. Lower bound
We now give a lower bound for the convergence of the random search algorithm. We also prove a conditional expectation bound that will be useful for the analysis of the -best averaging approach.
Lemma 0 (Lower bound for random search algorithm).
Let be a function satisfying Assumption 1. There exist a constant and such that for all integers , we have the following lower bound for random search:
Moreover, let be a sequence of integers such that , and . Then, there exist a constant and such that for all and , we have the following lower bound when the sampling is conditioned:
The proof is very similar to the previous one. Let us first show the unconditional inequality. We use the identity for the expectation of a positive random variable
Since the samples are independent, we have
Using Lemma 3.1, we get:
We can decompose the integral to obtain:
where the last inequality follows by Stirling’s approximation applied to the first term and because the second term is as in previous proof.
This concludes the proof of the first part of the lemma. Let us now treat the case of the conditional inequality. Using the same first identity as above we have
Remark 3 ().
Note that if we sample independent variables while conditioning on and keep only the -best variables such that , this is exactly equivalent to sampling directly from the -level set. This result was justified and used in (Meunier et al., 2020a) in their proofs.
This lemma, along with Lemma 4.1, proves that for any function satisfying Assumption 1, its rate of convergence is exponentially dependent on the dimension and of order where is the number of points sampled to estimate the optimum.
Remark 4 (Convergence of the distance to the optimum).
It is worth noting that, thanks to Lemma 3.1, the convergence rates are also valid for the square distance to the optimum .
5. Convergence rates for the -best averaging approach
In the next section we focus on the case where we average the best samples among the samples. We first prove a lemma when the sampling is conditional on the -th value.
Lemma 0 ().
Let be a function satisfying Assumption 1. There exists a constant such that for all and and two integers such that , we have the following conditional upper bound:
We first decompose the expectation as follows.
). We have the following “bias-variance” decomposition.
We will use Lemma 3.2. We have . Hence for the variance term
where means ”is equivalent to when , in other words, iff as . For the bias term, recall that