Noisy Optimization: Convergence with a Fixed Number of Resamplings

04/09/2014
by   Marie-Liesse Cauwet, et al.
0

It is known that evolution strategies in continuous domains might not converge in the presence of noise. It is also known that, under mild assumptions, and using an increasing number of resamplings, one can mitigate the effect of additive noise and recover convergence. We show new sufficient conditions for the convergence of an evolutionary algorithm with constant number of resamplings; in particular, we get fast rates (log-linear convergence) provided that the variance decreases around the optimum slightly faster than in the so-called multiplicative noise model. Keywords: Noisy optimization, evolutionary algorithm, theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/04/2012

Sufficient conditions for convergence of Loopy Belief Propagation

We derive novel sufficient conditions for convergence of Loopy Belief Pr...
04/30/2015

Average Convergence Rate of Evolutionary Algorithms

In evolutionary optimization, it is important to understand how fast evo...
08/20/2011

Convergence of a Recombination-Based Elitist Evolutionary Algorithm on the Royal Roads Test Function

We present an analysis of the performance of an elitist Evolutionary alg...
11/20/2013

Analyzing Evolutionary Optimization in Noisy Environments

Many optimization tasks have to be handled in noisy environments, where ...
06/18/2019

Nonparametric estimation in a regression model with additive and multiplicative noise

In this paper, we consider an unknown functional estimation problem in a...
09/03/2019

Continuous optimization

Sufficient conditions for the existence of efficient algorithms are esta...
07/08/2018

A New Noise-Assistant LMS Algorithm for Preventing the Stalling Effect

In this paper, we introduce a new algorithm to deal with the stalling ef...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a domain , with a positive integer, a noisy objective function is a stochastic process : with and

a random variable independently sampled at each call to

. Noisy optimization is the search of such that is approximately minimum. Throughout the paper, denotes the unknown exact optimum, supposed to be unique. For any positive integer , denotes the search point used in the function evaluation. We here consider black-box noisy optimization, i.e we can have access to only through calls to a black-box which, on request , (i) randomly samples (ii) returns . Among zero-order methods proposed to solve noisy optimization problems, some of the most usual are evolution strategies; [1] has studied the performance of evolution strategies in the presence of noise, and investigated its robustness by tuning the population size of the offspring and the mutation strength. Another approach consists in using resamplings of each individual (averaging multiple resamplings reduces the noise), rather than increasing the population size. Resampling means that, when evaluating , several independent copies of are used (i.e. the black-box oracle is called several times with a same ) and we use as an approximate fitness value in the optimization algorithm. The key point is how to choose , number of resamplings, for a given . Another crucial point is the model of noise. Different models of noise can be considered: additive noise (Eq. 3), multiplicative noise (Eq. 4) or a more general model (Eq. 5). Notice that, in Eq. 5 when , the noise decreases to zero near the optimum; this setting is not artificial as we can observe this behavior in many real problems.

Let us give an example in which the noise variance decreases to zero around the optimum. Consider a Direct Policy Search problem, i.e. the optimization of a parametric policy on simulations. Assume that we optimize the success rate of a policy. Assume that the optimum policy has a success rate 100%. Then, the variance is zero at the optimum.

1.1 Convergence Rates: -linear convergence and - convergence

Depending on the specific class of optimization problems and on some internal properties of the algorithm considered, we obtain different uniform rates of convergence (where the convergence can be almost sure, in probability or in expectation, depending on the setting); a fast rate will be a

-linear convergence, as follows:

(1)

In the noise-free case, evolution strategies typically converge linearly in -linear scale, as shown in [5, 7, 8, 15, 18].
The algorithm presents a slower rate of convergence in case of - convergence, as follows:

(2)

The - rates are typical rates in the noisy case (see [2, 4, 9, 10, 11, 16, 17]). Nevertheless, we will here show that, under specific assumptions on the noise (if the noise around the optimum decreases “quickly enough”, see section 1.4), we can reach faster rates: -linear convergence rates as in Eq. 1, by averaging a constant number of resamplings of .

1.2 Additive noise model

Additive noise refers to:

(3)

where is a positive integer and where is sampled independently with a fixed given distribution. In this model, the noise has lower bounded variance, even in the neighborhood of the optimum. The uniform rate typically converges linearly in scale (cf Eq. 2) as discussed in [2, 9, 10, 11, 16, 17]. This important case in applications has been studied in [9, 11, 12, 16] where tight bounds have been shown for stochastic gradient algorithms using finite differences. When using evolution strategies, [4] has shown mathematically that an exponential number of resamplings (number of resamplings scaling exponentially with the index of iterations) or an adaptive number of resamplings (scaling as a polynomial of the inverse step-size) can both lead to a - convergence rate.

1.3 Multiplicative noise model

Multiplicative noise, in the unimodal spherical case, refers to

(4)

and some compositions (by increasing mappings) of this function, where is a positive integer and where is sampled independently with a fixed given distribution. [14] has studied the convergence of evolution strategies in noisy environments with multiplicative noise, and essentially shows that the result depends on the noise distribution: if is conveniently lower bounded, then some standard evolution strategy converges to the optimum; if arbitrarily negative values can be sampled with non-zero probability, then it does not converge.

1.4 A more general noise model

Eqs. 3 and 4 are particular cases of a more general noise model:

(5)

where is a positive integer, and is sampled independently with a fixed given distribution. Eq. 5 boils down to Eq. 3 when and to Eq. 4 when . We will here obtain fast rates for some larger values of . More precisely, we will show that when , we obtain -linear rates, as in Eq. 1. Incidentally, this shows some tightness (with respect to ) of conditions for non-convergence in [14].

2 Theoretical analysis

Section 2.1 is devoted to some preliminaries. Section 2.2 presents results for constant numbers of resamplings on our generalized noise model (Eq. 5) when .

2.1 Preliminary: noise-free case

Typically, an evolution strategy at iteration :

  • generates

    individuals using the current estimate

    of the optimum and the so-called mutation strength (or step-size) ,

  • provides a pair where is a new estimate of and is a new mutation strength.

From now on, for the sake of notation simplicity, we assume that .

For some evolution strategies and in the noise-free case, we know (see e.g. Theorem 4 in [5]) that there exists a constant such that :

(6)
(7)

This paper will discuss cases in which an algorithm verifying Eqs. 6, 7 in the noise-free case also verifies them in a noisy setting.

Remarks: In the general case of arbitrary evolution strategies (ES), we don’t know if is positive, but:

  • in the case of a -ES with generalized one-fifth success rule, see [6];

  • in the case of a self-adaptive -ES with gaussian mutations, the estimate of by Monte-Carlo simulations is positive [5].

Property 1

For some , for any , such that and , there exist , , , , such that with probability at least

(8)
(9)
Proof

For any , almost surely, for sufficiently large. So, almost surely, is finite. Consider

the quantile

of Then, with probability at least , We can apply the same trick for lower bounding , and upper and lower bounding , all of them with probability , so that all bounds hold true simultaneously with probability at least .

2.2 Noisy case

The purpose of this Section is to show that if some evolution strategies perform well (linear convergence in the log-linear scale, as in Eqs. 6, 7), then, just by considering resamplings for each fitness evaluation as explained in Alg. 1, they will also be fast in the noisy case.

Our theorem holds for any evolution strategy satisfying the following constraints:

  • At each iteration , a search point is defined and search points are generated and have their fitness values evaluated.

  • The noisy fitness values are averaged over (a constant) resamplings.

  • The individual evaluated at iteration is randomly drawn by with a -dimensional standard Gaussian variable.

This framework is presented in Alg. 1.

  Initialize and .
  
  while not finished do
     for  do
        Define .
        Define .
     end for
     Update: update(,).
     
  end while
Algorithm 1 A general framework for evolution strategies. For simplicity, it does not cover all evolution strategies, e.g. mutations of step-sizes as in self-adaptive algorithms are not covered; yet, our proof can be extended to a more general case ( distributed as for some noise with exponentially decreasing tail). The case is the case without resampling. Our theorem basically shows that if such an algorithm converges linearly (in log-linear scale) in the noise-free case then the version with large enough converges linearly in the noisy case when .

We now state our theorem, under -linear convergence assumption (cf assumption (ii) below).

Theorem 2.1

Consider the following assumptions:

  1. the fitness function satisfies and has a limited variance:

    (10)
  2. in the noise-free case, the ES with population size under consideration is log-linearly converging, i.e. for any , for some , , there exist , , , , such that with probability 1-, Eqs. 8 and 9 hold;

  3. the number of resamplings per individual is constant.

Then, if , for any , there is such that for any , Eqs. 8 and 9 also hold with probability at least in the noisy case.

Corollary 1

Under the same assumptions, with probability at least ,

Proof of Corollary 1 : Immediate consequence of Theorem 2.1, by applying Eq. 8 and using .

Remarks:

  • Interpretation: Informally speaking, our theorem shows that if an algorithm converges in the noise-free case, then it also converges in the noisy case with the resampling rule, at least if and are large enough.

  • Notice that we can choose constants and very close to each other. Then the assumption boils down to .

  • We show a log-linear convergence rate as in the noise-free case. This means that we get linear in the number of function evaluations. This is as Eq. 1, and faster than Eq. 2 which is typical for noisy optimization with constant variance.

  • In the previous hypothesis, the new individuals are drawn following with a -dimensional standard Gaussian variable, but we could substitute for any random variable with an exponentially decreasing tail.

Proof of Theorem 2.1 : In all the proof, denotes a standard normal random variable in dimension .

Sketch of proof: Consider an arbitrary and for some and .
We compute in Lemma 2 the probability that at least two generated points and at iteration are “close”, i.e are such that ; then we calculate the probability that the noise of at least one of the evaluated individuals of iteration is bigger than in Lemma 3. Thus, we can conclude in Lemma 4 by estimating the probability that at least two individuals are misranked due to noise.
We first begin by showing a technical lemma.

Lemma 1

Let

be a unit vector and

a -dimensional standard normal random variable. Then for and , there exists a constant such that :

Proof

For any , we denote the set :

We first compute , the Lebesgue measure of :

with if is even, and otherwise. Hence, by Taylor expansion, , where , with .
If :

where

If ,

where . Hence the result follows by taking .

Lemma 2

Let us denote by the probability that, at iteration , there exist at least two points and such that . Then

for some and depending on , , , , , , , .

Proof

Let us first compute the probability that, at iteration , two given generated points and are such that . Let us denote by and two -dimensional standard independent random variables, a unit vector and .

Hence, by Lemma 1, there exists a such that , where is such that . Moreover by Assumption (ii). Thus , with and . In particular, is positive, provided that is sufficiently large.
By union bound, .

We now provide a bound on the probability that the fitness value of at least one search point generated at iteration has noise (i.e. deviation from expected value) bigger than in spite of the resamplings.

Lemma 3

for some and depending on , , , , , , , .

Proof

First, for one point , generated at iteration , we write the probability that when evaluating the fitness function at this point, we make a mistake bigger than .
by using Chebyshev’s inequality, where and . In particular, if ; hence, if , we get .
Then, by union bound.

Lemma 4

Let us denote by the probability that in at least one iteration, there is at least one misranking of two individuals. Then, if and is large enough, .

This lemma implies that with probability at least , provided that has been chosen large enough, we get the same rankings of points as in the noise free case. In the noise free case Eqs. 8 and 9 hold with probility at least - this proves the convergence with probability at least , hence the expected result; the proof of the theorem is complete.

Proof

(of the lemma)

We consider the probability that two individuals and at iteration are misranked due to noise, so

(11)
(12)

Eqs. 11 and 12 occur simultaneously if either two points have very similar fitness (difference less than ) or the noise is big (larger than ). Therefore, .
is upper bounded by if and are positive and constants large enough. and can be chosen positive simultaneously if .

3 Experiments : how to choose the right number of resampling ?

We consider in our experiments a version of multi-membered evolution strategies, the (,)-ES, where denotes the number of parents and the number of offspring (; see Alg. 2). We denote the parents at iteration and their corresponding step-size. At each iteration, a (,)-ES noisy algorithm : (i) generates offspring by mutation on the parents, using the corresponding mutated step-size, (ii) selects the best offspring by ranking the noisy fitness values of the individuals. Thus, the current approximation of the optimum at iteration is , to be consistent with the previous notations, we denote and .

  Parameters : , , a dimension .
  Input : initial points and initial step size .
  
  while (true) do
     Generate individuals indenpendently using :
     , evaluate times. Let be the averaging over these evaluations.
     Define so that .
     Update : compute and for :
     
  end while
Algorithm 2 An evolution strategy, with constant number of resamplings. If we consider , we obtain the case without resampling. is a -dimensional standard normal random variable.

Experiments are performed on the fitness function , with , , , , , and a standard gaussian random variable, using a budget of evaluations. The results presented here are the mean and the median over 50 runs. The positive results are proved, above, for a given quantile of the results. This explains the good performance in Fig. 1 (median result) as soon as the number of resamplings is enough. The median performance is optimal with just 12 resamplings. On the other hand, Fig. 2 shows the mean performance of Alg. 2 with various numbers of resamplings. We see that a limited number of runs diverge so that the mean results are bad even with 16 resamplings; results are optimal (on average) for 20 resamplings.

Results are safer with 20 resamplings (for the mean), but faster (for the median) with a smaller number of resamplings.

Figure 1: Convergence of Self-Adaptive Evolution Strategies: Median results.
Figure 2: Convergence of Self-Adaptive Evolution Strategies: Mean results.

4 Conclusion

We have shown that applying evolution strategies with a finite number of resamplings when the noise in the function decreases quickly enough near the optimum provides a convergence rate as fast as in the noise-free case. More specifically, if the noise decreases slightly faster than in the multiplicative model of noise, using a constant number of revaluation leads to a log-linear convergence of the algorithm. The limit case of a multiplicative noise has been analyzed in [14]; a fixed number of resamplings is not sufficient for convergence when the noise is unbounded.

Further work. We did not provide any hint for choosing the number of resamplings. Proofs based on Bernstein races [13] might be used for adaptively choosing the number of resamplings.

Acknowledgements

This paper was written during a stay in Ailab, Dong Hwa University, Hualien, Taiwan.

References

  • [1] D. Arnold and H.-G. Beyer. Investigation of the -es in the presence of noise. In

    Proc. of the IEEE Conference on Evolutionary Computation (CEC 2001)

    , pages 332–339. IEEE, 2001.
  • [2] D. Arnold and H.-G. Beyer. Local performance of the (1 + 1)-es in a noisy environment. Evolutionary Computation, IEEE Transactions on, 6(1):30 –41, feb 2002.
  • [3] D. V. Arnold and H.-G. Beyer. A general noise model and its effects on evolution strategy performance. IEEE Transactions on Evolutionary Computation, 10(4):380–391, 2006.
  • [4] S. Astete-Morales, J. Liu, and O. Teytaud. log-log convergence for noisy optimization. In Proceedings of EA 2013, LLNCS, page accepted. Springer, 2013.
  • [5] A. Auger. Convergence results for (1,)-SA-ES using the theory of

    -irreducible Markov chains.

    Theoretical Computer Science, 334(1-3):35–69, 2005.
  • [6] A. Auger. Linear convergence on positively homogeneous functions of a comparison-based step-size adaptive randomized search: the (1+1)-es with generalized one-fifth success rule. submitted, 2013.
  • [7] A. Auger, M. Jebalia, and O. Teytaud. (x,sigma,eta) : quasi-random mutations for evolution strategies. In EA, page 12p., 2005.
  • [8] H.-G. Beyer. The Theory of Evolution Strategies. Natural Computing Series. Springer, Heidelberg, 2001.
  • [9] H. Chen. Lower rate of convergence for locating the maximum of a function. Annals of statistics, 16:1330–1334, Sept. 1988.
  • [10] R. Coulom. Clop: Confident local optimization for noisy black-box parameter tuning. In Advances in Computer Games, pages 146–157. Springer Berlin Heidelberg, 2012.
  • [11] V. Fabian. Stochastic Approximation of Minima with Improved Asymptotic Speed. Annals of Mathematical statistics, 38:191–200, 1967.
  • [12] V. Fabian. Stochastic Approximation. SLP. Department of Statistics and Probability, Michigan State University, 1971.
  • [13] V. Heidrich-Meisner and C. Igel. Hoeffding and bernstein races for selecting policies in evolutionary direct policy search. In

    ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning

    , pages 401–408, New York, NY, USA, 2009. ACM.
  • [14] M. Jebalia, A. Auger, and N. Hansen. Log linear convergence and divergence of the scale-invariant (1+1)-ES in noisy environments. Algorithmica, 2010.
  • [15] I. Rechenberg. Evolutionstrategie: Optimierung Technischer Systeme nach Prinzipien des Biologischen Evolution. Fromman-Holzboog Verlag, Stuttgart, 1973.
  • [16] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. CoRR, abs/1209.2388, 2012.
  • [17] O. Teytaud and J. Decock. Noisy Optimization Complexity. In

    FOGA - Foundations of Genetic Algorithms XII - 2013

    , Adelaide, Australie, Feb. 2013.
  • [18] O. Teytaud and H. Fournier. Lower bounds for evolution strategies using vc-dimension. In G. Rudolph, T. Jansen, S. M. Lucas, C. Poloni, and N. Beume, editors, PPSN, volume 5199 of Lecture Notes in Computer Science, pages 102–111. Springer, 2008.