Efficient Natural Evolution Strategies

09/26/2012 ∙ by Yi Sun, et al. ∙ IDSIA 0

Efficient Natural Evolution Strategies (eNES) is a novel alternative to conventional evolutionary algorithms, using the natural gradient to adapt the mutation distribution. Unlike previous methods based on natural gradients, eNES uses a fast algorithm to calculate the inverse of the exact Fisher information matrix, thus increasing both robustness and performance of its evolution gradient estimation, even in higher dimensions. Additional novel aspects of eNES include optimal fitness baselines and importance mixing (a procedure for updating the population with very few fitness evaluations). The algorithm yields competitive results on both unimodal and multimodal benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Evolutionary algorithms aim to optimize a ‘fitness’ function that is either unknown or too complex to model directly. They allow domain experts to search for good or near-optimal solutions to numerous difficult real-world problems in areas ranging from medicine and finance to control and robotics.

Typically, three objectives have to be kept in mind when developing evolutionary algorithms—we want (1) robust performance; (2) few (potentially costly) fitness evaluations; (3) scalability with problem dimensionality.

We recently introduced Natural Evolution Strategies (NES; [8]), a new class of evolutionary algorithms less ad-hoc than traditional evolutionary methods. Here we propose a novel algorithm within this framework. It retains the theoretically well-founded nature of the original NES while addressing its shortcomings w.r.t. the above objectives.

NES algorithms maintain and iteratively update a multinormal mutation distribution. Parameters are updated by estimating a natural evolution gradient, i.e. the natural gradient on the parameters of the mutation distribution, and following it towards better expected fitness. Well-known advantages of natural gradient methods include isotropic convergence on ill-shaped fitness landscapes [2]. This avoids drawbacks of ‘vanilla’ (regular) gradients which are prone to slow or premature convergence [4].

Our algorithm calculates the natural evolution gradient using the exact Fisher information matrix (FIM) and the Monte Carlo-estimated gradient. In conjunction with the techniques of optimal fitness baselines and fitness shaping this yields robust performance (objective 1).

To reduce the number of potentially costly evaluations (objective 2), we introduce importance mixing, a kind of steady-state enforcer which keeps the distribution of the new population conformed to the current mutation distribution.

To keep the computational cost manageable in higher problem dimensions (objective 3), we derive a novel, efficient algorithm for computing the inverse of the exact Fisher information matrix (previous methods were either inefficient or approximate).

The resulting algorithm, Efficient Natural Evolution Strategies

(eNES), is elegant, requires no additional heuristics and has few parameters that need tuning. It performs consistently well on both unimodal and multimodal benchmarks.

2 Evolution Gradients

First let us introduce the algorithm framework and the concept of evolution gradients. The objective is to maximize a -dimensional unknown fitness function , while keeping the number of function evaluations – which are considered costly – as low as possible. The algorithm iteratively evaluates a population of size individuals generated from the mutation distribution . It then uses the fitness evaluations to adjust parameters of the mutation distribution.

Let be the expected fitness under mutation distribution , namely,

The core idea of our approach is to find, at each iteration, a small adjustment , such that the expected fitness is increased. The most straightforward approach is to set , where is the gradient on . Using the ‘log likelihood trick’, the gradient can be written as

The last term can be approximated using Monte Carlo:

where denotes the estimated evolution gradient.

In our algorithm, we assume that

is a Gaussian distribution with parameters

, where represents the mean, and represents the Cholesky decomposition of the covariance matrix , such that is upper triangular matrix and111For any matrix , denotes its inverse and denotes its transpose. . The reason why we choose instead of as primary parameter is twofold. First, makes explicit the independent parameters determining the covariance matrix . Second, the diagonal elements of

are the square roots of the eigenvalues of

, so is always positive semidefinite. In the rest of the text, we assume

is column vector of dimension

with elements in arranged as

Here and for , where () denotes the -th element of .

Now we compute

where is assumed to be a -dimensional column vector. The gradient w.r.t.  is simply

The gradient w.r.t.  () is given by

where is the -th element of

and is the Kronecker Delta function.

From , the mutation gradient can be computed as , where , and . We update by , where is an empirically tuned step size.

3 Natural Gradient

Vanilla gradient methods have been shown to converge slowly in fitness landscapes with ridges and plateaus. Natural gradients [1] constitute a principled approach for dealing with such problems. The natural gradient, unlike the vanilla gradient, has the advantage of always pointing in the direction of the steepest ascent. Furthermore, since the natural gradient is invariant w.r.t. the particular parameterization of the mutation distribution, it can cope with ill-shaped fitness landscapes and provides isotropic convergence properties, which prevents premature convergence on plateaus and avoids overaggressive steps on ridges [1].

In this paper, we consider a special case of the natural gradient , defined as

where is an arbitrarily small constant and

denotes the Kullback-Leibler divergence between distributions

and . The constraints impose a geometry on which differs from the plain Euclidean one. With , the natural gradient satisfies the necessary condition , with being the Fisher information matrix:

If is invertible, which may not always be the case, the natural gradient can be uniquely identified by , or estimated from data using . The adjustment can then be computed by

In the following sub-sections, we show that the FIM can in fact be computed exactly, that it is invertible, and that there exists an efficient222Normally the FIM would involve parameters, which is intractable for most practical problems. algorithm to compute the inverse of the FIM.

3.1 Derivation of the Exact FIM

In the original NES [8], we compute the natural evolution gradient using the empirical Fisher information matrix, which is estimated from the current population. This approach has three important disadvantages. First, the empirical FIM is not guaranteed to be invertible, which could result in unstable estimations. Second, a large population size would be required to approximate the exact FIM up to a reasonable precision. Third, it is highly inefficient to invert the empirical FIM, a matrix with elements.

We circumvent these problems by computing the exact FIM directly from mutation parameters , avoiding the potentially unstable and computationally costly method of estimating the empirical FIM from a population which in turn was generated from .

In eNES, the mutation distribution is the Gaussian defined by , the precise FIM can be computed analytically. Namely, the -th element in is given by

where , denotes the -th and -th element in . Let be the such that it appears at the -th position in . First, notice that

and

So the upper left corner of the FIM is , and has the following shape

The next step is to compute . Note that

Using the relation

and the properties of the trace, we get

Computing the first term gives us

Note that since is upper triangular, is also upper triangular, so the first summand is non-zero iff

In this case, , so

Here is the generalized Kronecker Delta function, i.e. iff all four indices are the same. The second term is computed as

Therefore, we have

It can easily be proven that itself is a block diagonal matrix with blocks along the diagonal, with sizes ranging from to . Therefore, the precise FIM is given by

with and block () given by

Here is the lower-right square submatrix of with dimension , e.g. , and .

We prove that the FIM given above is invertible if is invertible. being invertible follows from the fact that the submatrix on the main diagonal of a positive definite matrix must also be positive definite, and adding to the diagonal would not decrease any of its eigenvalues. Also note that is invertible, so is invertible.

It is worth pointing out that the block diagonal structure of partitions parameters into orthogonal groups , which suggests that we could modify each group of parameters without affecting other groups. We will need this intuition in the next section.

3.2 Iterative Computation of FIM Inverse

The exact FIM is a block diagonal matrix with blocks. Normally, inverting the FIM requires matrix inversions. However, we can explore the structure of each sub-block in order to make the inverse of more efficient, both in terms of time and space complexity.

First, we realize that is simply a number, so its inversion is given by , and similarly . Now, letting vary from to , we can compute and directly from . By block matrix inversion

using

and the Woodbury identity

(also noting that in our case, is a number ), we can state

This can be computed directly from , i.e. . Skipping the intermediate steps, we propose the following algorithm for computing and from :

Here is the sub-vector in at column , and row to . A single update from to and requires floating point multiplications. The overall complexity of computing all sub-blocks , , is thus .

The algorithm is efficient both in time and storage in the sense that, on one hand, there is no explicit matrix inversion, while on the other hand, the space for storing (including , if not needed later) can be reclaimed immediately after each iteration, which means that at most storage is required. Note also that can be used directly to compute , using , where

is the submatrix of w.r.t. the mutation gradient of .

To conclude, the algorithm given above efficiently computes the inverse of the exact FIM, required for computing the natural mutation gradient.

4 Optimal Fitness Baselines

The concept of fitness baselines, first introduced in [8]

, constitutes an efficient variance reduction method for estimating

. However, baselines as found in [5] are suboptimal w.r.t. the variance of , since this FIM may not be invertible. It is difficult to formulate the variance of directly. However, since the exact FIM is invertible and can be computed efficiently, we can in fact compute an optimal baseline for minimizing the variance of , given by

where is the estimated evolution gradient, which is given by

The scalar is called the fitness baseline. Adding does not affect the expectation of , since

However, the variance depends on the value of , i.e.

Here denotes a -by- vector filled with s. The optimal value of the baseline is

Assuming individuals are i.i.d., can be approximated from data by

In order to further reduce the estimation covariance, we can utilize a parameter-specific baseline for each parameter individually, which is given by

Here is the -th row vector of .

However, parameter-specific baseline values might reduce variance too much, which harms the performance of the algorithm. Additionally, adopting different baseline values for correlated parameters may affect the underlying structure of the parameter space, rendering estimations unreliable. To address both of these problems, we follow the intuition that if the -th element in the FIM is , then parameters and are orthogonal and only weakly correlated. Therefore, we propose using the block fitness baseline, i.e. a single baseline for each group of parameters , . Its formulation is given by

Here denotes the inverse of the -th diagonal block of , while and denote the submatrices corresponding to differentiation w.r.t. .

In a companion paper [7], we empirically investigated the convergence properties when using the various types of baseline. We found block fitness baselines to be very robust, whereas uniform and parameter-specific baselines sometimes led to premature convergence.

The main complexity for computing the optimal fitness baseline pertains to the necessity of storing a potentially large gradient matrix , with dimension . The time complexity, in this case, is since we have to multiply each with . For large problem dimensions, the storage requirement may not be acceptable since also scales with . We solve this problem by introducing a time-space tradeoff which reduces the storage requirement to while keeping the time complexity unchanged. In particular, we first compute for each , a scalar , where is the -th row vector of . Then, for , we compute the vector , where is the submatrix of by taking rows to . The gradient can be computed from and , and used to compute directly. The storage for can be immediately reclaimed. Finally, the complexity of computing for all is , so the total complexity of computing every element in would still be .

5 Importance mixing

At each generation, we evaluate new individuals generated from mutation distribution . However, since small updates ensure that the KL divergence between consecutive mutation distributions is generally small, most new individuals will fall in the high density area of the previous mutation distribution . This leads to redundant fitness evaluations in that same area.

Our solution to this problem is a new procedure called importance mixing, which aims to reuse fitness evaluations from the previous generation, while ensuring the updated population conforms to the new mutation distribution.

Importance mixing works in two steps: In the first step, rejection sampling is performed on the previous population, such that individual

is accepted with probability

Here is the minimal refresh rate. Let be the number of individuals accepted in the first step. In the second step, reverse rejection sampling is performed as follows: Generate individuals from and accept with probability

until new individuals are accepted. The individuals from the old generation and newly accepted individuals together constitute the new population. Note that only the fitnesses of the newly accepted individuals need to be evaluated. The advantage of using importance mixing is twofold: On the one hand, we reduce the number of fitness evaluations required in each generation, on the other hand, if we fix the number of newly evaluated fitnesses, then many more fitness evaluations can potentially be used to yield more reliable and accurate gradients.

The minimal refresh rate lower bounds the expected proportion of newly evaluated individuals , namely , with the equality holding iff . In particular, if , all individuals from the previous generation will be discarded, and if , depends only on the distance between and . Normally we set to be a small positive number, e.g. , to avoid too low an acceptance probability at the second step when .

It can be proven that the updated population conforms to the mutation distribution . In the region where , the probability that an individual from previous generations appears in the new population is

The probability that an individual generated from the second step entering the population is , since

So the probability of an individual entering the population is just in that region. The same result holds also for the region where .

In a companion paper [7], we measured the usefulness of importance mixing, and found that it reduces the number of required fitness evaluations by a factor 5. Additionally, it reduced the algorithm’s sensitivity to the population size.

The computational complexity of importance mixing can be analyzed as follows. For each generated individual , the probability and need to be evaluated, requiring computations. The total number of individuals generated is bounded by in the worst case, and is close to on average.

6 Fitness Shaping

For problems with wildly fluctuating fitnesses, the gradient is disproportionately distorted by extreme fitness values, which can lead to premature convergence or numerical instability. To overcome this problem, we use fitness shaping, an order-preserving nonlinear fitness transformation function [8]. The choice of (monotonically increasing) fitness shaping function is arbitrary, and should therefore be considered to be one of the tuning parameters of the algorithm. We have empirically found that ranking-based shaping functions work best for various problems. The shaping function used for all experiments in this paper was fixed to for and for , where denotes the relative rank of in the population, scaled between .

7 Efficient NES

Integrating all the algorithm elements introduced above, the Efficient Natural Evolution Strategy (with block fitness baselines) can be summarized as

Note that vectors and in line 18 correspond to and , respectively. Summing up the analysis from previous sections, the time complexity of processing a single generation is , while the space complexity is just , where comes from the need of storing the population. Assuming that scales linearly with , our algorithms scales linearly in space and quadratically in time w.r.t. the number of parameters, which is . This is a significant improvement over the original NES, whose complexity is in space and in time.

Implementations of eNES are available in both Python and Matlab333

The Python code is part of the PyBrain machine learning library (

www.pybrain.org) and the Matlab code is available at www.idsia.ch/~sun/enes.html.

8 Experiments

Figure 1: Results on the unimodal benchmark functions for dimension 5, 15 and 50 (from top to bottom).

The tunable parameters of Efficient Natural Evolution Strategies are comprised of the population size , the learning rate , the refresh rate and the fitness shaping function. In addition, three kinds of fitness baselines can be used.

We empirically find a good and robust choice for the learning rate to be . On some (but not all) benchmarks the performance can be further improved by more aggressive updates. Therefore, the only parameter that needs tuning in practice is the population size, which is dependent on both the expected ruggedness of the fitness landscape and the problem dimensionality.

8.1 Benchmark Functions

We empirically validate our algorithm on 9 unimodal and 4 multimodal functions out of the set of standard benchmark functions from [6] and [3], that are typically used in the literature, for comparison purposes and for competitions. We randomly choose the inital guess at average distance 1 from the optimum. In order to prevent potentially biased results, we follow [6] and consistently transform (by a combined rotation and translation) the functions’ inputs, making the variables non-separable and avoiding trivial optima (e.g. at the origin). This immediately renders many other methods virtually useless, since they cannot cope with correlated mutation directions. eNES, however, is invariant under translation and rotation. In addition, the rank-based fitness shaping makes it invariant under order-preserving transformations of the fitness function.

8.2 Performance on Benchmark Functions

We ran eNES on the set of unimodal benchmark functions with dimensions 5, 15 and 50 with population sizes 50, 250 and 1000, respectively, using and a target precision of . Figure 1 shows the average performance over 20 runs (5 runs for dimension 50) for each benchmark function. We left out the Rosenbrock function on which eNES is one order of magnitude slower than on the other functions (e.g. 150,000 evaluations on dimension 15). Presumably this is due to the fact that the principal mutation direction is updated too slowly on complex curvatures. Note that SharpR and ParabR are unbounded functions, which explains the abrupt drop-off.

Figure 2: Success percentages varying with initial distances for the multimodal test functions using population sizes 20 and 100.

For the experiments on the multimodal benchmark functions we varied the distance of the initial guess to the optimum between 0.1 and 1000. Those runs were performed on dimension 2 with a target precision of , since here the focus was on avoiding local maxima. We compare the results for population size 20 and 100 (with ). Figure 2 shows, for all tested multimodal functions, the percentage of 100 runs where eNES found the global optimum (as opposed to it getting stuck in a local extremum) conditioned on the distance from the initial guess to the optimum.

Figure 3:

Evolution path and mutation distributions for a typical run on Rastrigin. Ellipsoids correspond to 0.5 standard deviations of the mutation distributions in generations 1, 20, 40.

Note that for Ackley and Griewank the success probability drops off sharply at a certain distance. For Ackley this is due to the fitness landscapes providing very little global structure to exploit, whereas for Giewank the reason is that the local optima are extremely large, which makes them virtually impossible to escape from444A solution to this would be to start with a significantly larger initial , instead of . Figure 3 shows the evolution path of a typical run on Rastrigin, and the ellipses corresponding to the mutation distribution at different generations, illustrating how eNES jumps over local optima to reach the global optimum.

For three functions we find that eNES finds the global optimum reliably, even with a population size as small as 20. For the other one, Rastrigin, the global optimum is only reliably found when using a population size of 100.

9 Discussion

Unlike most evolutionary algorithms, eNES boasts a relatively clean derivation from first principles. Using a full multinormal mutation distribution and fitness shaping, the eNES algorithm is invariant under translation and rotation and under order-preserving transformations of the fitness function.

Comparing our empirical results to CMA-ES [3]

, considered by many to be the ‘industry standard’ of evolutionary computation, we find that eNES is competitive but slower, especially on higher dimensions. However, eNES is faster on DiffPow for all dimensions. On multimodal benchmarks eNES is competitive with CMA-ES as well, as compared to the results in 

[8]. Our results collectively show that eNES can compete with state of the art evolutionary algorithms on standard benchmarks.

Future work will also address the problems of automatically determining good population sizes and dynamically adapting the learning rate. Moreover, we plan to investigate the possibility of combining our algorithm with other methods (e.g. Estimation of Distribution Algorithms) to accelerate the adaptation of covariance matrices, improving performance on fitness landscapes where directions of ridges and valleys change abruptly (e.g. the Rosenbrock benchmark).

10 Conclusion

Efficient NES is a novel alternative to conventional evolutionary algorithms, using a natural evolution gradient to adapt the mutation distribution. Unlike previous natural gradient methods, eNES quickly calculates the inverse of the exact Fisher information matrix. This increases robustness and accuracy of the evolution gradient estimation, even in higher-dimensional search spaces. Importance mixing prevents unnecessary redundancy embodied by individuals from earlier generations. eNES constitutes a competitive, theoretically well-founded and relatively simple method for artificial evolution. Good results on standard benchmarks affirm the promise of this research direction.

11 Acknowledgments

This research was funded by SNF grants 200020-116674/1, 200021-111968/1 and 200021-113364/1.

References

  • [1] S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
  • [2] S. Amari and S. C. Douglas. Why natural gradient? volume 2, pages 1213–1216 vol.2, 1998.
  • [3] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001.
  • [4] J. Peters and S. Schaal. Natural actor-critic. Neurocomput., 71(7-9):1180–1190, 2008.
  • [5] J. R. Peters. Machine learning of motor skills for robotics. PhD thesis, Los Angeles, CA, USA, 2007. Adviser-Stefan Schaal.
  • [6] P. N. Suganthan, N. Hansen, J. J. Liang, K. Deb, Y. P. Chen, A. Auger, and S. Tiwari. Problem definitions and evaluation criteria for the cec 2005 special session on real-parameter optimization. Technical report, Nanyang Technological University, Singapore, 2005.
  • [7] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Stochastic search using the natural gradient. In To appear in: Proceedings of the International Conference on Machine Learning (ICML-2009), 2009.
  • [8] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In Proceedings of the Congress on Evolutionary Computation (CEC08), Hongkong. IEEE Press, 2008.