Natural Evolution Strategies

06/22/2011 ∙ by Daan Wierstra, et al. ∙ EPFL IDSIA 0

This paper presents Natural Evolution Strategies (NES), a recent family of algorithms that constitute a more principled approach to black-box optimization than established evolutionary algorithms. NES maintains a parameterized distribution on the set of solution candidates, and the natural gradient is used to update the distribution's parameters in the direction of higher expected fitness. We introduce a collection of techniques that address issues of convergence, robustness, sample complexity, computational complexity and sensitivity to hyperparameters. This paper explores a number of implementations of the NES family, ranging from general-purpose multi-variate normal distributions to heavy-tailed and separable distributions tailored towards global optimization and search in high dimensional spaces, respectively. Experimental results show best published performance on various standard benchmarks, as well as competitive performance on others.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 45

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real world optimization problems are too difficult or complex to model directly. Therefore, they might best be solved in a ‘black-box’ manner, requiring no additional information on the objective function (i.e., the ‘fitness’ or ‘cost’) to be optimized besides fitness evaluations at certain points in parameter space. Problems that fall within this category are numerous, ranging from applications in health and science (Winter et al., 2005; Shir and Bäck, 2007; Jebalia et al., 2007) to aeronautic design (Hasenjäger et al., 2005; Klockgether and Schwefel, 1970) and control (Hansen et al., 2009).

Numerous algorithms in this vein have been developed and applied in the past fifty years (see section 2 for an overview), in many cases providing good and even near-optimal solutions to hard tasks, which otherwise would have required domain experts to hand-craft solutions at substantial cost and often with worse results. The near-infeasibility of finding globally optimal solutions requires a fair amount of heuristics in black-box optimization algorithms, leading to a proliferation of sometimes highly performant yet unprincipled methods.

In this paper, we introduce Natural Evolution Strategies (NES), a novel black-box optimization framework which is derived from first principles and simultaneously provides state-of-the-art performance. The core idea is to maintain and iteratively update a search distribution from which search points are drawn and their fitness evaluated. The search distribution is then updated in the direction of higher expected fitness, using ascent along the natural gradient.

1.1 Continuous Black-Box Optimization

The problem of black-box optimization has spawned a wide variety of approaches. A first class of methods was inspired by classic optimization methods, including simplex methods such as Nelder-Mead (Nelder and Mead, 1965), as well as members of the quasi-Newton family of algorithms. Simulated annealing (Kirkpatrick et al., 1983)

, a popular method introduced in 1983, was inspired by thermodynamics, and is in fact an adaptation of the Metropolis-Hastings algorithm. More heuristic methods, such as those inspired by evolution, have been developed from the early 1950s on. These include the broad class of genetic algorithms 

(Holland, 1992; Goldberg, 1989), differential evolution (Storn and Price, 1997)

, estimation of distribution algorithms 

(Larrañaga, 2002)

, particle swarm optimization 

(Kennedy and Eberhart, 2001), and the cross-entropy method (Rubinstein and Kroese, 2004).

Evolution strategies (ES), introduced by Ingo Rechenberg and Hans-Paul Schwefel in the 1960s and 1970s (Rechenberg and Eigen, 1973; Schwefel, 1977), were designed to cope with high-dimensional continuous-valued domains and have remained an active field of research for more than four decades (Beyer and Schwefel, 2002). ESs involve evaluating the fitness of real-valued genotypes in batch (‘generation’), after which the best genotypes are kept, while the others are discarded. Survivors then procreate (by slightly mutating all of their genes) in order to produce the next batch of offspring. This process, after several generations, was shown to lead to reasonable to excellent results for many difficult optimization problems. The algorithm framework has been developed extensively over the years to include self-adaptation of the search parameters, and the representation of correlated mutations by the use of a full covariance matrix. This allowed the framework to capture interrelated dependencies by exploiting the covariances while ‘mutating’ individuals for the next generation. The culminating algorithm, covariance matrix adaptation evolution strategy (CMA-ES; Hansen and Ostermeier, 2001), has proven successful in numerous studies (e.g., Friedrichs and Igel, 2005; Muller et al., 2002; Shepherd et al., 2006).

While evolution strategies prove to be an effective framework for black-box optimization, their ad hoc procedures remain heuristic in nature. Thoroughly analyzing the actual dynamics of the procedure turns out to be difficult, the considerable efforts of various researchers notwithstanding (Beyer, 2001; Jebalia et al., 2010; Auger, 2005). In other words, ESs (including CMA-ES), while powerful, still lack a clear derivation from first principles.

1.2 The NES Family

Natural Evolution Strategies (NES) are well-grounded, evolution-strategy inspired black-box optimization algorithms, which instead of maintaining a population of search points, iteratively update a search distribution. Like CMA-ES, they can be cast into the framework of evolution strategies.

The general procedure is as follows: the parameterized search distribution is used to produce a batch of search points, and the fitness function is evaluated at each such point. The distribution’s parameters (which include strategy parameters) allow the algorithm to adaptively capture the (local) structure of the fitness function. For example, in the case of a Gaussian distribution, this comprises the mean and the covariance matrix. From the samples, NES estimates a search gradient on the parameters towards higher expected fitness. NES then performs a gradient ascent step along the

natural gradient, a second-order method which, unlike the plain gradient, renormalizes the update w.r.t. uncertainty. This step is crucial, since it prevents oscillations, premature convergence, and undesired effects stemming from a given parameterization.The entire process reiterates until a stopping criterion is met.

All members of the ‘NES family’ operate based on the same principles. They differ in the type of distribution and the gradient approximation method used. Different search spaces require different search distributions; for example, in low dimensionality it can be highly beneficial to model the full covariance matrix. In high dimensions, on the other hand, a more scalable alternative is to limit the covariance to the diagonal only. In addition, highly multi-modal search spaces may benefit from more heavy-tailed distributions (such as Cauchy, as opposed to the Gaussian). A last distinction arises between distributions where we can analytically compute the natural gradient, and more general distributions where we need to estimate it from samples.

1.3 Paper Outline

This paper builds upon and extends our previous work on Natural Evolution Strategies (Wierstra et al., 2008; Sun et al., 2009a, b; Glasmachers et al., 2010a, b; Schaul et al., 2011), and introduces novel performance- and robustness-enhancing techniques (in sections 3.3 and 3.4), as well as an extension to rotationally symmetric distributions (section 5.2) and a plethora of new experimental results.

The paper is structured as follows. Section 2 presents the general idea of search gradients as introduced by Wierstra et al. (2008), outlining how to perform stochastic search using parameterized distributions while doing gradient ascent towards higher expected fitness. The limitations of the plain gradient are exposed in section 2.2, and subsequently addressed by the introduction of the natural gradient (section 2.3), resulting in the canonical NES algorithm.

Section 3 then regroups a collection of techniques that enhance NES’s performance and robustness. This includes fitness shaping (designed to render the algorithm invariant w.r.t. order-preserving fitness transformations (Wierstra et al., 2008), section 3.1), importance mixing (designed to recycle samples so as to reduce the number of required fitness evaluations (Sun et al., 2009b), section 3.2), adaptation sampling which is a novel technique for adjusting learning rates online (section 3.3), and finally restart strategies, designed to improve success rates on multi-modal problems (section 3.4).

In section 4, we look in more depth at multivariate Gaussian search distributions, constituting the most common case. We show how to constrain the covariances to positive-definite matrices using the exponential map (section 4.1), and how to use a change of coordinates to reduce the computational complexity from to , with being the search space dimension, resulting in the xNES algorithm (Glasmachers et al., 2010b, section 4.2).

Next, in section 5, we develop the breadth of the framework, motivating its usefulness and deriving a number of NES variants with different search distributions. First, we show how a restriction to diagonal parameterization permits the approach to scale to very high dimensions due to its linear complexity (Schaul et al., 2011; section 5.1). Second, we provide a novel formulation of NES for the whole class of multi-variate versions of distributions with rotation symmetries (section 5.2

), including heavy-tailed distributions with infinite variance, such as the Cauchy distribution (

Schaul et al., 2011, section 5.3).

The ensuing experimental investigations show the competitiveness of the approach on a broad range of benchmarks (section 6). The paper ends with a discussion on the effectiveness of the different techniques and types of distributions and an outlook towards future developments (section 7).

2 Search Gradients

The core idea of Natural Evolution Strategies is to use search gradients

to update the parameters of the search distribution. We define the search gradient as the sampled gradient of expected fitness. The search distribution can be taken to be a multinormal distribution, but could in principle be any distribution for which we can find derivatives of its log-density w.r.t. its parameters. For example, useful distributions include Gaussian mixture models and the Cauchy distribution with its heavy tail.

If we use to denote the parameters of distribution and to denote the fitness function for samples , we can write the expected fitness under the search distribution as

(1)

The so-called ‘log-likelihood trick’ enables us to write

From this form we obtain the Monte Carlo estimate of the search gradient from samples as

(2)

where is the population size. This gradient on expected fitness provides a search direction in the space of search distributions. A straightforward gradient ascent scheme can thus iteratively update the search distribution

where is a learning rate parameter. Algorithm 1 provides the pseudocode for this very general approach to black-box optimization by using a search gradient on search distributions.

input : ,
repeat
       for  do
             draw sample
             evaluate the fitness
             calculate log-derivatives
            
       end for
      
      
      
until stopping criterion is met
Algorithm 1 Canonical Search Gradient algorithm

Utilizing the search gradient in this framework is similar to evolution strategies in that it iteratively generates the fitnesses of batches of vector-valued samples – the ES’s so-called candidate solutions. It is different however, in that it represents this ‘population’ as a parameterized distribution, and in the fact that it uses a search gradient to update the parameters of this distribution, which is computed using the fitnesses.

2.1 Search Gradient for Gaussian Distributions

In the case of the ‘default’ -dimensional multi-variate normal distribution, we collect the parameters of the Gaussian, the mean  (candidate solution center) and the covariance matrix  (mutation matrix), in a single concatenated vector . However, to sample efficiently from this distribution we need a square root of the covariance matrix (a matrix fulfilling ).111For any matrix , denotes its transpose. Then transforms a standard normal vector into a sample . Here,

denotes the identity matrix. Let

denote the density of the multinormal search distribution .

In order to calculate the derivatives of the log-likelihood with respect to individual elements of for this multinormal distribution, first note that

We will need its derivatives, that is, and . The first is trivially

(3)

while the latter is

(4)

In order to preserve symmetry, to ensure non-negative variances and to keep the mutation matrix positive definite, needs to be constrained. One way to accomplish that is by representing as a product (for a more sophisticated solution to this issue, see section 4.1). Instead of using the log-derivatives on directly, we then compute the derivatives with respect to as

Using these derivatives to calculate , we can then update parameters as using learning rate . This produces a new center for the search distribution, and simultaneously self-adapts its associated covariance matrix . To summarize, we provide the pseudocode for following the search gradient in the case of a multinormal search distribution in Algorithm 2.

input : , ,
repeat
       for  do
             draw sample
             evaluate the fitness
            
       end for
      
      
      
      
until stopping criterion is met
Algorithm 2 Search Gradient algorithm: Multinormal distribution

2.2 Limitations of Plain Search Gradients

As the attentive reader will have realized, there exists at least one major issue with applying the search gradient as-is in practice: It is impossible to precisely locate a (quadratic) optimum, even in the one-dimensional case. Let , , and samples . Equations (3) and (4), the gradients on and , become

and the updates, assuming simple hill-climbing (i.e. a population size ) read:

For any objective function that requires locating an (approximately) quadratic optimum with some degree of precision (e.g. ), must decrease, which in turn increases the variance of the updates, as and for a typical sample . In fact, the updates become increasingly unstable, the smaller becomes, an effect which a reduced learning rate or an increased population size can only delay but not avoid. Figure 1 illustrates this effect. Conversely, whenever is large, the magnitude of a typical update is severely reduced.

       

Figure 1: Left: Schematic illustration of how the search distribution adapts in the one-dimensional case: from (1) to (2), is adjusted to make the distribution cover the optimum. From (2) to (3), is reduced to allow for a precise localization of the optimum. The step from (3) to (1) then is the problematic case, where a small induces a largely overshooting update, making the search start over again. Right: Progression of (black) and (red, dashed) when following the search gradient towards minimizing , executing Algorithm 2. Plotted are median values over 1000 runs, with a small learning rate and , both of which mitigate the instability somewhat, but still show the failure to precisely locate the optimum (for which both and need to approach 0).

Clearly, this update is not at all scale-invariant: Starting with makes all updates minuscule, whereas starting with makes the first update huge and therefore unstable.

We conjecture that this limitation constitutes one of the main reasons why search gradients have not been developed before: in isolation, the plain search gradient’s performance is both unstable and unsatisfying, and it is only the application of natural gradients (introduced in section 2.3) which tackles these issues and renders search gradients into a viable optimization method.

2.3 Using the Natural Gradient

Instead of using the plain stochastic gradient for updates, NES follows the natural gradient. The natural gradient was first introduced by Amari in 1998, and has been shown to possess numerous advantages over the plain gradient (Amari, 1998; Amari and Douglas, 1998). In terms of mitigating the slow convergence of plain gradient ascent in optimization landscapes with ridges and plateaus, natural gradients are a more principled (and hyper-parameter-free) approach than, for example, the commonly used momentum heuristic.

The plain gradient simply follows the steepest ascent in the space of the actual parameters of the distribution. This means that for a given small step-size , following it will yield a new distribution with parameters chosen from the hypersphere of radius and center that maximizes . In other words, the Euclidean distance in parameter space between subsequent distributions is fixed. Clearly, this makes the update dependent on the particular parameterization of the distribution, therefore a change of parameterization leads to different gradients and different updates. See also Figure 2 for an illustration of how this effectively renormalizes updates w.r.t. uncertainty.

mu[l][l] sigma[c][c]

Figure 2: Illustration of plain versus natural gradient in parameter space. Consider two parameters, e.g. , of the search distribution. In the plot on the left, the solid (black) arrows indicate the gradient samples , while the dotted (blue) arrows correspond to , that is, the same gradient estimates, but scaled with fitness. Combining these, the bold (green) arrow indicates the (sampled) fitness gradient , while the bold dashed (red) arrow indicates the corresponding natural gradient

. Being random variables with expectation zero, the distribution of the black arrows is governed by their covariance, indicated by the gray ellipse. Notice that this covariance is a quantity in

parameter space (where the reside), which is not to be confused with the covariance of the distribution in the search space (where the samples reside). In contrast, solid (black) arrows on the right represent , and dotted (blue) arrows indicate the natural gradient samples , resulting in the natural gradient (dashed red). The covariance of the solid arrows on the right hand side turns out to be the inverse of the covariance of the solid arrows on the left. This has the effect that when computing the natural gradient, directions with high variance (uncertainty) are penalized and thus shrunken, while components with low variance (high certainty) are boosted, since these components of the gradient samples deserve more trust. This makes the (dashed red) natural gradient a much more trustworthy update direction than the (green) plain gradient.

The key idea of the natural gradient is to remove this dependence on the parameterization by relying on a more ‘natural’ measure of distance

between probability distributions

and

. One such natural distance measure between two probability distributions is the Kullback-Leibler divergence 

(Kullback and Leibler, 1951). The natural gradient can then be formalized as the solution to the constrained optimization problem

where is the expected fitness of equation (1), and is a small increment size. Now, we have for ,

where

is the Fisher information matrix of the given parametric family of search distributions. The solution to this can be found using a Lagrangian multiplier (Peters, 2007), yielding the necessary condition

(5)

for some constant . The direction of the natural gradient is given by thus defined. If is invertible222Care has to be taken because the Fisher matrix estimate may not be (numerically) invertible even if the exact Fisher matrix is., the natural gradient amounts to

The Fisher matrix can be estimated from samples, reusing the log-derivatives that we already computed for the gradient . Then, updating the parameters following the natural gradient instead of the steepest gradient leads us to the general formulation of NES, as shown in Algorithm 3.

input : ,
repeat
       for  do
             draw sample
             evaluate the fitness
             calculate log-derivatives
            
       end for
      
      
      
      
until stopping criterion is met
Algorithm 3 Canonical Natural Evolution Strategies

3 Performance and Robustness Techniques

In the following we will present and introduce crucial techniques to improves NES’s performance and robustness. Fitness shaping (Wierstra et al., 2008) is designed to make the algorithm invariant w.r.t. arbitrary yet order-preserving fitness transformations (section 3.1). Importance mixing (Sun et al., 2009b) is designed to recycle samples so as to reduce the number of required fitness evaluations, and is subsequently presented in section 3.2. Adaptation sampling, a novel technique for adjusting learning rates online, is introduced in section 3.3, and finally restart strategies, designed to improve success rates on multimodal problems, is presented in section 3.4.

3.1 Fitness Shaping

NES utilizes rank-based fitness shaping in order to render the algorithm invariant under monotonically increasing (i.e., rank preserving) transformations of the fitness function. For this purpose, the fitness of the population is transformed into a set of utility values . Let denote the best individual (the individual in the population, sorted by fitness, such that is the best and the worst individual). Replacing fitness with utility, the gradient estimate of equation (2) becomes, with slight abuse of notation,

(6)

To avoid entangling the utility-weightings with the learning rate, we require that .

The choice of utility function can in fact be seen as a free parameter of the algorithm. Throughout this paper we will use the following

which is directly related to the one employed by CMA-ES (Hansen and Ostermeier, 2001), for ease of comparison. In our experience, however, this choice has not been crucial to performance, as long as it is monotonous and based on ranks instead of raw fitness (e.g., a function which simply increases linearly with rank).

In addition to robustness, these utility values provide us with an elegant formalism to describe the (1+1) hill-climber version of the algorithm within the same framework, by using different utility values, depending on success (see section 4.5 later in this paper).

3.2 Importance Mixing

In each batch, we evaluate new samples generated from search distribution . However, since small updates ensure that the KL divergence between consecutive search distributions is generally small, most new samples will fall in the high density area of the previous search distribution . This leads to redundant fitness evaluations in that same area. We improve the efficiency with a new procedure called importance mixing, which aims at reusing fitness evaluations from the previous batch, while ensuring the updated batch conforms to the new search distribution.

Importance mixing works in two steps: In the first step, rejection sampling is performed on the previous batch, such that sample is accepted with probability

(7)

Here is an algorithm hyperparameter called the minimal refresh rate. Let be the number of samples accepted in the first step. In the second step, reverse rejection sampling is performed as follows: Generate samples from and accept with probability

(8)

until new samples are accepted. The samples from the old batch and newly accepted samples together constitute the new batch. Figure 3 illustrates the procedure. Note that only the fitnesses of the newly accepted samples need to be evaluated. The advantage of using importance mixing is twofold: On the one hand, we reduce the number of fitness evaluations required in each batch, on the other hand, if we fix the number of newly evaluated fitnesses, then many more fitness evaluations can potentially be used to yield more reliable and accurate gradients.

             

Figure 3: Illustration of importance mixing. Left: In the first step, old samples are eliminated (red triangles) according to (7), and the remaining samples (blue triangles) are reused. Right: In the second step, new samples (green circles) are generated according to (8).

The minimal refresh rate parameter has two uses. First, it avoids too low an acceptance probability at the second step when . And second, it permits specifying a lower bound on the expected proportion of newly evaluated samples , namely , with the equality holding if and only if . In particular, if , all samples from the previous batch will be discarded, and if , depends only on the distance between and . Normally we set to be a small positive number, e.g., in this paper we use throughout.

It can be proven that the updated batch conforms to the search distribution . In the region where

the probability that a sample from previous batches appears in the new batch is

The probability that a sample generated from the second step entering the batch is , since

So the probability of a sample entering the batch is just in that region. The same result holds also for the region where

When using importance mixing in the context of NES, this reduces the sensitivity to the hyperparameters, particularly the population size , as importance mixing implicitly adapts it to the situation by reusing some or many previously evaluated sample points.

3.3 Adaptation Sampling

To reduce the burden on determining appropriate hyper-parameters such as the learning rate, we develop a new self-adaptation or meta-learning technique (Schaul and Schmidhuber, 2010a), called adaptation sampling, that can automatically adapt the settings in a principled and economical way.

We model this situation as follows: Let be a distribution with hyper-parameter and a quality measure for each sample . Our goal is to adapt such as to maximize the quality . A straightforward method to achieve this, henceforth dubbed adaptation sampling, is to evaluate the quality of the samples drawn from , where is a slight variation of , and then perform hill-climbing: Continue with the new if the quality of its samples is significantly better (according, e.g., to a Mann-Whitney U-test), and revert to otherwise. Note that this proceeding is similar to the NES algorithm itself, but applied at a meta-level to algorithm parameters instead of the search distribution. The goal of this adaptation is to maximize the pace of progress over time, which is slightly different from maximizing the fitness function itself.

Virtual adaptation sampling is a lightweight alternative to adaptation sampling that is particularly useful whenever evaluating is expensive :

  • do importance sampling on the existing samples , according to :

    (this is always well-defined, because ).

  • compare with weights and with weights , using a weighted version of the Mann-Whitney test, as introduced in Appendix A.

Beyond determining whether or is better, choosing a non-trivial confidence level allows us to avoid parameter drift, as is only updated if the improvement is significant enough.

There is one caveat, however: the rate of parameter change needs to be adjusted such that the two resulting distributions are not too similar (otherwise the difference won’t be statistically significant), but also not too different, (otherwise the weights will be too small and again the test will be inconclusive).

If, however, we explicitly desire large adaptation steps on

, we have the possibility of interpolating between adaptation sampling and virtual adaptation sampling by drawing a few new samples from the distribution

(each assigned weight 1), where it is overlapping least with . The best way of achieving this is importance mixing, as introduced in Section 3.2, uses jointly with the reweighted existing samples.

For NES algorithms, the most important parameter to be adapted by adaptation sampling is the learning rate , starting with a conservative guess. This is because half-way into the search, after a local attractor has been singled out, it may well pay off to increase the learning rate in order to more quickly converge to it.

In order to produce variations which can be judged using the above-mentioned U-test, we propose a procedure similar in spirit to Rprop-updates (Riedmiller and Braun, 1993; Igel and Husken, 2003), where the learning rates are either increased or decreased by a multiplicative constant whenever there is evidence that such a change will lead to better samples.

More concretely, when using adaptation sampling for NES we test for an improvement with the hypothetical distribution generated with . Each time the statistical test is successful with a confidence of at least (this value was determined empirically) we increase the learning rate by a factor of , up to at most . Otherwise we bring it closer to its initial value: .

Figure 4: Illustration of the effect of adaptation sampling. We show the increase in fitness during a NES run (above) and the corresponding learning rates (below) on two setups: 10-dimensional sphere function (left), and 10-dimensional Rosenbrock function (right). Plotted are three variants of xNES (algorithm 4): fixed default learning rate of (dashed, red) fixed large learning rate of (dotted, yellow), and an adaptive learning rate starting at (green). We see that for the (simple) Sphere function, it is advantageous to use a large learning rate, and adaptation sampling automatically finds that one. However, using the overly greedy updates of a large learning rate fails on harder problems (right). Here adaptation sampling really shines: it boosts the learning rate in the initial phase (entering the Rosenbrock valley), then quickly reduces it while the search needs to carefully navigate the bottom of the valley, and boosts it again at the end when it has located the optimum and merely needs to zoom in precisely.

Figure 4 illustrates the effect of the virtual adaptation sampling strategy on two different 10-dimensional unimodal benchmark functions, the Sphere function and the Rosenbrock function (see section 6.2 for details). We find that, indeed, adaptation sampling boosts the learning rates to the appropriate high values when quick progress can be made (in the presence of an approximately quadratic optimum), but keeps them at carefully low values otherwise.

3.4 Restart Strategies

A simple but widespread method for mitigating the risk of finding only local optima in a strongly multi-modal scenario is to restart the optimization algorithm a number of times with different initializations, or just with a different random seed. This is even more useful in the presence of parallel processing resources, in which case multiple runs are executed simultaneously.

In practice, where the parallel capacity is limited, we need to decide when to stop or suspend an ongoing optimization run and start or resume another one. In this section we provide one such restart strategy that takes those decisions. Inspired by recent work on practical universal search (Schaul and Schmidhuber, 2010b), this results in an effective use of resources independently of the problem.

The strategy consists in reserving a fixed fraction of the total time for the first run, and then subdividing the remaining time in the same way, recursively (i.e. for the run). The recursive decomposition of the time budget stops when the assigned time-slice becomes smaller than the overhead of swapping out different runs. In this way, the number of runs with different seeds remains finite, but increases over time, as needed. Figure 5 illustrates the effect of the restart strategy, for different values of , on the example of a multi-modal benchmark function (see section 6.2 for details), where most runs get caught in local optima. Whenever used in the rest of the paper, the fraction is .

Let be the success probability of the underlying search algorithm at time . Here, time is measured by the number of generations or fitness evaluations. Accordingly, let be the boosted success probability of the restart scheme with parameter . Approximately, that is, assuming continuous time, the probabilities are connected by the formula

Two qualitative properties of the restart strategy can be deduced from this formula, even if in discrete time the sum is finite (), and the times need to be rounded:

  • If there exists such that then for all .

  • For sufficiently small , we have .

This captures the expected effect that the restart strategy results in an initial slowdown, but eventually solves the problem reliably.

Figure 5: Illustrating the effect of different restart strategies. Plotted is the cumulative empirical success probability, as a function of the total number of evaluations used. Using no restarts, corresponding to (green) is initially faster but unreliable, whereas (dotted black) reliably finds the optimum within 300000 evaluations, but at the cost of slowing down success for all runs by a factor 10. In-between these extremes, (broken line, red) trades off slowdown and reliability.

4 Techniques for Multinormal Distributions

In this section we will describe two crucial techniques to enhance performance of the NES algorithm as applied to multinormal distributions. First, the method of exponential parameterization is introduced, guaranteeing that the covariance matrix stays positive-definite. Second, a novel method for changing the coordinate system into a “natural” one is laid out, making the algorithm computationally efficient.

4.1 Using Exponential Parameterization

Gradient steps on the covariance matrix result in a number of technical problems. When updating directly with the gradient step , we have to ensure that remains a valid, positive definite covariance matrix. This is not guaranteed a priori, because the (natural) gradient may be any symmetric matrix. If we instead update a factor of , it is at least ensured that

is symmetric and positive semi-definite. But when shrinking an eigenvalue of

it may happen that the gradient step swaps the sign of the eigenvalue, resulting in undesired oscillations.

An elegant way to fix these problems is to represent the covariance matrix using the exponential map for symmetric matrices (see e.g., Glasmachers and Igel, 2005 for a related approach). Let

and

denote the vector space of symmetric and the (cone) manifold of symmetric positive definite matrices, respectively. Then the exponential map

(9)

is a diffeomorphism: The map is bijective, and both as well as its inverse map are smooth. The mapping can be computed in cubic time, for example by decomposing the matrix into orthogonal (eigen-vectors) and diagonal (eigen-values), taking the exponential of (which amounts to taking the element-wise exponentials of the diagonal entries), and composing everything back333The same computation works for the logarithm, and thus also for powers , for all , for example for the (unique) square root (). as .

exp[c][c] Sd[c][c] Pd[c][c] M[c][c] Sigma[c][c]

Figure 6: Left: updating a covariance matrix directly can end up outside the manifold of symmetric positive-definite matrices . Right: first performing the update on in and then mapping back the result into the original space using the exponential map is both safe (guaranteed to stay in the manifold) and straight (the update follows a geodesic).

Thus, we can represent the covariance matrix as with . The resulting gradient update for has two important properties: First, because is a vector space, any update automatically corresponds to a valid covariance matrix.444The tangent bundle of the manifold is isomorphic to and globally trivial. Thus, arbitrarily large gradient steps are meaningful in this representation. Second, the update of

makes the gradient invariant w.r.t. linear transformations of the search space

. This follows from an information geometric perspective, viewing as the Riemannian parameter manifold equipped with the Fisher information metric. The invariance property is a direct consequence of the Cartan-Hadamard theorem (Cartan, 1928). See also Figure 6 for an illustration.

However, the exponential parameterization considerably complicates the computation of the Fisher information matrix , which now involves partial derivatives of the matrix exponential (9). This can be done in cubic time per partial derivative according to (Najfeld and Havel, 1994), resulting in an unacceptable complexity of for the computation of the Fisher matrix.

4.2 Using Natural Coordinates

In this section we describe a technique that allows us to avoid the computation of the Fisher information matrix altogether, for some specific but common classes of distributions. The idea is to iteratively change the coordinate system in such a way that it becomes possible to follow the natural gradient without any costly inverses of the Fisher matrix (actually, without even constructing it explicitly). We introduce the technique for the simplest case of multinormal search distributions, and in section 5.2, we generalize it to the whole class of distributions that they are applicable to (namely, rotationally-symmetric distributions).

Instead of using the ‘global’ coordinates for the covariance matrix, we linearly transform the coordinate system in each iteration to a coordinate system in which the current search distribution is the standard normal distribution with zero mean and unit covariance. Let the current search distribution be given by with . We use the tangent space of the parameter manifold , which is isomorphic to the vector space , to represent the updated search distribution as

(10)

This coordinate system is natural in the sense that the Fisher matrix w.r.t. an orthonormal basis of is the identity matrix. The current search distribution is encoded as

where at each step we change the coordinates such that . In this case, it is guaranteed that for the variables the plain gradient and the natural gradient coincide (). Consequently the computation of the natural gradient costs operations.

In the new coordinate system we produce standard normal samples which are then mapped back into the original coordinates . The log-density becomes

and thus the log-derivatives (at , ) take the following, surprisingly simple forms:

(11)
(12)

These results give us the updates in the natural coordinate system

(13)
(14)

which are then mapped back onto using equation (10):

4.3 Orthogonal Decomposition of Multinormal Parameter Space

We decompose the parameter vector space into the product

(15)

of orthogonal subspaces. The one-dimensional space is spanned by the identity matrix , and denotes its orthogonal complement in . The different components have roles with clear interpretations: The -component describes the update of the center of the search distribution, the -component with value for has the role of a step size update, which becomes clear from the identity , and the -component describes the update of the transformation matrix, normalized to unit determinant, which can thus be attributed to the shape of the search distribution. This decomposition is canonical in being the finest decomposition such that updates of its components result in invariance of the search algorithm under linear transformations of the search space.

On these subspaces we introduce independent learning rates , , and , respectively. For simplicity we also split the transformation matrix into the step size and the normalized transformation matrix with . Then the resulting update is

(16)
(17)
(18)

with from equation (14). In case of , in this case referred to as , the updates (17) and (18) simplify to

(19)

The resulting algorithm is called exponential NES (xNES), and shown in Algorithm 4. We also give the pseudocode for its hill-climber variant (see also section 4.5).

Updating the search distribution in the natural coordinate system is an alternative to the exponential parameterization (section 4.1) for making the algorithm invariant under linear transformations of the search space, which is then achieved in a direct and constructive way.

input : , ,
initialize
repeat
       for  do
             draw sample
            
             evaluate the fitness
            
       end for
      sort with respect to and compute utilities
       compute gradients
       update parameters
      
until stopping criterion is met
Algorithm 4 Exponential Natural Evolution Strategies (xNES), for multinormal distributions
input : , ,
repeat
       draw sample
       evaluate the fitness
       calculate log-derivatives
      
       if  then
            
            
             update mean
            
      else
            
            
       end if
      
      
      
until stopping criterion is met
Algorithm 5 (1+1)-xNES

4.4 Connection to CMA-ES.

It has been noticed independently by (Glasmachers et al., 2010a) and shortly afterwards by (Akimoto et al., 2010) that the natural gradient updates of xNES and the strategy updates of the CMA-ES algorithm (Hansen and Ostermeier, 2001) are closely connected. However, since xNES does not feature evolution paths, this connection is restricted to the so-called rank--update (in the terminology of this study, rank--update) of CMA-ES.

First of all observe that xNES and CMA-ES share the same invariance properties. But more interestingly, although derived from different heuristics, their updates turn out to be nearly equivalent. A closer investigation of this equivalence promises synergies and new perspectives on the working principles of both algorithms. In particular, this insight shows that CMA-ES can be explained as a natural gradient algorithm, which may allow for a more thorough analysis of its updates, and xNES can profit from CMA-ES’s mature settings of algorithms parameters, such as search space dimension-dependent population sizes, learning rates and utility values.

Both xNES and CMA-ES parameterize the search distribution with three functionally different parameters for mean, scale, and shape of the distribution. xNES uses the parameters , , and , while the covariance matrix is represented as in CMA-ES, where can by any positive definite symmetric matrix. Thus, the representation of the scale of the search distribution is shared among and in CMA-ES, and the role of the additional parameter is to allow for an adaptation of the step size on a faster time scale than the full covariance update. In contrast, the NES updates of scale and shape parameters and are properly decoupled.

Let us start with the updates of the center parameter . The update (16) is very similar to the update of the center of the search distribution in CMA-ES, see (Hansen and Ostermeier, 2001). The utility function exactly takes the role of the weights in CMA-ES, which assumes a fixed learning rate of one.

For the covariance matrix, the situation is more complicated. From equation (19) we deduce the update rule

for the covariance matrix, with learning rate . This term is closely connected to the exponential parameterization of the natural coordinates in xNES, while CMA-ES is formulated in global linear coordinates. The connection of these updates can be shown either by applying the xNES update directly to the natural coordinates without the exponential parameterization, or by approximating the exponential map by its first order Taylor expansion. (Akimoto et al., 2010) established the same connection directly in coordinates based on the Cholesky decomposition of , see (Sun et al., 2009a, b). The arguably simplest derivation of the equivalence relies on the invariance of the natural gradient under coordinate transformations, which allows us to perform the computation, w.l.o.g., in natural coordinates. We use the first order Taylor approximation of to obtain

so the first order approximate update yields

with , from which the connection to the CMA-ES rank--update is obvious (see Hansen and Ostermeier, 2001).

Finally, the updates of the global step size parameter turn out to be identical in xNES and CMA-ES without evolution paths.

Having highlighted the similarities, let us have a closer look at the differences between xNES and CMA-ES, which are mostly two aspects. CMA-ES uses the well-established technique of evolution paths to smoothen out random effects over multiple generations. This technique is particularly valuable when working with minimal population sizes, which is the default for both algorithms. Thus, evolution paths are expected to improve stability; further interpretations have been provided by (Hansen and Ostermeier, 2001). However, the presence of evolution paths has the conceptual disadvantage that the state of the CMA-ES algorithms is not completely described by its search distribution. The other difference between xNES and CMA-ES is the exponential parameterization of the updates in xNES, which results in a multiplicative update equation for the covariance matrix, in contrast to the additive update of CMA-ES. We argue that just like for the global step size , the multiplicative update of the covariance matrix is natural.

A valuable perspective offered by the natural gradient updates in xNES is the derivation of the updates of the center , the step size , and the normalized transformation matrix , all from the same principle of natural gradient ascent. In contrast, the updates applied in CMA-ES result from different heuristics for each parameter. Hence, it is even more surprising that the two algorithms are closely connected. This connection provides a post-hoc justification of the various heuristics employed by CMA-ES, and it highlights the consistency of the intuition behind these heuristics.

4.5 Elitism

The principle of the NES algorithm is to follow the natural gradient of expected fitness. This requires sampling the fitness gradient. Naturally, this amounts to what, within the realm of evolution strategies, is generally referred to as “comma-selection”, that is, updating the search distribution based solely on the current batch of “offspring” samples, disregarding older samples such as the “parent” population. This seems to exclude approaches that retain some of the best samples, like elitism, hill-climbing, or even steady-state selection (Goldberg, 1989). In this section we show that elitism (and thus a wide variety of selection schemes) is indeed compatible with NES. We exemplify this technique by deriving a NES algorithm with (1+1) selection, i.e., a hill-climber (see also Glasmachers et al., 2010a).

It is impossible to estimate any information about the fitness gradient from a single sample, since at least two samples are required to estimate even a finite difference. The (1+1) selection scheme indicates that this dilemma can be resolved by considering two distinguished samples, namely the elitist or parent , and the offspring . Considering these two samples in the update is in principle sufficient for estimating the fitness gradient w.r.t. the parameters .

Care needs to be taken for setting the algorithm parameters, such as learning rates and utility values. The extremely small population size of one indicates that learning rates should generally be small in order to ensure stability of the algorithm. Another guideline is the well known observation (Rechenberg and Eigen, 1973) that a step size resulting in a success rate of roughly maximizes progress. This indicates that a self-adaptation strategy should increase the learning rate in case of too many successes, and decrease it when observing too few successes.

Let us consider the basic case of radial Gaussian search distributions

with parameters and . We encode these parameters as with . Let be a standard normally distributed vector, then we obtain the offspring as , and the natural gradient components are