CoNES: Convex Natural Evolutionary Strategies

07/16/2020 ∙ by Sushant Veer, et al. ∙ Princeton University 6

We present a novel algorithm – convex natural evolutionary strategies (CoNES) – for optimizing high-dimensional blackbox functions by leveraging tools from convex optimization and information geometry. CoNES is formulated as an efficiently-solvable convex program that adapts the evolutionary strategies (ES) gradient estimate to promote rapid convergence. The resulting algorithm is invariant to the parameterization of the belief distribution. Our numerical results demonstrate that CoNES vastly outperforms conventional blackbox optimization methods on a suite of functions used for benchmarking blackbox optimizers. Furthermore, CoNES demonstrates the ability to converge faster than conventional blackbox methods on a selection of OpenAI's MuJoCo reinforcement learning tasks for locomotion.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Policy optimization in reinforcement learning (RL) can be posed as a blackbox optimization problem: given access to a “blackbox” in the form of a simulator or robot hardware, find a setting of policy parameters that maximizes rewards. This perspective has led to significant recent interest from the RL community towards scaling blackbox optimization methods and has catapulted the use of blackbox optimizers from low-dimensional hyperparameter tuning

[17, 24]

to training deep neural networks (DNNs) with thousands of parameters

[42, 12, 11, 30, 13, 32]. Despite these promising advances, the sample complexity of blackbox methods remains high and is the subject of ongoing research.

In this paper we study a class of blackbox optimization methods called evolutionary strategies (ES) [41, 42]

. ES methods maintain a belief distribution on the domain of candidates. At each iteration, a batch of candidates is sampled from this distribution and their fitness is evaluated. These fitness scores are used to obtain a Monte-Carlo (MC) estimate of the loss function’s gradient with respect to the parameters of the belief distribution. In the domain of ES for RL, approaches that adapt the sampling rate from the belief distribution and reuse samples from previous iterations have been proposed to improve the sample complexity

[12, 11]. However, standard ES methods are not invariant to re-parameterizations of the belief distribution. Hence, the choice of belief parameterization (e.g., encoding the covariance as a symmetric positive definite matrix vs. a Cholesky decomposition) can affect the rate of convergence and cause undesirable behavior (e.g., oscillations) [48]. In contrast, ES techniques based on the natural gradient [4, 44, 48] are parameterization invariant and can demonstrate improved sample efficiency. However, these methods have not been thoroughly exploited in RL due to the difficulties in computing the natural gradient for high-dimensional problems; in particular, the challenging estimation of the Fisher information matrix is necessary for computing the natural gradient.

In this paper, we present a novel algorithm – convex natural evolutionary strategies (CoNES) – that leverages results on the natural gradient [4, 44, 48] from information geometry [5] and couples them with powerful tools from convex optimization (e.g., second-order cone programming [8] and geometric programming [7]) to promote rapid convergence. In particular, CoNES refines a crude gradient estimate by transforming it through a convex program that searches for the direction of steepest ascent in a KL-divergence ball around the current belief distribution. The relationship to natural evolutionary strategies (NES) [48] comes from the fact that the limiting solution of the KL-constrained optimization problem (as the “radius” of the KL-divergence ball shrinks to zero) corresponds to the natural gradient. However, in contrast to NES [48], CoNES circumvents the estimation of the Fisher information matrix by directly solving the convex KL-constrained optimization problem.

Figure 1: Illustration demonstrating the importance of accounting for the step length for choosing the update direction. At the belief distribution expressed in the coordinates , if we follow the negative of the gradient direction (right), then, with the step size , the loss increases. However, accounting for the step size while choosing the direction, we would go left and the loss would decrease.

Furthermore, tuning the radius of the KL-divergence ball facilitates better alignment of the update direction with the update step size, yielding faster convergence than NES (which provides the steepest ascent direction for infinitesimal steps lengths); see Fig. 1 for an illustration that demonstrates the importance of accounting the step length for choosing the update direction.

Our theoretical results establish that CoNES is invariant

to the parameterization of the belief distribution (e.g., encoding the covariance as a symmetric positive definite matrix or a Cholesky decomposition does not affect the solution of the CoNES optimization problem). Parameterization invariance ensures that we are working with the intrinsic mathematical object (i.e., probability distribution) and the specific encoding of these objects do not affect the outcome. Moreover, CoNES is agnostic to the method that generates the crude gradient estimate and can thus be potentially combined with various existing ES methods, such as

[42, 12, 11]. Through our numerical results we demonstrate that CoNES vastly outperforms various conventional blackbox optimizers on a suite of 5000-dimensional benchmark functions for blackbox optimizers: Sphere, Rosenbrock, Rastrigin, and Lunacek. We also demonstrate the improved sample complexity achieved by CoNES on the following OpenAI MuJoCo RL tasks: HalfCheetah-v2, Walker2D-v2, Hopper-v2, and Swimmer-v2.

2 Related Work

Blackbox optimization. Various engineering problems require optimizing systems for which the governing mechanisms are not explicitly known; e.g., system identification of complex physical systems [3] and mechanism design [6]. Blackbox optimization techniques such as Nelder-Mead [36], evolutionary strategies (ES) [41], simulated annealing [27]

, genetic algorithms

[23], the cross-entropy method [15], and covariance matrix adaptation (CMA) [22] were developed to address such problems. Recently, the growing potential of these methods for training control policies with reinforcement learning [42, 32, 12, 11, 30, 13, 10, 18] has reignited interest in blackbox optimizers. In this paper, we will primarily consider the class of blackbox optimizers that fall under the purview of ES.

Evolutionary strategies for reinforcement learning. In RL tasks, the advantages of ES – high parallelizability, better robustness, and richer exploration – were first demonstrated in [42]. Spurred by these findings, a plethora of recent developments aimed at improving ES for RL have emerged, some of which include: explicit novelty search regularization to avoid local minima [13], robustification of ES and efficient re-use of prior rollouts [12], and adaptive sampling for the ES gradient estimate [11]. We remark that all the above papers focus on improving the ES MC gradient estimator. In contrast, this paper presents a method that refines the ES gradient estimate – regardless of where that estimate comes from – by solving a convex program.

Natural gradient. Our method is directly motivated by the concept of the natural gradient [5]. The application of natural gradient in learning was initially pioneered in [4] and was later demonstrated to be effective for RL [25]

, deep learning with backpropagation

[39], and blackbox optimization with ES [44, 48]. However, the latent potential of the natural gradient has not been completely realized due to the difficulty in estimation of the Fisher information matrix. Much of the prior work employing natural gradient has focused on efficient estimation or computation of the Fisher information matrix [49, 44, 39]. In contrast, CoNES does not work directly with the Fisher information matrix. Instead, we approximate the update direction by solving a convex program that maximizes the loss while being constrained to a KL-divergence ball around the current belief distribution; as the radius of the KL-divergence ball goes to zero, the limiting solution of this convex program corresponds to the natural gradient (see Proposition 1).

Trust-regions for blackbox optimization. Recent work on trust region methods for blackbox optimizers [30, 33, 1] performs updates on the belief distribution by optimizing the loss on a KL-divergence ball. However, [1, 33] perform the constrained optimization on a discretization of the belief distribution. The approach in [30] computes the KL-divergence for each dimension individually and bounds their maximum; the resulting optimization problem is approximated via a clipped surrogate objective similar to proximal policy optimization (PPO) [43]. In contrast, we exactly solve a KL-constrained problem whose solution approximates the natural gradient (as outlined above and formally discussed in Section 4.1) using powerful tools from convex optimization (e.g., second-order cone programming and geometric programming).

3 Notation

We denote a blackbox loss function by with as its domain. Let be a distribution on the domain that signifies our belief of where the optimal candidate for resides. We assume that belongs to the statistical manifold [45] which is a Riemannian manifold [40] of probability distributions. Any point is expressed in the coordinates . Rather than optimizing directly, we will work with the loss function which provides the expected loss under the belief distribution . When referring to the manifold in a coordinate-free setting, we express the loss as , whereas, when we work with a particular coordinate system on , we express the loss as ; the abuse of notation creates no confusion as it will always be clear from context.

The (Euclidean) gradient operator is denoted by ; the natural gradient operator is denoted by ; and the solution of CoNES is denoted by . The KL-divergence between two distributions is denoted by

and the Euclidean inner product between two vectors is denoted by


4 Background

4.1 Natural Gradient

It is a commonly-held belief that the steepest ascent direction for a loss function is given by its gradient . However, this is only true if the domain is expressed in an orthonormal coordinate system in a Euclidean space. If the space admits a Reimannian manifold [40] structure, the steepest ascent direction is then given by the natural gradient instead [5, Section 12.1.2]. Besides providing the steepest ascent direction on , the natural gradient possesses various attractive properties: (a) natural gradient is independent of the choice of coordinates on the statistical manifold

; (b) natural gradient avoids saturation due to sigmoidal activation functions

[5, Theorem 12.2]; (c) online natural-gradient learning is asymptotically Fisher efficient, i.e., it asymptotically approaches equality of the Cramér-Rao bound [4]. These qualities lay the foundation of our interest in leveraging the natural gradient in learning applications. In the rest of this section we will present two explicit characterizations of the natural gradient relevant to this paper.

Let be the Fisher information matrix for the Reimannian manifold of distributions described in the coordinates

; e.g., Gaussian distributions can be expressed in the coordinates

vec upper-triangle where , denote the mean and the covariance, respectively. The natural gradient then satisfies the following relation with the Euclidean gradient:


For the second characterization of the natural gradient we will need the Fisher-Rao norm defined as [29, Definition 2]. Using this norm we can express the natural gradient as follows:

Proposition 1.

[Adapted from [37, Proposition 1]] Let be a statistical manifold, each point of which is a probability distribution parameterized by . Let be a loss function which maps a probability distribution to a scalar. Then, the natural gradient of the loss function computed at any satisfies:


Proposition 1 states that the natural gradient is aligned with the direction which maximizes the loss function in an infinitesimal KL-divergence ball around the current distribution . To avoid confusion, it is worth clarfiying that the maximization in Proposition 1 computes the natural gradient which can then be passed to a gradient-based optimizer to minimze the loss.

Remark 1.

Proposition 1 also holds true for the linear approximation of the loss function at . Intuitively, the reason for this is that the linear approximation locally converges to the loss function for arbitrarily small .

4.2 Natural Evolutionary Strategies

The evolutionary strategies (ES) framework performs a Monte-Carlo estimate of the gradient of the loss with respect to the belief distribution [48, Section 2]:


This gradient estimate is then supplied to a gradient-based optimizer to update the belief distribution. Note that (3) provides an estimate of the Euclidean gradient. Instead of using the Euclidean gradient (3), Natural Evolutionary Strategies (NES) [48, 44] estimates the natural gradient by transforming the Euclidean gradient estimate (3) through (1).

5 Convex Natural Evolutionary Strategies

Despite the various advantages offered by the natural gradient, the computationally expensive estimation of the Fisher information matrix and its inverse makes it difficult to scale to very high-dimensional problems. Proposition 1 offers an alternative to compute the natural gradient while obviating the need to estimate ; however, (2) is a challenging non-convex optimization problem. To develop CoNES we “massage” (2) into an efficiently-solvable convex program.

We begin by relaxing relaxing the requirement and instead choosing a fixed , resulting in the following optimization problem:111Without loss of generality, we are replacing with .


where is now a hyperparameter which can be as large as necessary. Using as the update direction could yield faster convergence than . This may seem counter-intuitive because the natural gradient is the steepest ascent direction, as discussed in Section 4.1; however, it is worth noting that this holds true only for an infinitesimal step length. The flexibility of choosing an permits us to align the search for the steepest ascent direction with the desired step-length of the update, yielding rapid convergence; see Fig. 1 for an illustration.

We are interested in settings where the landscape of the loss function is unknown and querying loss values of individual candidates is expensive. Even if the analytical form of was available to us, (4) may be a non-convex problem and hence challenging to solve. To make this problem more tractable, we perform a Taylor expansion of the loss function and work with the following optimization problem:


In (5), is a constant offset which does not affect the choice of and can hence be ignored. Further, we denote and restate (5) as:


Despite these relaxations, the optimization problem (6) may still be intractable due to the lack of convexity of the feasible set. However, in the following theorem we establish for the Gaussian family of probability distributions that (6) is convex and can be solved in polynomial time.

Theorem 1.

The optimization (6) is:

  • a semidefinite program (SDP) with an additional exponential cone constraint if is the space of Gaussian distributions;

  • a second-order cone program (SOCP) with an additional exponential cone constraint if is the space of Gaussian distributions with diagonal covariance.


As the objective function of (6) is linear, we only need to verify the convexity of the feasible set. We will first consider the case when is the space of Gaussian distributions. Let and . Then:


which is convex because is linear, is positive-definite quadratic, and is convex. Finally, noting that constraints can be formulated as an SDP with an additional exponential cone constraint [35] completes the proof of this part.

Now we consider the family of Gaussian distributions and with diagonal covariance. We denote the mean as and . The diagonal elements of the covariance and are expressed as and , respectively. Then, the KL-divergence between two distributions in this family is:


From (8), it follows that the problem (6) for this family of distributions is an SOCP with an additional exponential cone constraint (that arises from the terms), completing the proof. ∎

Figure 2: Geometric illustration of CoNES.

Restricting the class of belief distributions to those in Theorem 1 gives rise to CoNES: a family of convex programs that draws motivation from the concept of the natural gradient to transform the Euclidean gradient. To geometrically visualize CoNES, consider the illustration in Fig. 2. The orange surface is the loss landscape and the gray surface is the linearization of the loss at the point denoted by ; in differential geometric terms, the orange surface is more accurately characterized as the manifold given by the graph of the loss while the gray surface is the manifold’s tangent space at . The green arrow represents the solution of CoNES for a KL-divergence ball (light green region) with a very small which can also be regarded as the natural gradient (modulo the norm) at by Remark 1. The red arrow is the solution of CoNES for a KL-divergence ball (light red region) with a larger . Note that this figure is an illustration; the KL-divergence balls may not necessarily manifest in the depicted shapes. The NES gradient is the sharpest ascent direction for an infinitesimal step size, but, it may not be ideal for a larger step size. With CoNES, we can tune the scalar parameter to better align the update direction with the gradient-based optimizer’s step size (learning rate), yielding faster updates. Indeed, the choice of is important to the performance of CoNES as demonstrated in our numerical results in Section 7.2. The mechanism for selecting (or adapting) the hyperparameter is beyond the scope of this paper and will be explored in our future work.

The psuedo-code for our implementation of CoNES as a blackbox optimizer is detailed in Algorithm 1. We use the ES gradient estimate (presented in Section 4.2) as the Gradient-Estimator in Line 5 of Algorithm 1; any estimator of the Euclidean gradient, such as [12, 11], can be used here. We use Adam [26] as our gradient-based optimizer in Line 7; any gradient-based optimizer can be used.

1:Hyperparameters: radius of KL-divergence ball, number of candidates drawn at each iteration
2:Initialize: , Optimizer
4:      Draw samples from the belief distribution
5:      Gradient-Estimator()
6:      CoNES() solve (6)
7:      Optimizer(, )
8:until Termination conditions satisfied
Algorithm 1 CoNES

6 Parameterization invariance of CoNES

An important property of the natural gradient is its independence to the parameterization of the belief distribution; e.g., for Gaussian distributions it does not matter whether we use the covariance matrix or its Cholesky decomposition. The natural gradient inherits this property by construction as the covariant gradient on the statistical manifold [5]. Parameterization invariance ensures that we are working with the intrinsic mathematical objects (probability distributions here) and the specific encoding of these objects will not affect the outcome. From a practical perspective, we derive the benefit of fewer properties to “engineer”.

A natural question to ask is whether CoNES (Problem (6)) exhibits the same property. Proposition 1 ensures that the CoNES optimization exhibits this property in the limit of tending to zero, as the update direction then coincides with the natural gradient. However, establishing this property for arbitrary is not immediately obvious. The rest of this section is dedicated to formally demonstrating that CoNES does indeed exhibit this property.

We will work with the loss function rather than its linearization with the understanding that if the parameterization invariance holds for an arbitrary function , it will automatically hold for the linear function in (6). With a slight abuse of notation, we will express the loss function in the coordinates on the statistical manifold instead of the coordinate-free notation of . Now we are ready to present the main result of this section:

Theorem 2.

Consider the optimization problem:


Let be a smooth invertible mapping which performs a coordinate change from . Consider the following optimization problem in the new coordinates:222From a geometric perspective, and are coordinates on the statistical manifold , either of which can be used to express a distribution . The directions and lie in the tangent space of at .


Then, there exists an invertible mapping such that , ensuring that .

Theorem 2 shows that expressing the belief distribution in different coordinates or provides the same optimal loss and the same set of possible outcomes (upto a bijective mapping). Of course, we cannot ensure that the outcome, i.e., the of the CoNES optimization is the same due to the potential lack of uniqueness of the optima; e.g., consider the maximization of in initialized at – all directions from the initial point are equally good.

Intuitively, Theorem 2 holds because the KL-divergence is independent of the parameterization of the distribution [28, Corollary 4.1], i.e., for , , and as defined in Theorem 2, we have:


To formally prove Theorem 2, we will first establish two lemmas. The first lemma shows the existence of a bijective mapping between and .

Lemma 1.

Let , , and be as defined in Theorem 2. Then, there exists a bijective mapping , defined as


First we will check the injectivity of :


Next, to check the surjectivity of , let be arbitrary. Then there exists which satisfies . ∎

In the following remark, we express the result of Lemma 1 in a form that is more conducive to our forthcoming proof.

Remark 2.

Lemma 1 ensures that the following relation holds for any :

where the first equivalence relation holds by using the expression of (12) and the second equivalence relations hold from the bijectivity of .

Lemma 2.

Let and be the feasible sets of and , respectively. Let be defined as in Lemma 1. Then, .


Let , then there exists a such that . Therefore, Remark 2 ensures that , which further gives us:


where the last equality follows from (11) and the inequality follows from the fact that . From (14) we have that implying that .

Now, let . By the surjectivity of from Lemma 1, there exists a such that . With this, Remark 2 ensures that . Hence, using (11), followed by gives:


where the last inequality follows from the fact that . Therefore, by (15), we have that , which, on combining with the earlier assertion that implies that . Thereby, ensuring that and completing the proof. ∎

Proof of Theorem 2.

The proof follows from the following chain of arguments:


where (17) follows from Remark 2 (Lemma 1) and (18) follows from Lemma 2. Further, because from Remark 2, we get . ∎

7 Results

In this section, we use CoNES on two classes of problems: (a) a standard suite of high-dimensional loss functions used to benchmark blackbox optimizers, and (b) a selection of OpenAI Gym’s [9] MuJoCo [47]

suite of RL tasks. We compare CoNES against existing methods including ES, natural evolutionary strategies (NES), and covariance matrix adaptation (CMA). We custom implemented ES, NES, and CoNES, while CMA is adapted directly from the open-source PyCMA package

[19]; our code is accessible at:

The family of Gaussian belief distributions with diagonal covariance is used for ES, NES, and CoNES. This family of belief distributions permits the implementation of NES exactly (i.e., without having to numerically estimate the Fisher information matrix [44]) for high-dimensional problems, serving as a strong baseline to compare CoNES against. For CMA, PyCMA’s default family of belief distributions – Gaussian distributions with non-diagonal covariance – is used. For ES, NES, and CoNES we compute an estimate of the gradient direction and pass it to the Adam optimizer [26] to update the belief distribution. For each of these methods we perform antithetic sampling and rank-based fitness transformation [42]. Unlike [42]

, we also update the variance of the belief distribution; we circumvent the non-negativeness constraint of the variance by updating the

of variance with the Adam optimizer instead. The resulting convex optimization problems for CoNES are solved using the CVXPY package [16] and the MOSEK solver [34].

7.1 Benchmark Functions

We first test our approach on four -dimensional functions: Sphere, Rosenbrock, Rastrigin, and Lunacek [21] which are provided in Appendix B. These functions are commonly-used benchmarks for blackbox optimization methods [20, 46]. Hyperparameters for ES, NES, and CoNES are shared across all problems (see Appendix A) while the hyperparameters of CMA are the default values chosen by PyCMA. Training for these benchmark functions was performed on a desktop with a 3.30 GHz Intel i9-7900X CPU with 10 cores and 32 GB RAM. Fig. 3

plots the average and standard deviation (shaded region) of the loss curves across 10 seeds. The rapid drop of the loss for CoNES demonstrates significant benefits in terms of the sample complexity over other methods. Fig. 

4 shows that the step size for CoNES is smaller than ES and NES, which coupled with its lower loss implies that the update direction for CoNES is more accurate than ES and NES. The run-time for a single seed is 1 minute for ES and NES, 5 minutes for CoNES, and 35 minutes for CMA.

Figure 3: Average loss (solid curve) with standard deviation (shaded region) across 10 seeds for ES, NES, CMA, and CoNES on Sphere, Rosenbrock, Rastrigin, and Lunacek.
Figure 4: Average step size (solid) with standard deviation (shaded region) of the belief distribution’s mean across 10 seeds for ES, NES, CMA, and CoNES on Sphere, Rosenbrock, Rastrigin, and Lunacek.

7.2 Reinforcement Learning Tasks

Next, we benchmark our approach on the following environments from the OpenAI Gym suite of RL problems: HalfCheetah-v2, Walker2D-v2, Hopper-v2, and Swimmer-v2. We employ a fully-connected neural network policy with tanh

activations possessing one hidden layer with 16 neurons for

Swimmer-v2 and 50 neurons for all other environments. The input to the policies are the agent’s state – which are normalized using a method similar to the one adopted by [32] – and the output is a vector in the agent’s action space. The training for these tasks was performed on a c5.24xlarge instance on Amazon Web Services (AWS). Fig. 5 presents the average and standard deviation of the rewards for each RL task across 10 seeds against the number of time-steps interacted with the environment. Fig. 5 as well as Table 1 illustrate that CoNES performs well on all these tasks. For each environment we share the same hyperparameters (excluding ) between ES, NES, and CoNES; for CMA we use the default hyperparameters as chosen by PyCMA. It is worth pointing out that for RL tasks, CoNES demonstrates high sensitivity to the choice of . The results for CoNES reported in Fig. 5 and Table 1 are for the best choice of from . Exact hyperparameters for the problems are provided in Appendix A. Each seed of HalfCheetah-v2, Walker2D-v2 and Hopper-v2, takes 4-5 hours with ES, NES, CoNES and 10 hours with CMA. Each seed of Swimmer-v2 takes 2 hours with ES, NES, CoNES and 4 hours with CMA.

Figure 5: Average reward (solid curve) with standard deviation (shaded region) across 10 seeds for ES, NES, CMA, and CoNES on HalfCheetah-v2, Walker2D-v2, Hopper-v2, and Swimmer-v2.

width=1center   # Timesteps to attain target average reward Environments Target Avg. Reward ES NES CMA CoNES   HalfCheetah-v2 3500 Walker2D-v2 2000 Hopper-v2 1400 Swimmer-v2 340  

Table 1: Timesteps to attain a target average reward (over 10 seeds) for RL tasks. For each environment the timestep for the best performing blackbox method is displayed in bold. Hyphen ( – ) is used for the method that failed to achieve the target average reward in timesteps for Swimmer-v2 and timesteps for all other environments.

8 Conclusions and Future Work

We presented convex natural evolutionary strategies (CoNES) for optimizing high-dimensional blackbox functions. CoNES combines the notion of the natural gradient from information geometry with powerful techniques from convex optimization (e.g., second-order cone programming and geometric programming). In particular, CoNES refines a gradient estimate by solving a convex program that searches for the direction of steepest ascent in a KL-divergence ball around the current belief distribution. We formally established that CoNES is invariant under transformations of the belief parameterization. Our numerical results on benchmark functions and RL examples demonstrate the ability of CoNES to converge faster than conventional blackbox methods such as ES, NES, and CMA.

Future Work. This paper raises numerous exciting future directions to explore. The performance of CoNES is dependent on the choice of the radius of the KL-divergence ball. Furthermore, a suitable choice of in one region of the loss landscape may not be suitable for another. Hence, an adaptive scheme for choosing the radius of the KL-divergence ball could substantially enhance the performance of CoNES. Another potentially fruitful future direction arises from the observation that Proposition 1 — which serves as the cornerstone of CoNES — holds for any333This an outcome of the fact that the Hessian of all -divergences is the Fisher information [31]. -divergence [14]. Hence, we can generalize CoNES to arbitrary -divergences; this may afford greater flexibility in tuning it for the specific loss landscape and further improving performance. We can increase the flexibility afforded by CoNES even more by expanding beyond the family of Gaussian belief distributions. Finally, we are also exploring the empirical benefits of adaptively restricting the covariance matrix model [2, 11] in order to further enhance sample complexity.


The authors were supported by the Office of Naval Research [Award Number: N00014-18-1-2873], the Google Faculty Research Award, and the Amazon Research Award.


Appendix A Hyperparameters

The parameters for the Adam optimizer were chosen according to [26, Algorithm 1] for all results in Section 7.

Benchmark Functions. For all the results in Section 7.1

the initial belief distribution is chosen to be the normal distribution

. The hyperparameters for ES, NES and CoNES were chosen as follows: the number of function evaluations performed per iteration is 100 and the learning rate for the mean and log of the variance is 0.1. Additionally, is set to 100 for CoNES.

RL Tasks. The hyperparameters for ES, NES, and CoNES for the results in Section 7.2 are detailed in Table 2 below; some of these hyperparameters were borrowed from [38].

width=1center   Initial Distribution Learning Rate # policies evaluated # envs interacted Environments mean () std () per itr (N) per policy (m)   HalfCheetah-v2 0 0.02 0.01 0.01 40 1 Walker2D-v2 0 0.02 0.01 0.01 40 1 Hopper-v2 0 0.02 0.01 0.01 40 1 1 Swimmer-v2 0 1 0.5 0.1 40 1 10  

Table 2: Hyperparameters for RL tasks.

Appendix B Benchmark functions

Let be expressed in its coordinates as .

  • Sphere:

  • Rosenbrock:

  • Rastrigin:

  • Lunacek: First define the constants

    Using these constants the function can be expressed as .


  • [1] A. Abdolmaleki, B. Price, N. Lau, L. P. Reis, and G. Neumann (2017) Deriving and improving CMA-ES with information geometric trust regions. In

    Proceedings of the Genetic and Evolutionary Computation Conference

    pp. 657–664. Cited by: §2.
  • [2] Y. Akimoto and N. Hansen (2016) Projection-based restricted covariance matrix adaptation for high dimension. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 197–204. Cited by: §8.
  • [3] S. Amaran, N. V. Sahinidis, B. Sharda, and S. J. Bury (2016) Simulation optimization: a review of algorithms and applications. Annals of Operations Research 240 (1), pp. 351–380. Cited by: §2.
  • [4] S. Amari (1998) Natural gradient works efficiently in learning. Neural computation 10 (2), pp. 251–276. Cited by: §1, §1, §2, §4.1.
  • [5] S. Amari (2016) Information Geometry And Its Applications. Vol. 194, Springer. Cited by: §1, §2, §4.1, §6.
  • [6] C. Audet and M. Kokkolaras (2016) Blackbox and derivative-free optimization: theory, algorithms and applications. Springer. Cited by: §2.
  • [7] S. Boyd, S. Kim, L. Vandenberghe, and A. Hassibi (2007) A tutorial on geometric programming. Optimization and Engineering 8 (1), pp. 67. Cited by: §1.
  • [8] S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press. Cited by: §1.
  • [9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §7.
  • [10] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, and J. Mouret (2017) Black-box data-efficient policy search for robotics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 51–58. Cited by: §2.
  • [11] K. M. Choromanski, A. Pacchiano, J. Parker-Holder, Y. Tang, and V. Sindhwani (2019) From complexity to simplicity: adaptive ES-active subspaces for blackbox optimization. In Advances in Neural Information Processing Systems, pp. 10299–10309. Cited by: §1, §1, §1, §2, §2, §5, §8.
  • [12] K. Choromanski, A. Pacchiano, J. Parker-Holder, Y. Tang, D. Jain, Y. Yang, A. Iscen, J. Hsu, and V. Sindhwani (2019) Provably robust blackbox optimization for reinforcement learning. arXiv:1903.02993. Cited by: §1, §1, §1, §2, §2, §5.
  • [13] E. Conti, V. Madhavan, F. P. Such, J. Lehman, K. Stanley, and J. Clune (2018) Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in Neural Information Processing Systems, pp. 5027–5038. Cited by: §1, §2, §2.
  • [14] I. Csiszár and P. C. Shields (2004) Information Theory and Statistics: a Tutorial. Now Publishers Inc. Cited by: §8.
  • [15] P. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2005) A tutorial on the cross-entropy method. Annals of Operations Research 134 (1), pp. 19–67. Cited by: §2.
  • [16] S. Diamond and S. Boyd (2016) CVXPY: a Python-embedded modeling language for convex optimization.

    Journal of Machine Learning Research

    17 (83), pp. 1–5.
    Cited by: §7.
  • [17] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley (2017) Google vizier: a service for black-box optimization. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. Cited by: §1.
  • [18] D. Ha (2019) Reinforcement learning for improving agent design. Artificial Life 25 (4), pp. 352–365. Cited by: §2.
  • [19] N. Hansen, Y. Akimoto, and P. Baudis (2019-02) CMA-ES/pycma on Github. Note: Zenodo, DOI:10.5281/zenodo.2559634 External Links: Document, Link Cited by: §7.
  • [20] N. Hansen, A. Auger, O. Mersmann, T. Tusar, and D. Brockhoff (2016) COCO: a platform for comparing continuous optimizers in a black-box setting. arXiv preprint arXiv:1603.08785. Cited by: §7.1.
  • [21] N. Hansen, S. Finck, R. Ros, and A. Auger (2009) Real-parameter black-box optimization benchmarking 2009: noiseless functions definitions. [Research Report] RR-6829, INRIA. Note: inria00362633v2 Cited by: §7.1.
  • [22] N. Hansen (2016) The CMA evolution strategy: a tutorial. arXiv preprint arXiv:1604.00772. Cited by: §2.
  • [23] J. H. Holland (1992)

    Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence

    MIT Press. Cited by: §2.
  • [24] F. Hutter, L. Kotthoff, and J. Vanschoren (2019) Automated machine learning. Springer. Cited by: §1.
  • [25] S. M. Kakade (2002) A natural policy gradient. In Advances in Neural Information Processing Systems, pp. 1531–1538. Cited by: §2.
  • [26] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A, §5, §7.
  • [27] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi (1983) Optimization by simulated annealing. Science 220 (4598), pp. 671–680. Cited by: §2.
  • [28] S. Kullback and R. A. Leibler (1951) On information and sufficiency. The Annals of Mathematical Statistics 22 (1), pp. 79–86. Cited by: §6.
  • [29] T. Liang, T. Poggio, A. Rakhlin, and J. Stokes (2017) Fisher-Rao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530. Cited by: §4.1.
  • [30] G. Liu, L. Zhao, F. Yang, J. Bian, T. Qin, N. Yu, and T. Liu (2019) Trust region evolution strategies. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4352–4359. Cited by: §1, §2, §2.
  • [31] A. Makur (2015) A study of local approximations in information theory. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: footnote 3.
  • [32] H. Mania, A. Guy, and B. Recht (2018) Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055. Cited by: §1, §2, §7.2.
  • [33] M. Miyashita, S. Yano, and T. Kondo (2018) Mirror descent search and its acceleration. Robotics and Autonomous Systems 106, pp. 107–116. Cited by: §2.
  • [34] MOSEK ApS (2019) MOSEK fusion api for python 9.0.84(beta). External Links: Link Cited by: §7.
  • [35] MOSEK modeling cook-book: log-determinant. Note: Cited by: §5.
  • [36] J. A. Nelder and R. Mead (1965) A simplex method for function minimization. The Computer Journal 7 (4), pp. 308–313. Cited by: §2.
  • [37] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen (2017) Information-geometric optimization algorithms: a unifying picture via invariance principles. The Journal of Machine Learning Research 18 (1), pp. 564–628. Cited by: Proposition 1.
  • [38] P. Pagliuca, N. Milano, and S. Nolfi (2019) Efficacy of modern neuro-evolutionary strategies for continuous control optimization. arXiv preprint arXiv:1912.05239. Cited by: Appendix A.
  • [39] R. Pascanu and Y. Bengio (2013) Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584. Cited by: §2.
  • [40] P. Petersen, S. Axler, and K. Ribet (2006) Riemannian Geometry. Vol. 171, Springer. Cited by: §3, §4.1.
  • [41] I. Rechenberg and M. Eigen (1973) Evolutionsstrategie: optimierung technischer systeme nach prinzipien der biologischen evolution. Frommann-Holzboog Stuttgart. Cited by: §1, §2.
  • [42] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §1, §1, §1, §2, §2, §7.
  • [43] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
  • [44] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber (2009) Efficient natural evolution strategies. In Proceedings of the Annual Conference on Genetic and evolutionary computation, pp. 539–546. Cited by: §1, §1, §2, §4.2, §7.
  • [45] M. Suzuki (2014) Information geometry and statistical manifold. arXiv preprint arXiv:1410.3369. Cited by: §3.
  • [46] O. Teytaud and J. Rapin (2018) Nevergrad: an open source tool for derivative-free optimization. Cited by: §7.1.
  • [47] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033. Cited by: §7.
  • [48] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber (2014) Natural evolution strategies. The Journal of Machine Learning Research 15 (1), pp. 949–980. Cited by: §1, §1, §2, §4.2.
  • [49] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba (2017) Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in Neural Information Processing Systems, pp. 5279–5288. Cited by: §2.