CoNES
None
view repo
We present a novel algorithm – convex natural evolutionary strategies (CoNES) – for optimizing high-dimensional blackbox functions by leveraging tools from convex optimization and information geometry. CoNES is formulated as an efficiently-solvable convex program that adapts the evolutionary strategies (ES) gradient estimate to promote rapid convergence. The resulting algorithm is invariant to the parameterization of the belief distribution. Our numerical results demonstrate that CoNES vastly outperforms conventional blackbox optimization methods on a suite of functions used for benchmarking blackbox optimizers. Furthermore, CoNES demonstrates the ability to converge faster than conventional blackbox methods on a selection of OpenAI's MuJoCo reinforcement learning tasks for locomotion.
READ FULL TEXT VIEW PDF
A novel optimization strategy, Info-Evo, is described, in which natural
...
read it
We analyze the efficacy of modern neuro-evolutionary strategies for
cont...
read it
This report presents benchmarking results of the latest version of the
H...
read it
Evolutionary Strategies (ES) are known to be an effective black-box
opti...
read it
Economists specify high-dimensional models to address heterogeneity in
e...
read it
In this paper we present an evolutionary optimization approach to solve ...
read it
None
Policy optimization in reinforcement learning (RL) can be posed as a blackbox optimization problem: given access to a “blackbox” in the form of a simulator or robot hardware, find a setting of policy parameters that maximizes rewards. This perspective has led to significant recent interest from the RL community towards scaling blackbox optimization methods and has catapulted the use of blackbox optimizers from low-dimensional hyperparameter tuning
[17, 24]to training deep neural networks (DNNs) with thousands of parameters
[42, 12, 11, 30, 13, 32]. Despite these promising advances, the sample complexity of blackbox methods remains high and is the subject of ongoing research.In this paper we study a class of blackbox optimization methods called evolutionary strategies (ES) [41, 42]
. ES methods maintain a belief distribution on the domain of candidates. At each iteration, a batch of candidates is sampled from this distribution and their fitness is evaluated. These fitness scores are used to obtain a Monte-Carlo (MC) estimate of the loss function’s gradient with respect to the parameters of the belief distribution. In the domain of ES for RL, approaches that adapt the sampling rate from the belief distribution and reuse samples from previous iterations have been proposed to improve the sample complexity
[12, 11]. However, standard ES methods are not invariant to re-parameterizations of the belief distribution. Hence, the choice of belief parameterization (e.g., encoding the covariance as a symmetric positive definite matrix vs. a Cholesky decomposition) can affect the rate of convergence and cause undesirable behavior (e.g., oscillations) [48]. In contrast, ES techniques based on the natural gradient [4, 44, 48] are parameterization invariant and can demonstrate improved sample efficiency. However, these methods have not been thoroughly exploited in RL due to the difficulties in computing the natural gradient for high-dimensional problems; in particular, the challenging estimation of the Fisher information matrix is necessary for computing the natural gradient.In this paper, we present a novel algorithm – convex natural evolutionary strategies (CoNES) – that leverages results on the natural gradient [4, 44, 48] from information geometry [5] and couples them with powerful tools from convex optimization (e.g., second-order cone programming [8] and geometric programming [7]) to promote rapid convergence. In particular, CoNES refines a crude gradient estimate by transforming it through a convex program that searches for the direction of steepest ascent in a KL-divergence ball around the current belief distribution. The relationship to natural evolutionary strategies (NES) [48] comes from the fact that the limiting solution of the KL-constrained optimization problem (as the “radius” of the KL-divergence ball shrinks to zero) corresponds to the natural gradient. However, in contrast to NES [48], CoNES circumvents the estimation of the Fisher information matrix by directly solving the convex KL-constrained optimization problem.
Furthermore, tuning the radius of the KL-divergence ball facilitates better alignment of the update direction with the update step size, yielding faster convergence than NES (which provides the steepest ascent direction for infinitesimal steps lengths); see Fig. 1 for an illustration that demonstrates the importance of accounting the step length for choosing the update direction.
Our theoretical results establish that CoNES is invariant
to the parameterization of the belief distribution (e.g., encoding the covariance as a symmetric positive definite matrix or a Cholesky decomposition does not affect the solution of the CoNES optimization problem). Parameterization invariance ensures that we are working with the intrinsic mathematical object (i.e., probability distribution) and the specific encoding of these objects do not affect the outcome. Moreover, CoNES is agnostic to the method that generates the crude gradient estimate and can thus be potentially combined with various existing ES methods, such as
[42, 12, 11]. Through our numerical results we demonstrate that CoNES vastly outperforms various conventional blackbox optimizers on a suite of 5000-dimensional benchmark functions for blackbox optimizers: Sphere, Rosenbrock, Rastrigin, and Lunacek. We also demonstrate the improved sample complexity achieved by CoNES on the following OpenAI MuJoCo RL tasks: HalfCheetah-v2, Walker2D-v2, Hopper-v2, and Swimmer-v2.Blackbox optimization. Various engineering problems require optimizing systems for which the governing mechanisms are not explicitly known; e.g., system identification of complex physical systems [3] and mechanism design [6]. Blackbox optimization techniques such as Nelder-Mead [36], evolutionary strategies (ES) [41], simulated annealing [27]
[23], the cross-entropy method [15], and covariance matrix adaptation (CMA) [22] were developed to address such problems. Recently, the growing potential of these methods for training control policies with reinforcement learning [42, 32, 12, 11, 30, 13, 10, 18] has reignited interest in blackbox optimizers. In this paper, we will primarily consider the class of blackbox optimizers that fall under the purview of ES.Evolutionary strategies for reinforcement learning. In RL tasks, the advantages of ES – high parallelizability, better robustness, and richer exploration – were first demonstrated in [42]. Spurred by these findings, a plethora of recent developments aimed at improving ES for RL have emerged, some of which include: explicit novelty search regularization to avoid local minima [13], robustification of ES and efficient re-use of prior rollouts [12], and adaptive sampling for the ES gradient estimate [11]. We remark that all the above papers focus on improving the ES MC gradient estimator. In contrast, this paper presents a method that refines the ES gradient estimate – regardless of where that estimate comes from – by solving a convex program.
Natural gradient. Our method is directly motivated by the concept of the natural gradient [5]. The application of natural gradient in learning was initially pioneered in [4] and was later demonstrated to be effective for RL [25]
, deep learning with backpropagation
[39], and blackbox optimization with ES [44, 48]. However, the latent potential of the natural gradient has not been completely realized due to the difficulty in estimation of the Fisher information matrix. Much of the prior work employing natural gradient has focused on efficient estimation or computation of the Fisher information matrix [49, 44, 39]. In contrast, CoNES does not work directly with the Fisher information matrix. Instead, we approximate the update direction by solving a convex program that maximizes the loss while being constrained to a KL-divergence ball around the current belief distribution; as the radius of the KL-divergence ball goes to zero, the limiting solution of this convex program corresponds to the natural gradient (see Proposition 1).Trust-regions for blackbox optimization. Recent work on trust region methods for blackbox optimizers [30, 33, 1] performs updates on the belief distribution by optimizing the loss on a KL-divergence ball. However, [1, 33] perform the constrained optimization on a discretization of the belief distribution. The approach in [30] computes the KL-divergence for each dimension individually and bounds their maximum; the resulting optimization problem is approximated via a clipped surrogate objective similar to proximal policy optimization (PPO) [43]. In contrast, we exactly solve a KL-constrained problem whose solution approximates the natural gradient (as outlined above and formally discussed in Section 4.1) using powerful tools from convex optimization (e.g., second-order cone programming and geometric programming).
We denote a blackbox loss function by with as its domain. Let be a distribution on the domain that signifies our belief of where the optimal candidate for resides. We assume that belongs to the statistical manifold [45] which is a Riemannian manifold [40] of probability distributions. Any point is expressed in the coordinates . Rather than optimizing directly, we will work with the loss function which provides the expected loss under the belief distribution . When referring to the manifold in a coordinate-free setting, we express the loss as , whereas, when we work with a particular coordinate system on , we express the loss as ; the abuse of notation creates no confusion as it will always be clear from context.
The (Euclidean) gradient operator is denoted by ; the natural gradient operator is denoted by ; and the solution of CoNES is denoted by . The KL-divergence between two distributions is denoted by
and the Euclidean inner product between two vectors is denoted by
.It is a commonly-held belief that the steepest ascent direction for a loss function is given by its gradient . However, this is only true if the domain is expressed in an orthonormal coordinate system in a Euclidean space. If the space admits a Reimannian manifold [40] structure, the steepest ascent direction is then given by the natural gradient instead [5, Section 12.1.2]. Besides providing the steepest ascent direction on , the natural gradient possesses various attractive properties: (a) natural gradient is independent of the choice of coordinates on the statistical manifold
; (b) natural gradient avoids saturation due to sigmoidal activation functions
[5, Theorem 12.2]; (c) online natural-gradient learning is asymptotically Fisher efficient, i.e., it asymptotically approaches equality of the Cramér-Rao bound [4]. These qualities lay the foundation of our interest in leveraging the natural gradient in learning applications. In the rest of this section we will present two explicit characterizations of the natural gradient relevant to this paper.Let be the Fisher information matrix for the Reimannian manifold of distributions described in the coordinates
; e.g., Gaussian distributions can be expressed in the coordinates
vec upper-triangle where , denote the mean and the covariance, respectively. The natural gradient then satisfies the following relation with the Euclidean gradient:(1) |
For the second characterization of the natural gradient we will need the Fisher-Rao norm defined as [29, Definition 2]. Using this norm we can express the natural gradient as follows:
[Adapted from [37, Proposition 1]] Let be a statistical manifold, each point of which is a probability distribution parameterized by . Let be a loss function which maps a probability distribution to a scalar. Then, the natural gradient of the loss function computed at any satisfies:
(2) | ||||
s.t. |
Proposition 1 states that the natural gradient is aligned with the direction which maximizes the loss function in an infinitesimal KL-divergence ball around the current distribution . To avoid confusion, it is worth clarfiying that the maximization in Proposition 1 computes the natural gradient which can then be passed to a gradient-based optimizer to minimze the loss.
Proposition 1 also holds true for the linear approximation of the loss function at . Intuitively, the reason for this is that the linear approximation locally converges to the loss function for arbitrarily small .
The evolutionary strategies (ES) framework performs a Monte-Carlo estimate of the gradient of the loss with respect to the belief distribution [48, Section 2]:
(3) |
This gradient estimate is then supplied to a gradient-based optimizer to update the belief distribution. Note that (3) provides an estimate of the Euclidean gradient. Instead of using the Euclidean gradient (3), Natural Evolutionary Strategies (NES) [48, 44] estimates the natural gradient by transforming the Euclidean gradient estimate (3) through (1).
Despite the various advantages offered by the natural gradient, the computationally expensive estimation of the Fisher information matrix and its inverse makes it difficult to scale to very high-dimensional problems. Proposition 1 offers an alternative to compute the natural gradient while obviating the need to estimate ; however, (2) is a challenging non-convex optimization problem. To develop CoNES we “massage” (2) into an efficiently-solvable convex program.
We begin by relaxing relaxing the requirement and instead choosing a fixed , resulting in the following optimization problem:^{1}^{1}1Without loss of generality, we are replacing with .
(4) |
where is now a hyperparameter which can be as large as necessary. Using as the update direction could yield faster convergence than . This may seem counter-intuitive because the natural gradient is the steepest ascent direction, as discussed in Section 4.1; however, it is worth noting that this holds true only for an infinitesimal step length. The flexibility of choosing an permits us to align the search for the steepest ascent direction with the desired step-length of the update, yielding rapid convergence; see Fig. 1 for an illustration.
We are interested in settings where the landscape of the loss function is unknown and querying loss values of individual candidates is expensive. Even if the analytical form of was available to us, (4) may be a non-convex problem and hence challenging to solve. To make this problem more tractable, we perform a Taylor expansion of the loss function and work with the following optimization problem:
(5) |
In (5), is a constant offset which does not affect the choice of and can hence be ignored. Further, we denote and restate (5) as:
(6) |
Despite these relaxations, the optimization problem (6) may still be intractable due to the lack of convexity of the feasible set. However, in the following theorem we establish for the Gaussian family of probability distributions that (6) is convex and can be solved in polynomial time.
The optimization (6) is:
a semidefinite program (SDP) with an additional exponential cone constraint if is the space of Gaussian distributions;
a second-order cone program (SOCP) with an additional exponential cone constraint if is the space of Gaussian distributions with diagonal covariance.
As the objective function of (6) is linear, we only need to verify the convexity of the feasible set. We will first consider the case when is the space of Gaussian distributions. Let and . Then:
(7) |
which is convex because is linear, is positive-definite quadratic, and is convex. Finally, noting that constraints can be formulated as an SDP with an additional exponential cone constraint [35] completes the proof of this part.
Now we consider the family of Gaussian distributions and with diagonal covariance. We denote the mean as and . The diagonal elements of the covariance and are expressed as and , respectively. Then, the KL-divergence between two distributions in this family is:
(8) |
From (8), it follows that the problem (6) for this family of distributions is an SOCP with an additional exponential cone constraint (that arises from the terms), completing the proof. ∎
Restricting the class of belief distributions to those in Theorem 1 gives rise to CoNES: a family of convex programs that draws motivation from the concept of the natural gradient to transform the Euclidean gradient. To geometrically visualize CoNES, consider the illustration in Fig. 2. The orange surface is the loss landscape and the gray surface is the linearization of the loss at the point denoted by ; in differential geometric terms, the orange surface is more accurately characterized as the manifold given by the graph of the loss while the gray surface is the manifold’s tangent space at . The green arrow represents the solution of CoNES for a KL-divergence ball (light green region) with a very small which can also be regarded as the natural gradient (modulo the norm) at by Remark 1. The red arrow is the solution of CoNES for a KL-divergence ball (light red region) with a larger . Note that this figure is an illustration; the KL-divergence balls may not necessarily manifest in the depicted shapes. The NES gradient is the sharpest ascent direction for an infinitesimal step size, but, it may not be ideal for a larger step size. With CoNES, we can tune the scalar parameter to better align the update direction with the gradient-based optimizer’s step size (learning rate), yielding faster updates. Indeed, the choice of is important to the performance of CoNES as demonstrated in our numerical results in Section 7.2. The mechanism for selecting (or adapting) the hyperparameter is beyond the scope of this paper and will be explored in our future work.
The psuedo-code for our implementation of CoNES as a blackbox optimizer is detailed in Algorithm 1. We use the ES gradient estimate (presented in Section 4.2) as the Gradient-Estimator in Line 5 of Algorithm 1; any estimator of the Euclidean gradient, such as [12, 11], can be used here. We use Adam [26] as our gradient-based optimizer in Line 7; any gradient-based optimizer can be used.
An important property of the natural gradient is its independence to the parameterization of the belief distribution; e.g., for Gaussian distributions it does not matter whether we use the covariance matrix or its Cholesky decomposition. The natural gradient inherits this property by construction as the covariant gradient on the statistical manifold [5]. Parameterization invariance ensures that we are working with the intrinsic mathematical objects (probability distributions here) and the specific encoding of these objects will not affect the outcome. From a practical perspective, we derive the benefit of fewer properties to “engineer”.
A natural question to ask is whether CoNES (Problem (6)) exhibits the same property. Proposition 1 ensures that the CoNES optimization exhibits this property in the limit of tending to zero, as the update direction then coincides with the natural gradient. However, establishing this property for arbitrary is not immediately obvious. The rest of this section is dedicated to formally demonstrating that CoNES does indeed exhibit this property.
We will work with the loss function rather than its linearization with the understanding that if the parameterization invariance holds for an arbitrary function , it will automatically hold for the linear function in (6). With a slight abuse of notation, we will express the loss function in the coordinates on the statistical manifold instead of the coordinate-free notation of . Now we are ready to present the main result of this section:
Consider the optimization problem:
(9) |
Let be a smooth invertible mapping which performs a coordinate change from . Consider the following optimization problem in the new coordinates:^{2}^{2}2From a geometric perspective, and are coordinates on the statistical manifold , either of which can be used to express a distribution . The directions and lie in the tangent space of at .
(10) |
Then, there exists an invertible mapping such that , ensuring that .
Theorem 2 shows that expressing the belief distribution in different coordinates or provides the same optimal loss and the same set of possible outcomes (upto a bijective mapping). Of course, we cannot ensure that the outcome, i.e., the of the CoNES optimization is the same due to the potential lack of uniqueness of the optima; e.g., consider the maximization of in initialized at – all directions from the initial point are equally good.
Intuitively, Theorem 2 holds because the KL-divergence is independent of the parameterization of the distribution [28, Corollary 4.1], i.e., for , , and as defined in Theorem 2, we have:
(11) |
To formally prove Theorem 2, we will first establish two lemmas. The first lemma shows the existence of a bijective mapping between and .
Let , , and be as defined in Theorem 2. Then, there exists a bijective mapping , defined as
(12) |
First we will check the injectivity of :
(13) |
Next, to check the surjectivity of , let be arbitrary. Then there exists which satisfies . ∎
In the following remark, we express the result of Lemma 1 in a form that is more conducive to our forthcoming proof.
Let and be the feasible sets of and , respectively. Let be defined as in Lemma 1. Then, .
Let , then there exists a such that . Therefore, Remark 2 ensures that , which further gives us:
(14) |
where the last equality follows from (11) and the inequality follows from the fact that . From (14) we have that implying that .
Now, let . By the surjectivity of from Lemma 1, there exists a such that . With this, Remark 2 ensures that . Hence, using (11), followed by gives:
(15) |
where the last inequality follows from the fact that . Therefore, by (15), we have that , which, on combining with the earlier assertion that implies that . Thereby, ensuring that and completing the proof. ∎
In this section, we use CoNES on two classes of problems: (a) a standard suite of high-dimensional loss functions used to benchmark blackbox optimizers, and (b) a selection of OpenAI Gym’s [9] MuJoCo [47]
suite of RL tasks. We compare CoNES against existing methods including ES, natural evolutionary strategies (NES), and covariance matrix adaptation (CMA). We custom implemented ES, NES, and CoNES, while CMA is adapted directly from the open-source PyCMA package
[19]; our code is accessible at: https://github.com/irom-lab/CoNES.The family of Gaussian belief distributions with diagonal covariance is used for ES, NES, and CoNES. This family of belief distributions permits the implementation of NES exactly (i.e., without having to numerically estimate the Fisher information matrix [44]) for high-dimensional problems, serving as a strong baseline to compare CoNES against. For CMA, PyCMA’s default family of belief distributions – Gaussian distributions with non-diagonal covariance – is used. For ES, NES, and CoNES we compute an estimate of the gradient direction and pass it to the Adam optimizer [26] to update the belief distribution. For each of these methods we perform antithetic sampling and rank-based fitness transformation [42]. Unlike [42]
, we also update the variance of the belief distribution; we circumvent the non-negativeness constraint of the variance by updating the
of variance with the Adam optimizer instead. The resulting convex optimization problems for CoNES are solved using the CVXPY package [16] and the MOSEK solver [34].We first test our approach on four -dimensional functions: Sphere, Rosenbrock, Rastrigin, and Lunacek [21] which are provided in Appendix B. These functions are commonly-used benchmarks for blackbox optimization methods [20, 46]. Hyperparameters for ES, NES, and CoNES are shared across all problems (see Appendix A) while the hyperparameters of CMA are the default values chosen by PyCMA. Training for these benchmark functions was performed on a desktop with a 3.30 GHz Intel i9-7900X CPU with 10 cores and 32 GB RAM. Fig. 3
plots the average and standard deviation (shaded region) of the loss curves across 10 seeds. The rapid drop of the loss for CoNES demonstrates significant benefits in terms of the sample complexity over other methods. Fig.
4 shows that the step size for CoNES is smaller than ES and NES, which coupled with its lower loss implies that the update direction for CoNES is more accurate than ES and NES. The run-time for a single seed is 1 minute for ES and NES, 5 minutes for CoNES, and 35 minutes for CMA.Next, we benchmark our approach on the following environments from the OpenAI Gym suite of RL problems: HalfCheetah-v2, Walker2D-v2, Hopper-v2, and Swimmer-v2. We employ a fully-connected neural network policy with tanh
activations possessing one hidden layer with 16 neurons for
Swimmer-v2 and 50 neurons for all other environments. The input to the policies are the agent’s state – which are normalized using a method similar to the one adopted by [32] – and the output is a vector in the agent’s action space. The training for these tasks was performed on a c5.24xlarge instance on Amazon Web Services (AWS). Fig. 5 presents the average and standard deviation of the rewards for each RL task across 10 seeds against the number of time-steps interacted with the environment. Fig. 5 as well as Table 1 illustrate that CoNES performs well on all these tasks. For each environment we share the same hyperparameters (excluding ) between ES, NES, and CoNES; for CMA we use the default hyperparameters as chosen by PyCMA. It is worth pointing out that for RL tasks, CoNES demonstrates high sensitivity to the choice of . The results for CoNES reported in Fig. 5 and Table 1 are for the best choice of from . Exact hyperparameters for the problems are provided in Appendix A. Each seed of HalfCheetah-v2, Walker2D-v2 and Hopper-v2, takes 4-5 hours with ES, NES, CoNES and 10 hours with CMA. Each seed of Swimmer-v2 takes 2 hours with ES, NES, CoNES and 4 hours with CMA.We presented convex natural evolutionary strategies (CoNES) for optimizing high-dimensional blackbox functions. CoNES combines the notion of the natural gradient from information geometry with powerful techniques from convex optimization (e.g., second-order cone programming and geometric programming). In particular, CoNES refines a gradient estimate by solving a convex program that searches for the direction of steepest ascent in a KL-divergence ball around the current belief distribution. We formally established that CoNES is invariant under transformations of the belief parameterization. Our numerical results on benchmark functions and RL examples demonstrate the ability of CoNES to converge faster than conventional blackbox methods such as ES, NES, and CMA.
Future Work. This paper raises numerous exciting future directions to explore. The performance of CoNES is dependent on the choice of the radius of the KL-divergence ball. Furthermore, a suitable choice of in one region of the loss landscape may not be suitable for another. Hence, an adaptive scheme for choosing the radius of the KL-divergence ball could substantially enhance the performance of CoNES. Another potentially fruitful future direction arises from the observation that Proposition 1 — which serves as the cornerstone of CoNES — holds for any^{3}^{3}3This an outcome of the fact that the Hessian of all -divergences is the Fisher information [31]. -divergence [14]. Hence, we can generalize CoNES to arbitrary -divergences; this may afford greater flexibility in tuning it for the specific loss landscape and further improving performance. We can increase the flexibility afforded by CoNES even more by expanding beyond the family of Gaussian belief distributions. Finally, we are also exploring the empirical benefits of adaptively restricting the covariance matrix model [2, 11] in order to further enhance sample complexity.
The authors were supported by the Office of Naval Research [Award Number: N00014-18-1-2873], the Google Faculty Research Award, and the Amazon Research Award.
The parameters for the Adam optimizer were chosen according to [26, Algorithm 1] for all results in Section 7.
Benchmark Functions. For all the results in Section 7.1
the initial belief distribution is chosen to be the normal distribution
. The hyperparameters for ES, NES and CoNES were chosen as follows: the number of function evaluations performed per iteration is 100 and the learning rate for the mean and log of the variance is 0.1. Additionally, is set to 100 for CoNES.Let be expressed in its coordinates as .
Sphere:
Rosenbrock:
Rastrigin:
Lunacek: First define the constants
Using these constants the function can be expressed as .
Proceedings of the Genetic and Evolutionary Computation Conference
, pp. 657–664. Cited by: §2.Journal of Machine Learning Research
17 (83), pp. 1–5. Cited by: §7.Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence
. MIT Press. Cited by: §2.
Comments
There are no comments yet.