Evolution strategy (ES) is a class of blackbox optimization algorithms often used to optimize functions when gradient information is inaccessible. Let be an objective function in , we are interested in searching for with function value
as small as possible, where the only available information is the function values evaluated at search points. In recent years, ES has been revived as a popular method in several machine learning problems including optimizing neural networks(Cui et al., 2018)2017; Fazel et al., 2018).
In this work, we are interested in a particular class of ES methods, which represents the population by multivariate Gaussian distributions. This type of ES can be viewed as optimizing a modified objective function with Gaussian smoothing (GS). The key step of GS-based evolution strategy, is the computation of efficient search directions (most commonly, gradient) forbased on the GS formula. In (Flaxman et al., 2005; Nesterov & Spokoiny, 2015), the authors studied a random optimization technique, where the search direction is randomly chosen and the updates are generated from an estimate of directional derivative. It was shown in (Salimans et al., 2017) that the evaluation of and estimation of directional derivatives can be performed in parallel to provide a gradient estimate at each searching step. This scheme is simple and suitable for large-scale distributed optimization. There have been several attempts to improve the gradient estimate by carefully selecting the directions along which the gradient information is aggregated, for example, by using orthogonal instead of Monte-Carlo directions (Choromanski et al., 2018), exploiting the past data (Maheswaranathan et al., 2019; Meier et al., 2019), and employing active subspaces (Choromanski et al., 2019a).
One prominent challenge in blackbox optimization is that the landscape of is often complex and possesses plentiful of local minima and saddle points. There is a risk for any optimization algorithm, with or without access to full gradients, to get trapped in some of those points and become stagnate. The Gaussian smoothing, with its ability to smooth out function and damps out small, insignificant fluctuations, is a strong candidate in this very task. Specifically, with a moderately large smoothing parameter (i.e., strong smoothing effect), we can expect the gradient of the smoothed objective function will be able to look outside unimportant variation in the adjacent area and detect the general trend of the function from a distance, therefore an efficient nonlocal search direction. This potential of GS, however, has not been a focus of current works on ES, many of which concern with improving the accuracy of local gradient estimators (Maheswaranathan et al., 2019; Meier et al., 2019; Berahas et al., 2019a; Choromanski et al., 2019b). Yet, the local gradient may be misleading (see Figure 1).
Another challenge in estimating search direction arises from the high dimensionality of the objective functions. Indeed, the common approach to compute the gradient of Gaussian smoothed function is by numerical integration, for which, roughly speaking, one needs to feed with function evaluations from a domain with diameter proportional to the smoothing parameter. By increasing this parameter, we also significantly enlarge the domain, especially in high-dimensional cases, and the number of function evaluations for each gradient estimate can become prohibitively large.
In this paper, we propose an enhancement to ES by introducing a directional Gaussian smoothing (DGS)-gradient operator for determining nonlocal search direction, which takes into account large scale variation of the function and disregard local fluctuation. This DGS-gradient resembles the gradient of Gaussian smoothed version of the objective function with a moderately large smoothing parameter. As computing this gradient involves a -dimensional numerical integration and is very costly, the key idea of the DGS-gradient is to average the original objective function along orthogonal directions for the partial derivatives, and then, aggregates the information of all partial derivatives to assemble the full DGS-gradient. This strategy requires one-dimensional numerical integration, instead of one -dimensional, thus significantly reduces the number of function evaluations. We call our method directional Gaussian smoothing ES (DGS-ES) in the rest of the paper.
To maintain high accuracy for each partial derivative estimate, we use the Gauss-Hermite quadrature rule in each direction. While the DGS-gradient can be seen as a perturbed version of the gradient of Gaussian smoothed objective function, it is worth remarking that this operator may not form a gradient field of any function. We verify that in local regime (i.e., small smoothing parameter), the DGS-gradient operator is consistent with the local gradient. Furthermore, we proved that our method has a theoretical convergence rate independence of dimension, superior to that of GS-ES based on Monte Carlo sampling. The main contributions of this work can be summarized as follows:
Development of the DGS-gradient operator and its Gauss-Hermite estimator, which introduces an effective nonlocal search direction generation approach into the family of Evolution Strategy.
Theoretical analysis verifying the scalability of the DGS-ES method, i.e., the number of iteration needed for convergence is independent of the dimension for strongly convex functions.
Demonstration of our DGS-ES method on both high-dimensional non-convex benchmark optimization problems, as well as a real-world material design problem for rocket shell manufacture.
1.1 Related works
ES is an instance of derivative-free optimization that employs random search techniques to optimize an objective function (Rechenberg & Eigen, 1973; Schwefel, 1977). The search can be guided via either covariance matrix adaptation (Hansen & Ostermeier, 2001; Hansen, 2016), or search gradients to be fed to first order optimization schemes, (Wierstra et al., 2014). Our work herein is in line with the second approach. Recent applications of ES can be found in reinforcement learning (Houthooft et al., 2018; Ha & Schmidhuber, 2018), meta-learning (Metz et al., 2018), optimizing neural network architecture (Real et al., 2017; Miikkulainen et al., 2017), and direct training of neural network (Morse & Stanley, 2016). Gaussian smoothed emerged a method for approximating gradient as a search direction for ES (Flaxman et al., 2005; Nesterov & Spokoiny, 2015). This approach can be combined with trust region methods for optimization (Maggiar et al., 2018). A closely related approach is sphere smoothed (Flaxman et al., 2005). A comparison of different methods for gradient approximation, including finite difference, Gaussian smoothed and sphere smoothed, can be found in (Berahas et al., 2019b). Finally, for general review on derivative-free optimization, we refer to (Rios & Sahinidis, 2009; Larson et al., 2019)
We are interested in solving the following blackbox optimization problem
where consists of tuning parameters, and is a high-dimensional blackbox objective function. In this work, we assume that the gradient is not available, and that the function is only accessible via function evaluations. Unless otherwise stated, we do not assume is convex.
We briefly recall the class of ES methods, e.g., (Hansen & Ostermeier, 2001; Salimans et al., 2017), which uses the multivariate Gaussian distribution to represent the population. In other words, those ES methods were developed based on Gaussian smoothing (Flaxman et al., 2005; Nesterov & Spokoiny, 2015) of in Eq. (1), i.e.,
where the original objective function is smoothed via a convolution with a -dimensional Gaussian distribution , the notation
represents the element-wise product, and the vectorplays the role of a smoothing parameter. In general, the smoothed function has better properties than , as all the characteristics of , e.g., convexity, the Lipschitz constant, are inherited by . Moreover, for any , the function is always differentiable even if is not. As the discrepancy between the gradients of and can be bounded by the Lipschitz constant (see (Nesterov & Spokoiny, 2015) Lemma 3 for details), the original optimization problem in Eq. (1) can be replaced by a smoothed version, i.e., . The gradient of is given by
The standard ES method refers to the algorithm that uses Monte Carlo sampling to estimate the gradient and update the state from iteration to by where is the learning rate, are sampled from the Gaussian distribution .
One major advantage of the ES methods is that they are suited to be scaled up to a large number of parallel workers. All the function evaluations within each iteration can be simulated totally in parallel, and each worker only needs to return a scalar to the master, such that the communication cost among workers is almost minimal. Even though the existing ES methods and their variants have been successful in solving blackbox optimization problem in low to medium dimensions, the performance of the existing ES methods drops significantly when the dimension increases, especially in the case of having non-convex objective functions. Thus, the motivation of this work is to advance the state of the art of the ES type methods in solving high-dimensional non-convex optimization problems.
3 The evolution strategy with directional Gaussian smoothing (DGS-ES)
We present our main contributions in this section. We start by introducing, in §3.1, a novel directional Gaussian smoothing approach, in which the objective function is smoothed along -orthogonal directions, as opposed to the isotropic smoothing in Eq. (2). The DGS-gradient operator will be then introduced, followed by its approximation using the Gauss-Hermite quadrature rule. In §3.2, we describe the details of the proposed DGS-ES algorithm, and its convergence analysis is then provided in §3.3.
3.1 The DGS-gradient operator
To proceed, we first define a one-dimensional function
where is the current state of the objective function and is a unit vector in . Note that and can be viewed as parameters of the function . We define the Gaussian smoothing of , denoted by , by
which is the Gaussian smoothing of along the direction in the neighbourhood of . It is easy to see that the derivative of at is represented by
where denotes the differential operator. As opposed to directional derivatives of , the derivative only involves the directionally smoothed objective function given in Eq. (5).
Now it is natural to define a DGS-gradient operator for by building directional derivatives in Eq. (6) for orthogonal directions, i.e.,
where represents the matrix consisting of orthonormal vectors. It is important to notice that the DGS-gradient in Eq. (7) is not equal to for a general smoothing factor , because of the directional Gaussian smoothing used in Eq. (7). However, there is consistency between the two quantities as , i.e.,
for fixed and . In particular, if exists, both quantities will converge to as . Such consistency naturally led to the idea of replacing with in the ES framework.
3.2 The DGS-ES algorithm
The key idea of the DGS-ES algorithm is to exploit the fact that each component of in Eq. (7) only involves a one-dimensional integral, such that Gauss quadrature rules (Quarteroni et al., 2007) can be used to approximate the integrals with spectral convergence. In this work, we utilize the Gauss-Hermite (GH) quadrature rule, which was designed for computing integrals of the form . By doing a simple change of variable in Eq. (6), the GH rule can be directly used to obtain the following approximation formula for , i.e.,
where , are the GH quadrature weights defined by , , are the roots of the Hermite polynomial of degree and is the number of function evaluations needed to compute the quadrature in Eq. (9). The weights and the roots can be found in (Abramowitz & Stegun, 1972). The approximation error of the GH formula can be bounded by is
where is the factorial of , and the constant is independent of and .
Applying the GH quadrature rule to each component of in Eq. (7), we define the following estimator:
which requires a total of parallelizable function evaluations. In the DGS-ES algorithm, we use , instead of the estimators of in the literature, to update . For example, a simple update of would be
According to the consistency result in Eq. (8), the update will converge to the standard ES update using as , which is also the gradient descent update if exists. As such, the DGS-ES method is expected to have comparable performance as the standard ES or gradient descent methods for small values of .
On the other hand, the spectral convergence of the GH quadrature guarantees that is an accurate estimator of the DGS-gradient for a small , regardless of the value of . This fact enables the use of relatively big values of in Eq. (11) to exploit the nonlocal features of in the optimization process. The nonlocal exploitation ability of is demonstrated in §4 to be effective in skipping local minima of the objective function . An illustration of the nonlocal exploitation is given in Figure 1.
Random perturbation of and for exploration.
As the GH weights and , , are deterministic values, the estimator in Eq. (11) is a deterministic estimator for fixed and . This means the entire optimization algorithm will be deterministic if Eq. (12) is used to update . Compared to the standard ES and its variants, there is a lack of random exploration in our approach. The optimizer may be trapped in a local minimum without a random exploration strategy when minimizing a non-convex function. To alleviate this issue, we add random perturbation to and . For the orthonormal matrix , we intend to add a small random perturbation, denoted by , such that the orthonormal system defined by is a small random rotation from . Since needs to be orthonormal, we have
By neglecting the high-order term , we conclude that the infinitesimal rotation matrix
is a skew-symmetric matrix, i.e.,. In practice, we construct by creating a random skew-symmetric matrix with small-value entries, adding it to the current
, and pushing it through the Gram-Schmidt process for orthonormalization. We denote bythe multiplier that is used to control the magnitude of the entries of .
The perturbation of
is conducted by drawing random samples from a uniform distributionwith two hyper-parameters and with . The frequency of random perturbation can be determined by various types of indicators, e.g., the magnitude of the DGS-gradient, the number of iteration done since last perturbation.
3.3 Theoretical results
In the following, we verify that the DGS-gradient operator converges to when , and based on that, establish the convergence rate for the method in local regime. Even though our primary focus is not to employ DGS-gradient operator as a conventional gradient estimator, we show that when used for that purpose, this operator is also very competitive and can provide accurate gradient estimate under a weak condition.
In this section, we assume is convex and differentiable. We say if there exists such that
For being a unit vector in , we define the partial derivatives of at in direction , i.e,
We say is a strongly convex function if there exists a positive number such that for any ,
We will refer to as the convexity parameter of .
First, adapting Lemma 3, (Nesterov & Spokoiny, 2015) to -dimensional Gaussian smoothing, we arrive at the following bounds on partial derivative estimates.
Let be a unit vector in and be a function in . For , define as in (5). There hold
These results give us the gradient estimates using DGS-gradient operator.
Let be a set of orthonormal vectors in and be a function in , . Then
Assume that we apply the smoothing parameter for every direction, from (15), the condition under which DSG-ES ensures is
Compared to other methods of gradient approximations (Berahas et al., 2019b), our condition on is comparable to Forward Finite Difference and Sphere Smoothed (Flaxman et al., 2005) up to a constant, and superior to Gaussian Smoothed (Salimans et al., 2017), which is . The number of function evaluations required is slightly more demanding than those of Forward and Central Finite Difference, which are and respectively, but in general better than Gaussian Smoothed and Sphere Smoothed, which are and respectively.
With the gradient estimate (15), we will be able to establish error estimates for DGS-ES. We show here a convergence rate of our method in optimizing strongly convex functions.
Theorem 1 (DGS-ES for strongly convex function).
Let be a strongly convex function in and the sequence be generated by Algorithm 1 with . Then, for any , we have
where are the convexity and smoothing parameters and is the number of Gauss-Hermite samples.
Denote , where is the global minimizer of . Then
We proceed to bound the right hand side of (21). First, since is strongly convex,
On the other hand,
Applying an estimate for convex, -functions, see, e.g., Theorem 2.1.5, (Nesterov, 2004), gives
Since is a strongly convex function, for we have that
Note that , since , we arrive at the conclusion. ∎
It is worth remarking a few things on the above theorem. First, we obtain the global linear rate of convergence with DGS-ES, which is expected for strongly convex functions. Second, the result allows certain random perturbation of and . In particular, Theorem 1 still holds with and changing after each iterate, as long as remains orthonormal and is fixed.
Let us conclude this section by comparing conditions that guarantee of our method with that of random search GS approach in (Nesterov & Spokoiny, 2015) in this strongly convex setting. Assume again for every direction. From error estimate (19), we need to choose
The corresponding condition for random search GS is
The number of iterates in (Nesterov & Spokoiny, 2015) scales linearly with the dimension, which is undesirable for high-dimensional problems. This method however evaluates only one sample per iterate, thus, the total number of function evaluations remains low. Our method requires a slightly higher number of samples. However, it is possible to parallelize the function evaluations over thousands of workers at each step, so our number of iteration is completely independent of dimension. This makes DSG-ES very well-suited for large distributed computer systems.
We evaluate the proposed DGS-ES method using three sets of problems, i.e., benchmark functions in low and medium dimensions (2D to 10D), benchmark functions in very high-dimensional spaces (1000D and 2000D), and a real-world aerospace structure design problem. We compared the DGS-ES method with the following blackbox optimization methods: (a) Evolution Strategy (ES) in (Salimans et al., 2017); (b) Covariance matrix adaptation evolution strategy (CMA-ES) in (Hansen, 2006)
(we used the implementation of CMA-ES in the pycma open-source code fromhttps://github.com/CMA-ES/pycma), (c) Trust-Region Bayesian Optimization (TuRBO) in (Eriksson et al., 2019) (we used the code released by the authors at https://github.com/uber-research/TuRBO) and (d) the BFGS method.
4.1 Tests on functions in low to medium dimension
We tested our method and the baselines in nine benchmark problems (Surjanovic & Bingham, ), including 2D-Ackley (Ackley2), 5D-Ackley (Ackley5), 10D-Ackley (Ackley10), 2D-Branin (Branin), 10D-Levy (Levy10), 2D-Cross-In-Tray (CrossInTray), 10D-Sphere (Sphere10), 2D-DropWare (Dropwave) and 10D-Rastrigin (Rastrigin10). In our tests, we used the same parameter and input domains as in (Surjanovic & Bingham, ).
For each benchmark function, we performed 20 repeated independent trials for each method to show the statistical performance. Three metrics are used to compare the performance of our method and the baselines:
Success rate, i.e., the percentage of successful convergence to the global minimum of each test function.
The average iterations needed for global convergence, which only takes into account the successful trials.
The averaged minimal objective function value found in the unsuccessful trials.
shows that both the DGS-ES and the baselines perform perfectly for Sphere10 (a convex function), and Branin (a low-dimensional near convex function). For non-convex functions, our method performs better than the other baselines when the function has global funnel type structures, i.e., the Ackley functions, Levy10, Dropwave, and Rastrigin10. This is due to the nonlocal exploitation ability of the DGS-ES method when using with relatively large standard deviations, e.g., . However, when a function does not have an obvious funnel shape, e.g., the CrossInTray function, the success rate of the DGS-ES method drops dramatically, regardless of the dimension.
Table 2 indicates the convergence speed of an optimization method in a successful trial. When a method has success rate in Table 1, “” is placed in the corresponding spots in Table 2. We observed that the DGS-ES and the ES methods need the most iterations to achieve global convergence, because no adaptive strategies, e.g., adaptive learning rate, is used to accelerate the convergence. The BFGS method converges very fast when it can find the global minima. Both CMA-ES and TuRBO methods have some kind of adaptive strategy, e.g., adjusting covariance matrix in CMA-ES or the trust region radius in TuRBO, to help achieve faster convergence. However, if the adaptive adjustment is based on incorrect information, those methods may fail to converge, which could be the reason why their success rates are lower than the DGS-ES method for most test functions.
Table 3 indicates what kind of local minima an algorithm is strapped in a unsuccessful trial. We can see that DGS-ES performs better than the baselines in most cases, due to its nonlocal exploitation capability. For further illustration, we plotted in Figure 2 the DGS-gradient direction and local gradient direction at 50 random locations on the surface of the Ackley2. The standard deviation used in directional Gaussian smoothing is set to 0.01 in Figure 2(Left) and 1.0 in Figure 2(Right). It shows that the DGS-gradient with larger mostly points to the global minimum at .
4.2 Tests on high-dimensional functions
Here we investigate the DGS-ES performance on the high-dimensional functions: Sphere, Ackley (Re-scaled), Levy and Rastrigin functions, in 100D, 1000D and 2000D. The original Ackley function is extremely difficult in high dimensions, because its Lipschitz constant does not grow with the dimension and it is a highly concave function. In order to use the Ackley-type of geometry to test our DGS-ES method, we enlarged its Lipschitz constant by multiplying the original Ackley function by 100, and reducing the search domain to .
Figure 3 shows the scalability of the DGS-ES algorithm with the dimension of the objective function. We performed 20
repeated independent trials for each method to show the statistical performance. In §3.3, we proved in Theorem 1 that the DGS-ES method can achieve dimension independent convergence rate when optimizing strongly convex functions. Our numerical results are consistent with Theorem 1. For example, for the sphere function, the convergence rate does not change when we increase the dimension from 10 to 1000. Moreover, such property carries over to the non-convex Levy and Rastrigin. The reason is that both Levy and Rastrigin have a globally near-convex structure when smoothing out their local minima. In contrast, the Ackley function is highly concave, as a large part of its surface is almost flat regardless of the small local minima. In this case, it takes many iterations for DGS-ES to search for the global minimum hidden in the middle of the flat surface.
Figure 4 shows the comparison between the DGS-ES and the baselines for optimizing the four benchmark functions in Figure 3 in 2000-dimensional space. For the sphere function, BFGS is a clear winner due to its optimal performance in handling convex functions. Among the three ES-type methods, the DGS-ES has much better performance than its competitors. For the other three functions, the DGS-ES method shows significant advantages over other methods.
4.3 Aerospace hierarchical stiffened shell design
Now we demonstrate the DGS-ES method on a real-world stiffened shell design problem. Due to its high strength and stiffness, the hierarchical stiffened shell has been widely used in aerospace engineering. However, it is challenging to fully explore its optimal buckling load-carrying capacity. The goal of this work is to improve the load-carrying capacity by optimizing the representative unit cell of hierarchical stiffened shell where the inputs are 9 size variables (widths, heights and thickness) for major and minor stiffeners, as shown in Figure 5(a), and the output is the carrying-load capability factor which is calculated by high-fidelity numerical simulation, e.g. finite element method (FEM), see Figure 5 (b). Although the high-fidelity FEM simulation is time-consuming, many open-source codes and commercial software have improved the computational efficiency via scalable parallelism and GPU acceleration. It is thus feasible to apply the DGS-ES method by implementing parallel FEM simulations in supercomputers. Figure 5 presents the comparison between DGS-ES and the other blackbox optimization methods. We observed that DGS-ES outperforms all other algorithms and achieves faster convergence using only 100 iterations. The robustness of DGS-ES also outperforms the alternatives.
5 Discussion on the limitations of DGS-ES
Despite the successful demonstration of the DGS-ES method shown in §4, we realized that there are several limitations with the current version of the DGS-ES algorithm, including: (a) Naive random perturbation strategy for exploration. As the estimator is deterministic, an effective random exploration strategy is critically important to the robustness of the algorithm. The current random rotation algorithm does not effectively exploit the degrees of freedom of the special orthogonal group , such that important information may be lost after the perturbation. (b) The need for hyper-parameter tuning. The most sensitive hyper-parameter is standard deviation in . If is too small, the function is insufficiently smoothed, such that the optimizer may be trapped in a local minimum. In contrast, if is too big, the function is over smoothed, such that the convergence will become much slower. An adaptive strategy, e.g., the covariance matrix adaptation in CMA-ES, will be needed to alleviate such issue.
This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract and award numbers ERKJ352, and by the Artificial Intelligence Initiative at the Oak Ridge National Laboratory (ORNL). ORNL is operated by UT-Battelle, LLC., for the U.S. Department of Energy under Contract DE-AC05-00OR22725.
- Abramowitz & Stegun (1972) Abramowitz, M. and Stegun, I. (eds.). Handbook of Mathematical Functions. Dover, New York, 1972.
- Berahas et al. (2019a) Berahas, A. S., Cao, L., Choromanski, K., and Scheinberg, K. Linear interpolation gives better gradients than gaussian smoothing in derivative-free optimization. arXiv preprint arXiv:1905.13043, 2019a.
- Berahas et al. (2019b) Berahas, A. S., Cao, L., Choromanskiv, K., and Scheinberg, K. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. arXiv:1905.01332, 2019b.
- Choromanski et al. (2018) Choromanski, K., Rowland, M., Sindhwani, V., Turner, R. E., and Weller, A. Structured evolution with compact architectures for scalable policy optimization. International Conference on Machine Learning, pp. 969–977, 2018.
- Choromanski et al. (2019a) Choromanski, K., Pacchiano, A., Parker-Holder, J., , and Tang, Y. From complexity to simplicity: Adaptive es-active subspaces for blackbox optimization. NeurIPS 2019, 2019a.
- Choromanski et al. (2019b) Choromanski, K., Pacchiano, A., Parker-Holder, J., , and Tang, Y. Provably robust blackbox optimization for reinforcement learning. arXiv:1903.02993, 2019b.
Cui et al. (2018)
Cui, X., Zhang, W., Tüske, Z., and Picheny, M.
Evolutionary stochastic gradient descent for optimization of deep neural networks.NeurIPS, 2018.
- Eriksson et al. (2019) Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local bayesian optimization. In Advances in Neural Information Processing Systems, pp. 5497–5508, 2019.
- Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning, 80:1467–01476, 2018.
- Flaxman et al. (2005) Flaxman, A. D., Kalai, A. T., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. Proceedings of the 16th Annual ACM-SIAM symposium on Discrete Algorithms, pp. 385–394, 2005.
- Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, pp. 2450–2462, 2018.
The cma evolution strategy: a comparing review.
Towards a new evolutionary computation, pp. 75–102. Springer, 2006.
- Hansen (2016) Hansen, N. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.
- Hansen & Ostermeier (2001) Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
- Houthooft et al. (2018) Houthooft, R., Chen, Y., Isola, P., Stadie, B., Wolski, F., Ho, O. J., and Abbeel, P. Evolved policy gradients. Advances in Neural Information Processing Systems, pp. 5400–5409, 2018.
- Larson et al. (2019) Larson, J., Menickelly, M., and Wild, S. M. Derivative-free optimization methods. Acta Numerica, 28:287–404, 2019.
- Maggiar et al. (2018) Maggiar, A., Wachter, A., Dolinskaya, I. S., and Staum, J. A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization, 28(2):1478–1507, 2018.
- Maheswaranathan et al. (2019) Maheswaranathan, N., Metz, L., Tucker, G., Choi, D., and Sohl-Dickstein, J. Guided evolutionary strategies: Augmenting random search with surrogate gradients. Proceedings of the 36th International Conference on Machine Learning, 2019.
- Meier et al. (2019) Meier, F., Mujika, A., Gauy, M. M., and Steger, A. Improving gradient estimation in evolutionary strategies with past descent directions. Optimization Foundations for Reinforcement Learning Workshop at NeurIPS 2019, 2019.
- Metz et al. (2018) Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., and Sohl-dickstein, J. Learned optimizers that outperform sgd on wall-clock and test loss. 2nd Workshop on Meta-Learning at NeurIPS, 2018.
- Miikkulainen et al. (2017) Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Nuffy, N., and Hodjat, B. Evolving deep neural networks. arXiv preprint arXiv:1703.00548, 2017.
- Morse & Stanley (2016) Morse, G. and Stanley, K. O. Simple evolutionary optimization can rival stochastic gradient descent in neural networks. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016), pp. 477–484, 2016.
- Nesterov (2004) Nesterov, Y. Introductory Lectures on Convex Optimization. Springer US, 2004.
- Nesterov & Spokoiny (2015) Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, pp. 1–40, 2015.
- Quarteroni et al. (2007) Quarteroni, A., Sacco, R., and Saleri, F. Numerical Mathematics, volume 332. Springer Science Business Media &, 2007.
Real et al. (2017)
Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le,
Q. V., and Kurakin, A.
Large-scale evolution of image classifiers.International Conference on Machine Learning (ICML), pp. 2902–2911, 2017.
- Rechenberg & Eigen (1973) Rechenberg, I. and Eigen, M. Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog Stuttgart, 1973.
- Rios & Sahinidis (2009) Rios, L. M. and Sahinidis, N. V. Derivative-free optimization: a review of algorithms and comparison of software implementations. J Glob Optim, 56:1247–1293, 2009.
- Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., and Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Schwefel (1977) Schwefel, H.-P. Numerische optimierung von computer-modellen mittels der evolutionsstrategie. 1977.
- (31) Surjanovic, S. and Bingham, D. Virtual library of simulation experiments: Test functions and datasets. Retrieved February 6, 2020, from http://www.sfu.ca/~ssurjano.
- Wierstra et al. (2014) Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., and Schmidhuber, J. Natural evolution strategies. Journal of Machine Learning Research, 15:949–980, 2014.