A Scalable Evolution Strategy with Directional Gaussian Smoothing for Blackbox Optimization

02/07/2020 ∙ by Jiaxin Zhang, et al. ∙ 0

We developed a new scalable evolution strategy with directional Gaussian smoothing (DGS-ES) for high-dimensional blackbox optimization. Standard ES methods have been proved to suffer from the curse of dimensionality, due to the random directional search and low accuracy of Monte Carlo estimation. The key idea of this work is to develop Gaussian smoothing approach which only averages the original objective function along d orthogonal directions. In this way, the partial derivatives of the smoothed function along those directions can be represented by one-dimensional integrals, instead of d-dimensional integrals in the standard ES methods. As such, the averaged partial derivatives can be approximated using the Gauss-Hermite quadrature rule, as opposed to MC, which significantly improves the accuracy of the averaged gradients. Moreover, the smoothing technique reduces the barrier of local minima, such that global minima become easier to achieve. We provide three sets of examples to demonstrate the performance of our method, including benchmark functions for global optimization, and a rocket shell design problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Evolution strategy (ES) is a class of blackbox optimization algorithms often used to optimize functions when gradient information is inaccessible. Let be an objective function in , we are interested in searching for with function value

as small as possible, where the only available information is the function values evaluated at search points. In recent years, ES has been revived as a popular method in several machine learning problems including optimizing neural networks

(Cui et al., 2018)

and reinforcement learning

(Salimans et al., 2017; Fazel et al., 2018).

In this work, we are interested in a particular class of ES methods, which represents the population by multivariate Gaussian distributions. This type of ES can be viewed as optimizing a modified objective function with Gaussian smoothing (GS). The key step of GS-based evolution strategy, is the computation of efficient search directions (most commonly, gradient) for

based on the GS formula. In (Flaxman et al., 2005; Nesterov & Spokoiny, 2015), the authors studied a random optimization technique, where the search direction is randomly chosen and the updates are generated from an estimate of directional derivative. It was shown in (Salimans et al., 2017) that the evaluation of and estimation of directional derivatives can be performed in parallel to provide a gradient estimate at each searching step. This scheme is simple and suitable for large-scale distributed optimization. There have been several attempts to improve the gradient estimate by carefully selecting the directions along which the gradient information is aggregated, for example, by using orthogonal instead of Monte-Carlo directions (Choromanski et al., 2018), exploiting the past data (Maheswaranathan et al., 2019; Meier et al., 2019), and employing active subspaces (Choromanski et al., 2019a).

One prominent challenge in blackbox optimization is that the landscape of is often complex and possesses plentiful of local minima and saddle points. There is a risk for any optimization algorithm, with or without access to full gradients, to get trapped in some of those points and become stagnate. The Gaussian smoothing, with its ability to smooth out function and damps out small, insignificant fluctuations, is a strong candidate in this very task. Specifically, with a moderately large smoothing parameter (i.e., strong smoothing effect), we can expect the gradient of the smoothed objective function will be able to look outside unimportant variation in the adjacent area and detect the general trend of the function from a distance, therefore an efficient nonlocal search direction. This potential of GS, however, has not been a focus of current works on ES, many of which concern with improving the accuracy of local gradient estimators (Maheswaranathan et al., 2019; Meier et al., 2019; Berahas et al., 2019a; Choromanski et al., 2019b). Yet, the local gradient may be misleading (see Figure 1).

Another challenge in estimating search direction arises from the high dimensionality of the objective functions. Indeed, the common approach to compute the gradient of Gaussian smoothed function is by numerical integration, for which, roughly speaking, one needs to feed with function evaluations from a domain with diameter proportional to the smoothing parameter. By increasing this parameter, we also significantly enlarge the domain, especially in high-dimensional cases, and the number of function evaluations for each gradient estimate can become prohibitively large.

In this paper, we propose an enhancement to ES by introducing a directional Gaussian smoothing (DGS)-gradient operator for determining nonlocal search direction, which takes into account large scale variation of the function and disregard local fluctuation. This DGS-gradient resembles the gradient of Gaussian smoothed version of the objective function with a moderately large smoothing parameter. As computing this gradient involves a -dimensional numerical integration and is very costly, the key idea of the DGS-gradient is to average the original objective function along orthogonal directions for the partial derivatives, and then, aggregates the information of all partial derivatives to assemble the full DGS-gradient. This strategy requires one-dimensional numerical integration, instead of one -dimensional, thus significantly reduces the number of function evaluations. We call our method directional Gaussian smoothing ES (DGS-ES) in the rest of the paper.

To maintain high accuracy for each partial derivative estimate, we use the Gauss-Hermite quadrature rule in each direction. While the DGS-gradient can be seen as a perturbed version of the gradient of Gaussian smoothed objective function, it is worth remarking that this operator may not form a gradient field of any function. We verify that in local regime (i.e., small smoothing parameter), the DGS-gradient operator is consistent with the local gradient. Furthermore, we proved that our method has a theoretical convergence rate independence of dimension, superior to that of GS-ES based on Monte Carlo sampling. The main contributions of this work can be summarized as follows:

  • Development of the DGS-gradient operator and its Gauss-Hermite estimator, which introduces an effective nonlocal search direction generation approach into the family of Evolution Strategy.

  • Theoretical analysis verifying the scalability of the DGS-ES method, i.e., the number of iteration needed for convergence is independent of the dimension for strongly convex functions.

  • Demonstration of our DGS-ES method on both high-dimensional non-convex benchmark optimization problems, as well as a real-world material design problem for rocket shell manufacture.

1.1 Related works

ES is an instance of derivative-free optimization that employs random search techniques to optimize an objective function (Rechenberg & Eigen, 1973; Schwefel, 1977). The search can be guided via either covariance matrix adaptation (Hansen & Ostermeier, 2001; Hansen, 2016), or search gradients to be fed to first order optimization schemes, (Wierstra et al., 2014). Our work herein is in line with the second approach. Recent applications of ES can be found in reinforcement learning (Houthooft et al., 2018; Ha & Schmidhuber, 2018), meta-learning (Metz et al., 2018), optimizing neural network architecture (Real et al., 2017; Miikkulainen et al., 2017), and direct training of neural network (Morse & Stanley, 2016). Gaussian smoothed emerged a method for approximating gradient as a search direction for ES (Flaxman et al., 2005; Nesterov & Spokoiny, 2015). This approach can be combined with trust region methods for optimization (Maggiar et al., 2018). A closely related approach is sphere smoothed (Flaxman et al., 2005). A comparison of different methods for gradient approximation, including finite difference, Gaussian smoothed and sphere smoothed, can be found in (Berahas et al., 2019b). Finally, for general review on derivative-free optimization, we refer to (Rios & Sahinidis, 2009; Larson et al., 2019)

2 Background

We are interested in solving the following blackbox optimization problem

(1)

where consists of tuning parameters, and is a high-dimensional blackbox objective function. In this work, we assume that the gradient is not available, and that the function is only accessible via function evaluations. Unless otherwise stated, we do not assume is convex.

We briefly recall the class of ES methods, e.g., (Hansen & Ostermeier, 2001; Salimans et al., 2017), which uses the multivariate Gaussian distribution to represent the population. In other words, those ES methods were developed based on Gaussian smoothing (Flaxman et al., 2005; Nesterov & Spokoiny, 2015) of in Eq. (1), i.e.,

(2)

where the original objective function is smoothed via a convolution with a -dimensional Gaussian distribution , the notation

represents the element-wise product, and the vector

plays the role of a smoothing parameter. In general, the smoothed function has better properties than , as all the characteristics of , e.g., convexity, the Lipschitz constant, are inherited by . Moreover, for any , the function is always differentiable even if is not. As the discrepancy between the gradients of and can be bounded by the Lipschitz constant (see (Nesterov & Spokoiny, 2015) Lemma 3 for details), the original optimization problem in Eq. (1) can be replaced by a smoothed version, i.e., . The gradient of is given by

(3)

The standard ES method refers to the algorithm that uses Monte Carlo sampling to estimate the gradient and update the state from iteration to by where is the learning rate, are sampled from the Gaussian distribution .

One major advantage of the ES methods is that they are suited to be scaled up to a large number of parallel workers. All the function evaluations within each iteration can be simulated totally in parallel, and each worker only needs to return a scalar to the master, such that the communication cost among workers is almost minimal. Even though the existing ES methods and their variants have been successful in solving blackbox optimization problem in low to medium dimensions, the performance of the existing ES methods drops significantly when the dimension increases, especially in the case of having non-convex objective functions. Thus, the motivation of this work is to advance the state of the art of the ES type methods in solving high-dimensional non-convex optimization problems.

3 The evolution strategy with directional Gaussian smoothing (DGS-ES)

We present our main contributions in this section. We start by introducing, in §3.1, a novel directional Gaussian smoothing approach, in which the objective function is smoothed along -orthogonal directions, as opposed to the isotropic smoothing in Eq. (2). The DGS-gradient operator will be then introduced, followed by its approximation using the Gauss-Hermite quadrature rule. In §3.2, we describe the details of the proposed DGS-ES algorithm, and its convergence analysis is then provided in §3.3.

3.1 The DGS-gradient operator

To proceed, we first define a one-dimensional function

(4)

where is the current state of the objective function and is a unit vector in . Note that and can be viewed as parameters of the function . We define the Gaussian smoothing of , denoted by , by

(5)

which is the Gaussian smoothing of along the direction in the neighbourhood of . It is easy to see that the derivative of at is represented by

(6)

where denotes the differential operator. As opposed to directional derivatives of , the derivative only involves the directionally smoothed objective function given in Eq. (5).

Now it is natural to define a DGS-gradient operator for by building directional derivatives in Eq. (6) for orthogonal directions, i.e.,

(7)

where represents the matrix consisting of orthonormal vectors. It is important to notice that the DGS-gradient in Eq. (7) is not equal to for a general smoothing factor , because of the directional Gaussian smoothing used in Eq. (7). However, there is consistency between the two quantities as , i.e.,

(8)

for fixed and . In particular, if exists, both quantities will converge to as . Such consistency naturally led to the idea of replacing with in the ES framework.

3.2 The DGS-ES algorithm

The key idea of the DGS-ES algorithm is to exploit the fact that each component of in Eq. (7) only involves a one-dimensional integral, such that Gauss quadrature rules (Quarteroni et al., 2007) can be used to approximate the integrals with spectral convergence. In this work, we utilize the Gauss-Hermite (GH) quadrature rule, which was designed for computing integrals of the form . By doing a simple change of variable in Eq. (6), the GH rule can be directly used to obtain the following approximation formula for , i.e.,

(9)

where , are the GH quadrature weights defined by , , are the roots of the Hermite polynomial of degree and is the number of function evaluations needed to compute the quadrature in Eq. (9). The weights and the roots can be found in (Abramowitz & Stegun, 1972). The approximation error of the GH formula can be bounded by is

(10)

where is the factorial of , and the constant is independent of and .

Applying the GH quadrature rule to each component of in Eq. (7), we define the following estimator:

(11)

which requires a total of parallelizable function evaluations. In the DGS-ES algorithm, we use , instead of the estimators of in the literature, to update . For example, a simple update of would be

(12)

According to the consistency result in Eq. (8), the update will converge to the standard ES update using as , which is also the gradient descent update if exists. As such, the DGS-ES method is expected to have comparable performance as the standard ES or gradient descent methods for small values of .

On the other hand, the spectral convergence of the GH quadrature guarantees that is an accurate estimator of the DGS-gradient for a small , regardless of the value of . This fact enables the use of relatively big values of in Eq. (11) to exploit the nonlocal features of in the optimization process. The nonlocal exploitation ability of is demonstrated in §4 to be effective in skipping local minima of the objective function . An illustration of the nonlocal exploitation is given in Figure 1.

Figure 1: Illustration of the nonlocal exploitation ability of the DGS smoothing technique. The blue arrow points to the local gradient direction, and the red arrow points to the DGS-gradient direction, where the directionally smoothed functions along the two axes are the red curves in the top and right sub-figures. As the DGS smoothing captures the nonlocal features of , the DGS-gradient points to a much better direction, i.e., closer to the global minimum, than the local gradient.

Random perturbation of and for exploration.

As the GH weights and , , are deterministic values, the estimator in Eq. (11) is a deterministic estimator for fixed and . This means the entire optimization algorithm will be deterministic if Eq. (12) is used to update . Compared to the standard ES and its variants, there is a lack of random exploration in our approach. The optimizer may be trapped in a local minimum without a random exploration strategy when minimizing a non-convex function. To alleviate this issue, we add random perturbation to and . For the orthonormal matrix , we intend to add a small random perturbation, denoted by , such that the orthonormal system defined by is a small random rotation from . Since needs to be orthonormal, we have

By neglecting the high-order term , we conclude that the infinitesimal rotation matrix

is a skew-symmetric matrix, i.e.,

. In practice, we construct by creating a random skew-symmetric matrix with small-value entries, adding it to the current

, and pushing it through the Gram-Schmidt process for orthonormalization. We denote by

the multiplier that is used to control the magnitude of the entries of .

The perturbation of

is conducted by drawing random samples from a uniform distribution

with two hyper-parameters and with . The frequency of random perturbation can be determined by various types of indicators, e.g., the magnitude of the DGS-gradient, the number of iteration done since last perturbation.

1.2

Algorithm 1 : The DGS-ES algorithm
1:  Hyper-parameters: : the order of GH quadrature rule: learning rate: the scaling factor for controlling the norm of : the mean and radius for sampling : the tolerance for triggering random perturbation
2:  Input: The initial state
3:  Output: The final state
4:  Set , and for
5:  for  do
6:     Evaluate in Eq. (4)
7:     for  do
8:        Compute in Eq. (9)
9:     end for
10:     Assemble in Eq. (11)
11:     Set
12:     if  then
13:        Generate and update
14:        Generate from for
15:     end if
16:  end for

3.3 Theoretical results

In the following, we verify that the DGS-gradient operator converges to when , and based on that, establish the convergence rate for the method in local regime. Even though our primary focus is not to employ DGS-gradient operator as a conventional gradient estimator, we show that when used for that purpose, this operator is also very competitive and can provide accurate gradient estimate under a weak condition.

In this section, we assume is convex and differentiable. We say if there exists such that

For being a unit vector in , we define the partial derivatives of at in direction , i.e,

We say is a strongly convex function if there exists a positive number such that for any ,

We will refer to as the convexity parameter of .

First, adapting Lemma 3, (Nesterov & Spokoiny, 2015) to -dimensional Gaussian smoothing, we arrive at the following bounds on partial derivative estimates.

Lemma 1.

Let be a unit vector in and be a function in . For , define as in (5). There hold

(13)
(14)

These results give us the gradient estimates using DGS-gradient operator.

Lemma 2.

Let be a set of orthonormal vectors in and be a function in , . Then

(15)
(16)
Proof.

From (10) and (13), we have for any

(17)

Summing (17) from to gives (15). For (16), recall

(18)

Each term inside the above sum can be bounded as

Plugging this into (18), we obtain (16). ∎

Assume that we apply the smoothing parameter for every direction, from (15), the condition under which DSG-ES ensures is

Compared to other methods of gradient approximations (Berahas et al., 2019b), our condition on is comparable to Forward Finite Difference and Sphere Smoothed (Flaxman et al., 2005) up to a constant, and superior to Gaussian Smoothed (Salimans et al., 2017), which is . The number of function evaluations required is slightly more demanding than those of Forward and Central Finite Difference, which are and respectively, but in general better than Gaussian Smoothed and Sphere Smoothed, which are and respectively.

With the gradient estimate (15), we will be able to establish error estimates for DGS-ES. We show here a convergence rate of our method in optimizing strongly convex functions.

Theorem 1 (DGS-ES for strongly convex function).

Let be a strongly convex function in and the sequence be generated by Algorithm 1 with . Then, for any , we have

(19)

Here,

(20)

where are the convexity and smoothing parameters and is the number of Gauss-Hermite samples.

Proof.

Denote , where is the global minimizer of . Then

(21)

We proceed to bound the right hand side of (21). First, since is strongly convex,

(22)

On the other hand,

(23)

Applying an estimate for convex, -functions, see, e.g., Theorem 2.1.5, (Nesterov, 2004), gives

(24)

Combining (21)–(24), there holds

(25)

Since is a strongly convex function, for we have that

(26)

Assuming . We derive from (25), (26) and (20) that which yields

Note that , since , we arrive at the conclusion. ∎

It is worth remarking a few things on the above theorem. First, we obtain the global linear rate of convergence with DGS-ES, which is expected for strongly convex functions. Second, the result allows certain random perturbation of and . In particular, Theorem 1 still holds with and changing after each iterate, as long as remains orthonormal and is fixed.

Let us conclude this section by comparing conditions that guarantee of our method with that of random search GS approach in (Nesterov & Spokoiny, 2015) in this strongly convex setting. Assume again for every direction. From error estimate (19), we need to choose

The corresponding condition for random search GS is

The number of iterates in (Nesterov & Spokoiny, 2015) scales linearly with the dimension, which is undesirable for high-dimensional problems. This method however evaluates only one sample per iterate, thus, the total number of function evaluations remains low. Our method requires a slightly higher number of samples. However, it is possible to parallelize the function evaluations over thousands of workers at each step, so our number of iteration is completely independent of dimension. This makes DSG-ES very well-suited for large distributed computer systems.

4 Experiments

We evaluate the proposed DGS-ES method using three sets of problems, i.e., benchmark functions in low and medium dimensions (2D to 10D), benchmark functions in very high-dimensional spaces (1000D and 2000D), and a real-world aerospace structure design problem. We compared the DGS-ES method with the following blackbox optimization methods: (a) Evolution Strategy (ES) in (Salimans et al., 2017); (b) Covariance matrix adaptation evolution strategy (CMA-ES) in (Hansen, 2006)

(we used the implementation of CMA-ES in the pycma open-source code from

https://github.com/CMA-ES/pycma), (c) Trust-Region Bayesian Optimization (TuRBO) in (Eriksson et al., 2019) (we used the code released by the authors at https://github.com/uber-research/TuRBO) and (d) the BFGS method.

4.1 Tests on functions in low to medium dimension

We tested our method and the baselines in nine benchmark problems (Surjanovic & Bingham, ), including 2D-Ackley (Ackley2), 5D-Ackley (Ackley5), 10D-Ackley (Ackley10), 2D-Branin (Branin), 10D-Levy (Levy10), 2D-Cross-In-Tray (CrossInTray), 10D-Sphere (Sphere10), 2D-DropWare (Dropwave) and 10D-Rastrigin (Rastrigin10). In our tests, we used the same parameter and input domains as in (Surjanovic & Bingham, ).

For each benchmark function, we performed 20 repeated independent trials for each method to show the statistical performance. Three metrics are used to compare the performance of our method and the baselines:

  1. Success rate, i.e., the percentage of successful convergence to the global minimum of each test function.

  2. The average iterations needed for global convergence, which only takes into account the successful trials.

  3. The averaged minimal objective function value found in the unsuccessful trials.

The comparison results for the three metrics are given in Table 1, Table 2, Table 3, respectively.

Table 1

shows that both the DGS-ES and the baselines perform perfectly for Sphere10 (a convex function), and Branin (a low-dimensional near convex function). For non-convex functions, our method performs better than the other baselines when the function has global funnel type structures, i.e., the Ackley functions, Levy10, Dropwave, and Rastrigin10. This is due to the nonlocal exploitation ability of the DGS-ES method when using with relatively large standard deviations

, e.g., . However, when a function does not have an obvious funnel shape, e.g., the CrossInTray function, the success rate of the DGS-ES method drops dramatically, regardless of the dimension.

Table 2 indicates the convergence speed of an optimization method in a successful trial. When a method has success rate in Table 1, “” is placed in the corresponding spots in Table 2. We observed that the DGS-ES and the ES methods need the most iterations to achieve global convergence, because no adaptive strategies, e.g., adaptive learning rate, is used to accelerate the convergence. The BFGS method converges very fast when it can find the global minima. Both CMA-ES and TuRBO methods have some kind of adaptive strategy, e.g., adjusting covariance matrix in CMA-ES or the trust region radius in TuRBO, to help achieve faster convergence. However, if the adaptive adjustment is based on incorrect information, those methods may fail to converge, which could be the reason why their success rates are lower than the DGS-ES method for most test functions.

Table 3 indicates what kind of local minima an algorithm is strapped in a unsuccessful trial. We can see that DGS-ES performs better than the baselines in most cases, due to its nonlocal exploitation capability. For further illustration, we plotted in Figure 2 the DGS-gradient direction and local gradient direction at 50 random locations on the surface of the Ackley2. The standard deviation used in directional Gaussian smoothing is set to 0.01 in Figure 2(Left) and 1.0 in Figure 2(Right). It shows that the DGS-gradient with larger mostly points to the global minimum at .

DGS-ES ES CMA-ES TuRBO BFGS
Ackley2 95% 75% 65% 100% 0%
Ackley5 90% 85% 50% 55% 0%
Ackley10 90% 85% 45% 45% 0%
Branin 100% 100% 100% 100% 100%
Levy10 100% 50% 55% 15% 0%
CrossInTray 60% 5% 30% 65% 10%
Sphere10 100% 100% 100% 100% 100%
Dropwave 100% 100% 55% 100% 10%
Rastrigin10 100% 85% 0% 0% 0%
Table 1: Comparison between the DGS-ES method and the baselines on Metric (a) the success rate, i.e., the percentage of successful convergence to the global minimum.
DGS-ES ES CMA-ES TuRBO BFGS
Ackley2 774 829 38 513
Ackley5 1014 1894 59 1278
Ackley10 2603 4321 160 3980
Branin 205 160 17 152 4
Levy10 2215 16331 70 706
CrossInTray 2901 4265 38 490 3
Sphere10 206 996 47 30 3
Dropwave 720 1765 16 458 2
Rastrigin10 1406 7347
Table 2: Comparison between the DGS-ES method and the baselines on Metric (b) the average iterations needed for global convergence, which only takes into account the successful trials.
DGS-ES ES CMA-ES TuRBO BFGS
Ackley2 6.08 20.33 16.53 - 17.36
Ackley5 18.84 20.62 14.06 2.75 19.32
Ackley10 5.52 19.68 19.01 9.57 19.24
Branin - - - - -
Levy10 - 4.96 6.45 7.59 18.78
CrossInTray -0.995 1.05 2.31 3.49 1.93
Sphere10 - - - - -
Dropwave - - 1.13 - 2.46
Rastrigin10 - 7.98 16.85 21.39 86.31
Table 3: Comparison between the DGS-ES method and other methods on Metric (c) the averaged minimal objective function value found in the unsuccessful trials. The global minima of the test functions are: Ackley , Branin , Levy10 , CrossInTray , Sphere10 , Dropwave , Rastrigin .

4.2 Tests on high-dimensional functions

Here we investigate the DGS-ES performance on the high-dimensional functions: Sphere, Ackley (Re-scaled), Levy and Rastrigin functions, in 100D, 1000D and 2000D. The original Ackley function is extremely difficult in high dimensions, because its Lipschitz constant does not grow with the dimension and it is a highly concave function. In order to use the Ackley-type of geometry to test our DGS-ES method, we enlarged its Lipschitz constant by multiplying the original Ackley function by 100, and reducing the search domain to .

Figure 2: Illustration of DGS-gradient direction and local gradient direction at 50 random locations on the surface of the Ackley2. (Left) The standard deviation is set to 0.01, such that the DGS-gradient align with the local gradient at most locations. (Right) The standard deviation is set to 1.0, such that most DGS-gradient points to the global minimum at .

Figure 3 shows the scalability of the DGS-ES algorithm with the dimension of the objective function. We performed 20

Figure 3: Illustration of the dimension dependence of the convergence rate of the DGS-ES method. The convergence rate, i.e., the number of iterations to converge, is independent of the dimension for convex functions, e.g., Sphere function, which is consistent with our theoretical analysis in §3.3.

repeated independent trials for each method to show the statistical performance. In §3.3, we proved in Theorem 1 that the DGS-ES method can achieve dimension independent convergence rate when optimizing strongly convex functions. Our numerical results are consistent with Theorem 1. For example, for the sphere function, the convergence rate does not change when we increase the dimension from 10 to 1000. Moreover, such property carries over to the non-convex Levy and Rastrigin. The reason is that both Levy and Rastrigin have a globally near-convex structure when smoothing out their local minima. In contrast, the Ackley function is highly concave, as a large part of its surface is almost flat regardless of the small local minima. In this case, it takes many iterations for DGS-ES to search for the global minimum hidden in the middle of the flat surface.

Figure 4 shows the comparison between the DGS-ES and the baselines for optimizing the four benchmark functions in Figure 3 in 2000-dimensional space. For the sphere function, BFGS is a clear winner due to its optimal performance in handling convex functions. Among the three ES-type methods, the DGS-ES has much better performance than its competitors. For the other three functions, the DGS-ES method shows significant advantages over other methods.

Figure 4: Comparison of different blackbox optimization methods on four 2000-dimensional benchmark functions.

4.3 Aerospace hierarchical stiffened shell design

Now we demonstrate the DGS-ES method on a real-world stiffened shell design problem. Due to its high strength and stiffness, the hierarchical stiffened shell has been widely used in aerospace engineering. However, it is challenging to fully explore its optimal buckling load-carrying capacity. The goal of this work is to improve the load-carrying capacity by optimizing the representative unit cell of hierarchical stiffened shell where the inputs are 9 size variables (widths, heights and thickness) for major and minor stiffeners, as shown in Figure 5(a), and the output is the carrying-load capability factor which is calculated by high-fidelity numerical simulation, e.g. finite element method (FEM), see Figure 5 (b). Although the high-fidelity FEM simulation is time-consuming, many open-source codes and commercial software have improved the computational efficiency via scalable parallelism and GPU acceleration. It is thus feasible to apply the DGS-ES method by implementing parallel FEM simulations in supercomputers. Figure 5 presents the comparison between DGS-ES and the other blackbox optimization methods. We observed that DGS-ES outperforms all other algorithms and achieves faster convergence using only 100 iterations. The robustness of DGS-ES also outperforms the alternatives.

Figure 5: Illustration and design optimization of hierarchical stiffened shell structures in aerospace engineering, e.g. rocket.

5 Discussion on the limitations of DGS-ES

Despite the successful demonstration of the DGS-ES method shown in §4, we realized that there are several limitations with the current version of the DGS-ES algorithm, including: (a) Naive random perturbation strategy for exploration. As the estimator is deterministic, an effective random exploration strategy is critically important to the robustness of the algorithm. The current random rotation algorithm does not effectively exploit the degrees of freedom of the special orthogonal group , such that important information may be lost after the perturbation. (b) The need for hyper-parameter tuning. The most sensitive hyper-parameter is standard deviation in . If is too small, the function is insufficiently smoothed, such that the optimizer may be trapped in a local minimum. In contrast, if is too big, the function is over smoothed, such that the convergence will become much slower. An adaptive strategy, e.g., the covariance matrix adaptation in CMA-ES, will be needed to alleviate such issue.

Acknowledgement

This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract and award numbers ERKJ352, and by the Artificial Intelligence Initiative at the Oak Ridge National Laboratory (ORNL). ORNL is operated by UT-Battelle, LLC., for the U.S. Department of Energy under Contract DE-AC05-00OR22725.

References

  • Abramowitz & Stegun (1972) Abramowitz, M. and Stegun, I. (eds.). Handbook of Mathematical Functions. Dover, New York, 1972.
  • Berahas et al. (2019a) Berahas, A. S., Cao, L., Choromanski, K., and Scheinberg, K. Linear interpolation gives better gradients than gaussian smoothing in derivative-free optimization. arXiv preprint arXiv:1905.13043, 2019a.
  • Berahas et al. (2019b) Berahas, A. S., Cao, L., Choromanskiv, K., and Scheinberg, K. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. arXiv:1905.01332, 2019b.
  • Choromanski et al. (2018) Choromanski, K., Rowland, M., Sindhwani, V., Turner, R. E., and Weller, A. Structured evolution with compact architectures for scalable policy optimization. International Conference on Machine Learning, pp. 969–977, 2018.
  • Choromanski et al. (2019a) Choromanski, K., Pacchiano, A., Parker-Holder, J., , and Tang, Y. From complexity to simplicity: Adaptive es-active subspaces for blackbox optimization. NeurIPS 2019, 2019a.
  • Choromanski et al. (2019b) Choromanski, K., Pacchiano, A., Parker-Holder, J., , and Tang, Y. Provably robust blackbox optimization for reinforcement learning. arXiv:1903.02993, 2019b.
  • Cui et al. (2018) Cui, X., Zhang, W., Tüske, Z., and Picheny, M.

    Evolutionary stochastic gradient descent for optimization of deep neural networks.

    NeurIPS, 2018.
  • Eriksson et al. (2019) Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local bayesian optimization. In Advances in Neural Information Processing Systems, pp. 5497–5508, 2019.
  • Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning, 80:1467–01476, 2018.
  • Flaxman et al. (2005) Flaxman, A. D., Kalai, A. T., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. Proceedings of the 16th Annual ACM-SIAM symposium on Discrete Algorithms, pp. 385–394, 2005.
  • Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, pp. 2450–2462, 2018.
  • Hansen (2006) Hansen, N. The cma evolution strategy: a comparing review. In

    Towards a new evolutionary computation

    , pp. 75–102. Springer, 2006.
  • Hansen (2016) Hansen, N. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.
  • Hansen & Ostermeier (2001) Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
  • Houthooft et al. (2018) Houthooft, R., Chen, Y., Isola, P., Stadie, B., Wolski, F., Ho, O. J., and Abbeel, P. Evolved policy gradients. Advances in Neural Information Processing Systems, pp. 5400–5409, 2018.
  • Larson et al. (2019) Larson, J., Menickelly, M., and Wild, S. M. Derivative-free optimization methods. Acta Numerica, 28:287–404, 2019.
  • Maggiar et al. (2018) Maggiar, A., Wachter, A., Dolinskaya, I. S., and Staum, J. A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization, 28(2):1478–1507, 2018.
  • Maheswaranathan et al. (2019) Maheswaranathan, N., Metz, L., Tucker, G., Choi, D., and Sohl-Dickstein, J. Guided evolutionary strategies: Augmenting random search with surrogate gradients. Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Meier et al. (2019) Meier, F., Mujika, A., Gauy, M. M., and Steger, A. Improving gradient estimation in evolutionary strategies with past descent directions. Optimization Foundations for Reinforcement Learning Workshop at NeurIPS 2019, 2019.
  • Metz et al. (2018) Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., and Sohl-dickstein, J. Learned optimizers that outperform sgd on wall-clock and test loss. 2nd Workshop on Meta-Learning at NeurIPS, 2018.
  • Miikkulainen et al. (2017) Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Nuffy, N., and Hodjat, B. Evolving deep neural networks. arXiv preprint arXiv:1703.00548, 2017.
  • Morse & Stanley (2016) Morse, G. and Stanley, K. O. Simple evolutionary optimization can rival stochastic gradient descent in neural networks. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016), pp. 477–484, 2016.
  • Nesterov (2004) Nesterov, Y. Introductory Lectures on Convex Optimization. Springer US, 2004.
  • Nesterov & Spokoiny (2015) Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, pp. 1–40, 2015.
  • Quarteroni et al. (2007) Quarteroni, A., Sacco, R., and Saleri, F. Numerical Mathematics, volume 332. Springer Science Business Media &, 2007.
  • Real et al. (2017) Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le, Q. V., and Kurakin, A.

    Large-scale evolution of image classifiers.

    International Conference on Machine Learning (ICML), pp. 2902–2911, 2017.
  • Rechenberg & Eigen (1973) Rechenberg, I. and Eigen, M. Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog Stuttgart, 1973.
  • Rios & Sahinidis (2009) Rios, L. M. and Sahinidis, N. V. Derivative-free optimization: a review of algorithms and comparison of software implementations. J Glob Optim, 56:1247–1293, 2009.
  • Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., and Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
  • Schwefel (1977) Schwefel, H.-P. Numerische optimierung von computer-modellen mittels der evolutionsstrategie. 1977.
  • (31) Surjanovic, S. and Bingham, D. Virtual library of simulation experiments: Test functions and datasets. Retrieved February 6, 2020, from http://www.sfu.ca/~ssurjano.
  • Wierstra et al. (2014) Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., and Schmidhuber, J. Natural evolution strategies. Journal of Machine Learning Research, 15:949–980, 2014.