Natural Evolution Strategies (NES) [22, 8, 21, 23] is a promising framework for black-box continuous optimization problems. Instead of directly seeking the optimal solution , NES optimizes the parameter of a probability distribution . The expectation of the objective function over the solution space is minimized by repeatedly updating the parameter of the probability distribution based on the estimated natural gradient 
. In this study, we focus on NES using a multivariate normal distribution as the probability distribution. The natural gradient plays an important role in evolution strategies and randomized algorithms: For example, the rank-update [10, 9] of CMA-ES  can also be regarded as using the estimated value of the natural gradient . Information Geometric Optimization , which is a generalized framework of NES and the rank- update of CMA-ES, has been actively studied in recent years [1, 6, 19].
As with other evolution strategies, one of the critical parameters in NES is a learning rate for the parameter of the probability distribution. If the learning rate is too high, the parameter update will be unstable and the performance will deteriorate. On the other hand, if the learning rate is too low, the speed of approaching the optimal solution will be slow, resulting in poor performance. Therefore, setting an appropriate learning rate is essential for maximizing the performance of NES.
There are a few studies on learning rate adaptation of NES. DX-NES proposed by Fukushima et al. switches the learning rate based on the norm of the evolution path which accumulates the movement of the normalized
mean vector. The effectiveness of switching the learning rate in DX-NES is demonstrated empirically. In fact, the recently proposed DX-NES variants [17, 16], which also employ switching of the learning rate, show promising performance on unconstrained and implicitly constrained black-box optimization problems. Another learning rate adaptation method based on maximum likelihood estimation is proposed in the literature of the CMA-ES .
In this paper, we propose a new learning rate adaptation mechanism in view of the natural gradient method; our work is based on a principle that the learning rate of the natural gradient method should depend on its estimation accuracy. To measure the estimation accuracy of the natural gradient, we calculate the movement of Kullback-Leibler (KL) divergence, which was first introduced in the population size adaptation of the CMA-ES [14, 15]. We extend the notion to the learning rate adaptation in NES.
The aim of this study is to understand the behavior of NES with the learning rate adaptation mechanism, rather than developing a method that achieves state-of-the-art performance, as this is the first work of learning rate adaptation based on the estimation accuracy of the natural gradient in NES. To that end, we decide to incorporate the learning rate adaptation into xNES , which is a simple and promising variant of NES.
The rest of this paper is organized as follows. In Section 2, we describe xNES algorithm. In Section 3, we propose a learning rate adaptation mechanism based on the estimation accuracy of the natural gradient. In Section 4, we experiment on unimodal and multimodal benchmark problems to investigate the effect of the learning rate adaptation mechanism. Section 5 concludes with summary and future direction of this work.
xNES  uses a multivariate normal distribution as the probability distribution. Here, is the mean vector, is the step-size, and is the normalization transformation matrix where . The update of xNES is performed by using an estimated natural gradient in the parameter space of the multivariate normal distribution.
xNES first initializes the parameters , and . Then, the following steps are repeated until a stopping criterion is met.
Step 1. Sampling and Sorting
For , sample solutions as follows. Generate -dimensional standard normal vectors and compute . Evaluate the generated solutions on the objective function and obtain their objective values. Then, sort the solutions according to their evaluation values.
Step 2. Estimating Natural Gradient
Estimate the natural gradient based on the sorted solutions as follows:
where is the weight function
The weight function holds . Note that xNES uses the weight function instead of using raw evaluation values. This technique is called fitness shaping, and it improves the robustness of the algorithm due to the invariance for the monotone transformation of the objective function and enables linear convergence .
Step 3. Updating Parameters
Based on the estimated natural gradient, update the parameters of the multivariate normal distributions as follows:
where , and are the learning rates for updating , , and , respectively. These learning rates above have default values ; and .
3 Learning Rate Adaptation
While default values of the learning rates are presented in xNES, Fukushima et al. have pointed out that the default values are too conservative in a certain situation and there is much room for improvement . However, simply increasing the learning rate causes performance degradation in problems where it is difficult to estimate the natural gradient. It is thus important to adapt the learning rate according to the search situation in order to maximize the performance of xNES.
In this work, we try to adapt the learning rates and . That is, we focus on only the learning rates related to the covariance matrix. We fix , which is the default value presented in  and widely used in the literature of the CMA-ES [10, 12] as well.
To this end, we introduce a learning rate adaptation mechanism that dynamically adapts the learning rates based on the estimation accuracy of the natural gradient. To quantify the estimation accuracy of the natural gradient, we introduce an evolution path in the parameter space , which accumulates successive parameter movements. The length of the evolution path in the parameter space, described in detail in Section 3.2, is used to measure the accuracy of the natural gradient. We believe that, if the length of the evolution path is larger than its expectation under a random function, the accuracy is high, as the tendency of parameter update can be captured. On the contrary, we believe that, if the length of the evolution path is close to its expectation under a random function, the estimation is dominated by noise, and the accuracy is low.
In this study, we consider an evolution path in the parameter space of only the covariance matrix, not the mean vector, because the learning rate for the mean vector is fixed. This is different from existing studies that use the evolution path in the parameter space [14, 15]. We will investigate the behavior of the evolution path in Section 4.
3.2 Evolution Path for Covariance Matrix
In this work, we introduce an evolution path in the parameter space of the covariance matrix to quantify the estimation accuracy of the natural gradient. We use a modification of the evolution path proposed in , which considers both the mean vector and the covariance matrix. Let . The covariance movement matrix is defined to capture the movement of the covariance matrix from iteration to , which is updated as
We then define the evolution path in the parameter space of the covariance matrix.
where is a cumulation factor of the evolution path and is the Fisher information matrix of the covariance matrix of the multivariate normal distribution. The expectation is taken under a random function , where is independently drawn from the identical distribution for each evaluation. We use the approximation of , which will be derived in Section 3.4.
Using the result from Eq. (21) and Appendix B in , we define the length of the evolution path , which represents the movement of the KL divergence in the parameter space of the covariance matrix, as follows:
Although we do not consider the movement of the KL divergence in the parameter space of the mean vector, we use as a notation for the parameter space of the probability distribution.
Under a random function, the length of the evolution path approaches as the iteration increases. Therefore, comparing the the length of the evolution path with the normalization factor which is updated as
we can obtain the estimation of the accuracy of the parameter update. The initial parameter is set to .
3.3 Updating Learning Rate
In this section, we give a procedure for the learning rate adaptation. As described, we argue that the learning rate should depend on the estimation accuracy of the natural gradient. When the accuracy is high, the learning rate should be increased, and when the accuracy is low, the learning rate should be decreased.
The learning rate adaptation is performed as follows:
where , and
are pre-defined hyperparameters. It is possible to set different hyperparameters forand , respectively, if needed. In this study, we employ the same value for easier interpretation, i.e., and .
We clip the learning rates to prevent them from being updated to unexpected ranges by the following equations:
where and are the maximum and the minimum values of the learning rate for step-size , respectively. Similarly, and are the maximum and the minimum values of the learning rate for the normalized transformation matrix , respectively. The function is defined as .
To prevent extrapolation in the update of the parameter, we set the maximum value of the learning rates to , i.e., . Also, the minimum value of the learning rates is set to the default value of xNES, as it is pointed out that the setting of the learning rates in xNES is often too conservative . Therefore, we use the values recommended in  for and , i.e., .
3.4 Approximation of
In this section, we derive an approximation of , which represents a change of the KL divergence in terms of the covariance matrix.
Let , , , and . To derive the approximation, we use the Slepian-Bangs formula [20, 5] and obtain . We will derive the expectation of this equation. From the result provided by Nishida and Akimoto , we can obtain . We then need to derive the approximation of , and .
Derivation of :
The update equation of step-size in xNES can be rewritten as
Then, by the second order Taylor expansion, for any ,
We will thus calculate the expectations in the above equation. First, from , . Next, noting that and the independence,
where . By combining these results, . We thus obtain
Derivation of :
Let . The first order Taylor expansion of in xNES can be obtained as
From , ,
Derivation of and :
Note that . Then, , Therefore,
We then derive . First, is written as
In the second line, we used . Then, . We used due to .
By combining these results,
From the results above,
We recalculate this approximation every iteration because it depends on the dynamically changing learning rates, and .
3.5 Overall Procedure
is the zero matrix. The procedures in line 3-14 are the same as xNES. In line 15, the covariance movement matrix is updated. In line 16, the expectation of the length of the evolution path under a random function is approximated by using Eq. (9). In line 17, the evolution path in the parameter space of the covariance matrix is updated. In line 18, the length of the evolution path is calculated. In line 19, the normalization factor for the evolution path is updated. In line 20-21, the learning rates for the step-size and the normalization transformation matrix are updated with clipping.
In this section, we investigate the following research questions (RQs).
When the learning rate is fixed, how does the evolution path in Eq. (2) of xNES behave on unimodal and multimodal functions?
How is the learning rate adapted in xNES with the proposed learning rate adaptation mechanism?
Does xNES with the proposed learning rate adaptation mechanism achieve better performance than xNES with fixed learning rates?
We first describe the experimental setups in Section 4.1. In Section 4.2, we investigate the behavior of the evolution path in xNES with a fixed learning rate (RQ1). We then investigate the behavior of the evolution path and the learning rate in xNES with the adaptive learning rate mechanism (RQ2) in Section 4.3. Finally, we compare the performance of xNES with the proposed adaptive learning rate mechanism and that with fixed learning rates (RQ3) in Section 4.4. The code for running the proposed method is available at https://github.com/nomuramasahir0/xnes-adaptive-lr.
4.1 Experimental Setups
Table 1 shows the definition of benchmark problems used in the experiment. We employ two unimodal functions (Sphere and Ellipsoid) and two multimodal functions (Rastrigin and Bohachevsky). While the Rastrigin function has strong multimodality, the Bohachevsky function has relatively weak multimodality. In this experiment, we set the dimension to . The initial parameters are set to in the Sphere, Ellipsoid, and Rastrigin functions, and in the Bohachevsky function.
The hyperparameters for the proposed learning rate adaptation mechanism are: , , as described in Section 3.3. And we set and based on our preliminary experiments. and are set to their default values, i.e., .
4.2 Evolution Path with Fixed Learning Rate
Figure 1 shows a typical behavior of the best evaluation value and the length of the evolution path of xNES with a fixed learning rate on the benchmark problems. We use the default learning rate and set the population size to obtain a reliable estimation of the evolution path.
In the result of the Sphere and Ellipsoid functions where is improved quickly, the length of the evolution path becomes long . We believe this is because the estimation accuracy of the natural gradient should be high in relatively easy objective functions (e.g., unimodal functions).
On the other hand, in the multimodal functions where may not be improved easily, different behavior from that of the unimodal functions appears. In the Bohachevsky function, which is a relatively weakly multimodal function, we can observe that the length of the evolution path slightly decreases once. In the Rastrigin function, which has strong multimodality, such a decreasing behavior is prominent in the beginning of the optimization. In fact, the length of the evolution path takes a value close to 1, which is the expected amount of change in KL divergence in a random function, in the period where the number of evaluations is between about and about .
4.3 Behavior of Learning Rate Adaptation
A typical behavior of xNES with the proposed learning rate adaptation mechanism is depicted in Figure 2. In addition to the learning rates and , the corresponding objective function value and the length of the evolution path are also shown. We employ for the Sphere and Ellipsoid functions, for the Rastrigin function, for the Bohachevsky function, respectively. It is observed in each function that the learning rates also increase when the length of the evolution path increases.
To investigate the effect of the setting of the population size, we conduct an experiment with , and on the -dimensional Sphere function. Figure 3 shows the result of the experiment. In , the length of the evolution path does not increase and the learning rates, and , are then not changed at all. We think that this is because the accuracy of the parameter update is low under the setting of a small population size. We observe that, as is increased, the length of the evolution path increases, and, as a result, the learning rate also increases. The result suggests that the proposed mechanism can adapt the learning rate appropriately, measuring the estimation accuracy of natural gradient. This dynamic learning rate adaptation depending on the population size is an advantage over DX-NES , which statically injects the population size into the setting of the learning rate.
4.4 Fixed Learning Rate vs. Adaptative Learning Rate
To check the effectiveness of the proposed mechanism, we compare the performance of xNES with the proposed learning rate adaptation mechanism and that of xNES with fixed learning rates (= the default value , and ). The performance metrics is the average number of evaluations until reaches a target function value over successful trials divided by the success rate . The target function value is set to . A trial is successful if the target function value is found. We set the maximum number of evaluations to . For the Sphere function and the Ellipsoid function, we employ the population size , and . Note that the recommended value of the population size presented in  is included, i.e., . For the Rastrigin function, we employ the population size , and . For the Bohachevsky function, we employ the population size , and . We perform trials to calculate the performance metrics for the Sphere and Ellipsoid functions. We perform trials to calculate it for the Rastrigin and Bohachevsky functions.
Figure 4 shows the result of the experiment. We first compare the proposed mechanism (red) and xNES with the default learning rate (blue). In the Sphere and Ellipsoid functions, when , the performance is almost the same, which is consistent in Section 4.3. As increases, the proposed mechanism shows better performance than xNES with the default learning rate. This is because that the estimation accuracy of the natural gradient should become high when is large, which increases the learning rate and leads to acceleration of the search. In the Rastrigin and Bohachevsky functions, the proposed mechanism outperforms xNES with the default learning rate due to the adaptive learning rate.
Next, we compare the proposed mechanism (red) and xNES with other fixed learning rates. In all the benchmark problems, when is large, the performance of the proposed mechanism is close to that of xNES with the fixed learning rate of the default value times (pink). However, xNES with the fixed learning rate of the default value times fails to find the optimum in the Sphere and Ellipsoid functions with small population sizes (, and ) because the learning rate is too high. On the other hand, the proposed mechanism does not increase the learning rate so much by measuring the estimation accuracy of the natural gradient when the population size is small, which enables stable search.
From the result in the multimodal functions, we can observe that the proposed mechanism is competitive with xNES with high learning rates when the population size is large. In particular, in the Rastrigin function, the proposed mechanism and xNES with the fixed learning rate of the default value times and achieve almost the same performance in terms of the average number of evaluations of successful trials divided by the success rate. This means that the number of evaluations required to find the optimum is about the same if an appropriate restart is performed when the optimum is failed to find. Figure 5 shows the success rate of the proposed mechanism (red), xNES with the fixed learning rate of the default value times (pink), and xNES with the fixed learning rate of the default value times (cyan). While these methods are competitive when the population size is large, xNES with the fixed learning rates are more likely to fail when the population size is small. This result suggests that the proposed mechanism is more robust than xNES with a fixed learning rate. A higher success rate in the proposed mechanism is also practically beneficial, as it is often difficult to implement an appropriate restart strategy.
In this paper, we proposed a novel learning rate adaptation mechanism for NES. The proposed mechanism adapts the learning rate based on the estimation accuracy of the natural gradient, which is inspired by the population size adaptation mechanism of the CMA-ES [14, 15]. We introduced the evolution path in the parameter space of the covariance matrix. Based on the length of the evolution path, we update the learning rates related to the covariance matrix. The numerical experiments using unimodal and multimodal benchmark functions demonstrated that the proposed mechanism can appropriately adapt the learning rates depending on the estimation accuracy of the natural gradient. Additionaly, xNES with the proposed mechanism achieved comparable performance as xNES with an appropriate fixed learning rate which cannot be obtained without prior parameter surveys.
This study focused on the proposal of the principled learning rate adaptation, and did not conduct exhaustive experiments. Verifying the performance of the proposed mechanism with a wider range of experimental setting is thus an important future direction.
The authors thank anonymous reviewers for their helpful comments. This work was partially supported by JSPS KAKENHI Grant Number JP20K11986.
Comparison-Based Natural Gradient Optimization in High Dimension.
Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pp. 373–380. Cited by: §1.
-  (2010) Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies. In International Conference on Parallel Problem Solving from Nature, pp. 154–163. Cited by: §1.
-  (1998) Why natural gradient?. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 2, pp. 1213–1216. Cited by: §1.
-  (2005) A Restart CMA Evolution Strategy With Increasing Population Size. In 2005 IEEE congress on evolutionary computation, Vol. 2, pp. 1769–1776. Cited by: §4.4.
-  (1971) Array Processing with Generalized Beam-Formers. Yale University. Cited by: §3.4.
Convergence Analysis of Evolutionary Algorithms That Are Based on the Paradigm of Information Geometry. Evolutionary Computation 22 (4), pp. 679–709. Cited by: §1, §2.
-  (2011) Proposal of distance-weighted exponential natural evolution strategies. In 2011 IEEE Congress of Evolutionary Computation (CEC), pp. 164–171. Cited by: §1, §3.1, §3.3, §4.3.
-  (2010) Exponential Natural Evolution Strategies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp. 393–400. Cited by: §1, §1, §2, §2, §3.1, §3.3, §3.5, §4.4.
-  (2004) Evaluating the CMA Evolution Strategy on Multimodal Test Functions. In International Conference on Parallel Problem Solving from Nature, pp. 282–291. Cited by: §1.
-  (2003) Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary computation 11 (1), pp. 1–18. Cited by: §1, §3.1.
-  (2006) The CMA Evolution Strategy: A Comparing Review. Towards a new evolutionary computation, pp. 75–102. Cited by: §1.
-  (2016) The CMA Evolution Strategy: A Tutorial. arXiv preprint arXiv:1604.00772. Cited by: §3.1.
-  (2014) Maximum Likelihood-based Online Adaptation of Hyper-Parameters in CMA-ES. In International Conference on Parallel Problem Solving from Nature, pp. 70–79. Cited by: §1.
-  (2016) Population Size Adaptation for the CMA-ES based on the Estimation Accuracy of the Natural Gradient. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 237–244. Cited by: §1, §3.1, §3.2, §5.
-  (2018) PSA-CMA-ES: CMA-ES with population size adaptation. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 865–872. Cited by: §1, §3.1, §3.1, §3.2, §3.4, §5.
-  (2021) Natural Evolution Strategy for Unconstrained and Implicitly Constrained Problems with Ridge Structure. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. Cited by: §1.
-  (2021) Distance-weighted Exponential Natural Evolution Strategy for Implicitly Constrained Black-Box Function Optimization. In 2021 IEEE Congress on Evolutionary Computation (CEC), pp. 1099–1106. Cited by: §1.
Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles.
The Journal of Machine Learning Research18 (1), pp. 564–628. Cited by: §1.
-  (2020) Information-geometric optimization with natural selection. Entropy 22 (9), pp. 967. Cited by: §1.
-  (1954) Estimation of signal parameters in the presence of noise. Transactions of the IRE Professional Group on Information Theory 3 (3), pp. 68–89. Cited by: §3.4.
-  (2009) Efficient Natural Evolution Strategies. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pp. 539–546. Cited by: §1.
-  (2014) Natural Evolution Strategies. The Journal of Machine Learning Research 15 (1), pp. 949–980. Cited by: §1.
-  (2009) Stochastic Search Using the Natural Gradient. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1161–1168. Cited by: §1.