1. Introduction
One of the key reasons for the success of metaheuristics is their being generalpurposeness. Indeed, Evolutionary Algorithms (EAs), Swarm Intelligence (SI) algorithms and alike can be applied, more or less straightforwardly, to a broad range of optimization problems. On the other hand, it is wellestablished that different algorithms can produce different results on a given problem, and in fact it is impossible to identify an algorithm that works better than any other algorithm on all possible problems (Wolpert and Macready, 1997).
Moreover, the performance of metaheuristics typically depends on their hyperparameters. However, optimal parameters are usually problemdependent, and finding those parameters before performing an optimization process through trialanderror or other empirical approaches is usually tedious, and obviously suboptimal. One possible alternative is given by hyperheuristics (Burke et al., 2013; Drake et al., 2020; Sánchez et al., 2020), i.e., algorithms that can either select the best metaheuristic for a given problem (Nareyek, 2003; Chakhlevitch and Cowling, 2008; Li et al., 2017), or simply optimize the parameters of a given metaheuristic. Several tools, e.g. irace (LópezIbáñez et al., 2016), exist for this purpose.
Another possibility is to endow the metaheuristic with a parameter adaptation strategy, i.e., a set or rules that change the parameters dynamically during the optimization process. Several handcrafted, successful policies have been proposed over the years to address parameter adaptation (Cotta et al., 2008). However, finding an optimal adaptation policy is, in turn, challenging as different policies may perform differently on different problems or during different stages of an optimization process. Moreover, exploring manually the space of such policies is infeasible. On the other hand, it is possible to cast the search for an adaptation policy as a reinforcement learning (RL) problem (Sutton and Barto, 2018), where the agent observes the state of the optimization process and decides how to change the parameters accordingly. However, only few attempts have been done so far in this direction. This is mostly due to the fact that the observation space of an optimization process can be quite large, and finding relevant state metrics (i.e., inputs to the policy) and rewards can be difficult.
Here, we aim to make steps in this direction by introducing a generalpurpose framework for performing parameter adaptation in continuousdomain metaheuristics based on stateoftheart RL. One reason for building such a framework is to relieve algorithm designers and practitioners from the need for building handcrafted adaptation strategies. Moreover, using such framework would allow to use pretrained strategies and apply them to new optimization problems.
In the experimentation, we focus on two wellknown continuous optimization algorithms (assuming, without loss of generalization, minimization of the objective function/fitness), namely the Covariance Matrix Adaptation Evolution Strategies (CMAES) (Hansen and Ostermeier, 2001) and Differential Evolution (DE) (Storn and Price, 1997), for which wellknown successful handmade adaptation policies exist. In the case of CMAES, we train an adaptation policy for the stepsize . In the case of DE, we instead adapt the scale factor and the crossover rate . We train these policies on a set of 46 benchmark functions at different dimensionalities, with various state metrics, in two settings: one policy per function, and one global policy for all functions. Compared, respectively, to the Cumulative Stepsize Adaptation (CSA) policy (Chotard et al., 2012) and to two wellknown adaptive DE variants (iDE (Elsayed et al., 2011) and jDE (Brest et al., 2006)), our policies are able to produce competitive results, especially in the case of DE.
2. Background
In the context of DE, several works have shown the effect of using an adaption strategy to choose and . These parameters are, in fact, known to affect both diversity and optimization results (Yaman et al., 2019). For instance, some authors proposed using pools of different parameters and mutation/crossover strategies, either as discrete sets of fixed values (Iacca et al., 2014), or as continuous ranges (Iacca et al., 2015). Others proposed using multiple mutation strategies (Yaman et al., 2018), where each strategy is represented as an agent whose measured performance is used to promote its activation within an ensemble of strategies. Recently, the authors of (Julian Blank, 2022) introduced a polynomial mutation for DE with different approaches for controlling its parameter. The authors of (Ghosh et al., 2022) proposed instead an improvement on SHADE (Tanabe and Fukunaga, 2013), which uses proximitybased local information to control the parameter settings.
Rather than engineering the parameter adaptation strategy, some studies have tried to learn metaheuristics with RL. Some of these works are based on Qlearning: Li et al. (Li et al., 2019) considered each individual as an agent that learns the optimal strategy for solving a multiobjective problem with DE; in a similar way, Hu et al. (Hu et al., 2021) used a Qtable for each individual to choose how much to increase/decrease the parameter during a DE run to solve circuit design problems; Sallam et al. (Sallam et al., 2020) proposed an algorithm that evolves two populations: one with CMAES, and one with Qtable, in order to choose between different DE operators and enhance the EA with a local search.
Other approaches are based on deep RL: Sharma et al. (Sharma et al., 2019) proposed a method that uses deep RL that produces an adaptive DE strategy based on the observation of several state metrics; Sun et al. (Sun et al., 2021)
trained a LongShort Term Memory (LSTM) with policy gradient to control the
and parameters in DE; Shala et al. (Shala et al., 2020)trained a neural network with Guided Policy Search (GPS)
(Levine and Koltun, 2013) to control the stepsize of CMAES by also sampling trajectories created by Cumulative Stepsize Adaptation (CSA) (Chotard et al., 2012); Lacerda et al. (LACERDA, 2021) used distributed RL to train several metaheuristics with Twin Delayed Deep Deterministic Policy Gradients (Fujimoto et al., 2018).3. Methods
The proposed framework uses deep RL to learn parameter adaptation strategies for EAs, i.e., to learn a policy that is able to set the parameters of an EA at each generation of the optimization process. In that, our framework is similar to the approach presented in (Shala et al., 2020). However, differently from (Shala et al., 2020) we do not use GPS as RL algorithm and, most importantly, we do not partially sample the parameter adaptation trajectory from an existing adaptation strategy (in (Shala et al., 2020), CSA), but rather we build the adaptation trajectory from scratch, i.e., entirely based on the trained policy. Another important aspect is that our framework can be configured with different EAs and RL algorithms, and can be easily extended in terms of state metrics, actions and rewards.
Next, we briefly describe the two EAs considered in our experimentation (Section 3.1), the RL setting (Section 3.2), the evaluation procedure (Section 3.3) and the computational setup (Section 3.4).
3.1. Evolutionary algorithms
We tested the framework using CMAES and DE since these are two wellknown EAs for which several studies exist on parameter adaptation. In our comparisons, we considered two wellestablished adaptation strategies taken from the literature: for CMAES, Cumulative Stepsize Adaptation (CSA) (Chotard et al., 2012); for DE, iDE (Elsayed et al., 2011) and jDE (Brest et al., 2006). More details on these adaptation strategies will follow.
3.1.1. Covariance Matrix Adaptation Evolution Strategies
CMAES (Hansen and Ostermeier, 2001)
conducts the search by sampling adaptive mutations from a multivariate normal distribution (
). At each generation, the mean is updated based on a weighted average over the population, while the covariance matrix is updated by applying a process similar to that of Principle Component Analysis. The remaining parameter, is the step size, which in turn is adapted during the process. Usually, is selfadapted using CSA (Chotard et al., 2012). In our case, the policy is learned and computed based on an observation of the current state of the search.3.1.2. Differential Evolution
DE (Storn and Price, 1997) is a very simple yet efficient EA. Starting from an initial random population, at each generation the algorithm applies on each parent solution a differential mutation operator, to obtain a mutant, which is then crossed over with the parent. While there are different mutation and crossover strategies for DE, in this study we consider only the “best/1/bin” strategy. According to this strategy, the mutant is computed as ; where is the best individual at the th generation, and are two mutually exclusive randomly selected individuals in the current population, and
is the scale factor. The binary crossover, on the other hand, swaps the genes of parent and mutant with probability given by the crossover rate
.Without adaptation, and are fixed. In our case, we make the policy learn how to adapt them by using two different approaches: directly updating and with the policy, or sampling and from a uniform/normal distribution parametrized by the policy.
3.2. Reinforcement learning setting
As for the RL setting, we chose the same model used in (Shala et al., 2020)
: 2 fully connected hidden layers of 50 neurons each (thus with
connections) with ReLU activation function. The size if the input layer depends on the observation space, while the size of the output layer depends on the action space. In the following, we describe the other details of the learning setting.
3.2.1. Proximal Policy Optimization
We chose Proximal Policy Optimization (PPO) (Schulman et al., 2017) to optimize the policy due to its good performances in generalpurpose RL tasks. Here, for brevity we do not go into details of the algorithm (for which we refer to (Schulman et al., 2017)), but in short the algorithm works as shown in Algorithm 1.
In our setup, , , and the other parameters are set as per their defaults value used in the rayrllib library^{1}^{1}1See https://docs.ray.io/en/latest/rllib/rllibalgorithms.html#ppo. are the parameters of the policy (in our case, the weights of the neural networks),
is the loss function (see Eq. 9 from
(Schulman et al., 2017)) andis the advantage estimate at iteration
(see Eq. 11 from (Schulman et al., 2017)).3.2.2. Observation spaces
We experimented with different observation spaces, each one defined as a set of state metrics. A state metric computes the state (or observation) of the model based on various combinations of fitness values, genotypes, and other parameters of the EA. More specifically, we used the following state metrics:

[leftmargin=*]

Intergenerational : For the last generations, we take the best fitness in the population at each generation and compute the normalized difference with the best fitness at the previous generation:
(1) (2) where is the best fitness value in the population at the th generation. In this way, and it is proportional to the best fitness from the previous generation, saturating to for . The constant is needed to avoid divisions by zero. The normalization of is fundamental to have stable training.

Intragenerational : For the last generations, we take the normalized difference between the maximum and minimum fitness of the current population at each generation:
(3) 
Intergenerational : Similarly to the intergenerational , the normalized difference between the best genotype in two consecutive generations are taken for the last generations. In this case, to maintain linearity, the normalization is done using the bounds of the search space:
(4) where is the genotype associated to the best fitness at generation and
is the vector containing, for each variable, the bounds of the search space, being
the problem size. Since the size of this observation would depend on the problem size, the policy would work only with problems of that fixed size. To solve this problem, we use as observation the minimum and maximum values of :(5) The intragenerational is then defined as a history of the above defined metric at the last generations:
(6) 
Intragenerational : Given as the th dimension of the th individual of the population at the th generation, the intragenerational at the th generation is defined as:
(7) (8) Also in this case, we use as observation the minimum and maximum values of :
(9) The intragenerational is then defined as a history of the above defined metric at the last generations:
(10)
In all the experiments, we always include in the observation space also the previous model output, i.e., the parameters given by the model in the previous generation.
3.2.3. Action spaces
The action space of the policy depends on both the specific EA and the approach used to parametrize it. In our model, the action is taken at every generation, using the observation from the previous one. In our experiments, we considered the following action spaces:

[leftmargin=*]

CMAES (Stepsize): .

DE (Both and ): .

DE (Normal distribution): and
are sampled using two normal distributions parametrized with mean and standard deviation determined by the learned policy, i.e., respectively,
and . Thus, the action space is: , , , . 
DE (uniform distribution): and
are sampled using two uniform distributions parametrized with lower and upper bound determined by the learned policy, i.e., respectively,
and . Thus, the action space is: , .
3.2.4. Reward
The reward is a scalar representing how good or bad was the performance of the policy during the training episodes (in our case, an episode is a full evolutionary run). It is computed every generation using the Intergenerational without history, see Eq. (2). The use of this reward brings some advantages: it reflects the progress of the optimization process, maintaining the independence with different scales of the objective functions, and it yields better numerical stability during the training process. All the experiments have been done using this reward function (except the one presented in Section 4.1.1).
3.2.5. Training procedure
We consider two policy configurations: singlefunction policy, and multifunction policy. In the first configuration, the model is trained separately on each function: in this way, the policy specializes for each single optimization problem. If the ideal policy should work well on as many functions as possible, a singlefunction policy can be useful to get an idea about the top performance that the policy can get for each function. Or, singlefunction policies could used for similar functions. In the second configuration, the model is instead trained using the evolutionary runs on multiple functions. Quite surprisingly, with this procedure, we could obtain policies that are able to work better than the adaptive approaches from the literature.
In all our experiments, we trained the models for function evolutionary runs (i.e., episodes), each consisting of function evaluations, hence meaning function evaluations per each policy training. Then, the trained policy is tested using the procedure defined in Section 3.3.4.
3.3. Evaluation
3.3.1. Benchmark functions
The experiments have been done with 46 benchmark functions taken from the BBOB benchmark (Finck et al., 2010). For each function, we used the default instance, i.e., without random shift in the domain or codomain (as done in (Shala et al., 2020)). Future investigations will extend the analysis to instances with shift.
The 46 functions are selected as follows. The first 10 functions are: BentCigar, Discus, Ellipsoid, Katsuura, Rastrigin, Rosenbrock, Schaffers, Schwefel, Sphere, Weierstrass, all in 10 dimensions. The remaining 36 functions are the same 12 functions, namely: AttractiveSector, BuecheRastrigin, CompositeGR, DifferentPowers, LinearSlope, SharpRidge, StepEllipsoidal, RosenbrockRotated, SchaffersIllConditioned, LunacekBiR, GG101me, and GG21hi, each one in 5, 10 and 20 dimensions.
3.3.2. Compared methods
We compared the learned policies with the following adaptive methods from the literature:

[leftmargin=*]

Cumulative Stepsize adaptation (Chotard et al., 2012): CSA is considered the default stepsize control method of CMAES. To compute , a cumulative path is defined as: , where ( represents the lifespan of the information contained in ) and is the best children at the th generation. The stepsize is defined as: , where is the damping parameter that determines how much the step size can change (usually, ).

iDE (Elsayed et al., 2011): The iDE adaptive method maintains a different and for each individual and updates them with a different rule that depends on the mutation/crossover strategy used. Since in our DE experiments we use the best/1/bin strategy, the considered iDE update rules are:
(11) (12) where and are the and values corresponding to the best individual and (or ) is a random (or ) sampled from the best (or ) values until the current generation ( is needed to selected mutually exclusive values for each individual).

jDE (Brest et al., 2006): jDE is a simple but effective adaptive DE variant. With probability the method samples from . Otherwise, it uses the best until the current generation.
3.3.3. Evaluation metrics
In order to compare the different setups of algorithms and models, we consider two metrics (similar to (Shala et al., 2020)):

[leftmargin=*]

Area Under the Curve (AUC): During each evolutionary run, the minimum fitness of the population at each generation is stored. The result is a monotonic nonincreasing discrete function (assuming elitism). The area under this curve is then calculated using the composite trapezoidal rule. This metric is a good indication of how fast the optimization process is.

Best of Run: The best fitness found during the entire optimization process.
As we assume minimization of the objective function, for both metrics it holds that the lower their values, the better is the performance of a policy.
3.3.4. Testing procedure
Given a RL based trained policy and an adaptive policy taken from the literature (e.g., CSA) that take actions on the corresponding EA (e.g., CMAES), the two policies are tested in the following way:

[leftmargin=*]

We take the policy and execute 50 runs, each one for 50 generations, with a population of 10 individuals. Thus, every run has 500 function evaluations.

We do the same for policy .

For each run of both policies, we compute the two metrics (AUC and Best of Run).

For both metrics, we calculate the probability that performs better than as:
(13) where is 1 if the metric of on the th run is less than the metric of on the th run, otherwise it is 0.
3.4. Computational setup
We ran our experiments on an Azure Virtual Machine with an 8 core 64bit CPU (we noted that the CPU model would change over different sessions, but usually the machine used an Intel Xeon with 2GHz and 30MB cache) and 16GB RAM, running Ubuntu 20.04. A training process of episodes takes hours.
Our code is implemented in python (v3.8) using rayrllib (v1.7), gym (v0.19) and numpy (v1.19). We took the CSA implementation from the cma (v3.1.1) library, as well as the BBOB benchmark functions (Finck et al., 2010). The implementation of DE was done by slightly modifying the scipy’s implementation in order to make it compatible with and at the individual level. The implementation of iDE and jDE has been realized by porting it from the C++ implementation available in the pagmo (v2.18.0) library (that is based on the algorithm descriptions presented in (Elsayed et al., 2011) and (Brest et al., 2006)).
4. Results
We now present the results, separating the experiments with CMAES (Section 4.1) from those with DE (Section 4.2).
4.1. CMAES experiments
4.1.1. Comparison between PPO and GPS
The first experiment was done with CMAES and singlefunction training, trying to configure the model as similarly as possible to (Shala et al., 2020), in order to get a first comparative analysis. However, a direct comparison with the results reported in (Shala et al., 2020) was not possible. In fact, the authors of (Shala et al., 2020) used GPS as training algorithm, that is not implemented in the rayrllib library. To avoid replicability issues, we then decided to train our model using the available PPO implementation from rayrllib. Furthermore, we did not use the sampling rate technique implemented in (Shala et al., 2020), i.e., in our case the trajectories of the stepsize are taken entirely from the trained policy.
The rest of the setup is the same used in (Shala et al., 2020). As mentioned earlier, we used 2 fully connected hidden layers of 50 neurons each with ReLU. The observation space is: differences between successive fitnesses from 40 previous generations (not normalized), the stepsize history from 40 previous generations, the current cumulative path length (Equation 2 from (Shala et al., 2020)). The reward is the negative of the fitness (not normalized). The action space is . Please note that these state metrics and reward are different from the ones described in Section 3.2, and have been used only in this preliminary experiment for comparison with the results from (Shala et al., 2020).
We performed this experiment only with the first 10 functions of the considered benchmark. The result of this experiment was quite poor: the singlefunction trained policy obtained better testing results than CSA ( with both AUC and Best of Run metrics) only on 2 functions. We found that the main reason for this scarce performance is the noisy reward. In fact we observed that, depending on the function, the scale of the fitness differs across multiple runs, and PPO is sensible to the reward scale. This seems to explain why the authors of (Shala et al., 2020) chose GPS, which is robust to different reward scales.
Also, with this setup we encountered numerical instability problems: with BentCigar, Rosenbrock and Schaffers we have not been able to train the policy because at a certain point of the training process the weights of the model became NaN. This is very likely caused by the noisy reward, which makes some gradient or loss function value go to infinity. Indeed, this problem was almost totally fixed using a normalized reward.
4.1.2. Normalizing the reward
We tried to improve the previous setup by normalizing the reward and using a minimal observation space. The reward in this case is the one explained in Section 3.2.4. The observation space is the intergenerational with and the stepsize of the previous generation. Testing the policy on all the 46 functions, it did better than CSA on 30.4% (14/46) of the functions. Moreover, we did not have training stability issues. Overall, we found that CSA is a very good stepsize adaptation strategy and it is difficult to do better by means of RL.
4.2. DE experiments
A more intensive experimentation has been conducted with DE. We started with singlefunction training policies, training one model per function. Then we experimented with multifunction training, applying small changes in the model in order to get close to the singlefunction results.
4.2.1. Singlefunction policy
In Figure 1 we report the results of the singlefunction training policies using three action spaces to parametrize DE, and compare them with iDE and jDE. For brevity, we report only the results of the Best of Run metric. Green (red) cells indicate that the trained policy works better (worse) than the corresponding adaptive DE variant (thus, either iDE or jDE), with trained separately and tested on each function. Darker green (red) indicates higher (lower) probabilities. Black cells indicate that the policy could not be trained due to numerical instability issues: in fact DE, due to its random nature, is likely to produce different fitness trajectories across evolutionary runs. This causes a noisy reward that can lead to numerical instabilities during the training process.
To solve this problem, it is necessary to design a custom loss function for the training algorithm. However, this would mean to use a variation of PPO, which falls outside the scope of this work where we are limiting ourselves to using the original PPO. A simple workaround was to run the training process multiple times: in most cases, one or two attempts were enough to train the policy without encounter this instability problem. Moreover, we observed that using the hyperbolic tangent as activation function (instead of ReLu) can help reduce the probability to encounter instabilities. However, we did not perform a deeper analysis on this.
The leftmost side of Figure 1 shows the percentage of the functions where the learned policy did better than iDE/jDE over the total number of functions: . It can be seen that the uniform distribution strategy gives the best results overall. However, there are a few functions where the adaptive strategies provided by iDE and jDE always do better.
4.2.2. Multifunction policy
After seeing the results of the normal and uniform distribution approaches in the singlefunction setting, we experimented with multifunction training using one policy trained for episodes on all the functions, meaning evolutionary runs per function. We trained and compared 9 versions of the model with different observation spaces combining the state metrics defined in Section 3.2.2.
The results of this experiment are shown in Figure 2. All the policies have at least the intergenerational and the values of the precedent action as observation. The entries on the rightmost side of Figure 2 (starting with “w/”) denote what is included in the observation space. Moreover, we also tried to double the number of training episodes (“double training” labels) and increase the size of the model to (“bigger net” label).
One of the main results that can be noted from Figure 2 is the different performance between the normal and the uniform distribution approach. The latter is visibly superior with respect to the former, and it gets very close to the singlefunction training performance shown in Figure 1, by only adding the intragenerational to the observation space. The normal distribution approach overcomes iDE and jDE only by both adding intragenerational and intragenerational to the observation space and increasing the model and its training time. This suggests the fact that this approach could work but it is more difficult to train. Another consideration may be that, in order to get a better balance between exploration and exploitation, and must be highly variant, especially at the end of the evolution.
Another important observation can be made looking at Figure 3. Because the trajectories are very similar across the functions (note that there is a small standard deviation in the actions), it is clear that the policy is not able to differentiate trough different functions, but rather it learns to map actions to the number of the current generation. Figure 3 shows only the trajectories of one policy, but the same pattern is present on all the multifunction training policies. This problem is an important limitation for the policy because being able to make different choices depending on the function is crucial if we want to have true adaptation. One possible cause of this problem may be the small capacity of the model (in terms of number of layers/neurons) or the number of episodes. However, increasing both (at least to the values that we tested) did not bring any improvement. A possible solution may be to add a loss in an intermediate layer (like GoogLeNet (Szegedy et al., 2015)
does) to classify the function in some manner (e.g., unimodal or multimodal). However, this would increases the computational cost.
Figure 3 also shows that the model has learned a general policy that works well for the majority of the functions and that this is in line with the common strategy of: exploration during the first phase, and exploitation during the last phase. In fact, during the evolution, , which determines the effect of the crossover, is initially small and within a small range ( and are
and similar) while at the end it increases its variance (
and ). Instead, , which determines the effects of mutation, is initially high with low variance (both and are ) and at the end it has high variance between and .5. Conclusions
In this study, we have proposed a Python framework for learning parameter adaptation policies in metaheuristics. The framework, based on a stateoftheart RL algorithm (PPO) is of general applicability and can be easily extended to handle various optimizers and parameters thereof.
In the experimentation, we have applied the proposed framework to the learning of the stepsize in CMAES, and the scale factor and crossover rate in DE. Our experiments demonstrate the efficacy of the learned adaptation policies, especially considering the Best of Run results in the case of DE, in comparison with wellknown adaptation policies taken from the literature such as iDE and jDE.
The hybridization of metaheuristics and RL, to which this paper contributes, is becoming growing field of research, and offers the potential to create genuinely adaptive numerical optimization techniques, with the possibility to perform continual learning and incorporate previous knowledge. In this regard, this work can be extended in multiple ways. The most straightforward direction would be to test alternative RL models (different from PPO). Moreover, while in this study we focused on realvalued optimization, in principle the proposed system could be extended to handle parameter adaptation also for solving combinatorial problems. Furthermore, it will be important to test the proposed framework in realworld applications, and include in the comparative analysis other stateoftheart optimizers. Moreover, it would be interesting to investigate alternative observation spaces and reward functions. Another option would be to extend the framework to learn the choice of operators and algorithms (in an algorithm portfolio scenario), rather than their parameters.
Acknowledgements.
We thank Alessandro Cacco for a preliminary implementation of the framework used in the experiments reported in this study.References
 Selfadapting control parameters in differential evolution: a comparative study on numerical benchmark problems. IEEE transactions on evolutionary computation 10 (6), pp. 646–657. Cited by: §1, item 3, §3.1, §3.4.
 Hyperheuristics: a survey of the state of the art. Journal of the Operational Research Society 64 (12), pp. 1695–1724. Cited by: §1.
 Hyperheuristics: recent developments. In Adaptive and multilevel metaheuristics, pp. 3–29. Cited by: §1.
 Cumulative stepsize adaptation on linear functions. In International Conference on Parallel Problem Solving from Nature, Berlin, Heidelberg, pp. 72–81. Cited by: §1, §2, item 1, §3.1.1, §3.1.
 Adaptive and multilevel metaheuristics. Vol. 136, Springer, Berlin, Heidelberg. Cited by: §1.
 Recent advances in selection hyperheuristics. European Journal of Operational Research 285 (2), pp. 405–428. Cited by: §1.
 Differential evolution with multiple strategies for solving cec2011 realworld numerical optimization problems. In 2011 IEEE Congress of Evolutionary Computation (CEC), New York, NY, USA, pp. 1041–1048. Cited by: §1, item 2, §3.1, §3.4.
 Realparameter blackbox optimization benchmarking 2009: presentation of the noiseless functions. Technical report INRIA. Cited by: §3.3.1, §3.4.
 Addressing function approximation error in actorcritic methods. In International conference on machine learning, Stockholm, Sweden, pp. 1587–1596. Cited by: §2.
 Using spatial neighborhoods for parameter adaptation: an improved success history based differential evolution. Swarm and Evolutionary Computation in press, pp. 101057. External Links: ISSN 22106502, Document, Link Cited by: §2.
 Completely derandomized selfadaptation in evolution strategies. Evolutionary computation 9 (2), pp. 159–195. Cited by: §1, §3.1.1.
 Reinforcement learningbased differential evolution for parameters extraction of photovoltaic models. Energy Reports 7, pp. 916–928. Cited by: §2.
 Continuous parameter pools in ensemble selfadaptive differential evolution. In IEEE Symposium Series on Computational Intelligence, New York, NY, USA, pp. 1529–1536. Cited by: §2.
 A differential evolution framework with ensemble of parameters and strategies and pool of local search algorithms. In European conference on the applications of evolutionary computation, Berlin, Heidelberg, pp. 615–626. Cited by: §2.
 Parameter tuning and control: a case study on differential evolution with polynomial mutation. External Links: Link Cited by: §2.
 Outofthebox parameter control for evolutionary and swarmbased algorithms with distributed reinforcement learning. Ph.D. Thesis, Universidade Federal de Pernambuco. Cited by: §2.
 Guided policy search. In International conference on machine learning, Atlanta, GA, USA, pp. 1–9. Cited by: §2.
 A learning automatabased multiobjective hyperheuristic. IEEE Transactions on Evolutionary Computation 23 (1), pp. 59–73. Cited by: §1.
 Differential evolution based on reinforcement learning with fitness ranking for solving multimodal multiobjective problems. Swarm and Evolutionary Computation 49, pp. 234–244. Cited by: §2.
 The irace package: iterated racing for automatic algorithm configuration. Operations Research Perspectives 3, pp. 43–58. Cited by: §1.
 Choosing search heuristics by nonstationary reinforcement learning. In Metaheuristics: Computer decisionmaking, pp. 523–544. Cited by: §1.
 Evolutionary framework with reinforcement learningbased mutation adaptation. IEEE Access 8, pp. 194045–194071. Cited by: §2.

A systematic review of hyperheuristics on combinatorial optimization problems
. IEEE Access 8, pp. 128068–128095. Cited by: §1.  Proximal policy optimization algorithms. Note: arXiv:1707.06347 Cited by: §3.2.1, §3.2.1.
 Learning stepsize adaptation in cmaes. In International Conference on Parallel Problem Solving from Nature, Berlin, Heidelberg, pp. 691–706. Cited by: §2, §3.2, §3.3.1, §3.3.3, §3, §4.1.1, §4.1.1, §4.1.1.
 Deep reinforcement learning based parameter control in differential evolution. In Genetic and Evolutionary Computation Conference, New York, NY, USA, pp. 709–717. Cited by: §2.
 Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization 11 (4), pp. 341–359. Cited by: §1, §3.1.2.
 Learning adaptive differential evolution algorithm from optimization experiences by policy gradient. IEEE Transactions on Evolutionary Computation 25 (4), pp. 666–680. Cited by: §2.
 Reinforcement learning: an introduction. MIT press, Cambridge, MA, USA. Cited by: §1.

Going deeper with convolutions.
In
Conference on Computer Vision and Pattern recognition
, New York, NY, USA, pp. 1–9. Cited by: §4.2.2.  Successhistory based parameter adaptation for differential evolution. In Congress on Evolutionary Computation, New York, NY, USA, pp. 71–78. Cited by: §2.
 No free lunch theorems for optimization. IEEE transactions on evolutionary computation 1 (1), pp. 67–82. Cited by: §1.
 A comparison of three differential evolution strategies in terms of early convergence with different population sizes. In AIP conference proceedings, Vol. 2070, Melville, NY, USA, pp. 020002. Cited by: §2.
 Multistrategy differential evolution. In International Conference on the applications of evolutionary computation, Berlin, Heidelberg, pp. 617–633. Cited by: §2.