1. Introduction
Evolutionary algorithms for numerical optimization come in many variants involving different operators, such as mutation strategies and types of crossover. In the case of differential evolution (DE) (Storn and Price, 1997), experimental analysis has shown that different mutation strategies perform better for specific optimization problems (MezuraMontes et al., 2006) and that choosing the right mutation strategy at specific stages of an optimization process can further improve the performance of DE (Fialho et al., 2010a). As a result, there has been great interest in methods for controlling or selecting the value of discrete parameters while solving a problem, also called adaptive operator selection (AOS).
In the context of DE, there is a finite number of mutation strategies (operators) that can be applied at each generation to produce new solutions from existing (parent) solutions. An AOS method will decide, at each generation, which operator should be applied, measure the effect of this application and adapt future choices according to some reward function. An inherent difficulty is that we do not know which operator is the most useful at each generation to solve a previously unseen problem. Moreover, different operators may be useful at different stages of an algorithm’s run.
There are multiple AOS methods proposed in the literature (Karafotias et al., 2015b; Aleti and Moser, 2016; Gong et al., 2010) and several of them are based on reinforcement learning (RL)
techniques such as probability matching
(Fialho et al., 2010b; Sharma et al., 2018), multiarm bandits (Gong et al., 2010), learning (Pettinger and Everson, 2002) and SARSA (Chen et al., 2005; Eiben et al., 2006; Sakurai et al., 2010), among others (Karafotias et al., 2014). These RL methods use one or few features to capture the state of the algorithm at each generation, select an operator to be applied and calculate a reward from this application. Typical state features are fitness standard deviation, fitness improvement from parent to offspring, best fitness, and mean fitness
(Eiben et al., 2006; Karafotias et al., 2014). Typical reward functions measure improvement achieved over the previous generation (Karafotias et al., 2014). Other parameter control methods use an offline training phase to collect more data about the algorithm than what is available within a single run. For example, Kee et al. (2001) uses two types of learning: tablebased and rulebased. The learning is performed during an offline training phase that is followed by an online execution phase where the learned tables or rules are used for choosing parameter values. More recently, Karafotias et al. (2012)trains offline a feedforward neural network with no hidden layers to control the numerical parameter values of an evolution strategy. To the best of our knowledge, none of the AOS methods that use offline training are based on reinforcement learning.
In this paper, we adapt Double Deep QNetwork (DDQN) (van Hasselt et al., 2016), a deep reinforcement learning technique that uses a deep neural network as a prediction model, as an AOS method for DE. The main differences between DDQN and other RL methods are the possibility of training DDQN offline on large amounts of data and of using a larger number of features to define the current state. When applied as an AOS method within DE, we first run the proposed DEDDQN algorithm many times on training benchmark problems by collecting data on features, such as the relative fitness of the current generation, mean and standard deviation of the population fitness, dimension of the problem, number of function evaluations, stagnation, distance among solutions in decision space, etc. After this training phase, the DEDDQN algorithm can be applied to unseen problems. It will observe the run time value of these features and predict which mutation strategy should be used at each generation. DEDDQN also requires the choice of a suitable reward definition to facilitate learning of a prediction model. Some RLbased AOS methods calculate rewards per individual (Pettinger and Everson, 2002; Chen et al., 2005), while others calculate it per generation (Sakurai et al., 2010). Moreover, reward functions can be designed in different ways depending on the problem at hand. For example, Karafotias et al. (2015a) defines and compares four pergeneration reward definitions for RLbased AOS methods. Here, we also find that the reward definition has a strong effect on the performance of DEDDQN and, hence, we analyze three alternative reward definitions that assign reward for each application of a mutation strategy.
As an experimental benchmark, we use functions from the cec2005 special session on realparameter optimization (Suganthan et al., 2005). In particular, the proposed DEDDQN method is first trained on 16 functions for both dimensions 10 and 30, i.e., a total of 32 training functions. Then, we run the trained DEDDQN on a different set of 5 functions, also for dimensions 10 and 30, i.e., a total of 10 test functions. We also run on these 10 test functions the following algorithms for comparison: four DE variants, each using a single specific mutation strategy, DE with a random selection among mutation strategies at each generation, DE using various AOS methods (PMAdapSS (Fialho et al., 2010b), FAUC (Gong et al., 2010), and RecPMAOS (Sharma et al., 2018)), and the two winners of CEC2005 (Suganthan et al., 2005) competition, which are both variants of CMAES: LRCMAES (LR) (Auger and Hansen, 2005a) and IPOPCMAES (IPOP) (Auger and Hansen, 2005b).
Our experimental results show that the DE variants using AOS completely outperform the DE variants using a fixed mutation strategy or a random selection. Although a nonparametric posthoc test does not find that the differences between the CMAES algorithms and the AOSenabled DE algorithms (including DEDDQN) are statistically significant, DEDDQN is the second best approach, behind IPOPCMAES, in terms of mean rank.
The paper is structured as follows. First, we give a brief introduction to DE, mutation strategies and deep reinforcement learning. In Sect. 3, we introduce our proposed DEDDQN algorithm, and explain its training and online (deployment) phases. Section 4 introduces the state features and reward functions used in the experiments, which are described in Sect. 5. We summarise our conclusions in Sect. 6.
2. Background
2.1. Differential Evolution
Differential Evolution (DE) (Price et al., 2005) is a populationbased algorithm that uses a mutation strategy to create an offspring solution . A mutation strategy is a linear combination of three or more parent solutions , where is the index of a solution in the current population. Some mutation strategies are good at exploration and others at exploitation, and it is wellknown that no single strategy performs best for all problems and for all stages of a single run. In this paper, we consider these frequently used mutation strategies:
“rand/1”:  

“rand/2”:  
“randtobest/2”:  
“currtorand/1”: 
where is a scaling factor, and are the
th offspring and parent solution vectors in the population, respectively,
is the best parent in the population, and , , , , and are randomly generated indexes within , where is the population size. An additional numerical parameter, the crossover rate (), determines whether the mutation strategy is applied to each dimension of to generate . At least one dimension of each vector is mutated.2.2. Deep Reinforcement Learning
In RL (Sutton and Barto, 1998)
, an agent takes actions in an environment that returns the reward and the next state. The goal is to maximize the cumulative reward at each step. RL estimates the value of an action given a state called
Qvalue to learn a policy that returns an action given a state. A variety of different techniques are used in RL to learn this policy and some of them are applicable only when the set of actions is finite.When the features that define a state are continuous or the set of states is very large, the policy becomes a function that implicitly maps between state features and actions, as opposed to keeping an explicit map in the form of a lookup table. In deep reinforcement learning, this function is approximated by a deep neural network and the weights of the network are optimized to maximize the cumulative reward.
Deep Qnetwork (DQN) (Mnih et al., 2015) is a deep RL technique that extends Qlearning to continuous features by approximating a nonlinear Qvalue function of the state features using a neural network (NN). The classical DQN algorithm sometimes overestimates the Qvalues of the actions, which leads to poor policies. Double DQN (DDQN) (van Hasselt et al., 2016)
was proposed as a way to overcome this limitation and enhance the stability of the Qvalues. DDQN employs two neural networks: a primary network selects an action and a target network generates a target Qvalue for that action. The targetQ values are used to compute the loss function for every action during training. The weights of the target network are fixed, and only periodically or slowly updated to the primary Qnetworks values.
In this work, we integrate DDQN into DE as an AOS method that selects a mutation strategy at each generation.
3. DeDdqn
When integrated with DE as an AOS method, DDQN is adapted as follows. The environment of DDQN becomes the DE algorithm performing an optimization run for a maximum of function evaluations. A state is a collection of features that measure static or run time features of the problem being solved or of DE at step (function evaluation or generation counter). The actions that DDQN may take are the set of mutation strategies available (Sect. 2.1), and is the strategy selected and applied at step . Once a mutation strategy is applied, a reward function returns the estimated benefit (reward) of applying action , and the DE run reaches a new state, . We refer to the tuple , , , as an observation.
Our proposed DEDDQN algorithm operates in two phases. In the first training phase, the two deep neural networks of DDQN are trained on observations by running the DEDDQN algorithm multiple times on several benchmark functions. In a second online (or deployment) phase, the trained DDQN is used to select which mutation strategy should be applied at each generation of DE when tackling unseen (or test) problems not considered during the training phase. We describe these two phases in detail next.
3.1. Training phase
In the training phase, DDQN uses two deep neural networks (NNs), namely primary NN and target NN. The primary NN predicts the Qvalues that are used to select an action given state at step , while the target NN estimates the target Qvalues after the action has been applied, where and are the weights of the primary and target NNs, respectively, is the state vector of DE, and is a mutation strategy.
The goal of the training phase is to train the primary NN of DDQN so that it learns to approximate the target function. The training data is a memory of observations that is collected by running DEDDQN several times on training benchmark functions. Training the primary NN involves finding its weights through gradient optimization.
The training process of DEDDQN is shown in Algorithm 1. Training starts by running DE with random selection of mutation strategy for a fixed number of steps (warmup size) that generates observations to populate a memory of capacity , which can be different from the warmup size (line 2). This memory stores a fixed number of recent observations, old ones are removed as new ones are added. Once the warmup phase is over, DE is executed times, and each run is stopped after function evaluations or the known optimum of the training problem is reached (line 7). For each solution in the population, the greedy policy is used to select mutation strategy, i.e., with probability a random mutation is selected, otherwise the mutation strategy with maximum Qvalue is selected. Using the current DE state , the primary NN is responsible for generating a Qvalue per possible mutation strategy (line 12). The use of a greedy policy forces the primary NN to explore mutation strategies that may be currently predicted less optimal. The selected mutation strategy is applied (line 13) and a new state is achieved (line 14). A reward value is computed by measuring the performance progress made at this step.
To prevent the primary NN from only learning about the immediate state of this DE run, randomly draw mini batches of observations (line 16) from memory to perform a step of gradient optimization. Training the primary NN with the randomly drawn observations helps to robustly learn to perform well in the task.
The primary NN is used to predict the next mutation strategy (line 20) and its reward (line 21), without actually applying the mutation. A target reward value is used to train the primary NN, i.e., finding the weights that minimise the loss function (line 22). If the run terminates, i.e., if the budget assigned to the problem is finished, is the same as the reward . Otherwise, is estimated (line 21) as a linear combination of the current reward and the predicted future reward , where is the (predicted) target Qvalue and is the discount factor that makes the training focus more on immediate results compared to future rewards.
Finally, the primary and target NNs are synchronised periodically by copying the weights from the primary NN to the of the target NN every fixed number of training steps (line 23). That is, the target NN uses an older set of weights to compute the target Qvalue, which keeps the target value from changing too quickly. At every step of training (line 22), the Qvalues generated by the primary NN shift. If we are using a constantly shifting set of values to calculate (line 21) and adjust the NN weights (line 22), then the target value estimations can easily become unstable by falling into feedback loops between and the (target) Qvalues used to calculate . In order to mitigate that risk, the target NN is used to generate target Qvalues () that are used to compute , which is used in the loss function for training the primary NN. While the primary NN is trained, the weights of the target NN are fixed.
3.2. Online phase
Once the learning is finished, the weights of the primary NN are frozen. In the testing phase, the mutation strategy is selected online during an optimization run on an unseen function. The online AOS with DE is shown in Algorithm 2. Since the weights of the NN are not updated in this phase, we do not maintain a memory of observations or compute rewards. As a new state is observed , the Qvalues per mutation strategy are calculated and a new mutation strategy is chosen according to the greedy policy (line 7).
4. State features and reward
In this section we describe the new state features and reward definitions explored for the proposed DEDDQN method.
4.1. State representation
The state representation needs to provide sufficient information so that the NN can decide which action is more suitable at the current step. We propose a state vector consisting of various features capturing properties of the landscape and the history of operator performance. Each feature is normalised to the range by design in order to abstract absolute values specific to particular problems and help generalisation. Features are summarised in Table 1.
Index  Feature  Notes 

1  denotes the th solution of the population and denotes its fitness; and denote the bestsofar and worstsofar fitness values found up to this step within a single run  
2  is the population size  
3  calculates the standard deviation and is the value when solutions have fitness and the other half have fitness  
4  is the maximum number of function evaluations per run, and gives the remaining number of evaluations at step  
5  is the dimension of the benchmark function being optimised, and is the maximum dimension among all training functions  
6  stagcount is the stagnation counter, i.e., the number of function evaluations (steps) without improving  
711  ,  is the Euclidean distance between two solutions; is the maximum distance possible, calculated between the lower and upper bounds of the decision space; are random indexes 
12  is the best parent in the current population  
1317  ,  
18  
19  denotes the solution with fitness  
2035  For each op and and normalised over all operators; gen is the number of recent generations recorded; and are successful and total applications of op according to at generation  
3651  
5267  For each op and and normalised over all operators; is the maximum value of  
6883  For each op and and normalised over all operators  
8499  For each op and and normalised over all operators; is the th value in the window generated by op 
Our state needs to encode information about how the current solutions in the population are distributed in the decision space and their differences in fitness values. The fitness of current parent is given to the NN as a first state feature. The next feature is the mean of the fitness of the current population. The first two features in the state are normalised by the difference of worst and best seen so far solution. The third feature calculates the standard deviation of the population fitness values. Feature 4 measures the remaining budget of function evaluations. Feature 5 is the dimension of the function being solved. The training set includes benchmark functions with different dimensions in the hope that the NN are able to generalise to functions of any dimension within the training range. Feature 6, stagnation count, calculates the number of function evaluations since the last improvement of the best fitness found for this run (normalised by ).
The next set of feature values describe the relation between the current parent and the six solutions used by the various mutation strategies, i.e., the five random indexes (, , , , ) and the best parent in the population (). Features 7–12 measure the Euclidean distance in decision space between the current parent and the six solutions. These six euclidean distances help the NN learn to select the strategy that best combines these solutions. Features 13–18 use the same six solutions to calculate the fitness difference w.r.t. . Feature 19 measures the normalised Euclidean distance in decision space between and the best solution seen so far. We use distances instead of positions to make the state representation independent of the dimensionality of the solution space.
Describing the current population is not sufficient to select the best strategy. Reinforcement learning requires the state to be Markov, i.e., to include all necessary information for selecting an action. To this end, we enhance the state with features about the run time history. Using historical information has shown to be useful in our previous work (Sharma et al., 2018). In addition to the remaining budget and the stagnation counter described above, we also store four metric values after the application of op at generation :

, that is, the th fitness improvement of offspring over parent ;

, the th fitness improvement of offspring over , the best parent in the current population;

, the th fitness improvement of offspring over , the best so far solution; and

, the th fitness improvement of offspring over the median fitness of the parent population.
For each , the total number of fitness improvements (successes) is given by , that is, the index is always . The counter gives the total number of applications of op at generation . We store this historical information for the last gen number of generations.
With the information above, we compute the sum of success rates over the last gen generations, where each success rate is the number of successful applications of operator op, i.e., mutation strategy, in generation that improve metric divided by the total number of applications of op in the same generation. For each metric , the values for an operator are normalised by the sum of all values of all operators. A different success rate is calculated for each combination of () and op (four mutation strategies) resulting in features 20–35.
We also compute the sum of fitness improvements for each divided by the total number of applications of op over the last gen generations (features 36–51). Features 52–67 are defined in terms of best fitness improvement of a mutation strategy op according to metric over a given generation , that is, . In this case, we calculate the relative difference in best improvement of the last generation with respect to the previous one, divided by the difference in number of applications between the last two generations (gen and ). Any zero value in the denominator is ignored. The sum of best improvement seen for combination of operator and metric is given as features 68–83.
Features 8499 are calculated by maintaining a fixed size window where each element is a tuple of the four metric values and resulting from the application of a mutation strategy to that generates . Initially the window is filled with values as new improved offsprings are produced. Once it is full, new elements replace existing ones generated by that mutation strategy according to the FirstIn FirstOut (FIFO) rule. If there is no element produced by that operator in the window, the element with the worst (highest) is replaced. Each feature is the sum of values within the window for each
and each operator. The difference between features extracted from recent generations (6883) and from the fixedsize window (8499) is that the window captures the best solutions for each operator, and the number of solutions present per operator vary. In a sense, solutions compete to be part of the window. Whereas when computing features from the last
gen generations, all successful improvements per generation are captured and there is no competition among elements. As the most recent history is the most useful, we use small values for last generations and window size .4.2. Reward definitions
While we only know the true reward of a sequence of actions after a full run of DE is completed, i.e., the best fitness found, such sparse rewards provide a very weak signal and can slow down training. Instead, we calculate rewards after every action has been taken, i.e., a new offspring is produced from parent . In this paper, we explore three reward definitions, each one using different information related to fitness improvement:
R1  R2  
R3 
R1 is the fitness difference of offspring from parent when an improvement is seen. This definition has been used commonly in literature for parameter control (Pettinger and Everson, 2002; Chen et al., 2005; Sakurai et al., 2010). R2 assigns a higher reward to an improvement over the best so far solution than to an improvement over the parent. Finally, R3 is a variant of R1 relative to the difference between the offspring fitness and the optimal fitness, i.e., maximise the fitness difference between parent and offspring and minimise fitness difference between offspring and optimal solution. This definition can only be used when the optimum values of the functions used for training are known in advance.
5. Experimental design
In our implementation of DEDDQN, the primary and target NNs are multilayer perceptrons. We integrate the three reward definitions R1, R2 and R3 into DEDDQN and the resulting methods are denoted DEDDQN1, DEDDQN2 and DEDDQN3, respectively. For each of these methods, we trained four NNs using batch sizes 64 or 128 and 3 or 4 hidden layers, and we picked the best combination of batch size and number of hidden layers according to the total accumulated reward during the training phase. In all cases, the most successful configuration was batch size 64 with 4 hidden layers. Results of other configurations are not shown in the paper.
The rest of the parameters are not tuned but set to typical values. In the training phase, we applied greedy policy with of the actions selected randomly and the rest according to the highest Qvalue. In the warmup phase during training, we set the capacity of the memory of observations larger than the warmup size so that 90% of the memory is filled up with observations from random actions and the rest with actions selected by the NN. The gradient descent algorithm used to update the weights of the NN during training is Adam (Kingma and Ba, 2014). Table 2
shows all hyperparameter values.
Training and online parameters  Parameter value 

Scaling factor ()  
Crossover rate ()  
Population size ()  
per function  function evaluations 
Max. generations (gen)  
Window size ()  
Type of neural network  Multi layer perceptron 
Hidden layers  
Hidden nodes  per hidden layer 
Activation function  Rectified linear (Relu) (Nair and Hinton, 2010) 
Batch size  
Training only parameters  Parameter value 
Training policy  greedy ( ) 
Discount factor ()  
Target network synchronised ()  every steps 
Observation memory capacity  
Warmup size  
NN training algorithm  Adam (learning rate: ) 
Online phase parameters  Parameter value 
Online policy  Greedy 
We compared the three proposed DEDDQN variants with ten baselines: random selection of mutation strategies (Random), four different fixedstrategy DEs (DE1DE4), PMAdapSS (AdapSS) (Fialho et al., 2010b), FAUC (FAUC) (Gong et al., 2010), RecPMAOS (RecPM) (Sharma et al., 2018) and the two winners of CEC2005 competition, which are both variants of CMAES: LRCMAES (LR) (Auger and Hansen, 2005a) and IPOPCMAES (IPOP) (Auger and Hansen, 2005b). Among all these alternatives, AdapSS, FAUC, RecPM are AOS methods that were proposed to adaptively select mutation strategies. The parameters of these AOS methods were previously tuned with the help of an offline configurator irace (Sharma et al., 2018) and the tuned hyperparameter values (parameters of AOS and not DE) have been used in the experiments. The first eight baselines involve the DE algorithm with the following parameter values: population size (), scaling factor () and crossover rate (). This choice for parameter has shown good results (Fialho, 2010). CR as has been chosen to see the full potential of mutation strategies to evolve each dimension of each parent. The results of LR and IPOP are taken from their original papers from the cec2005 competition for the comparison.
Function  Random  DE1  DE2  DE3  DE4  AdapSS  FAUC  RecPM  LR  IPOP  DDQN1  DDQN2  DDQN3  

10 














10 














10 














10 














10 














30 














30 














30 














30 














30 













Algo  IPOP  DDQN2  DDQN3  RecPM  LR  AdapSS  DDQN1  FAUC  Random  DE3  DE2  DE4  DE1 
Rank  2.3  3.3  4.1  4.4  4.4  4.9  5.4  7.2  10.5  10.8  10.8  11.4  11.5 
5.1. Training and testing
In order to force the NN to learn a general policy, we train on different classes of functions. From the 25 functions of the cec2005 benchmark suite (Suganthan et al., 2005), we excluded nondeterministic functions and functions without bounds (functions , , and ). The remaining 21 functions can be divided into four classes: unimodal functions – ; basic multimodal functions – ; expanded multimodal functions – ; and hybrid composition functions – . We split these 21 functions into roughly training and testing sets, that is, functions (, , , , , –, – and ) are assigned to the training set and the rest (, , , and ) are assigned to the test set. According to the above classification, the training set contains at least two functions from each class and the test set contains at least one function from each class except for expanded multimodal functions, as both functions of this class are included in the training set. For each function, we consider both dimensions and , giving a total of problems for training and problems for testing.
During training, we cycle through the 32 training problems multiple times and keep track of the mean reward achieved in each cycle. We overwrite the weights of the NN if the mean reward is better than what we have observed in previous cycles. We found this measure of progress was better than comparing rewards after individual runs, because different problems vary in difficulty making rewards incomparable. After each cycle, the 32 problems are shuffled before being used again. The mean reward stopped improving after 1890 cycles (60480 problems, FEs) which indicated the convergence of the learning process.
Although the computational cost of the training phase is significant compared to a single run of DE, this cost is incurred offline, i.e., one time on known benchmark functions before solving any unseen function, and it can be significantly reduced by means of parallelisation and GPUs. On the other hand, we conjecture that training on even more data from different classes of functions should allow the application of DEDDQN to a larger range of unknown functions.
After training, the NN weights were saved and used for the testing (online) phase.^{1}^{1}1The weights obtained after training are available on Github (Sharma et al., 2019) together with the source code, and can be used for testing on similar functions including expanded multimodal. The code may be adapted to train or test using other benchmark suites such as bbob with functions of up to dimension . For testing, each DEDDQN variant was independently run 25 times on each test problem and each run was stopped when either absolute error difference from the optimum is smaller than or function evaluations are exhausted. Mean and standard deviation of the final error values achieved by each of the 25 runs are reported in Table 3.
5.2. Discussion of results
The average rankings of each method among the 10 test problem instances are shown in Table 4. The differences among the 13 algorithms are significant () according to the nonparametric Friedman test. We conducted a posthoc analysis using the best performing method (DEDDQN2) among the newly proposed ones as the control method for pairwise comparisons with the other methods. The pvalues adjusted for multiple comparisons (Li, 2008) are shown in Table 5. The differences between DEDDQN2 and the five baselines, random selection of operators and single strategy DEs (DE1DE4), are significant while differences with other methods are not. The analysis makes clear that the proposed method learns to adaptively select the strategy at different stages of a DE run.
While differences between the three reward definitions are not statistically significant, the rankings provide some evidence that R2 performs better than the other two definitions. R2 being a simple definition assigning fixed reward values does not get affected by the function range, whereas R1 and R3 involving raw functions values may mislead the NN when dealing with functions with different fitness ranges. R2 assigns ten times more reward when offspring improves over the best so far solution than when it improves over its parent. Thus, DEDDQN2 may learn to generate offspring that not only tend to improve over the parent but also improve the best fitness seen so far. On the contrary, R1 considers the improvement of offspring over parent only and is less informative than R3, which considers improvement over parent and optimum value. The improvement can be small or large when function values with different ranges is considered. As a result, R1 and R3 become less informative about choosing operators that will solve the problem within the given number of function evaluations. Although R3 scales fitness improvement with distance from the optimum which partially mitigates the effect of different ranges among functions, inconsistent ranges are still problematic. The R2 definition encourages the generation of better offsprings than the best so far candidate and it is invariant to differences in function ranges. Comparing with other methods proposed in the literature shows that DE variants with a suitable operator selection strategy can perform similarly to CMAES variants which are known to be the best performing methods for this class of problems.
To further analyze the difference between DEDDQN and other AOS methods we provide boxplots of the results of 25 runs of DEDDQN2, PMAdapSS and RecPMAOS on each function (Fig. 1). We observe that the overall minimum function value found across the 25 runs is lower for DEDDQN2 in all problems except 10 and 30. As seen in box plots, for and
with dimension 10, DEDDQN2 often gets stuck at local optima, but manages to find a better overall solution compared to the other methods. Other methods find high variance solutions in these cases. At the same time, the median values of solutions found are better for six out of ten problems. This observation suggests that incorporating restart strategies similar to those used by IPOPCMAES can be particularly useful for DEDDQN and give us a direction for future work. DEDDQN2 performs well consistently for the unimodal
with both 10 and 30 dimensions, while the other AOS methods find relatively higher error solutions with high variance. We interpret this as an indication that DEDDQN can identify this type of problem and apply a more suitable AOS strategy than RecPM and PMAdapSS. On the other hand, we see that for 30 and 30, DEDDQN2 exhibits higher variance of solutions, which suggests that higher dimensional multimodal functions often confuse the NN, leading it to suboptimal behaviour.Comparison  Statistic 

Result  

DDQN2 vs DE1  4.70819  0.00001  H0 is rejected  
DDQN2 vs DE4  4.65077  0.00008  H0 is rejected  
DDQN2 vs DE2  4.30627  0.00005  H0 is rejected  
DDQN2 vs DE3  4.30627  0.00005  H0 is rejected  
DDQN2 vs Random  4.13402  0.00010  H0 is rejected  
DDQN2 vs FAUC  2.23926  0.06630  H0 is not rejected  
DDQN2 vs DDQN1  1.20576  0.39166  H0 is not rejected  
DDQN2 vs AdapSS  0.91867  0.50299  H0 is not rejected  
DDQN2 vs RecPM  0.63159  0.59848  H0 is not rejected  
DDQN2 vs LR  0.63159  0.59848  H0 is not rejected  
DDQN2 vs IPOP  0.57417  0.61515  H0 is not rejected  
DDQN2 vs DDQN3  0.45934  0.64599  H0 is not rejected 
10  30 
10  30 
10  30 
10  30 
10  30 
6. Conclusion
We presented DEDDQN, a DeepRLbased operator selection method that learns to select online the mutation strategies of DE. DEDDQN has two phases, offline training and online evaluation phase. During training we collected data from DE runs using a reward metric to assess the performance of the selected mutation action and 99 features to evaluate the state of the DE. Features and reward values are used to optimise the weights of a neural network to learn the most rewarding mutation given the DE state. The weights learned during training are then used during the online phase to predict the mutation strategy to use when solving a new problem. Experiments were run using 21 functions from cec2005 benchmark suite, each function was evaluated with dimensions 10 and 30. A set of 32 functions was used for training and we run the online phase on a different test set of 10 functions.
All three proposed methods outperform all the nonAOS baselines based on mean error seen in 25 runs on test functions. This shows that the proposed methods can learn to select the right strategy at different stages of the algorithm. Our statistical analysis suggests that differences between the best proposed method and the AOS methods from the literature are not significant, but the best performing version of our model, DEDDQN2, was ranked overall second after IPOPCMAES. The R2 reward function, which assigns fixed reward values when better solutions are found, is more helpful for learning an AOS strategy.
For future work, we want to explore applications of Deep RL for learning to control more parameters of evolutionary algorithms, including combinations of discrete and continuous parameters. We also expect that an extensive tuning of state features and hyperparameter values will further improve performance of the method.
References
 (1)
 Aleti and Moser (2016) A. Aleti and I. Moser. 2016. A systematic literature review of adaptive parameter control methods for evolutionary algorithms. Comput. Surveys 49, 3, Article 56 (Oct. 2016), 35.
 Auger and Hansen (2005a) A. Auger and N. Hansen. 2005a. Performance evaluation of an advanced local search evolutionary algorithm. In Proceedings of the 2005 Congress on Evolutionary Computation (CEC 2005). IEEE Press, Piscataway, NJ, 1777–1784.
 Auger and Hansen (2005b) A. Auger and N. Hansen. 2005b. A restart CMA evolution strategy with increasing population size. In Proceedings of the 2005 Congress on Evolutionary Computation (CEC 2005). IEEE Press, Piscataway, NJ, 1769–1776.

Chen
et al. (2005)
F. Chen, Y. Gao,
Z.q. Chen, and S.f. Chen.
2005.
SCGA: Controlling genetic algorithms with Sarsa(0). In
Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on, Vol. 1. IEEE, 1177–1183.  Eiben et al. (2006) A. E. Eiben, M. Horvath, W. Kowalczyk, and M. C. Schut. 2006. Reinforcement learning for online control of evolutionary algorithms. In International Workshop on Engineering SelfOrganising Applications. Springer, 151–160.
 Fialho (2010) Á. Fialho. 2010. Adaptive operator selection for optimization. Ph.D. Dissertation. Université Paris SudParis XI.
 Fialho et al. (2010a) Á. Fialho, R. Ros, M. Schoenauer, and M. Sebag. 2010a. Comparisonbased adaptive strategy selection with bandits in differential evolution. In Parallel Problem Solving from Nature, PPSN XI, R. Schaefer et al. (Eds.). Lecture Notes in Computer Science, Vol. 6238. Springer, Heidelberg, Germany, 194–203.
 Fialho et al. (2010b) Á. Fialho, M. Schoenauer, and M. Sebag. 2010b. Toward comparisonbased adaptive operator selection. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2010, M. Pelikan and J. Branke (Eds.). ACM Press, New York, NY, 767–774.
 Gong et al. (2010) W. Gong, Á. Fialho, and Z. Cai. 2010. Adaptive strategy selection in differential evolution. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2010, M. Pelikan and J. Branke (Eds.). ACM Press, New York, NY, 409–416.
 Karafotias et al. (2014) G. Karafotias, A. E. Eiben, and M. Hoogendoorn. 2014. Generic parameter control with reinforcement learning. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2014, C. Igel and D. V. Arnold (Eds.). ACM Press, New York, NY, 1319–1326.
 Karafotias et al. (2015a) G. Karafotias, M. Hoogendoorn, and A. E. Eiben. 2015a. Evaluating reward definitions for parameter control. In Applications of Evolutionary Computation, EvoApplications 2015, A. M. Mora and G. Squillero (Eds.). Lecture Notes in Computer Science, Vol. 9028. Springer, Heidelberg, Germany, 667–680.
 Karafotias et al. (2015b) G. Karafotias, M. Hoogendoorn, and A. E. Eiben. 2015b. Parameter Control in Evolutionary Algorithms: Trends and Challenges. IEEE Transactions on Evolutionary Computation 19, 2 (2015), 167–187.
 Karafotias et al. (2012) G. Karafotias, S. K. Smit, and A. E. Eiben. 2012. A generic approach to parameter control. In Applications of Evolutionary Computation, EvoApplications 2012, D. C. C. et al. (Eds.). Lecture Notes in Computer Science, Vol. 7248. Springer, Heidelberg, Germany, 366–375.
 Kee et al. (2001) E. Kee, S. Airey, and W. Cyre. 2001. An adaptive genetic algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2001, E. D. Goodman (Ed.). Morgan Kaufmann Publishers, San Francisco, CA, 391–397.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. Arxiv preprint arXiv:1412.6980 [cs.LG] (2014). https://arxiv.org/abs/1412.6980
 Li (2008) J. D. Li. 2008. A twostep rejection procedure for testing multiple hypotheses. Journal of Statistical Planning and Inference 138, 6 (2008), 1521–1527.
 MezuraMontes et al. (2006) E. MezuraMontes, J. VelázquezReyes, and C. A. Coello Coello. 2006. A comparative study of differential evolution variants for global optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2006, M. Cattolico et al. (Eds.). ACM Press, New York, NY, 485–492.
 Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529.

Nair and Hinton (2010)
V. Nair and G. E.
Hinton. 2010.
Rectified linear units improve restricted boltzmann
machines. In
Proceedings of the 27th international conference on machine learning (ICML10)
. ACM Press, New York, NY, 807–814.  Pettinger and Everson (2002) J. E. Pettinger and R. M. Everson. 2002. Controlling genetic algorithms with reinforcement learning. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2002, W. B. Langdon et al. (Eds.). Morgan Kaufmann Publishers, San Francisco, CA, 692–692.
 Price et al. (2005) K. Price, R. M. Storn, and J. A. Lampinen. 2005. Differential Evolution: A Practical Approach to Global Optimization. Springer, New York, NY.
 Sakurai et al. (2010) Y. Sakurai, K. Takada, T. Kawabe, and S. Tsuruta. 2010. A method to control parameters of evolutionary algorithms by using reinforcement learning. In 2010 Sixth International Conference on SignalImage Technology and Internet Based Systems. IEEE, 74–79.
 Sharma et al. (2018) M. Sharma, M. LópezIbáñez, and D. Kazakov. 2018. Performance Assessment of Recursive Probability Matching for Adaptive Operator Selection in Differential Evolution. In Parallel Problem Solving from Nature  PPSN XV, A. Auger et al. (Eds.). Lecture Notes in Computer Science, Vol. 11102. Springer, Cham, 321–333.
 Sharma et al. (2019) M. Sharma, M. LópezIbáñez, and D. Kazakov. 2019. Deep Reinforcement Learning Based Parameter Control in Differential Evolution: Supplementary material. https://github.com/mudita11/DEDDQN. (2019).

Storn and Price (1997)
R. Storn and K. Price.
1997.
Differential Evolution – A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces.
Journal of Global Optimization 11, 4 (1997), 341–359.  Suganthan et al. (2005) P. N. Suganthan, N. Hansen, J. J. Liang, K. Deb, Y. P. Chen, A. Auger, and S. Tiwari. 2005. Problem definitions and evaluation criteria for the CEC 2005 special session on realparameter optimization. Technical Report. Nanyang Technological University, Singapore.
 Sutton and Barto (1998) R. S. Sutton and A. G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
 van Hasselt et al. (2016) H. van Hasselt, A. Guez, and D. Silver. 2016. Deep Reinforcement Learning with Double QLearning. In AAAI, D. Schuurmans and M. P. Wellman (Eds.). AAAI Press.
Comments
There are no comments yet.