Evolving Neural Networks in Reinforcement Learning by means of UMDAc

by   Mikel Malagon, et al.

Neural networks are gaining popularity in the reinforcement learning field due to the vast number of successfully solved complex benchmark problems. In fact, artificial intelligence algorithms are, in some cases, able to overcome human professionals. Usually, neural networks have more than a couple of hidden layers, and thus, they involve a large quantity of parameters that need to be optimized. Commonly, numeric approaches are used to optimize the inner parameters of neural networks, such as the stochastic gradient descent. However, these techniques tend to be computationally very expensive, and for some tasks, where effectiveness is crucial, high computational costs are not acceptable. Along these research lines, in this paper we propose to optimize the parameters of neural networks by means of estimation of distribution algorithms. More precisely, the univariate marginal distribution algorithm is used for evolving neural networks in various reinforcement learning tasks. For the sake of validating our idea, we run the proposed algorithm on four OpenAI Gym benchmark problems. In addition, the obtained results were compared with a standard genetic algorithm. Revealing, that optimizing with UMDAc provides better results than the genetic algorithm in most of the cases.



There are no comments yet.


page 1

page 2

page 3

page 4


PathNet: Evolution Channels Gradient Descent in Super Neural Networks

For artificial general intelligence (AGI) it would be efficient if multi...

Direct Mutation and Crossover in Genetic Algorithms Applied to Reinforcement Learning Tasks

Neuroevolution has recently been shown to be quite competitive in reinfo...

Applications of Gaussian Mutation for Self Adaptation in Evolutionary Genetic Algorithms

In recent years, optimization problems have become increasingly more pre...

Artificial Intelligence-Assisted Optimization and Multiphase Analysis of Polygon PEM Fuel Cells

This article presents new PEM fuel cell models with hexagonal and pentag...

A hybrid MGA-MSGD ANN training approach for approximate solution of linear elliptic PDEs

We introduce a hybrid "Modified Genetic Algorithm-Multilevel Stochastic ...

OPEB: Open Physical Environment Benchmark for Artificial Intelligence

Artificial Intelligence methods to solve continuous- control tasks have ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in artificial intelligence have permitted computers to effectively deal with some real-world problems, such as natural language processing, face recognition, speech recognition, machine vision and more

[1]. Furthermore, in some of these tasks artificial intelligence has outperformed humans [2].

In artificial intelligence and, more precisely, in machine learning, Reinforcement Learning (RL) is a field of great relevance. This is partially motivated by the large quantity of benchmark problems that are being solved in this field in recent years. RL algorithms are capable of developing complex behaviors from high dimensional input data, enabling algorithms to solve difficult problems such as playing 3D games


. In RL, an agent takes actions in an environment and a reward value is given back. The agent has to learn an optimum behavior in order maximize the cumulative reward. However, learning the optimum behavior is usually a hard optimization problem. In this sense, firstly the agent or the set of agents needs to be defined. There exist many ways of implementing the agents, for example as a vector of its own features, or, as is the case of agents controlled by NNs, as vectors of the parameters of NNs. In spite of the existence of several types of NNs, such as deep neural networks (DNNs)

[1], recurrent (RNN) [4] and convolutional (CNN) [2], in this paper feed-forward NNs (FNNs) are used due to their simplicity, Figure 1.

In many cases, optimizing the agent involves the optimization of thousands of parameters, as is the case of NN controlled agents. Based on that idea, a number of optimization algorithms have been proposed for RL, some examples being the following, Q-learning [5], SARSA [6], Deep Q-learning [7] or Deep Deterministic Policy Gradient [8]

. In order to optimize the parameters of NNs, numerical algorithms are one of the most common choices. However, they usually tend to be computationally very expensive. As an alternative, Evolutionary Algorithms (EAs) are used, and among these, Genetic Algorithms (GAs) are the most frequent

[9, 10, 11, 12, 13, 14].

Evolutionary Algorithms are natural evolution inspired techniques. As occurs in nature, a set of solutions or individuals, also called a population, is evolved by means of selection, crossover and mutation operations and, in this way, it tries to find an optimal solution. Among the advantages of EAs, they are applicable in a wide range of problems, they do not make any assumptions about the search space and they can be easily adapted to any combinatorial or continuous problems.

In GAs, solutions or individuals of the population are usually encoded as vectors, for example, binary strings or real-coded vectors. Subsequently, every solution is evaluated according to a fitness function. Finally, some random operators are applied to them in order to create new solutions.

In this paper, we propose using another paradigm among EAs, an Estimation of Distribution Algorithm [15]

(EDA) in order to optimize the inner parameters of NNs in RL tasks. EDAs, like GAs, also start with a randomly generated population of solutions. Each solution is evaluated according to a fitness function, and the most promising ones are selected following a predefined procedure. Next, a probability distribution is learned from a set of selected individuals. Afterwards, the new generation is composed of survivors and new individuals generated by sampling the learned probability distribution from the previous step. The process is repeated until a maximum fitness value, maximum number of generations or convergence is reached.

In other works [15]

, some EDAs are compared with the Stochastic Gradient Descent (SGD) technique, which is the usual choice to pair with the popular backpropagation method in various supervised learning problems. In the conducted experiments, EDAs proved that they have a competitive performance despite being outperformed by SGD.

There are various types of EDAs depending of the kind of the probabilistic model they learn. In this paper, the Univariate Marginal Distribution Algorithm (UMDA, c stands for continuous domain) is used [15]. In order to prove the validity of using UMDA for evolving NNs (NN-UMDA) in RL tasks, the OpenAI Gym [16] toolkit was adopted. This toolkit is designed to train and test RL algorithms in many different tasks, such as walking, balancing or playing Atari games. Particularly, we compared the performance of NN-UMDA with NN-GA in four different problems from OpenAI’s Gym. Experimental results showed a superior performance when using UMDA.

The remainder of the paper is organized as follows: in the next section, some background notes in NN and UMDA are given. Afterwards, in section 3, RL problems are described. Later, in section 4, optimization of NN agents by UMDA is explained. In section 5, experimental design, obtained measures and a statistical analysis on the results are presented. Finally, in section 6, conclusions and future challenges are given.

2 Background

In this section, we introduce the necessary background on NNs and UMDA, giving a detailed explanation on how they work. Later, the background from this section is used for explaining the algorithm proposed in this paper.

2.1 Neural Networks

Neural Networks are biologically inspired computational models. They were first proposed in 1943 by Warren McCullough and Walter Pitts [17]. However, it was not until the 80’s that huge advances in computer science finally enabled the computation of complex NNs. Hence, the capability of NNs to solve problems was boosted [18].

NNs are networks of simple processing units called perceptrons, simple computational models of biologic neurons. Perceptrons are given input data in the form of an

size vector. Then, each element of the vector is multiplied by a real number given by the weights of the neuron and all these values are summed together. Usually, a bias is added to the input of the perceptron in order to shift the perceptron’s function. After multiplying the input vector with weights, the results are summed together and an activation function is applied. The activation function gives non-linearity to the model and it can also be used to normalize the output value to a certain range. To give an example, the sigmoid function (

) is one of the most widely used activation functions. Also, tanh activation function is known to normalize the output vector’s values from -1 to 1. Therefore, the output of a perceptron can be described as

Figure 1: Feed-forward and single hidden layer NN.

where the output of the perceptron is , stands for th element of the input vector, is the bias of the perceptron and is the activation function.

When the output of a perceptron turns into the input of another, perceptrons are connected. In NNs, perceptrons are connected to others in the form of layers. This gives NNs the capability to approximate any kind of function [19, 20]. Commonly, a NN is composed of an input layer, some intermediate layers, known as hidden layers, and a final output layer. Depending on how layers are connected, different type of NNs are created. Due to the their simplicity, densely connected feed-forward NNs (FNNs) are used in this paper. In densely connected FNNs, every perceptron of a layer, , is connected to every perceptron in the next layer, , and there are no loops or cycles between perceptrons, Figure 1.

Training NNs is the process where inner parameters of NNs (weights) are optimized in order to minimize the error of the output with respect to the desired output or target. This error value is often called the loss

, and is calculated with a loss function. One such option is the mean squared error or

MSE, and it is defined as follows,

where is the number elements in the output vector, is the th output of the NN and is the expected or target value of . The computed loss is then propagated backwards in a process called back-propagation, where the parameters of the NN are optimized. Despite the existence of several optimization algorithms to combine with back-propagation [21, 22, 23, 24], this step is very time consuming, and thus, we decided to substitute it by using UMDA. The optimization algorithm employed to improve the performance of the model is explained in the following sections.

2.2 Univariate Marginal Distribution Algorithm

Optimizing the weights of NNs can be seen as a continuous optimization problem, which can be approached by means of Estimation of Distribution Algorithms (EDAs). More precisely, Univariate Marginal Distribution Algorithm for continuous domains, UMDA, is adopted in this work. UMDA was introduced by Larrañaga et al. in 1999, [15].

As with most EDAs, the start point for UMDA is a randomly generated population of solutions codified as real-valued vectors. Next, a problem specific evaluation is conducted, thus giving a fitness value to every solution of the population. According to the specified survivor rate, a set with the best solutions is selected (Figure 2

). Later, a normal distribution is estimated from the set of survivor individuals on each of the positions in the real valued vectors. In order to create a new set of solutions, random samples from the normal distributions are taken. Finally, the new set of solutions along the solutions from the survivor set will form a new generation or population. The described process is repeated until some convergence criteria are met.

Figure 2: Scheme of the UMDA considered for learning the parameters of NNs.
Figure 3: Reinforcement Learning scheme.

3 Reinforcement learning

As mentioned in the introduction, in Reinforcement Learning (RL) problems the agent is given an observation of the current state of the environment. In the case of OpenAI’s Gym environments [16], for every frame of the game, the environment returns a state vector, . In response, the agent takes an action , and the environment returns a reward value , and a new state of the game . The goal is to learn an optimum policy that maximizes the cumulative reward value. This interaction of the agent and environment is illustrated in Figure 3.

In this paper, agents are controlled by NNs, and state vectors are the input of NNs. The output of NNs, , is an sized vector, being the number of actions the agent can take. In the particular case of LunarLander-v2 environment [16] (in which the objective is to land between two flags), equals 4, as there are four available actions: do nothing, fire left engine, fire main engine and fire right orientation engine. The index of the highest value element in is chosen as the action, , to take. As a result, the action is defined as:

After selecting the action for the current state , the action is taken and the environment returns a new state of the environment , and a reward value . The reward value represents how good or bad the taken action was for state . This process is repeated until the agent dies or the game finishes. Every reward the agent is given is summed as the total reward, so after the game finishes is used to evaluate the behavior of the individual, often called fitness.

Figure 4: NN parameter treatment in order to apply UMDA.

4 Optimizing NN Agents with UMDA

In this section, the contribution of the paper, the optimization of NNs via UMDA is presented, henceforth known as NN-UMDA.

Optimizing the parameters of NNs can be seen as a continuous optimization problem. For this reason, UMDA is used as a more appropriate approach than GA in order to evolve NNs, as UMDA manages large search spaces more efficiently than GA does. Conversely to GAs, UMDA is not limited to crossover and mutation, new solutions are sampled from probability distributions, these being more suitable for very difficult search space problems, such as the optimization of the parameters of NNs [15].

As aforementioned, in these RL problems we propose controlling the agents by NNs. So, the population of solutions is composed of real-valued vectors representing the parameters of NNs. As a result, the population is a matrix containing vectors of NNs of dimensions, being the number of inner parameters of the selected network architecture (see Figure 4). Since all the individuals of the population describe the parameters of a unique network architecture, every row of the population matrix has the same length. Each column in the matrix represents a single weight or bias of the selected NN architecture.

Once the population of solutions is evaluated, the ones with the highest fitness values are selected, thus creating a matrix of individuals called survivors, as shown in Figure 2

. The next step is to create a set of new solutions or individuals. For every column in the survivors matrix, the mean and variance of the column is calculated in order to estimate a normal distribution for a certain parameter of the NN. Then, new values are sampled from the normal distribution, creating new weight values for that specific parameter.

After generating new individuals, individuals that are not in the survivor set are replaced with new ones, thus creating a new population of solutions, Figure 2. As the process described is repeated, random samples of the normal distributions learned from the best individuals of each generation will tend to better values in order to maximize the fitness value. In GAs, a mutation rate is commonly used to prevent convergences in local optimums and to add diversity to the population of solutions in order to explore other areas from the solution landscape. On the contrary, in UMDA there is no need for such techniques. As diversification implicitly happens when new solutions are randomly sampled from normal distributions.

5 Experiments

In order to validate our proposal, NN-UMDA and NN-GA were evaluated in four RL tasks. More specifically, CartPole-v0, LunarLander-v2, LunarLanderCountinous-v2 and Bipedal-Walker-v2 environments from OpenAI Gym toolkit [16] were adopted for evaluation.

5.1 Experimental Setup

To define the scale of a problem, the number of trainable parameters in the selected NN architecture was taken, . For the proposed problems in these experiments, the simplest NN architecture for each task was chosen. Consequently, the proposed architectures contain an input layer with the size of the state vector the environment returns, and an output layer with the size of the number of available movements. Some parameters of both algorithms are under , as those parameters depend on the scale of the task to be solved 111The parameters were set by tuning their values manually and carrying out non-exhaustive experiments.. NN-UMDA and NN-GA share the same parameters, in order to be evaluated under the same conditions. With the exception of mutation rate in the GA222 The GA used in this paper follows the same steps as UMDA

, besides the sampling of new solutions. The GA, instead of sampling new solutions from normal distributions, samples the values from a uniform distribution from the survivors set of solutions, applying some random noise to the samples taken, defined by the

mutation rate. , the rest of the parameters are equally set in both algorithms in order to evaluate under the same conditions. In the presented experiments, the mutation rate has a constant value of 0.1. The rest of the parameters for both algorithms are:

  • Population size:

  • Number of generations:

  • Number of individual evaluations:

  • Survivor rate:

The population size depends on the dimension of the problem. For larger problems, in which the solution space is much larger and finding an optimum solution is harder, a higher number of solutions per generation is needed to solve the problem. The same occurs with the number of iterations or generations needed.

In order to encourage the model to generalize, the environments are stochastic. This prevents the model from memorizing a sequence of correct actions, as for every game it plays, the optimal sequence of actions to take changes. Consequently, solutions that performed correctly in a trial may not perform equally in other trials. Therefore, to have an overall understanding of how well an individual performs, the same solution is evaluated 3 times, and the average of total rewards is taken as the fitness value of the given individual. The number of times each solution is evaluated is referred to as the number of individual evaluations. The number of solutions to select as survivors is defined as the survivor rate. In these experiments, the survivor rate is always , meaning a half of the population is going to remain unchanged while the other half can be replaced with new solutions every generation.

Presented experiments were carried out on a machine of the following characteristics: 8 cores Intel Xeon Skylake processor and 7.2GB of RAM memory.

5.2 Performance evaluation

In order to collect data for later assessment, both algorithms were run 10 times on each benchmark problem. Furthermore, the best solution of each run was evaluated 100 times, and the average total reward was calculated. The results obtained are summarized in Table 1 and Figure 5. In Table 1, the averages of the obtained total rewards are given, while in Figure 5

, the results are plotted as violin plots, representing the kernel density estimations of the

average total rewards obtained when evaluating the best results of both algorithms.

Algorithm CartPole-v0 LunarLander-v2 LunarLanderContinuous-v2 BipedalWalker-v2
NN-UMDA 173.26 258.80 229.34 108.80
NN-GA 155.60 231.30 21.22 39.71
Table 1: Average total rewards of the algorithms in every environment.
Figure 5: The performance of both algorithms in the given environments.

5.2.1 CartPole-v0

Firstly, algorithms were tested in the CartPole-v0 environment [16]. In this classic control task, a pole is joined to the cart by an non-actuated joint, and the objective is to maintain the pole in balance. There are only two available actions: push the cart to the left or push it to the right. In CartPole-v0, the state is a four element vector, representing: cart position, cart velocity, pole angle and pole velocity at tip. The game is considered solved after an average reward of 195.0 over 100 consecutive trials. In this task, according to the chosen NN architecture, the scale of the problem, , is 10.

NN-UMDA and NN-GA were able to solve the game in almost every trial. The total rewards of NN-UMDA were almost always 200. However, NN-GA did not generalize so well, and in some cases the obtained total rewards dropped below 100.

5.2.2 LunarLander-v2

The next step was to evaluate both algorithms in a more difficult task. LunarLander’s state vector size and the number of available actions are twice those of CartPole-v0. Consequently, the complexity increased considerably, having for this game. As previously mentioned in the RL section, in the LunarLander-v2 environment the objective of the agent is to land between the two flags. The game is solved if an average reward of 200 or more is achieved in 100 runs.

As shown in Figure 5, both algorithms solved the proposed task and performed similarly. However, NN-UMDA was able to obtain slightly higher scores than NN-GA.

5.2.3 LunarLanderContinous-v2

To test the game in a continuous control task, the LunarLanderContinous-v2 problem was chosen. Both LunarLander environments are equal except for having an action vector of two real values from -1 to +1 for this environment. This real number vector controls the thrust of the main and both side engines.

From the last three presented environments, LunarLanderContinous-v2 has the greatest difference in performance between both algorithms. The results obtained, Figure 5, showed that NN-UMDA solved the game in most of the trials, even above the score considered to solve the game (200), while NN-GA was not able to solve the game in most of the cases, resulting in a much poor performance.

5.2.4 BipedalWalker-v2

The results of the last experiment motivated the comparison of the algorithms in a task of larger search space. In order to test both algorithms in a more challenging benchmark, BipedalWalker-v2 was adopted. In this continuous control task, in order to maximize its reward, the bipedal agent has to move forward as fast as possible without falling. The size of the state vector increases considerably compared to the other presented environments, as . For this specific problem, due to computational limitations, parameters under were tuned in order to reduce the training time. The parameters changed for BipedalWalker-v2 are:

  • Population size:

  • Number of generations:

As a result, neither NN-UMDA nor NN-GA were able to solve the game (rewards above 300). However, the goal was not to solve the game, but to get useful data to compare the performance and behavior of both algorithms. As expected, NN-UMDA obtained considerably higher rewards than NN-GA in most of the evaluations conducted.

5.3 Statistical analysis

In order to statistically assess the performance results obtained from the compared algorithms, we have followed the Bayesian approach presented in [25]

, as it can provide a deeper insight into the results than the classical null hypothesis significance tests. Particularly, we have used the Bayesian equivalent of the Wilcoxon’s test

333We have used the implementation available in the development version of the scmamp R package [26] available at https://github.com/b0rxa/scmamp.. Conversely to the usual null hypothesis statistical analysis, Bayesian methods permit to answer explicitly questions such as, which is the probability that a given algorithm is the best one?

Not only that but, due to the nature of the Bayesian statistics, it naturally provides us with information about the uncertainty about the behavior of the algorithms after the data from the experimentation has been introduced.

In this particular case, the Bayesian analysis was conducted on the average reward obtained by each algorithm in the 10 repetitions. Under this analysis the goal is to determine how likely will be NN-UMDA to beat NN-GA when run on each of the benchmarks considered. To that end, the procedure used requires the definition of what is understood as ‘practical equivalence’ or ’rope’ in [25], i.e. the performance difference below which both algorithms can be considered with same performance. In our case, we decided that both approaches have equivalent performance when the difference in reward is smaller than

. In the particular case of this experimentation, the Bayesian analysis permits to explicitly calculate the posterior probability distribution on the probability of each scenario being the most likely i.e. NN-UMDA

, NN-GA or Rope. To that end, as in any Bayesian model, a prior knowledge needs to assumed about the probability of each of the algorithms being the best. In this case, we assume uniform probability on all the possible probability distributions of the three options. Then, the results obtained from the experimentation are incorporated in order to update our belief accordingly, calculating this way the posterior distribution444As this paper is not devoted on Bayesian analysis, we refer the interested reader to address the papers [25, 26] for further details..

Results of the analysis are depicted in Fig. 6 as simplex plots. These plots are triangle diagrams where the vertices represent the probability vectors . The points in the plot represent a sampling of the posterior distribution of the probability of win-lose-tie. The closer a point is to the NN-UMDA vertex of the triangle (or, equivalently, to NN-GA or Rope vertices), the more probable it is for NN-UMDA to have better performance (or equivalently, NN-GA or both algorithms being equal). Therefore, the three areas delimited by the dashed lines show the dominance regions, i.e., the area where the highest probability corresponds to its vertex.

(a) CartPole. UMDA: 0.497, GA: 0.372, Rope: 0.129
(b) LunarLander. UMDA: 0.952, GA: 0.040, Rope: 0.006
(c) LunarLanderContinuous. UMDA: 0.993, GA: 0.00, Rope: 0.006
(d) BipedalWalker. UMDA: 0.821, GA: 0.005, Rope: 0.172
Figure 6: Simplex plots of the results obtained in the four benchmark environments, and the expected posterior probabilities of each algorithm being the best algorithm.

Fig. 6 shows that depending on the environment optimized, very different scenarios can be observed. As regards LunarLander and LunarLanderContinuous, the probability mass of the posterior is closed to the vertex NN-UMDA (see Figs. 6(b) and 6(c)). This fact is confirmed by the average posterior probabilities for each situation (NN-UMDA being better, equal or worse than NN-GA), which points out that the expected probability of NN-UMDA being better is 0.92 and 0.993, respectively. We also see that the spread of the points in the two plots is very low, which indicates that the variance of the posterior distribution is small, i.e. there is almost no uncertainty about the result of analysis: NN-UMDA performs much better than NN-GA.

With respect to the results in BipedalWalker environment, we see that most of the points are in the region of NN-UMDA indicating that this algorithm has larger probability to show better performance. This is corroborated with the expected probability . Nevertheless, in this case, the posterior variance is larger than previously, and suggests that, occasionally, NN-GA is preferred to NN-UMDA.

Finally, in CartPole we see that the expected posterior probability of the EDA approach is slightly higher (see Fig. 6(a)). However, we see that, on the one hand, the points in the simplex plot are very sparse. This shows that the variance of the posterior distribution is high (large uncertainty), and more executions are needed to draw solid conclusions. On the other hand, we also see that the points are spread in the regions relative to UMDA and GA, and a few in the Rope. This fact shows that the uncertainty related to which algorithm is better is high, i.e., with the executions at hand, we can say it looks that NN-UMDA and NN-GA have similar performance being the first slightly better.

Finally, as the practical equivalence threshold was set very small when compared to the obtained rewards, almost never both algorithms are considered equivalent during the analysis. Therefore, the expected posterior probability of the Rope is very low in all the considered environments.

6 Conclusions and future challenges

In this paper, we introduced an alternative to GAs in order to adjust the parameters of NNs for RL tasks. We also proved that the proposed algorithm, NN-UMDA, was capable of solving every given task more efficiently than NN-GA, a greatly used algorithm to evolve NNs. NN-UMDA was not only able to resolve the given task more reliably than NN-GA, but in some cases the obtained score was greater than the considered threshold to solve the problem.

The experiments in this paper were conducted to compare the performance of NN-GA with respect to NN-UMDA, and obtain some empirical data for the evaluation of both algorithms. However, further research has to be done in order to test the presented algorithm in more complex and advanced scenarios. In the following, we enumerate three possible lines for future research.

  1. In the presented experiments, NNs were always feed-forward single layer NNs. As the proposed algorithm can be applied to many types of NNs with very little effort, more advanced NN architectures could be used in order to face harder problems. For example, CNNs in scenarios where the input data are images, or RNNs to deal with time dependant actions.

  2. EAs are great at efficiently approaching near global optimum solutions. After evolving a NN, this could be used to pre-train another NN. In this process, the knowledge of the near optimal solution would be transferred to the other NN. Finally, the pre-trained NN could be optimized (using another algorithm such as DQL [7]) to fine-tune towards the global optimum.

  3. A special type of NN, Autoencoders, could be used to learn a reduced dimensionality representation of the observation space

    [27]. Autoencoders are previously trained to learn this simpler representation. Consequently, much simpler models are trained to learn to solve the required task. As regards the algorithm presented in this manuscript, populations of much simpler NNs could be trained. Solutions could be fed with a much smaller dimension state vector, thus reducing the training time and computational load considerably.

  4. To conclude, in this paper, UMDA is only compared with a GA. For better understanding the convenience of optimizing the parameters of NNs with UMDA, the proposed algorithm should be compared with more types of EAs.


This Work has been partially supported by TIN2016-78365R (Spanish Ministry of Economy, Industry and Competitiveness). We would like to gratefully acknowledge the support of Unai Garciarena in the revision of the manuscript.


  • [1] Simon Haykin. Neural networks, volume 2. Prentice hall New York, 1994.
  • [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [3] Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [4] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010.
  • [5] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • [6] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, England, 1994.
  • [7] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [8] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • [9] Xin Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447, 1999.
  • [10] Zhengjun Liu, Aixia Liu, Changyao Wang, and Zheng Niu. Evolving neural network using real coded genetic algorithm (ga) for multispectral image classification. Future Generation Computer Systems, 20(7):1119–1129, 2004.
  • [11] A Sedki, Driss Ouazar, and E El Mazoudi. Evolving neural network using real coded genetic algorithm for daily rainfall–runoff forecasting. Expert Systems with Applications, 36(3):4523–4527, 2009.
  • [12] Rasoul Irani and Reza Nasimi. Evolving neural network using real coded genetic algorithm for permeability estimation of the reservoir. Expert Systems with Applications, 38(8):9862–9866, 2011.
  • [13] Kenneth O Stanley and Risto Miikkulainen. Efficient reinforcement learning through evolving neural network topologies. In

    Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation

    , pages 569–577. Morgan Kaufmann Publishers Inc., 2002.
  • [14] Alexis P Wieland. Evolving neural network controllers for unstable systems. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume 2, pages 667–673. IEEE, 1991.
  • [15] Pedro Larrañaga and Jose A Lozano. Estimation of distribution algorithms: A new tool for evolutionary computation, volume 2. Springer Science & Business Media, 2001.
  • [16] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
  • [17] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
  • [18] Haohan Wang and Bhiksha Raj. On the origin of deep learning. arXiv preprint arXiv:1702.07800, 2017.
  • [19] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • [20] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
  • [21] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
  • [22] Tijmen Tieleman and Geoffrey Hinton.

    Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.

    COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • [25] Alessio Benavoli, Giorgio Corani, Janez Demšar, and Marco Zaffalon.

    Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis.

    Journal of Machine Learning Research, 18(77):1–36, 2017.
  • [26] Borja Calvo and Guzman Santafe. scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems. The R Journal, 2016. R package version 2.0.
  • [27] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.