1. Introduction
In many areas of interest for machine learning, sample sets are small and noisy, for example in robotics where environments are prone to continuous changes and novelty or when using machine learning models to replace expensive sampling methods in optimization. These complex problems often require models with a high number of degrees of freedom, which tend to overfit when presented with only a small number of samples.
Neuroevolutionary (NE) methods represent a starkly different neural network training paradigm than backpropagation of error. Morse and Stanley
(Morse and Stanley, 2016) point out that a number of advantages, like the regularizing properties of topological evolution (Stanley and Miikkulainen, 2002), diversity maintenance and indirect encoding, allows NE to perform certain tasks better. Especially when using NE for regression, where models need to be created that are as general as possible, NE training methods allow to find simple/minimal models, adhering to the law of parsimony. Smaller networks have been shown to be less prone to overfitting (Orr, 1993), and are easier to train, because less model parameters have to be optimized.To understand how we can improve NE methods further, a number of questions still needs to be answered. Taking advantage of the strengths of NE, for example by increasing the expressiveness of networks with a higher diversity of neural activation functions, which is harder for classical methods, we could decrease the network size and limit overfitting. NE could allow training neural networks for small sample sets which are too small to train large networks, by topology minimization and decrease overfitting for problems that are hard to represent with small networks.
In this work, we try to answer the first question and look at the evolution of the activation functions in a topologyevolving algorithm and its effect on network size, convergence speed and error approximation in a number of benchmark regression problems.
1.1. Influence of Activation Function
Hornik showed that the Universal Approximation Theorem^{1}^{1}1Based on Kolmogorov’s superposition theorem (Kolmogorov, 1957) is valid for neural networks using any continuous nonconstant function, stating that for these activation functions, multilayer feedforward neural networks are capable of arbitrarily accurate approximation to any realvalued continuous function, as long as outputs are bounded (Hornik et al., 1989).
Although the choice of an activation function does not have an influence on the theoretical expressiveness of neural networks, it can affect the training behaviour of the network. The number of nodes needed to approximate a given function is highly dependent on the activation function of the nodes. To approximate a Gaussian function two sigmoid functions are required, to approximate a sigmoid function four Gaussian functions are needed (Figure 1).
Although this example is contrived, it is easy to see that the number of nodes needed is dependent on the used activation function. As more nodes are required, the number of weights that need to be trained also grows, this expands the search space of the training algorithm, and extends convergence times. Several studies have shown the significant influence the choice of activation function can have on the performance of neural networks. Kamruzzaman and Aziz (Kamruzzaman and Aziz, 2002) show that activation functions influence convergence speed, depending on the target function. In a large literature review by Laudani et al (Laudani et al., 2015)
comparing many activation functions, the authors find large differences in terms of convergence speed (in epochs) and resulting network errors. An extensive comparison was done by Efe
(Efe, 2008), using 8 benchmark data sets, drawing the same conclusion, that different activation functions result in different convergence behavior and accuracy, with respect to the resulting network errors.1.2. Heterogeneous Networks
In contrast to backpropagation, the training method in NE is independent of function derivatives. Since the best nonlinearity is unknown beforehand, we can easily use NE training to create networks containing a heterogeneous set of activation functions. Figure 2 shows an example where it makes sense to use such networks. In this case, combining a Gaussian and a sigmoidal activated neuron leads to a good approximation, whereas a homogeneous network, which uses only one activation function, would need more neurons and weights, increasing the size of the topologyweight search space.
The outputs in multitarget problems are often correlated. When we try to approximate the vibration frequency and temperature response in a motor to the amount of acceleration in a car, both targets are likely to be correlated by the response of the valves in the motor. These valves will contain a dampening effect, which depends on the input (the gas pedal), which will affect both targets. Models that approximate and predict these target values benefit when they are able to describe this global effect for all targets. Figure 3 shows an example of a heterogeneous neural network approximating two such correlated target functions. The sine function in the first neuron describes a part of both targets. The second layer of neurons can then specialize into learning decorrelated signals. This modularity is something we expect to see even in small heterogeneous networks and is another reason why the use of heterogeneous networks can lead to more parsimonous models.
Not using a fixed activation function and instead developing it with the network’s topology allows for compact networks to be more expressive and decrease the size of the parameter space. Not only are smaller networks more easily trained, but that training can be performed with fewer samples, and the resulting networks are less prone to overfitting of the training data (Orr, 1993). Especially in cases where samples are very noisy or the number of available is not sufficient to train large networks, this can be an advantage compared to classical training methods.
There is no analytical way to adapt the activation function in online backpropagation learning and often, model selection methods like the Akaike Information Criterion (Angus, 1991) need to be used to select the appropriate model. Neuroevolution offers methods that allow coevolution of activation functions in an online manner. We extend the NEAT (Stanley and Miikkulainen, 2002) algorithm and refer to it as Heterogeneous Activation NEAT (HANEAT), allowing it to evolve the activation function as well, using a similar idea as was applied in HyperNEAT (Stanley et al., 2009), but producing directly encoded networks. We then compare the performance of the original implementation, producing homogeneous neural topologies, to HANEAT.
2. Related Work
2.1. Evolving Topologies
While some neuroevolutionary algorithms like SANE (Moriarty and Miikkulainen, 1996) or ESP (Gomez and Miikkulainen, 1998) evolve only fixed networks, others like NEAT evolve the network’s topology as well. The NEAT algorithm was introduced by Stanley and Miikkulainen in 2002 (Stanley and Miikkulainen, 2002). To keep the produced topologies as small as possible, NEAT follows a bottom up approach, initialising minimal networks in which inputs are connected directly to outputs. Then, slowly, neurons and connections are added by mutation operators, gradually complexifying the networks. While most parts of a neural network are evolved, a fixed activation function is used for all hidden neurons. NEAT does not suffer from bloat, growing network size with little improvement to fitness. This would slow the evolutionary process, increasing the algorithm’s memory footprint and hampering breeding methods (Trujillo et al., 2014).
Stanley et al. extended NEAT to HyperNEAT seven years later (Stanley et al., 2009). They evolve Compositional Pattern Producing Networks (CPPN) (Stanley, 2007) that output a function which determines the weights in a preconfigured neural network which depends on the position of the connection’s source and target neuron’s coordinates in a predefined coordinate system, thus indirectly encoding the network’s weight matrix. CPPNs are evolved as a composite function, which includes multiple activation functions. The algorithm is mostly used to solve control problems (Taras et al., 2014), where locality or symmetries in the neurons’ locations can be defined in a meaningful way.
Khan et al. introduce an extension to Cartesian Genetic Programming (CGP)
(Miller and Thomson, 2000; Mahsal Khan et al., 2013), CGPANN, in which they define neurons as nodes in the cartesian genotype. CGP allows neuron connections to ”jump” layers and thus allowing topologies to change during evolution. The algorithm does not use ”minimal initialization” like NEAT and does not emphasize creating parsimonious networks that generalize well.2.2. Evolving Activation Functions
Neural networks usually utilise the same, nonlinear fixed activation function for all neurons, but the activation function has a significant influence on the learning performance, topology and fitness (Kamruzzaman and Aziz, 2002; Efe, 2008; Laudani et al., 2015).
Mayer and Schwaiger evolve the activation function of generalized multilayer perceptrons
(Mayer and Schwaiger, 2002) within the netGEN framework, which uses a genetic algorithm to evolve neural network topologies (Huber et al., 1995). The activation functions are described in the genome by the coordinates of the control points of a cubic spline function. This also allows nonmonotonic activation functions. The authors show that by evolving the activation function they are able to learn the XOR problem with a single neuron as well as increase the accuracy of the evolved networks on more complex target functions. The effect on the evolutionary process, on convergence times, is not investigated. Yao gives an overview on the evolution of neural activation functions (Yao, 1999) and show that most work was done mixing sigmoidal and Gaussian neurons, and thus selecting the transfer functions from a predefined set.In recent work, Turner and Miller (Turner and Miller, 2014)
acknowledge that a lot of work has been left open on NE that includes evolving the transfer function. They use a single scaling parameter that is learned during training to evolve heterogeneous networks. They show that the effectiveness of using NE with either a convential fixedtopology NE algorithm or Cartesian Genetic Programming (CGP) to train homogeneous ANNs is dependent on the selected activation function. They evaluate their approach using a number of classification benchmarks, reinforcement learning benchmarks and a simple circuit regression task and evolved variable activation functions by including a function scaling factor. The authors show that homogeneous networks reach significantly different fitness levels and convergence times, depending on the activation function and data set, whereby no single activation function can be shown to work best for all problems. Heterogeneous networks are created as well with the transfer function evolved from both a list of the beforementioned nonlinear functions or as a parameterized activation function. The authors find clear improvements using heterogeneous networks, when comparing to the
average homogeneous network. In most cases however, there exists an activation function for which the homogeneous networks perform mostly better than the heterogeneous ones. They also find that the networks do not tend to prefer any particular activation function, especially in the case of the CGP algorithm that evolves the topology as well as the weights.2.3. Discussion
It is necessary to do more research on the effects of evolving the transfer functions in neuroevolved networks, especially in the context of regression and algorithms that construct a network’s topology.
As NEAT is a NE method that initializes networks in a minimal state, which will allow us to use this regularizing effect, this work extends that approach. The algorithm is naturally topologyminimizing and its internal representation is easily extendible to include coevolution of activation functions. Especially when targeting regression where multiple target outputs are correlated, the bottomup construction of neural networks in NEAT could allow the more general submodels or
modules to evolve first, after which the later layers or neurons are added to specialize on more specific behaviours in the underlying model.The original implementation of NEAT does not evolve the activation functions of the network, leading to homogeneous networks. Even though there is research on evolving activation functions (Mayer and Schwaiger, 2002; Gauci and Stanley, 2010; Agostinelli et al., 2014), these methods have not yet been applied to the canonical NEAT. While HyperNEAT, a NEAT derivate, evolves activation functions, it has many other modifications. Especially noteworthy is the use of CPPNs instead of the directly encoded ANNs NEAT uses (Gauci and Stanley, 2010; Risi et al., 2010). While this approach increases the suitability to solve highdimensional control and vision problems, apart from symbolic regression (Drchal and Snorek, 2012) it has not been widely applied on regression or classification problems. HyperNEAT and its derivatives were designed to indirectly train large networks, but we focus on training directly encoded parsimonious networks for regression.
3. Approach
In order to grow heterogeneous models, we enhance NEAT by allowing the evolutionary process to affect neurons’ activation functions. This enhanced version of the algorithm, HANEAT, produces heterogeneous topologies, increasing the parsimony of NEAT, shortening the time to convergence, and increasing the accuracy of the produced model. The rest of this section only discusses changes we made to the original implementation of NEAT. Please refer to the original publication for elaborated details on the implementation of NEAT (Stanley and Miikkulainen, 2002).
To simplify optimization of topology, and because of our focus on regression problems, we only allow feed forward networks. Before a connection is added to the network, we check whether it creates a cycle in the network. Only connections that do not cause such a recurrency are added.
In order to evolve the activation function, the genome is extended by adding genes that represent the activation function of the neurons. This modified genome is illustrated in Figure 4. It is similar to the CPPNs used in HyperNEAT (Stanley, 2007)
, but instead of taking the coordinates of weights as input, we use a direct encoding. The data is presented to inputs and outputs in a direct manner. The activation function gene is an integer number, representing the index in a list of activation functions. The list contains a number of activation functions: the discontinuous step function, the nondifferentiable Rectifier Linear Unit
(Nair and Hinton, 2010), the smooth sigmoid and locally active Gaussian kernel, which are shown in Figure 5. Instead of evolving a parameterized activation function as was done in (Mayer and Schwaiger, 2002), which would add continuous parameters to the search space, we manually determine a small number of qualitatively different activation functions to reduce the search problem for NEAT.Activation functions like the hyperbolic tangent are not used, because these functions need a different range in their output normalization. Mixed, heterogeneous networks, would then need to scale all neuron outputs differently, depending on their output range.
During the initialization phase, we assign a standard linear activation function to the input and output nodes. A modified version of the add node mutation operator is used. When constructing a new neuron, a random activation function is selected.
A new mutation operation that changes the activation function of a node is added as well, which was not used in the HyperNEAT algorithm. Depending on the probability of this particular mutation, a node is randomly selected and an activation function is uniformly randomly selected from the list of possible functions. Since the connection weights to this neuron were optimized for the old activation function, the network might perform worse than its parents. Therefore, highly nonlinear changes in the networks are expected during evolution, when the activation function is mutated. This often leads to a drop in fitness first and decreases the chance of an individual to survive.
Speciation is used to protect innovation. The mutate activation operator is adjusted by changing the identifier of the node and the innovation numbers of all incoming and outgoing connections if the activation function is changed. This increases the individual’s speciation distance profoundly, as a single mutation can affect many connections’ innovation numbers. This is a desirable effect, because changing the activation function in many neural pathways causes large qualitative changes in the neural network’s output function.
Large changes to individuals are now protected by speciation. In order to prevent too many changes to an individual, we only allow changing one node per genome per generation. The effect of mutating the activation function of existing nodes, as opposed to fixing the activation when the node is created, as in HyperNEAT, is examined in more detail in Section 4, where we show that the mutation operator is beneficial to evolution.
The expected impact of the proposed extension can be summarized as follows. We expect smaller networks to be more expressive, as was shown in Section 1. We therefore expect an approximation accuracy that rivals homogeneous networks created with NEAT, especially for problems with multiple target functions. The decrease in the number of necessary nodes to approximate a certain target will lead to faster convergence, although this will be partially counteracted by the increased search space, which depends on the number of possible activation functions.
4. Evaluation
4.1. Data Sets
A number of regression and classification datasets is used to evaluate HANEAT. All inputs are normalized to the interval
. The datasets are separated into training and test sets using 5fold cross validation, which are replicated 10 times, leading to a total of 50 replicates. We compare the median mean square error (MSE), and 25% and 75% quantiles of the regression predictions with the target values, which are normalized between
.Cholesterol Level Indicators
The first task is a regression problem in which three cholesterol level indicators (LDL, VLDL and HDL) from 21 spectral measurements of 264 blood samples are estimated. The data originates from work done by Purdie
(Purdie et al., 1992) and is shown in Figure 6.Engine Torque and Emissions
The second task, also a regression problem, is to estimate torque and nitrous oxide emissions for 1199 samples containing fuel rate and speed. The data is provided by Martin Hagan(Hagan, 2017).
Wisconsin Breast Cancer Diagnosis Problem
The final task is a classification problem from the UCI (Mangasarian and Wolberg, 1990) machine learning database. The data set provides 699 samples, 16 of which are incomplete and therefore omitted for these experiments. Samples contain 9 dimensions, describing several known indicators for breast cancer and 2 classification labels: benign and malignant. The performance on this binary classification problem is measured by rounding the neural network’s output to the closest label and comparing the MSE on the labels, so the classification threshold is set to 0.5 for this binary problem.
4.2. Algorithmic Setup
General  
population size  100  
max. generation  3000  
Speciation  
target species  10  
compatibility threshold  20  start value 
excess gene weight  1  
disjoint gene weight  1  
weight distance weight  0.2  
drop off age  15  
Genome operations  
crossover  90  % of population 
add node(*)  1  % chance/genome 
add connection  30  % chance/genome 
mutate activation(*)  20  % chance/genome 
Gene operations  
mutate weight  20  % chance/gene 
weight  2  
enable connection  0.02  % chance/gene 
disable connection  0.2  % chance/gene 
HANEAT’s hyperparameters are described in Table 1 and were selected based on experiments done with a simple regression problem which was not part of the data sets used in these experiments. A custom Matlab implementation of NEAT was used. The data is normalized and used to train HANEAT and standard NEAT with a number of fixed activation functions.
4.3. Homogeneous Networks
As mentioned above, Kamruzzaman and Aziz (Kamruzzaman and Aziz, 2002) and Laudani et al. (Laudani et al., 2015) show that there is no single best activation function for different regression problems. To confirm that the data sets indeed require different activation functions for optimal training, a performance comparison of homogeneous networks, networks that use a single activation function and evolved with NEAT, between the cholesterol and engine regression, as well as the cancer classification problems is made.
HANEAT  Step  ReLu  Sigmoid  Gaussian  

Cholesterol  
MSE  0.0197  0.0199  0.0198  0.0163  0.0170 
25% quantile  0.0173  0.0181  0.0167  0.0138  0.0149 
75% quantile  0.0248  0.0237  0.0251  0.0197  0.0214 
Engine  
MSE  0.0048  0.0091  0.0047  0.0053  0.0048 
25% quantile  0.0041  0.0083  0.0043  0.0047  0.0042 
75% quantile  0.0055  0.0103  0.0054  0.0061  0.0058 
Cancer  
MSE  0.0366  0.0365  0.0403  0.0294  0.0438 
25% quantile  0.0292  0.0292  0.0292  0.0219  0.0292 
75% quantile  0.0438  0.0441  0.0511  0.0511  0.0511 
4.4. Heterogeneous Networks
4.4.1. Mutation Operator
The mutation function that changes the activation function of a node is an addition that was not used in the HyperNEAT algorithm. Figure 8 shows the difference between not using the mutation (0% mutation rate) and various probabilities of mutating once per genome on the engine data set. A rate of 100% means that one of the nodes in an individual’s network is changed in every generation. Experiments were repeated 50 times but only for a small population of 50 individuals, explaining the discrepancy with respect to results of other experiments.
Although the algorithm converges quicker in the beginning when the mutation operator is disabled (similar to what is done in HyperNEAT), after 500750 generations, depending on the mutation rate, all instances of HANEAT surpass the accuracy of the mutationfree instance. Although the operator can be quite destructive, it still increases the evolvability of our representation. The difference between various mutation rates seems to not be very significant, but this needs to be analyzed more in depth.
4.4.2. Accuracy and Size
Figure 9 shows the development of median training error, test error and number of nodes and connections on the cancer dataset over 50 replicates. The heterogeneous networks perform as well as the best homogeneous networks, but with lower complexity.
4.4.3. Activation Functions
The median amount of nodes used in the heterogeneous networks for the three data sets, 32/10/20, is much lower than the 51/16/29 nodes required in the homogeneous networks. Similarly, the heterogeneous networks use a median value of 124/35/116 connections, whereas the homogeneous networks use 225/170/197 connections. This shows that HANEAT creates more parsimonious networks than the original implementation of NEAT. Although the topological search space, in terms of different neurons, is increased, because of the added dimension of 4 instead of 1 activation function, the algorithm has a convergence rate similar to that of the homogeneous versions.
The relative frequency of different activation functions in the heterogeneous solutions is shown in Figure 10. For the cholesterol HANEAT does not seem to prefer a specific activation function. In case of the engine data, the step function is underrepresented, whereas in case of cancer classification, step and sigmoid activations are preferred. The histograms show that HANEAT does not simply prefer just finding the best performing homogeneous network, but finds qualitatively different solutions.
4.4.4. Parsimony
Figure 11 shows the relationship between test errors in a comparison between heterogeneous and homogeneous networks. When trained on the cholesterol data set NEAT is clearly more parsimonous when creating heterogeneous instead of homogeneous networks, showing that it is easier to be more parsimonious when we allow the evolution of activation functions. As was already discussed in Figure 9
, although the search space is increased by the number of activation functions, the increased expressivity of the networks decreases it with respect to a certain accuracy. Similar results are noticeable for the engine data set. Here, the resulting variance in terms of accuracy and size is much smaller using heterogeneous networks, as they have a much smaller footprint in the plot. In the classification experiment the heterogeneous networks are again much smaller than most homogeneous networks.
Although we did not focus on the topic of modularity in this work, Figure 12 shows one of the top networks produced by HANEAT on the engine data set. It shows that for this multitarget problem, we do see the kind of modularity we describe in 1.2. The two hidden neurons in the lower left of the network are used for both outputs, whereas the rest is split into two modules that serve only one of the outputs. This is the same behaviour as was observed for HyperNEAT in (Huizinga et al., 2014).
In Section 2.2 it is mentioned that Turner and Miller (Turner and Miller, 2014) found that heterogeneous networks perform better than the average homogeneous networks. We show that for regression and classification problems heterogeneous networks perform not only as well as the average homogeneous model, but mostly as well as the best homogeneous models. Most of all though, the heterogeneous networks are more parsimonious and do indeed contain a nontrivial mix of selected activation functions. The networks reach a higher accuracy per connection, resulting in a smaller search space.
5. Conclusion
An extension to the NEAT algorithm is introduced, allowing for it to produce heterogeneous networks, consisting of neurons with different activation functions, a directly encoded version of the CPPNs used in HyperNEAT. HANEAT is able to find solutions that perform as good as the best homogeneous networks, which use only one kind of activation function, but using less nodes and connections. Although the search space is increased by parameterizing the activation function of every node, the reduction in topological complexity is shown to lead to convergence speeds that are similar to the bestperforming homogeneous networks. HANEAT automatically picks solutions that rivals the original NEAT in regression problems, posing the benefit of not having to pick the right activation function and at the same time being more parsimonious than homogeneous networks.
A mutation operator that allows changing the activation function of nodes during evolution has been evaluated. This operator was not used in HyperNEAT but shows that it improves convergence. The amount of activation mutations did not have a significant impact but the results seem to suggest, that we may use more than one mutation per genome per generation. This will have to be further analyzed.
Although heterogeneous networks perform as well as the best homogeneous networks, more attention has to be given to the training of weights. Using backpropagation on the evolved network topologies in certain intervals might allow us to make fairer comparisons between individuals with different topologies sooner, which would decrease some of the necessity for innovation protection.
When training data is scarce and overfitting is to be expected, networks should be as parsimonious as possible, while still retaining a high accuracy. The use of heterogeneous networks has shown to be a promising approach to do just that in topologyconstructing neuroevolutionary training methods.
References
 (1)
 Agostinelli et al. (2014) Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. 2014. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830 5, 34 (2014).
 Angus (1991) John E. Angus. 1991. Criteria for Choosing the Best Neural Network: Part 1. Technical Report 9116. DTIC Document.
 Drchal and Snorek (2012) Jan Drchal and Miroslav Snorek. 2012. Distance measures for HyperGP with fitness sharing. In Proceedings of the 14th annual conference on Genetic and evolutionary computation. ACM, 545–552.
 Efe (2008) Mehmet Ö. Efe. 2008. Novel neuronal activation functions for feedforward neural networks. Neural Processing Letters 28, 2 (2008), 63–79.
 Gauci and Stanley (2010) Jason Gauci and Kenneth O. Stanley. 2010. Autonomous evolution of topographic regularities in artificial neural networks. Neural computation 22, 7 (2010), 1860–1898.
 Gomez and Miikkulainen (1998) Faustino Gomez and Risto Miikkulainen. 1998. 2D pole balancing with recurrent evolutionary networks. In ICANN 98. Springer, 425–430.
 Hagan (2017) Martin T. Hagan. 2017. Engine dataset. (2017). http://hagan.ecen.ceat.okstate.edu
 Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989), 359–366.
 Huber et al. (1995) Reinhold Huber, H.A. Mayer, and Roland Schwaiger. 1995. netGENA Parallel System Generating ProblemAdapted Topologies of Artificial Neural Networks by means of Genetic Algorithms. 7 (1995).
 Huizinga et al. (2014) Joost Huizinga, JeanBaptiste Mouret, and Jeff Clune. 2014. Evolving Neural Networks That Are Both Modular and Regular: HyperNeat Plus the Connection Cost Technique. GECCO ’14 Proceedings of the 16th annual Genetic and Evolutionary Computation Conference (2014), 697–704. DOI:http://dx.doi.org/10.1145/2576768.2598232
 Kamruzzaman and Aziz (2002) J. Kamruzzaman and S.M. Aziz. 2002. A note on activation function in multilayer feedforward learning. Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290) 1 (2002).
 Kolmogorov (1957) Andrey N. Kolmogorov. 1957. On the representation of continuous functions of many variables by superpositions of continuous functions of one variable and addition. MR 22 (1957), 2669.

Laudani et al. (2015)
Antonino Laudani,
Gabriele Maria Lozito, Francesco Riganti
Fulginei, and Alessandro Salvini.
2015.
On training efficiency and computational costs of a feed forward neural network: A review.
Computational Intelligence and Neuroscience 2015 (2015).  Mahsal Khan et al. (2013) Maryam Mahsal Khan, Arbab Masood Ahmad, Gul Muhammad Khan, and Julian F. Miller. 2013. Fast learning neural networks using Cartesian genetic programming. Neurocomputing 121 (2013), 274–289. DOI:http://dx.doi.org/10.1016/j.neucom.2013.04.005

Mangasarian and
Wolberg (1990)
Olvi L. Mangasarian and
William H. Wolberg. 1990.
Cancer diagnosis via linear programming. University of WisconsinMadison.
Computer Sciences Department (1990).  Mayer and Schwaiger (2002) Helmut Mayer and Roland Schwaiger. 2002. Differentiation of Neuron Types by Evolving Activation Function Templates for Artificial Neural Networks. In 2002 International Joint Conference on Neural Networks Vol. 2. 1773–1778.
 Miller and Thomson (2000) Julian F. Miller and Peter Thomson. 2000. Cartesian genetic programming. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 1802 (2000), 121–132. arXiv:arXiv:1011.1669v3
 Moriarty and Miikkulainen (1996) David E. Moriarty and Risto Miikkulainen. 1996. Efficient reinforcement learning through symbiotic evolution. Machine Learning 22, 13 (1996), 11–32.

Morse and Stanley (2016)
Gregory Morse and
Kenneth O. Stanley. 2016.
Simple Evolutionary Optimization Can Rival Stochastic Gradient Descent in Neural Networks.
GECCO ’16 Proceedings of the 18th annual Genetic and Evolutionary Computation Conference (2016). DOI:http://dx.doi.org/10.1145/2908812.2908916 
Nair and Hinton (2010)
Vinod Nair and
Geoffrey E. Hinton. 2010.
Rectified Linear Units Improve Restricted Boltzmann Machines.
Proceedings of the 27th International Conference on Machine Learning 3 (2010), 807–814. 
Orr (1993)
Mark JL Orr.
1993.
Regularised centre recruitment in radial basis function networks. In
Centre for Cognitive Science, Edinburgh University. Citeseer.  Purdie et al. (1992) N Purdie, EA Lucas, and MB Talley. 1992. Direct measure of total cholesterol and its distribution among major serum lipoproteins. Clinical Chemistry 38, 9 (1992), 1645–1647.
 Risi et al. (2010) Sebastian Risi, Joel Lehman, and Kenneth O. Stanley. 2010. Evolving the Placement and Density of Neurons in the HyperNEAT Substrate. GECCO ’10 Proceedings of the 12th annual Genetic and Evolutionary Computation Conference (2010), 563–570.
 Stanley (2007) Kenneth O. Stanley. 2007. Compositional pattern producing networks: A novel abstraction of development. Genetic Programming and Evolvable Machines 8, 2 (2007), 131–162. DOI:http://dx.doi.org/10.1007/s1071000790288
 Stanley et al. (2009) Kenneth O. Stanley, David B. D’Ambrosio, and Jason Gauci. 2009. A hypercubebased encoding for evolving largescale neural networks. Artificial Life 15, 2 (2009), 185–212. DOI:http://dx.doi.org/10.1162/artl.2009.15.2.15202
 Stanley and Miikkulainen (2002) Kenneth O. Stanley and Risto Miikkulainen. 2002. Evolving Neural Networks through Augmenting Topologies. Evolutionary Computation 10, 2 (2002), 99–127.
 Taras et al. (2014) Kowaliw Taras, Bredeche Nicolas, and Renè Doursat. 2014. HyperNEAT: The First Five Years. In Studies in Computational Intelligence. Vol. 557. 159–185. DOI:http://dx.doi.org/10.1007/9783642553370
 Trujillo et al. (2014) Leonardo Trujillo, Luis Munoz, Enrique Naredo, and Yuliana Martinez. 2014. {NEAT}, There’s No Bloat. 17th European Conference on Genetic Programming 8599 (2014), 174–185. DOI:http://dx.doi.org/doi:10.1007/9783662443033_15
 Turner and Miller (2014) Andrew James Turner and Julian Francis Miller. 2014. NeuroEvolution: Evolving Heterogeneous Artificial Neural Networks. Evolutionary Intelligence 7, 3 (2014), 135–154. DOI:http://dx.doi.org/10.1007/s1206501401155
 Yao (1999) Xin Yao. 1999. Evolving artificial neural networks. Proc. IEEE 87, 9 (1999), 1423–1447. DOI:http://dx.doi.org/10.1109/5.784219 arXiv:1108.1530