Adaptive Genomic Evolution of Neural Network Topologies (AGENT) for State-to-Action Mapping in Autonomous Agents

by   Amir Behjat, et al.
University at Buffalo

Neuroevolution is a process of training neural networks (NN) through an evolutionary algorithm, usually to serve as a state-to-action mapping model in control or reinforcement learning-type problems. This paper builds on the Neuro Evolution of Augmented Topologies (NEAT) formalism that allows designing topology and weight evolving NNs. Fundamental advancements are made to the neuroevolution process to address premature stagnation and convergence issues, central among which is the incorporation of automated mechanisms to control the population diversity and average fitness improvement within the neuroevolution process. Insights into the performance and efficiency of the new algorithm is obtained by evaluating it on three benchmark problems from the Open AI platform and an Unmanned Aerial Vehicle (UAV) collision avoidance problem.



There are no comments yet.


page 2


Remote UAV Online Path Planning via Neural Network Based Opportunistic Control

This letter proposes a neural network (NN) aided remote unmanned aerial ...

Autonomous Aerial Delivery Vehicles, a Survey of Techniques on how Aerial Package Delivery is Achieved

Autonomous aerial delivery vehicles have gained significant interest in ...

Apprenticeship Bootstrapping Via Deep Learning with a Safety Net for UAV-UGV Interaction

In apprenticeship learning (AL), agents learn by watching or acquiring h...

Reinforcement Learning for UAV Autonomous Navigation, Mapping and Target Detection

In this paper, we study a joint detection, mapping and navigation proble...

Feasible Computationally Efficient Path Planning for UAV Collision Avoidance

This paper presents a robust computationally efficient real-time collisi...

Towards co-evolution of fitness predictors and Deep Neural Networks

Deep neural networks proved to be a very useful and powerful tool with m...

Modularity in NEAT Reinforcement Learning Networks

Modularity is essential to many well-performing structured systems, as i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Artificial Neural networks or ANNs are playing an emerging role as decision-support models in various intelligent autonomous systems [1]. This emergence is partly attributed to the capability of ANNs to serve as universal function approximators, allowing them to be used for mapping states to (discrete or continuous) actions in autonomous systems. A significant fraction of such applications fall in the category where optimum actions corresponding to various states (i.e., labeled data) are not known apriori. Outside of classical control and planning methods, reinforcement learning (RL) methods [3, 4] and its recent deep variants [5] constitute a dominant player in training state-to-action models in such scenarios. However, RL methods use gradient information for back propagation which is not easy to ascertain for some problems [6], and generally do not scale well with the dimension of the problem for many cases [7]. Most RL variants (with recent exceptions [8, 10]) are also not conducive to application in the continuous-space domain. An alternative class of frameworks, based on evolutionary algorithms [9], namely neuroevolution [13] and evolution strategies [14], seek to mitigate these limitations.

While neuroevolution allows highly parallelized implementations and can be applied to problems with continuous/mixed state spaces, they are often plagued with poor convergence, premature stagnation, and topological inflexibility issues. In this paper, we present a novel neuroevolution approach, with the aim to address these key issues.

Neuroevolution is the process of designing ANNs through evolutionary optimization algorithms. Early approaches merely evolved the weights of the ANN without altering its topology [15]

, a typical continuous optimization problem – genetic algorithms (GA) were used for this purpose. In these approaches, the architecture or topology of the ANN was user-prescribed (as is common across most domains of ANN training and usage), which however leads to sub-optimal of overfitting prediction models

[2] for state-to-action mapping. Later endeavors introduced the concept of topology and weight evolving ANNs (TWEANNs). Neuroevolution of augmenting topologies or NEAT [16] is perhaps the most well-known implementation of the TWEANN concept. NEAT evolves ANNs via a GA that directly encodes the nodes, edges, and edge weights (the phenotype) in a specialized genotype. NEAT commences with a population of minimalist genomes, represented by feedforward ANNs (with no hidden nodes), whose input and output layers are sized according to the problem at hand. At every generation of NEAT, along with standard genetic operations (i.e., selection, crossover, and mutation) a specialized operation called “speciation” is performed on the population in order to preserve newly created complex topologies (with likely premature weights). Variations of NEAT, including HyperNEAT [17] (which provides an indirect genotypephenotype encoding) and SUNA [25] have been used to control virtual agents in Atari games [18], evolve robot gaits [19], geological prediction [20], and financial market analysis [21]. More recently, neuroevolution has also been employed to evolve deep neural networks [22].

A persistent concern in evolving neural network topologies (for highly non-linear state

action mapping) is that of premature stagnation. Neuroevolution methods are often unable to preserve genomic diversity as (nascent) complex structures cannot stabilize their weights as fast as simple networks leading to local stagnation (fitness functions typically tend to be highly non-convex over the NN topological space). While problem-specific tedious heuristics can strive to address this concern,

we hypothesize that situation-adaptive automated variation of selection pressure and reproduction operators are needed to offer generalized solutions. In addition, while the concept in NEAT of initializing the population with minimalist NN topologies [16] mitigates the possibility of overfitting (to sample scenarios used for fitness evaluation) and aids fast convergence for problems with low-dimensional input spaces, a converse effect is encountered in problems with larger state spaces or when modeling highly non-linear stateaction mapping. In these cases, starting with a minimalist baseline could lead to wasted computational effort to reach network topologies of reasonable complexity. Problem size-adaptive initialization of the NN topologies and allowing topological complexity to both increase and decrease [23] during neuroevolution is hypothesized to address this issue.

To investigate the above-stated hypotheses, this paper develops an adaptive neuroevolution

approach that incorporates novel mechanisms for insitu control of the genomic diversity and average fitness improvement. Further performance and computational efficiency gains are accomplished by allowing flexible topology initialization and provisioning nodes with variable activation function and memory properties. The new algorithm is evaluated both over benchmark RL and control problems and a practical robotics problem – collision avoidance in unmanned aerial vehicles (UAV). The next section describes the basic components of the algorithm. Section

III presents the salient features of the algorithm. Section IV and Section V respectively demonstrate the capabilities of the algorithm on benchmark problems and the UAV problem; and Section VI provides concluding remarks.

Ii AGENT (Neuroevolution) Algorithm

The new neuroevolution algorithm is called Adaptive Genomic Evolution of Neural Network Topologies or AGENT. In this section, we describe the key components of this algorithm, namely the intra-generational stages, the encoding approach, and the selection and reproduction operators.

Ii-a AGENT: Stages

Figure 1 illustrates the overall flowchart of the AGENT algorithm. AGENT uses a two-stage evolutionary approach. All species participate in the first stage of evolution in each generation, while only the best genomes from each species participate in the second stage of evolution.

Fig. 1: AGENT: Flowchart

Ii-B Encoding

Information-processing capacity in an ANN is encapsulated both in its nodes and edges. Therefore, similar to the original NEAT, AGENT uses a direct bi-structural encoding, where each genome comprises of a node encoding and an edge encoding. Where AGENT differs from NEAT is in how nodes are encoded – in AGENT, the node encoding also defines the type of activation function and memory capacity of the node. To allow greater flexibility, one of three activation functions can be selected: modified sigmoid, (only option in original NEAT), saturated linear, and sigmoid functions. The

memory is allowed to take one of three values: ; a memory size of designates using the current weighted input incoming into the node; memory sizes of and respectively allow using the first and second temporal derivatives of the weighted input incoming into the node. The latter enables exibiting temporal dynamic behavior, useful for neurocontroller type applications. Equation 1 explains how the derivatives of each node are used. In this equation, for a node connected to upstream nodes, is the net synaptic input of node- in time step , is the output of the (upstream) node-, is the time step used when implementing this NN as a controller, and .


Ii-C Initialization

Unlike NEAT, the initial population in AGENT is allowed to comprise a small number of hidden neurons, instead of a minimalist topology with no hidden neurons. To introduce diversity in the initial population, the number of hidden nodes in each genome is chosen from a distribution such that the expected value (over the population) of the number of hidden nodes is given by

; here and are respectively the number of input and output nodes.

Ii-D Speciation

Speciation is a crucial aspect of AGENT – it refers to the process in which the population is divided into several subgroups. Speciation is carried out for two reasons. First, it shelters newly generated genomes (containing yet-to-be-stabilized weights) from getting eliminated due to selection pressure. Second, it facilitates greater local search within each species. The speciation process adopted here is similar to that in SUNA [25]

. The most unique genomes are chosen to represent the different species. The remaining genomes are then classified into these species/groups based on their similarity to the aforementioned unique genomes.

Ii-E Selection

Tournament selection is used in AGENT due to its property of being invariant in order-preserving transformations. Moreover, compared to proportional selection, tournament selection has the added benefit of being able to modify selection pressure by varying the ratio between the number of genomes that participate in the tournament and the number of genomes that are allowed to win the tournament. This will be used later on for controlling the selection pressure.

Ii-F Crossover

AGENT expands on the crossover process used in NEAT, by also transferring the special nodal properties (activation function and memory type). Figure 2

illustrates this procedure. Weights of common edges are inherited randomly (with equal probability) from one of the two parents, while all nodal properties and weights associated with unique edges are inherited from the parent with superior fitness value.

Fig. 2: Crossover operation in AGENT

Ii-G Mutation

As illustrated in Fig. 3 and described below, there are two types of mutation in AGENT, mutation of edges/nodes and mutation of nodal properties, each with their own rate.

Mutation of edge weights: For existing edges between any two nodes and , real-valued Gaussian mutation [26] of weights is undertaken, as given by:


Addition of an edge

: An edge can be added between any existing pair of nodes as long as duplicate edges and cycles are not produced. It must be noted that the addition of an edge/node increases the complexity of the network. The weight of the new edge is assigned randomly from a uniform distribution in the range


Removal of an edge: Removing edges and nodes assists in reducing network complexity, e.g., to mitigate overfitting. An edge between any existing pair of nodes can be removed as long it does not result in a floating node. In this research, the mutation rate for removing an edge is kept at 80 of the mutation rate for adding an edge, thereby allowing slightly greater probability of network complexification.

Addition of a node: A node can be added between any edge resulting in the splitting of the edge into two edges.

Removal of a node: Any hidden node can be removed. Upon removal of a node, new connections are made such that all incident downstream nodes (w.r.t. to the removed node) are connected to all upstream nodes (to which the removed node was connected). The probability of removing a node is kept at of the probability of adding a node.

Mutation of nodal properties: This is done by probabilistic switching between the categorical values of these properties (i.e., different memory value or activation function).

(a) Mutation of edges and nodes
(b) Mutation of nodal properties
Fig. 3: Mutation operations in AGENT

Iii Adaptation Mechanisms in AGENT

In this section, we outline the novel formulations proposed to adaptively control the diversity and (fitness) improvement rate of AGENT along with the requisite metrics and limits.

Iii-a Diversity Preservation: Measure of Diversity

Population diversity is paramount to effective neuroevolution. This calls for prudently controlling the population diversity – an abrupt decrease in diversity can lead to premature stagnation, but at the same time, a steady (low) rate of diversity reduction is needed for exploitation and eventual convergence. The first step in diversity preservation is robust measurement of diversity. To develop a diversity measure, an approach is needed to quantify the differences between any two neural networks in the design space. In neuroevolution, since genomes encode different topologies, their basic dimensionality varies across the population. Hence, a distance metric similar to the novelty metric described in [25] is used here. The distance between two candidate ANNs, and , is thus given by the weighted sum of the difference between their node types, as well as the difference between edges connecting different types of nodes.


In this equation, is the number of nodes of type in neural network A. is the number of edges from node type to node type in neural network A, and is the number of types of activation functions allowed in the NNs. The weights and are prescribed to be in this paper.

Now, to quantify the overall diversity in the population at any given generation , we construct a complete undirected graph out of the population of candidate NNs. The length of the edges connecting candidate NNs in this graph is given by the above defined distance metric (Eq. 3). Then, employing the concept of minimum spanning tree (MST), the total length of the MST is used as the diversity metric () at the generation, as given by:


where represents an edge connecting ANNs A and B in the (population) graph. Kruskal‘s Algorithm [27] is used to determine the MST, which is computationally inexpensive (, where is the number of edges in the graph), and thus can be called in every generation.

With this measure of diversity, we can define a desired value for diversity and also delineate an approach to maintain this desired value. The below proposed limit defines the desired diversity at any generation.


Here is the diversity in the initial population. The coefficient is used to increase the diversity. Here, is the maximum allowed generations. This formulation suggests a low steady linear decrease of diversity.

Iii-B Diversity Preservation: Controlling Diversity

The tournament size in the selection operator is used as the control input to regulate the diversity. The probability of selecting a specific genome, to be copied into the mating pool, decreases with the number of genomes participating in the tournament and increases with the number of the genomes chosen from the tournament.

The probability of the ranked genome to be selected into the mating pool () by resampling is given by:


where is the population size, and and are respectively the numbers of genomes that enter the tournament and win the tournament. Since crossover in AGENT produces a single child NN from two parent NNs, the numerator is multiplied by . Based on this formulation, it can be seen that the probability of choosing lower ranked genomes increases by increasing the ratio . Therefore this ratio can be used to decrease the selection pressure, thereby increasing the diversity, and thus serves as a suitable choice for a control input. For regulating diversity at any generation, this control input can be computed as:


Here represents the difference between the observed diversity and the desired diversity; is the diversity gain coefficient, which modifies the amount of change that must be applied to the tournament ratio. For the current study, we used .

Iii-C Improvement Adaptation: Metric of Improvement

The premise behind tracking and controlling fitness is its ability to reflect whether adequate search dynamics is present in the population. Since, diversity is simultaneously being preserved (Section III-B), steady improvement in average fitness over generations (in comparison to the improvement in the best fitness in the population) would be reflective of a robust search process. With this premise, we first define an improvement metric that encapsulates the history of improvement, as given by.


where is the current generation, and respectively represent fitness function values at the and generations, and is an scaling coefficient. This metric is such designed that more recent improvements have a greater influence.

Iii-D Improvement Adaptation: Mutation Controller

If the improvement (over generations) in the average fitness of the population lags far behind the improvement in the best fitness value, it demonstrates a weakening search dynamics across the population. Now, in TWEANNs, mutation is the main driver of network innovation. So, too high a rate of mutation continues generating new niches of NNs that do not get time to stabilize their weights, and the algorithm starts acting as random search leading to the lagging average fitness improvement scenario mentioned above. This is where an adaptive reduction in mutation rate is needed. Conversely, when the improvement in the fitness of the population best starts lagging behind improvement in average fitness of the population, it is indicative of potential stagnation at local optima, and calls for increasing the mutation rate to facilitate discovery of new networks. With this perspective, we propose the following mutation rate () control strategy:


Here is the mutation rate in generation , and is the mutation controller gain coefficient. Similiar to , the mutation controller gain can be prescribed to enable more or less aggressive search dynamics. For the current paper, is set at 0.1. In Eq. 9, and respectively represent the fitness improvement metrics for the population average and the population best; they are computed using Eq. 8.

Iv Benchmark Testing of AGENT: OpenAI Gym

OpenAI Gym is an open-source platform [28] that has been growing in popularity for benchmarking and comparing RL algorithms [11], as well as other learning and optimization methods [22] that can solve RL-type control problems. In this paper, we showcase the performance of AGENT on three problems curated from the OpenAI gym and compare with published results on state-of-the-art RL methods (summarized in Table I

). These problems are very briefly described below; further information on these implementations, e.g., details of the state and action vectors, can be found at

Iv-a Mountain Car

In the Mountain Car problem, a candidate NN agent must control an underpowered car so that it can successfully climb up a mountain. In this paper, the MountainCarContinuous-v0 environment taken directly from OpenAI Gym is used; for the sake of fair comparison, the same reward function as described in the source code is used. The initial position of the car is randomly generated at the commencement of each episode – this could lead to misleading results, as some solutions might accumulate a good reward due to a conducive starting position. To account for this factor, the number of episodes each genome encounters is controlled in a progressive manner. Each genome must accumulate reward thresholds before progressing on to the next episode. Mathematically, we express this as:


where is the reward/penalty the agent receives for each action taken; is the total number of actions taken in the -th scenario, and represents the genome’s accumulated reward in that scenario; represents the net fitness function evaluate for a candidate genome; refers to the maximum number of scenarios available at training; and refers to the number of scenarios that the genome successfully passed.

For this test problem, performance of AGENT is compared to that reported for the deep deterministic policy gradient method [30]. From the results in Table I, it can be seen that, AGENT was able to find significantly better reward values, albeit at the expense of a greater total number of steps (where a step is defined as an executed stateaction instance).

Parameter Mount. Car Acrobat Lunar
Population Size 200 600 400
AGENT: Best Reward 99.1 -69.6 68.0
AGENT: Tot. Func. Eval. 13,500 49,800 57,024
AGENT: Total Episodes 8,059 71,826 285,120
AGENT: Total Steps 855,199 73,831,050 1,369,446
Published Best Reward [30] [31] [32]
Published Total Episodes - 1,000 10,000
Published Total Steps 100,000 - -

Published best values (under maximization) are taken from [30, 31, 32].

TABLE I: OpenAI results: AGENT vs. Reference papers

Iv-B Acrobot

The Acrobot-v1 environment in OpenAI Gym describes a two joint and two link robotic arm that initially hangs downwards. The goal in this problem is to produce a joint torque so as to swing the lower link up to a specified height. The same approach as outlined in Eq. 10 is taken to mitigate the effects of randomness in the environment. As can be seen from Table I, AGENT is able to achieve better reward values compared to that of the reference method, RL with adaptive memory replay [31], again at the expense of additional computational cost.

Iv-C Lunar Lander

The LunarLander-v2

environment in OpenAI Gym presents a problem with a mixture of discrete and continuous variables. Here, the agent must safely land a spacecraft on a launching pad via control of its three engines. The reward function described in OpenAI Gym is used. The effects of uncertainties in this problem (attributed to noisy engine thrust) are mitigated using the approach outlined in Eq.


The optimum results obtained by AGENT is compared to that of a recently reported experience replay based RL method [32]. As seen from Table I, AGENT did not perform as well as the RL method. This can be attributed to the reward function for this problem, which presents many local minima, demanding different behaviors. Note that the handcrafted design of the reward functions in such problems do not necessarily represent generic performance in physical terms, and are often more amenable to RL based learning.

Given this problem’s complexity, it is used to analyze the performance of the special controllers in AGENT. Figure 4 shows the fitness improvement (not fitness value) of the population best and population average over generations, which are observed to follow similar trajectories thus providing evidence towards the effectiveness of the fitness improvement adaptation. A baseline case with the diversity/mutation controllers deactivated was also run, which provided an inferior optimum reward function of 47.4, further supporting the usefulness of AGENT’s adaptation mechanisms.

Fig. 4: Lunar Lander: diversity and mutation control effects

V Optimal UAV Collision Avoidance

In this section, we present the performance of AGENT on a UAV collision avoidance application. UAV collision avoidance is a well studied problem with solutions existing for avoiding both static [33] and dynamic [34] obstacles.

The thesis [35]

described an online cooperative collision avoidance approach for uniform quadcopter UAVs, where the UAVs undertake either a heading change or speed change maneuver, both in a reciprocal manner, e.g., if one UAV decides to veer to its left, the other UAV will also veer to its own left (both must get back to their original path). The prior approach used supervised learning, with optimization derived labels over sample collision scenarios, to train the maneuver models. Here, we use AGENT/neuroevolution to train the heading-change maneuver model. The outputs of the maneuver model are the time,

, between the point of detection and maneuver initiation, and the effective change in angle . These are used to generate waypoints, then translated into a minimum snap trajectory, to be executed by a PID controller [35]. The inputs to the model include five UAV pose variables that completely define a collision scenario. Figure 5(a) illustrates the inputs (state vector) and outputs (action vector) for this problem, and Fig. 5(b) illustrates the online maneuver (given by the AGENT-trained model) for a representative collision scenario.

Fig. 5: UAV Collision Avoidance: (a) stateaction mapping, (b) Heading change maneuver is a given collision scenario

In AGENT, each candidate genome is subjected to a set of collision scenarios, with the fitness function (to be maximized) given by the following aggregate performance:


Here is the net energy consumed by both UAVs to execute the maneuver, and is the total battery capacity. In general, , and is used as a scaling factor to give more prominence to safety (i.e., maintaining adequate inter-UAV separation), compared to energy efficiency. In exceptional cases, where (typically indicates a diverging maneuver), the term , otherwise is equal to the minimum separation distance experienced during the maneuver. The parameter represents the separation threshold, coming closer than which is considered as collision; is set at .

AGENT is run with a population size of 400 and allowed 60 generations. Figure 6 shows the structure of the optimum NN obtained, compared to an initial NN. This optimum NN avoids collisions in all training scenarios. It is also tested on an additional 200 unseen scenarios, chosen from the same distribution as the training scenarios. It successfully avoids collisions in 192/200 unseen scenarios.

Fig. 6: UAV problem: Evolution of NN via AGENT

For further validation, the performance of the AGENT-derived NN model is compared with that given by optimizing the maneuver individually for each test scenario. The PSO-based global optimization approach [36], used in the prior work for generating labels [35], is employed for the latter (whose computing cost makes it unsuitable for online application). Figure 7 shows the distribution of energy consumption and minimum separation distance (during maneuver) over the test scenarios, for both the AGENT-derived model and the PSO results. It can be observed from Fig. 7 that while the energy efficiency performance of AGENT’s NN model is slightly worse than the PSO results, the avoidance performance is quite comparable – 192/200 successes by AGENT’s NN model vs. 196/200 successes by PSO.

Fig. 7: UAV problem: Performance of AGENT compared to direct maneuver optimization over unseen test scenarios

Vi Conclusion

In this paper, we developed a new neuroevolution method, called AGENT, by making important advancements to the NEAT formalism, The goal was to mitigate premature stagnation issues and improve the rate of convergence on complex RL/control problems. The key contributions included: 1) incorporating memory and activation function choice as nodal properties, 2) quantifying diversity using minimum spanning tree and controlling diversity via adaptive tournament selection, 3) controlling average fitness improvement via mutation rate adaptation, and 4) allowing both growth and shrinkage of NN topologies during evolution.

The AGENT algorithm was tested on benchmark control problems adopted from the Open AI Gym, illustrating competitive results in terms of final outcomes (except in one problem), while incurring greater time steps cost, both compared to state-of-the-art RL methods. However, it is important to point out that AGENT is significantly more amenable to parallel implementation than RL methods, and thus computational time comparisons in future might elicit a different (likely more promising) picture. AGENT was also tested on an original UAV collision avoidance problem, resulting in an online model that provided competitive performance w.r.t offline optimization over test scenarios. Immediate future efforts will explore mechanisms to accelerate the evolutionary process via indirect genomic encoding and distributed implementations, in order to allow application to higher-dimensional learning problems in robotics.