1 Introduction
The goal of robot design is to find an optimal body structure and its means of locomotion to best achieve a given objective in an environment. Robot design often relies on careful humanengineering and expert knowledge. The field of automatic robot design aims to search for these structures automatically. This has been a longstudied subject, however, with limited success. There are two major challenges: 1) the search space of all possible designs is large and combinatorial, and 2) the evaluation of each design requires learning or testing a separate optimal controller that is often expensive to obtain.
In (Sims, 1994), the authors evolved creatures with 3Dblocks. Recently, soft robots have been studied in (Joachimczak et al., 2014), which were evolved by adding small cells connected to the old ones. In (Cheney et al., 2014), the 3D voxels were treated as the minimum element of the robot. Most evolutionary robots (Duff et al., 2001; Neri, 2010) require heavy engineering of the initial structures, evolving rules and careful humanguidance. Due to the combinatorial nature of the problem, evolutionary, genetic or random structure search have been the de facto algorithms of automatic robot design in the pioneering works (Sims, 1994; Steels, 1993; Mitchell & Forrest, 1994; Langton, 1997; Lee, 1998; Taylor, 2017; Calandra et al., 2016). In terms of the underlying algorithm, most of these works have a similar populationbased optimization loop to the one used in (Sims, 1994). None of these algorithms are able to evolve kinematically reasonable structures, as a result of large search space and the inefficient evaluation of candidates.
Similar in vein to automatic robot design, automatic neural architecture search also faces a large combinatorial search space and difficulty in evaluation. There have been several approaches to tackle these problems. Bayesian optimization approaches (Snoek et al., 2012)
primarily focus on finetuning the number of hidden units and layers from a predefined set. Reinforcement learning
(Zoph & Le, 2016)(Liu et al., 2017)are studied to evolve recurrent neural networks (RNNs) and convolutional neural networks (CNNs) from scratch in order to maximize the validation accuracy. These approaches are computationally expensive because a large number of candidate networks have to be trained from grounds up.
(Pham et al., 2018) and (Stanley & Miikkulainen, 2002)propose weight sharing among all possible candidates in the search space to effectively amortize the inner loop training time and thus speed up the architecture search. A typical neural architecture search on ImageNet
(Krizhevsky et al., 2012) takes 1.5 days using 200 GPUs (Liu et al., 2017).In this paper, we propose an efficient search method for automatic robot design, Neural Graph Evolution (NGE), that coevolves both, the robot design and the control policy. Unlike the recent reinforcement learning work, where the control policies are learnt on specific robots carefully designed by human experts (Mnih et al., 2013; Bansal et al., 2017; Heess et al., 2017), NGE aims to adapt the robot design along with policy learning to maximize the agent’s performance. NGE formulates automatic robot design as a graph search problem. It uses a graph as the main backbone of rich design representation and graph neural networks (GNN) as the controller. This is key in order to achieve efficiency of candidate structure evaluation during evolutionary graph search. Similar to previous algorithms like (Sims, 1994), NGE iteratively evolves new graphs and removes graphs based on the performance guided by the learnt GNN controller. The specific contributions of this paper are as follows:

[leftmargin=*]

We formulate the automatic robot design as a graph search problem.

We utilize graph neural networks (GNNs) to share the weights between the controllers, which greatly reduces the computation time needed to evaluate each new robot design.

To balance exploration and exploitation during the search, we developed a mutation scheme that incorporates model uncertainty of the graphs.
We show that NGE automatically discovers robot designs that are comparable to the ones designed by human experts in MuJoCo (Todorov et al., 2012), while random graph search or naive evolutionary structure search (Sims, 1994) fail to discover meaningful results on these tasks.
2 Background
2.1 Reinforcement Learning
In reinforcement learning (RL), the problem is usually formulated as a Markov Decision Process (MDP). The infinitehorizon discounted MDP consists of a tuple of (
), respectively the state space, action space, discount factor, transition function, and reward function. The objective of the agent is to maximize the total expected reward , where the state transition follows the distribution . Here, and denotes the state and action at time step , and is the reward function. In this paper, to evaluate each robot structure, we use PPO to train RL agents (Schulman et al., 2017; Heess et al., 2017). PPO uses a neural network parameterized as to represent the policy, and adds a penalty for the KLdivergence between the new and old policy to prevent overoptimistic updates. PPO optimizes the following surrogate objective function instead:(1) 
We denote the estimate of the expected total reward given the current stateaction pair, the value and the advantage functions, as
, and respectively. PPO solves the problem by iteratively generating samples and optimizing (Schulman et al., 2017).2.2 Graph Neural Network
Graph Neural Networks (GNNs) are suitable for processing data in the form of graph (Bruna et al., 2014; Defferrard et al., 2016; Li et al., 2015; Kipf & Welling, 2017; Duvenaud et al., 2015; Henaff et al., 2015). Recently, the use of GNNs in locomotion control has greatly increased the transferability of controllers (Wang et al., 2018). A GNN operates on a graph whose nodes and edges are denoted respectively as and . We consider the following GNN, where at timestep each node in GNN receives an input feature and is supposed to produce an output at a node level.
Input Model: The input feature for node is denoted as . is a vector of size , where is the size of features. In most cases, is produced by the output of an embedding function used to encode information about into dimensional space.
Propagation Model: Within each timestep , the GNN performs internal propagations, so that each node has global (neighbourhood) information. In each propagation, every node communicates with its neighbours, and updates its hidden state by absorbing the input feature and message. We denote the hidden state at the internal propagation step () as . Note that is usually initialized as , i.e., the final hidden state in the previous time step. is usually initialized to zeros. The message that sends to its neighbors is computed as
(2) 
where is the message function. To compute the updated , we use the following equations:
(3) 
where and are the message aggregation function and the update function respectively, and denotes the neighbors of .
Output Model: Output function takes input the node’s hidden states after the last internal propagation. The nodelevel output for node is therefore defined as .
Functions in GNNs can be trainable neural networks or linear functions. For details of GNN controllers, we refer readers to (Wang et al., 2018).
3 Neural Graph Evolution
In robotics design, every component, including the robot arms, finger and foot, can be regarded as a node. The connections between the components can be represented as edges. In locomotion control, the robotic simulators like MuJoCo (Todorov et al., 2012) use an XML file to record the graph of the robot. As we can see, robot design is naturally represented by a graph. To better illustrate Neural Graph Evolution (NGE), we first introduce the terminology and summarize the algorithm.
Graph and Species. We use an undirected graph to represent each robotic design. and are the collection of physical body nodes and edges in the graph, respectively. The mapping maps the node to its structural attributes , where is the attributes space. For example, the fish in Figure 1 consists of a set of ellipsoid nodes, and vector describes the configurations of each ellipsoid. The controller is a policy network parameterized by weights . The tuple formed by the graph and the policy is defined as a species, denoted as .
Generation and Policy Sharing. In the th iteration, NGE evaluates a pool of species called a generation, denoted as , where is the size of the generation. In NGE, the search space includes not only the graph space, but also the weight or parameter space of the policy network. For better efficiency of NGE, we design a process called Policy Sharing (PS), where weights are reused from parent to child species. The details of PS is described in Section 3.4.
Our model can be summarized as follows. NGE performs populationbased optimization by iterating among mutation, evaluation and selection. The objective and performance metric of NGE are introduced in Section 3.1. In NGE, we randomly initialize the generation with species. For each generation, NGE trains each species and evaluates their fitness separately, the policy of which is described in Section 3.2. During the selection, we eliminate species with the worst fitness. To mutate new species from surviving species, we develop a novel mutation scheme called Graph Mutation with Uncertainty (GMUC), described in Section 3.3, and efficiently inherit policies from the parent species by Policy Sharing, described in Section 3.4. Our method is outlined in Algorithm 1.
3.1 Amortized Fitness and Objective Function
Fitness represents the performance of a given using the optimal controller parameterized with . However,
is impractical or impossible to obtain for the following reasons. First, each design is computationally expensive to evaluate. To evaluate one graph, the controller needs to be trained and tested. Modelfree (MF) algorithms could take more than one million ingame timesteps to train a simple 6degreeoffreedom cheetah
(Schulman et al., 2017), while modelbased (MB) controllers usually require much more execution time, without the guarantee of having higher performance than MF controllers (Tassa et al., 2012; Nagabandi et al., 2017; Drews et al., 2017; Chua et al., 2018). Second, the search in robotic graph space can easily get stuck in localoptima. In robotic design, localoptima are difficult to detect as it is hard to tell whether the controller has converged or has reached a temporary optimization plateau. Learning the controllers is a computation bottleneck in optimization.In populationbased robot graph search, spending more computation resources on evaluating each species means that fewer different species can be explored. In our work, we enable transferablity between different topologies of NGE (described in Section 3.2 and 3.4). This allows us to introduce amortized fitness (AF) as the objective function across generations for NGE. AF is defined in the following equation as,
(4) 
In NGE, the mutated species continues the optimization by initializing the parameters with the parameters inherited from its parent species. In past work (Sims, 1994), species in one generation are trained separately for a fixed number of updates, which is biased and potentially undertrained or overtrained. In next generations, new species have to discard old controllers if the graph topology is different, which might waste valuable computation resources.
3.2 Policy Representation
Given a species with graph , we train the parameters of policy network using reinforcement learning. Similar to (Wang et al., 2018), we use a GNN as the policy network of the controller. A graphical representation of our model is shown in Figure 1. We follow notation in Section 2.2.
For the input model, we parse the input state vector obtained from the environment into a graph, where each node fetches the corresponding observation from , and extracts the feature with an embedding function . We also encode the attribute information into with an embedding function denoted as . The input feature is thus calculated as:
(5)  
where denotes concatenation. We use , to denote the weights of embedding functions.
The propagation model is described in Section 2.2. We recap the propagation model here briefly: Initial hidden state for node is denoted as , which are initialized from hidden states from the last timestep or simply zeros. internal propagation steps are performed for each timestep, during each step (denoted as ) of which, every node sends messages to its neighboring nodes, and aggregates the received messages. is calculated by an update function that takes in , node input feature and aggregated message . We use summation as the aggregation function and a GRU (Chung et al., 2014) as the update function.
For the output model, we define the collection of controller nodes as
, and define Gaussian distributions on each node’s controller as follows:
(6)  
(7) 
where and
are the mean and the standard deviation of the action distribution. The weights of output function are denoted as
. By combining all the actions produced by each node controller, we have the policy distribution of the agent:(8) 
We optimize with PPO, the details of which are provided in Appendix A.
3.3 Graph Mutation with Uncertainty
Between generations, the graphs evolve from parents to children. We allow the following basic operations as the mutation primitives on the parent’s graph :
, AddNode: In the (AddNode) operation, the growing of a new body part is done by sampling a node from the parent, and append a new node to it. We randomly initialize
’s attributes from an uniform distribution in the attribute space.
, AddGraph: The (AddGraph) operation allows for faster evolution by reusing the subtrees in the graph with good functionality. We sample a subgraph or leaf node from the current graph, and a placement node to which to append . We randomly mirror the attributes of the root node in to incorporate a symmetry prior.
, DelGraph: The process of removing body parts is defined as (DelGraph) operation. In this operation, a subgraph from is sampled and removed from .
, PertGraph: In the (PertGraph) operation, we randomly sample a subgraph and recursively perturb the parameter of each node by adding Gaussian noise to .
We visualize a pair of example fish in Figure 1. The fish in the topright is mutated from the fish in the topleft by applying . The new node (2) is colored magenta in the figure. To mutate each new candidate graph, we sample the operation and apply on as
(9) 
is the probability of sampling each operation with
.To facilitate evolution, we want to avoid wasting computation resources on species with low expected fitness, while encouraging NGE to test species with high uncertainty. We again employ a GNN to predict the fitness of the graph , denoted as . The weights of this GNN are denoted as . In particular, we predict the AF score with a similar propagation model as our policy network, but the observation feature is only , i.e., the embedding of the attributes. The output model is a graphlevel output (as opposed to nodelevel used in our policy), regressing to the score . After each generation, we train the regression model using the L2 loss.
However, pruning the species greedily may easily overfit the model to the existing species since there is no modeling of uncertainty. We thus propose Graph Mutation with Uncertainty (GMUC) based on Thompson Sampling to balance between exploration and exploitation. We denote the dataset of past species and their AF score as
. GMUC selects the best graph candidates by considering the posterior distribution of the surrogate :(10) 
Instead of sampling the full model with , we follow Gal & Ghahramani (2016) and perform dropout during inference, which can be viewed as an approximate sampling from the model posterior. At the end of each generation, we randomly mutate new species from surviving species. We then sample a single dropout mask for the surrogate model and only keep species with highest . The details of GMUC are given in Appendix F.
3.4 Rapid Adaptation using Policy Sharing
To leverage the transferability of GNNs across different graphs, we propose Policy Sharing (PS) to reuse old weights from parent species. The weights of a species in NGE are as follows:
(11) 
where are the weights for the models we defined earlier in Section 3.2 and 2.2. Since our policy network is based on GNNs, as we can see from Figure 1, model weights of different graphs share the same cardinality (shape). A different graph will only alter the paths of message propagation. With PS, new species are provided with a strong weight initialization, and the evolution will less likely be dominated by species that are more ancient in the genealogy tree.
Previous approaches including naive evolutionary structure search (ESSSims) (Sims, 1994) or random graph search (RGS) utilize humanengineered onelayer neural network or a fully connected network, which cannot reuse controllers once the graph structure is changed, as the parameter space for
might be different. And even when the parameters happen to be of the same shape, transfer learning with unstructured policy controllers is still hardly successful
(Rajeswaran et al., 2017). We denote the old species in generation , and its mutated species with different topologies as , in baseline algorithm ESSSims and RGS, and , for NGE. We also denote the network initialization scheme for fullyconnected networks as . We show the parameter reuse between generations in Table 1.Algorithm  Mutation  Parameter Space  Policy Initialization 

ESSSims, RGS  
NGE 
4 Experiments
In this section, we demonstrate the effectiveness of NGE on various evolution tasks. In particular, we evaluate both, the most challenging problem of searching for the optimal body structure from scratch in Section 4.1, and also show a simpler yet useful problem where we aim to optimize humanengineered species in Section 4.2 using NGE. We also provide an ablation study on GMUC in Section 4.3, and an ablation study on computational cost or generation size in Section 4.4.
Our experiments are simulated with MuJoCo. We design the following environments to test the algorithms. Fish Env: In the fish environment, graph consists of ellipsoids. The reward is the swimmingspeed along the direction. We denote the reference humanengineered graph (Tassa et al., 2018) as . Walker Env: We also define a 2D environment walker constructed by cylinders, where the goal is to move along direction as fast as possible. We denote the reference humanengineered walker as and cheetah as (Tassa et al., 2018). To validate the effectiveness of NGE, baselines including previous approaches are compared. We do a grid search on the hyperparameters as summarized in Appendix E, and show the averaged curve of each method. The baselines are introduced as follows:
ESSSims: This method was proposed in (Sims, 1994), and applied in (Cheney et al., 2014; Taylor, 2017)
, which has been the most classical and successful algorithm in automatic robotic design. In the original paper, the author uses evolutionary strategy to train a humanengineered one layer neural network, and randomly perturbs the graph after each generation. With the recent progress of robotics and reinforcement learning, we replace the network with a 3layer Multilayer perceptron and train it with PPO instead of evolutionary strategy.
ESSSimsAF: In the original ESSSims, amortized fitness is not used. Although amortized fitness could not be fully applied, it could be applied among species with the same topology. We name this variant as ESSSimsAF.
ESSGMUC: ESSGMUC is a variant of ESSSimsAF, which combines GMUC. The goal is to explore how GMUC affects the performance without the use of a structured model like GNN.
ESSBodyShare: We also want to answer the question of whether GNN is indeed needed. We use both an unstructured models like MLP, as well as a structured model by removing the message propagation model.
RGS: In the Random Graph Search (RGS) baseline, a large amount of graphs are generated randomly. RGS focuses on exploiting given structures, and does not utilize evolution to generate new graphs.
4.1 Evolution Topology Search
In this experiment, the task is to evolve the graph and the controller from scratch. For both fish and walker, species are initialized as random . Computation cost is often a concern among structure search problems. In our comparison results, for fairness, we allocate the same computation budget to all methods, which is approximately 12 hours on a EC2 m4.16xlarge cluster with 64 cores for one session. A grid search over the hyperparameters is performed (details in Appendix E). The averaged curves from different runs are shown in Figure 2. In both fish and walker environments, NGE is the best model. We find RGS is not able to efficiently search the space of even after evaluating different graphs. The performance of ESSSims grows faster for the earlier generations, but is significantly worse than our method in the end. The use of AF and GMUC on ESSSims can improve the performance by a large margin, which indicates that the submodules in NGE are effective. By looking at the generated species, ESSSims and its variants overfit to local species that dominate the rest of generations. The results of ESSBodyShare indicates that, the use of structured graph models without message passing might be insufficient in environments that require global features, for example, walker.
To better understand the evolution process, we visualize the genealogy tree of fish using our model in Figure 3. Our fish species gradually generates three fins with preferred , with two sidefins symmetrical about the fish torso, and one tailfin lying in the middle line. We obtain similar results for walker, as shown in Appendix C. To the best of our knowledge, our algorithm is the first to automatically discover kinematically plausible robotic graph structures.
4.2 Finetuning Species
Evolving every species from scratch is costly in practice. For many locomotion control tasks, we already have a decent humanengineered robot as a starting point. In the finetuning task, we verify the ability of NGE to improve upon the humanengineered design. We showcase both, unconstrained experiments with NGE where the graph is finetuned, and constrained finetuning experiments where the topology of the graph is preserved and only the node attributes are finetuned. In the baseline models, the graph is fixed, and only the controllers are trained. We can see in Figure 4 that when given the same wallclock time, it is better to coevolve the attributes and controllers with NGE than only training the controllers.
The figure shows that with NGE, the cheetah gradually transforms the forefoot into a claw, the 3Dfish rotates the pose of the sidefins and tail, and the 2Dwalker evolves bigger feet. In general, unconstrained finetuning with NGE leads to better performance, but not necessarily preserves the initial structures.


4.3 Greedy Search v.s. Exploration under Uncertainty
We also investigate the performance of NGE with and without Graph Mutation with Uncertainty, whose hyperparameters are summarized in Appendix E. In Figure 4(a), we applied GMUC to the evolution graph search task. The final performance of the GMUC outperforms the baseline on both fish and walker environments. The proposed GMUC is able to better explore the graph space, showcasing its importance.
4.4 Computation Cost and Generation Size
We also investigate how the generation size affect the final performance of NGE. We note that as we increase the generation size and the computing resources, NGE achieves marginal improvement on the simple Fish task. A NGE session with 16core m5.4xlarge ($0.768 per Hr) AWS machine can achieve almost the same performance with 64core m4.16xlarge ($3.20 per Hr) in Fish environment in the same wallclock time. However, we do notice that there is a trade off between computational resources and performance for the more difficult task. In general, NGE is effective even when the computing resources are limited and it significantly outperforms RGS and ES by using only a small generation size of 16.
5 Discussion
In this paper, we introduced NGE, an efficient graph search algorithm for automatic robot design that coevolves the robot design graph and its controllers.
NGE greatly reduces evaluation cost by transferring the learned GNNbased control policy from previous generations, and better explores the search space by incorporating model uncertainties.
Our experiments show that the search over the robotic body structures is challenging, where both random graph search and evolutionary strategy fail to discover meaning robot designs.
NGE significantly outperforms the naive approaches in both the final performance and computation time by an order of magnitude,
and is the first algorithm that can discovers graphs similar to carefully handengineered design.
We believe this work is an important step towards automated robot design, and may show itself useful to other graph search problems.
Acknowledgements Partially supported by Samsung and NSERC. We also thank NVIDIA for their donation of GPUs.
References
 Bansal et al. (2017) Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multiagent competition. arXiv preprint arXiv:1710.03748, 2017.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. ICLR, 2014.

Calandra et al. (2016)
Roberto Calandra, André Seyfarth, Jan Peters, and Marc Peter Deisenroth.
Bayesian optimization for learning gaits under uncertainty.
Annals of Mathematics and Artificial Intelligence
, 76(12):5–23, 2016.  Cheney et al. (2014) Nick Cheney, Robert MacCurdy, Jeff Clune, and Hod Lipson. Unshackling evolution: Evolving soft robots with multiple materials and a powerful generative encoding. ACM SIGEVOlution, 7(1):11–23, 2014.
 Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018.
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.
 Drews et al. (2017) Paul Drews, Grady Williams, Brian Goldfain, Evangelos A Theodorou, and James M Rehg. Aggressive deep driving: Combining convolutional neural networks and model predictive control. In Conference on Robot Learning, pp. 133–142, 2017.
 Duff et al. (2001) David Duff, Mark Yim, and Kimon Roufas. Evolution of polybot: A modular reconfigurable robot. In Proc. of the Harmonic Drive Intl. Symposium, Nagano, Japan, 2001.
 Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.

Gal & Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
Ininternational conference on machine learning
, pp. 1050–1059, 2016.  Heess et al. (2017) Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Henaff et al. (2015) Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 Joachimczak et al. (2014) Michał Joachimczak, Reiji Suzuki, and Takaya Arita. Fine grained artificial development for bodycontroller coevolution of softbodied animats. Artificial life, 14:239–246, 2014.
 Kipf & Welling (2017) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. ICLR, 2017.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Langton (1997) Christopher G Langton. Artificial life: An overview. Mit Press, 1997.
 Lee (1998) WeiPo Lee. An evolutionary system for automatic robot design. In Systems, Man, and Cybernetics, 1998. 1998 IEEE International Conference on, volume 4, pp. 3477–3482. IEEE, 1998.
 Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Liu et al. (2017) Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
 Mitchell & Forrest (1994) Melanie Mitchell and Stephanie Forrest. Genetic algorithms and artificial life. Artificial life, 1(3):267–289, 1994.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
 Nagabandi et al. (2017) Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. arXiv preprint arXiv:1708.02596, 2017.
 Neri (2010) Ferrante Neri. Memetic compact differential evolution for cartesian robot control. IEEE Computational Intelligence Magazine, 5(2):54–65, 2010.
 Pham et al. (2018) Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 Rajeswaran et al. (2017) Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6553–6564, 2017.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Sims (1994) Karl Sims. Evolving virtual creatures. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pp. 15–22. ACM, 1994.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
 Stanley & Miikkulainen (2002) Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
 Steels (1993) Luc Steels. The artificial life roots of artificial intelligence. Artificial life, 1(1_2):75–110, 1993.
 Tassa et al. (2012) Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 4906–4913. IEEE, 2012.
 Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
 Taylor (2017) Tim Taylor. Evolution in virtual worlds. arXiv preprint arXiv:1710.06055, 2017.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pp. 1058–1066, 2013.
 Wang et al. (2018) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. In ICLR, 2018.
 Zoph & Le (2016) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Details of NerveNet++
Similar to NerveNet, we parse the agent into a graph, where each node in the graph corresponds to the physical body part of the agents. For example, the fish in Figure 1 can be parsed into a graph of five nodes, namely the torso (0), leftfin (1), rightfin (2), and tailfin bodies (3, 4). By replacing MLP with NerveNet, the learnt policy has much better performance in terms of robustness and the transfer learning ability. We here propose minor but effective modifications to Wang et al. (2018), and refer to this model as NerveNet++.
In the original NerveNet, at every timestep, several propagation steps need to be performed such that every node is able to receive global information before producing the control signal. This is time and memory consuming, with the minimum number of propagation steps constrained by the depth of the graph.
Since the episode of each game usually lasts for several hundred timesteps, it is computationally expensive and ineffective to build the full backpropagation graph. Inspired by Mnih et al. (2016), we employ the truncated graph backpropagation to optimize the policy. NerveNet++ is suitable for an evolutionary search or populationbased optimization, as it brings speedup in wallclock time, and decreases the amount of memory usage.
Therefore in NerveNet++, we propose a propagation model with the memory state, where each node updates its hidden state by absorbing the input feature and a message with time. The number of propagation steps is no longer constrained by the depth of the graph, and in backpropagation, we save memory and time consumption with truncated computation graph.
The computational performance evaluation is provided in Appendix B. NerveNet++ model is trained by the PPO algorithm Schulman et al. (2017); Heess et al. (2017),
Appendix B Optimization with Truncated Backpropagation
During training, the agent generates the rollout data by sampling from the distribution and stores the training data of . To train the reinforcement learning agents with memory, the original training objective is
(12) 
where we denote the whole update model as and
(13) 
The memory state depends on the previous actions, observations, and states. Therefore, the full backpropagation graph will be the same length as the episode length, which is very computationally intensive. The intuition from the authors in Mnih et al. (2016) is that, for the RL agents, the dependency of the agents on timesteps that are faraway from the current timestep is limited. Thus, negligible accuracy of the gradient estimator will be lost if we truncate the backpropagation graph. We define a backpropagation length , and optimize the following objective function instead:
(14)  
(15) 
Essentially this optimization means that we only backpropagate up to timesteps, namely at the places where , we treat the hidden state as input to the network and stop the gradient. To optimize the objective function, we follow same optimization procedure as in Wang et al. (2018), which is a variant of PPO Schulman et al. (2017), where a surrogate loss is optimized. We refer the readers to these papers for algorithm details.
Appendix C Full NGE Results
Similar to the fish genealogy tree, in Fig. 8, the simple initial walking agent evolves into a cheetahlike structure, and is able to run with high speed.
We also show the species generated by NGE, ESSSims (ESSSimsAF to be more specific, which has the best performance among all ESSSims variants.) and RGS.
Appendix D Resetting Controller for Fair Competition
Although amortized fitness is a better estimation of the groundtruth fitness, it is still biased. Species that appear earlier in the experiment will be trained for more updates if it survives. Indeed, intuitively, it is possible that in real nature, species that appear earlier on will dominate the generation by number, and new species are eliminated even if the new species has better fitness. Therefore, we design the experiment where we reset the weights for all species randomly. By doing this, we are forcing the species to compete fairly. From Fig 10, we notice that this method helps exploration, which leads to a higher reward in the end. However, it usually takes a longer time for the algorithm to converge. Therefore for the graph search task in Fig 2, we do not include the results with the controllerresetting.
Appendix E Hyperparameters Searched
All methods are given equal amount of computation budget. To be more specific, the number of total timesteps generated by all species for all generations is the same for all methods. For example, if we use
training epochs in one generation, each of the epoch with
sampled timesteps, then the computation budget allows NGE to evolve for 200 generations, where each generation has a species size of 64. For NGE, RGS, ESSSimsAF models in Fig 11, we run a grid search over the hyperparameters recorded in Table 2, and Table 3, and plot the curve with the best results respectively. Since the number of generations for the RGS baseline can be regarded as 1, its curve is plotted with the number of updates normalized by the computation resource as xaxis.Here we show the detail figures of six baselines, which are: RGS20, RGS100, RGS200, and ESSSimsAF20, ESSSimsAF100, ESSSimsAF200. The number attached to the baseline names indicates the number of innerloop policy training epochs. In the case of RGS20, where more than 12800 different graphs are searched over, the average reward is still very low. Increasing the number of innerloop training of species to 100 and 200 does not help the final performance significantly.
To test the performance with and without GMUC, we use 64core clusters (generations of size 64). Here, the hyperparameters are chosen to be the first value available in Table 2 and Table 3.
Items  Value Tried 

Number of Iteration Per Update  10, 20, 100, 200 
Number of Species per Generation  16, 32, 64, 100 
Elimination Rate  0.15, 0.20, 0.3 
Discrete Socket  Yes, True 
Timesteps per Updates  2000, 4000, 6000 
Target KL  0.01 
Learning Rate Schedule  Adaptive 
Number of Maximum Generation  400 
Prob of AddNode, AddGraph  0.15 
Prob of PertGraph  0.15 
Prob of DelGraph  0.15 
Allow Mirrowing Attrs in AddGraph  Yes, No 
Allow Resetting Controller  Yes, No 
Resetting Controller Freq  50, 100 
Items  Value Tried 

Allow GraphAdd  True, False 
Graph Mutation with Uncertainty  True, False 
Pruning Temperature  0.01, 0.1, 1 
Network Structure  NerveNet, NerveNet++ 
Number Candidates before Pruning  200, 400 
Appendix F Model based search using Thompson Sampling
Thompson Sampling is a simple heuristic search strategy that is typically applied to the multiarmed bandit problem. The main idea is to select an action proportional to the probability of the action being optimal. When applied to the graph search problem, Thompson Sampling allows the search to balance the tradeoff between exploration and exploitation by maximizing the expected fitness under the posterior distribution of the surrogate model.
Formally, Thompson Sampling selects the best graph candidates at each round according to the expected estimated fitness using a surrogate model. The expectation is taken under the posterior distribution of the surrogate :
(16) 
f.1 Surrogate model on graphs.
Here we consider a graph neural network (GNN) surrogate model to predict the average fitness of a graph as a Gaussian distribution, namely . We use a simple architecture that predicts the mean of the Gaussian from the last hidden layer activations, , of the GNN, where are the weights in the GNN up to the last hidden layer.
Greedy search.
We denoted the size of dataset as . The GNN weights are trained to predict the average fitness of the graph as a standard regression task:
(17) 
Thompson Sampling
In practice, Thompson Sampling is very similar to the previous greedy search algorithm. Instead of picking the top action according to the best model parameters, at each generation, it draws a sample of the model and takes a greedy action under the sampled model.
Approximating Thompson Sampling using Dropout
Performing dropout during inference can be viewed as an approximately sampling from the model posterior. At each generation, we will sample a single dropout mask for the surrogate model and rank all the proposed graphs accordingly.
Comments
There are no comments yet.