Evolving Inborn Knowledge For Fast Adaptation in Dynamic POMDP Problems

04/27/2020 ∙ by Eseoghene Ben-Iwhiwhu, et al. ∙ Loughborough University HRL Laboratories, LLC 0

Rapid online adaptation to changing tasks is an important problem in machine learning and, recently, a focus of meta-reinforcement learning. However, reinforcement learning (RL) algorithms struggle in POMDP environments because the state of the system, essential in a RL framework, is not always visible. Additionally, hand-designed meta-RL architectures may not include suitable computational structures for specific learning problems. The evolution of online learning mechanisms, on the contrary, has the ability to incorporate learning strategies into an agent that can (i) evolve memory when required and (ii) optimize adaptation speed to specific online learning problems. In this paper, we exploit the highly adaptive nature of neuromodulated neural networks to evolve a controller that uses the latent space of an autoencoder in a POMDP. The analysis of the evolved networks reveals the ability of the proposed algorithm to acquire inborn knowledge in a variety of aspects such as the detection of cues that reveal implicit rewards, and the ability to evolve location neurons that help with navigation. The integration of inborn knowledge and online plasticity enabled fast adaptation and better performance in comparison to some non-evolutionary meta-reinforcement learning algorithms. The algorithm proved also to succeed in the 3D gaming environment Malmo Minecraft.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The field of deep reinforcement learning (RL) has showcased amazing results in recent time, solving tasks in robotic control (Duan et al., 2016a; Lillicrap et al., 2015), games (Mnih et al., 2015) and other complex environments. Despite such successes, deep RL algorithms are sample inefficient and sometimes unstable. Furthermore, they usually perform sub-optimally when dealing with sparse reward and partially observable environments. One further limitation of deep RL is when rapid adaptation to changing tasks (dynamic goals) is required. Established methods only work well in fixed task environments. In an attempt to solve this problem, deep meta-reinforcement learning (meta-RL) methods (Finn et al., 2017; Rothfuss et al., 2019; Zintgraf et al., 2019; Duan et al., 2016b; Wang et al., 2016) were specifically devised. However, these methods are largely evaluated on dense reward, fully observable MDP environments, and perform sub-optimally in sparse reward, partially observable environments.

One key aspect in achieving fast adaptation in dynamic partially observable environments is the presence of appropriate learning structures and memory units that fits the specific class of learning problems. Therefore, standard model-free RL algorithms do not perform well in dynamic environments because they are tabula-rasa systems. They hold no knowledge in their architectures to allow a fast and targeted learning when a change in the environment occurs. Upon a task change, these algorithms will try to randomly explore the action space to relearn from scratch a different, new policy. On the other hand, model-based RL, holds knowledge of the structure of the environment, which in turn allows for rapid adaptation to changes in the environment, but such a knowledge needs to be built manually into the system.

In this paper, we investigate the use of neuroevolution to autonomously evolve inborn knowledge (Soltoggio et al., 2018) in the form of neural structures and plasticity rules with a specific focus on dynamic POMDPs that have posed challenges to current RL approaches. The neuroevolutionary approach that we propose is designed to solve rapid adaptation to changing tasks (Soltoggio et al., 2018) in complex high dimensional partially observable environments. The idea is to test the ability of evolution to build an unconstrained neuromodulated network architecture with problem-specific learning skills that can exploit the latent space provided by an autoencoder. Thus, in the proposed system, an autoencoder serves as a feature extractor that produces low dimensional latent features from high dimensional environment observations. A neuromodulated network (Soltoggio et al., 2008) receives the low dimensional latent features as input and produces the output of the system, effectively acting as high level controller. Evolved neuromodulated networks have shown computational advantages in various dynamic task scenarios (Soltoggio et al., 2008, 2018).

The proposed approach is similar to that proposed in (Alvernaz and Togelius, 2017). One key novelty is that our approach seeks to evolve selective plasticity with the use of modulatory neurons, and therefore, to evolve problem-specific neuromodulated adaptive systems. The relationships among image-pixel inputs and control actions in POMDPs is highly nonlinear and history dependent, therefore, an open question is whether neuroevolution can exploit latent features to evolve learning systems with inborn knowledge. Thus, we test the hypothesis that a neuromodulated evolved network can discover neural structures and their related plasticity rules to encode required memory and fast adaptation mechanisms to compete with current deep meta-RL approaches.

We call the proposed system a Plastic Evolved Neuromodulated Network with Autoencoder (PENN-A), denoting the combination of the two neural components. We evaluate our proposed method in a POMDP environment where we show better performance in comparison to some non-evolutionary deep meta-reinforcement learning methods. Also, we evaluated the proposed method in the Malmo Minecraft environment to test its general applicability.

Two interesting findings from our experiments are that (i) the networks acquire through evolution the ability to recognise reward cues (i.e. environment cues that are associated with survival even when reward signals are not given) and (ii) the networks can evolve location neurons that help solving the problem by detecting, and becoming active at, specific location of the partially observable MDP. The evolved network topology allows for richer dynamics in comparison to fixed architectures such as hand-designed feed-forward or recurrent networks.

The next section reviews the related work. Following that, a formal task definition is presented. Next is the description of the proposed method employed in this work, followed by the evaluation of results. The PENN-A source code is made available at: https://github.com/dlpbc/penn-a.

2. Related Work

In reinforcement learning (RL) literature, meta-RL methods seek to develop agents that adapt to changing tasks in an environment or a set of related environments. Meta-RL (Schmidhuber et al., 1996; Schweighofer and Doya, 2003) is based on the general idea of meta-learning (Bengio et al., 1992; Thrun and Pratt, 1998; Hochreiter et al., 2001) applied to the RL domain.

Recently, deep meta-RL has been used to tackle the problem of rapid adaptation in dynamic environments. Methods such as (Finn et al., 2017; Duan et al., 2016b; Wang et al., 2016; Zintgraf et al., 2019; Mishra et al., 2018; Rothfuss et al., 2019; Rakelly et al., 2019) use deep RL methods to train a meta-learner agent that adapts to changing tasks. These methods are mostly evaluated in dense reward, fully observable MDP environments. Furthermore, most methods are either memory based (Duan et al., 2016b; Wang et al., 2016; Mishra et al., 2018) or optimization based (Finn et al., 2017; Zintgraf et al., 2019). Optimization based methods seek to find an optimal initial set of parameters (e.g. for an agent network) across tasks, which can be fine-tuned with a few gradient steps for each specific task presented to it. Therefore, a small amount of re-training is required to enable adaptation to every change in task. Memory based methods (implemented using a recurrent network or temporal convolution attention network) do not necessarily require fine tuning after initial training to enable adaptation. This is because memory-based agents learn to build a memory of past sequence of tasks and interactions, thus enabling them to identify change in task and adapt accordingly.

In the past, neuroevolution methods have been employed to solve RL tasks (Stanley and Miikkulainen, 2002; McHale and Husbands, 2004), including adapting to changing tasks (Soltoggio et al., 2008; Blynel and Floreano, 2002) in partially observable environments. These methods were evaluated in environments with high level feature observations. Recently, several approaches have been introduced that combine deep neural networks and neuroevolution to tackle high dimensional deep RL tasks (Alvernaz and Togelius, 2017; Poulsen et al., 2017; Ha and Schmidhuber, 2018; Salimans et al., 2017; Such et al., 2017). These approaches can be divided into two major categories. The first category uses neuroevolution to optimize the entire deep network end to end (Salimans et al., 2017; Such et al., 2017; Risi and Stanley, 2019a, b). The second category splits the network into parts (for example, a body and controller) where some part(s) (e.g. body) are optimized using gradient based methods and other part(s) (e.g. controller) are evolved using neuroevolution methods (Alvernaz and Togelius, 2017; Poulsen et al., 2017; Ha and Schmidhuber, 2018). Current deep neuroevolution methods are usually evaluated in fully observable MDP environments, where the task is fixed. Furthermore, after the training phase is completed, the weights of a trained network are fixed (the same is true for standard deep RL). The recent attention to neuroevolution for deep RL aims to present such approaches as a competitive alternative to standard gradient based deep RL methods for fixed task problems.

In the past, neural network based agents employing Hebbian-based local synaptic plasticity have been used to achieve behavioural adaptation with changing tasks (Floreano and Mondada, 1996; Blynel and Floreano, 2002; Soltoggio et al., 2008). Such methods use a neuroevolution algorithm to optimize the parameters of the network when producing a new generation of agents. As an agent interacts with an environment during its lifetime in training or testing, the weights are adjusted in an online fashion (via a local plasticity rule), enabling adaptation to changing tasks. In (Floreano and Mondada, 1996; Blynel and Floreano, 2002) this technique was employed, and further extended to include a mechanism of gating plasticity via neuromodulation in (Soltoggio et al., 2008). These methods were evaluated in environments with low dimensional observations (with high level features) and not compared with deep (meta-)RL algorithms.

3. Task Definition

A POMDP environment , defined by a sextuple (, , , , , ) is employed in this work. defines the state set, the action set, the environment dynamics, the reward function, the observation set, and the function that maps observations to states.

The environment contains a number of related tasks. A task is sampled from a distribution of tasks . The task distribution can either be discrete or continuous. A sampled task is an instance of the partially observable environment . The configuration of the environment (for example, the goal or reward function) varies across each task instance. An optimal agent is required to adapt its behaviour to task changes in the environment (and maximize accumulated reward), only from few interactions in the environment. When presented with a task , an optimal agent should initially explore, and subsequently exploit when the task is understood. When the task is changed (a new task sampled from ), the agent needs to re-explore the environment in few-shots, and then to start exploiting again when the new task has been understood.

In each task, an episode is defined as the trajectory of an agent interactions in the environment, terminating at a terminal state. A trial consist of two or more tasks sampled from . The total number of episodes in a trial is kept fixed. A trial starts with an initial task that runs for a number of episodes, and then the task is changed to other tasks (one after another) at different points within the trial (see Figure 1). The points at which a task change occurs are stochastically generated, and the task is changed before the start of the next episode. For example, when the number of tasks is set as (i.e. and ), the trial starts with task which runs for a number of episodes, and it is replaced by task for the remaining episodes in the trial. An agent is iteratively trained, with each iteration consisting of a fixed number of trials. The subsections below describes two environments where the proposed system is evaluated.

Figure 1. Illustration of a dynamic environment and required behavior of a learning agent. An agent is required to learn to perform optimally and then exploit the learned policy until a change in the environment occurs, at which point the agent needs to learn again before exploiting.
Figure 2. Environments (note, during execution, goal location is dynamic across episodes). (A) CT-graph instance, and . (B) CT-graph instance, and . (C) Malmo Minecraft instance (a double T-Maze), bird’s eye view on top, with some sample observations at the bottom. The maze-end with the teal colour is the goal location.

3.1. The Configurable Tree Graph Environment

The configurable tree graph (CT-graph) environment is a graph abstraction of a decision making process. The complexity of the environment is specified via configuration parameters; branching factor and depth , controlling the width and height of the graph. Additionally, it can be configured to be fully or partially observable. It contains the following types of state; start, wait, decision, end (leaf node of graph) and crash. Each observation is a grey-scale image. The total number of end states grows exponentially as the depth of the graph increases (see Figure 2A and B).

In the experiments in this study, partial observability is configured by mapping all wait states to the same observation, and all decision states to the same observation. Also, is set to 2. Therefore, each decision state has two choices, splitting into two sub-graphs. The discrete action space is defined as; choice 1, choice 2, wait action, thus discrete. The wait action is the correct action in a wait state. In a decision state, choice 1 or choice 2 is the correct subset from which to select. All incorrect actions lead to the crash state and episode termination.

An agent starts an episode in the start state, and the episode is completed when the agent traverses the graph to an end state or takes a wrong action in a state. Once an agent transitions from one state to the next, it cannot go back. In a task instance, one of the end states is set as the goal location. An agent receives a positive reward when it traverses to the goal location, and reward of 0 at other non-goal states. The agent may receive a negative reward in a crash state.

3.2. Malmo Minecraft Environment

Malmo (Johnson et al., 2016) is an AI research platform built on top of Minecraft video game. The platform is configurable, and it enables the construction of various worlds in which AI agents can be evaluated. In this work, a double T-maze was constructed, with discrete action space left turn, right turn and forward action. A task is defined based on the maze ends, requiring the agent to navigate to a specific maze end (goal location). The maze end that is set as the goal location varies across tasks. The agent only receives a positive reward when they navigate to the maze end that is the goal location. It receives reward of 0 in every other time step. If the agent runs into a wall, the episode is terminated and it receives a negative reward. The agent receives a visual observation of its current view at each time step (hence it does not fully observe the entire environment). Each observation is a RGB image based on a first-person view of the agent at each time step.

4. Methods

Figure 3. System overview, showcasing the feature extractor and controller components. In the controller, white and blue nodes are standard and modulatory neurons respectively. Modulatory connections facilitates selective plasticity in the network.

We seek to develop an agent that is capable of continual adaptation through its life time (across episodes) - exploring, exploiting, re-exploring when the task changes and exploiting again. The system (specifically the controller or decision maker) is evolved to acquire knowledge about both the invariant and variant aspects of an environment (e.g. changing tasks).

The agent is modelled using two neural components with separate parameters and objectives; a deep network (used as a feature extractor and parameterized by ) and a neuromodulated network (serving as a controller and parameterized by ). Both components make up the overall system model . See Figure 3 for a general system overview. The presented architectural style is similar to a standard deep RL setup. However, it differs on two fronts; (i) the controller is a neuromodulated network (described in Section 4.2) rather than a standard neural network, (ii) the training setup combines gradient based optimization method (Werbos, 1982; Rumelhart et al., 1988)), gradient free optimization method (neuroevolution (Yao, 1999; Stanley et al., 2019)), and Hebbian-based synaptic plasticity to train the system. Using this setup, each neural component therefore contains its own objective function. An autoencoder network was employed as the feature extractor, thus enabling the use of Mean Squared Error (MSE) or Binary Cross Entropy (BCE) objective function:

where is the number of training observations and is the output of the autoencoder for observation

(reconstructed observation). Each agent in the population uses the same feature extractor. The fitness function of the evolutionary algorithm is given by:

represents a task sampled from the task distribution , and a single trial consist of two tasks as defined in Section 3. Also, z is the number of episodes in which a task is kept fixed within a trial. It is stochastically generated and may differ between tasks in a trial within an interval. is the accumulated reward of a trajectory of an episode , defined as:


where is the reward function that takes state and action as arguments and produces a scalar reward value. is the same autoencoder feature extractor network earlier described, but denoting that we only want the output from the encoder (the latent features). Also, represents discrete time steps and is the length of the trajectory of an episode.

4.1. Feature Extractor

This neural component of the system is tasked with learning a good latent representation of the observations from the environment, which can be fed to the controller as input. In the CT-graph experiments, a fully connected autoencoder was employed (two layers encoder and decoder respectively). In the Malmo Minecraft experiments, a convolutional autoencoder was employed (four layers encoder and decoder respectively).

4.2. Control Network (Decision Maker)

This neural component takes the latent features of the feature extractor as its input, and produces an output which serves as the final output of the system (the action or behaviour of the system). It is a neuromodulated network (see Section 4.2.1), that reproduces the model introduced in (Soltoggio et al., 2008). The network can evolve two neuron types - a standard and a modulatory neuron. The output neuron(s) always belong to the standard neuron type.

The control network is parameterized by . Unlike (which represents only the weights of the feature extractor network), consists of the weights, architecture and the co-efficients of Hebbian-based plasticity rule (described in 4.2.2) of the network, and it is evolved. Therefore, evolution is tasked with finding the architecture and plasticity rules, including selective plasticity enabled by modulatory neurons to target neurons. The large search space that is granted to evolution allows for rich dynamics that include memory in the form of both recurrent connections and temporary values of rapidly changing modulated weights.

The agent is never fed the reward signal explicitly. The reward signal is only used by the evolutionary process for the fitness evaluation, which in turn drives the selection process. Therefore, the network is tasked to learn the discovery of reward cues implicitly from the visual observations in the environment.

4.2.1. Neuromodulated Network Dynamics

Though processing is distributed across neurons, a standard neural network usually contains one type of neuron - where the dynamics of each neuron is homogeneous across the network. In a neuromodulated network, there can be two types of neurons, each type having different dynamics - thus heterogeneous. The two types of neurons are standard neurons and modulatory neurons (Soltoggio et al., 2008). The standard neurons have the same dynamics as the ones in standard neural network. The modulatory neurons are used to dynamically regulate plasticity in the network.

Each neuron has one standard and one modulatory activation value that represent the weighted amount of standard and modulatory activity they receive from other neurons (see Equations 2 and 3). is the output signal of neuron that is propagated to other neurons in its outgoing connections (this is true for both standard and modulatory neurons). is used internally by the neuron itself to regulate the Hebbian-based plasticity of the incoming connections from other standard neurons, as described in Section 4.2.2. The framework allows for selective plasticity in the network, as parts of the network may become plastic or not plastic depending on the change of the modulatory activation signals over time. In turn, the final action of the network is affected in the current and future time steps - thus enabling adaptation.


4.2.2. Neuromodulated Hebbian Plasticity

The Hebbian synaptic plasticity of the control network is governed by the Equations 4, 5 and 6. are the coefficients of the plasticity rule. The update of a weight is dependent pre-synaptic and post-synaptic standard activations, the plasticity co-efficients, and the post-synaptic modulatory activation. This is true for all weights in the neuromodulated network.


5. Results and Analysis

Figures 4 and 6 show the results of the experiments in the CT-graph environment. Figure 7 shows the results of the experiment in the Malmo Minecraft environment. In addition, we present results obtained in the Malmo Minecraft environment (Figure 7), evaluating the general applicability of PENN-A.

5.1. Performance in CT-graph Environments

The proposed method (PENN-A) was evaluated on depth 2 and 3 CT-graph environments, with branching factor of 2. The controller was evolved for 200 generations, with population of 600 and 800 for depth 2 and 3 experiments respectively. Tournament selection with segment size of 5 was employed. Each controller was evaluated for 4 trials, with 100 episodes and 2 tasks per trial. The initial task is changed between episodes 35 and 65, determined stochastically for each trial. The depth 2 CT-graph experiment was employed as a baseline, and we compared PENN-A against some recent deep meta-RL methods (each with its own experimental setup). The depth 3 CT-graph experiment was employed to evaluate the PENN-A in a more complex configuration of the environment.

In order to ensure compatibility in the result presented across all methods, the number of evaluations (horizontal axis) were scaled to the approximate number of episodes equivalent. Additionally, the vertical axis is the average accumulated reward across all trials and episodes. In the depth 2 CT-graph result (Figure 4), we see that PENN-A performs optimally when compared to deep meta-RL methods; optimization-based (MAML (Finn et al., 2017) and CAVIA (Zintgraf et al., 2019)) and memory-based ( (Duan et al., 2016b) without extra input). Only the observations were fed as input to the neural network for all methods including PENN-A. We hypothesize the deep meta-RL methods perform sub-optimally due to the partial observability of the environment. When extra input (the reward, previous time step action and done state) are concatenated to the observation and fed to the method (which is vanilla setup), then it is able to perform optimally (see Figure 5). We hypothesize that exploits the actions fed as input to the network, ignoring the observations and other parts of the input. This reduces the problem complexity in comparison to conditions where only the observations are fed as input.

Figure 6 presents result for a depth 3 CT-graph. We present result for only PENN-A in depth 3 CT-graph (a more difficult problem than depth 2 CT-graph) since the other methods performed sub-optimally in depth 2 CT-graph. We again observe PENN-A performing optimally in the more difficult CT-graph setting.

Figure 4. Results for a CT-graph with depth 2. PENN-A is compared against non-evolutionary meta-RL methods.
Figure 5. in the CT-graph with depth 2. The method is run with extra input to the network (reward, done state, and previous time step action concatenated with current observation to form input).
Figure 6. PENN-A performance in a CT-graph with depth 3.
Figure 7. Malmo Minecraft result.

5.1.1. Network Analysis

Figure 8. Absolute activation values distribution (across trials and episodes) per time step of a sample evolved controller. (A) This neuron is active specifically at decision states (steps 3 and 5), while it remains low at wait states. (B) This neuron clearly identifies wait states (steps 2, 4 and 6) and remains inactive otherwise.
Figure 9. Distributions of the activation values for each neuron in a sample network when the goal location (reward) is found and vice versa. The neurons highlighted in green bounding box react differently to the presence or absence of reward cues from observation. (A) heat maps of grey-scale CT-graph observations. The top image is the observation presented when the goal location is found, with a bright square reward cue. The bottom image is the observation when the goal location is not found, with the reward cue absent. (B) Neurons 11 and 13 show complementary firing patterns based on reward cues. (C) Neurons 1, 2, 5 and 6 are active when a reward cue is observed, and have little or no activity when the reward cue is not observed.

To better understand the evolved solution and how the network implements policies, we analyzed the best performing networks after evolution in a depth 2 CT-graph environment. While different evolutionary runs produced highly different networks, we observed interesting patterns in the neural activations. For one network of 11 neurons (including the output neuron), the absolute activation value distribution (across trials and episodes per time step) is plotted for each neuron in Figure 8. We see that the absolute activation distribution of some neurons are high at specific time steps, i.e., at specific points within the graph environment (see Figure 8A and B) - and therefore function as location neurons. Such kind of location neurons had been previously discovered in an evolutionary setting in (Floreano and Keller, 2010). In the current experiments, it is worth noting that location neurons are designed by evolution to exploit latent features and possibly help action-selection in a high-dimensional dynamic POMDP. In particular, the neuron in Figure 8A is active at decision states, while the neuron in Figure 8B is active at wait states.

One aspect of our experimental setting is that the reward signal is not fed to the network, but the environment provides reward cues embedded in the observations as it is shown in Figure 9A where a bright square represents a reward. The actual reward value is only accumulated in the fitness function, and is therefore not explicitly visible to the network. The surprising results that networks evolved to explore the environment and find the reward even if no reward signal was given suggests that the reward cue was recognised. In fact, in the example shown in Figure 9B, some neurons fire positively when a reward cue is observed and negatively when not observed or vice versa. Other neurons fire when a reward cue is observed and have little or no firing when not observed (see Figure 9C). Not all evolved networks appeared to have reward neurons. Nevertheless, the examples that evolved such reward cues detectors demonstrate that evolution is able to incorporate invariant knowledge of the environment to optimize the policy, in this case, reward seeking behaviour and fast adaptation speed to changing task.

5.2. Performance in Malmo Minecraft

To further assess the validity of our method, it is important to use a different benchmark environment with a larger input and RGB observations that offered a different feature space, hence the Malmo Minecraft environment. The controller was evolved with population size of 800, in 400 generations. The same selection strategy as used in the CT-graph was employed. Each controller was evaluated for 8 trials, with 50 episodes and 3 tasks per trial. The task is changed at two stochastically generated points within the trial. The result is presented in Figure 7, keeping the same axes format as with the results presented for the CT-graph environment. Again, the proposed method was able to perform optimally with a high average reward score, demonstrating its capability to scale to other high dimensional, less abstract environments.

6. Conclusion

This paper introduced an evolutionary design method for fast adaptation in POMDP environments. The system combines a feature extractor network and an evolved neuromodulated network with the aim of acquiring specific inborn knowledge and structure via evolution. While the suitability of evolved neuromodulated networks to solve environments with changing task was known (Soltoggio et al., 2008, 2018), we demonstrated that such advantages are scalable to high dimensional input spaces, and can be used in combination with an autoenconder. The results showed performance that compare or surpass some deep meta-RL algorithms. Interestingly, the evolved networks were capable of learning to recognise implicit reward cues, and therefore could explore the environment in search for the goal location without an explicit reward signal. This ability that was acquired by the networks through evolution is an example of inborn knowledge that allow networks to be born with the knowledge of what are reward cues. Subsequently, this information can be used to direct fast adaptation when the optimal policy changes (e.g. the task change). The networks also evolved location neurons to help the deployment of a policy by distinguishing different states in the underlying MDP. We speculate that this approach might be promising when a combination of inborn knowledge and online learning are required to perform optimally in rapidly changing environments.

Appendix A Experimental Settings

The PENN-A source code containing the experimental setup is made available at: https://github.com/dlpbc/penn-a.

a.1. Feature Extractor

Mean Squared Error (MSE) was employed as the loss function across all CT-graph experiments, with a vanilla SGD (learning rate of

) optimizer. Likewise, Binary Cross Entropy (BCE) was employed as the loss function in the Malmo Minecraft experiments, using RMSprop (learning rate of

) optimizer. The network architectures for the CT-graph and Malmo Minecraft experiments are presented in Table 1 and 2

. The double horizontal line in both tables highlight the split between the encoder and decoder (i.e. specifications on top of the double line are for the encoder and likewise bottom for the decoder) of each autoencoder. In the CT-graph experiments where a Fully Connected (FC) autoencoder was employed, each input observation is flattened into a vector before feeding it to the network.

Layer Activation Units
Input N/A 144
FC ReLU 64
FC ReLU 16
FC ReLU 64
FC ReLU 144
Table 1. Network architecture for CT-graph experiments
Layer Activation Kernel Stride Channels
Input N/A N/A N/A N/A
Conv2D ReLU 3x3 2 16
Conv2D ReLU 3x3 2 32
Conv2D ReLU 3x3 2 32
Conv2D ReLU 3x3 2 8
ConvTranspose2D ReLU 3x3 2 32
ConvTranspose2D ReLU 3x3 2 32
ConvTranspose2D ReLU 3x3 2 16
ConvTranspose2D Sigmoid 4x4 2 3
Table 2. Network architecture for Malmo Minecraft experiments

a.2. Control Network

Excluding population size and number of generations, the evolutionary parameters from (Soltoggio et al., 2008) were followed.

The latent features from the feature extractor network were scaled between and . To further restrict the latent features, a transformation operation was applied to the scaled latent features before it was fed to the control network as shown below.

where is an inverse sigmoid operation on , and is the transformed feature space. The scaling and transformation operations were performed independently of the feature extractor optimization (i.e. the operations were applied on copies of the latent features), and were applied across all experiments.

In this work, both evaluation environments were designed to work with discrete action space (3 actions each). Therefore, a single output neuron was employed across all experiments. The activation value of the neuron was discretized to produce the actions of an agent. An activation value within the interval [-1.0, -0.33) mapped to one action, the interval [-0.33, +0.33] mapped to another action, and the interval (+0.33, 1.0] mapped to the last action.

This material is based upon work supported by the United States Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-18-C-0103. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA).


  • S. Alvernaz and J. Togelius (2017) Autoencoder-augmented neuroevolution for visual doom playing. In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. Cited by: §1, §2.
  • S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei (1992) On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Cited by: §2.
  • J. Blynel and D. Floreano (2002) Levels of dynamics and adaptive behavior in evolutionary neural controllers. In Proceedings of the seventh international conference on simulation of adaptive behavior on From animals to animats, pp. 272–281. Cited by: §2, §2.
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016a) Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338. Cited by: §1.
  • Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel (2016b) RL: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §1, §2, §5.1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §2, §5.1.
  • D. Floreano and L. Keller (2010) Evolution of adaptive behaviour in robots by means of darwinian selection. PLoS Biol 8 (1), pp. e1000292. Cited by: §5.1.1.
  • D. Floreano and F. Mondada (1996) Evolution of plastic neurocontrollers for situated agents. In Proc. of The Fourth International Conference on Simulation of Adaptive Behavior (SAB), From Animals to Animats, Cited by: §2.
  • D. Ha and J. Schmidhuber (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pp. 2450–2462. Cited by: §2.
  • S. Hochreiter, A. S. Younger, and P. R. Conwell (2001) Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87–94. Cited by: §2.
  • M. Johnson, K. Hofmann, T. Hutton, and D. Bignell (2016)

    The malmo platform for artificial intelligence experimentation.

    In IJCAI, pp. 4246–4247. Cited by: §3.2.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • G. McHale and P. Husbands (2004) Gasnets and other evolvable neural networks applied to bipedal locomotion. From Animals to Animats 8, pp. 163–172. Cited by: §2.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A simple neural attentive meta-learner. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  • A. P. Poulsen, M. Thorhauge, M. H. Funch, and S. Risi (2017)

    DLNE: a hybridization of deep learning and neuroevolution for visual control

    In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pp. 256–263. Cited by: §2.
  • K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, pp. 5331–5340. Cited by: §2.
  • S. Risi and K. O. Stanley (2019a) Deep neuroevolution of recurrent and discrete world models. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 456–462. Cited by: §2.
  • S. Risi and K. O. Stanley (2019b) Improving deep neuroevolution via deep innovation protection. arXiv preprint arXiv:2001.01683. Cited by: §2.
  • J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel (2019) ProMP: proximal meta-policy search. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  • D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. (1988) Learning representations by back-propagating errors. Cognitive modeling 5 (3), pp. 1. Cited by: §4.
  • T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §2.
  • J. Schmidhuber, J. Zhao, and M. Wiering (1996) Simple principles of metalearning. Technical report IDSIA 69, pp. 1–23. Cited by: §2.
  • N. Schweighofer and K. Doya (2003) Meta-learning in reinforcement learning. Neural Networks 16 (1), pp. 5–9. Cited by: §2.
  • A. Soltoggio, J. A. Bullinaria, C. Mattiussi, P. Dürr, and D. Floreano (2008) Evolutionary advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In Proceedings of the 11th international conference on artificial life (Alife XI), pp. 569–576. Cited by: §A.2, §1, §2, §2, §4.2.1, §4.2, §6.
  • A. Soltoggio, K. O. Stanley, and S. Risi (2018) Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Networks. Cited by: §1, §6.
  • K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen (2019) Designing neural networks through neuroevolution. Nature Machine Intelligence 1 (1), pp. 24–35. Cited by: §4.
  • K. O. Stanley and R. Miikkulainen (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §2.
  • F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune (2017) Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567. Cited by: §2.
  • S. Thrun and L. Pratt (1998) Learning to learn: introduction and overview. In Learning to learn, pp. 3–17. Cited by: §2.
  • J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick (2016) Learning to reinforcement learn, 2016. arXiv preprint arXiv:1611.05763. Cited by: §1, §2.
  • P. J. Werbos (1982) Applications of advances in nonlinear sensitivity analysis. In System modeling and optimization, pp. 762–770. Cited by: §4.
  • X. Yao (1999) Evolving artificial neural networks. Proceedings of the IEEE 87 (9), pp. 1423–1447. Cited by: §4.
  • L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson (2019) Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693–7702. Cited by: §1, §2, §5.1.