Context Meta-Reinforcement Learning via Neuromodulation

Meta-reinforcement learning (meta-RL) algorithms enable agents to adapt quickly to tasks from few samples in dynamic environments. Such a feat is achieved through dynamic representations in an agent's policy network (obtained via reasoning about task context, model parameter updates, or both). However, obtaining rich dynamic representations for fast adaptation beyond simple benchmark problems is challenging due to the burden placed on the policy network to accommodate different policies. This paper addresses the challenge by introducing neuromodulation as a modular component to augment a standard policy network that regulates neuronal activities in order to produce efficient dynamic representations for task adaptation. The proposed extension to the policy network is evaluated across multiple discrete and continuous control environments of increasing complexity. To prove the generality and benefits of the extension in meta-RL, the neuromodulated network was applied to two state-of-the-art meta-RL algorithms (CAVIA and PEARL). The result demonstrates that meta-RL augmented with neuromodulation produces significantly better result and richer dynamic representations in comparison to the baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 13

page 24

page 26

page 27

page 28

page 30

page 31

09/30/2019

Meta-Q-Learning

This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm ...
01/12/2021

Linear Representation Meta-Reinforcement Learning for Instant Adaptation

This paper introduces Fast Linearized Adaptive Policy (FLAP), a new meta...
01/30/2017

Reinforcement Learning Algorithm Selection

This paper formalises the problem of online algorithm selection in the c...
04/27/2020

Evolving Inborn Knowledge For Fast Adaptation in Dynamic POMDP Problems

Rapid online adaptation to changing tasks is an important problem in mac...
01/01/2020

Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies

We propose and address a novel few-shot RL problem, where a task is char...
08/08/2021

Meta-Reinforcement Learning in Broad and Non-Parametric Environments

Recent state-of-the-art artificial agents lack the ability to adapt rapi...
02/22/2021

Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning

Meta-learning for offline reinforcement learning (OMRL) is an understudi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human intelligence, though specialized in some sense, is able to generally adapt to new tasks and solve problems from limited experience or few interactions. The field of meta-reinforcement learning (meta-RL) seeks to replicate such a flexible intelligence by designing agents that are capable of rapidly adapting to tasks from few interactions in an environment. The recent progress in the field such as (Rakelly et al., 2019; Finn et al., 2017; Duan et al., 2016b; Wang et al., 2016; Zintgraf et al., 2019; Gupta et al., 2018)

have showcased start-of-the-art results. Studies with agents endowed with such adaptation capabilities are a promising venue for developing much desired and needed artificial intelligence systems and robots with lifelong learning dynamics.

When an agent’s policy for a meta-RL problem is encoded by a neural network, neural representations are adjusted from a base pre-trained point to a configuration that is optimal to solve a specific task. Such dynamic representations are a key feature to enable an agent to rapidly adapt to different tasks. These representations can be derived from gradient-based approaches (Finn et al., 2017), context-based approaches such as memory (Mishra et al., 2018; Wang et al., 2016; Duan et al., 2016b) and probabilistic (Rakelly et al., 2019), or hybrid approaches (i.e., combination of gradient and context methods) (Zintgraf et al., 2019). The hybrid approach obtains a task context via gradient updates and thus dynamically alters the representations of the network. Context approaches such as CAVIA (Zintgraf et al., 2019) and PEARL (Rakelly et al., 2019) are more interpretable as they disentangle task context from the policy network, thus the task context is used to achieve optimal policies for different tasks.

One limitation of such approaches is that they do not scale well as the problem complexity increases because of the demand to store many diverse policies to be reached within a single network. In particular, it is possible that, as tasks grow in complexity, the tasks similarities reduce and thus the network’s representations required to solve each task optimally becomes dissimilar. We hypothesize that standard policy networks are not likely to produce diverse policies from a trained base representation because all neurons have a homogeneous role or function: thus, significant changes in the policy require widespread changes across the network. From this observation, we speculate that a network endowed with modulatory neurons (neuromodulators) has a significantly higher ability to modify its policy.

Our approach to overcome this limiting design factor in current meta-RL neural approaches is to introduce a neuromodulated policy network to increase its ability to encode rich and flexible dynamic representations. The rich representations are measured based on the dissimilarity of the representations across various tasks, and are useful when the optimal policy of an agent (input-to-action mapping) is less similar across tasks. When combined with the CAVIA and PEARL meta-learning frameworks, the proposed approach produced better dynamic representations for fast adaptation as the neuromodulators in each layer serve as a means of directly altering the representations of the layer in addition to the task context.

Several designs exist for neuromodulation (Doya, 2002), either to gate plasticity (Soltoggio et al., 2008; Miconi et al., 2020), gate neural activations (Beaulieu et al., 2020) or alter high level behaviour (Xing et al., 2020). The proposed mechanism in this work focuses on just one simple principle: modulatory signals alter the representations in each layer by gating the weighted sum of input of the standard neural component.

The primary contribution of this work is a neuromodulated policy network for meta-reinforcement learning for solving increasingly difficult problems. The modular approach of the design allows for the proposed layer to be used with other existing layers (such as standard fully connected layer, convolutional layer and so on) when stacking them to form a deep network. The experimental evidence in this work demonstrates that neuromodulation is beneficial to adapt network representations with more flexibility in comparison to standard networks. Experimental evaluations were conducted across high dimensional discrete and continuous control environments of increasing complexity using CAVIA and PEARL meta-RL algorithms. The results indicate that the neuromodulated networks show an increasing advantage as the problem complexity increases, while they perform comparably on simpler problems. The increased diversity of the representations from the neuromodulated policy network are examined and discussed. The open source implementation of the code can be found at:

https://github.com/anon-6994/nm-metarl

2 Related Work

Meta-reinforcement learning. This work builds on the existing meta learning frameworks (Bengio et al., 1992; Schmidhuber et al., 1996; Thrun and Pratt, 1998; Schweighofer and Doya, 2003)

in the domain of reinforcement learning. Recent studies in meta-reinforcement learning (meta-RL) can be largely classified into optimization and context-based methods. Optimization methods

(Finn et al., 2017; Li et al., 2017; Stadie et al., 2018; Rothfuss et al., 2019) seek to learn good initial parameters of a model that can be adapted with a few gradient steps to a specific task. In contrast, context-based methods seek to adapt a model to a specific task based on few-shot experiences aggregated into context variables. The context can be derived via probabilistic methods (Rakelly et al., 2019; Liu et al., 2021), or recurrent memory (Duan et al., 2016b; Wang et al., 2016), or recursive networks (Mishra et al., 2018) or the combination of probabilistic and memory (Zintgraf et al., 2020; Humplik et al., 2019). Hybrid methods (Zintgraf et al., 2019; Gupta et al., 2018) combine optimization and context-based methods whereby task specific context parameters are obtained via gradient updates.

Neuromodulation. Neuromodulation in biological brains is a process whereby a neuron alters or regulates the properties of other neurons in the brain (Marder, 2012). The altered properties can either be in the cellular activities or synaptic weights of the neurons. Well known biological neuromodulators include dopamine (DA), serotonin (5-HT), acetycholine (ACh), and noradrenaline (NA) (Bear et al., 2020; Avery and Krichmar, 2017). The neuromodulators were described in Doya (2002) within the reinforcement learning computation framework, with dopamine loosely mapped to the reward signal error (like TD error), serotonin representing discount factor, acetycholine representing learning rate and noradrenaline representing randomness in a policy’s action distribution. Several studies have drawn inspiration from neuromodulation and applied it to gradient-based RL (Xing et al., 2020; Miconi et al., 2020) and neuroevolutionary RL (Soltoggio et al., 2007, 2008; Velez and Clune, 2017)

for dynamic task settings. In broader machine learning, neuromodulation has been applied to goal-driven perception

(Zou et al., 2020), and also in continual learning setting (Beaulieu et al., 2020) where it was combined with meta-learning to sequentially learn a number of classification tasks without catastrophic forgetting. The neuromodulators used in these studies have different designs or functions: plasticity gating (Soltoggio et al., 2008; Miconi et al., 2020), activation gating (Beaulieu et al., 2020), direct action modification in a policy (Xing et al., 2020).

3 Background

3.1 Problem Formulation

In a meta-RL setting, there exist a task distribution from which tasks are sampled. Each task

is a Markov Decision Process (MDP), which is a tuple

={} consisting of a state space , an action space , a state transition distribution , a reward function , and an initial state distribution . When presented with a task , an agent (with a policy represented as ) is required to quickly adapt to the task from few interactions. Therefore, the goal of the agent for each task it is presented is to maximize the expected reward in the shortest time possible:

(1)

where is a finite horizon and is the discount factor.

3.2 Context Adaptation via Meta-Learning (CAVIA)

The CAVIA meta-learning framework (Zintgraf et al., 2019) is an extension of the model-agnostic meta-learning algorithm (MAML) (Finn et al., 2017) that is interpretable and less prone to meta-overfitting. The key idea in CAVIA is the introduction of context parameters in a policy network. Therefore, the policy contains the standard network parameters and the context parameters . During the adaptation phase for each task (the gradient updates in the inner loop), only the context parameters are updated, while the network parameters are updated during the outer loop gradient updates. There are different ways to provide the policy network with the context parameters. In Zintgraf et al. (2019), the parameters were concatenated to the input.

In the meta-RL framework, an agent is trained for a number of iterations. For each iteration, tasks represented as are sampled from the task distribution . For each task , a batch of trajectories is obtained using the policy with the context parameters set to an initial condition . The obtained trajectories for task are used to perform a one step inner loop gradient update of the context parameters to new values , shown in the equation below:

(2)

where is the objective function for task . After the one step gradient update of the policy, another batch of trajectories is collected using the updated task specific policy .

After completing the above procedure for all tasks sampled from , a meta gradient step (also referred to as the outer loop update) is performed, updating to maximize the average performance of the policy across the task batch.

(3)

3.3 Probabilistic Embeddings for Actor-Critic Meta-RL (PEARL)

PEARL (Rakelly et al., 2019) is an off-policy meta-RL algorithm that is based on the soft actor-critic architecture (Haarnoja et al., 2018). The algorithm derives the context of the task to which an agent is exposed through probabilistic sampling. Given a task, the agent maintains a prior belief of the task, and as the agent interacts with the environment, it updates the posterior distribution with goal of identifying the specific task context. The context variables

are concatenated to the input of the actor and critic neural components of the setup. To estimate this posterior

, an additional neural component called an inference network is trained using the trajectories collected for tasks sampled from the task distribution . The objective function for the actor, critic and inference neural components are described below,

(4)
(5)
(6)

where is a target network and means that gradients are not being computed through it, is a unit Gaussian prior over , is the replay buffer and is a weighting hyper-parameter.

4 Neuromodulated Network

(a) Neuromodulated network
(b) Neuromodulated fully connected layer
Figure 1: Overview of the proposed computational framework

This section introduces the extension of the policy network with neuromodulation. A graphical representation of the network is shown in Figure 0(a). The neuromodulated policy network is a stack of neuromodulated fully connected layers.

4.1 Computational Framework

A neuromodulated fully connected layer contains two neural components: standard neurons and neuromodulators (see Figure 0(b)). The standard neurons serve as the output of the layer (i.e., the layer’s representations) and they are connected to the preceding layer via standard fully connected weights . The neuromodulators serve as a means to alter the output of the standard neurons. They receive input via standard fully connected weights from the preceding layer in order to generate their neural activity, which is then projected to the standard neurons via another set of fully connected weights . The function of the projected neuromodulatory activity defines the representation altering mechanism. For example, it could gate the plasticity of , gate neural activation of or do something else based on the designer’s specification. While different types of neuromodulators can be used (Doya, 2002), in this particular work, we employ an activity-gating neuromodulator. Such neuromodulator multiplies the activity of the target (standard) neurons before a non-linearity is applied to the layer. Formally, the structure can be described with three parameter matrices: defines weights connecting the input to the standard neurons, defines weights connecting the input to the neuromodulators and defines weights connecting the neuromodulators to the standard neurons. The step-wise computation of a forward pass through the neuromodulatory structure is given below:

(7)
(8)
(9)
(10)

where is the layer’s input, is the weighted sum of input of the standard neurons, is activity of the neuromodulators derived from the weighted sum of input, is the neuromodulatory activity projected onto the standard neurons, and is the output of the layer. The key modulating process takes place in the element-wise multiplication of the and .

The non-linearity is employed to enable positive and negative neuromodulatory signals, and thus gives the network the ability to affect both the magnitude and the sign of target activation values. When is used as the non-linearity for the layer’s output , has the intrinsic ability to dynamically turn on or off certain output in .

A simpler version of the proposed model can be achieved by only considering the sign, and not the magnitude, of the neuromodulatory signal, using the following variation of Equation 10:

(11)

This variation is shown to be suited for discrete control problems.

A major caveat of the proposed design is that the inclusion of neuromodulatory structures into the network increases the number of parameters (hence more memory) and the time complexity of a forward pass through the network. However, when only a few neuromodulatory parameters are employed (in order of hundreds or thousands rather than millions), then the increase in computational time is negligible.

5 Results and Analysis

In this section, the results of the neuromodulated policy network evaluations across high dimensional discrete and continuous control environments with varying levels of complexity are presented. The continuous control environments are the simple 2D navigation, the half-cheetah direction (Finn et al., 2017) and velocity (Finn et al., 2017) Mujoco (Todorov et al., 2012) based environments and the meta-world ML1 and ML45 environments (Yu et al., 2020). The discrete action environment is a graph navigation environment that supports configurable levels of complexity called the CT-graph (Soltoggio et al., 2021; Ladosz et al., 2021; Ben-Iwhiwhu et al., 2020). The experimental setup focused on investigating the beneficial effect of the proposed neuromodulatory mechanism when augmenting existing meta-RL frameworks (i.e., neuromodulation as complementary tool to meta-RL rather than competing). To this end, using CAVIA meta-RL method (Zintgraf et al., 2019), a standard policy network (SPN) is compared against our neuromodulated policy network (NPN) across the aforementioned environments. Similarly, SPN is compared against NPN using PEARL (Rakelly et al., 2019) method only in the continuous control environments because the soft actor-critic architecture employed by PEARL is designed for continuous control. We present the analysis of the learned dynamic representations from a standard and a neuromodulated network in Section 5.2. Finally, the policy networks were evaluated in a RGB autonomous vehicle navigation domain in the CARLA driving simulator using CAVIA and the results and discussions are presented in Appendix E.

5.1 Performance

The experimental setup for CAVIA and PEARL as in Zintgraf et al. (2019) and Rakelly et al. (2019) were followed. For PEARL, neuromodulation was applied only to the actor neural component. The details of the experimental setup and hyper-parameters are presented in Appendix A. The performance reported are the meta-testing results of the agents in the evaluation environments after meta-training has been completed. During meta-testing in CAVIA, the policy networks were fine-tuned for inner loop gradient steps.

(a) 2D Navigation
(b) Half-Cheetah Velocity
(c) Half-Cheetah Direction
(d) Meta-World ML1
(e) Meta-World ML45
Figure 2: Adaptation performance of standard policy network (SPN) and neuromodulated policy network (NPN) in continuous control environment using CAVIA meta-RL framework.
(a) 2D Navigation
(b) Half-Cheetah Velocity
(c) Half-Cheetah Direction
(d) Meta-World ML1
(e) Meta-World ML45
Figure 3: Adaptation performance of standard policy network (SPN) and neuromodulated policy network (NPN) in continuous control environment using PEARL meta-RL framework.
(a) CAVIA, ML1
(b) CAVIA, ML45
(c) PEARL, ML1
(d) PEARL, ML45
Figure 4: Adaptation performance (based on success rate metric) of standard policy network (SPN) and neuromodulated policy network (NPN) in CAVIA and PEARL.

5.1.1 2D Navigation Environment

The first simulations are in the 2D point navigation experiment introduced in Finn et al. (2017). An agent is tasked with navigating to a randomly sampled goal position from a start position. A goal position is sampled from the interval [-0.5, 0.5]. The reward function is the negative squared distance between the current agent position and the goal. An observation is the agent’s current 2D position while the actions are velocity commands clipped at [-0.1, 0.1]. The result of the meta-testing performance evaluation comparing both the standard policy network and neuromodulated policy network is presented in Figure 1(a) for CAVIA and Figure 2(a) for PEARL. The result shows that both policy networks had a relative good performance. Such optimal performance is expected from both policies as the environment is simple and the dynamic representations required for each task are not very distinct.

5.1.2 Half-Cheetah

The half-cheetah is an environment based on the MuJoCo simulator (Todorov et al., 2012) that requires an agent to learn continuous control locomotion. We employ two standard meta-RL benchmarks using the environment as proposed in Finn et al. (2017); (i) the direction task that requires the cheetah agent to run either forward or backward and (ii) the velocity task that requires the agent to run at a certain velocity sampled from a distribution of velocities. Although challenging (due to their high dimensional nature) in comparison to the 2D navigation task, these benchmark are still simplistic as the direction benchmark contains only two unique tasks and the velocity benchmark samples small range of velocities ( or ). Therefore, the optimal policies across tasks in these benchmarks possess similar representations. The results of the experiments for both benchmarks are presented in Figures 1(c) and 1(b) for CAVIA, and Figures 2(c) and 2(b) for PEARL. Unsurprisingly, the results show comparable level of performance between the standard policy network and the neuromodulated policy network across CAVIA and PEARL. These benchmarks are of medium complexity and the optimal policy for each task is similar to others.

5.1.3 Meta-World

The neuromodulated policy network was evaluated in a complex high-dimensional continuous control environment called meta-world (Yu et al., 2020). In meta-world, an agent is required to manipulate a robotic arm to solve a wide range of tasks (e.g. pushing an object, pick and place objects, opening a door and more). Two instances of the benchmark ML1 and ML45 were employed. In ML1 instance, the robot is required to solve a single task that contains several parametric variations (e.g. push an object to different goal locations). The parametric variations of the selected task are used as the meta-train and meta-test tasks. ML45 is a more complex instance that contains a wide variety of tasks (each task with parametric variations). It consists of 45 distinct meta-train tasks and 5 distinct meta-test tasks. The standard policy network and neuromodulated policy network were evaluated in ML1 and ML45 instances using CAVIA and PEARL. The results111The experiments were conducted using the updated Meta-World (i.e., v2) environment containing the updated reward function. are presented in Figures 1(d) and 1(e) for CAVIA, and Figures 2(d) and 2(e) for PEARL. In these complex benchmarks, the results show that the neuromodulated policy network outperforms the standard policy network in both CAVIA and PEARL, highlighting the advantage neuromodulation offers in complex problem setting. In addition to judging the performance based on reward, results are also presented using the success rate metric (introduced in Yu et al. (2020) as a metric judge whether or not an agent is able to solve a task) in Figure 4. The results again show that the neuromodulated policy network achieved significantly higher average success rate both in CAVIA and PEARL in comparison to the standard policy network.

5.1.4 Configurable Tree graph (CT-graph) Environment

The CT-graph is a sparse reward discrete control graph environment with increasing complexity that is specified via parameters such as branch and depth . An environment instance consists of a set of states including a start state and a number of end states. An agent is tasked with navigating to a randomly sampled end state from the start state. See Appendix B for more details about the CT-graph. The three CT-graph instances used in this work were setup with varying depth parameter: with increasing depth, the sequence of actions grows linearly, but the search space for the policy network grows exponentially. The simplest instance has set to 2 (CT-graph depth2), and the next has set to 3 (CT-graph depth3) and the most complex instance has set to 4 (CT-graph depth4). The meta-testing results are presented in Figure 5. The results show a significant difference in performance between standard and neuromodulated policy network. The optimal adaption performance from the neuromodulated policy network stems from the rich dynamic representations needed for adaptation as discussed in Section 5.2.

(a) CT-graph ()
(b) CT-graph ()
(c) CT-graph ()
Figure 5: Adaptation performance of standard policy network (SPN) and neuromodulated policy network (NPN) in three discrete control environments using CAVIA meta-RL framework.

5.2 Analysis

(a) standard policy network, first hidden layer.
(b) neuromodulated policy network, first hidden layer.
Figure 6: Representation similarities between tasks in the 2D Navigation environment.
(a) standard policy network, first hidden layer.
(b) neuromodulated policy network, first hidden layer.
Figure 7: Representation similarities between tasks in the CT-graph depth2 environment.

In this section, we conduct analysis on the learnt representations of the standard and neuromodulated policy networks for tasks in the 2D Navigation and CT-graph environments. The policy networks trained using CAVIA was chosen for the analysis as the single neural component in CAVIA (i.e. the policy network) makes it easier to analyse in comparison to PEARL which contain multiple neural components.

To measure representation similarity across task, we employ the use of the central kernel alignment (CKA) (Kornblith et al., 2019) similarity index, comparing per layer representations of both standard and neuromodulated policy networks across different tasks. The representation similarities between tasks were plotted as heat maps in Figure 6 and 7. Each heatmap in a row (for example 5(a)) depict the similarity before or after few steps of gradient updates to the layer. Before any gradient updates, the representations are similar between tasks in the figure. After gradient updates, some dissimilarities between tasks begin to emerge. Additional analysis plots are presented in Appendix C.

2D Navigation. For the simple 2D Navigation environment, the plots for the first hidden layer of the standard policy network shown in Figure 5(a) depicts good dissimilarity between tasks, thus highlighting the fact that the learnt representations are sufficient enough to produce distinct task behaviours. The same is true as well for the first hidden layer of the neuromodulated policy network (see Figure 5(b)). This further justifies why both policies obtained roughly comparable performance in this environment. The simplicity of the problem enables task distinct representations to be obtained easily. Appendix C.1 contains the plots of the representation similarity for the second hidden layer of both policy networks.

CT-graph. In Figure 6(a) and 6(b), we compare the representation similarity of the first hidden layer of the standard and neuromodulated policy networks in the CT-graph depth2 environment. We see that representations of the neuromodulated policy are more dissimilar between the tasks than those of the standard policy. Due to the complexity of the environment, the task specific representations required to solve each task are distinct from one another. Therefore, adaptation by fine-tuning the representations of a base network via few gradient steps of parameters update would require a significant jump in the solution space. Standard policy network struggles to enable such jump in the solution space. However, by incorporating neuromodulators that dynamically alters the representations, such jump becomes possible. Appendix C.1 contains the plots of the representation similarity for the second hidden layer of both policy networks.

5.3 Control Experiments: with equal number of parameters

(a) CT-graph depth4
(b) ML45
Figure 8: Control experiments. Adaptation performance of standard policy network (SPN), a larger SPN variant and neuromodulated policy network (NPN) using CAVIA. Note, the number of parameters in SPN (larger) approximately matches that of the NPN in each environment.

Since the inclusion of neuromodulators increases the number of parameters in a neuromodulated policy network, a set of control experiments were conducted in which the number of parameters in a standard policy network was configured to approximately match that of a neuromodulated policy network. This was achieved by increasing the size of each hidden layers in the standard policy network. Using CAVIA, experiments were conducted in the CT-graph depth4 and the ML45 meta-world environments, comparing the standard policy network (i.e., the original size), its larger variant and a neuromodulated policy network. The results are presented in Figure 8. We observe from the results that the increase in the size of the policy network does not lead to match of the performance of the neuromodulated policy network.

6 Discussions

Neuromodulation and gated recurrent networks: The neuromodulatory gating mechanism introduced in this work is reminiscent of the gating in recurrent/memory networks (LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014)). In this respect (with the observation of improved performance as a consequence of neuromodulatory gating in this work), the noteworthy performance demonstrated by meta-RL memory approaches (Duan et al., 2016b; Wang et al., 2016) could also be a consequence of such gating mechanisms222Although not the focus of this work, we ran an experiment using (a memory based meta-RL method) in the ML45 environment and achieved an average meta-test success rate performance of , which is comparable to the results obtained using neuromodulatory gating mechanism.. Nonetheless, the present study aims to highlight the advantage of a simpler form of gating (i.e., neuromodulatory gating) on a MLP feedforward network, and thus could help to pinpoint the advantage of such dynamics in isolation. Furthermore, the advantage of our approach over gated recurrent variants is somewhat similar to the advantages derived from decoupling attention mechanism from recurrent models (where it was originally introduced) and applying it to MLP networks (i.e., Transformer models) (Vaswani et al., 2017). By decoupling neuromodulatory (gating) mechanism from recurrent models and applying it to MLP models (as in our work), the advantages of faster training and better parallelization were achieved while maintaining the benefit of neuromodulatory gating. Therefore, our proposed approach is faster to train and more parallelizable in comparison to memory variants, while maintaining the advantages that neuromodulatory gating offers. Memory based approaches will still be required for problems where memory is advantageous such as sequential data processing and POMDPs.

Task similarity measure and robust benchmarks: Increasing task complexity was presented in this work by moving from simple 2D point navigation environment to half-cheetah locomotion and then to the complex robotic arm setup of the Meta-World environment. Furthermore, exploiting the benefits of configurable parameters in the CT-graph environment, we were able to control the complexity in the environment. Overall, task complexity was viewed through the perspective of task similarity (i.e., environments with dissimilar task were viewed as more complex and vice versa). Despite these efforts, a precise measure of task complexity and similarity was not clearly outlined in this work and this is widely the case in meta-RL literatures. There is a need for the development of precise metrics for measuring task similarity and complexity in the field. The CT-graph with its configurable parameters allow for tasks to be mathematically defined, which is a first step towards alleviating this issue. However, a separate future research investigation would be necessary to develop explicit metrics that can be incorporated into meta-RL benchmarks.

We hypothesize that such a task similarity metric should be able to capture the precise change points in a task relative to other tasks. For example, a useful metric could be one that capture task change either as a function of change in reward, or state space, or transition function, or a combination of these factors. Most benchmarks in meta-RL have been focused on task change as reward function change. However, a more robust benchmark could include the aforementioned change points in order to further control the complexity. The CT-graph, Meta-world, and the recently developed Alchemy (Wang et al., 2021) environment are examples of benchmarks with early stage work in this direction, albeit implicitly. Therefore, the development of a precise measure of task similarity and complexity, as well as robust benchmarks with configurable change points (i.e., reward, state/input, and transition) would be highly beneficial to the meta-RL field.

7 Conclusion and Future Work

This paper introduced an architectural extension of the standard meta-RL policy networks to include a neuromodulatory mechanism, investigating the beneficial effect of neuromodulation when augmenting existing meta-RL frameworks (i.e., neuromodulation as complementary tool to meta-RL rather than competing). The aim is to implement richer dynamic representations and facilitate rapid task adaptation in increasingly complex problems. The effectiveness of the proposed approach was evaluated in meta-RL setting using CAVIA and PEARL algorithms. In the experimental setup across environments of increasing complexity, the neuromodulated policy network significantly outperformed the standard policy network in complex problems while showcasing comparable performance in simpler problems. The results highlight the usefulness of neuromodulators to enable fast adaptation via rich dynamic representations in meta-RL problems. The architectural extension, although simple, presents a general framework for extending meta-RL policy networks with neuromodulators that expand their ability to encode different policies. The projected neuromodulatory activity can be designed to perform other functions apart from the one introduced in this work e.g., gating plasticity of weights, or including different neuromodulators in the same layer. The neuromodulatory extension could also be tested with a recurrent meta-RL policy, with the goal of enhancing the memory dynamics of the policy.

Acknowledgment

This material is based upon work supported by the United States Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-18-C0103. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA). The authors would like to thank Jeffrey Krichmar for useful discussion. Also, the authors would like to thank Luisa Zintgraf and Kate Rakelly for open sourcing the CAVIA and PEARL code respectively. Lastly, the authors thank the anonymous reviewers for their valuable feedback which helped to improve the paper.

Supplementary Material

Appendix A Experimental Configurations

All experiments were conducted using machines containing Tesla K80 and GeForce RTX 2080 GPUs. Also note that across all experiments, the output layer in the neuromodulated policy network (in CAVIA and in PEARL) employed a regular fully connected linear layer while the preceding layers were neuromodulated fully connected layers.

a.1 Cavia

Following the experimental setup of the original CAVIA paper (Zintgraf et al., 2019), the context variables were concatenated to the input of the policy network and were reset to zero at the beginning of each task across all experiments. Also, during each training iteration, the policy was adapted using one gradient update in the inner loop as employed in Zintgraf et al. (2019); Finn et al. (2017). After training, the iteration with the best policy performance or the final policy at the end of training was used to conduct meta-testing evaluations to produce the final result. During meta-testing, the policy was evaluated using a number of tasks sampled from the task distribution and it was adapted (fine-tuned) for each task using

inner loop gradient updates. All policy networks employed ReLU non-linearity across all experiments.

The CAVIA experimental configurations across all environments are presented in Table 1, with 2D Nav denoting the 2D navigation benchmark, Ch Dir and Ch Vel denoting the Half-Cheetah direction and Half-Cheetah Velocity benchmarks, ML 1 and ML45 denoting the meta-world ML1 and ML45 benchmarks, CT d2, d3, d4 denoting the CT-graph depth2, 3 and 4 benchmarks respectively.

2D Nav Ch Dir Ch Vel ML 1 ML 45 CT d2 CT d3 CT d4
 
Number of iterations
500 500 500 500 500 500 700 1500
Number tasks per iteration
(meta-batch size)
20 40 40 40 45 20 25 20
Number inner loop grad steps
(for meta-training)
1 1 1 1 1 1 1 1
Number trajectories per task
(for meta-training)
20 20 20 20 10 20 25 60
Number inner loop grad steps
(for meta-testing)
4 4 4 4 4 4 4 4
Number trajectories per task
(for meta-testing)
20 40 40 40 20 20 40 100
 
Policy network specification
 
Number of context parameters
5 50 50 50 100 5 10 20
 
Number of hidden layers
2 2 2 2 2 2 2 2
 
Hidden layer size
100 200 200 200 200 200 300 600
Neuromodulator size (for neuromodulated policy only) 4 32 32 32 32 8 16 32
Table 1: CAVIA experimental configurations.

Across all experiments in CAVIA, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) was employed as the outer loop update algorithm. Vanilla policy gradient (Williams, 1992) with generalized advantage estimation (GAE) (Schulman et al., 2016) was employed as the inner loop update algorithm with learning rate of for the 2D navigation and the CT-graph experiments, and for half-cheetah and meta-world experiments. Both the inner and outer loop training employed a linear feature baseline introduced in Duan et al. (2016a)

. The hyperparameters for TRPO are presented in Table

2

. Furthermore, finite-differences was employed to compute the Hessian-vector product for TRPO in order to avoid computing third-order derivatives as highlighted in

Finn et al. (2017). During sampling of data for each task in environments, multiprocessing was employed using workers.

Name Value
maximum KL-divergence
number of conjugate gradient iterations
conjugate gradient damping
maximum number of line search iterations
backtrack ratio for line search iterations
Table 2: TRPO hyperparameters

a.2 Pearl

Similar to CAVIA, the original experimental configurations in PEARL were followed for the half-cheetah benchmarks. Also, most of the configurations of PEARL in the original meta-world experiments were followed.

Across all PEARL experiments in this work, the learning rate across all neural components (policy, Q, value and context networks) was set to 3e-4, with KL penalty (KL lambda) set to 0.1. Furthermore, for experiments that involved the use of neuromodulation, the neuromodulator was employed only in the policy (actor) neural component. Table 3 highlights some of the PEARL configurations across the evaluation environments.

 
2D Nav  
Ch Dir
 
Ch Vel
 
ML 1
 
ML 45
 
Number of iterations
500 500 500 1000 1000
 
Number of train task
40 2 100 50 225
 
Number of test task
40 2 30 50 25
 
Number of initial steps
1000 2000 2000 4000 4000
 
Number of steps prior
400 1000 400 750 750
 
Number of steps posterior
0 0 0 0 0
 
Number of extra posterior steps
600 1000 600 750 750
 
Reward scale
5 5 5 10 5
 
Policy network specification
 
Context vector size
5 5 5 7 7
Network size
(policy, Q and value networks)
300 300 300 300 300
 
Inference (context) network size
200 200 200 200 200
Number of hidden layers (policy, Q, value and context networks) 3 3 3 3 3
Neuromodulator size
(for neuromodulated policy only)
4 32 32 32 32
Table 3: PEARL experimental configurations.

Appendix B CT-graph

Figure 9: (A) CT-graph depth2 and (B) CT-graph depth3. The coloured legends represent the various state types in the environment.

Each environment instance in the CT-graph is composed of a start state, a crash state, a number of wait states, decision states, end states and a goal state (one of the end states designated as the goal). A wait state is found between decision states (the tree graph splits at decision states). The wait state requires an agent to take a wait (forward) action, a decision state requires an agent to take one of the decision (turn) actions. Any decision action at a wait state, or wait action at a decision state leads to a crash where the agent is punished with a negative reward of -0.01 and returns to the start. When an agent navigates to the correct end state (the goal location), it receives a positive reward of 1.0. Otherwise, the agent receives a reward of 0.0 at every time step. An episode is terminated either at a crash state or when the agent navigates to any end state. The observations are 1D vector (with full observability of each state) whose length depends on the environment instance configuration.

The environment’s complexity is defined via a number of configuration parameters that is used to specify the graph size (using branch and ), sequence length, reward function, and level of state observability. The three CT-graph instances used in this work were setup with varying depth parameter. The simplest instance has set to 2 (CT-graph depth2), and the next has set to 3 (CT-graph depth3) and the most complex instance has set to 4 (CT-graph depth4). Figure 9 depicts a graphical view of CT-graph depth2 and 3.

Appendix C Analysis Plots

This section presents additional analysis plots of the representation similarity across tasks for the standard and neuromodulated policy networks in the various evaluation environments employed in this work. The additional plots further highlights the usefulness of neuromodulation to facilitate efficient (distinct) representations across tasks in problems of increasing complexity as earlier showcased in Section 5.2.

c.1 Second hidden layer: 2D Navigation and CT-graph depth2 environments

(a) standard policy network, second hidden layer.
(b) neuromodulated policy network, second hidden layer.
Figure 10: Representation similarities between tasks in the 2D Navigation environment.
(a) standard policy network, second hidden layer.
(b) neuromodulated policy network, second hidden layer.
Figure 11: Representation similarities between tasks in the CT-graph-depth2 environment.

2D Navigation: Figure 10 present the representation similarity between tasks, across inner loop gradient updates for the second hidden layer of both policy networks in the 2D navigation environment. Again, similar patterns as highlighted in the first hidden layer of the policy networks in Figure 6 emerge. Both networks are able to learn good (dissimilar) representations between tasks after few steps of inner loop gradient update. Also, both networks, before any gradient update, already have some level of representation disimilarity between tasks. Thus, this further highlights the fact the 2D Navigation environment has a low complexity and requires very little adaptation of network parameters.

CT-graph depth2: Figure 11 present the representation similarity between tasks, across inner loop gradient updates for the second hidden layer of both policy networks in the CT-graph depth2 benchmark. With increased problem complexity in comparison to the 2D navigation, only the neuromodulated policy network succeeds in learning distinct representations across tasks. The distinct representations thus allow the neuromodulated policy network to adapt optimally across tasks while the standard policy network struggles, as indicated by the performance plot in Figure 4(a).

c.2 CT-graph depth3 and depth4 Environments

(a) standard policy network, first hidden layer.
(b) standard policy network, second hidden layer.
(c) neuromodulated policy network, first hidden layer.
(d) neuromodulated policy network, second hidden layer.
Figure 12: Representation similarities between tasks in the CT-graph depth3 environment.
(a) standard policy network, first hidden layer.
(b) standard policy network, second hidden layer.
(c) neuromodulated policy network, first hidden layer.
(d) neuromodulated policy network, second hidden layer.
Figure 13: Representation similarities between tasks in the CT-graph depth4 environment.

The representation similarity plots (across inner loop gradient updates) between tasks of the hidden layers of both the standard and the neuromodulated policy networks in the CT-graph depth3 and depth4 environment instances are presented in Figure 12 and 13. Again, as observed in Section 5.2, the standard policy network still struggles to learn distinct representations for each task, whereas, the neuromodulated policy network is able to learn the required task-specific representations. Hence, the neuromodulated policy network is able to perform optimally across tasks. This explains the difference in performance (Figures 4(b) and 4(c)) between the policy networks.

c.3 Half-Cheetah and Meta-World Environments

The CAVIA policies analysis plots for the half-cheetah and meta-world benchmarks are presented in this section.

(a) standard policy network, first hidden layer.
(b) standard policy network, second hidden layer.
(c) neuromodulated policy network, first hidden layer.
(d) neuromodulated policy network, second hidden layer.
Figure 14: Representation similarities between tasks in the Half-Cheetah Direction environment.
(a) standard policy network, first hidden layer.
(b) standard policy network, second hidden layer.
(c) neuromodulated policy network, first hidden layer.
(d) neuromodulated policy network, second hidden layer.
Figure 15: Representation similarities between tasks in the Half-Cheetah Velocity environment.
(a) standard policy network, first hidden layer.
(b) standard policy network, second hidden layer.
(c) neuromodulated policy network, first hidden layer.
(d) neuromodulated policy network, second hidden layer.
Figure 16: Representation similarities between tasks in the ML1 (meta-world) environment.
(a) standard policy network, first hidden layer.
(b) standard policy network, second hidden layer.
(c) neuromodulated policy network, first hidden layer.
(d) neuromodulated policy network, second hidden layer.
Figure 17: Representation similarities between tasks in the ML45 (meta-world) environment.

Half-Cheetah: Figure 14 and 15 shows the representation similarity plots of the standard and neuromodulated policy networks for the Half-Cheetah direction and velocity environments respectively. In this setting where the problems are of simple to medium complexity, both policy networks are able to learn efficient (dissimilar) representations between tasks, further buttressing the discussions in Section 5.2. In fact, we observe that the standard policy network learns better dissimilar representations across tasks for the half-cheetah direction environment due to the simplicity of the tasks in the environment in comparison to the velocity environment.

Meta-World: Also, Figure 16 and 17 depicts the representation similarity plots for the ML1 and ML45 meta-world environment. Similar to the observations in Section 5.2, we observe again that as the problem complexity increase (i.e., from ML1 to ML45), the neuromodulated policy network produces better (dissimilar) representations across the sampled tasks in comparison to the standard policy network.

Appendix D Implementation

A code snippet demonstrating the extension of the fully connected layer with neuromodulation is presented below using PyTorch code style.

class NMLinear(Module): def __init__(self, in_features, out_features, nm_features, bias=True, gate=None): super(NMLinear, self).__init__() self.in_features = in_features self.out_features = out_features self.nm_features = nm_features self.std = nn.Linear(in_features, out_features, bias=bias) self.in_nm = nn.Linear(in_features, nm_features, bias=bias) self.out_nm =nn.Linear(nm_features, out_features, nm_features) self.in_nm_act = F.relu self.out_nm_act = torch.tanh self.gate = gate

def forward(self, data, params=None): output = self.std(data) mod_features = self.in_nm_act(self.in_nm(data)) projected_mod_features = self.out_nm_act(self.out_nm(mod_features)) if self.gate == ’strict’: projected_mod_features = torch.sign(projected_mod_features) projected_mod_features[projected_mod_features == 0.] = 1. output *= projected_mod_features return output

The full implementation (including experimental setup and test scripts) is open sourced at https://github.com/anon-6994/nm-metarl. The codebase is an extension of the original CAVIA and PEARL open source (MIT license) implementations that can be found at https://github.com/lmzintgraf/cavia and https://github.com/katerakelly/oyster respectively.

Appendix E Additional Experiments

e.1 CARLA Environment

Additional experiments were conducted in an autonomous driving environment called CARLA (Dosovitskiy et al., 2017) to provide preliminary evidence on whether the method scales to complex RGB input distributions such as those in autonomous driving. Given the limited nature of these experiments and the limited analysis, they are not included in the main paper, but provide additional validation on the robustness of the proposed approach. CARLA (see Figure 17(a)), is an open source experimentation platform for autonomous driving research. It contains a host of configuration parameters that is used to specify an environment instance (for example, weather). MACAD (Palanisamy, 2020), a wrapper on top of CARLA with OpenAI gym interface, was employed to run the experiments. In this work, the environment was configured to use RGB observations (images of size 64x64x3), discrete actions (coast, turn left, turn right, forward, brake, forward left, forward right, brake left, and brake right), and a clear (sunny) noon weather.

(a) A snapshot of the CARLA environment. Note, the view of environment in this figure is for the purpose of presentation. The actual environment instance used in this work is configured as a first person view with a smaller observation (image) size.
(b) Adaptation performance of standard policy network (SPN) and neuromodulated policy network (NPN) in the CARLA environment.
Figure 18: CARLA environment and results

The agent (vehicle) is presented with a goal of navigating from a start position to an end position. The start and end points are randomly set from a pre-defined list of co-ordinates. We setup two distinct tasks in the environment - drive aggressively and drive passively - defined by reward functions, that can be sampled from a uniform task distribution. Although the tasks are quite similar, they are challenging due to the domain of the problem (learning to drive) and the RGB pixel observations from the environment. Therefore, it is a suitable environment to further scale up meta-RL algorithms.

Each experiment processes the environment’s observations through a variational autoencoder (VAE)

(Kingma and Welling, 2013; Rezende et al., 2014) that was pre-trained using samples collected from taking random actions in the environment. Using CAVIA, the latent features from the VAE were concatenated with the context parameters and then passed as input to the policy network. Only the policy network was updated during the meta-training and testing, while the VAE was kept fixed.

Due to the computational load of the environment, both the standard and the neuromodulated policy network were evaluated for iterations, with sampled tasks per iteration and context parameter size of . For each task, episodes are collected before and after one step of inner loop gradient update. The results are presented in Figure 17(b), with the neuromodulated policy network showing an advantage over the standard policy network. In general, the results show promise towards scaling meta-RL algorithms to even more challenging problem domains.

References

  • M. C. Avery and J. L. Krichmar (2017) Neuromodulatory systems and their interactions: a review of models, theories, and experiments. Frontiers in neural circuits 11, pp. 108. Cited by: §2.
  • M. Bear, B. Connors, and M. A. Paradiso (2020) Neuroscience: exploring the brain. Jones & Bartlett Learning, LLC. Cited by: §2.
  • S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney (2020) Learning to continually learn. arXiv preprint arXiv:2002.09571. Cited by: §1, §2.
  • E. Ben-Iwhiwhu, P. Ladosz, J. Dick, W. Chen, P. Pilly, and A. Soltoggio (2020) Evolving inborn knowledge for fast adaptation in dynamic POMDP problems. In

    Proceedings of the 2020 Genetic and Evolutionary Computation Conference

    ,
    pp. 280–288. Cited by: §5.
  • S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei (1992) On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Cited by: §2.
  • K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    .
    arXiv preprint arXiv:1409.1259. Cited by: §6.
  • A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §E.1.
  • K. Doya (2002) Metalearning and neuromodulation. Neural networks 15 (4-6), pp. 495–506. Cited by: §1, §2, §4.1.
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016a) Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pp. 1329–1338. Cited by: §A.1.
  • Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel (2016b) RL: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §1, §1, §2, §6.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §A.1, §A.1, §1, §1, §2, §3.2, §5.1.1, §5.1.2, §5.
  • A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine (2018) Meta-reinforcement learning of structured exploration strategies. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 5307–5316. Cited by: §1, §2.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870. Cited by: §3.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §6.
  • J. Humplik, A. Galashov, L. Hasenclever, P. A. Ortega, Y. W. Teh, and N. Heess (2019) Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424. Cited by: §2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §E.1.
  • S. Kornblith, M. Norouzi, H. Lee, and G. Hinton (2019) Similarity of neural network representations revisited. In International Conference on Machine Learning, pp. 3519–3529. Cited by: §5.2.
  • P. Ladosz, E. Ben-Iwhiwhu, J. Dick, N. Ketz, S. Kolouri, J. L. Krichmar, P. K. Pilly, and A. Soltoggio (2021) Deep reinforcement learning with modulated hebbian plus q-network architecture. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §5.
  • Z. Li, F. Zhou, F. Chen, and H. Li (2017) Meta-sgd: learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835. Cited by: §2.
  • E. Z. Liu, A. Raghunathan, P. Liang, and C. Finn (2021) Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In International Conference on Machine Learning, pp. 6925–6935. Cited by: §2.
  • E. Marder (2012) Neuromodulation of neuronal circuits: back to the future. Neuron 76 (1), pp. 1–11. Cited by: §2.
  • T. Miconi, A. Rawal, J. Clune, and K. O. Stanley (2020) Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. arXiv preprint arXiv:2002.10585. Cited by: §1, §2.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A simple neural attentive meta-learner. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  • P. Palanisamy (2020) Multi-agent connected autonomous driving using deep reinforcement learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §E.1.
  • K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, pp. 5331–5340. Cited by: §1, §1, §2, §3.3, §5.1, §5.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    .
    In International conference on machine learning, pp. 1278–1286. Cited by: §E.1.
  • J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel (2019) ProMP: proximal meta-policy search. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • J. Schmidhuber, J. Zhao, and M. Wiering (1996) Simple principles of metalearning. Technical report IDSIA 69, pp. 1–23. Cited by: §2.
  • J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Cited by: §A.1.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2016) High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.1.
  • N. Schweighofer and K. Doya (2003) Meta-learning in reinforcement learning. Neural Networks 16 (1), pp. 5–9. Cited by: §2.
  • A. Soltoggio, J. A. Bullinaria, C. Mattiussi, P. Dürr, and D. Floreano (2008) Evolutionary advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In Proceedings of the 11th international conference on artificial life (Alife XI), pp. 569–576. Cited by: §1, §2.
  • A. Soltoggio, P. Durr, C. Mattiussi, and D. Floreano (2007) Evolving neuromodulatory topologies for reinforcement learning-like problems. In 2007 IEEE Congress on Evolutionary Computation, pp. 2471–2478. Cited by: §2.
  • A. Soltoggio, P. Ladosz, E. Ben-Iwhiwhu, and J. Dick (2021) CT-graph environments - lifelong learning machines (l2m). GitHub. Note: https://github.com/soltoggio/CT-graph Cited by: §5.
  • B. C. Stadie, G. Yang, R. Houthooft, X. Chen, Y. Duan, Y. Wu, P. Abbeel, and I. Sutskever (2018) Some considerations on learning to explore via meta-reinforcement learning. arXiv preprint arXiv:1803.01118. Cited by: §2.
  • S. Thrun and L. Pratt (1998) Learning to learn: introduction and overview. In Learning to learn, pp. 3–17. Cited by: §2.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.1.2, §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §6.
  • R. Velez and J. Clune (2017) Diffusion-based neuromodulation can eliminate catastrophic forgetting in simple neural networks. PloS one 12 (11), pp. e0187736. Cited by: §2.
  • J. X. Wang, M. King, N. Porcel, Z. Kurth-Nelson, T. Zhu, C. Deck, P. Choy, M. Cassin, M. Reynolds, F. Song, et al. (2021) Alchemy: a structured task distribution for meta-reinforcement learning. arXiv preprint arXiv:2102.02926. Cited by: §6.
  • J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick (2016) Learning to reinforcement learn, 2016. arXiv preprint arXiv:1611.05763. Cited by: §1, §1, §2, §6.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §A.1.
  • J. Xing, X. Zou, and J. L. Krichmar (2020) Neuromodulated patience for robot and self-driving vehicle navigation. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1, §2.
  • T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine (2020) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094–1100. Cited by: §5.1.3, §5.
  • L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson (2019) Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693–7702. Cited by: §A.1, §1, §1, §2, §3.2, §5.1, §5.
  • L. Zintgraf, K. Shiarlis, M. Igl, S. Schulze, Y. Gal, K. Hofmann, and S. Whiteson (2020) VariBAD: a very good method for bayes-adaptive deep rl via meta-learning. In International Conference on Learning Representations, Cited by: §2.
  • X. Zou, S. Kolouri, P. K. Pilly, and J. L. Krichmar (2020) Neuromodulated attention and goal-driven perception in uncertain domains. Neural Networks 125, pp. 56–69. Cited by: §2.