Do What Nature Did To Us: Evolving Plastic Recurrent Neural Networks For Task Generalization

by   Fan Wang, et al.
Baidu, Inc.

While artificial neural networks (ANNs) have been widely adopted in machine learning, researchers are increasingly obsessed by the gaps between ANNs and biological neural networks (BNNs). In this paper, we propose a framework named as Evolutionary Plastic Recurrent Neural Networks (EPRNN). Inspired by BNN, EPRNN composes Evolution Strategies, Plasticity Rules, and Recursion-based Learning all in one meta learning framework for generalization to different tasks. More specifically, EPRNN incorporates with nested loops for meta learning – an outer loop searches for optimal initial parameters of the neural network and learning rules; an inner loop adapts to specific tasks. In the inner loop of EPRNN, we effectively attain both long term memory and short term memory by forging plasticity with recursion-based learning mechanisms, both of which are believed to be responsible for memristance in BNNs. The inner-loop setting closely simulate that of BNNs, which neither query from any gradient oracle for optimization nor require the exact forms of learning objectives. To evaluate the performance of EPRNN, we carry out extensive experiments in two groups of tasks: Sequence Predicting, and Wheeled Robot Navigating. The experiment results demonstrate the unique advantage of EPRNN compared to state-of-the-arts based on plasticity and recursion while yielding comparably good performance against deep learning based approaches in the tasks. The experiment results suggest the potential of EPRNN to generalize to variety of tasks and encourage more efforts in plasticity and recursion based learning mechanisms.



There are no comments yet.


page 1

page 2

page 3

page 4


Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters

Although model-agnostic meta-learning (MAML) is a very successful algori...

Meta-Learning with Neural Tangent Kernels

Model Agnostic Meta-Learning (MAML) has emerged as a standard framework ...

Regularizing Meta-Learning via Gradient Dropout

With the growing attention on learning-to-learn new tasks using only a f...

Meta-learning with negative learning rates

Deep learning models require a large amount of data to perform well. Whe...

L^2C – Learning to Learn to Compress

In this paper we present an end-to-end meta-learned system for image com...

Training of CC4 Neural Network with Spread Unary Coding

This paper adapts the corner classification algorithm (CC4) to train the...

Recurrent neural networks that generalize from examples and optimize by dreaming

The gap between the huge volumes of data needed to train artificial neur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial Neural Networks (ANNs) achieve huge success in machine learning. Despite being initially inspired by Biological Neural Networks (BNNs), the optimization paradigm for ANNs has diverged from that of BNNs. BNNs are believed to learn through plastic rules (Gerstner et al., 1993), also known as Hebb’s rule (Hebb, 1949). In contrast, the optimization of ANNs is dominated by gradient descent methods, the evidences of which are less revealed in BNNs . Though up to now gradient descent methods are the most efficient optimizers for ANNs, their limitations are beginning to attract attentions, including the problems of catastrophic forgetting, over consumption of data, and the requirement for manual efforts in designing target functions. Those challenges are becoming the main impedance for further development of machine intelligence.

Recent studies show that under proper meta-parameter optimization, both recursion (Santoro et al., 2016; Mishra et al., 2018) and plasticity (Soltoggio et al., 2008; Najarro and Risi, 2020) can be valid alternatives to train ANNs for task generalization. Unlike gradient descending, both of the two learning mechanisms work in an unsupervised manner, without requiring exact forms of learning objectives, which are believed to more closely simulate BNNs.

Recursion based learning typically employs recurrent neural networks (RNN), LSTM (Hochreiter and Schmidhuber, 1997), and self-attention (Mishra et al., 2018; Chen et al., 2021) layers as learners. Learning takes place within the feed forward pass. The information are updated in the hidden states instead of the parameters. Recursion are found to be extremely sample efficient in generalized supervised tasks (Santoro et al., 2016), zero-shot generalization in language (Brown et al., 2020)

, and reinforcement learning

(Mishra et al., 2018; Chen et al., 2021) when compared with various type of gradient descent methods. So far among those learners, though self-attention based learners such as Transformers have state-of-the-art performance, the computational costs for inference per step grow linearly with the sequence length, which makes them only available to relatively short sequences. On the other hand, recurrent learners such as RNN and LSTM have constant per-step inference costs, but suffer form poor asymptotic performances. That is, when sequences are getting longer, performances no longer improve or even deteriorate. This is partly due to the limitation of the memory spaces. For instance, an recurrent hidden layer of size has a memory of scale . In contrast, its parameters scale with , which makes parameter-updating more powerful as learning mechanisms than recursion-only.

Figure 1: An illustration of the natural evolution: The evolution takes place in the outer loop, where the genomes are mutated and selected, and the population either thrive or become extinct based on the Fitness

. Lifetime of each individual composes the inner loop. At the beginning, the genomes decide the learning mechanisms and initial neural configurations of the brain in the new born life. As the neural networks interact with the environment through actions and observations, its connections and hidden neuron states are further updated to better adapt to the environment. Plasticity are believed to be important part of the learning mechanisms. The fitness is dependent on the learning and adapting capability of each individual.

Prior to our work, evolving plasticity (Soltoggio et al., 2008, 2018; Lindsey and Litwin-Kumar, 2020; Yaman et al., 2021) has been proposed to reproduce the natural evolution and plasticity in simulation, as shown in Figure 1

. Implementing plasticity is not straight-forward, unlike gradient-descent, plastic rules are not universal but have to be optimized beforehand, which is not possible without a further outer-loop optimizer over the inner-loop learning. Evolutionary algorithms

(Zhang et al., 2011) are typically applied in the outer loop to search for meta-parameters shaping the learning rules, which can be regarded as information carried by genomes during evolution. Those optimized plasticity rules are then applied in the inner loop to further tune the parameter of NN for better adaptions to the environment. Due to the complexity in training, plasticity is insufficiently investigated and its potential is not well understood. Existing works on evolving plasticity are mainly focused on feed-forward layers. However, as recursion widely exists in BNNs, it is desirable to also investigate plasticity in recurrent connections.

Another motivation of combining recursion with plasticity is to simultaneously possess the merits of updating information in hidden states and in connection weights. Both recursion and plasticity mechanisms are believed to be relevant to memristance. While recursion that updates hidden states only is likely to be related to short-term memories, synaptic plasticity that updates connection weights is more related to long-term memories (Bliss and Lømo, 1973; Mel, 2002; Linares-Barranco and Serrano-Gotarredona, 2009). Based on this motivation, we propose a new learning framework named as Evolutionary Plastic Recurrent Neural Networks (EPRNN). Specifically, we apply plastic rules in recurrent connections by utilizing the information of time-dependent hidden states. Then, following the natural evolution of intelligence (Fig. 1), in the outer loop we employ evolution strategies to search for proper initial parameters and plastic rules. To validate our proposal, we carry out experiments on two groups of tasks: Sequence Predicting and Wheeled Robot Navigating. We compared the proposed framework with the other recursion and gradient based learners. The experiment results show that EPRNN can surpass decent models such as LSTM using the most naive recurrent structures. Although the validated model structure is relatively simple, we believe that validating such prototypes is an important step toward more general cases.

2 Related Works

2.1 Meta Learning

Meta learning aims at building learning machines that gain experience using task-specific data over a distribution of tasks. Inspired by human and animal brains that are born with both embedded skills and capability of acquiring new skills, meta learning implements two nested learning loops: The outer learning loops optimize the meta that typically involves initial parameters (Finn et al., 2017; Song et al., 2019), learning rules (Zoph and Le, 2017; Najarro and Risi, 2020; Pedersen and Risi, 2021), and model structures (Soltoggio et al., 2008; Li and Malik, 2016)

over a distribution of tasks; The inner learning loops adapt the model to specific task by utilizing the meta. According to different inner-loop optimizers, we roughly classify the methods into

model-based and parameter-updating methods. The model-based methods do not update the parameters in the inner-loop, where learning is accomplished by updating the hidden states (e.g., recursion); The parameter-updating methods update the parameter in the inner-loop (e.g., gradient descent (MAML), plasticity). From this point of view, our method can be classified into both groups. A brief review of the typical meta learning paradigms is presented in Table 1.

Methods Inner Loop Outer Loop
Memory Augmented NN, (Santoro et al., 2016) Recursion Gradient
MAML, (Finn et al., 2017) Gradient Gradient
Reptile, (Nichol et al., 2018) Gradient Average
Conditional Neural Processes, (Garnelo et al., 2018) Average Gradient
SNAIL, (Mishra et al., 2018) Attention Gradient
ES-MAML, (Song et al., 2019) Gradient, Hill Climbing Evolution
EPMLP, (Najarro and Risi, 2020) Plasticity Evolution
EPRNN (ours) Plasticity & Recursion Evolution
Table 1: A brief review of meta learning methods

2.2 Evolving Plasticity

The proposal of the learning mechanism of BNNs is initially raised by Hebb’s rule (Hebb, 1949), the most well-known part of which is “neurons fire together wire together”. It is further polished by Spike Time-Dependent Plasticity (STDP) (Gerstner et al., 1993) indicating that the signal of learning is dependent on the temporal patterns of the pre-synaptic spike and post-synaptic spike. Learning could also appear in inhibitory connections, also known as anti-hebbian (Barlow, 1989). Also, relationships between STDP and memristance are also investigated (Linares-Barranco and Serrano-Gotarredona, 2009). Since many of those rules are related to spiking neural networks (Ghosh-Dastidar and Adeli, 2009), to apply them to ANNs, simplifed rules are proposed (Soltoggio et al., 2008) instead: given the pre-synaptic neuron state and post-synaptic neuron state , the connections between and are updated by


is the output from neuron modulators that adjust the learning rates of plasticity, which is also inspired by previous discovery in BNNs.

Since plasticity rules can not be easily optimized with gradient-based methods (Miconi et al., 2018), evolutionary algorithms (Zhang et al., 2011) are naturally exploited in plasticity based ANNs (Soltoggio et al., 2018). Some works also investigate the possibility of learning plastic rules without depending on the initial parameters (Najarro and Risi, 2020; Yaman et al., 2021). However, most of previous works related to plasticity are focused on feed forward layers only; Few of them did thorough comparison with the other learners in task generalization.

3 Algorithms

Problem Settings. We consider an agent (learner) that is dependent on meta-parameter . It has the capability of adapting itself to a distribution of tasks by interacting with the environment through observation and action . In K-shot learning, the agent is allowed to first observe samples of length (this stage can be referred as meta-training-training, see Beaulieu et al. (2020)), then its fitness is calculated in meta-training-testing

rollouts. In Generalized Supervised Learning tasks (

GSL), the observations typically include features () and labels () in meta-training-training stage (), and the labels are left out for predicting in the meta-training-testing stage (Santoro et al., 2016; Garnelo et al., 2018). In Generalized Reinforcement Learning tasks (GRL), the observations typically include states (), actions (), and feedbacks () (, sometimes can not be observed) (Mishra et al., 2018). The goal of meta-training is to optimize such that the agent achieves higher fitness in meta-training-testing. In meta-testing, similarly, the learned parameters are given meta-testing-training and meta-testing-testing in order, the performances in meta-testing-testing are evaluated.

Plastic Recurrent Neural Networks (PRNN). Given a sequence of observations , we first consider an recurrent neural network (RNN) that propagates forward and yields sequence of outputs following:


where is the hidden states at step . In PRNN, we kept stationary, but we set to be plastic, so that we add a subscript to mark different at different steps. Regarding as pre-synaptic neuron states, and as post-synaptic neuron states, by applying Equation 1, we update with:


where we use and to represent “element-wise multiplication” and “matrix multiplication” respectively. and

are column vectors.

are collection of plastic rules of from Equation 1, which has the same shape as . is the neural modulators that adjusts the learning rates of plasticity. We calculate by applying a neuron modulating layer denoted with:

Figure 2: A sketch of the information flow in plastic recurrent neural networks (PRNN)

A sketch of PRNN is presented in Figure 2. The main difference between PRNN and naive RNN is that PRNN updates both the hidden states and the connection weights during the forward pass.

Evolving PRNN. Given task , by continuously applying Equation 2 to 7 over meta-training-training and meta-training-testing, the fitness is eventually dependent on the initial parameters, learning rules, and the sampled task , which is denoted as:


Following Evolution Strategies (ES) (Salimans et al., 2017), in th outer-loop iteration, we sample different tasks from , and meta-parameters () from the neighbourhoods of . We evaluate the fitness of sampled meta-parameters, and update the meta-parameters by applying:


Why Recurrent Neural Networks? As stated in Equation 1, plasticity in feed-forward-only NNs allows NNs to gain experiences from single-frame observation only. In cases of non-sequential GSL, the plasticity has chances to tune the connection weights to the specific task by relying on observing one single frame of data (), since its information of the feature and the supervision is complete. However, in general cases, learning can not happen without summarizing sequences of observations. For instances, a human driver getting used to a new car through continuously interacting and observing. Moreover, in GRL, there are time lag between the observation of states and feed-backs. Recursion helps to summarize historical observations to give the correct update for the connection weights.

Although, compared with naive RNN, there are obviously bunches of more sophisticated neural structure such as GRU and LSTM, we believe it is more desirable to start from simplest recurrent structure to study the potential of combining recursion and plasticity.

4 Experiments

4.1 Tasks for Evaluation

In generalized tasks, we have each task dependent on some configuration parameters that are hidden from the agent. Below we introduce two groups of generalized tasks that we experiment on.

Sequence Predicting (Generalized Supervised Learning tasks). We randomly generate sequences of vectors , where , and , , , and

represents the uniform distribution between

and . , and are hidden from the agent. The front part of the sequence is exposed to the agent, and the left part is to be predicted. The fitness is the opposite of mean square error (MSE) between the predicted sequence and the ground truth. We test the methods for comparison in four groups of tasks including , , , and (see Figure 3(a)(b)).

Wheeled Robot Navigating (Generalized Reinforcement Learning tasks). The agent is to navigate a two-wheeled robot to a randomly generated goal in 2-D space . We assume that there is a signal transmitter on the goal and a receiver on the robot. The robot observes the signal intensity decided by (inspired by the attenuation for electromagnetic wave, see Friis (1946)), where is the current distance between the robot and the goal,

is the white noise in the observation.

are environment related configurations that is hidden from the agent. For each task, we sample configurations by , , and . The action is the rotation speed of its two wheels that controls the orientation and velocity of the robot. The reward at each step is , an episode terminates when the robot approaches the goal or steps reaches the maximum of 100. We also hide its own position and orientation from the agent, such that the agent relies on recording its own action the signal strength to locate itself. We investigate three types of navigating circumstances with different level of noises in the observed signal: Low Noise (), Median Noise (), and High Noise () (see Figure 3(c)).

(a) Sequence Predicting (l=1,K=25,N=25)
(b) Sequence Predicting (l=3,K=25,N=25)
(c) Wheeled Robot Navigating
Figure 3: A sketch of the trajectories of three sampled tasks.

4.2 Experiment Settings

We add the following methods into comparison, the methods share exactly the same outer loop and differ in the inner loop.

  • ES-MAML (Song et al., 2019) : We use four gradient descent steps in the inner loop, the learning rate of each step is treated as meta-parameters which is to be optimized by the outer loop. As MAML can not utilize instant observation in zero-shot case, we show results of both ES-MAML (zero-shot) and ES-MAML (one-rollout) in Wheeled Robot Navigating. Except for ES-MAML, the other methods are measured with zero-shot meta-testing score only in Wheeled Robot Navigating.

  • ES-RNN: Vanilla RNN as the inner loop learner.


    : Long Short Term Memory

    (Hochreiter and Schmidhuber, 1997) as the inner loop learner.

  • EPMLP (Soltoggio et al., 2018)

    : Multi-Layer Perceptrons (MLP) with plasticity rules implemented.

  • EPMLP (Random) (Najarro and Risi, 2020): The main difference with EPMLP is randomly setting the parameters of the plastic layers at the beginning of each inner loop instead of using fixed trainable initial parameters.

  • EPRNN (w/o m): Removing neuron modulator () from the plasticity rule in PRNN.

We add additional non-plastic fully connected layers before and after the plastic layers to increase the representation capability of the models. For fairness, we kept those layers identical for all compared methods. For every 100 outer loops in meta training, we add an extra meta testingepoch, evaluating the average fitness of current meta-parameters on testing tasks. Each run includes 15000 outer loops and 150 meta-testing epochs. Each result is concluded from independent 3 runs. Our code111 relies on PARL222 parallelization. We leave the detailed illustration of the model structures and hyper-parameters in the Appendices.

4.3 Results

We present the experiment results in Figure 4 (In Table 2, and Table 3, we also list the summarized performance by averaging the Top-3 meta-testing scores in the latest 10 meta-testing epochs of each run over 3 independent runs). Generally, we can conclude that PRNN performs substantially better when compared with naive RNN. In some cases, it even produces better results compared with LSTM, despite the simpler model architecture. It is also interesting to notice that the gap between RNN and PRNN are smaller in shorter sequences or low-noise environments, but larger in more challenging tasks with longer sequence or higher noise (In Wheeled Robot Navigating tasks, higher noise pushes the agent to maintain a longer memory in order to filter the noise and figure out the way to goal). This phenomenon reaffirms the lack of long-term memories in RNN, and shows that PRNN significantly improves this drawback. Comparing EPRNN (w/o m) with RNN and EPRNN clearly demonstrates that simple ABCD rule (without the neural modulator) may also work to some extent, but introducing the neuron modulator can further benefit the learner.

Among plasticity based methods, we show that recursion is more advantageous than MLP in evaluated tasks. It is also worth noticing that EPMLP and EPMLP (Random) perform steadily beyond the gradient-based learner (ES-MAML). In ES-MAML, the gradient can only be calculated after an episode is completed, while EPMLP is able to perform sequential learning even though no feedback is available during its life time. This demonstrates the possibility of surpassing human-designed gradient-based learning rules with automatically learned unsupervised rules. Moreover, comparison between EPMLP and EPMLP (Random) validate the proposal of Najarro and Risi (2020), implying the possibility of discovering global plastic learning rules instead of rules coupled with the initial parameters. Yet we see optimization of the initial parameters is still advantageous, which can also be validated by evidences in nature that the newborn lives already have certain embedded skills (e.g., Newborn human babies have reflexes of suckling and grasping; Foals can stand shortly after being born).

(a) Sequence Predicting (l=1,K=10,N=20)
(b) Sequence Predicting (l=3,K=10,N=20)
(c) Sequence Predicting (l=1,K=25,N=50))
(d) Sequence Predicting (l=3,K=25,N=50)
(e) Wheeled Robot Navigating (Low Noise)
(f) Wheeled Robot Navigating (Median Noise)
(g) Wheeled Robot Navigating (High Noise)
Figure 4: Plotting meta-testing scores against meta-training iterations.
l=1,K=10,N=20 l=1,K=25,N=50 l=3,K=10,N=20 l=3,K=25,N=50
EPRNN (w/o m)
Table 2: Summarized performances comparison in Sequence Predicting tasks
Low Noise
Median Noise
High Noise
ES-MAML (zero-shot)
ES-MAML (1 rollout)
EPMLP (Random)
EPRNN (w/o m)
Table 3: Summarized performances comparison in Wheeled Robot Navigating tasks

4.4 Analysis

(a) Varying for task Sequence Predicting (l=1,K=25,N=50)
(b) Varying for task Sequence Predicting (l=1,K=25,N=50)
(c) Varying for task Sequence Predicting (l=1,K=25,N=50)
(d) Varying settings for task Wheeled Robot Navigating (Low Noise)
Figure 5: t-SNE visualization of the trajectories of the plastic connection weights .

To investigate whether plasticity rules update the connection weights as expected, we test the trained model with different tasks and record the updating trajectories of plastic connection weights . As those weights are in relatively high dimension, we run t-SNE visualization to map s to 2-D space and show their trajectories in Figure 5. We found that start from the same position and gradually move in different directions depending on the tasks. The final weights effectively capture environmental configuration that was hidden from the agent. Particularly, in Figure 5(d) for Wheeled Robot Navigating tasks, we can see that captures only the signal transmission patterns (), but neglects the position of the goal (). We guess that are important stationary patterns that helps the agent to interpret the observed signal strength (), while the absolute position of goal is less important as its relative position to the robot is changing continuously. This is a clear demonstration showing that plasticity has done meaningful update on the connections weights depending on the tasks.

5 Conclusions

In this paper we present a nature-inspired learning framework composed of Evolution Strategies, Plastic rules, and Recursion. Experiment results show that plasticity can be effectively forged with recursion to enhance the learning capability. The proposed framework can achieve equivalent or even better performances compared with more sophisticated neural structures, by applying the simplest recurrent neural structures. Moreover, we also show that under proper meta parameters, plasticity has a chance to surpass gradient descent methods.

We believe the learning framework of Figure 1

can be extended to more sophisticated plastic rules and model structures, uncovering better learners in the future. Also, it would be more interesting if such framework can be validated in more complex environments such as natural language processing tasks and vision-related tasks. Finally, we are looking forward that this work can shed light to new paradigm of building intelligent machines and inspire more efforts in this line of research,


  • H. B. Barlow (1989) Adaptation and decorrelation in the cortex. The computing neuron. Cited by: §2.2.
  • S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney (2020) Learning to continually learn. In ECAI 2020, pp. 992–1001. Cited by: §3.
  • T. V. Bliss and T. Lømo (1973) Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. The Journal of physiology 232 (2), pp. 331–356. Cited by: §1.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
  • L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch (2021) Decision transformer: reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345. Cited by: §1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: §2.1, Table 1.
  • H. T. Friis (1946) A note on a simple transmission formula. Proceedings of the IRE 34 (5), pp. 254–256. Cited by: §4.1.
  • M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. A. Eslami (2018) Conditional neural processes. In International Conference on Machine Learning, pp. 1704–1713. Cited by: Table 1, §3.
  • W. Gerstner, R. Ritz, and J. L. Van Hemmen (1993) Why spikes? hebbian learning and retrieval of time-resolved excitation patterns. Biological cybernetics 69 (5), pp. 503–515. Cited by: §1, §2.2.
  • S. Ghosh-Dastidar and H. Adeli (2009) Spiking neural networks. International journal of neural systems 19 (04), pp. 295–308. Cited by: §2.2.
  • D. O. Hebb (1949) The organization of behavior; a neuropsycholocigal theory. A Wiley Book in Clinical Psychology 62, pp. 78. Cited by: §1, §2.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, 3rd item.
  • K. Li and J. Malik (2016) Learning to optimize. arXiv preprint arXiv:1606.01885. Cited by: §2.1.
  • B. Linares-Barranco and T. Serrano-Gotarredona (2009)

    Memristance can explain spike-time-dependent-plasticity in neural synapses

    Nature precedings, pp. 1–1. Cited by: §1, §2.2.
  • J. Lindsey and A. Litwin-Kumar (2020) Learning to learn with feedback and local plasticity. arXiv preprint arXiv:2006.09549. Cited by: §1.
  • B. W. Mel (2002) Have we been hebbing down the wrong path?. Neuron 34 (2), pp. 175–177. Cited by: §1.
  • T. Miconi, K. Stanley, and J. Clune (2018)

    Differentiable plasticity: training plastic neural networks with backpropagation

    In International Conference on Machine Learning, pp. 3559–3568. Cited by: §2.2.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A simple neural attentive meta-learner. In International Conference on Learning Representations, Cited by: §1, §1, Table 1, §3.
  • E. Najarro and S. Risi (2020) Meta-learning through hebbian plasticity in random networks. arXiv preprint arXiv:2007.02686. Cited by: §1, §2.1, §2.2, Table 1, 5th item, §4.3.
  • A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: Table 1.
  • J. W. Pedersen and S. Risi (2021) Evolving and merging hebbian learning rules: increasing generalization by decreasing the number of rules. arXiv preprint arXiv:2104.07959. Cited by: §2.1.
  • T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §A.1, §3.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §1, §1, Table 1, §3.
  • A. Soltoggio, J. A. Bullinaria, C. Mattiussi, P. Dürr, and D. Floreano (2008) Evolutionary advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In Proceedings of the 11th international conference on artificial life (Alife XI), pp. 569–576. Cited by: §1, §1, §2.1, §2.2.
  • A. Soltoggio, K. O. Stanley, and S. Risi (2018) Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Networks 108, pp. 48–67. Cited by: §1, §2.2, 4th item.
  • X. Song, W. Gao, Y. Yang, K. Choromanski, A. Pacchiano, and Y. Tang (2019) ES-maml: simple hessian-free meta learning. In International Conference on Learning Representations, Cited by: §2.1, Table 1, 1st item.
  • A. Yaman, G. Iacca, D. C. Mocanu, M. Coler, G. Fletcher, and M. Pechenizkiy (2021) Evolving plasticity for autonomous learning under changing environmental conditions. Evolutionary computation 29 (3), pp. 391–414. Cited by: §1, §2.2.
  • J. Zhang, Z. Zhan, Y. Lin, N. Chen, Y. Gong, J. Zhong, H. S. Chung, Y. Li, and Y. Shi (2011) Evolutionary computation meets machine learning: a survey. IEEE Computational Intelligence Magazine 6 (4), pp. 68–75. Cited by: §1, §2.2.
  • B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2.1.

Appendix A Appendix

a.1 Model Architecture and Training Details

To maintain fairness for comparison, the model architectures of the compared methods are similar, where the differences lie in only one of their layers (Figure 6). We use 3 hidden layers for Sequence Predicting tasks and 4 hidden layers for Wheeled Robot Navigating tasks. For ES-MAML, we replace the PRNN layer with fully connected layer; For ES-RNN and ES-LSTM, we replace it with RNN and LSTM respectively; For EPMLP, we replace it with plastic fully connected layer, where the plasticity rule is stated by Equation 1, and the neural modulator is calculated by an additional dense layer with sigmoid activation. The hidden sizes of all the hidden layers are (for LSTM it is hidden states and cell states).

(a) Sequence Predicting
(b) Wheeled Robot Navigating
Figure 6: A sketch of the model architectures for evaluated tasks

For Sequence Predicting tasks, the input observation and the output action has the dimension of . In the meta-training-training and meta-testing-train stages, we use the ground truth as input observation; In the the meta-training-test and meta-testing-test stages, its previous action () are taken as inputs. For Wheeled Robot Navigating tasks, the output action is the control command of its two wheels (a length 2 vector), the input observation is the concatenation of its previous action and the current observed signal intensity ().

Figure 7: A sketch of the parallel training framework.

The training process is accelerated by utilizing the paralleling mechanism of PARL. We employ 400 workers (400 Intel(R) Xeon(R) CPU E5-2650) running inner-loops for 400 off-springs in each generation of the evolution, and additional 1 CPU to perform evolutionary update (shown in Figure 7(b)). It takes to hours for each run depending on the length of the inner loop and model architectures. Following the previous work (Salimans et al., 2017), we rank normalize the fitness among 400 workers. The learning rate is set to . The mutation is performed by adding independent Gaussian noises to each parameter. During the meta-training, we sample different tasks and 400 meta-parameters. Each worker run meta-training-training and meta-training-testing given the assigned meta-parameter, then the fitness is averaged. We set for Sequence Predicting tasks, and for Wheeled Robot Navigating tasks. For each meta-testing epoch, we evaluated the current meta parameters in newly sampled tasks.