Agents controlled by neural networks and trained through reinforcement learning (RL) have proven to be capable of solving complex tasks Vinyals et al. (2019); Berner et al. (2019); Arulkumaran et al. (2017). However once trained, the neural network weights of these agents are typically static, thus their behaviour remains mostly inflexible, showing limited adaptability to unseen conditions or information. These solutions, whether found by gradient-based methods or black-box optimization algorithms, are often immutable and overly specific for the problem they have been trained to solve Zhang et al. (2018); Justesen et al. (2018). When applied to a different tasks, these networks need to be retrained, requiring many extra iterations.
Unlike artificial neural networks, biological agents display remarkable levels of adaptive behavior and can learn rapidly Morgan (1882); Macphail (1982). Although the underlying mechanisms are not fully understood, it is well established that synaptic plasticity plays a fundamental role Liu et al. (2012); Martin et al. (2000). For example, many animals can quickly walk after being born without any explicit supervision or reward signals, seamlessly adapting to their bodies of origin. Different plasticity-regulating mechanisms have been suggested which can be encompassed in two main ideal-type families: end-to-end mechanisms which involve top-down feedback propagating errors Sacramento et al. (2018) and local mechanisms, which solely rely on local activity in order to regulate the dynamics of the synaptic connections. The earliest proposed version of a purely local mechanism is known as Hebbian plasticity
, which in its simplest form states that the synaptic strength between neurons changes proportionally to the correlation of activity between themHebb (1949).
The rigidity of non-plastic networks and their inability to keep learning once trained can partially be attributed to them traditionally having both a fixed neural architecture and a static set of synaptic weights. In this work we are therefore interested in algorithms that search for plasticity mechanisms that allow agents to adapt during their lifetime Ben-Iwhiwhu et al. (2020); Soltoggio et al. (2018); Bengio et al. (1991); Miconi et al. (2018)
. While recent work in this area has focused on determining both the weights of the network and the plasticity parameters, we are particularly intrigued by the interesting properties of randomly-initialised networks in both machine learningJaeger and Haas (2004); Schmidhuber et al. (2007); Ulyanov et al. (2018) and neuroscience Lindsay et al. (2017). Therefore, we propose to search for plasticity rules that work with randomly initialised networks purely based on a process of self-organisation.
To accomplish this, we optimize for connection-specific Hebbian learning rules that allow an agent to find high-performing weights for non-trivial reinforcement learning tasks without any explicit reward during its lifetime. We demonstrate our approach on two continuous control tasks and show that such as network reaches a higher performance than a fixed-weight network in a vision-based RL task. In a 3-D locomotion task, the Hebbian network is able to adapt to damages in the morphology of a simulated quadrupedal robot, while a fixed-weight network fails to do so. In contrast to fixed-weight networks, the weights of the Hebbian networks continuously vary during the lifetime of the agent; the evolved plasticity rules give rise to the emergence of an attractor in weight phase-space, which results in the network quickly converging to a high-performing set of weights.
We hope that our demonstration of random Hebbian networks will inspire more work in neural plasticity that challenges current assumptions in reinforcement learning; instead of agents starting deployment with finely-tuned and frozen weights, we advocate for the use of more dynamical neural networks, which might display dynamics closer to their biological counterparts. Interestingly, we find that the discovered Hebbian networks are remarkable robust and can even recover from having a large part of their weights zeroed out.
While we focus here on exploring the potential of Hebbian plasticity to master reinforcement learning problems, local plasticity rules are capable of explaining neurobiological data Richards et al. (2019). Therefore, demonstrating that random networks, solely optimised through local rules, are capable of reaching competitive performance in complex tasks with different sensory modalities may hint at biologically plausible learning rules occurring in the brain and promote a reinforcement-learning framework to study how biological agents learn Pozzi et al. (2018).
2 Related work
Meta-learning. The aim in meta-learning or learning-to-learn Thrun and Pratt (1998); Schmidhuber (1992) is to create agents that can learn quickly from ongoing experience. A variety of different methods for meta-learning already exist Ravi and Larochelle (2017); Zoph and Le (2016); Schmidhuber (1993a); Wang et al. (2016); Finn et al. (2017); Fernando et al. (2018). For example, Wang et al. (2016) showed that a recurrent LSTM network Hochreiter and Schmidhuber (1997) can learn to reinforcement learn. In their work, the policy network connections stay fixed during the agent’s lifetime and learning is achieved through changes in the hidden state of the LSTM. While most approaches, such as the work by Wang et al. (2016), take the environment’s reward as input in the inner loop of the meta-learning algorithms (either as input to the neural network or to adjust the network’s weights), we do not give explicit rewards during the agent’s lifetime in the work presented here.
Typically, during meta-training, networks are trained on a number of different tasks and then tested on their ability to learn new tasks. A recent trend in meta-learning is to find good initial weights (e.g. through gradient descent Finn et al. (2017) or evolution Fernando et al. (2018)), from which adaptation can be performed in a few iterations. One such approach is Model-Agnostic Meta-Learning (MAML) Finn et al. (2017), which allows simulated robots to quickly adapt to different goal directions.
A less explored meta-learning approach is the evolution of plastic networks that undergo changes at various timescales, such as in their neural connectivity while experiencing sensory feedback. These evolving plastic networks are motivated by the promise of discovering principles of neural adaptation, learning, and memory Soltoggio et al. (2018). They enable agents to perform a type of meta-learning by adapting during their lifetime through evolving recurrent networks that can store activation patterns Beer and Gallagher (1992) or by evolving forms of local Hebbian learning rules that change the network’s weights based on the correlated activation of neurons (“what fires together wires together”). Instead of relying on Hebbian learning rules, early work Bengio et al. (1991) tried to explore the optimization of the parameters of a parameterised learning rule that is applied to all connections in the network. Most related to our approach is early work by Floreano and Urzelai (2000), who explored the idea of starting networks with random weights and then applying Hebbian learning. This approach demonstrated the promise of evolving Hebbian rules but was restricted to only four different types of Hebbian rules and small networks (12 neurons, 144 connections) applied to a simple robot navigation task.
Instead of training local learning rules through evolutionary optimization, recent work showed it is also possible to optimize the plasticity of individual synaptic connections through gradient descent Miconi et al. (2018). However, while the trainable parameters in their work only determine how plastic each connection is, the black-box optimization approach employed in this paper allows each connection to implement its own Hebbian learning rule.
Self-Organization. Self-organization plays a critical role in many natural systems Camazine et al. (2003) and is an active area of research in complex systems. It also recently gaining more prominence in machine learning, with graph neural networks being a noteworthy example Wu et al. (2020). The recent work by Mordvintsev et al. (2020) on growing cellular automate through local rules encoded by a neural network has interesting parallels to the work we present here; in their work the growth of 2D images relies on self-organization while it is the network’s weights in our work. A benefit of self-organizing systems is that they are very robust and adaptive. The goal in our proposed approach is to take a step towards similar levels of robustness for neural network-based RL agents.
Neuroscience. In biological nervous systems, the weakening and strengthening of synapses through synaptic plasticity is assumed to be one of the key mechanisms for long-term learning Liu et al. (2012); Martin et al. (2000). Evolution shaped these learning mechanisms over long timescales, allowing efficient learning during our lives. What is clear is that the brain can rewire itself based on experiences we undergo during our lifetime Grossberg (2012). Additionally, animals are born with a highly structured brain connectivity that allows them to learn quickly form birth Zador (2019). However, the importance of random connectivity in biological brains is less well understood. For example, random connectivity seems to play a critical role in the prefrontal cortex Maass et al. (2002), allowing an increase in the dimensionality of neural representations. Interestingly, it was only recently shown that these theoretical models matched experimental data better when random networks were combined with simple Hebbian learning rules Lindsay et al. (2017). We take inspiration from this theoretical work, showing that random networks combined with Hebbian learning can also enable more robust meta-learning approaches.
3 Meta-learning through Evolved Local Learning Rules
The main steps of our approach can be summarized as follows: (1) An initial population of neural networks with random synapse-specific learning rules is created, (2) each network is initialised with random weights and evaluated on a task based on its accumulated episodic reward, with the network weights changing at each timestep following the discovered learning rules, and (3) a new population is created through an evolution strategy Salimans et al. (2017), moving the learning-rule parameters towards rules with higher cumulative rewards. The algorithm then starts again at (2), with the goal to progressively discover more and more efficient learning rules that can work with arbitrary initialised networks.
In more detail, the synapse-specific learning rules in this paper are inspired by biological Hebbian mechanism. We use a generalized Hebbian ABCD model Soltoggio et al. (2007); Niv et al. (2001) to control the synaptic strength between the artificial neurons of relatively simple feedforward networks. Specifically, the weights of the agent are randomly initialized and updated during its lifetime at each timestep following:
where is the weight between neuron and , is the evolved learning rate, evolved correlation term , evolved presynaptic term , evolved postsynaptic terms , with and being the presynaptic and postsynaptic activations respectively. While the coefficients explicitly determine the local dynamics of the network weights, the evolved coefficient can be interpreted as an individual inhibitory/excitatory bias of each connection in the network. In contrast to previous work, our approach is not limited to uniform plasticity Ba et al. (2016); Schmidhuber (1993b) (i.e. each connection has the same amount of plasticity) or being restricted to only optimizing a connection-specific plasticity value Miconi et al. (2018). Instead, building on the ability of recent evolution strategy implementations to scale to a large number of parameters Salimans et al. (2017), our approach allows each connection in the network to have both a different learning rule and learning rate.
We hypothesize that this Hebbian plasticity mechanism should give rise to the emergence of an attractor in weight phase-space, which leads the randomly-initialised weights of the policy network to quickly converge towards high-performing values, guided by sensory feedback from the environment.
3.1 Optimization details
The particular population-based optimization algorithm that we are employing are evolution strategies (ES) Beyer and Schwefel (2002); Wierstra et al. (2008). ES have recently shown to reach competitive performance compared to other deep reinforcement learning approaches across a variety of different tasks Salimans et al. (2017)
. These black-box optimization methods have the benefit of not requiring the backpropagation of gradients and can deal with both sparse and dense rewards. Here, we adapt the ES algorithm bySalimans et al. (2017) to not optimize the weights directly but instead finding the set of Hebbian coefficients that will dynamically control the weights of the network during its lifetime based on input from the environment.
In order to evolve the optimal local learning rules, we randomly initialise both the policy network’s weights w and the Hebbian coefficients h by sampling from an uniform distribution w U[-0.1, 0.1] and h U[0, 1] respectively. Subsequently we let the ES algorithm evolve h, which in turn determines the updates to the policy network’s weights at each timestep through Equation 1.
At each evolutionary step we compute the task-dependent fitness of the agent , we populate a new set of n candidate solutions by adding normal noise to the current best solution (where sigma modulates the amount of noise), and we update the parameters of the solution based on the fitness evaluation of each of the individual solutions:
where modulates how much the parameters are updated at each generation. It is important to note that during its lifetime the agent does not have access to this reward.
We compare our Hebbian approach to a standard fixed-weight approach, using the same ES algorithm to optimise either directly the weight or learning rule parameters respectively. The code for the experiments in this paper will be made available shortly.
4 Experimental Setups
We demonstrate our approach on two continuous control environments with different sensory modalities (Fig. 2). The first is a challenging vision-based RL task, in which the goal is to drive a racing car through procedurally generated tracks as fast possible. While not appearing too complicated, the tasks was only recently solved (achieving a score of more than 900 averaged over 100 random rollouts) Risi and Stanley (2019); Ha and Schmidhuber (2018); Tang et al. (2020). The second domain is a complex 3-D locomotion task that controls a four-legged robot Schulman et al. (2015)
. Here the information of the environment is represented as a one-dimensional state vector.
Vision-based environment As a vision-based environment, we use the CarRacing-v0 domain Klimov (2016), build with the Box2D physics engine. The output state of the environment is resized and normalised, resulting in a observational space of 3 channels (RGB) of 8484 pixels each. The policy network consists of two convolutional layers, activated by hyperbolic tangent and interposed by pooling layers which feed a 3-layers feedforward network with nodes per layer with no bias. This network has 92,690 weight parameters, 1,362 corresponding to the convolutional layers and 91,328 to the fully connected ones. The three network outputs control three continuous actions (left/right, acceleration, break). Under the ABCD mechanism this results in 456,640 Hebbian coefficients including the lifetime learning rates .
In this environment, only the weights of the fully connected layers are controlled by the Hebbian plasticity mechanism, while the 1,300 parameters of the convolutional layers remain static during the lifetime of the agent. The reason being that there is no natural definition of what the presynaptic and postsynaptic activity of a convolution filter may be, hence making the interpretation of Hebbian plasticity for convolutional layers challenging. Furthermore, previous research on the human visual cortex indicates that the representation of visual stimuli in the early regions of the ventral stream are compatible with the representations of convolutional layers trained for image recognition Yamins and DiCarlo (2016), therefore suggesting that the variability of the parameters of convolutional layers should be limited. The evolutionary fitness is calculated as -0.1 every frame and +1000/ for every track tile visited, where is the total number of tiles in the generated track.
3-D Locomotion Task For the quadruped, we use a 3-layer feedforward network with
nodes per layer, no bias and hyperbolic tangent as activation function. This architectural choice leads to a network with 12,288 synapses. Under the ABCD plastic mechanism, which has 5 coefficients per synapse, this translates to a set of 61,440 Hebbian coefficients including the lifetime learning rates
. For the state-vector environment we use the open-source Bullet physics engine and its pyBullet python wrapperCoumans and Bai (2016–2019) that includes the “Ant” robot, a quadruped with 13 rigid links, including four legs and a torso, along with 8 actuated joints Bib (2016). It is modeled after the ant robot in the MuJoCo simulator Todorov et al. (2012) and constitutes a common benchmark in RL Finn et al. (2017). The robot has an input size of 28, comprising the positional and velocity information of the agent and an action space of 8 dimensions, controlling the motion of each of the 8 joints. The fitness function of the quadruped agent selects for distance travelled and low energy consumption during a period of 1,000 timesteps.
The parameters used for the ES algorithm to optimize both the Hebbian and static networks are the following: a population size 200 for the CarRacing-v0
domain and size 500 for the quadruped, reflecting the higher complexity of this domain. Other parameters were the same for both domains and reflect typical ES settings (ES algorithms are typically more robust to different hyperparameters than other RL approachesSalimans et al. (2017)), with a learning rate =0.2, decay=0.995, =0.1, and decay=0.999. These hyperparameters were found by trial-and-error and worked best in prior experiments.
For each of the two domains, we performed three independent evolutionary runs (with different random seeds) for both the static and Hebbian approach. We performed additional ablation studies on restricted forms of the generalised Hebbian rule, which can be found in the Appendix. Videos of the evolved behaviors can be found the supplementary materials.
Vision-based Environment To test how well the evolved solutions generalize, we compare the cumulative rewards averaged over 100 rollouts for the highest-performing Hebbian-based approach and traditional fixed-weight approach. The set of local learning rules found by the ES algorithm yield a reward of 87013, while the static-weights solution only reached a performance of 71116. The numbers for the Hebbian network are slightly below the performance of the state-of-the approaches in this domain which rely on additional neural attention mechanisms (91415 Tang et al. (2020)), but on par with deep RL approaches such as PPO (865159 Tang et al. (2020)). The competitive performance of the Hebbian learning agent is rather surprising, since it starts every one of the 100 rollouts with completely different random weights but through the tuned learning rules it is able to adapt quickly. While the Hebbian network takes slightly longer to reach a high training performance, likely because of the increased parameter space (see Appendix), the benefits are a higher generality when tested on procedurally generated tracks not seen during training.
|Quadruped Damage||Learning Rule||Distance travelled||Solved|
|No Damage||Hebbian||1278 13||True|
|No Damage||static weights||1496 3||True|
|Right front leg||Hebbian||440 3||True|
|Right front leg||static weights||2158 2||True|
|Left front leg||Hebbian||860 9||True|
|Left front leg||static weights||41 1||False|
3-D Locomotion Task For the locomotion task, we created three variations of a 4-legged robot such as to mimic the effect of partial damage to one of its legs (Fig. 2). The choice of these morphologies is intended to create a task that would be difficult to master for a neural network that is not able to adapt. During training, both the static-weights and the Hebbian plastic networks follow the same set-up, at each training step the policy is optimised following the ES algorithm described in Section 3.1 where the fitness function consists of the average cumulative reward of the three morphologies.
For the quadruped, we define solving the task as monotonically moving away from its initial position at least 100 units of length. Out of the three evolutionary runs, the Hebbian network found solutions in all runs, while the static-weights network was incapable of finding a single solution that would solve the task simultaneously for the three morphologies (Table 1).
Since the static-weights network can not adapt to the environment, it focuses on two morphologies that can be controlled well with the same gait, ignoring the third morphology. On the other hand, the Hebbian network is capable of adapting to the new morphologies leading to an efficient self-organization of network’s synaptic weights (Fig. 1). Furthermore, we found that the initial random weights of the network can even be sampled from other distributions than the one used during the discovery of the Hebbian coefficients, such as , and the agent still reaches a comparable performance.
Interestingly, even without the presence of any reward feedback during its lifetime, the Hebbian-based network is able to find well-performing weights for each of the three morphologies. The incoming activation patterns alone are enough for the network to adapt without explicitly knowing which is the morphology currently being simulated. However, for the morphologies that the static-weight network did solve, it reached a higher reward than the Hebbian-based approach. Several reasons may explain this, including the need of extra time to learn or the lager size of the parameters space, which could require longer training times to find even more efficient plasticity rules.
In order to determine the minimum number of timesteps the weights need to converge from random to optimal during an agent’s lifetime, we investigated freezing the Hebbian update mechanism of the weights after a different number of timesteps and examining the resulting episode’s cumulative reward. We observe that the weights only need between 30 and 80 timesteps (i.e. Hebbian updates), to converge to a set of optimal values (Fig. 3, left). Furthermore, we tested the resilience of the network to external perturbations by saturating all its outputs to 1.0 for 100 timesteps, effectively freezing the agent in place. Fig. 3, right shows that the evolved Hebbian rules allow the network to recover to an optimal set of weights within a few timesteps. Furthermore, the Hebbian network is able to recover from a partial loss of its connections, which we simulate by zeroing out a subset of the synaptic weights during one timestep (Fig. 4, left). We observe a brief disruption in the behavior of the agent, however, the network is able to reconverge towards an optimal solution in a few timesteps (Fig. 4, upper-right).
In order to get a better insight into the effect of the discovered plasticity rules and the development of the weight patterns during the Hebbian learning, we performed a dimensionality reduction through principal component analysis (PCA) which projects the high-dimensional space where the network weights live to a 3-dimensional representation at each timestep such that most of the variance is best explained by this lower dimensional representation (Fig.5). For the car environment we observe the presence of a -shaped 2-dimensional manifold where most of the weights live, this contrasts with the dynamics of a network in which we set the Hebbian coefficient (Eq.1) to random values; here the weight trajectory lacks any structure and oscillates around zero. In the case of the three quadruped morphologies, the trajectories of the Hebbian network follow a 3-dimensional curve, with an oscillatory signature; with random Hebbian coefficients the network does not give rise to any apparent structure in its weights trajectory.
5 Discussion and Future Work
In this work we introduced a novel approach that allows agents with random weights to adapt quickly to a task. It is interesting to note that lifetime adaptation happens without any explicitly provided reward signal, and is only based on the evolved Hebbian local learning rules. In contrast to typical static network approaches, in which the weights of the network do not change during the lifetime of the agent, the weights in the Hebbian-based networks self-organize and converge to an attractor in weight space during their lifetime. The ability to adapt weights quickly is shown to be important for tasks such as adapting to damaged robot morphologies, which could be useful for tasks such as continual learning Parisi et al. (2019). The ability to converge to high-performing weights from initially random weights is surprisingly robust and the best networks manage to do this for each of the 100 rollouts in the CarRacing domain. That the Hebbian networks are more general but performance for a particular task/robot morphology can be less is maybe not surprising: learning generally takes time but can result in greater generalisation Simpson (1953). Interestingly, randomly initialised networks have recently shown particularly interesting properties in different domains Jaeger and Haas (2004); Schmidhuber et al. (2007); Ulyanov et al. (2018). We add to this recent trend by demonstrating that random weights are all you need to adapt quickly to some complex RL domains, given that they are paired with expressive neural plasticity mechanisms.
An interesting future work direction is to extend the approach with neuromodulated plasticity, which has shown to improve the performance of evolving plastic neural networks Soltoggio et al. (2008) and plastic network trained through backpropagation Miconi et al. (2020). Among other properties, neuromodulation allows certain neurons to modulate the level of plasticity of the connections in the neural network. Additionally, complex system of neuromodulation seems critical in animal brains for more elaborated forms of learning Frank et al. (2004). Such an ability could be particularly important when giving the network an additional reward signal as input for goal-based adaptation. The approach presented here opens up other interesting research areas such as also evolving the agents neural architecture Gaier and Ha (2019) or encoding the learning rules through a more indirect genotype-to-phenotype mapping Risi and Stanley (2010); Zador (2019).
In the neuroscience community, the question of which parts of animal behaviors are already innate and which parts are acquired through learning is hotly debated Zador (2019). Interestingly, randomness in the connectivity of these biological networks potentially plays a more important part than previously recognized. For example, random feedback connections could allow biological brains to perform a type of backpropagation Lillicrap et al. (2014), and there is recent evidence suggesting that the prefrontal cortex might in effect employ a combination of random connectivity and Hebbian learning Lindsay et al. (2017). To the best of our knowledge, this is the first time the combination of random networks and Hebbian learning has been applied to a complex reinforcement learning problem, which we hope could inspire further cross-pollination of ideas between neuroscience and machine learning in the future Richards et al. (2019).
In contrast to current reinforcement learning algorithms that try to be as general as possible, evolution biased animal nervous system to be able to quickly learn by restricting their learning to what is important for their survival Zador (2019). The results presented in this paper, in which the innate agent’s knowledge is the evolved learning rules, take a step in this direction. The presented approach opens up interesting future research direction that suggest to demphasize the role played by the network’s weights, and focus more on the learning rules themselves. The results on two complex and different reinforcement learning tasks suggest that such an approach is worth exploring further.
This work was supported by a DFF-Danish ERC-programme grant and an Amazon Research Award.
The ethical and future societal consequences of this work are hard to predict but likely similar to other work dealing with more adaptive agents and robots. In particular, by giving robots the ability to still function when injured could make it easier for them being deploying in areas that have both a positive and negative impact on society. In the very long term, robots that can adapt could help in industrial automation or help to care for the elderly. On the other hand, more adaptive robots could also be more easily used for military applications. The approach presented in this paper is far from being deployed in these areas, but it its important to discuss its potential long-term consequences early on.
- Vinyals et al.  Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Berner et al.  Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Arulkumaran et al.  Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017.
- Zhang et al.  Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018.
- Justesen et al.  Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. NeurIPS 2018 Workshop on Deep Reinforcement Learning, 2018.
- Morgan  C. Lloyd Morgan. Animal Intelligence 1. Nature, 26(674):523–524, Sep 1882. ISSN 1476-4687. doi: 10.1038/026523b0.
- Macphail  Euan M Macphail. Brain and intelligence in vertebrates. Oxford University Press, USA, 1982.
- Liu et al.  Xu Liu, Steve Ramirez, Petti T Pang, Corey B Puryear, Arvind Govindarajan, Karl Deisseroth, and Susumu Tonegawa. Optogenetic stimulation of a hippocampal engram activates fear memory recall. Nature, 484(7394):381–385, 2012.
- Martin et al.  Stephen J Martin, Paul D Grimwood, and Richard GM Morris. Synaptic plasticity and memory: an evaluation of the hypothesis. Annual review of neuroscience, 23(1):649–711, 2000.
- Sacramento et al.  João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. ArXiv e-prints, Oct 2018. URL https://arxiv.org/abs/1810.11393.
- Hebb  D. O. Hebb. The organization of behavior; a neuropsychological theory. Wiley, 1949.
- Ben-Iwhiwhu et al.  Eseoghene Ben-Iwhiwhu, Pawel Ladosz, Jeffery Dick, Wen-Hua Chen, Praveen Pilly, and Andrea Soltoggio. Evolving inborn knowledge for fast adaptation in dynamic pomdp problems. arXiv preprint arXiv:2004.12846, 2020.
- Soltoggio et al.  Andrea Soltoggio, Kenneth O Stanley, and Sebastian Risi. Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Networks, 108:48–67, 2018.
- Bengio et al.  Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Neural Networks, 1991.
- Miconi et al.  Thomas Miconi, Jeff Clune, and Kenneth O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. ArXiv e-prints, Apr 2018. URL https://arxiv.org/abs/1804.02464.
- Jaeger and Haas  Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004.
- Schmidhuber et al.  Jürgen Schmidhuber, Daan Wierstra, Matteo Gagliolo, and Faustino Gomez. Training recurrent networks by evolino. Neural computation, 19(3):757–779, 2007.
- Ulyanov et al.  Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In
- Lindsay et al.  Grace W Lindsay, Mattia Rigotti, Melissa R Warden, Earl K Miller, and Stefano Fusi. Hebbian learning in a random network captures selectivity properties of the prefrontal cortex. Journal of Neuroscience, 37(45):11021–11036, 2017.
Richards et al. 
Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal
Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy
de Berker, Surya Ganguli, et al.
A deep learning framework for neuroscience.Nature neuroscience, 22(11):1761–1770, 2019.
- Pozzi et al.  Isabella Pozzi, Sander Bohté, and Pieter Roelfsema. A biologically plausible learning rule for deep learning in the brain. 2018.
- Thrun and Pratt  Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998.
- Schmidhuber  Jürgen Schmidhuber. Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
- Ravi and Larochelle  Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR 2018), 2017.
- Zoph and Le  Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
- Schmidhuber [1993a] Jürgen Schmidhuber. A ‘self-referential’weight matrix. In International Conference on Artificial Neural Networks, pages 446–450. Springer, 1993a.
- Wang et al.  Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Finn et al.  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
- Fernando et al.  Chrisantha Thomas Fernando, Jakub Sygnowski, Simon Osindero, Jane Wang, Tom Schaul, Denis Teplyashin, Pablo Sprechmann, Alexander Pritzel, and Andrei A Rusu. Meta learning by the baldwin effect. arXiv preprint arXiv:1806.07917, 2018.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Beer and Gallagher  Randall D Beer and John C Gallagher. Evolving dynamical neural networks for adaptive behavior. Adaptive behavior, 1(1):91–122, 1992.
- Floreano and Urzelai  Dario Floreano and Joseba Urzelai. Evolutionary robots with on-line self-organization and behavioral fitness. Neural Networks, 13(4-5):431–443, 2000.
- Camazine et al.  Scott Camazine, Jean-Louis Deneubourg, Nigel R Franks, James Sneyd, Eric Bonabeau, and Guy Theraula. Self-organization in biological systems, volume 7. Princeton university press, 2003.
- Wu et al.  Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
- Mordvintsev et al.  Alexander Mordvintsev, Ettore Randazzo, Eyvind Niklasson, and Michael Levin. Growing neural cellular automata. Distill, 2020. doi: 10.23915/distill.00023. https://distill.pub/2020/growing-ca.
- Grossberg  Stephen T Grossberg. Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control, volume 70. Springer Science & Business Media, 2012.
- Zador  Anthony M Zador. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications, 10(1):1–7, 2019.
- Maass et al.  Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural computation, 14(11):2531–2560, 2002.
- Salimans et al.  Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ArXiv e-prints, Mar 2017. URL https://arxiv.org/abs/1703.03864.
Soltoggio et al. 
Andrea Soltoggio, Peter Durr, Claudio Mattiussi, and Dario Floreano.
Evolving neuromodulatory topologies for reinforcement learning-like
2007 IEEE Congress on Evolutionary Computation, pages 2471–2478. IEEE, 2007.
- Niv et al.  Yael Niv, Daphna Joel, Isaac Meilijson, and Eytan Ruppin. Evolution of reinforcement learning in uncertain environments: Emergence of risk-aversion and matching. In European Conference on Artificial Life, pages 252–261. Springer, 2001.
- Ba et al.  Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, pages 4331–4339, 2016.
- Schmidhuber [1993b] J Schmidhuber. Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. In International Conference on Artificial Neural Networks, pages 460–463. Springer, 1993b.
- Beyer and Schwefel  Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies–a comprehensive introduction. Natural computing, 1(1):3–52, 2002.
- Wierstra et al.  Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pages 3381–3387. IEEE, 2008.
- Klimov  Oleg Klimov. Carracing-v0. 2016. URL https://gym.openai.com/envs/CarRacing-v0/.
- Risi and Stanley  Sebastian Risi and Kenneth O Stanley. Deep neuroevolution of recurrent and discrete world models. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 456–462, 2019.
- Ha and Schmidhuber  David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pages 2450–2462, 2018.
- Tang et al.  Yujin Tang, Duong Nguyen, and David Ha. Neuroevolution of self-interpretable agents. arXiv preprint arXiv:2003.08165, 2020.
- Schulman et al.  John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ArXiv e-prints, Jun 2015. URL https://arxiv.org/abs/1506.02438.
- Yamins and DiCarlo  Daniel L. K. Yamins and James J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3):356–365, Feb 2016. ISSN 1546-1726. doi: 10.1038/nn.4244.
- Coumans and Bai [2016–2019] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019.
- Bib  May 2016. URL https://arxiv.org/abs/1604.06778.pdf. [Online; accessed 27. May 2020].
- Todorov et al.  Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
- Parisi et al.  German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
- Simpson  George Gaylord Simpson. The baldwin effect. Evolution, 7(2):110–117, 1953.
- Soltoggio et al.  Andrea Soltoggio, John A Bullinaria, Claudio Mattiussi, Peter Dürr, and Dario Floreano. Evolutionary advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In Proceedings of the 11th international conference on artificial life (Alife XI), number CONF, pages 569–576. MIT Press, 2008.
- Miconi et al.  Thomas Miconi, Aditya Rawal, Jeff Clune, and Kenneth O Stanley. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. arXiv preprint arXiv:2002.10585, 2020.
- Frank et al.  Michael J Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306(5703):1940–1943, 2004.
- Gaier and Ha  Adam Gaier and David Ha. Weight Agnostic Neural Networks. ArXiv e-prints, Jun 2019. URL https://arxiv.org/abs/1906.04358.
- Risi and Stanley  Sebastian Risi and Kenneth O. Stanley. Indirectly Encoding Neural Plasticity as a Pattern of Local Rules. SpringerLink, pages 533–543, Aug 2010. doi: 10.1007/978-3-642-15193-4_50.
- Lillicrap et al.  Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247, 2014.
6.1 Network Weight Visualizations
Fig. 6 shows an example of how we visualize the weights of the network for a particular timestep. Each pixel represent the weight value of each synaptic connection. We represent the weights of each of the three fully connected layers FC layer 1, FC layer 2, FC layer 3 separately: the quadruped’s network has an input space of dimension 28 and three fully connected layers with neurons respectively, hence the rectangle above FC layer 1 has an horizontal dimension of 28 and a vertical one of 128, the 2nd layer FC layer 2 has an horizontal dimension of 64 and a vertical one of 128 while the last layer’s FC layer 3 dimension is 64 vertical and 8 horizontally, which corresponds to the dimension of the action space. Darker pixels indicate negative values while white pixels are positive values. In the case of the CarRacing environment the weights are normalised to the interval [-1,+1], while the quadruped agents have unbounded weights.
6.2 Training efficiency
We show the training over generations for both approaches and both domains in Fig. 7. Even though the Hebbian method has to optimize a significant larger number of parameters, training performance increases similarly fast for both approaches.
6.3 Hebbian rules
We analyze the different flavours of Hebbian rules derived from Eq. 2 in the car racing environment. For this experiment, we do not evolve the parameters of the convolutional layers and instead they are randomly fixed at initialisation; we solely evolve the Hebbian coefficients controlling the feed forward layers. From the simplest one where all but the A coefficients are zero, to its most general form where all the four coefficient and the intra-life learning rate are present (Fig. 8):
Only the static, ABCD+ and AD+ treatments can solve the task, highlighting the importance of the inhibitory/excitatory bias .
We also show the distribution of coefficients of the most general ABCD+ version (Fig. 9
), which shows a normal distribution. We hypothesise that this distribution is potentially necessary to allow the self-organization of weights to not grow to extreme values. Analysing the resulting weight distributions and evolved rules opens up many interesting future research directions.