Adaptive Reinforcement Learning through Evolving Self-Modifying Neural Networks

05/22/2020 ∙ by Samuel Schmidgall, et al. ∙ George Mason University 0

The adaptive learning capabilities seen in biological neural networks are largely a product of the self-modifying behavior emerging from online plastic changes in synaptic connectivity. Current methods in Reinforcement Learning (RL) only adjust to new interactions after reflection over a specified time interval, preventing the emergence of online adaptivity. Recent work addressing this by endowing artificial neural networks with neuromodulated plasticity have been shown to improve performance on simple RL tasks trained using backpropagation, but have yet to scale up to larger problems. Here we study the problem of meta-learning in a challenging quadruped domain, where each leg of the quadruped has a chance of becoming unusable, requiring the agent to adapt by continuing locomotion with the remaining limbs. Results demonstrate that agents evolved using self-modifying plastic networks are more capable of adapting to complex meta-learning learning tasks, even outperforming the same network updated using gradient-based algorithms while taking less time to train.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction & Related Work

The brain’s active self-modifying behavior plays an important role in its effectiveness for continual adaptation and learning in dynamic environments. Furthermore, evolution has led to the design of both the underlying neural connectivity as well as the framework for directing neuromodulated plasticity, the structure from which short-term synaptic self-modification occurs. However, the most common methods from which current AI are trained contradicts this way of learning. Consequently, modern training methods render AI incapable of online adaptation, only performing well on the tasks that they were trained on. Even slight deviations from the original simulated environment might be catastrophic for the agent’s performance.

To address this problem, recent literature in meta-learning aim to optimize toward an initial set of parameters that enable rapid learning over a specified set of tasks, such as Model-Agnostic Meta-Learning (MAML)  (Finn et al., 2017). Another set of methods utilize fast and slow-weights in neural networks through a non-trainable Hebbian learning-based associative memory  (Rae et al., 2018). Building off of this, differential neuromodulation  (Miconi et al., 2019) proposes a way to augment traditional artificial neural networks with fast- and slow-weights, where the fast-weights are modified through the addition of neuromodulated plasticity that is trainable using backpropagated gradients.

The work presented in this paper both demonstrates that self-modifying neural networks are capable of solving complex learning tasks in dynamic environments and poses Evolutionary Strategies as the natural choice for developing such networks. Previous work using neurmodulated plasticity  (Miconi et al., 2019)[5] only experimented on simple problems, and only considered optimization through backpropagating gradients. Here we show evidence toward the applicability of evolved neuromodulated plasticity in the high-dimensional continuous control problem, Crippled-Ant, requiring both precise motor skills and adaptivity.

2. Methods

The approach presented in this work compares a traditional neural network architecture against one with self-modifying synaptic connectivity, where the changes in connectivity are modulated by a learned set of parameters. Performance comparisons are made between policy gradient algorithm Proximal Policy Optimization  (Schulman et al., 2017) and a simplified version of Natural Evolutionary Strategies  (Salimans et al., 2017), which, for simplicity, will be referred to as OpenAI-ES for the duration of this paper.

2.1. Differential Neuromodulation

Within the differential neuromodulation framework, the weights along with the plasticity of each connection are optimized:



is the output of a layer of neurons at time


is a nonlinear activation function,

is the set of traditional non-plastic weights, and is the plasticity coefficient that scales the magnitude of the plastic component of each connection. The plastic component at timestep is represented by , which accumulates the modulated product of pre- and post-synaptic activity between the respective layers. Here, plasticity is modulated through a network learned neuromodulatory signal , which be represented by a variety of functions, but in this work is represented by a single-layer feedforward neural network. is generally clipped between - and , with = 1 in this experiment.

Figure 1. Adaptive locomotion. In the Crippled-Ant Environment, a limb is chosen at random to be disabled (in red) requiring the agent to adapt its gait using the remaining limbs.

2.2. OpenAI-ES

Starting with an initial zero-vector

, the OpenAI-ES algorithm generates N population samples of random noise vectors and uses them to create population individuals . The fitness of each individual is evaluated over the course of a lifetime through an environment defined reward, . Such reward is often center-ranked to prevent early local optima (Salimans et al., 2017)

. Using the corresponding rewards, parameters are updated with Stochastic Gradient Descent (SGD) as follows:


OpenAI-ES was chosen because it has been shown to be competitive with and exhibit better exploration behavior than both DQN and A3C on difficult RL benchmarks  (Salimans et al., 2017). While OpenAI-ES is less sample-efficient than these other methods, it is better structured for distributed computing and allows a shorter wall-clock training time. Additionally, due to not requiring back-propagation of error gradients, the required wall-clock training time is further significantly reduced for optimization over networks involving recurrence, such as the neuromodulated plasticity used in our experiments.

2.3. Crippled-Ant Environment

The meta-learning capabilities of the neural network in this paper are evaluated on a high-dimensional continuous control environment, Crippled-Ant  (Clavera et al., 2018). The environment begins with a 12-jointed quadruped aiming to attain the highest possible velocity in a limited amount of time (Figure 1). The environment takes direct joint torque for each of the 12 joints as input. The state is represented as a 111 dimensional vector containing relative angles and velocities for each joint, as well as information about external forces acting on the quadruped. At the beginning of each session, a leg is randomly selected to be crippled on the quadrupedal robot, rendering it fully unusable. This environment was chosen because this modification causes significant change in the action dynamics, requiring gait adaptation throughout the course of each run.

3. Results & Discussion

Evaluation of performance is averaged over 100 episodes from 5 fully trained models for each algorithm during the testing phase to ensure accurate measurement. Each algorithm is trained using the default hyper-parameters from their respective papers. OpenAI-ES was compared against a policy gradient algorithm often used in continuous control problems, Proximal Policy Optimization (PPO). Both of these algorithms were also compared using fixed weights and differential self-modifying ones. The experimental results demonstrate that self-modifying networks trained through Evolutionary Strategies consistently outperform networks without such augmentation trained using OpenAI-ES and PPO, as well as self-modifying networks using PPO. Total training time for the self-modifying OpenAI-ES averaged around 214.8 minutes, and 968.8 minutes for the self-modifying PPO running on a standard 6-core CPU. Future work involves experimenting with new types of neuromodulation, as well as understanding the full capabilities of such networks.

Figure 2. Performance Comparison on Crippled-Ant Environment Performance of each policy is measured for self-modifying (SM-) and traditional neural networks trained using Proximal Policy Optimization and OpenAI-ES.


  • I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn (2018) Learning to adapt: meta-learning for model-based control. CoRR abs/1803.11347. External Links: Link, 1803.11347 Cited by: §2.3.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. CoRR abs/1703.03400. External Links: Link, 1703.03400 Cited by: §1.
  • T. Miconi, A. Rawal, J. Clune, and K. O. Stanley (2019) Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. In ICLR, Cited by: §1, §1.
  • J. W. Rae, C. Dyer, P. Dayan, and T. P. Lillicrap (2018) Fast parametric learning with activation memorization. CoRR abs/1803.10049. External Links: Link, 1803.10049 Cited by: §1.
  • T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution strategies as a scalable alternative to reinforcement learning. External Links: 1703.03864 Cited by: §2.2, §2.2, §2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §2.