1. Introduction & Related Work
The brain’s active self-modifying behavior plays an important role in its effectiveness for continual adaptation and learning in dynamic environments. Furthermore, evolution has led to the design of both the underlying neural connectivity as well as the framework for directing neuromodulated plasticity, the structure from which short-term synaptic self-modification occurs. However, the most common methods from which current AI are trained contradicts this way of learning. Consequently, modern training methods render AI incapable of online adaptation, only performing well on the tasks that they were trained on. Even slight deviations from the original simulated environment might be catastrophic for the agent’s performance.
To address this problem, recent literature in meta-learning aim to optimize toward an initial set of parameters that enable rapid learning over a specified set of tasks, such as Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017). Another set of methods utilize fast and slow-weights in neural networks through a non-trainable Hebbian learning-based associative memory (Rae et al., 2018). Building off of this, differential neuromodulation (Miconi et al., 2019) proposes a way to augment traditional artificial neural networks with fast- and slow-weights, where the fast-weights are modified through the addition of neuromodulated plasticity that is trainable using backpropagated gradients.
The work presented in this paper both demonstrates that self-modifying neural networks are capable of solving complex learning tasks in dynamic environments and poses Evolutionary Strategies as the natural choice for developing such networks. Previous work using neurmodulated plasticity (Miconi et al., 2019) only experimented on simple problems, and only considered optimization through backpropagating gradients. Here we show evidence toward the applicability of evolved neuromodulated plasticity in the high-dimensional continuous control problem, Crippled-Ant, requiring both precise motor skills and adaptivity.
The approach presented in this work compares a traditional neural network architecture against one with self-modifying synaptic connectivity, where the changes in connectivity are modulated by a learned set of parameters. Performance comparisons are made between policy gradient algorithm Proximal Policy Optimization (Schulman et al., 2017) and a simplified version of Natural Evolutionary Strategies (Salimans et al., 2017), which, for simplicity, will be referred to as OpenAI-ES for the duration of this paper.
2.1. Differential Neuromodulation
Within the differential neuromodulation framework, the weights along with the plasticity of each connection are optimized:
is the output of a layer of neurons at time,
is a nonlinear activation function,is the set of traditional non-plastic weights, and is the plasticity coefficient that scales the magnitude of the plastic component of each connection. The plastic component at timestep is represented by , which accumulates the modulated product of pre- and post-synaptic activity between the respective layers. Here, plasticity is modulated through a network learned neuromodulatory signal , which be represented by a variety of functions, but in this work is represented by a single-layer feedforward neural network. is generally clipped between - and , with = 1 in this experiment.
Starting with an initial zero-vector, the OpenAI-ES algorithm generates N population samples of random noise vectors and uses them to create population individuals . The fitness of each individual is evaluated over the course of a lifetime through an environment defined reward, . Such reward is often center-ranked to prevent early local optima (Salimans et al., 2017)
. Using the corresponding rewards, parameters are updated with Stochastic Gradient Descent (SGD) as follows:
OpenAI-ES was chosen because it has been shown to be competitive with and exhibit better exploration behavior than both DQN and A3C on difficult RL benchmarks (Salimans et al., 2017). While OpenAI-ES is less sample-efficient than these other methods, it is better structured for distributed computing and allows a shorter wall-clock training time. Additionally, due to not requiring back-propagation of error gradients, the required wall-clock training time is further significantly reduced for optimization over networks involving recurrence, such as the neuromodulated plasticity used in our experiments.
2.3. Crippled-Ant Environment
The meta-learning capabilities of the neural network in this paper are evaluated on a high-dimensional continuous control environment, Crippled-Ant (Clavera et al., 2018). The environment begins with a 12-jointed quadruped aiming to attain the highest possible velocity in a limited amount of time (Figure 1). The environment takes direct joint torque for each of the 12 joints as input. The state is represented as a 111 dimensional vector containing relative angles and velocities for each joint, as well as information about external forces acting on the quadruped. At the beginning of each session, a leg is randomly selected to be crippled on the quadrupedal robot, rendering it fully unusable. This environment was chosen because this modification causes significant change in the action dynamics, requiring gait adaptation throughout the course of each run.
3. Results & Discussion
Evaluation of performance is averaged over 100 episodes from 5 fully trained models for each algorithm during the testing phase to ensure accurate measurement. Each algorithm is trained using the default hyper-parameters from their respective papers. OpenAI-ES was compared against a policy gradient algorithm often used in continuous control problems, Proximal Policy Optimization (PPO). Both of these algorithms were also compared using fixed weights and differential self-modifying ones. The experimental results demonstrate that self-modifying networks trained through Evolutionary Strategies consistently outperform networks without such augmentation trained using OpenAI-ES and PPO, as well as self-modifying networks using PPO. Total training time for the self-modifying OpenAI-ES averaged around 214.8 minutes, and 968.8 minutes for the self-modifying PPO running on a standard 6-core CPU. Future work involves experimenting with new types of neuromodulation, as well as understanding the full capabilities of such networks.
- Learning to adapt: meta-learning for model-based control. CoRR abs/1803.11347. External Links: Cited by: §2.3.
- Model-agnostic meta-learning for fast adaptation of deep networks. CoRR abs/1703.03400. External Links: Cited by: §1.
- Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. In ICLR, Cited by: §1, §1.
- Fast parametric learning with activation memorization. CoRR abs/1803.10049. External Links: Cited by: §1.
- Evolution strategies as a scalable alternative to reinforcement learning. External Links: Cited by: §2.2, §2.2, §2.
- Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Cited by: §2.