Meta Reinforcement Learning with Distribution of Exploration Parameters Learned by Evolution Strategies

12/29/2018 ∙ by Yiming Shen, et al. ∙ 0

In this paper, we propose a novel meta-learning method in a reinforcement learning setting, based on evolution strategies (ES), exploration in parameter space and deterministic policy gradients. ES methods are easy to parallelize, which is desirable for modern training architectures; however, such methods typically require a huge number of samples for effective training. We use deterministic policy gradients during adaptation and other techniques to compensate for the sample-efficiency problem while maintaining the inherent scalability of ES methods. We demonstrate that our method achieves good results compared to gradient-based meta-learning in high-dimensional control tasks in the MuJoCo simulator. In addition, because of gradient-free methods in the meta-training phase, which do not need information about gradients and policies in adaptation training, we predict and confirm our algorithm performs better in tasks that need multi-step adaptation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep reinforcement learning, which combines deep learning and reinforcement learning, has achieved significant progress recently. The performance of state-of-the-art algorithms is close to or even better than human performance in Atari games

[Mnih et al.2013], Go [Silver et al.2016, Silver et al.2017]

and even multiplayer online games such as Dota. However, one inherent drawback of deep reinforcement learning is the tendency to overfit to the current environment setting, which makes agents unable to adapt quickly to slight variations in the environment. Approaches combining deep reinforcement learning and meta learning are proposed by researchers to address this problem, with the aim of improving the applicability of deep reinforcement learning to real-world problems. A store of prior knowledge, or “common sense”, is very important for humans learning to perform a new task quickly, and we can use experience from previous tasks and fast adaptations to perform a new, similar task, integrating this prior knowledge into the initial parameters of neural networks. One common scheme for this approach is to learn a good initial parameter configuration integrating common knowledge of a distribution of tasks; the agent can then quickly find the appropriate parameters if a particular task is in the distribution. Recent research following this scheme, such as Model-Agnostic Meta-Learning (MAML)

[Finn, Abbeel, and Levine2017], mainly focus on gradient-based methods; these methods achieve state-of-the-art results in multi-task environments. However, gradient-based methods need higher-order gradients to train initial parameters; in reinforcement learning, these methods use Trust Region Policy Optimization (TRPO) to improve stability by limiting changes between the initial and adapted policies. However, this constraint is restrictive when many steps are needed to train an adapted policy.

To resolve these problems of gradient-based meta-learning, we consider the methods that do not need gradients in adaptation training, like evolution strategies (ES), which search the distribution of parameters for neural networks. However, meta-learning with evolution strategies is less data-efficient than gradient-based meta-learning; it needs to sample more data when performing adaptive training than MAML does, and is difficult to train on tasks that have continuous actions and continuous distributions. It is important to improve data efficiency in adaptation training and make use of parameter noise in meta-training. We consider that we can use a deterministic policy gradient in adaptation training for a new task to improve data sampling efficiency, and use the noise from evolution strategies to explore varied strategies for a new task, because the Gaussian noise in parameters for a neural network can be used to try out new strategies in reinforcement learning[Plappert et al.2017]; using more noise in meta-training can therefore improve training speed of evolution strategies.

We propose a novel approach, combining evolution strategies, parameter space noise and deterministic policy gradients to tackle the problem of meta-learning in a reinforcement learning setting. The key idea behind our approach is to enable the agent to learn the shared prior knowledge of a collection of tasks while exploring and sampling efficiently. The agent is represented by a meta-distribution of policies which, in fact, are Gaussian distributions over each parameter learned by evolution strategies. The mean values of parameters represent an overall good initial policy on the whole collection of tasks, while the standard deviations of parameters indicate how much such parameters should be tweaked to adapt to a specific new task. By using the different combinations of sampled policies according to the learned standard deviation, the sample-efficiency and training time of evolution strategies can be improved using our approach. The initial policy we are learning is a distribution instead of a deterministic policy, and we are learning initial policy and exploration strategies together, embedded in the meta-distribution, instead of learning them separately. One advantage of our approach is that it is easy to deploy in a parallel framework, since it does not need to compute gradients across parallel workers, and the performance grows almost linearly as the number of parallel workers is increased, without much cost due to increased communication between workers.

We apply our algorithms to several benchmark [Duan et al.2016] problems in Mujoco environments, such as half-Cheetah and Ant with random target speeds and random goals. And the results show the performance of our methods is close to or even better than the methods in MAML [Finn, Abbeel, and Levine2017].

Related Work

Evolution Strategies

The method of evolution strategies is inspired by the process of natural evolution [Back, Hoffmeister, and Schwefel1991]. The basic idea behind ES is as follows: a population, represented by policy parameters, is slightly perturbed at every generation to generate multiple new children. The performance of each child will be evaluated by a fitness function, which is an indicator of the benefit of the perturbation exerted on the population. The beneficial perturbations will be kept and reused in later generations. This iterative procedure will be repeated until a good solution for the objective is found. Current ES methods follow the above scheme and differ primarily in specific methods used for perturbation and selection. As a black-box optimization method, it has several desirable properties compared with the gradient-based methods more widely used in reinforcement learning today [Salimans et al.2017]:

  • No need to compute the gradients and back-propagate them.

  • Well adapted to environments whose rewards are sparsely-distributed.

  • Computation can be easily scaled to multiple parallel workers.

  • Indifferent to arbitrary length of horizon.

The specific ES method we use in our work belongs to the class of natural evolution strategies [Schaul et al.2008]

, which maintains a search distribution for perturbation and iteratively updates the distribution using the estimated gradients with respect to the fitness function. The general procedure of NES can be described as follows: In every iteration, the parameterized search distribution generates a batch of search points and a fitness function will be used to evaluate the performance of every point. Then, the gradients with respect to the fitness function will be computed to update the search distribution, in order to maximize the expected score on the current distribution. If we use

to denote the parameters of probability density of search distribution

and to denote the fitness function for sample , the expected search gradient and the estimate of the search gradient from samples can be written as


Parameter Space Noise

Efficient and consistent exploration can prevent agents from converging prematurely on a local optimum and allow them to continue searching for a better one, which is crucial in gaining better performance. This is even more important in the meta-learning setting, since the agent needs to explore to understand the current environment. Various exploration methods have been proposed to address this problem. In deterministic methods, -greedy exploration, softmax exploration and UCB exploration [Pecka and Svoboda2014] have been proposed, while in stochastic methods, the policy itself is a distribution over actions. However, most of the methods today focus on noise in action space, which might result in discarding all temporal structure and gradient information.

When Gaussian action noise is used, the action is sampled from a stochastic policy in which the stochasticity is independent of the current state. Therefore, even two identical states sampled in rollouts might result in the agent choosing completely different actions. If we denote the parameters of our model as , the basic idea for parameter space noise [Plappert et al.2018] is to use policy for more consistent exploration. The perturbed policy is sampled at the beginning of a rollout and kept fixed during the entire trajectory. For off-policy methods, the parameter space noise can be directly applied to parameters, since data is collected offline in such methods. In this case, the perturbed policy is used to collect samples while the non-perturbed policy will be trained.

Deterministic Policy Gradient

Policy gradient algorithms [Sutton1999] are widely-used in reinforcement learning. The basic idea behind them is to represent the policy as a parametric probability density distribution , then the action actually taken can be sampled from such distributions. To train the policy is basically to move the distribution in the direction of higher reward using its gradient. The counterpart of stochastic policy gradients is deterministic policy gradients [Mnih et al.2015], which represent the policy as the mapping from the current state to an exact action instead of a distribution. In the stochastic policy gradient theorem, the policy gradient can be written as:


The stochastic policy gradient needs to integrate over both action space and state space, so more samples are required, especially in high-dimensional action space. However, the deterministic policy gradient only integrates over the state space:


This simpler form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic version.

Meta Reinforcement Learning with Distribution of Exploration Parameters Learned by Evolution Strategies

Meta Reinforcement Learning Problem Setup

In meta-learning for RL, each task consists of an initial state distribution , a transition distribution

and a loss function

corresponding to the reward function

. If we denote the length of the horizon in such a Markov decision process as

and the model of the agent as , then the loss for task and model takes the form:


If we denote as the collection of tasks and as the distribution of , the overall loss on all tasks can be written as follows:


In -shot learning, which we are focusing on, rollouts can be acquired from the current policy and model. Those rollouts will be used for adaptation training on the current task and the performance of the agent will be evaluated after adaptation training.

Outline of our Algorithms

The algorithms we propose can be viewed as an outer loop for meta-training and an inner loop for adaptation training. In the adaptation training phase, the agent will be trained on the same tasks for a few iterations while in the meta-training phase, the environment will be switched to different tasks. In the beginning of adaptation training, model parameters will be sampled from the policy distribution , which is the distribution the agent is learning in meta-training to explore new tasks. The average of the models’ parameters will be used as the initial model parameters for adaptation training. The model , which can be regarded as with noise in parameter space, will be sampled in current tasks for replaying and training. The sampling process of model parameters will repeat for iterations in meta-training to obtain . In meta-training, the performance of each after adaptation training will be used by the fitness functions with its corresponding to update the meta-distribution. The outer meta-training loop and the inner adaptation loop will continue until a good meta-distribution is found.

Fast Adaptation and Sampling

The basic idea of fast adaptation in adaptation training is to learn a new task with a small amount of experience, which means that the agent needs efficient exploration and high sample-efficiency. Off-policy methods usually have comparatively higher sample-efficiency than on-policy methods. Also, deterministic policy gradient methods have higher sample-efficiency than stochastic policy gradient methods. Both of these greatly improve the convergence of adaptation training. Parameter space noise is the key for consistent and efficient exploration in new tasks. According to our algorithms, initial parameters in adaptation training are the average of samples. In this situation, the samples follow the original distribution , while the average of samples, which is the initial policy in new tasks, follows . The samples have higher deviations for better exploration while the initial policy with lower deviation can make the learning more stable.

Learning Distribution of Exploration Parameters by Evolution Strategies

We use Deep Deterministic Policy Gradient (DDPG) to perform adaptive training, hence we need to co-evolve actors and critics. In order to minimize the squared error between the critics after adaptive training and the Q value, the critics’ fitness function is the negative of the squared error between the adapted critic and the test set.


The goal of evolution strategies for actors is to find a Gaussian distribution for each parameter in actor, so that sampling from it can maximize the expectation of after adaptation to new tasks. Since

are independent, their joint distribution is

. The goal of the whole algorithm becomes finding :


The gradients of and in adaptation training are defined as below. Note that both gradients are independent of the fitness functions used in meta-training.


Since the meta-distribution follows a Gaussian distribution, the gradients of can be derived as:


Meta-distribution Learning Algorithms


Algorithm 1
  Initialize the parameters of distributions and  
  while not done do
     Sample a mini-batch of tasks from  
     for i in M do
        sample K actor parameters from and critic parameter from  
        for each task  do
           get K trajectories based on K actor parameters
           initialize actor parameter based on K actor parameters  
           use K trajectories and critic to adaptively train  
           get adapted parameters to sample returns in  
        end for
     end for
     update and using Equation 9 
  end while


Meta-learning in reinforcement learning is analogous to few-shot learning in supervised learning. After training on a collection of tasks, the agent should be able to learn a new task with only a small amount of further training. That is to say we are given a distribution that encompasses both the collection of tasks in the training set and the new task in the test set. In our algorithm, the ideal agent has the optimal starting parameters from which to start exploring efficiently when learning a new task.

Broadly speaking, a new task might consist of achieving a new goal, operating under a new environment or a different transition distribution, yet all tasks must be from the same task distribution. In our experiment, we use the benchmark environment [Duan et al.2016] which is a simulated continuous control environment. More specifically, we used a modified version designed for Model-Agnostic Meta-Learning. [Finn, Abbeel, and Levine2017]


The high-dimensional control tasks in the MuJoCo simulator [Todorov, Erez, and Tassa2012] can be divided into four categories, including controlling two simulated robots to achieve two kinds of goals. The simulated robots in our tasks are a planar cheetah and a 3D quadruped (the ‘ant’). The specific goals are to control the robots to run at a particular velocity or in a particular direction, and the corresponding target values of goals are not used as an input to the robot (i.e. the robot needs to explore or find out the target value using experience sampled during adaptation training). When needing to run at a particular velocity, the reward is the sum of the negative absolute value between the velocity of the agent and a goal, sampled uniformly between 0.0 and 2.0 for the cheetah, and between 0.0 and 3.0 for the ant. When the goal is to follow a particular direction, the reward is defined as the sum of the magnitude of the velocity in the corresponding direction (forward or backward).

Our environment setting is exactly the same as described in MAML. The horizon for each task is H = 200. In every iteration, the amount of experience used for adaptation training for a specific task for each worker in our algorithm is the same as in MAML, which is 20 trajectories per iteration for all problems except the ant forward/backward task, which uses 40 trajectories.

Implementation and Details

In all tasks, the specific model for each worker is an actor neural network and a critic neural network. The actor network’s input is an observation of the environment, and its output is an action. The network has two hidden layers of size 100 with ReLU nonlinearities. Xavier’s random weight initialization is used for each neuron. The critic network evaluating state-action values has the same structure as the actor, but the input of the second hidden layer is the output of the first hidden layer concatenated with the action. Besides these networks describing the mean of the Gaussian distribution (Eq.

6, Eq. 7

), two extra sets of variables describing the corresponding standard deviation are also used to help training and exploration. Hence, we use four Stochastic Gradient Descent (SGD) optimizers to train the related variables. We find it worth noting that the Adam optimizer(

[Kingma and Ba2014]) does not help for the training of evolution strategies. The specific meta-training steps are performed according to the formula in Eq. 9.

During every iteration, for each worker, we use some random seeds (20 or 40 depending on the experiment) to perturb the meta-network to produce corresponding exploration networks. These exploration networks will sample trajectories to build a temporal replay buffer for the central network, whose parameters are the average parameters of the exploration networks. The central network is trained and then performs a rollout once to calculate the fitness of this worker. After using all workers to evolve these seeds, a new meta network will be broadcast to each worker. The above procedure is repeated until an appropriate meta-policy has been achieved.

Experimental Results

Figure 1: Our experimental results for the MuJoCo velocity task.

The goal of these experiments is to verify the following points: (1) Given the same amount of computation for every worker, our algorithm can achieve superior performance compared to the two-order adaptive training method described in MAML [Finn, Abbeel, and Levine2017]; (2) As described in [Salimans et al.2017], highly parallel, evolution strategy-based optimization algorithms can achieve linear speedups even when using many workers, so we will show speedups can also apply to our algorithm; (3) Due to the off-policy algorithm [Lillicrap et al.2015], our algorithm can achieve better results when performing several adaptive training steps repeatedly using the experience with 20 or 40 rollouts (for the ant forward/backward task). We address these points separately below.

  1. Data Efficiency: As illustrated in the upper left and lower right of Figure 1, when using 64 workers, our algorithm can converge faster and achieve better results than MAML. We hypothesize that using more workers can achieve even better results. In addition, we have also tried the methods in [Fernando et al.2018], which combine A2C and ES optimization, but this method fails when applied to high dimensional environments.

    The standard deviation of the Gaussian distribution we are searching is helpful for workers’ exploration and training. Using a trainable standard deviation instead of a fixed value enables agents to adjust the extent of exploration according to the task in order that more effective experience can be sampled during adaptation training. The upper right of Figure 1 shows that by using a flexible standard deviation under identical hyper-parameters, a higher final score and more stable training can be achieved.

  2. Speedups: As mentioned in [Salimans et al.2017], an ES optimizer is particular amenable to parallelization, because it only requires infrequent communication after complete episodes. Here, we compare the performance between 8 workers and 64 workers. The upper right of Figure 1 shows that, for the given task, a greater number of workers can bring about a greater result more quickly.

  3. Effectiveness: Because of the many higher-order derivatives that MAML needs to consider during meta-training, we predict that MAML may not learn during adaptation training. However, ES, being a black box optimization algorithm, need not consider so many variables. We attempt to conduct the half-cheetah velocity experiment to compare the performance between MAML and our algorithm; we run MAML twice, once with 3 gradient updates in the adaptation training loop and another time with only 1 gradient update per loop. Results in the lower left of figure 1 show that the 3-time update fails to learn, while the 1-time update does not learn as well as our algorithm.

Discussion and Future Work

We introduce a method based on evolution strategies, which is comparable to gradient-based meta-learning methods in reinforcement learning. Because our proposed method does not need higher-order gradients in multi-step task adaptation training, it has more potential to learn well on more complex tasks, like playing different levels from a single video game. The evolution strategy can learn a distribution of exploration parameters and obtain the initial parameters in a simple way. We use noise perturbation points to approximate a mean value, as opposed to the standard method of creating noise from a mean, so we can evolve our policy distributions without using higher-order gradients. In future research, we plan to use more flexible ways to obtain the initial parameters of new tasks and more types of noise distributions.