1 Introduction
Reinforcement learning (RL) approaches have seen many successes in recent years, from mastering the complex game of Go [silver2017mastering] to even discovering molecules [olivecrona2017molecular]. However, a common limitation of these methods is their propensity to overfitting on a single task and inability to adapt to even slightly perturbed configuration [zhang2018study]
. On the other hand, humans have this astonishing ability to learn new tasks in a matter of minutes by using their prior knowledge and understanding of the underlying task mechanics. Drawing inspiration from human behaviors, researchers have proposed to incorporate multiple inductive biases and heuristics to help the models learn quickly and generalize to unseen scenarios. However, despite a lot of effort it has been difficult to approach human levels of data efficiency and generalization.
Meta reinforcement learning addresses these shortcomings by learning how to learn these inductive biases and heuristics from the data itself. It strives to learn an algorithm that allows an agent to succeed in a previously unseen task or environment when only limited experience is available. These inductive biases or heuristics can be induced in the model in various ways like optimization algorithm, policy, hyperparameters, network architecture, loss function, exploration strategies
[sharaf2019meta], etc. Recently, a class of parameter initialization based metalearning approaches have gained attention like Model Agnostic MetaLearning (MAML) [finn2017model]. MAML finds a good initialization for a model or a policy that can be adapted to a new task by finetuning with policy gradient updates from a few samples of that task.Since the objective of metaRL algorithms is to adapt to a new task from a few examples, efficient exploration strategies are crucial for quickly finding the optimal policy in a new environment. Some recent works have tried to address this problem by improving the credit assignment of the meta learning objective to the preupdate trajectory distribution [stadie2018some, rothfuss2018promp]. However, that requires transitioning the base policy from exploration behavior to exploitation behavior using one or few policy gradient updates. This limits the applicability of these methods to cases where the postupdate (exploitation) policy is similar to the preupdate (exploration) policy and can be obtained with only a few updates. Additionally, for cases where pre and postupdate policies are expected to exhibit different behaviors, large gradient updates may result in training instabilities and poor performance at convergence.
To address this problem, we propose to explicitly model a separate exploration policy for the task distribution. The exploration policy is trained to find trajectories that can lead to fast adaptation of the exploitation policy on the given task. This formulation provides much more flexibility in training the exploration policy. We’ll also show that separating the exploration and exploitation policy helps us improve/match the performance of the baseline approaches on metaRL benchmark tasks. We further show that, in order to perform stable and fast adaptation to the new task, it is often more useful to use selfsupervised or supervised learning approaches to perform the inner loop/meta updates, where possible. This also helps obtain more stable gradient update steps while training exploitation policy with the trajectories collected from a different exploration policy.
2 Related work
Metalearning algorithms proposed in the RL community include approaches based on recurrent models [duan2016rl, finn2017meta], metric learning [snell2017prototypical, sung2018learning], and learning optimizers [alex20181storder]. Recently, finn2017model proposed Model Agnostic MetaLearning (MAML) which aims to learn a policy that can generalize to a distribution of tasks. Specifically, it aims to find a good initialization for a policy that can be adapted to any task sampled from the distribution by finetuning with policy gradient updates from a few samples of that task.
Efficient exploration strategies are crucial for finding trajectories that can lead to quick adaptation of the policy in a new environment. Recent works [gupta2018meta, rakelly2019efficient] have proposed structured exploration strategies using latent variables to perform efficient exploration across successive episodes, however, they did not explicitly incentivize exploration in preupdate episodes. EMAML [stadie2018some] made the first attempt at assigning credit for the final expected returns to the preupdate distribution in order to incentivize exploration in each of the preupdate episodes. rothfuss2018promp
proposed Proximal MetaPolicy search (ProMP) where they incorporated the causal structure for more efficient credit assignment and proposed a low variance curvature surrogate objective to control the variance of the corresponding policy gradient update. However, these methods make use of a single base policy for both exploration and exploitation while relying on one or few gradient updates to transition from the exploration behavior to exploitation behavior. Over the next few sections, we illustrate that this approach is problematic and insufficient when the exploration and exploitation behaviors are quite different from each other.
A number of prior works have tried to utilize selfsupervised objectives [mahmoudieh2017self, florensa2019self, kahn2018self, pathak2017curiosity, wortsman2018learning] to ease learning especially when the reward signal itself is insufficient to provide the required level of feedback. Drawing inspiration from these approaches, we modify the inner loop update/adaptation step in MAML using a selfsupervised objective to allow more stable and faster updates. Concurrent to our work, yang2019norml also decoupled exploration and adaptation policies where the latter is initialized as a learnable offset to the exploration policy.
3 Background
3.1 MetaReinforcement Learning
Unlike RL which tries to find an optimal policy for a single task, metaRL aims to find a policy that can generalize to a distribution of tasks. Each task sampled from the distribution
corresponds to a different Markov Decision Process (MDP) defined by the tuple
with state space , action space , transition dynamics , reward function , discount factor , and time horizon . These MDPs are assumed to have similar state and action space but might differ in the reward function or the environment dynamics . The goal of meta RL is to quickly adapt the policy to any task with the help of few examples from that task.3.2 Credit Assignment in MetaRL
MAML is a gradientbased metaRL algorithm that tries to find a good initialization for a policy which can be adapted to any task by finetuning with one or more gradient updates using the sampled trajectories of that task. MAML maximizes the following objective function:
(1) 
where is the update function that performs one policy gradient ascent step to maximize the expected reward obtained on the trajectories sampled from task .
rothfuss2018promp showed that the gradient of the objective function in Eq. 1 can be written as:
where,
(2)  
(3) 
The first term corresponds to a policy gradient step on the postupdate policy w.r.t.. the postupdate parameters
which is then followed by a linear transformation from
to (preupdate parameters). Note that optimizes to increase the likelihood of the trajectories that lead to higher returns given some trajectories . However, this term does not optimize to yield trajectories that lead to good adaptation steps. That is, infact, done by the second term . It optimizes for the preupdate trajectory distribution, , i.e, increases the likelihood of trajectories that lead to good adaptation steps.During optimization, MAML only considers and ignores . Thus MAML finds a policy that adapts quickly to a task given relevant experiences, however, the policy is not optimized to gather useful experiences from the environment that can lead to fast adaptation.
rothfuss2018promp proposed ProMP where they analyze this issue with MAML and incorporate
term in the update as well. They used The Infinitely Differentiable MonteCarlo Estimator (DICE)
[foerster2018DICE] to allow causal credit assignment on the preupdate trajectory distribution, however, the gradients computed by DICE still have high variance. To remedy this, they proposed a low variance (and slightly biased) approximation of the DICE based loss that leads to stable updates.4 Method
The preupdate and postupdate policies are often expected to exhibit very different behaviors, i.e, exploration and exploitation behaviors respectively. For instance, consider a 2D environment where a task corresponds to reaching a goal location sampled randomly from a semicircular region (example shown in appendix). The agent receives a reward only if it lies in some vicinity of the goal location. The optimal preupdate or exploration policy is to move around in the semicircular region whereas the ideal postupdate or exploitation policy will be to reach the goal state as fast as possible once the goal region is discovered. Clearly, the two policies are expected to behave very differently. In such cases, transitioning a single policy from pure exploration phase to pure exploitation phase via policy gradient updates will require multiple steps. Unfortunately, this significantly increases the computational and memory complexities of the algorithm. Furthermore, it may not even be possible to achieve this transition via few gradient updates. This raises an important question: Do we really need to use the preupdate policy for exploration as well? Can we use a separate policy for exploration?
Using separate policies for preupdate and postupdate sampling: The straightforward solution to the above problem is to use a separate exploration policy responsible for collecting trajectories for the inner loop updates to get . Following that, the postupdate policy can be used to collect trajectories for performing the outer loop updates. Unfortunately, this is not as simple as it sounds. To understand this, let’s look at the inner loop updates:
When the exploration policy is used for sampling trajectories, we need to perform importance sampling. The update would thus become:
where and represent the trajectory distribution sampled by and respectively. Note that the above update is an offpolicy update which results in high variance estimates when the two trajectory distributions are quite different from each other. This makes it infeasible to use the importance sampling update in the current form. In fact, this is a more general problem that arises even in the onpolicy regime. The policy gradient updates in the inner loop results in both and terms being high variance. This stems from the misalignment of the outer gradients () and the inner gradient, hessian terms appearing in Eq. 2 and 3. This motivates our second question: Do we really need the preupdate gradients to be policy gradients? Can we use a different objective in the inner loop to get more stable updates?
Using a selfsupervised/supervised objective for the inner loop update step: The instability in the inner loop updates arises due to the high variance nature of the policy gradient update. Note that the objective of inner loop update is to provide some task specific information to the agent with the help of which it can adapt its behavior in the new environment. We believe that this could be achieved using some form of selfsupervised or supervised learning objective in place of policy gradient in the inner loop to ensure that the updates are more stable. We propose to use a network for predicting some task (or MDP) specific property like reward function, expected return or value. During the inner loop update, the network updates its parameters by minimizing its prediction error on the given task. Unlike prior metaRL works where the task adaptation in the inner loop is done by policy gradient updates, here, we update some parameters shared with the exploitation policy using a supervised loss objective function which leads to stable updates during the adaptation phase. However, note that the variance and usefulness of the update depends heavily on the choice of the selfsupervision/supervision objective. We delve into this in more detail in Section 4.1.1.
4.1 Model
Our proposed model comprises of three modules, the exploration policy , the exploitation policy , and the selfsupervision network . Note that and share a set of parameters while containing their own set of parameters and respectively. We describe our proposed model in Fig. 1. Our model differs from EMAML/ProMP because of the separate exploration policy, the separation of taskspecific parameters and task agnostic parameters , and the selfsupervised update as shown in Fig. 1.
The agent first collects a set of trajectories using its exploration policy for each task . It then updates the shared parameter by minimizing the regression loss on the sampled trajectories :
(4) 
where, is the target, which can be any task specific quantity like reward, return, value, next state etc. After obtaining the updated parameters for each task , the agent samples the (validation) trajectories using its updated exploitation policy . Effectively, encodes the necessary information regarding the task that helps an agent in adapting its behavior to maximize its expected return whereas remain task agnostic. A similar approach was proposed by zintgraf2018caml to learn taskspecific behavior using context variable with MAML.
The collected trajectories are then used to perform a policy gradient update to all parameters and using the following objective:
(5) 
In order to allow multiple outerloop updates, we use the PPO [schulman2017proximal] objective instead of the vanilla policy gradient objective to maximize Eq. 5. Furthermore, we don’t perform any outer loop updates on and treat it as a shared latent variable with fixed initial values of as proposed in [zintgraf2018caml]. The reason being, that the bias term in the layers connecting to the respective networks would learn to compensate for the initialization. We only update to in the inner loop to obtain a task specific latent variable.
Note that in all the prior meta reinforcement learning algorithms, both the innerloop update and the outerloop update are policy gradient updates. In contrast, in this work, the innerloop update is a supervised learning gradient descent update whereas the outerloop update remains a policy gradient update.
The outer loop gradients w.r.t. , can be simplified by multiplying the DICE [foerster2018DICE] operator inside the expectation in Eq. 4 as proposed by rothfuss2018promp. This allows the gradients w.r.t. to be computed in a straightforward manner with backpropagation. This also eliminates the need to apply the policy gradient trick to expand Eq. 4 for gradient computation. The inner loop update then becomes:
where is the stop gradient operator as introduced in [foerster2018DICE].
The pseudocode of our algorithm is shown in appendix (see algorithm LABEL:alg:multibot). However, we found that implementing algorithm LABEL:alg:multibot as it is, using DICE leads to high variance gradients for , resulting in instability during training and poor performance of the learned model. To understand this, let’s look at the vanilla DICE gradients for the exploration parameters , which can be written as follows:
The above expression can be viewed as a policy gradient update:
(6) 
with returns
(7) 
Note that the variance depends on the policy gradient terms computed in the outerloop and the choice of selfsupervision. We’ll explain the latter in Sec 4.1.1. However, irrespective of the choice, we can use value function based variance reduction ([mnih2016asynchronous]) by substituting the above computed returns with advantages, i.e, we replace the return in Eq. 7 with an advantage estimate and use a PPO ([schulman2017proximal]) objective to allow multiple outer loop updates: :
where,
(8) 
where
is computed using a neural network or a linear feature baseline
[duan2016benchmarking] fitted on the returns . where is given by:(9) 
4.1.1 SelfSupervised/Supervised Objective
It is important to note that the selfsupervised/supervised learning objective not only guides the adaptation step but also influences the exploration policy update as seen in Eq. 6 and 7. We mentioned above that the selfsupervised/supervised learning objective could be as simple as a value/reward/return/next state prediction for each state (stateaction pair). However, the exact choice of the objective can be critical to the final performance and stability. From the perspective of the adaptation step, the only criterion is that the selfsupervised objective should contain enough task specific information to allow a useful adaptation step. For example, it would not be a good idea to use the rewards selfsupervision in sparse/noisy reward scenarios or the next state predictions as selfsupervision when the dynamics model does not change much over tasks since the selfsupervision updates in such cases will not carry enough task specific information. From the perspective of the exploration policy updates, an additional requirement would be to ensure that the returns computed in Eq. 7 are low variance and unbiased, which further translates to saying should ideally be low variance and unbiased. For example, using the cumulative future returns as selfsupervision might lead to a very high variance update in certain environments. Thus, finding a generalizable selfsupervision/supervision objective which satisfies both properties mentioned above across all scenarios is a challenging task.
In our experiments, we found that, using step return prediction for supervision works reasonably well across all the environments. This acts as a tradeoff between predicting the full return (which was high variance but also more taskspecific info) and the reward (which was lower variance but lower taskspecific info). Hence, becomes . However, using to directly predict would still lead to high variance in . Thus, we use the truncated step successor representations [kulkarni2016deep] (similar to Nstep returns) and a linear layer on top of that to compute . Using the successor representations can effectively be seen as using a more accurate/powerful baseline than directly predicting the Nstep returns using the pair.
5 Experiments
We evaluate our proposed model on a set of benchmark continuous control environments, AntFwdBwd, HalfCheetahFwdBwd, HalfCheetahVel, Walker2DFwdBwd, Walker2DRandParams and HopperRandParams used in [rothfuss2018promp]. We also compare our method with baseline approaches: MAML, EMAML and ProMP. Furthermore, we also perform ablation experiments to analyze different components and design choices of our model on a toy D point environment proposed by [rothfuss2018promp].
The details of the network architecture and the hyperparameters used for learning have been mentioned in the appendix. We would like to state that we have not performed much hyperparameter tuning due to computational constraints and we expect the results of our method to show further improvements with further tuning. Also, we restrict ourselves to a single adaptation step in all environments for the baselines as well as our method, but it can be easily extended to multiple gradient steps as well by conditioning the exploration policy on the latent parameters .
The results of the baselines for the benchmark environments have been borrowed directly from the the official ProMP website ^{1}^{1}1https://sites.google.com/view/promp/experiments. For the point environments, we have used their official implementation^{2}^{2}2https://github.com/jonasrothfuss/ProMP.
5.1 Meta RL Benchmark Continuous Control Tasks.
The continuous control tasks require adaptation either across reward functions (AntFwdBwd, HalfCheetahFwdBwd, HalfCheetahVel, Walker2DFwdBwd) or across dynamics (Walker2DRandParams and HopperRandParams). We set the horizon length to be in AntFwdBwd and HalfCheetahFwdBwd environments and in others in accordance with the practice in [rothfuss2018promp]. The performance plots for all the algorithms are shown in Fig. 2. In all the environments, our proposed method outperforms or achieves similar performance to other method in terms of asymptotic performance.
Our algorithm performs particularly well in HalfCheetahFwdBwd, HalfCheetahVel, Walker2DFwdBwd and AntFwdBwd environments where the step returns are informative. In AntFwdBwd and HalfCheetahFwdBwd environments, although we reach similar asymptotic performance as ProMP, the convergence is slower in the initial stages of training. This is because training multiple networks together can make training slower especially in the initial stages of training especially when the training signal isn’t strong enough. Note that in Walker2DRandParams and HopperRandParams environments, although our method converges as well as the baselines, it doesn’t do much better in terms of peak performance. This could be attributed to the selection of the selfsupervision signal. A more appropriate selfsupervision signal for these environments would be the next state or successor state prediction since the task distribution in these environments corresponds to the variation in model dynamics and not just reward function. This shows that the choice of the selfsupervision signal plays an important role in the model’s performance. To further understand these design choices we perform some ablations on a toy environment in section 5.2.1.
5.2 2D Point Navigation.
We show the performance plots for ProMP, MAMLTRPO, MAMLVPG and our algorithm in the sparse reward 2DPointEnvCorner environment (proposed in [rothfuss2018promp]) in Fig. 2. Each task in this environment corresponds to reaching a randomly sampled goal location (one of the four corners) in a D environment. This is a sparse reward task where the agent receives a reward only if it is sufficiently close to the goal location. In this environment, the agent needs to perform efficient exploration and use the sparse reward trajectories to perform stable updates, both of which are salient aspects of our algorithm.
Our method is able to achieve this and reaches peak performance while showing stable behavior. ProMP, on the other hand, also reaches the peak performance but shows more unstable behavior than in the dense reward scenarios, although it manages to reach similar peak performance to our method. The other baselines struggle to do much in this environment since they do not explicitly incentivize exploration for the preupdate policy.
5.2.1 Ablation Study
We perform several ablation experiments to analyze the impact of different components of our algorithm on D point navigation task. Fig. 4 shows the performance plots for the following different variants of our algorithm:
VPGInner loop: The semisupervised/supervised loss in the inner loop is replaced with the vanilla policy gradient loss as in MAML while using the exploration policy to sample the preupdate trajectories. This variant illustrates our claim of unstable inner loop updates when naively using an exploration policy. As expected, this model performs poorly due to the high variance offpolicy updates in the inner loop.
Reward SelfSupervision : A reward based selfsupervised objective is used instead of return based selfsupervision, i.e, the selfsupervision network now predicts the reward instead of the step return at each time step. This variant is stable but struggles to reach peak performance since the task is sparse reward. This shows that the choice of selfsupervision objective is also important and needs to be chosen carefully.
Vanilla DiCE: In this variant, we directly use the DICE gradients to perform updates on instead of using the low variance gradient estimator. The leads to higher variance updates and unstable training as can be seen from the plots. This shows that the low variance gradient estimate has a major contribution to the stability during training.
EMAML Based : Here, we used an EMAML [stadie2018some] type objective to compute the gradients w.r.t. instead of using DICE, i.e, directly used policy gradient updates on but instead with returns computed on postupdate trajectories. This variant ignores the causal credit assignment from output to inputs. Thus, the updates are of higher variance, leading to more unstable updates, although it manages to reach good performance.
Ours : The low variance estimate of the DICE gradients is used to compute updates for along with step return based selfsupervision for inner loop updates. Our model reaches peak performance and exhibits stable training due to low variance updates.
6 Discussions and Conclusion
Unlike conventional metaRL approaches, we proposed to explicitly model a separate exploration policy for the task distribution. Having two different policies gives more flexibility in training the exploration policy and also makes adaptation to any specific task easier. The above experiments illustrate that our approach provides more stable updates and better asymptotic performance as compared to ProMP when the preupdate and postupdate policies are very different. Even when that is not the case, our approach matches or surpasses the baselines in terms of asymptotic performance. More importantly, this shows that in most of these tasks, separating the exploration and exploitation policies can yield better performance if properly done. From our ablation studies, we show that the selfsupervised objective plays a huge role in improving stability of the updates and the choice of the selfsupervised objective can be critical in some cases (e.g, predicting reward v/s return). Further, we also show through the above experiments that the variance reduction techniques used in the objective of exploration policy is important for achieving stable behavior. However, we would like to emphasize that the idea of using a separate exploration and exploitation policy is much more general and doesn’t need to be restricted to MAML. Given the requirements of sample efficiency of the adaptation steps in the metalearning setting, exploration is a very crucial ingredient and has been vastly under explored. Thus, we would like to explore the following extensions as future work:

Explore other techniques of selfsupervision that can be more generally used across environments and tasks.

Decoupling the exploration and exploitation policies allows us to perform offpolicy updates. Thus, we plan to test it as a natural extension of our approach.

Explore the use of having separate exploration and exploitation policies in other metalearning approaches.
Acknowledgments
This work has been funded by AFOSR award FA95501510442 and AFOSR/AFRL award FA95501810251. We would like to thank Tristan Deleu, Maruan AlShedivat, Anirudh Goyal and Lisa Lee for their insightful and fruitful discussions and Tristan Deleu, Jonas Rothfuss and Dennis Lee for opensourcing the repositories and result files.
References
7 Appendix
7.1 Algorithm
7.1.1 SemiCircleEnvironment
We perform some additional experiments on another toy environment to illustrate the exploration behavior shown by our model and demonstrate the benefits of using different exploration and exploitation policies. Fig 8 shows an environment where the agent is initialized at the center of the semicircle. Each task in this environment corresponds to reaching a goal location (red dot) randomly sampled from the semi circle (green dots). This is also a sparse reward task where the agent receives a reward only if it is sufficiently close to the goal location. However, unlike the previous environments, we only allow the agent to sample 2 preupdate trajectories per task in order to identify the goal location. Thus the agent has to explore efficiently at each exploration step in order to perform reasonably at the task. Fig 8 shows the trajectories taken by our exploration agent (orange and blue) and the exploitation/trained agent (green). Clearly, our agent has learnt to explore the environment. However, we know that a policy going around the periphery of the semicircle would be a more useful exploration policy. In this environment we know that this exploration behavior can be reached by simply maximizing the environment rewards collected by the exploration policy. Fig. 8 shows this experiment where the exploration policy is trained using environment reward maximization while everything else is kept unchanged. We call this variant OursEnvReward. We also show the trajectories traversed by promp in Fig 8. It is clear that it struggles to learn different exploration and exploitation behaviors. Fig. 8 shows the performance of our two variants along with the baselines. This experiment shows that decoupling the exploration and exploitation policies also allows us, the designers more flexibility at training them, i.e, it allows us to add any domain knowledge we might have regarding the exploration or the exploitation policies to further improve the performance.
7.1.2 Varying number of adaptation trajectories collected
We additionally wanted to test the sensitivity of the algorithms to the number of trajectories collected in the inner loop. This is crucial because at test time, the algorithms would only be collecting trajectories for the inner loop update, i.e, for adaptation. We test this on the HalfCheetahVel Environment with varying numbers of inner loop adaptation trajectories namely, 2, 5, 10 and 20. However to keep the updates stable, we increase the meta batch size (number of tasks sampled for each update) proportionally to 400, 160, 80 and 40 respectively. Figure 11 shows the plots for these variants for ProMP and our model. We notice that the performance of our model stays roughly constant across varying values of the number of adaptation trajectories whereas ProMP shows degradation in performance as the number of adaptation trajectories decrease. This shows that each of the trajectories we sample performs efficient exploration. Note that the last pair with (20,40) correspond to the standard settings of hyperparameters which we (and other papers before us) have used for the above experiments.
7.2 Hyperparameters and Details
For all the experiments, we treat the shared parameter as a latent embedding with a fixed initial value of . The exploitation policy and the selfsupervision network concatenates with their respective inputs. All the three networks () have the same architecture (except inputs and output sizes) as that of the policy network in [rothfuss2018promp] for all experiments. We also stick to the same values of hyperparameters such as inner loop learning rate, gamma, tau and number of outer loop updates. We keep a constant embedding size of 32 and a constant N=15 (for computing the Nstep returns) across all experiments and runs. We use the Adam [kingma2014adam] optimizer with a learning rate of for all parameters. Also, we restrict ourselves to a single adaptation step in all environments, but it can be easily extended to multiple gradient steps as well by conditioning the exploration policy on the latent parameters .