Reinforcement learning (RL) approaches have seen many successes in recent years, from mastering the complex game of Go [silver2017mastering] to even discovering molecules [olivecrona2017molecular]. However, a common limitation of these methods is their propensity to overfitting on a single task and inability to adapt to even slightly perturbed configuration [zhang2018study]
. On the other hand, humans have this astonishing ability to learn new tasks in a matter of minutes by using their prior knowledge and understanding of the underlying task mechanics. Drawing inspiration from human behaviors, researchers have proposed to incorporate multiple inductive biases and heuristics to help the models learn quickly and generalize to unseen scenarios. However, despite a lot of effort it has been difficult to approach human levels of data efficiency and generalization.
Meta reinforcement learning addresses these shortcomings by learning how to learn these inductive biases and heuristics from the data itself. It strives to learn an algorithm that allows an agent to succeed in a previously unseen task or environment when only limited experience is available. These inductive biases or heuristics can be induced in the model in various ways like optimization algorithm, policy, hyperparameters, network architecture, loss function, exploration strategies[sharaf2019meta], etc. Recently, a class of parameter initialization based meta-learning approaches have gained attention like Model Agnostic Meta-Learning (MAML) [finn2017model]. MAML finds a good initialization for a model or a policy that can be adapted to a new task by fine-tuning with policy gradient updates from a few samples of that task.
Since the objective of meta-RL algorithms is to adapt to a new task from a few examples, efficient exploration strategies are crucial for quickly finding the optimal policy in a new environment. Some recent works have tried to address this problem by improving the credit assignment of the meta learning objective to the pre-update trajectory distribution [stadie2018some, rothfuss2018promp]. However, that requires transitioning the base policy from exploration behavior to exploitation behavior using one or few policy gradient updates. This limits the applicability of these methods to cases where the post-update (exploitation) policy is similar to the pre-update (exploration) policy and can be obtained with only a few updates. Additionally, for cases where pre- and post-update policies are expected to exhibit different behaviors, large gradient updates may result in training instabilities and poor performance at convergence.
To address this problem, we propose to explicitly model a separate exploration policy for the task distribution. The exploration policy is trained to find trajectories that can lead to fast adaptation of the exploitation policy on the given task. This formulation provides much more flexibility in training the exploration policy. We’ll also show that separating the exploration and exploitation policy helps us improve/match the performance of the baseline approaches on meta-RL benchmark tasks. We further show that, in order to perform stable and fast adaptation to the new task, it is often more useful to use self-supervised or supervised learning approaches to perform the inner loop/meta updates, where possible. This also helps obtain more stable gradient update steps while training exploitation policy with the trajectories collected from a different exploration policy.
2 Related work
Meta-learning algorithms proposed in the RL community include approaches based on recurrent models [duan2016rl, finn2017meta], metric learning [snell2017prototypical, sung2018learning], and learning optimizers [alex20181storder]. Recently, finn2017model proposed Model Agnostic Meta-Learning (MAML) which aims to learn a policy that can generalize to a distribution of tasks. Specifically, it aims to find a good initialization for a policy that can be adapted to any task sampled from the distribution by fine-tuning with policy gradient updates from a few samples of that task.
Efficient exploration strategies are crucial for finding trajectories that can lead to quick adaptation of the policy in a new environment. Recent works [gupta2018meta, rakelly2019efficient] have proposed structured exploration strategies using latent variables to perform efficient exploration across successive episodes, however, they did not explicitly incentivize exploration in pre-update episodes. E-MAML [stadie2018some] made the first attempt at assigning credit for the final expected returns to the pre-update distribution in order to incentivize exploration in each of the pre-update episodes. rothfuss2018promp
proposed Proximal Meta-Policy search (ProMP) where they incorporated the causal structure for more efficient credit assignment and proposed a low variance curvature surrogate objective to control the variance of the corresponding policy gradient update. However, these methods make use of a single base policy for both exploration and exploitation while relying on one or few gradient updates to transition from the exploration behavior to exploitation behavior. Over the next few sections, we illustrate that this approach is problematic and insufficient when the exploration and exploitation behaviors are quite different from each other.
A number of prior works have tried to utilize self-supervised objectives [mahmoudieh2017self, florensa2019self, kahn2018self, pathak2017curiosity, wortsman2018learning] to ease learning especially when the reward signal itself is insufficient to provide the required level of feedback. Drawing inspiration from these approaches, we modify the inner loop update/adaptation step in MAML using a self-supervised objective to allow more stable and faster updates. Concurrent to our work, yang2019norml also decoupled exploration and adaptation policies where the latter is initialized as a learnable offset to the exploration policy.
3.1 Meta-Reinforcement Learning
Unlike RL which tries to find an optimal policy for a single task, meta-RL aims to find a policy that can generalize to a distribution of tasks. Each task sampled from the distribution
corresponds to a different Markov Decision Process (MDP) defined by the tuplewith state space , action space , transition dynamics , reward function , discount factor , and time horizon . These MDPs are assumed to have similar state and action space but might differ in the reward function or the environment dynamics . The goal of meta RL is to quickly adapt the policy to any task with the help of few examples from that task.
3.2 Credit Assignment in Meta-RL
MAML is a gradient-based meta-RL algorithm that tries to find a good initialization for a policy which can be adapted to any task by fine-tuning with one or more gradient updates using the sampled trajectories of that task. MAML maximizes the following objective function:
where is the update function that performs one policy gradient ascent step to maximize the expected reward obtained on the trajectories sampled from task .
rothfuss2018promp showed that the gradient of the objective function in Eq. 1 can be written as:
The first term corresponds to a policy gradient step on the post-update policy w.r.t.. the post-update parameters
which is then followed by a linear transformation fromto (pre-update parameters). Note that optimizes to increase the likelihood of the trajectories that lead to higher returns given some trajectories . However, this term does not optimize to yield trajectories that lead to good adaptation steps. That is, infact, done by the second term . It optimizes for the pre-update trajectory distribution, , i.e, increases the likelihood of trajectories that lead to good adaptation steps.
During optimization, MAML only considers and ignores . Thus MAML finds a policy that adapts quickly to a task given relevant experiences, however, the policy is not optimized to gather useful experiences from the environment that can lead to fast adaptation.
rothfuss2018promp proposed ProMP where they analyze this issue with MAML and incorporate
term in the update as well. They used The Infinitely Differentiable Monte-Carlo Estimator (DICE)[foerster2018DICE] to allow causal credit assignment on the pre-update trajectory distribution, however, the gradients computed by DICE still have high variance. To remedy this, they proposed a low variance (and slightly biased) approximation of the DICE based loss that leads to stable updates.
The pre-update and post-update policies are often expected to exhibit very different behaviors, i.e, exploration and exploitation behaviors respectively. For instance, consider a 2D environment where a task corresponds to reaching a goal location sampled randomly from a semi-circular region (example shown in appendix). The agent receives a reward only if it lies in some vicinity of the goal location. The optimal pre-update or exploration policy is to move around in the semi-circular region whereas the ideal post-update or exploitation policy will be to reach the goal state as fast as possible once the goal region is discovered. Clearly, the two policies are expected to behave very differently. In such cases, transitioning a single policy from pure exploration phase to pure exploitation phase via policy gradient updates will require multiple steps. Unfortunately, this significantly increases the computational and memory complexities of the algorithm. Furthermore, it may not even be possible to achieve this transition via few gradient updates. This raises an important question: Do we really need to use the pre-update policy for exploration as well? Can we use a separate policy for exploration?
Using separate policies for pre-update and post-update sampling: The straightforward solution to the above problem is to use a separate exploration policy responsible for collecting trajectories for the inner loop updates to get . Following that, the post-update policy can be used to collect trajectories for performing the outer loop updates. Unfortunately, this is not as simple as it sounds. To understand this, let’s look at the inner loop updates:
When the exploration policy is used for sampling trajectories, we need to perform importance sampling. The update would thus become:
where and represent the trajectory distribution sampled by and respectively. Note that the above update is an off-policy update which results in high variance estimates when the two trajectory distributions are quite different from each other. This makes it infeasible to use the importance sampling update in the current form. In fact, this is a more general problem that arises even in the on-policy regime. The policy gradient updates in the inner loop results in both and terms being high variance. This stems from the mis-alignment of the outer gradients () and the inner gradient, hessian terms appearing in Eq. 2 and 3. This motivates our second question: Do we really need the pre-update gradients to be policy gradients? Can we use a different objective in the inner loop to get more stable updates?
Using a self-supervised/supervised objective for the inner loop update step: The instability in the inner loop updates arises due to the high variance nature of the policy gradient update. Note that the objective of inner loop update is to provide some task specific information to the agent with the help of which it can adapt its behavior in the new environment. We believe that this could be achieved using some form of self-supervised or supervised learning objective in place of policy gradient in the inner loop to ensure that the updates are more stable. We propose to use a network for predicting some task (or MDP) specific property like reward function, expected return or value. During the inner loop update, the network updates its parameters by minimizing its prediction error on the given task. Unlike prior meta-RL works where the task adaptation in the inner loop is done by policy gradient updates, here, we update some parameters shared with the exploitation policy using a supervised loss objective function which leads to stable updates during the adaptation phase. However, note that the variance and usefulness of the update depends heavily on the choice of the self-supervision/supervision objective. We delve into this in more detail in Section 4.1.1.
Our proposed model comprises of three modules, the exploration policy , the exploitation policy , and the self-supervision network . Note that and share a set of parameters while containing their own set of parameters and respectively. We describe our proposed model in Fig. 1. Our model differs from E-MAML/ProMP because of the separate exploration policy, the separation of task-specific parameters and task agnostic parameters , and the self-supervised update as shown in Fig. 1.
The agent first collects a set of trajectories using its exploration policy for each task . It then updates the shared parameter by minimizing the regression loss on the sampled trajectories :
where, is the target, which can be any task specific quantity like reward, return, value, next state etc. After obtaining the updated parameters for each task , the agent samples the (validation) trajectories using its updated exploitation policy . Effectively, encodes the necessary information regarding the task that helps an agent in adapting its behavior to maximize its expected return whereas remain task agnostic. A similar approach was proposed by zintgraf2018caml to learn task-specific behavior using context variable with MAML.
The collected trajectories are then used to perform a policy gradient update to all parameters and using the following objective:
In order to allow multiple outer-loop updates, we use the PPO [schulman2017proximal] objective instead of the vanilla policy gradient objective to maximize Eq. 5. Furthermore, we don’t perform any outer loop updates on and treat it as a shared latent variable with fixed initial values of as proposed in [zintgraf2018caml]. The reason being, that the bias term in the layers connecting to the respective networks would learn to compensate for the initialization. We only update to in the inner loop to obtain a task specific latent variable.
Note that in all the prior meta reinforcement learning algorithms, both the inner-loop update and the outer-loop update are policy gradient updates. In contrast, in this work, the inner-loop update is a supervised learning gradient descent update whereas the outer-loop update remains a policy gradient update.
The outer loop gradients w.r.t. , can be simplified by multiplying the DICE [foerster2018DICE] operator inside the expectation in Eq. 4 as proposed by rothfuss2018promp. This allows the gradients w.r.t. to be computed in a straightforward manner with back-propagation. This also eliminates the need to apply the policy gradient trick to expand Eq. 4 for gradient computation. The inner loop update then becomes:
where is the stop gradient operator as introduced in [foerster2018DICE].
The pseudo-code of our algorithm is shown in appendix (see algorithm LABEL:alg:multibot). However, we found that implementing algorithm LABEL:alg:multibot as it is, using DICE leads to high variance gradients for , resulting in instability during training and poor performance of the learned model. To understand this, let’s look at the vanilla DICE gradients for the exploration parameters , which can be written as follows:
The above expression can be viewed as a policy gradient update:
Note that the variance depends on the policy gradient terms computed in the outer-loop and the choice of self-supervision. We’ll explain the latter in Sec 4.1.1. However, irrespective of the choice, we can use value function based variance reduction ([mnih2016asynchronous]) by substituting the above computed returns with advantages, i.e, we replace the return in Eq. 7 with an advantage estimate and use a PPO ([schulman2017proximal]) objective to allow multiple outer loop updates: :
is computed using a neural network or a linear feature baseline[duan2016benchmarking] fitted on the returns . where is given by:
4.1.1 Self-Supervised/Supervised Objective
It is important to note that the self-supervised/supervised learning objective not only guides the adaptation step but also influences the exploration policy update as seen in Eq. 6 and 7. We mentioned above that the self-supervised/supervised learning objective could be as simple as a value/reward/return/next state prediction for each state (state-action pair). However, the exact choice of the objective can be critical to the final performance and stability. From the perspective of the adaptation step, the only criterion is that the self-supervised objective should contain enough task specific information to allow a useful adaptation step. For example, it would not be a good idea to use the rewards self-supervision in sparse/noisy reward scenarios or the next state predictions as self-supervision when the dynamics model does not change much over tasks since the self-supervision updates in such cases will not carry enough task specific information. From the perspective of the exploration policy updates, an additional requirement would be to ensure that the returns computed in Eq. 7 are low variance and unbiased, which further translates to saying should ideally be low variance and unbiased. For example, using the cumulative future returns as self-supervision might lead to a very high variance update in certain environments. Thus, finding a generalizable self-supervision/supervision objective which satisfies both properties mentioned above across all scenarios is a challenging task.
In our experiments, we found that, using -step return prediction for supervision works reasonably well across all the environments. This acts as a trade-off between predicting the full return (which was high variance but also more task-specific info) and the reward (which was lower variance but lower task-specific info). Hence, becomes . However, using to directly predict would still lead to high variance in . Thus, we use the truncated -step successor representations [kulkarni2016deep] (similar to N-step returns) and a linear layer on top of that to compute . Using the successor representations can effectively be seen as using a more accurate/powerful baseline than directly predicting the N-step returns using the pair.
We evaluate our proposed model on a set of benchmark continuous control environments, Ant-Fwd-Bwd, Half-Cheetah-Fwd-Bwd, Half-Cheetah-Vel, Walker2D-Fwd-Bwd, Walker2D-Rand-Params and Hopper-Rand-Params used in [rothfuss2018promp]. We also compare our method with baseline approaches: MAML, EMAML and ProMP. Furthermore, we also perform ablation experiments to analyze different components and design choices of our model on a toy D point environment proposed by [rothfuss2018promp].
The details of the network architecture and the hyperparameters used for learning have been mentioned in the appendix. We would like to state that we have not performed much hyperparameter tuning due to computational constraints and we expect the results of our method to show further improvements with further tuning. Also, we restrict ourselves to a single adaptation step in all environments for the baselines as well as our method, but it can be easily extended to multiple gradient steps as well by conditioning the exploration policy on the latent parameters .
The results of the baselines for the benchmark environments have been borrowed directly from the the official ProMP website 111https://sites.google.com/view/pro-mp/experiments. For the point environments, we have used their official implementation222https://github.com/jonasrothfuss/ProMP.
5.1 Meta RL Benchmark Continuous Control Tasks.
The continuous control tasks require adaptation either across reward functions (Ant-Fwd-Bwd, Half-Cheetah-Fwd-Bwd, Half-Cheetah-Vel, Walker2D-Fwd-Bwd) or across dynamics (Walker2D-Rand-Params and Hopper-Rand-Params). We set the horizon length to be in Ant-Fwd-Bwd and Half-Cheetah-Fwd-Bwd environments and in others in accordance with the practice in [rothfuss2018promp]. The performance plots for all the algorithms are shown in Fig. 2. In all the environments, our proposed method outperforms or achieves similar performance to other method in terms of asymptotic performance.
Our algorithm performs particularly well in Half-Cheetah-Fwd-Bwd, Half-Cheetah-Vel, Walker2D-Fwd-Bwd and Ant-Fwd-Bwd environments where the -step returns are informative. In Ant-Fwd-Bwd and Half-Cheetah-Fwd-Bwd environments, although we reach similar asymptotic performance as ProMP, the convergence is slower in the initial stages of training. This is because training multiple networks together can make training slower especially in the initial stages of training especially when the training signal isn’t strong enough. Note that in Walker2D-Rand-Params and Hopper-Rand-Params environments, although our method converges as well as the baselines, it doesn’t do much better in terms of peak performance. This could be attributed to the selection of the self-supervision signal. A more appropriate self-supervision signal for these environments would be the next state or successor state prediction since the task distribution in these environments corresponds to the variation in model dynamics and not just reward function. This shows that the choice of the self-supervision signal plays an important role in the model’s performance. To further understand these design choices we perform some ablations on a toy environment in section 5.2.1.
5.2 2D Point Navigation.
We show the performance plots for ProMP, MAML-TRPO, MAML-VPG and our algorithm in the sparse reward 2DPointEnvCorner environment (proposed in [rothfuss2018promp]) in Fig. 2. Each task in this environment corresponds to reaching a randomly sampled goal location (one of the four corners) in a D environment. This is a sparse reward task where the agent receives a reward only if it is sufficiently close to the goal location. In this environment, the agent needs to perform efficient exploration and use the sparse reward trajectories to perform stable updates, both of which are salient aspects of our algorithm.
Our method is able to achieve this and reaches peak performance while showing stable behavior. ProMP, on the other hand, also reaches the peak performance but shows more unstable behavior than in the dense reward scenarios, although it manages to reach similar peak performance to our method. The other baselines struggle to do much in this environment since they do not explicitly incentivize exploration for the pre-update policy.
5.2.1 Ablation Study
We perform several ablation experiments to analyze the impact of different components of our algorithm on D point navigation task. Fig. 4 shows the performance plots for the following different variants of our algorithm:
VPG-Inner loop: The semi-supervised/supervised loss in the inner loop is replaced with the vanilla policy gradient loss as in MAML while using the exploration policy to sample the pre-update trajectories. This variant illustrates our claim of unstable inner loop updates when naively using an exploration policy. As expected, this model performs poorly due to the high variance off-policy updates in the inner loop.
Reward Self-Supervision : A reward based self-supervised objective is used instead of return based self-supervision, i.e, the self-supervision network now predicts the reward instead of the -step return at each time step. This variant is stable but struggles to reach peak performance since the task is sparse reward. This shows that the choice of self-supervision objective is also important and needs to be chosen carefully.
Vanilla DiCE: In this variant, we directly use the DICE gradients to perform updates on instead of using the low variance gradient estimator. The leads to higher variance updates and unstable training as can be seen from the plots. This shows that the low variance gradient estimate has a major contribution to the stability during training.
E-MAML Based : Here, we used an E-MAML [stadie2018some] type objective to compute the gradients w.r.t. instead of using DICE, i.e, directly used policy gradient updates on but instead with returns computed on post-update trajectories. This variant ignores the causal credit assignment from output to inputs. Thus, the updates are of higher variance, leading to more unstable updates, although it manages to reach good performance.
Ours : The low variance estimate of the DICE gradients is used to compute updates for along with -step return based self-supervision for inner loop updates. Our model reaches peak performance and exhibits stable training due to low variance updates.
6 Discussions and Conclusion
Unlike conventional meta-RL approaches, we proposed to explicitly model a separate exploration policy for the task distribution. Having two different policies gives more flexibility in training the exploration policy and also makes adaptation to any specific task easier. The above experiments illustrate that our approach provides more stable updates and better asymptotic performance as compared to ProMP when the pre-update and post-update policies are very different. Even when that is not the case, our approach matches or surpasses the baselines in terms of asymptotic performance. More importantly, this shows that in most of these tasks, separating the exploration and exploitation policies can yield better performance if properly done. From our ablation studies, we show that the self-supervised objective plays a huge role in improving stability of the updates and the choice of the self-supervised objective can be critical in some cases (e.g, predicting reward v/s return). Further, we also show through the above experiments that the variance reduction techniques used in the objective of exploration policy is important for achieving stable behavior. However, we would like to emphasize that the idea of using a separate exploration and exploitation policy is much more general and doesn’t need to be restricted to MAML. Given the requirements of sample efficiency of the adaptation steps in the meta-learning setting, exploration is a very crucial ingredient and has been vastly under explored. Thus, we would like to explore the following extensions as future work:
Explore other techniques of self-supervision that can be more generally used across environments and tasks.
Decoupling the exploration and exploitation policies allows us to perform off-policy updates. Thus, we plan to test it as a natural extension of our approach.
Explore the use of having separate exploration and exploitation policies in other meta-learning approaches.
This work has been funded by AFOSR award FA9550-15-1-0442 and AFOSR/AFRL award FA9550-18-1-0251. We would like to thank Tristan Deleu, Maruan Al-Shedivat, Anirudh Goyal and Lisa Lee for their insightful and fruitful discussions and Tristan Deleu, Jonas Rothfuss and Dennis Lee for opensourcing the repositories and result files.
We perform some additional experiments on another toy environment to illustrate the exploration behavior shown by our model and demonstrate the benefits of using different exploration and exploitation policies. Fig 8 shows an environment where the agent is initialized at the center of the semi-circle. Each task in this environment corresponds to reaching a goal location (red dot) randomly sampled from the semi circle (green dots). This is also a sparse reward task where the agent receives a reward only if it is sufficiently close to the goal location. However, unlike the previous environments, we only allow the agent to sample 2 pre-update trajectories per task in order to identify the goal location. Thus the agent has to explore efficiently at each exploration step in order to perform reasonably at the task. Fig 8 shows the trajectories taken by our exploration agent (orange and blue) and the exploitation/trained agent (green). Clearly, our agent has learnt to explore the environment. However, we know that a policy going around the periphery of the semi-circle would be a more useful exploration policy. In this environment we know that this exploration behavior can be reached by simply maximizing the environment rewards collected by the exploration policy. Fig. 8 shows this experiment where the exploration policy is trained using environment reward maximization while everything else is kept unchanged. We call this variant Ours-EnvReward. We also show the trajectories traversed by promp in Fig 8. It is clear that it struggles to learn different exploration and exploitation behaviors. Fig. 8 shows the performance of our two variants along with the baselines. This experiment shows that decoupling the exploration and exploitation policies also allows us, the designers more flexibility at training them, i.e, it allows us to add any domain knowledge we might have regarding the exploration or the exploitation policies to further improve the performance.
7.1.2 Varying number of adaptation trajectories collected
We additionally wanted to test the sensitivity of the algorithms to the number of trajectories collected in the inner loop. This is crucial because at test time, the algorithms would only be collecting trajectories for the inner loop update, i.e, for adaptation. We test this on the HalfCheetah-Vel Environment with varying numbers of inner loop adaptation trajectories namely, 2, 5, 10 and 20. However to keep the updates stable, we increase the meta batch size (number of tasks sampled for each update) proportionally to 400, 160, 80 and 40 respectively. Figure 11 shows the plots for these variants for ProMP and our model. We notice that the performance of our model stays roughly constant across varying values of the number of adaptation trajectories whereas ProMP shows degradation in performance as the number of adaptation trajectories decrease. This shows that each of the trajectories we sample performs efficient exploration. Note that the last pair with (20,40) correspond to the standard settings of hyper-parameters which we (and other papers before us) have used for the above experiments.
7.2 Hyper-parameters and Details
For all the experiments, we treat the shared parameter as a latent embedding with a fixed initial value of . The exploitation policy and the self-supervision network concatenates with their respective inputs. All the three networks () have the same architecture (except inputs and output sizes) as that of the policy network in [rothfuss2018promp] for all experiments. We also stick to the same values of hyper-parameters such as inner loop learning rate, gamma, tau and number of outer loop updates. We keep a constant embedding size of 32 and a constant N=15 (for computing the N-step returns) across all experiments and runs. We use the Adam [kingma2014adam] optimizer with a learning rate of for all parameters. Also, we restrict ourselves to a single adaptation step in all environments, but it can be easily extended to multiple gradient steps as well by conditioning the exploration policy on the latent parameters .