Offline Meta-Reinforcement Learning with Online Self-Supervision

07/08/2021 ∙ by Vitchyr H. Pong, et al. ∙ berkeley college 0

Meta-reinforcement learning (RL) can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We do not want to remove this distributional shift by simply adopting a conservative exploration strategy, because learning an exploration strategy enables an agent to collect better data for faster adaptation. Instead, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) agents are often described as learning from reward and punishment analogously to animals: in the same way that a person might train a dog by providing treats, we might train RL agents by providing rewards. However, in reality, modern deep RL agents require so many trials to learn a task that providing rewards by hand is often impractical. Meta-reinforcement learning in principle can mitigate this, by learning to learn using a set of meta-training tasks, and then acquiring new behaviors in just a few trials at meta-test time. Current meta-RL methods are so efficient that it is entirely practical for the meta-test time adaptation to use even human-provided rewards. However, the meta-training phase in these algorithms still requires a large number of online samples, often even more than standard RL, due to the multi-task nature of the meta-learning problem.

Offline reinforcement learning methods, which use only prior experience without active data collection, provide a potential solution to this issue, because a user must only annotate multi-task data with rewards once in the offline dataset, rather than doing so in the inner loop of RL training, and the same offline multi-task data can be reused repeatedly for many training runs. While a few recent works have proposed offline meta-RL algorithms dorfman2020offline; mitchell2020offline, we identify a specific problem when an agent trained with offline meta-RL is tested on a new task: the distributional shift between the behavior policy and the meta-test time exploration policy means that adaptation procedures learned from offline data might not perform well on the (differently distributed) data collected by the exploration policy at meta-test time. In practice, we find that this issue can lead to a large degradation in performance on offline meta-RL tasks. This mismatch in training distribution occurs because meta-RL is never trained on data generated by the exploration policy.

We propose to address this challenge by collecting additional online data without any reward supervision, leading to a semi-supervised offline meta-RL algorithm, as illustrated in Figure 1. Online data can be relatively cheap to collect when it does not require reward labels and enables the agent to observe new exploration trajectories. We would like to meta-train on this online data to mitigate the distribution shift that occurs at meta-test time, but meta-training requires reward labels. If an agent can generate its own reward labels for these new states and actions, it can meta-train on this labeled dataset that includes trajectories generated by the learned exploration policy.

Based on this principle, we develop a method called semi-supervised meta actor-critic (SMAC) that uses reward-labeled offline data to bootstrap a semi-supervised meta-reinforcement learning procedure, in which an offline meta-RL agent collects additional online experience without any reward labels. The agent uses the reward supervision from the offline dataset to learn to generate new reward functions, which it uses to autonomously annotate rewards in these otherwise rewardless interactions and meta-train on this new data. We evaluate our method and prior offline meta-RL methods on existing benchmarks (dorfman2020offline; mitchell2020offline) with fewer than 400 time steps of reward labels at meta-test time. We find that while standard meta-RL methods perform well at adapting to training tasks, they suffer from data-distribution shifts when they must adapt to new tasks that were not seen during meta-training. In contrast, we find that additional environment interaction greatly improves the meta-test time performance of SMAC, despite the lack of additional reward supervision.

Figure 1: (left) In offline meta-RL, an agent uses offline data from multiple tasks , each with reward labels that must only be provided once. (middle) In online meta-RL, new reward supervision must be provided with every environment interaction. (right) In semi-supervised meta-RL, an agent uses an offline dataset collected once to learn to generate its own reward labels for new, online interactions. Similar to offline meta-RL, reward labels must only be provided once for the offline training, and unlike online meta-RL, the additional environment interactions do not require external reward supervision.

2 Related Works

Many prior meta-RL algorithms assume that reward labels are provided with each episode of online interaction (duan2016rl; finn2017model; gupta2018meta; xu2018meta; hausman2018learning; rakelly2019efficient; humplik2019meta; kirsch2019improving; zintgraf2020varibad; xu2020meta; zhao2020meld; kamienny2020learning). In contrast to these prior methods, our method only requires offline prior data with rewards, and additional online interaction does not require any ground truth reward signal.

Prior works have also studied other formulations that combine unlabeled and labeled trials. For example, imitation and inverse reinforcement learning methods use offline demonstrations to either learn a reward function  (abbeel2004apprenticeship; finn2016guided; ho2016generative; fu2017learning) or to directly learn a policy (schaal1999imitation; ross2010efficient; ho2016generative; reddy2019sqil; peng2020learning). Semi-supervised and positive-unlabeled reward learning (xu2019positive; zolna2020offline; konyushkova2020semi) methods use reward labels provided for some interactions to train a reward function for RL. However, all of these methods have been studied in the context of a single task. While these methods focus on recovering a policy that maximizes a specific reward function, we focus on meta-learning an RL procedure that can adapt to new reward functions. In other words, we do not focus on recovering a single reward function, because there is no single test time reward or task. Instead, we focus on generating reward labels for meta-training that mitigate the distribution shift between the offline data and online data at test-time.

SMAC uses a context-based adaptation procedure similar to rakelly2019efficient, which is related to other work on contextual policies, such as goal-conditioned reinforcement learning  (kaelbling1993goals; schaul2015uva; andrychowicz2017her; pong2018tdm; colas2018gep; wardefarley2018discern; pere2018unsupervised; nair2018rig) or successor features (kulkarni2016deep; barreto2017successor; barreto2019transfer; grimm2019disentangled). In contrast to these latter works on contextual policies, our meta-learning procedure applies to any RL problem, does not assume that the reward is defined by a single goal state or fixed basis function, and uses offline data to learn to generate rewards.

Our method addresses a similar problem to prior offline meta-RL methods (mitchell2020offline; dorfman2020offline), but we show that these approaches generally underperform in low-data regimes, whereas our method addresses the distribution shift problem by using online interactions without requiring additional reward supervision. In our experiments, we found that SMAC greatly improves the performance on both training and held-out tasks. Lastly, SMAC is also related to unsupervised meta-learning methods (gupta2018unsupervised; jabri2019unsupervised), which annotate data with their own rewards. In contrast to these methods, we assume that there exists an offline dataset with reward labels that we can use to learn to generate similar rewards.

3 Preliminaries

Meta-reinforcement learning.

In meta-RL, we assume there is a distribution of tasks . A task

is a Markov decision process (MDP), defined by a tuple

, where is the state space, is the action space, is a reward function, is a discount factor, is the initial state distribution, and is the environment dynamics distribution. A replay buffer is a set of state, action, reward, next-states tuples, , where all the rewards come from the same task. We will use the letter to denote a small replay buffer or “history” and the notation to denote that a mini-batch is sampled from a replay buffer . We will use the letter to represent a trajectory without reward labels.

A meta-episode consists of sampling a task , collecting trajectories with a policy , adapting the policy to the task between each trajectory, and measuring the performance on the last trajectory. We write the policy’s adaptation procedure as , parameterized by meta-parameters . Between each trajectory, the adaptation procedure transforms the history of interactions within the meta-episode into a context that summarizes the previous interactions. This context is then given to the policy . The exact representation of , , and depends on the specific meta-RL method used. For example, the context

can be weights of a neural network 


outputted by a gradient update, hidden activations ouputted by a recurrent neural network 

(duan2016rl), or latent variables outputted by a stochastic encoder (rakelly2019efficient). Using this notation, the objective in meta-RL is to learn the adaptation parameters and policy parameters to maximize performance on a meta-episode given a new task sampled from .


Since we require an off-policy meta-RL procedure for offline meta-training, we build on probabilistic embeddings for actor-critic RL (PEARL) (rakelly2019efficient), an online off-policy meta-RL algorithm. In PEARL,

is a vector and the adaptation procedure

that maps to consists of sampling from a distribution . The distribution is generated by an encoder network with parameters . This encoder is a set-based network that processes all of the tuples in

in a permutation-invariant manner to produce the mean and variance of a diagonal multivariate Gaussian. The policy is a contextual policy

conditioned on by concatenating to the state .

The policy parameter is trained using soft-actor critic (haarnoja2018soft) which involves learning a critic, or -function, , with parameter

that estimates the sum of future discounted rewards conditioned on the current state, action, and context. The encoder parameters are trained by back-propagating the critic loss into the encoder. The actor, critic, and encoder losses are minimized via gradient descent with mini-batches sampled from separate replay buffers for each task.

Offline reinforcement learning.

In offline reinforcement learning, we assume that we have access to a dataset collected by some policy behavior . An RL agent must train on this fixed dataset and cannot interact with the environment. One challenge that offline RL poses is that the distribution of states and actions that an agent will see when deployed will likely be different from those seen in the offline dataset as they are generated by the agent, and a number of recent methods have tackled this distribution shift issue (fujimoto2019off; fujimoto2019benchmarking; kumar2019stabilizing; wu2019behavior; nair2020accelerating; levine2020offline). Moreover, one can combine offline RL with meta-RL by training meta-RL on multiple datasets  (dorfman2020offline; mitchell2020offline), but in the next section we describe some limitations of this combination.

4 The Problem with Naïve Offline Meta-Reinforcement Learning

Although offline meta-RL methods must address the usual offline RL distribution shift issues, they must also contend with a distribution shift that is specific to the meta-RL scenario: distribution shift in -space. Distribution shift in -space occurs because meta-learning requires learning an exploration policy that generates data for adaptation. However, offline meta-learning only trains the adaptation procedure using offline data generated by a previous behavior policy, which we denote as . After offline training, there will be a mismatch between this learned exploration policy and the behavior policy , leading to a difference in the data and in turn, in the context variables . In other words, if we write to denote the marginal distribution over given data generated by policy , the differences between trajectories from and will result in differences between during offline training and at meta-test time.

Figure 2: Left: The distribution of the KL-divergence between the posterior and a prior over the course of training, when conditioned on data from the offline dataset (blue) or learned policy (orange), as measured by . We see that data from the learned policy results in posteriors that are substantially farther from the prior, suggesting a significant difference in distribution over . Right: The performance of the policy post-adaptation when conditioned on data from the offline dataset (i.e., ) and data generated by the learned policy (i.e., ). During the offline training phase, we see that although the meta-RL policy learns when conditioned on generated by the offline data, the performance does not increase when is generated using the online data. Since the same policy is evaluated, the change in -distribution is likely the cause for the drop in performance. In contrast, during the self-supervised training phase, the performance of the two is quite similar.

To illustrate the presence of this distribution shift at meta-test time, we empirically compare and . While computing these distributions in closed form would require strong assumptions about how the policy, adaptation procedure, and environment interact, we approximate the these distributions by using a PEARL-style encoder discussed in Section 3: where . We use this approximation to measure the KL-divergence observed during offline training between the posterior and a fixed prior . If these two distributions were the same, then we would expect the distribution of KL divergences to also be similar. However, we see in Figure 2 that these two distributions are markedly different when analyzing a training run of SMAC on the Ant Direction task (see Section 6 for details).

We also observe that this distribution shift negatively impacts the resulting policy. In Figure 2, we plot the performance of the learned policy when conditioned on sampled from compared to . We see that the policy conditioned on generated from the behavior policy data leads to improvement, while the same policy conditioned on generated from the exploration policy slightly drops in performance. Since we evaluate the same policy and only change how is sampled, this degradation in performance suggests that the policy suffers from distributional shift between and : the encoder produces vectors that too unfamiliar to the policy after reading in these exploration trajectories, and therefore actually attains better performance when conditioned on the trajectories from .

We note that this issue arises in any method for training non-Markovian policies with offline data. For example, recurrent policies for partially observed MDPs (jaakkola1995reinforcement) depend both on the current observation and a history . When deployed, these policies must also contend with potential distributional shifts between the training and test-time history distributions, in addition to the change in observation distribution . This additional distribution shift may explain why many memory-based recurrent policies are often trained online (duan2016rl; heess2017emergence; espeholt2018impala) or have benefited from refreshing the memory states (kapturowski2018recurrent). In this paper, we focus on addressing this issue specifically in the offline meta-RL setting.

Offline meta-RL with self-supervised online training.

In complex environments where many behaviors are possible, the distribution shift in -space will likely be inevitable, since the learned policy is likely to deviate from the behavior policy. To address this issue, we introduce an additional assumption: in addition to the offline dataset, we assume that the agent can autonomously interact with the environment without observing additional reward supervision. This problem statement is useful for scenarios where autonomously interacting with the world is relatively easy, but online reward supervision is more expensive to obtain. For instance, it may be cheap to label rewards in an offline dataset for robotics by reward sketching cabi2019scaling, but expensive to have a labeler available online while the robot runs to provide rewards.

Formally, we assume that the agent can generate additional rollouts in an MDP without a reward function, . These additional interactions enable the agent to explore using the learned policy. These exploration trajectories are from the same distribution that will be observed at meta-test time, and therefore can be included into the meta-training process to mitigate the distributional shift issue described above. However, meta-training requires not just states and actions, but also rewards. In the next section, we describe a method for autonomously labeling these rollouts with synthetic reward labels to enable an agent to meta-train on this additional data.

Figure 3: (Left) In the offline phase, we sample a history to compute the posterior . We then use a sample from this encoder and another history to train the networks. In purple, we update the encoder with both reconstruction and KL loss. (Right) During the self-supervised phase, we explore by sampling and conditioning our policy on these observations. We label rewards using our learned reward decoder, and append the resulting data to the training data. The training procedure is equivalent to the offline phase, except that we do not train the reward decoder or encoder since no additional ground-truth rewards are observed.

5 Semi-Supervised Meta Actor-Critic

In this section, we present semi-supervised meta actor-critic (SMAC), a method to perform offline meta-training followed by self-supervised online meta-training. For the offline meta-training, we assume access to a set of replay buffers, , where each buffer corresponds to data for one task. For self-supervised online meta-training, we assume that we can sample MDPs without a reward function. The SMAC adaptation procedure consists of passing history through the encoder described in Section 3, resulting in a posterior . SMAC then uses this posterior for both meta-RL training and for reward generation. Below, we describe both components of the algorithm.

5.1 Offline Meta-Training

To learn from the user-provided offline data, we adapt the PEARL meta-learning method (rakelly2019efficient) to the offline setting. We use an actor-critic algorithm to train a contextual policy using a set-based encoder, and update the critic by minimizing the Bellman error:


where are target network weights (mnih2015human) updated with the soft update (lillicrap2015continuous) of .

PEARL uses soft actor critic (SAC) (haarnoja2018soft) to train their policy and Q-function. SAC has been primarily applied in the online setting, in which a replay buffer is continuously expanded by adding data from the latest policy. However, when naïvely applied to the offline setting, actor-critic methods such as SAC suffer from off-policy bootstrapping error accumulation (fujimoto2019off; kumar2019stabilizing; wu2019behavior), which occurs when the target Q-function for bootstrapping is evaluated at actions outside of the training data.

To avoid this error accumulation during offline training, we update our actor with a loss that implicitly constrains the policy to stay close to the actions observed in the replay buffer, following the approach in a previously proposed single-task offline RL algorithm called AWAC (nair2020accelerating). AWAC uses the following loss to approximate a constrained optimization problem, where the policy is constrained to stay close to the data observed in :


We estimate the value function with a single sample, and is the resulting Lagrange multiplier for the optimization problem. See nair2020accelerating for a full derivation.

This modified actor update makes it possible to train the encoder, actor, and critic on the offline data without the overestimation issues that afflict conventional actor-critic algorithms (kumar2019stabilizing). However, it does not address the -space distributional shift issue discussed in Section 4, because the exploration policy learned via this offline procedure will still deviate significantly from the behavior policy . As discussed previously, we will aim to address this issue by collecting additional online data without reward labels and learning to generate reward labels if self-supervised meta-training.

Learning to generate rewards.

To continue meta-training online without provided rewards, we propose to use the offline dataset to learn a generative model over meta-training task reward functions that we can use to label the transitions collected online. Recall that during offline learning, we learn an encoder which maps experience to a latent context that encodes the task. In the same way that we train our policy that conditionally decodes into actions, as well as a Q-function that conditionally decodes into Q-values, we additionally train a reward decoder 111For simplicity, we write to represent the parameters of both the encoder and decoder. that conditionally decodes into rewards. We train the reward decoder to reconstruct the observed reward in the offline dataset through a mean squared error loss.

Because we use the latent space for reward-decoding, we back-propagate the reward decoder loss into . As visualized in Figure 3, we also regularize the posteriors against a prior to provide an information bottleneck in that latent space and ensure that samples from represent meaningful latent variables. We found it beneficial to not back-propagate the critic loss into the encoder, in contrast to prior work such as PEARL. To summarize, we train the reward encoder and decoder by minimizing the following loss


In the next section, describe how we use this reward decoder to generate new reward labels.

5.2 Self-Supervised Meta-Training

We now describe the self-supervised online training procedure, during which we use the reward decoder to provide supervision. First, we collect a trajectory by rolling out our exploration policy conditioned on a context sampled from the prior . To emulate the offline meta-training supervision, we would like to label with rewards that are in the distribution of meta-training tasks. As such, we sample a replay buffer uniformly from to get a history from the offline data. We then sample from the posterior and label the reward of a new state and action, , using the reward decoder


We then add the labeled trajectory to the buffer and perform actor and critic updates as in offline meta-training. Lastly, since we do not observe additional ground-truth rewards, we do not update the reward decoder or encoder , and instead only train the policy and Q-function during the self-supervised phase. We visualize this procedure in Figure 3.

1:Input: datasets , policy , Q-function , encoder , and decoder .
2:for iteration  do offline phase
3:     Sample buffer and two histories from buffer .
4:     Use the first history sample to to infer encode it .
5:     Update , , , by minimizing , , with samples .
6:for iteration  do self-supervised phase
7:     Collect trajectory with , with .
8:     Label the rewards in using Equation 4 and add the resulting data to .
9:     Sample buffer and two histories from buffer .
10:     Encode first history .
11:     Update , by minimizing , with samples .
Algorithm 1 Semi-Supervised Meta Actor-Critic

5.3 Algorithm Summary and Details

We call the overall algorithm semi-supervised meta actor-critic (SMAC) and visualize it in Figure 3. For offline training, we assume access to offline datasets , where each buffer corresponds to data generated for one task. Each iteration, we sample a buffer and a history from this buffer . We condition the stochastic encoder on this history to obtain a sample . We then use this sample and a second history sample to update the -function, the policy, encoder, and decoder by minimizing Equation 1, Equation 2, and Equation 3 respectively. During the self-supervised phase, we found it beneficial to train the actor with a combination of the loss in Equation 2

and the original PEARL actor loss, weighted by hyperparameter

. We provide pseudo-code for SMAC in Algorithm 1 and a complete list of hyperparameters, such as the network architecture and RL hyperparameters, in Appendix B.

6 Experiments

We proposed a method that uses additional online data to mitigate the distribution shift in -space that occurs in offline meta-RL. In this section, we evaluate whether or not the self-supervised phase of SMAC mitigates this negative drop in performance. We also study if SMAC can not only overcome the distribution shift problem, but also improve the ability to generalize to new tasks. We evaluate the adaptation procedure’s ability to generalize to new task by testing it on held-out reward functions. We compare to different methods across multiple simulated robot domains.

Meta-RL Tasks

We evaluate our method on multiple simulated MuJoCo (todorov2012mujoco) meta-learning tasks that have been used in past online and offline meta-RL papers (finn2017model; rakelly2019efficient; dorfman2020offline; mitchell2020offline). The first task, Cheetah Velocity, contains a two-legged “half cheetah” that can move forwards or backwards along the x-axis. Following prior work (rakelly2019efficient; dorfman2020offline), the reward function is the absolute difference product between the agent’s x-velocity and a velocity uniformly sampled from . The second task, Ant Direction, contains a quadruped “ant” robot that can move in a plane. The reward function is the dot product between the agent’s velocity and a direction uniformly sampled from the unit circle. In both of these domains, a meta-episode consists of sampling a desired velocity. The agent must learn to discover which velocity will maximize rewards within episodes, each of length .

Figure 4: Examples of our evaluation domains, each of which has a set of meta-train tasks (examples shown in blue) and held out test tasks (orange). The domains include (left) a half cheetah tasked with running at different speeds, (middle) a quadruped ant locomoting to different points on a circle, and (right) a simulated Sawyer arm performing various manipulation tasks.

We also evaluated SMAC on a significantly more diverse robot manipulation meta-learning task called Sawyer Manipulation, based on the goal-conditioned environment introduced in khazatsky2021val. Sawyer Manipulation is a simulated PyBullet environment coumans2021 which comprises a Sawyer robot arm that can manipulate drawers, pick and place objects, and push buttons. Sampling a task involves sampling both a new configuration of the environment and the desired behavior to achieve. The initial configuration of the objects can vary drastically, with the presence and location of objects randomized as shown in Figure 4 and the agent is tested on one of three possible desired behaviors, such as pushing a button, opening a drawer, or lifting an object. The observation is a 13-dimensional state vector; when an object is not present in the task, the corresponding dimension takes on value . The action space is 4-dimensional: 3 dimensions to control the end-effector in Euclidean space and one dimension to control the gripper. The sparse reward is when the desired behavior is not achieved and when achieved. The task is difficult due to the diversity of objects, sparse reward, and precise manipulation required.

On all of the environments, we test the meta-RL procedure’s ability to generalize to new tasks by evaluating the policies on held-out tasks sampled from the same distribution as in the offline datasets. We give a complete description of the possible tasks in Appendix B.

Offline data collection.

For the MuJoCo tasks, we generate data by following a similar procedure as (fujimoto2019off), in which we use the replay buffer from a single PEARL run that uses the ground-truth reward. We limit the data collection to transitions or trajectories per task and terminate the PEARL run early, forcing the meta-RL agent to learn from sub-optimal data. For Sawyer Manipulation, we collect data using a scripted policy that randomly performs as many potential tasks in the environment, without knowing what the desired behavior in a sampled task is. We used 50 training tasks and 50 trajectories of length 75 per task. The average reward is -0.54 in the offline data. See Appendix B for more details.

Comparisons and ablations.

As an upper bound, we include the performance of PEARL when online training uses oracle ground-truth rewards rather than self-generated rewards, which we label Online Oracle. To understand the impact of using the actor loss in Equation 2, we include an ablation in which we use the actor loss from PEARL but still employ our proposed unsupervised online phase, which we label SMAC (actor ablation). We also include a meta-imitation baseline, which infers the task like PEARL, but then imitates the task data in the dataset. In this baseline, we replace the actor update in Equation 2 with simply maximizing . We label this baseline meta behavior cloning. This baseline illustrates the gap between offline meta-RL and imitation, and helps us understand the gap between the (highly suboptimal) behavior policy and RL.

For comparisons to prior work, we include the two previously proposed offline meta-RL methods: meta-actor critic with advantage weighting (labelled MACAW(mitchell2020offline) and Bayesian offline RL (labelled BOReL)  (dorfman2020offline). Since these methods have only been applied to the offline phase, we report their performance only after offline training, since they do not have a self-supervised online stage. For both prior works, we used the code released by the authors. We trained these methods using the same offline dataset and matched hyperparameters when possible, such as batch size and network size.

Figure 5: Comparison on self-supervised meta-learning against baseline methods. We report the final return of meta-test adaptation on unseen test tasks, with varying amounts of online meta-training following offline meta-training. Our method SMAC, shown in red, consistently trains to a reasonable performance from offline meta-RL (shown at step 0) and then steadily improves with online self-supervised experience. The offline meta-RL methods, MACAW mitchell2020offline and BOReL at best match the offline performance of SMAC but have no mechanism to improve via self-supervision. We also compare to SMAC (SAC ablation) which uses SAC instead of AWAC as the underlying RL algorithm. This ablation struggles to pretrain a value function offline, and so struggles to improve on more difficult tasks.

Comparison results.

We plot the mean post-adaptation returns and standard deviation across 4 seeds in

Figure 5. We see that across all three environments, SMAC consistently improves during the self-supervised phase, and often achieves a similar performance to the oracle that uses ground-truth reward during the online phase of learning. SMAC also significantly improves over meta behavior cloning, which confirms that the data in the offline dataset is far from optimal.

We found that BOReL and MACAW performed comparatively poorly on all three tasks. A likely cause for this performance is that BOReL and MACAW were both developed assuming several orders of magnitude more data than the regime that we tested. For example, in the BOReL paper (dorfman2020offline), the Cheetah Velocity was trained with an offline dataset using 400 million transitions and performs additional reward relabeling using ground-truth information about the transitions. In contrast, our offline dataset contains only 240 thousand transitions, roughly three orders of magnitude fewer transitions. Similarly, MACAW uses 100M transitions for Cheetah Velocity, over 40 times more transitions than used in our experiments. These prior methods also collect offline datasets by training task-specific policies, which converge to near-optimal policies within the first million time step (haarnoja2018soft), meaning that they are evaluated on very high-quality data.

In contrast, our data collection protocol produces more realistic offline datasets that are highly suboptimal, as evidenced by the performance of the meta behavior cloning baseline, leaving plenty of room for improvement with offline RL. We also observed that our method improves over the performance of BOReL and MACAW even before the online phase (i.e., at zero new environment steps) on the Cheetah and Sawyer Manipulation tasks, and achieves a particularly large improvement on the Sawyer Manipulation environments, which are by far the most challenging and exhibit the most variability between tasks. In this domain, we also see the largest gains from the AWAC actor update, in contrast to the actor ablation (in blue), indicating that properly handling the offline phase is also important for good performance.

Visualizing the distribution shift.

We also investigate if the self-supervised training helps specifically because it mitigates a distribution shift caused by the exploration policy. To investigate this, we visualize the trajectories of the learned policy both before and after the self-supervised phase for the Ant Direction task in Figure 6. For each plot, we show trajectories from the policy when the encoder is conditioned on histories from either the offline dataset () or from the learned exploration policy (). Since the same policy is evaluated, differences between the resulting trajectories represent the distribution shift caused by using history from the learned exploration policy rather than from the offline dataset.

We see that before the self-supervised phase, there is a large difference between the two modes that can only be attributed to the difference in . When using , the post-adaptation policy only explores one mode, but when using , the policy moves in all directions. This qualitative difference explains the large performance gap observed in Figure 2 and highlights that the adaptation procedure is sensitive to the history used to adapt. In contrast, after the self-supervised phase, the policy moves in all directions regardless of where the history came from. In Appendix A, we also visualize the exploration trajectories and found that the exploration trajectories are qualitatively similar both before and after the self-supervised phase. Together, these results illustrate the SMAC policy learns to adapt to the exploration trajectories by using the self-supervised phase to mitigate the distribution shift that occurs with naïve offline meta RL.

Figure 6: We visualize the visited XY-coordinates of the learned policy on the Ant Direction task. Left: Trajectories from the post-adaptation policy conditioned on when is sampled from the offline dataset (blue) or the learned exploration policy (orange) immediately after offline training. When conditioned on offline data, the policy correctly moves in many different directions. However, when conditioned on data from the learned exploration policy, the post-adaptation policy only moves up and to the left, suggesting that the post-adaptation policy is sensitive to data distribution used to collect . Right: After the self-supervised phase, we see that the post-adaptation policy learns to move in many different directions regardless of the data source. These visualization demonstrate that the self-supervised phase mitigates the distribution shift between conditioning on offline and online data.

7 Conclusion

In this paper, we analyzed and addressed a specific problem in offline meta-RL: distribution shift in the context parameters . This distribution shift occurs because the data collected by a meta-learned exploration policy will differ from the data in an offline dataset. This difference is then magnified by the non-Markovian adaptation procedure in meta-RL, since the learned policy depends on the entire history through the post-adaptation context parameters . We provided evidence that this distribution shift both occurs and hurts the performance of offline meta-RL. To address this problem, we made an additional assumption that an agent can sample new trajectories without additional reward labels. We then presented SMAC, a method that uses these additional interactions together with state-of-the-art offline RL techniques to provide an effective meta-RL method. We showed experimentally that SMAC significantly improves over the performance of prior offline meta-RL methods, in many cases achieving better performance even after the offline phase alone, and significantly improving performance after the self-supervised online phase.

While our method significantly improves offline meta-RL performance, the most obvious limitation of this approach is the need to gather additional unlabeled online samples. This may be quite practical in domains where collecting reward labels is a major bottleneck, such as when rewards must be labeled by a human user, but may still pose a challenge in safety-critical settings where online interaction is impractical.

In our experiments, we found that SMAC is sample efficient, requiring only 3 trajectories to solve a new task at meta-test time and only a few thousand self-supervised trajectories to learn a relatively complicated robot manipulation skills. Because SMAC is so sample-efficient, an exciting direction for future work is to deploy this method on a real-world robot and other practical real-world scenarios, where the reward labels could even be provided directly by a human user at meta-test time and labeled via crowd-sourcing in the offline dataset.


Appendix A Additional Experimental Results

Exploration and offline dataset visualization

In Figure 7, we visualize the post-adaption trajectories generated when conditioning the encoder the online exploration trajectories and the offline trajectories . Similar to Figure 6, and also visualize the online and offline trajectories themselves. We see that the exploration trajectories and the offline trajectories are very different (green vs red, respectively), but the self-supervised phase mitigates the negative impact that this distribution shift has on offline meta RL. In particular, the post-adaptation trajectories conditioned on these two data sources (blue and orange) are similar after the self-supervised training, whereas before the self-supervised training, only the trajectories conditioned on the offline data (blue) move in multiple directions.

Figure 7: We duplicate Figure 6 but include the exploration trajectories (green) and example trajectories from the offline dataset (red). We see that the exploration policy both before and after self-supervised training primarily moves up and to the left, whereas the offline data moves in all direction. Before the self-supervised phase, we see that conditioning the encoder on online data (orange) rather than offline data (blue) results in very different policies, with the online data resulting in the post-adaptation policy only moving up and to the left. However, the self-supervised phase of SMAC mitigates the impact of this distribution shift and results in qualitatively similar post-adaptation trajectories, despite the large difference between the exploration trajectories and offline dataset trajectories.
Figure 8: Learning curves when performing self-supervised training on the test environments (red) or the meta-training environments (blue). We also compare to an oracle that trains on test environments in combination with ground-truth rewards (black). We see that interacting with the test environment without rewards allows for steady improvement in post-adaptation test performance and obtains a similar performance to meta-training on those environments with ground-truth rewards.

Addressing state-space distribution shift by self-supervised meta-training on test tasks.

Another source of distribution shift that can negatively impact a meta-policy is a distribution shift in state space. While this distribution shift occurs in standard offline RL, we expect this issue to be more prominent in meta RL, where there is a focus on generalizing to completely novel tasks. In many real-world scenarios, experiencing the state distribution of a novel task is possible, but it is the supervision (ie. reward signal) that is expensive to obtain. Can we mitigate state distribution shift by allow the agent to meta-train in the test task environments, but without rewards?

In this experiment, we evaluate our method, SMAC, when training online on the test tasks instead of on the meta-training tasks as in the experiments in Section 6

. Prior work has explored this idea of self-supervision with test tasks in supervised learning 

sun2020testtimetrain and goal-conditioned RL khazatsky2021val. We use the Sawyer Manipulation environment to study how self-supervised training can mitigate state distribution shifts, as these environments contain significant variation between tasks. To further increase the complexity of the environment, we use a version of the environment which samples from a set of eight potential desired behaviors instead of three.

We compare self-supervised training on test tasks to self-supervised training on the set of meta-training tasks, which are also the tasks contained in the offline dataset. A large gap in performance indicates that interacting with the test tasks can mitigate the resulting distribution shift even when no reward labels are provided.

We show the results in Figure 8 and find that there is indeed a large performance gap between the two training modes, with self-supervision on test tasks improving post-adaptation returns while self-supervision on meta-training tasks does not improve post-adaptation returns. We also compare to an oracle method that performs online training with the test tasks and the ground-truth reward signal. We see that SMAC is competitive with the oracle, demonstrating that we do not need access to rewards in order to improve on test tasks. Instead, the entire performance gain comes from experiencing the new state distribution of test tasks. Overall, these results suggest that SMAC is effective for mitigate distribution shifts in both -space and state space, even when an agent can interact in the environment without reward supervision.

Appendix B Experimental Details

b.1 Environment Details

In this section, we describe the state and action space of each environment. We also describe how reward functions were generated and how the offline data was generated.

Ant Direction

The Ant Direction task consists of controlling a quadruped “ant” robot that can move in a plane. Following prior work  (rakelly2019efficient; dorfman2020offline), the reward function is the dot product between the agent’s velocity and a direction uniformly sampled from the unit circle. The state space is , comprising the orientation of the ant (in quaternion) as well as the angle and angular velocity of all 8 joints. The action space is , with each dimension corresponding to the torque applied to a respective joint.

The offline data is collected by running PEARL (rakelly2019efficient) on this meta RL task with 100 pre-sampled 222To mitigate variance coming from this sampling procedure, we use the same sampled target velocities across all experiments and comparisons. We similarly use a pre-sampled set of tasks for the other environments. target velocities. We terminate PEARL after 100 iterations, with each iteration containing at least 1000 new transitions. In PEARL, there are two replay buffers saved for each task, one for sampling data for training the encoder and another for training the policy and Q-function. We will call the former replay buffer the encoder replay buffer and the latter the RL replay buffer. The encoder replay buffer contains data generated by only the exploration policy, in which . The RL replay buffer contains all data generated, including both exploration and post-adaptation, in which . To make the offline dataset, we load the last 1200 samples of the RL replay buffer and the last 400 transitions from the encoder replay buffer into corresponding RL and encoder replay buffers for SMAC. In the initial submission, we mistakenly stated that we used 1200 samples when in fact we used 1600 samples for each task. During the self-supervised phase, we add all new data to both replay buffers.

Cheetah Velocity

The Cheetah Velocity task consists of controlling a two-legged “half cheetah” that can move forwards or backwards along the x-axis. Following prior work (rakelly2019efficient; dorfman2020offline), the reward function is the absolute difference product between the agent’s x-velocity and a velocity uniformly sampled from . The state space is , comprising the -position; the cheetah’s x- and z- velocity; the angle and angular velocity of each joint and the half-cheetah’s y-angle; and the XYZ position of the center of mass. The action space is , with each dimension corresponding to the torque applied to a respective joint.

The offline data is collected in the same way as in the Ant Direction task, using a run from PEARL with 100 pre-sampled target velocities. For the offline dataset, we use the first 1200 samples from the RL replay buffer and last 400 samples from the encoder replay buffer after 50 PEARL iterations, with each iteration containing at least 1000 new transitions. For only this environment, we found that it was beneficial to freeze the encoder buffer during the self-supervised phase.

Sawyer Manipulation

The state space, action space, and reward is described in  Section 6. Tasks are generated by sampling the initial configuration, and then the desired behavior. There are five objects: a drawer opened by handle, a drawer opened by button, a button, a tray, and a graspable object. If an object is not present, it takes on position in the corresponding element of the state space. First, the presence or absence of each of the five is randomized. Next, the position of the drawers (from 2 sides), initial position of the tray (from 4 positions), and the object (from 4 positions) is randomized. Finally, the desired behavior is randomly chosen from the following list, but only including the ones that are possible in the scene: "move hand", "open top drawer with handle", or "open bottom drawer with button". The offline data is collected using a scripted controller that does not know the desired behavior and randomly performs potential tasks in the scene, choosing another task if it finishes one task before the trajectory ends. This data is loaded into a single replay buffer used for both the encoder and RL.

b.2 Hyperparameters

We list the hyperparameters for training the policy, encoder, decoder, and Q-network in Table 1. If hyperparameters were different across environments, they are listed in Table 2. For pretraining, we use the same hyperparameters and train for gradient steps. Below, we give details on non-standard hyperparameters and architectures.

Batch sizes.

The RL batch size is the batch size per task when sampling tuples to update the policy and Q-network. The encoder batch size is the size of the history per task used to conditioned the encoder . The meta batch size is how many tasks batches were sampled and concatenated for both the RL and encoder batches. In other words, for each gradient update, the policy and Q-network observe transitions and the encoder observes transitions.

Hyperparameter Value
RL batch size
encoder batch size
meta batch size
Q-network hidden sizes
policy network hidden sizes
decoder network hidden sizes
encoder network hidden sizes
dimensionality ()
hidden activation (all networks) ReLU
Q-network, encoder, and decoder output activation identity
policy output activation tanh
discount factor
target network soft target
policy, Q-network, encoder, and decoder learning rate
policy, Q-network, encoder, and decoder optimizer Adam
# of gradient steps per environment transition 4
Table 1: SMAC Hyperparameters for Self-Supervised Phase
Hyperparameter Cheetah Velocity Ant Direction Sawyer Manipulation
horizon (max # of transitions per trajectory) 200 200 50
AWR 100 100 0.3
reward scale 5 5 1
# of training tasks 100 100 50
# of test tasks 30 20 10
# of transitions per training task in offline dataset 1600 1600 3750
1 1 0
Table 2: Environment Specific SMAC Hyperparameters

Encoder architecture.

The encoder uses the same architecture as in rakelly2019efficient. The posterior is given as the product of independent factors

where each factor is a multi-variate Gaussian over with learned mean and diagonal variance. In other words,

The mean and standard deviation is the output of a single MLP network with output dimensionality . The output of the MLP network is split into two halves. The first half is the mean and the second half is passed through the softplus activation to get the standard deviation.

Self-supervised actor update.

The parameter controls the actor loss during the self-supervised phase, which is

where is the actor loss from PEARL (rakelly2019efficient). For reference, the PEARL actor loss is

When the parameter is zero, the actor update is equivalent to the actor update in AWAC (nair2020accelerating).