Learning models that generalize effectively to complex open-world settings, from image recognition (krizhevsky2012imagenet)devlin2019bert)
, relies on large, high-capacity models and large, diverse, and representative datasets. Leveraging this recipe for reinforcement learning (RL) has the potential to yield powerful policies for real-world control applications such as robotics. However, while deep RL algorithms enable the use of large models, the use of large datasets for real-world RL is conceptually challenging. Most RL algorithms collect new data online every time a new policy is learned, which limits the size and diversity of the datasets for RL. In the same way that powerful models in computer vision and NLP are often pre-trained on large, general-purpose datasets and then fine-tuned on task-specific data, RL policies that generalize effectively to open-world settings will need to be able to incorporate large amounts of prior data effectively into the learning process, while still collecting additional data online for the task at hand.
For data-driven reinforcement learning, offline datasets consist of trajectories of states, actions and associated rewards. This data can potentially come from demonstrations for the desired task (schaal97lfd; atkeson1997lfd), suboptimal policies (gao2018imperfect), demonstrations for related tasks (zhou2019wtl), or even just random exploration in the environment. Depending on the quality of the data that is provided, useful knowledge can be extracted about the dynamics of the world, about the task being solved, or both. Effective data-driven methods for deep reinforcement learning should be able to use this data to pre-train offline while improving with online fine-tuning.
Since this prior data can come from a variety of sources, we require an algorithm that does not utilize different types of data in any privileged way. For example, prior methods that incorporate demonstrations into RL directly aim to mimic these demonstrations (nair2018demonstrations), which is desirable when the demonstrations are known to be optimal, but can cause undesirable bias when the prior data is not optimal. While prior methods for fully offline RL provide a mechanism for utilizing offline data (fujimoto19bcq; kumar19bear), as we will show in our experiments, such methods generally are not effective for fine-tuning with online data as they are often too conservative. In effect, prior methods require us to choose: Do we assume prior data is optimal or not? Do we use only offline data, or only online data? To make it feasible to learn policies for open-world settings, we need algorithms that contain all of the aforementioned qualities.
In this work, we study how to build RL algorithms that are effective for pre-training from a variety of off-policy datasets, but also well suited to continuous improvement with online data collection. We systematically analyze the challenges with using standard off-policy RL algorithms (haarnoja2018sac; kumar19bear; we2018mpo)
for this problem, and introduce a simple actor critic algorithm that elegantly bridges data-driven pre-training from offline data and improvement with online data collection. Our method, which uses dynamic programming to train a critic but a supervised update to train a constrained actor, combines the best of supervised learning and actor-critic algorithms. Dynamic programming can leverage off-policy data and enable sample-efficient learning. The simple supervised actor update implicitly enforces a constraint that mitigates the effects of out-of-distribution actions when learning from offline data(fujimoto19bcq; kumar19bear), while avoiding overly conservative updates. We evaluate our algorithm on a wide variety of robotic control and benchmark tasks across three simulated domains: dexterous manipulation, tabletop manipulation, and MuJoCo control tasks. We see that our algorithm, Advantage Weighted Actor Critic (AWAC), is able to quickly learn successful policies on difficult tasks with high action dimension and binary sparse rewards, significantly better than prior methods for off-policy and offline reinforcement learning. Moreover, we see that AWAC can utilize different types of prior data: demonstrations, suboptimal data, and random exploration data.
We consider the standard reinforcement learning notation, with states , actions , policy , rewards , and dynamics . The discounted return is defined as , for a discount factor and horizon which may be infinite. The objective of an RL agent is to maximize the expected discounted return
under the distribution induced by the policy. The optimal policy can be learned by direct optimization of this objective using the policy gradient, estimatingwilliams1992reinforce
, but this is often ineffective due to high variance of the estimator. Many algorithms attempt to reduce this variance by making use of the value function, action-value function , or advantage . The action-value function for a policy can be written recursively via the Bellman equation:
Instead of estimating policy gradients directly, actor-critic algorithms maximize returns by alternating between two phases (konda2000actorcritic): policy evaluation and policy improvement. During the policy evaluation phase, the critic is estimated for the current policy . This can be accomplished by repeatedly applying the Bellman operator , corresponding to the right-hand side of Equation 1, as defined below:
By iterating according to , converges to (sutton1998rl). With function approximation, we cannot apply the Bellman operator exactly, and instead minimize the Bellman error with respect to Q-function parameters :
During policy improvement, the actor is typically updated based on the current estimate of . A commonly used technique lillicrap2015continuous; fujimoto2018td3; haarnoja2018sac is to update the actor via likelihood ratio or pathwise derivatives to optimize the following objective, such that the expected value of the Q-function is maximized:
Actor-critic algorithms are widely used in deep RL mnih2016asynchronous; lillicrap2015continuous; haarnoja2018sac; fujimoto2018td3. With a Q-function estimator, they can in principle utilize off-policy data when used with a replay buffer for storing prior transition tuples, which we will denote , to sample previous transitions, although we show that this by itself is insufficient for our problem setting.
3 Challenges in Offline RL with Online Fine-tuning
In this section, we aim to better understand the unique challenges that exist when pre-training using offline data, followed by fine-tuning with online data collection. We first describe the problem, and then analyze what makes this problem difficult for prior methods.
Problem definition. We assume that a static dataset of transitions, , is provided to the algorithm at the beginning of training. This dataset can be sampled from an arbitrary policy or mixture of policies, and may even be collected by a human expert. This definition is general and encompasses many scenarios, such as learning from demonstrations, learning from random data, learning from prior RL experiments, or even learning from multi-task data. Given the dataset , our goal is to leverage for pre-training and use some online interaction to learn the optimal policy , with as few interactions with the environment as possible (depicted in Fig 1). This setting is representative of many real-world RL settings, where prior data is available and the aim is to learn new skills efficiently. We first study existing algorithms empirically in this setting on the HalfCheetah-v2 Gym environment. The prior dataset consists of 15 demonstrations from an expert policy and 100 suboptimal trajectories sampled from a behavioral clone of these demonstrations. All methods for the remainder of this paper incorporate the prior dataset, unless explicitly labeled “scratch”.
3.1) Data efficiency.
One of the simplest ways to utilize prior data such as demonstrations for RL is to pre-train a policy with imitation learning, and fine-tune with on-policy RL(gupta2019relay; rajeswaran2018dextrous). This has two drawbacks: (1) prior data may not be optimal; (2) on-policy fine-tuning is data inefficient as it does not reuse the prior data in the RL stage. In our setting, data efficiency is vital. To this end, we require algorithms that are able to reuse arbitrary off-policy data during online RL for data-efficient fine-tuning. We find that algorithms that use on-policy fine-tuning (rajeswaran2018dextrous; gupta2019relay), or Monte-Carlo return estimation (peng2019awr; peters2007rwr) are generally much less efficient than off-policy actor-critic algorithm, which iterate between improving and estimating via Bellman backups. This can be seen from the results in Figure 2 plot 1, where on-policy methods like DAPG rajeswaran2018dextrous and Monte-Carlo return methods like AWR (peng2019awr) are an order of magnitude slower than off-policy actor-critic methods. Actor-critic methods, shown in Figure 2 plot 2, can in principle use off-policy data. However, as we will discuss next, naïvely applying these algorithms to our problem does not perform well due to a different set of challenges.
3.2) Bootstrap Error in Offline Learning with Actor-Critic Methods. When standard off-policy actor-critic methods are applied to this problem setting, they perform poorly, as shown in the second plot in Figure 2: despite having a prior dataset in the replay buffer, these algorithms do not benefit significantly from offline training. We evaluate soft actor critic haarnoja2018sac, a state-of-the-art actor-critic algorithm for continuous control. Note that “SAC (scratch),” which does not receive the prior data, performs similarly to “SACfD (prior),” which does have access to the prior data, indicating that the off-policy RL algorithm is not actually able to make use of the off-policy data for pre-training. Moreover, even if the SAC is policy is pre-trained by behavior cloning, labeled “SACfD (pretrain)”, we still observe an initial decrease in performance.
This challenge can be attributed to off-policy bootstrapping error accumulation, as observed in several prior works (sutton1998rl; kumar19bear; wu2019brac; levine2020offlinetutorial; fujimoto19bcq). In actor-critic algorithms, the target value , with , is used to update . When is outside of the data distribution, will be inaccurate, leading to accumulation of error on static datasets.
Prior offline RL algorithms fujimoto19bcq; kumar19bear; wu2019brac propose to address this issue by explicitly adding constraints on the policy improvement update (Equation 4) to avoid bootstrapping on out-of-distribution actions, leading to a policy update of this form:
Here, is the actor being updated and represents the (potentially unknown) distribution from which all of the data seen so far (both offline data and online data) was generated. In the case of a replay buffer, corresponds to a mixture distribution over all past policies. Typically, is not known, especially for offline data, and must be estimated from the data itself. Many offline RL algorithms (kumar19bear; fujimoto19bcq; siegel2020abm)
explicitly fit a parametric model to samples for the distributionvia maximum likelihood estimation, where samples from are obtained simply by sampling uniformly from the data seen thus far: . After estimating , prior methods implement the constraint given in Equation 5 in various ways, including penalties on the policy update (kumar19bear; wu2019brac) or architecture choices for sampling actions for policy training (fujimoto19bcq; siegel2020abm). As we will see next, the requirement for accurate estimation of makes these methods difficult to use with online fine-tuning.
3.3) Excessively Conservative Online Learning. While offline RL algorithms with constraints (kumar19bear; fujimoto19bcq; wu2019brac) perform well offline, they struggle to improve with fine-tuning, as shown in the third plot in Figure 2. We see that the purely offline RL performance (at “0K” in Fig. 2) is much better than the standard off-policy methods shown in Section 3. However, with additional iterations of online fine-tuning, the performance increases very slowly (as seen from the slope of the BEAR curve in Fig 2). What causes this phenomenon?
This can be attributed to challenges in fitting an accurate behavior model as data is collected online during finetuning. In the offline setting, behavior models must only be trained once via maximum likelihood, but in the online setting, the behavior model must be updated online to track incoming data. Training density models online (in the “streaming” setting) is a challenging research problem (ramapuram2017lifelonggm), made more difficult by a potentially complex multi-modal behavior distribution induced by the mixture of online and offline data. To understand this, we plot the log likelihood of learned behavior models on the dataset during online and offline training for the HalfCheetah task. As we can see in the plot, the accuracy of the behavior models ( on the y axis) reduces during online fine-tuning, indicating that it is not fitting the new data well during online training. When the behavior models are inaccurate or unable to model new data well, constrained optimization becomes too conservative, resulting in limited improvement with fine-tuning. This analysis suggests that, in order to address our problem setting, we require an off-policy RL algorithm that constrains the policy to prevent offline instability and error accumulation, but is not so conservative that it prevents online fine-tuning due to imperfect behavior modeling. Our proposed algorithm, which we discuss in the next section, accomplishes this by employing an implicit constraint, which does not require any explicit modeling of the behavior policy.
4 Advantage Weighted Actor Critic: A Simple Algorithm for Fine-tuning from Offline Datasets
In this section, we will describe the advantage weighted actor-critic (AWAC) algorithm, which trains an off-policy critic and an actor with an implicit policy constraint. We will show AWAC mitigates the challenges outlined in Section 3. AWAC follows the standard paradigm for actor-critic algorithms as described in Section 2, with a policy evaluation step to learn and a policy improvement step to update . AWAC uses off-policy temporal-difference learning to estimate in the policy evaluation step, and a unique policy improvement update that is able to obtain the benefits of offline RL algorithms at training from prior datasets, while avoiding the overly conservative behavior described in Section 3. We describe the policy improvement step in AWAC below, and summarize the entire algorithm thereafter.
Policy improvement for AWAC proceeds by learning a policy that maximizes the value of the critic learned in the policy evaluation step via TD bootstrapping. If done naively, this can lead to the issues described in Section 3, but we can avoid the challenges of bootstrap error accumulation by restricting the policy distribution to stay close to the data observed thus far during the actor update, while maximizing the value of the critic. At iteration , AWAC therefore optimizes the policy to maximize the estimated Q-function at every state, while constraining it to stay close to the actions observed in the data, similar to prior offline RL methods, though this constraint will be enforced differently. Note from the definition of the advantage in Section 2 that optimizing is equivalent to optimizing . We can therefore write this optimization as:
As we saw in Section 3, enforcing the constraint by incorporating an explicit learned behavior model (kumar19bear; fujimoto19bcq; wu2019brac; siegel2020abm) leads to poor fine-tuning performance. Instead, we will enforce the constraint implicitly, without explicitly learning a behavior model. We first derive the solution to the constrained optimization in Equation 6 to obtain a non-parametric closed form for the actor. This solution is then projected onto the parametric policy class without any explicit behavior model. The analytic solution to Equation 6 can be obtained by enforcing the KKT conditions peters2007rwr; peters2010reps; peng2019awr. The Lagrangian is:
and the closed form solution to this problem is
is the normalizing partition function. When using function approximators, such as deep neural networks as we do in our implementation, we need to project the non-parametric solution into our policy space. For a policywith parameters , this can be done by minimizing the KL divergence of from the optimal non-parametric solution under the data distribution :
Note that the parametric policy could be projected with either direction of KL divergence. Choosing the reverse KL results in explicit penalty methods (wu2019brac) that rely on evaluating the density of a learned behavior model. Instead, by using forward KL, we can compute the policy update by sampling directly from :
This actor update amounts to weighted maximum likelihood (i.e., supervised learning), where the targets are obtained by re-weighting the state-action pairs observed in the current dataset by the predicted advantages from the learned critic, without explicitly learning any parametric behavior model, simply sampling from the replay buffer . See Appendix A.2 for a more detailed derivation and Appendix A.3 for specific implementation details.
Avoiding explicit behavior modeling. Note that the update in Equation 10 completely avoids any modeling of the previously observed data with a parametric model. By avoiding any explicit learning of the behavior model AWAC is far less conservative than methods which fit a model explicitly, and better incorporates new data during online fine-tuning, as seen from our results in Section 6. This derivation is related to AWR (peng2019awr), with the main difference that AWAC uses an off-policy Q-function to estimate the advantage, which greatly improves efficiency and even final performance (see results in Section 6). The update also resembles ABM-MPO, but ABM-MPO does require modeling the behavior policy which, as discussed in Section 3, can lead to poor fine-tuning. In Section 6, AWAC outperforms ABM-MPO on a range of challenging tasks.
Policy evaluation. During policy evaluation, we estimate the action-value for the current policy , as described in Section 2. We utilize a standard temporal difference learning scheme for policy evaluation haarnoja2018sac; fujimoto2018td3, by minimizing the Bellman error as described in Equation 2. This enables us to learn very efficiently from off-policy data. This is particularly important in our problem setting to effectively use the offline dataset, and allows us to significantly outperform alternatives using Monte-Carlo evaluation or TD() to estimate returns (peng2019awr).
Algorithm summary. The full AWAC algorithm for offline RL with online fine-tuning is summarized in Algorithm 1. In a practical implementation, we can parameterize the actor and the critic by neural networks and perform SGD updates from Eqn. 10 and Eqn. 3. Specific details are provided in Appendix A.3. As we will show in our experiments, the specific design choices described above enable AWAC to overcome the challenges discussed in Section 3. AWAC ensures data efficiency with off-policy critic estimation via bootstrapping, and avoids offline bootstrap error with a constrained actor update. By avoiding explicit modeling of the behavior policy, AWAC avoids overly conservative updates.
5 Related Work
Off-policy RL algorithms are designed to reuse off-policy data during training, and have been studied extensively (konda2000actorcritic; degris2012; mnih2016asynchronous; haarnoja2018sac; fujimoto2018td3; bhatnagar2009; peters2008; zhang2019; pawel2009; balduzzi2015). While standard off-policy methods are able to benefit from including data seen during a training run, as we show in Section 3 they struggle when training from previously collected offline data from other policies, due to error accumulation with distribution shift (fujimoto19bcq; kumar19bear). Offline RL methods aim to address this issue, often by constraining the actor updates to avoid excessive deviation from the data distribution (levine2020offlinetutorial; thomas2016; hallak2015offpolicy; hallak2016td; hallak2017onlineoffpolicy; kumar19bear; fujimoto19bcq; lange2012; siegel2020abm; nachum2019dualdice; zhang2020gendice). One class of these methods utilize importance sampling (thomas2016; zhang2020gendice; nachum2019dualdice; degris2012; jiang2016doublyrobust; hallak2017onlineoffpolicy). Another class of methods perform offline reinforcement learning via dynamic programming, with an explicit constraint to prevent deviation from the data distribution (kumar19bear; fujimoto19bcq; lange2012; wu2019brac; jaques2019). While these algorithms perform well in the purely offline settings, we show in Section 3 that such methods tend to be overly conservative, and therefore may not learn efficiently when fine-tuning with online data collection. In contrast, our algorithm Advantage Weighted Actor Critic is comparable to these algorithms for offline pre-training, but learns much more efficiently during subsequent fine-tuning.
Prior work has also considered the special case of learning from demonstration data. One class of algorithms initializes the policy via behavioral cloning from demonstrations, and then fine-tunes with reinforcement learning (peters2008baseball; ijspeert2002attractor; Theodorou2010; kim2013apid; rajeswaran2018dextrous; gupta2019relay; zhu2019hands). Most such methods use on-policy fine-tuning, which is less sample-efficient than off-policy methods that perform value function estimation. Other prior works have incorporated demonstration data into the replay buffer using off-policy RL methods (vecerik17ddpgfd; nair2017icra). We show in Section 3 that these strategies can result in a large dip in performance during online fine-tuning, due to the inability to pre-train an effective value function from offline data. In contrast, our work shows that using supervised learning style policy updates can allow for better bootstrapping from demonstrations as compared to vecerik17ddpgfd and nair2017icra.
Our method builds on algorithms that implement a maximum likelihood objective for the actor, based on an expectation-maximization formulation of RL(peters2007rwr; neumann2008fqiawr; Theodorou2010; peters2010reps; peng2019awr; we2018mpo). Most closely related to our method in this respect are the algorithms proposed by peng2019awr (AWR) and siegel2020abm (ABM). Unlike AWR, which estimates the value function of the behavior policy, via Monte-Carlo estimation or TD, our algorithm estimates the Q-function of the current policy via bootstrapping, enabling much more efficient learning, as shown in our experiments. Unlike ABM, our method does not require learning a separate function approximator to model the behavior policy , and instead directly samples the dataset. As we discussed in Section 3, modeling can be a major challenge for online fine-tuning. While these distinctions may seem somewhat subtle, they are important and we show in our experiments that they result in a large difference in algorithm performance. Finally, our work goes beyond the analysis in prior work, by studying the issues associated with pre-training and fine-tuning in Section 3.
6 Experimental Evaluation
In our experiments, we first compare our method against prior methods in the offline training and fine-tuning setting. We show that we can learn difficult, high-dimensional, sparse reward dexterous manipulation problems from human demonstrations and off-policy data. We then evaluate our method with suboptimal prior data generated by a random controller. Finally, we study why prior methods struggle in this setting by analyzing their performance on benchmark MuJoCo tasks, and conduct further experiments to understand where the difficulty lies (also shown in Section 3). Videos and further experimental details can also be found at awacrl.github.io
6.1) Comparative Evaluation on Dexterous Manipulation Tasks. We aim to study tasks representative of the difficulties of real-world robot learning, where offline learning and online fine-tuning are most relevant. One such setting is the suite of dexterous manipulation tasks proposed by rajeswaran2018dextrous. These tasks involve complex manipulation skills using a 28-DoF five-fingered hand in the MuJoCo simulator (todorov12mujoco) shown in Figure 3: in-hand rotation of a pen, opening a door by unlatching the handle, and picking up a sphere and relocating it to a target location. These environments exhibit many challenges: high dimensional action spaces, complex manipulation physics with many intermittent contacts, and randomized hand and object positions. The reward functions in these environments are binary 0-1 rewards for task completion. 111rajeswaran2018dextrous use a combination of task completion factors as the sparse reward. For instance, in the door task, the sparse reward as a function of the door position was . We only use the success measure , which is substantially more difficult. rajeswaran2018dextrous provide 25 human demonstrations for each task, which are not fully optimal but do solve the task. Since this dataset is very small, we generated another 500 trajectories of interaction data by constructing a behavioral cloned policy, and then sampling from this policy.
First, we compare our method on the dexterous manipulation tasks described earlier against prior methods for off-policy learning, offline learning, and bootstrapping from demonstrations. Specific implementation details are discussed in Appendix A.4. The results are shown in Fig. 3. Our method is able to leverage the prior data to quickly attain good performance, and the efficient off-policy actor-critic component of our approach fine-tunes much more quickly than demonstration augmented policy gradient (DAPG), the method proposed by rajeswaran2018dextrous. For example, our method solves the pen task in 120K timesteps, the equivalent of just 20 minutes of online interaction. While the baseline comparisons and ablations are able to make some amount of progress on the pen task, alternative off-policy RL and offline RL algorithms are largely unable to solve the door and relocate task in the time-frame considered. We find that the design decisions to use off-policy critic estimation allow AWAC to significantly outperform AWR (peng2019awr) while the implicit behavior modeling allows AWAC to significantly outperform ABM (siegel2020abm), although ABM does make some progress. rajeswaran2018dextrous show that DAPG can solve these tasks with more reward information, but this highlights the weakness of on-policy methods in sparse reward scenarios.
6.2) Fine-Tuning from Random Policy Data. An advantage of using off-policy RL for reinforcement learning is that we can also incorporate suboptimal data, rather than only demonstrations. In this experiment, we evaluate on a simulated tabletop pushing environment with a Sawyer robot (shown in Fig 3), described further in Appendix A.1. To study the potential to learn from suboptimal data, we use an off-policy dataset of 500 trajectories generated by a random process. The task is to push an object to a target location in a 40cm x 20cm goal space.
The results are shown in Figure 4. We see that while many methods begin at the same initial performance, AWAC learns the fastest online and is actually able to make use of the offline dataset effectively as opposed to some methods which are completely unable to learn.
6.3) Analysis on MuJoCo Benchmarks from Prior Data. Since the dexterous manipulation environments are challenging to solve, we provide a comparative evaluation on MuJoCo benchmark tasks for analysis. On these simpler problems, many prior methods are able to learn, but it allows us to understand more precisely which design decisioins are crucial. For each task, we collect 15 demonstration trajectories using a pre-trained expert on each task, and 100 trajectories of off-policy data by rolling out a behavioral cloned policy trained on the demonstrations. The same data is made available to all methods. The results are presented in Figure 5. AWAC is consistently the best-performing method, but several other methods show reasonable performance. We summarize the results according to the challenges in Section 3.
Data efficiency. The two methods that do not estimate are DAPG we2018mpo and AWR peng2019awr. Across all three tasks, we see that these methods are somewhat worse offline than the best performing offline methods, and exhibit steady but slow improvement.
Bootstrap error in offline off-policy learning. For SAC haarnoja2018sac
, across all three tasks, we see that the offline performance at epoch 0 is generally poor. Due to the data in the replay buffer, SAC with prior data does learn faster than from scratch, but AWAC is faster to solve the tasks in general. SAC with additional data in the replay buffer is similar to the approach proposed byvecerik17ddpgfd. SAC+BC reproduces nair2018demonstrations but uses SAC instead of DDPG lillicrap2015continuous as the underlying RL algorithm. We find that these algorithms exhibit a characteristic dip at the start of learning.
Conservative online learning. Finally, we consider conservative offline algorithms: ABM siegel2020abm, BEAR kumar19bear, and BRAC wu2019brac
. We found that BRAC performs similarly to SAC for working hyperparameters. BEAR trains well offline - on Ant and Walker2d, BEAR significantly outperforms prior methods before online experience. However, online improvement is slow for BEAR and the final performance across all three tasks is much lower than AWAC. The closest in performance to our method is ABM, which is comparable on Ant-v2, but much slower on other domains.
7 Discussion and Future Work
We have discussed the challenges existing RL methods face when fine-tuning from prior datasets, and proposed an algorithm, AWAC, that is effective in this setting. The key insight in AWAC is that enforcing a policy update constraint implicitly on actor-critic methods results in a stable learning algorithm amenable for off-policy learning. With an informative action-value estimate, the policy is weighted towards high-advantage actions in the data, resulting in policy improvement without conservative updates. A direction of future work we plan to pursue is applying AWAC to solve difficult robotic tasks in the real world. More than just speeding up individual runs, incorporating prior data into the learning process enables continuously accumulating data by saving environment interactions of the robot - for instance, runs of RL with varying hyperparameters. We hope that this enables a wider array of robotic applications than previously possible.
This research was supported by the Office of Naval Research, the National Science Foundation through IIS-1700696 and IIS-1651843, and ARL DCIST CRA W911NF-17-2-0181. We would like to thank Aviral Kumar, Ignasi Clavera, Karol Hausman, Oleh Rybkin, Michael Chang, Corey Lynch, Kamyar Ghasemipour, Alex Irpan, Vitchyr Pong, Graham Hughes, and many others at UC Berkeley RAIL Lab and Robotics at Google for their valuable feedback on the paper and insightful discussions.
Appendix A Appendix
a.1 Environment-Specific Details
We evaluate our method on three domains: dexterous manipulation environments, Sawyer manipulation environments, and MuJoCo benchmark environments. In the following sections we describe specific details.
a.1.1 Dexterous Manipulation Environments
These environments are modified from those proposed by by rajeswaran2018dextrous, and available in this repository.
The task is to spin a pen into a given orientation. The action dimension is 24 and the observation dimension is 45. Let the position and orientation of the pen be denoted by and respectively, and the desired position and orientation be denoted by and respectively. The reward function is - 1. In rajeswaran2018dextrous, the episode was terminated when the pen fell out of the hand; we did not include this early termination condition.
The task is to open a door, which requires first twisting a latch. The action dimension is 28 and the observation dimension is 39. Let denote the angle of the door. The reward function is - 1.
The task is to relocate an object to a goal location. The action dimension is 30 and the observation dimension is 39. Let denote the object position and denote the desired position. The reward is - 1.
a.1.2 Sawyer Manipulation Environment
This environment is included in the Multiworld library. The task is to push a puck to a goal position in a 40cm x 20cm, and the reward function is the negative distance between the puck and goal position. When using this environment, we use hindsight experience replay for goal-conditioned reinforcement learning. The random dataset for prior data was collected by rolling out an Ornstein-Uhlenbeck process with and .
a.2 Algorithm Derivation Details
The full optimization problem we solve, given the previous off-policy advantage estimate and buffer distribution is:
Our derivation follows peters2010reps and peng2019awr. The analytic solution for the constrained optimization problem above can be obtained by enforcing the KKT conditions. The Lagrangian is:
Differentiating with respect to gives:
Setting to zero and solving for gives the closed form solution to this problem:
Next, we project the solution into the space of parametric policies. For a policy with parameters , this can be done by minimizing the KL divergence of from the optimal non-parametric solution under the data distribution :
Note that in the projection step, the parametric policy could be projected with either direction of KL divergence. However, choosing the reverse KL direction has a key advantage: it allows us to optimize as a maximum likelihood problem with an expectation over data , rather than sampling actions from the policy that may be out of distribution for the Q function. In our experiments we show that this decision is vital for stable off-policy learning.
Furthermore, assume discrete policies with a minimum probably density of. Then the upper bound:
holds by the Pinsker’s inequality, where denotes the total variation distance between distributions. Thus minimizing the reverse KL also bounds the forward KL. Note that we can control the minimum if desired by applying Laplace smoothing to the policy.
a.3 Implementation Details
We implement the algorithm building on top of twin delayed deep deterministic policy gradient (TD3) from fujimoto2018td3. The base hyperparameters are given in table 1.
The policy update is replaced with:
We found that explicitly computing results in worse performance, so we ignore the effect of and empirically find that this results in strong performance both offline and online.
The Lagrange multiplier is a hyperparameter. In this work we use for the manipulation environments and for the MuJoCo benchmark environments. One could adaptively learn with a dual gradient descent procedure, but this would require access to .
As rewards for the dextrous manipulation environments are non-positive, we clamp the Q value for these experiments to be at most zero. We find this stabilizes training slightly.
|Training Batches Per Timestep|
|Exploration Noise||None (stochastic policy)|
|RL Batch Size|
|Replay Buffer Size|
|Number of pretraining steps|
|Policy Hidden Sizes|
|Policy Hidden Activation||ReLU|
|Policy Weight Decay|
|Policy Learning Rate|
|Q Hidden Sizes|
|Q Hidden Activation||ReLU|
|Q Weight Decay|
|Q Learning Rate|
a.4 Baseline Implementation Details
We used public implementations of prior methods (DAPG, AWR) when available. We implemented the remaining algorithms in our framework, which also allows us to understand the effects of changing individual components of the method. In the section, we describe the implementation details. The full overview of algorithms is given in Figure 6.
Behavior Cloning (BC). This method learns a policy with supervised learning on demonstration data.
Soft Actor Critic (SAC). Using the soft actor critic algorithm from haarnoja2018sac, we follow the exact same procedure as our method in order to incorporate prior data, initializing the policy with behavior cloning on demonstrations and adding all prior data to the replay buffer.
Behavior Regularized Actor Critic (BRAC). We implement BRAC as described in wu2019brac by adding policy regularization where is a behavior policy trained with supervised learning on the replay buffer. We add all prior data to the replay buffer before online training.
Advantage Weighted Regression (AWR). Using the advantage weighted regression algorithm from peng2019awr, we add all prior data to the replay buffer before online training. We use the implementation provided by peng2019awr, with the key difference from our method being that AWR uses TD() on the replay buffer for policy evaluation.
Maximum a Posteriori Policy Optimization (MPO). We evaluate the MPO algorithm presented by we2018mpo. Due to a public implementation being unavailable, we modify our algorithm to be as close to MPO as possible. In particular, we change the policy update in Advantage Weighted Actor Critic to be:
Note that in MPO, actions for the update are sampled from the policy and the Q-function is used instead of advantage for weights. We failed to see offline or online improvement with this implementation in most environments, so we omit this comparison in favor of ABM.
Advantage-Weighted Behavior Model (ABM). We evaluate ABM, the method developed in siegel2020abm. As with MPO, we modify our method to implement ABM, as there is no public implementation of the method. ABM first trains an advantage model :
where is an increasing non-negative function, chosen to be . In place of an advantage computed by empirical returns we use the advantage estimate computed per transition by the value . This is favorable for running ABM online, as computing is similar to AWR, which shows slow online improvement. We then use the policy update:
Additionally, for this method, actions for the update are sampled from a behavior policy trained to match the replay buffer and the value function is computed as s.t. .
Demonstration Augmented Policy Gradient (DAPG). We directly utilize the code provided in (rajeswaran2018dextrous) to compare against our method. Since DAPG is an on-policy method, we only provide the demonstration data to the DAPG code to bootstrap the initial policy from.
Bootstrapping Error Accumulation Reduction (BEAR). We utilize the implementation of BEAR provided in rlkit. We provide the demonstration and off-policy data to the method together. Since the original method only involved training offline, we modify the algorithm to include an online training phase. In general we found that the MMD constraint in the method was too conservative. As a result, in order to obtain the results displayed in our paper, we swept the MMD threshold value and chose the one with the best final performance after offline training with offline fine-tuning.