1 Introduction
Over the last five years advances in Deep Reinforcement Learning (RL) have been at the source of a number of impressive results in autonomous control, including the ability to solve video games from pixels mnih2015human , master the game of Go silver2017mastering , play multiagent large scale video games vinyals2019alphastar , and control robots OpenAI2019SolvingRC . Most advances in RL were achieved in simulated environments where data was cheap to collect and mistakes during policy training were harmless. However, two substantial problems stand in the way from utilizing the above approaches to deploy RL algorithms in realworld settings. First, since RL algorithms require millions and sometimes billions of environment interactions, learning policies with RL in the real world is costly in terms of time and resources. Second, since RL algorithms stochastically explore their environment, the resulting agents are not safe and can harm the environment, themselves, or other agents if trained in the real world. How can we overcome the challenges of data efficiency and safety to enable RL algorithms that can be deployed in real world settings?
Offline or Batch RL LangeGR12 ; levine2020offlinerlsurvey
has recently been proposed as a promising paradigm to tackle these challenges. Offline RL agents use logged or previously collected data by humans or other agents for learning. Importantly, the offline data does not have to consist of expert demonstrations like in the case of imitation learning
pomerleau1988alvinn ; Abbeel2004ApprenticeshipLV ; Ziebart2008MaximumEI , but can be collected with policies that are suboptimal or noisy. Such policies may already be in deployment for a variety of applications like autonomous driving, warehouse automation, dialogue systems jaques2019way ; ZhouSRE17 and recommendation systems CovingtonAS16 ; SwaminathanJ15 . By learning policies only using offline datasets and perhaps finetuning the policy using a small dataset of subsequent interactions, offline RL has the potential to be highly sample efficient and safe. The primary challenge with extracting policies from offline data comes from the distribution mismatch between transitions seen during training and those encountered during evaluation. Conservatism or pessimism has emerged as a core principle in offline RL to deal with distribution mismatch. Conservatism encourages the offline RL agent to improve the policy while also staying close to the dataset distribution, thereby minimizing distribution shift between training and deployment. A number of algorithms, both modelfree and modelbased, have been proposed that incorporate conservatism in various forms like importance weights LiuSAB19 , value functions kumar2019bear ; kumar20cql ; fujimoto2018addressing ; Agarwal2020AnOP , and dynamics models KidambiMOReL20 ; yu20mopo ; ArgensonMBOP ; MatsushimaBREMEN .Recently, modelbased offline RL algorithms like MOReL KidambiMOReL20 and MOPO yu20mopo
have demonstrated impressive results in benchmark tasks and also the ability to repurpose the learned dynamics model to solve downstream tasks that are different from those encountered in the offline dataset. They incorporate conservatism in the learning process by learning pessimistic dynamics models using uncertainty quantification. However, uncertainty quantification with deep neural networks can pose challenges in many domains, such as those with high dimensional inputoutput spaces or multiple confounding factors
Ovadia2019CanYT ; Begoli2019TheNF ; Jiang2018ToTO ; Abdar2020ARO ; Ribeiro2016LIME . Since offline RL views uncertainty quantification as a means to the end of incorporating conservatism, and since uncertainty quantification by itself can be a difficult exercise, we are motivated to develop offline RL algorithms that do not require uncertainty quantification. In this work, we develop an algorithm that achieves this goal. Our algorithm outperforms prior approaches in the widely studied D4RL benchmark fu2020d4rlas well as in tasks that require domain adaptation and generalization. Thus, our algorithm has potentially wider applicability, especially in settings where uncertainty estimation can be difficult.
Our Contribution
Our principal contribution in this work is the development of a new algorithm – offline modelbased RL with Adaptive Behavioral Regularization (MABE). Using the offline dataset, MABE learns an approximate dynamics model, reward function, as well as an adaptive behavioral prior. By adaptive behavioral prior, we mean a policy that approximates the behavior in the offline dataset while giving more importance to trajectories with high rewards. Using the learned dynamics model and reward function, MABE performs modelbased RL with an objective to maximize the rewards along with a KLdivergence penalty that encourages the agent to stay close to the adaptive behavioral prior. This divergence penalty provides the necessary conservatism needed to succeed in offline RL. Our major findings in this work are listed below.

Our algorithm, MABE, achieves the highest scores in 7 out of 9 D4RL fu2020d4rl benchmark tasks we study, as well as the highest average normalized score.

MABE is flexible and can benefit from uncertainty estimation if available or forgo it altogether. Our empirical ablations suggest that uncertainty estimation contributes only minor improvements compared to the other components of dynamics models and behavioral priors. Thus, MABE can be used in a wider set of application domains, especially those where uncertainty estimation is difficult.

We demonstrate that MABE has favorable generalization capabilities to new tasks by leveraging the learned dynamics model and transferring of behavioral priors across datasets, a capability that is only possible when both modelbased and behavioral priors are combined.
2 Preliminaries
We operate in the standard RL setting of infinite horizon discounted Markov Decision Process (MDPs), defined as the tuple
. The MDP tuple has states , actions , rewards , transition dynamics , an initial state distribution , and a discount factor. A policy defines a mapping from states to actions, typically in the form of a probability distribution:
. The value and actionvalue function describe the long term reward behavior of policy .where the first expectation denotes actions are sampled according to and future states are sampled according to the MDP dynamics . The goal in RL is the learn the optimal policy:
(1) 
When the MDP (especially ) is unknown, exploration is important to learn the optimal policy.
ModelBased RL (MBRL)
is an approach to learning in MDPs that involves learning an approximate MDP . The learned MDP has the same state and action spaces, but uses the learned approximate dynamics and reward models. Generating samples from is cheap and does not require environment interaction. As a result, various algorithms based on policy gradient and dynamic programming sutton1998introduction can be used to efficiently improve the policy, with intermittent data collection to improve model approximation quality. Recently, MBRL algorithms have demonstrated strong results in a variety of RL tasks RajeswaranGameMBRL ; janner2019mbpo ; hafner2019dream ; schrittwieser2019mastering , including offline RL KidambiMOReL20 ; yu20mopo ; ArgensonMBOP .
Offline RL
is a setting in RL where we must learn a policy using a fixed dataset of environment interactions. Specifically, we are given a dataset of interactions of environment interactions collected using one or more behavioral policies. If the behavioral policies do not induce sufficient exploration, it is not possible to learn an optimal policy for the underlying MDP even as Chen2019InformationTheoreticCI ; KidambiMOReL20 . Thus, the goal in offline RL is typically to learn the best possible policy using the provided dataset.
ModelBased Offline RL
algorithms like MOPO yu20mopo and MOReL KidambiMOReL20 leverage MBRL to learn in the offline RL setting. They learn an approximate MDP using the offline dataset. Simulation with the learned MDP allows the offline RL agent to ask counterfactual questions about actions that are unseen in the dataset by leveraging the generalization capabilities of the learned dynamics model. However, since the model cannot be iteratively refined or improved like in the case of online RL, the learned MDP is likely erroneous on outofdistribution states. As a result, policy learning in the learned MDP may exploit the errors in the model to optimize rewards, leading to poor performance in the true MDP. To guard against this exploitation, MOPO and MOReL penalize the agent for visiting outofdistribution states in the learned MDP, with uncertainty in the dynamics model being used to detect outofdistribution states.
3 ModelBased Offline RL with Adaptive Behavioral Regularization
Given an offline dataset , our goal is to learn a parameterized policy that achieves high rewards, without any additional interaction with the environment. We assume consists of tuples which we use to learn along with a behavioral prior . This dataset can be collected using one or more structured behavioral policies interacting with test environment. We now present our algorithm MABE (ModelBased Offline RL with Adaptive Behavioral Regularization), which consists of three components described below.
Dynamics Model Learning
MABE is a modelbased RL algorithm, and thus we use the offline dataset to learn a neural network dynamics model. This can be accomplished using maximum likelihood estimation or other generative modeling techniques such as variational models hafner2019dream . Let represent the generative model for the conditional next state distribution. Similar to prior offline MBRL works yu20mopo ; KidambiMOReL20 ; ArgensonMBOP ; MatsushimaBREMEN , we learn the generative dynamics model with maximumlikelihood learning as:
(2) 
Learning Behavioral Priors
Our main insight is the use of adaptive behavioral priors as a form of regularization in offline MBRL. Building on prior work brac2019wu ; awr2019peng , we utilize behavioral regularization within the MBRL framework. Our experimental results suggest that combining MBRL with behavioral regularization can incorporate sufficient conservatism to succeed in offline RL. This is in contrast to prior offline MBRL works that rely crucially on uncertainty estimation which may prove difficult in various applications.
We consider a parameterized generative model that represents our behavioral prior. A straightforward option is to learn a behavior model that replicates the statistics in the dataset.
(3) 
Alternatively, we can consider an adaptive behavioral prior that is biased towards trajectories that achieve higher rewards. This can be particularly useful in diverse datasets collected with multiple policies – some of which perform better at the task while other policies may exhibit behaviors that may hinder the task we want the offline RL agent to learn. Similar to Siegel et al. Siegel2020KeepDW , we seek a behavioral prior that is biased towards the high reward trajectories in the dataset while also staying close to the average statistics in the dataset. We formulate this as:
(4) 
where denotes the empirical behavioral policy and is the weighting function. The nonparametric solution to the above optimization is given by:
where we have used to avoid specification of the normalization factor, and represents a temperature parameter that is related to the constraint level . The above nonparametric policy can be projected into the space of parametric neural network policies as awr2019peng ; Siegel2020KeepDW :
(5) 
For the choice of the weighting function, we use
where is learned using TDerror minimization and is the maximum reward observed in the dataset. In this process, we treat the temperature as the hyperparameter choice. This implicitly defines the constraint threshold , and makes the problem specification and optimization more straightforward.
Behavior Regularized ModelBased RL
Equipped with a dynamics model and adaptive behavioral prior, our algorithm MABE, performs modelbased RL with a regularized objective given by:
(6) 
We use to denote the discounted state visitation distribution induced by executing in the learned MDP model. This objective encourages the agent to increase the rewards along with entropy and behavioral regularization. We learn a policy to solve this optimization using SAC haarnoja2018soft , resulting in an algorithm that is similar to a behavior regularized version of Dyna sutton1991dyna and MBPO janner2019mbpo . Algorithm 1 presents the full details of our learning approach.
Optional use of uncertainty quantification
MABE is a flexible framework that can additionally incorporate uncertainty quantification if available, in addition to the behavioral prior regularization. Let be an estimate of the dynamics model uncertainty in state . Analogous to prior work like MOPO and MOReL, we can additionally incorporate uncertainty into the MABE objective given by Eq. 6 as:
We emphasize again that additional reward penalty based on uncertainty is optional, and our experiment results suggest that it only offers marginal benefits compared to our other components.
4 Results
MABE design choices
We first outline the main decision choices and implementation details used for our experiments. Our implementation of MABE is built on MOPO. We parameterize the policy, behavioral prior, and dynamics model as a Gaussian distributions, with the mean being parameterized by an MLP network, and the covariance is also learned. For example, the dynamics is represented as
The reward and Qfunction are modeled using deterministic MLP networks. We learn the policy and Qfunction using MBPO janner2019mbpo (which itself uses SAC haarnoja2018soft
internally), similar to MOPO. MBPO is a modelbased RL algorithm that augments Additional implementation details of MABE and hyperparameters are provided in the Appendix.
Experiments in D4RL offline RL benchmark tasks
Our first goal is to study the performance of MABE on the widely studied D4RL fu2020d4rl benchmark. We consider a total of nine domains involving three simulated locomotion tasks and three datasets per task: medium, mediumreplay (or mixed), and mediumexpert. The medium dataset is collected with partially trained SAC agent, the mixed dataset is the entire replay buffer of a SAC agent throughout training, and the mediumexpert is a mix between trajectories from the medium dataset and an expert policy. These represent three distinct types of imperfect data  one imperfect policy, many changing policies, and a mixture of expert and suboptimal policies respectively. We compare our method to published leading offline RL algorithms which include: (a) MOReL KidambiMOReL20 and MOPO yu20mopo – modelbased algorithms that rely on uncertainty quantification; (b) CQL kumar20cql , a modelfree algorithm that learns a conservative Qfunction, and (c) BRACv brac2019wu , which regularizes a modelfree actorcritic algorithm with an unweighted (or equallyweighted) behavioral prior. Please see appendix for more details.
Evaluation scores on D4RL are shown in Table 1. We find that MABE achieves the highest score on the majority (7 out of 9) environments as well as the highest average score of . Crucially, MABE’s performance is robust across the three dataset types, achieving a leading score on at least 2 out 3 environments for each dataset. Finally, we note that MABE substantially outperforms its two most directly competing baselines: MOPO, an uncertaintybased MBRL method; and BRACv, a modelfree method with explicit behavioral prior regularization. This suggests that a combination of MBRL and behavioral priors can substantially benefit offline RL.
Dataset  Environment  BC  MABE (ours)  MOPO  MOReL  SAC  CQL  BRACv 

medium  halfcheetah  36.1  46.8 0.8  42.3 1.6  42.1  4.3  44.4  45.5 
medium  hopper  29.0  94.1 5.8  28.0 12.4  95.4  0.8  58.0  32.3 
medium  walker2d  6.6  65.7 8.5  17.8 19.3  77.8  0.9  79.2  81.3 
medreplay  halfcheetah  38.4  53.5 0.5  53.1 2.0  40.2  2.4  46.2  45.9 
medreplay  hopper  11.8  71.7 12.5  67.5 24.7  93.6  1.9  48.6  0.9 
medreplay  walker2d  11.3  51.0 2.4  39.0 9.6  49.8  3.5  26.7  0.8 
medexpert  halfcheetah  35.8  100.6 1.3  63.3 38.0  53.3  1.8  62.4  45.3 
medexpert  hopper  111.9  110.5 0.8  23.7 6.0  108.7  1.6  111.0  0.8 
medexpert  walker2d  6.4  103.3 1.3  44.6 12.9  95.6  0.1  98.7  66.6 
Average  Average  31.7  77.5  42.1  72.9  0.4  63.9  35.5 
In the remainder of this section, we investigate in detail why MABE performs well and what new capabilities are enabled by MABE.
Which components of MABE contribute most to performance?
MABE consists of several components that each play a part in the final agent. The full MABE algorithm consists of three components: (a) adaptive behavioral prior regularization; (b) policy learning (improvement) using modelbased RL, and (b) the optional use of uncertainty quantification through model ensembles chua18pets ; RajeswaranGameMBRL ; KidambiMOReL20 ; yu20mopo to incorporate additional conservatism. In this ablation study, we investigate the importance of each of these components by removing one while keeping all others fixed. Results shown in Figure 3, indicate that RL and behavioral priors are the largest contributors to MABE’s performance, while the optional uncertainty penalty only incrementally improves the final policies. Removing the uncertainty penalty leads to an observable drop in performance in only 2 out of the 9 environments. In contrast, removing behavioral priors drops performance in 8 environments, and removing RL drops performance in 7. Aggregated across the datasets, we find that removing behavioral priors results and RL result in a and drop in performance respectively. At the same time, removing uncertainty estimation only marginally degrades MABE performance by . This suggests that MABE has the potential to find wider applicability, especially in situations where uncertainty estimation can be difficult, but can also benefit from uncertainty estimation where available.
In Figure 3, no downstream RL refers to the direct use of the adaptive behavioral prior, without any finetuning with MBRL. This can be viewed as a baseline inspired by imitation learning. The ablation study of nobehavioral prior corresponds to MOPO and incorporates conservatism through the use of uncertainty estimation. The no uncertainty estimation ablation utilizes adaptive behavior prior regularization to incorporate conservatism when learning the policy using MBRL. This utilizes all the components of the full MABE algorithm except the optional uncertaintybased reward penalties. Finally, the full MABE algorithm uses all the three aformentioned components of behavioral priors, policy learning with MBRL, and additional conservatism through uncertainty penalized rewards.
Weighted vs Unweighted Behavioral Prior Regularization
Finally, we ablate the importance of adaptive or weighted behavioral priors as used in MABE. In particular, we compare MABE with the unweighted behavioral prior in Eq. 3 against the full MABE algorithm that uses the adaptive prior in Eq. 5. We show learning curves for MABE trained with the two priors in Figure 4 and find that adaptive priors help with training stability as well as asymptotic performance.
Crossdomain and crosstask generalization capability of MABE
A unique capability enabled by the use of behavioral priors is the possibility of transferring behaviors from one environment (or domain) to another. Prior work has explored the use of offline datasets and RL to acquire new behaviors in the same environment. For example, Yu et al. yu20mopo demonstrates that offline RL using a dataset that primarily consists of an agent walking forward can be used to learn a jumping behavior. In contrast, we seek for the agent to learn the same behavior but in a different environmental condition. This is particularly useful in robotics applications, like for instance home robots that operate in kitchens. While the environmental scene and physical dynamics would vary across different kitchens depending on the types of cabinets, stoves, plates, floor etc. we would often want to robot to exhibit similar behaviors in different kitchens like loading plates in a dishwasher. By utilizing behavioral priors that can potentially capture the core concepts of manipulation like force closure for grasping, robots can learn to become competent quickly in the home of a target user.
To test the generalization capabilities of MABE, we setup the following simple experiment. We use simulated locomotion agents (Hopper, Walker, HalfCheetah), and collect two datasets: containing mediumreplay forward walking data in normal terrain; and containing expert backwards walking data in low friction terrain intended to simulate ice. In this behavior transfer test, we use these two datasets to train an agent to run backward on normal terrain. A schematic illustration of our setting can be found in Figure 5. In our experiments, we consider the following approaches: (i) task transfer only where we use the forwards walking dataset to learn a backwards walking policy using offline MBRL. (ii) domain transfer only where we train a policy in source domain and directly deploy it in the target domain. (iii) task transfer with behavior initialization where we initialize the task transfer approach with the adaptive behavioral prior; (iv) task + domain transfer with MABE where we run MABE using the dataset corresponding to the target dynamics and behavioral prior corresponding to the desired behavior . We show the resulting expert normalized scores in Figure 6 and find that MABE is the only algorithm that is able to successfully solve the target task through crossdomain behavior transfer. This suggests that dynamics models and behavioral priors are complementary and can be used to acquire a wide range of behaviors from offline data using domain and task transfer.
5 Related Work
Our method, MABE, is at the intersection of modelbased reinforcement learning, offline reinforcement learning, and behavioral prior regularization. There are a number of related algorithms that utilize dynamics models or behavioral priors in the context of offline RL yu20mopo ; brac2019wu ; MatsushimaBREMEN ; awr2019peng ; Siegel2020KeepDW , which we describe in Table 2 with a comprehensive overview. While MABE is similar to prior work, our primary contribution is identifying a unique mixture of components that enable robust offline RL on the D4RL benchmark. Recently, concurrent work COMBO yu2021combo has also investigated an uncertaintyfree approach to offline MBRL. The difference is that COMBO combines offline MBRL with conservative Qfunctions whereas MABE utilizes adaptive behavioral priors, which helps with cross domain generalization capability as demonstrated in Section 4.
Modelbased Reinforcement Learning
: Reinforcement learning algorithms can be broadly classified into modelbased and modelfree categories. Modelbased reinforcement learning (MBRL) algorithms build an explicit dynamics model of the environment for use with policy search. Modelbased approaches can be further categorized into Dynastyle algorithms, policy search with temporal backpropagation, and shooting methods. In dynastyle approaches
sutton1990integrated ; sutton1991dyna ; sutton1991planning , interactions with the environment are used to update the dynamics model and the RL policy is trained on synthetic rollouts from the dynamics model, often using a modelfree RL algorithm like policy gradients or actorcritic. Some representative examples of Dynastyle algorithms include MBPO janner2019mbpo , METRPO kurutach2018model , PAL/MAL RajeswaranGameMBRL , and Dreamer hafner2019dream . Policy search with temporal backpropagation and differential dynamic programming methods Rosenbrock1972DDP ; Deisenroth2011pilco ; Heess2015SVG ; tassa12ilqg ; Todorov2005iLQG utilize gradients through the model to help compute the policy gradient. Shooting methods chua18pets ; hafner2018learning ; Williams2017MPPI ; Nagabandi2019PDDM ; POLO extract an implicit policy from the learned model by performing realtime planning using the learned model. For simplicity and to build on prior work in the area of offline RL, we implemented MABE with MBPO, a Dynastyle algorithm. However, MABE can in principle be implemented with any MBRL algorithm.Offline Reinforcement Learning: Offline RL levine2020offlinerlsurvey has recently received much attention due to its potential for applicability in a wide range of applications, and consequently many algorithms have been developed recently. Among them include importance sampling based algorithms Liu2020ProvablyGB ; LiuSAB19 ; SwaminathanJ15 , dynamic programming and actorcritic based algorithms brac2019wu ; fujimoto2018addressing ; Agarwal2020AnOP ; Siegel2020KeepDW ; kumar20cql , and modelbased algorithms KidambiMOReL20 ; yu20mopo ; MatsushimaBREMEN ; ArgensonMBOP . These algorithms are primarily evaluated using recently proposed benchmarks including D4RL fu2020d4rl , Atari Agarwal2020AnOP ; bellemare2013arcade and RLUnplugged Gulcehre2020RLUA . We outline the contrasts between MABE and prior work in the remainder of the section.
Relationship to prior offline MBRL algorithms In terms of the policy learning, our work is closest to prior offline MBRL algorithms – MOPO yu20mopo and MOReL KidambiMOReL20 , which rely on uncertainty quantification to estimate model prediction error to incorporate conservatism. In contrast, MABE can benefit from uncertainty estimation, but even in its absence demonstrates strong performance and thus has wider applicability. BREMEN MatsushimaBREMEN is another MBRL algorithm that was primarily developed for a different setting of deployment efficient RL but can be repurposed for offline RL. Like MABE, it uses a behavioral prior instead of uncertainty driven conservatism. However, it uses an unweighted behavioral prior and performs only a small number of policy updates with implicit KL regularization. As a result, it may not benefit from the full potential of policy learning for many iterations with an explicit KL regularization. Furthermore, in our experiments (Section 4), we find that adaptive behavioral prior helps learning stability and improves asymptotic performance.
MABE (ours)  MOPO/MOReL  BRACv brac2019wu  BREMEN MatsushimaBREMEN  ABM Siegel2020KeepDW  AWR awr2019peng  

ModelBased  Yes  Yes  No  Yes  No  No 
Behavior Prior  Adaptive  None  Unweighted  Unweighted  Adaptive  Adaptive 
Policy Regularization  Explicit KL  None  Explicit KL  Implicit KL  Implicit KL  Implicit KL 
Policy Optimization  SAC  SAC/NPG  SAC  TRPO  MPO Abdolmaleki2018MPO  Imitation 
Uncertainty  Optional  Yes  No  No  No  No 
Relationship to prior work with behavioral priors: An alternate class of offline RL algorithms incorporate conservatism to prevent overfitting by regularizing the policy learning towards a behavioral prior. Some representative algorithms are BRAC brac2019wu , ABM Siegel2020KeepDW , and AWR awr2019peng , which are all modelfree algorithms. Among these, BRAC uses an unweighted behavioral prior and learns the policy using an actorcritic algorithm like SAC haarnoja2018soft . AWR was primarily developed for online RL but can be repurposed for offline RL. It is analogous to our learning of adaptive behavioral prior, but without any RL based finetuning. In our ablation experiments, we find that RL finetuning significantly improves the performance of MABE. ABM learns an adaptive behavior prior similar to MABE, but learns the policy using the modelfree MPO algorithm. In contrast, modelbased algorithms that augment training data with modelgenerated rollouts can unlock better generalization capabilities, including to new tasks. In an alternate line of work, behavioral priors have also been used for skill extraction to enable longhorizon tasks pertsch2020spirl or structured exploration strategies singh2021parrot . Finally, we also note that the concurrently developed Decision Transformer chen2021decisiontransformer learns a returnconditioned behavior model and generates actions by conditioning on high desired reward. In contrast, MABE does not condition on returns and instead uses an RL algorithm for policy improvement.
In summary, we note that MABE presents a novel combination of MBRL and adaptive behavioral priors for offline RL. Through this combination, MABE can serve as an attractive choice for uncertaintyfree offline MRBL. MABE also achieves leading performance relative to prior modelbased and modelfree approaches on the D4RL benchmark and demonstrates a strong ability to transfer behaviors across datasets from different domains.
6 Broader Impacts and Limitations
Robust offline RL has the potential to make RL as widely applicable for decision making problems as supervised learning is today for vision and language. Applications include domains where offline data is ample but exploration can be harmful such as controlling autonomous vehicles, digital assistants, and recommender systems. Negative potential impacts of MABE and RL algorithms more generally is the lack of explainability. Since MABE is simply optimizing a reward function while regularizing against a behavioral prior it can learn policies with undesired consequences that exploit the reward function. Future work on explainability of RL policies as well as constrained policy optimization could help alleviate these concerns. While we extensively evaluate our method using D4RL benchmark tasks, and also study crossdomain transfer, our experimental evaluation is in continuous control tasks. Although continuous control is representative of many applications in robotics, offline RL is a broad and vibrant field with applications involving language
jaques2019way ; ZhouSRE17 and visual modalities Agarwal2020AnOP ; Rafailov2020LOMPO ; hafner2019dream . We hope to extend MABE to different offline RL tasks and highdimensional observation modalities in future work.Acknowledgments
This work was supported by Berkeley Deep Drive. Part of this work was completed when Aravind Rajeswaran was at the University of Washington, where he was supported through the J.P. Morgan PhD Fellowship in AI (202021). The authors thank Kevin Lu and Justin Fu for help with setting up the D4RL benchmark tasks. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.
References
 (1) Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep datadriven reinforcement learning. CoRR, abs/2004.07219, 2020.
 (2) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 (3) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
 (4) Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature, 575(7782):350–354, 2019.
 (5) OpenAI et al. Solving rubik’s cube with a robot hand. ArXiv, abs/1910.07113, 2019.
 (6) Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Reinforcement Learning, volume 12. Springer, 2012.
 (7) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR, abs/2005.01643, 2020.
 (8) Dean A Pomerleau. Alvinn: an autonomous land vehicle in a neural network. In NIPS, pages 305–313, 1988.

(9)
P. Abbeel and A. Ng.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the twentyfirst international conference on Machine learning
, 2004.  (10) Brian D. Ziebart, Andrew L. Maas, J. Bagnell, and A. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
 (11) Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
 (12) Li Zhou, Kevin Small, Oleg Rokhlenko, and Charles Elkan. Endtoend offline goaloriented dialog policy learning via policy gradient. CoRR, abs/1712.02838, 2017.
 (13) Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In RecSys. ACM, 2016.
 (14) Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. J. Mach. Learn. Res, 16:1731–1755, 2015.
 (15) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Offpolicy policy gradient with state distribution correction. CoRR, abs/1904.08473, 2019.
 (16) Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing offpolicy qlearning via bootstrapping error reduction. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, pages 11761–11771, 2019.
 (17) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative qlearning for offline reinforcement learning. In NeurIPS, 2020.
 (18) Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, 2018.
 (19) Rishabh Agarwal, D. Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In ICML, 2020.
 (20) Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL : ModelBased Offline Reinforcement Learning. In NeurIPS, 2020.
 (21) Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y. Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: modelbased offline policy optimization. In NeurIPS, 2020.
 (22) Arthur Argenson and Gabriel DulacArnold. Modelbased offline planning. ArXiv, abs/2008.05556, 2020.
 (23) T. Matsushima, H. Furuta, Y. Matsuo, Ofir Nachum, and Shixiang Gu. Deploymentefficient reinforcement learning via modelbased offline optimization. ArXiv, abs/2006.03647, 2020.
 (24) Yaniv Ovadia, E. Fertig, J. Ren, Zachary Nado, D. Sculley, S. Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In NeurIPS, 2019.
 (25) Edmon Begoli, Tanmoy Bhattacharya, and D. Kusnezov. The need for uncertainty quantification in machineassisted medical decision making. Nature Machine Intelligence, 1:20–23, 2019.
 (26) Heinrich Jiang, Been Kim, and M. Gupta. To trust or not to trust a classifier. In NeurIPS, 2018.
 (27) M. Abdar, Farhad Pourpanah, Sadiq Hussain, D. Rezazadegan, Li Liu, M. Ghavamzadeh, P. Fieguth, Xiaochun Cao, A. Khosravi, U. Acharya, V. Makarenkov, and S. Nahavandi. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. ArXiv, abs/2011.06225, 2020.
 (28) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
 (29) Richard S Sutton et al. Introduction to reinforcement learning, volume 135. 1998.
 (30) Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for modelbased reinforcement learning. In ICML, 2020.
 (31) Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. In NeurIPS, 2019.
 (32) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In ICLR, 2020.
 (33) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv:1911.08265, 2019.
 (34) Jinglin Chen and Nan Jiang. Informationtheoretic considerations in batch reinforcement learning. In ICML, 2019.
 (35) Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. CoRR, abs/1911.11361, 2019.
 (36) Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantageweighted regression: Simple and scalable offpolicy reinforcement learning. CoRR, abs/1910.00177, 2019.
 (37) Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin A. Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. ArXiv, abs/2002.08396, 2020.
 (38) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
 (39) Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
 (40) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, pages 4759–4770, 2018.
 (41) Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline modelbased policy optimization, 2021.
 (42) Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, pages 216–224. Elsevier, 1990.
 (43) Richard S Sutton. Planning by incremental dynamic programming. In Machine Learning Proceedings 1991, pages 353–357. Elsevier, 1991.
 (44) Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Modelensemble trustregion policy optimization. In ICLR, 2018.
 (45) H. Rosenbrock, D. Jacobson, and D. Mayne. Differential dynamic programming. The Mathematical Gazette, 56:78, 1972.
 (46) Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, pages 465–472. Omnipress, 2011.
 (47) N. Heess, Greg Wayne, D. Silver, T. Lillicrap, T. Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In NIPS, 2015.
 (48) Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, October 712, 2012, pages 4906–4913. IEEE, 2012.
 (49) E. Todorov and W. Li. A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems. Proceedings of the 2005, American Control Conference, 2005., pages 300–306 vol. 1, 2005.
 (50) Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, 2019.
 (51) Grady Williams, Nolan Wagener, Brian Goldfain, P. Drews, James M. Rehg, Byron Boots, and E. Theodorou. Information theoretic mpc for modelbased reinforcement learning. 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721, 2017.
 (52) Anusha Nagabandi, K. Konolige, Sergey Levine, and V. Kumar. Deep dynamics models for learning dexterous manipulation. ArXiv, abs/1909.11652, 2019.
 (53) Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via ModelBased Control. In ICLR, 2019.
 (54) Yao Liu, A. Swaminathan, A. Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. ArXiv, abs/2007.08202, 2020.

(55)
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  (56) Caglar Gulcehre, Ziyu Wang, Alexander Novikov, T. Paine, Sergio Gomez Colmenarejo, Konrad Zolna, Rishabh Agarwal, J. Merel, Daniel J. Mankowitz, Cosmin Paduraru, Gabriel DulacArnold, J. Li, Mohammad Norouzi, Matthew W. Hoffman, Ofir Nachum, G. Tucker, N. Heess, and N. D. Freitas. Rl unplugged: A suite of benchmarks for offline reinforcement learning. 2020.
 (57) Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, R. Munos, N. Heess, and Martin A. Riedmiller. Maximum a posteriori policy optimisation. ArXiv, abs/1806.06920, 2018.
 (58) Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with learned skill priors. In Conference on Robot Learning (CoRL), 2020.
 (59) Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Datadriven behavioral priors for reinforcement learning. In ICLR, 2021.
 (60) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021.
 (61) Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. In L4DC, 2021.
 (62) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 (63) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
 (64) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. In International Conference on Machine Learning, 2015.
Appendix A Environments
In our experiments, we use offline datasets from D4RL [1] for environments from OpenAI gym’s [62] MuJoCo continuous control tasks [63]. We look at three locomotion agents shown in Figure 2: HalfCheetah, Hopper, and Walker2d, which are all tasked with moving forward as fast as possible. For each agent, we look at three types of datasets:

Medium: Approximately 1 million transitions collected from a partially trained SAC agent

Mixed: Approximately 100000 transitions collected from the entire replay buffer of a SAC agent throughout training

Mediumexpert: Approximately 2 million transitions consisting of half medium samples (collected from a partially trained SAC agent) and half expert samples, which are collected from a fully trained SAC agent.
We don’t evaluate on random datasets, which are collected with a random policy for two reasons. First, the actions in these datasets are completely random and behavioral priors are not expected to be helpful since the behaviors are random. Instead we are more interested in evaluating performance on offline datasets with some, even if minimal, structure. Second, we argue that completely random data is a somewhat contrived benchmark. Datasets used to solve realworld problems in robotics, such as autonomous vehicle navigation, locomotion, and manipulation are likely to have some sort of structure.
Appendix B Baselines
We compare against several leading modelbased and modelfree offline RL baselines on the D4RL dataset.

MOPO: MOPO [21] is an uncertaintybased offline MBRL algorithm. MOPO uses MBPO [31], an offpolicy Dynastyle RL algorithm where a replay buffer is populated with synthetic samples from a learned dynamics model and used to train an Soft Actor Critic (SAC) [38] agent. MOPO build on MBPO by penalizing the reward experienced by an agent with a penalty proportional to the prediction uncertainty of the dynamics model. MABE is also built on top of MBPO and thus MOPO is the most directly competing baseline.

MOReL: MOReL [20] is also an uncertaintybased offline MBRL algorithm. The primary difference between MOReL and MOPO is that MOReL uses an onpolicy algorithm, TRPO [64], as its backbone. Otherwise, MOPO and MOReL are similar  both penalize the reward with a term proportional to the forward model uncertainty. The performance differences between MOPO and MOReL on D4RL are mainly due to the performance of the backbone algorithm, SAC and TRPO respectively. SAC outperforms TRPO on the mujoco Cheetah environment while TRPO outperforms TRPO in the Hopper environment, and these differences are also evident in the offline RL results for MOPO and MOReL.

CQL: Conservative QLearning (CQL) [17] is a leading offline modelfree baselines. CQL learns Qfunctions so that the expected value of a policy under the learned Qfunction is a lowerbound of the true policy value. CQL modifies the standard Bellman error with a term that minimizes the Qfunction under the policy distribution while maximizing it under the offline data distribution. CQL does not leverage behavioral priors.

BRACv: BRACv is another leading modelfree RL algorithm that utilizes behavioral priors to learn a conservative policy. BRACv is the modelfree algorithm most similar to MABE. Like MABE, BRACv learns a behavioral prior by fitting a Gaussian distribution to the offline data and regularizing a Gaussian evaluation policy with respect to the behavioral data. Unlike MABE, BRACv does not weigh the behavioral prior with the advantage and instead treats all data points equally regardless of the reward achieved.
Additionally, we include comparisons to naive behavior cloning and offline SAC.
Appendix C Experiment Details
c.1 AdvantageWeighted Behavioral Prior
First, to learn the advantages for each datapoint in dataset, we fit a Qfunction to the offline dataset. We train until the loss no longer increases any further, then use this Qfunction to assign Qvalues to each datapoint. We normalize these Qvalues by dividing each value by the maximum Qvalue assigned to any datapoint.
We train our behavioral prior using a negative log likelihood loss. We weight the loss from each datapoint by the exponentiated normalized Qvalues obtained from our learned Qfunction. During training, we do a 9010 trainvalidation split and stop training when the validation loss stops decreasing.
One note is that for halfcheetah mediumexpert, we found that a more simple weighing scheme led to better results. Rather than fitting a Qfunction, we weighed datapoints by the final total reward of their trajectory instead. For all other environments, we found that weighing by the Qfunction worked better or approximately the same.
c.2 Hyperparameters
Because we built off of MOPO [21], we use the same MOPOspecific hyperparameters for the MOPO hyperparameters of the rollout length and penalty coefficient . We refer you to the MOPO Appendix for these values. We additionally use the MOPO architecture and training method for our dynamics model ensemble. For the dynamics model, we train an ensemble of 7 dynamics models and choose the 5 best models based on their prediction error to use while training our offline SAC agent.
For our policy network, we learn a Gaussian twohead network with 2 hidden layers with 256 hidden units, and two separate linear output layers outputting the mean and log standard deviation of the next action. For our Q networks, we use an architecture of 3 feedforward layers of 256 hidden units each. Our behavioral prior has the same architecture as our policy network.
Our main hyperparameter for our method is the target KL divergence . For our hyperparameter search, we defaulted on a low target divergence for the mediumexpert datasets (), and we performed a grid search for the medium and mediumreplay environments, because we found that the different agents required different target divergences based on their dataset composition. The full list of target divergences used can be found in Table 3
Dataset Type  Environment  Target Divergence 

medium  halfcheetah  100 
medium  hopper  0.75 
medium  walker2d  1 
mixed  halfcheetah  40 
mixed  hopper  5 
mixed  walker2d  20 
mediumreplay  halfcheetah  0.1 
mediumreplay  hopper  0.1 
mediumreplay  walker2d  0.1 
Appendix D Compute Resources and Assets Used
Compute Resources
Experiments for our main suite of results were run on GPUs using a machine with eight Quadro RTX 6000. However, only one GPU is required for four concurrent experiments, so our main experiments used approximately 1080 GPU hours (including all seeds).
Assets Used
In this work we used the D4RL Offline RL Benchmark for evaluation [1] which has an Apache License 2.0. We build our code off of logic from MOPO [21], which is distributed under a MIT License. We built our final codebase off of a PyTorch replication codebase of MBPO. From this codebase, we ported over MOPO logic. Additionally, we train our dynamics models in the MOPO official codebase for fair comparison against MOPO. For our baselines, we report results for MOPO, BC, SAC, and BRACv from [21], MOReL from [20], and CQL from [17].