1 Introduction
Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decisionmaking tasks BarthMaron et al. (2018); Hessel et al. (2018); Nachum et al. (2019). Virtually all of these demonstrations have relied on highlyfrequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment. However, in many realworld applications of RL, such as health Murphy et al. (2001), education Mandel et al. (2014), dialog agents (Jaques et al., 2019), and robotics Gu et al. (2017a); Kalashnikov et al. (2018), the deployment of a new datacollection policy may be associated with a number of costs and risks. If we can learn tasks with a small number of data collection policies, we can substantially reduce these costs and risks.
Based on this idea, we propose a novel measure of RL algorithm performance, namely deployment efficiency, which counts the number of changes in the datacollection policy during learning, as illustrated in Figure 1. This concept may be seen in contrast to sample efficiency or data efficiency Precup et al. (2001); Degris et al. (2012); Gu et al. (2017b); Haarnoja et al. (2018); Lillicrap et al. (2016); Nachum et al. (2018), which measures the amount of environment interactions incurred during training, without regard to how many distinct policies were deployed to perform those interactions. Even when the data efficiency is high, the deployment efficiency could be low, since many onpolicy and offpolicy algorithms alternate data collection with each policy update (Schulman et al., 2015; Lillicrap et al., 2016; Gu et al., 2016; Haarnoja et al., 2018). Such dependence on highfrequency policy deployments is best illustrated in the recent works in offline RL Fujimoto et al. (2019); Jaques et al. (2019); Kumar et al. (2019); Levine et al. (2020); Wu et al. (2019), where baseline offpolicy algorithms exhibited poor performance when trained on a static dataset. These offline RL works, however, limit their study to a single deployment, which is enough for achieving high performance with data collected from a suboptimal behavior policy, but often not from a random policy. In contrast to those prior works, we aim to learn successful policies from scratch with minimal amounts of data and deployments.
Many existing modelfree offline RL algorithms (Levine et al., 2020) are tuned and evaluated on large datasets (e.g., one million transitions). In order to develop an algorithm that is both sampleefficient and deploymentefficient, each iteration of the algorithm between successive deployments has to work effectively on much smaller dataset sizes. We believe modelbased RL is better suited to this setting due to its higher demonstrated sample efficiency than modelfree RL Kurutach et al. (2018); Nagabandi et al. (2018). Although the combination of modelbased RL and offline or limiteddeployment settings seems straightforward, we find this naïve approach leads to poor performance. This problem can be attributed to extrapolation errors Fujimoto et al. (2019) similar to those observed in modelfree methods. Specifically, the learned policy may choose sequences of actions which lead it to regions of the state space where the dynamics model cannot predict properly, due to poor coverage of the dataset. This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In modelfree settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution Jaques et al. (2019); Kumar et al. (2019); Wu et al. (2019), which, however, can overly limit policies’ expressivity Sohn et al. (2020).
In order to better approach these problems arising in limited deployment settings, we propose BehaviorRegularized ModelENsemble (BREMEN), which learns an ensemble of dynamics models in conjunction with a policy using imaginary rollouts while implicitly regularizing the learned policy via appropriate parameter initialization and conservative trustregion learning updates. We evaluate BREMEN on highdimensional continuous control benchmarks and find that it achieves impressive deployment efficiency. BREMEN is able to learn successful policies with only 510 deployments, significantly outperforming existing offpolicy and offline RL algorithms in this deploymentconstrained setting. We further evaluate BREMEN on standard offline RL benchmarks, where only a single static dataset is used. In this fixedbatch setting, our experiments show that BREMEN can not only achieve performance competitive with stateoftheart when using standard dataset sizes but also learn with 1020 times smaller datasets, which previous methods are unable to attain.
2 Preliminaries
We consider a Markov Decision Process (MDP) setting, characterized by the tuple
, where is the state space, is the action space,is the transition probability distribution or dynamics,
is the reward function and is the discount factor. A policy is a function that determines the agent behavior, mapping from states to probability distributions over actions. The goal is to obtain the optimal policy aswhere is the expectation of the discounted sum of rewards under the policy . The transition probability
is usually unknown, and it is estimated with a parameterized dynamics model
(e.g., a neural network) in modelbased RL. For simplicity, we assume that the reward function
is known, and the reward can be computed for any arbitrary state, but we can easily extend to the unknown setting and predict it using a parameterized function.Onpolicy vs Offpolicy, Online vs Offline At highlevel, most RL algorithms iterate many times between collecting a batch of transitions (deployments) and optimizing the policy (learning). If the algorithms discard data after each policy update, they are onpolicy (Schulman et al., 2015, 2017), while if they accumulate data in a buffer , i.e. experience replay (Lin, 1992), they are offpolicy (Mnih et al., 2015; Lillicrap et al., 2016; Gu et al., 2016, 2017b; Haarnoja et al., 2018; Fujimoto et al., 2019) because not all the data in buffer comes from the current policy. However, we consider all these algorithms to be online RL algorithms, since they involve many deployments during learning, ranging from hundreds to millions. On the other hand, in pure offline
RL, one does not assume direct interaction and learns a policy from only a fixed dataset, which effectively corresponds to a single deployment allowed for learning. Classically, interpolating these two extremes were semibatch RL algorithms
Lange et al. (2012); Singh et al. (1995), which improve the policy through repetitions of collecting a large batch of transitions and performing many or full policy updates. While these semibatch RL also realize good deployment efficiency, they have not been extensively studied with neural network function approximators or in offpolicy settings with experience replay for scalable sampleefficient learning. In our work, we aim to have both high deployment efficiency and sample efficiency by developing an algorithm that can solve the tasks with minimal policy deployments as well as transition samples.3 Deployment Efficiency
Deploying a new policy for data collection can be associated with a number of costs and risks for many realworld applications like medicine or robotic control Murphy et al. (2001); Mandel et al. (2014); Gu et al. (2017a); Kalashnikov et al. (2018); Nachum et al. (2019). While there is an abundance of works on safety for RL (Chow et al., 2015; Eysenbach et al., 2018; Chow et al., 2018; Ray et al., 2019; Chow et al., 2019), these methods often do not provide guarantees in practice when combined with neural networks and stochastic optimization. It is therefore necessary to validate each policy before deployment. Due to the cost associated with each deployment, it is desirable to minimize the number of distinct deployments needed during the learning process.
In order to focus research on these practical bottlenecks, we propose a novel measure of RL algorithms, namely, deployment efficiency, which counts how many times the datacollection policy has been changed during improvement from random policy to solve the task. For example, if an RL algorithm operates by using its learned policy to collect transitions from the environment times, each time collecting a batch of new transitions, then the number of deployments is , while the total number of samples collected is . The lower is, the more deploymentefficient the algorithm is; in contrast, sample efficiency looks at . Online RL algorithms, whether they are onpolicy or offpolicy, typically update the policy and acquire new transitions by deploying the newly updated policy at every iteration. This corresponds to performing hundreds to millions of deployments during learning on standard benchmarks (Haarnoja et al., 2018), which is severely deployment inefficient. On the other hand, offline RL literature only studies the case of 1 deployment. A deploymentefficient algorithm would stand in the middle of these two extremes and ideally learn a successful policy from scratch while deploying only a few distinct policies, as illustrated in Figure 1.
Recent deep RL literature seldom emphasizes deployment efficiency, with few exceptions in specific applications Kalashnikov et al. (2018) where such a learning procedure is necessary. Although current stateoftheart algorithms on continuous control have substantially improved sample or data efficiency, they have not optimized for deployment efficiency. For example, SAC Haarnoja et al. (2018), an efficient modelfree offpolicy algorithm, performs half a million to one million policy deployments during learning on MuJoCo Todorov et al. (2012) benchmarks. METRPO Kurutach et al. (2018), a modelbased algorithm, performs a much lower 100300 policy deployments, although this is still relatively high for practical settings.^{1}^{1}1We examined the number of deployments by checking their original implementations, while the frequency of data collection is a tunable hyperparameter. In our work, we demonstrate successful learning on standard benchmark environments with only 510 deployments.
4 BehaviorRegularized ModelEnsemble
To achieve high deployment efficiency, we propose BehaviorRegularized ModelENsemble (BREMEN). BREMEN incorporates Dynastyle Sutton (1991) modelbased RL, learning an ensemble of dynamics models in conjunction with a policy using imaginary rollouts from the ensemble and behavior regularization via conservative trustregion updates.
4.1 Imaginary Rollout from Model Ensemble
As in recent Dynastyle modelbased RL methods Kurutach et al. (2018); Wang et al. (2019), BREMEN uses an ensemble of deterministic dynamics models to alleviate the problem of model bias. Each model is parameterized by and trained by the following objective, which minimizes mean squared error between the prediction of next state and true next state over a dataset :
(1) 
During training of a policy , imagined trajectories of states and actions are generated sequentially, using a dynamics model that is randomly selected at each time step:
(2) 
4.2 Policy Update with Behavior Regularization
In order to manage the discrepancy between the true dynamics and the learned model caused by the distribution shift in batch settings, we propose to use iterative policy updates via a trustregion constraint, reinitialized with a behaviorcloned policy after every deployment. Specifically, after each deployment, we are given an updated dataset of experience transitions . With this dataset, we approximate the true behavior policy through behavior cloning (BC), utilizing a neural network parameterized by
, where we implicitly assume a fixed variance, a common practice in BC
(Rajeswaran et al., 2017):(3) 
After obtaining the estimated behavior policy, we initialize the target policy as a Gaussian policy with mean from
and standard deviation of
. This BC initialization in conjunction with gradient descent based optimization may be seen as implicitly biasing the optimized to be close to the datacollection policy Nagarajan and Kolter (2019), and thus works as a remedy for the distribution shift problem (Ross et al., 2011). To further bias the learned policy to be close to the datacollection policy, we opt to use a KLbased trustregion optimization Schulman et al. (2015). Therefore, the optimization of BREMEN becomes(4)  
where is the advantage of computed using modelbased rollouts in the learned dynamics model and is the maximum step size.
The combination of BC for initialization and finite iterative trustregion updates serves as an implicit KL regularization, as discussed in Section 4.3. This is in contrast to many previous offline RL algorithms that augment the value function with a penalty of explicit KL divergence Siegel et al. (2020); Wu et al. (2019) or maximum mean discrepancy Kumar et al. (2019). Empirically, we found that our regularization technique outperforms the explicit KL penalty (see Section 5.3).
By recursively performing offline procedure, BREMEN can be used for deploymentefficient learning as shown in Algorithm 1, starting from a randomly initialized policy, collecting experience data, and performing offline policy updates.
4.3 Implicit KL Control from a Mathematical Perspective
We can intuitively understand that behavior cloning initialization with trustregion updates works as a regularization of distributional shift, and this can be supported by theory. Following the notation of Janner et al. (2019), we denote the generalization error of a dynamics model on the state distribution under the true behavior policy as , where represents the total variation distance between true dynamics and learned model . We also denote the distribution shift on the target policy as . A bound relating the true returns and the model returns on the target policy is given in Janner et al. (2019) as,
(5) 
This bound guarantees the improvement under the true returns as long as the improvement under the model returns increases by more than the slack in the bound due to Janner et al. (2019); Levine et al. (2020).
We may relate this bound to the specific learning employed by BREMEN, which includes dynamics model learning, behavior cloning policy initialization, and conservative KLbased trustregion policy updates. To do so, we consider an idealized version of BREMEN, where the expectations over states in equations 1, 3, 4 are replaced with supremums and the dynamics model is set to have unit variance.
Proposition 1 (Policy and model error bound).
Suppose we apply the idealized BREMEN on a dataset , and define in terms of the behavior cloning and dynamics model losses as,
where denotes the Shannon entropy. If one then applies KLbased trustregion steps of step size (equation 4) using stochastic dynamics models with mean and standard deviation 1, then
Proof.
See Appendix A. ∎
5 Experiments
We evaluate BREMEN in both deploymentefficient settings, where the algorithm must learn a policy from scratch via a limited number of deployments, and offline RL, where the algorithm is given only a single static dataset. We use four standard continuous control benchmarks for offline RL Kumar et al. (2019); Wu et al. (2019), namely, Ant, HalfCheetah, Hopper, and Walker2d on the MuJoCo physics simulator Todorov et al. (2012). See Appendix B and C for further details and results.
5.1 Evaluating Deployment Efficiency
We compare BREMEN to METRPO, SAC, BCQ, and BRAC applied to limited deployment settings. To adapt offline methods (BCQ, BRAC) to this setting, we simply apply them in a recursive fashion;^{2}^{2}2Recursive BCQ and BRAC also do behavioral cloningbased policy initialization after each deployment. at each deployment iteration, we collect a batch of data with the most recent policy and then run the offline update with this dataset. As for SAC, we simply change the replay buffer to update only at specific deployment intervals. For the sake of comparison, we align the number of deployments and the amount of data collection at each deployment (either 100,000 or 200,000) for all methods.
Figure 2 shows the results with 200,000 (top) and 100,000 (bottom) batched transitions per deployment. Regardless of the environments and the batch size per update, BREMEN achieves remarkable performance while existing online and offline RL methods struggle to make any progress in the limited deployment settings. As a point of comparison, we also include results for online SAC and METRPO without limits on the number of deployments but using the same number of transitions.
5.2 Evaluating Offline Learning
We also evaluate BREMEN on standard offline RL benchmarks following Wu et al. (2019). We first train online SAC to a certain cumulative reward threshold, 4,000 in HalfCheetah, 1,000 in Ant, Hopper, and Walker2d, and collect offline datasets. We evaluate agents with the offline dataset of one million (1M) transitions, which is standard for BCQ and BRAC Wu et al. (2019). We then evaluate them on much smaller datasets of 50k and 100k transitions, 510 % of prior works.
Table 1 shows that BREMEN can achieve performance competitive with stateoftheart modelfree offline RL algorithms when using the standard dataset size of 1M. Moreover, BREMEN can also appropriately learn with 1020 times smaller datasets, where BCQ and BRAC are unable to exceed even BC baseline. As a result, our recursive BREMEN algorithm is not only deploymentefficient but also sampleefficient, and significantly outperforms the baselines.



1,000,000 (1M) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  1191  4126  1128  1376 
BC  1321141  428112  1341161  1421147 
BCQ Fujimoto et al. (2019)  202131  5783272  1130127  2153753 
BRAC Wu et al. (2019)  2072285  7192115  142290  22391124 
BRAC (max Q)  2369234  732091  1916343  24091210 
BREMEN (Ours)  3328275  8055103  2058852  2346230 
METRPO (offline) Kurutach et al. (2018)  1258550  1804924  51891  211154 


100,000 (100K) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  1191  4066  1128  1376 
BC  133081  426621  1322109  142647 
BCQ  1363199  3915411  1129238  2187196 
BRAC  157383  25052501  131070  21621109 
BRAC (max Q)  226387  23322422  1422101  21641114 
BREMEN (Ours)  1633127  6095370  2191455  2132301 
METRPO (offline)  9744  2434  307170  1061 


50,000 (50K) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  1191  4138  1128  1376 
BC  127065  423049  124961  1420194 
BCQ  132995  1319626  1178235  1841439 
BRAC  878244  59773  1277102  9761207 
BRAC (max Q)  843279  59056  1276225  9031137 
BREMEN (Ours)  1347283  5823146  1632796  2280647 
METRPO (offline)  93832  7395  15213  176343 

5.3 Evaluating Effectiveness of Implicit KL Control
In this section, we present an experiment to better understand the effect of BREMEN’s implicit regularization. Figure 3 shows the KL divergence of learned policies from the last deployed policy. We compare BREMEN to variants of BREMEN that use an explicit KL penalty on value instead of BC initialization (conservative KL trustregion updates are still used). We find that the explicit KL without behavior initialization variants learn policies that move farther away from the last deployed policy than behavior initialized policies. This suggests that the implicit behavior regularization employed by BREMEN is more effective as a conservative policy learning protocol.
6 Related Work
Deployment Efficiency and Offline RL
Although we are not aware of any previous works which explicitly proposed the concept of deployment efficiency, its necessity in many realworld applications has been generally known. One may consider previously proposed semibatch RL algorithms Ernst et al. (2005); Lange et al. (2012); Singh et al. (1994) as approaching this issue. More recently, a related but distinct problem known as offline RL has gained popularity Levine et al. (2020); Wu et al. (2019). These offline RL works consider an extreme version of 1 deployment, and typically collect the static batch with a partially trained policy rather than a random policy. While offline RL has shown promising results for a variety of realworld applications, such as robotics Mandlekar et al. (2019), dialogue systems Jaques et al. (2019), or medical treatments Gottesman et al. (2018), these algorithms struggle when learning a policy from scratch or when the dataset is small. Nevertheless, common themes of many offline RL algorithms – regularizing the learned policy to the behavior policy Fujimoto et al. (2019); Jaques et al. (2019); Kumar et al. (2019); Siegel et al. (2020); Wu et al. (2019) and utilizing ensembles to handle uncertainty Kumar et al. (2019); Wu et al. (2019) – served as inspirations for the proposed BREMEN algorithm. A major difference of BREMEN from prior works is that the target policy is not explicitly forced to stick close to the estimated behavior policy through the policy update. Rather, BREMEN employs a more implicit regularization by initializing the learned policy with a behavior cloned policy and then applying conservative trustregion updates. Another major difference is the application of modelbased approaches to fully offline settings, which has not been extensively studied in prior works Levine et al. (2020), except the two concurrent works from Kidambi et al. (2020) and Yu et al. (2020) that study pessimistic or uncertainty penalized MDPs with guarantees – closely related to Liu et al. (2019). By contrast, our work shows that a simple technique can already enable modelbased offline algorithms to significantly outperform the prior modelfree methods, and is, to the best of our knowledge, the first to define and extensively evaluate deployment efficiency with recursive experiments.
ModelBased RL
There are many types of modelbased RL algorithms (Sutton, 1991; Deisenroth and Rasmussen, 2011; Heess et al., 2015). A simple algorithmic choice is Dynastyle Sutton (1991), which uses a parameterized dynamics model to estimate the true MDP transition function, stochastically mapping states and actions to next states. The dynamics model can then serve as a simulator of the environment during policy updates. Dynastyle algorithms often suffer from the distributional shift, also known as model bias, which leads RL agents to exploit regions where the data is insufficient, and significant performance degradation. A variety of remedies have been proposed to relieve the problem of model bias, such as the use of multiple dynamics models as an ensemble Chua et al. (2018); Kurutach et al. (2018); Janner et al. (2019), metalearning Clavera et al. (2018)
, energybased model regularizer
Boney et al. (2019), and explicit reward penalty for unknown state Kidambi et al. (2020); Yu et al. (2020). Notably, we have employed a subset of these remedies – model ensembles and trustregion updates Kurutach et al. (2018) – for BREMEN. Compared to existing works, our work is notable for using BC initialization in conjunction with trustregion updates to alleviate the distribution shift of the learned policy from the dataset used to train the dynamics model.7 Conclusion
In this work, we introduced deployment efficiency, a novel measure for RL performance that counts the number of changes in the datacollection policy during learning. To enhance deployment efficiency, we proposed BehaviorRegularized ModelENsemble (BREMEN), a novel modelbased offline algorithm with implicit KL regularization via appropriate policy initialization and trustregion updates. BREMEN shows impressive results in limited deployment settings, obtaining successful policies from scratch in only 510 deployments, as it can improve policies offline even when the batch size is 1020 times smaller than prior works. Not only can this help alleviate costs and risks in realworld applications, but it can also reduce the amount of communication required during distributed learning and could form the basis for communicationefficient largescale RL in contrast to prior works Nair et al. (2015); Espeholt et al. (2018, 2019). Most critically, we show that under deployment efficiency constraints, most prior algorithms – modelfree or modelbased, online or offline – fail to achieve successful learning. We hope our work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the stateoftheart algorithms like SAC (Haarnoja et al., 2018) while having the deployment efficiency wellsuited for safe and practical realworld reinforcement learning.
References
 Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, Cited by: §1.
 Regularizing modelbased planning with energybased models. In Conference on Robot Learning, Cited by: §6.
 A lyapunovbased approach to safe reinforcement learning. In Advances in neural information processing systems, Cited by: §3.
 Lyapunovbased safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031. Cited by: §3.
 Risksensitive and robust decisionmaking: a cvar optimization approach. In Advances in Neural Information Processing Systems, Cited by: §3.
 Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, Cited by: §6.
 Modelbased reinforcement learning via metapolicy optimization. In Conference on Robot Learning, Cited by: §6.
 Offpolicy actorcritic. arXiv preprint arXiv:1205.4839. Cited by: §1.

PILCO: a modelbased and dataefficient approach to policy search.
In
International Conference on Machine Learning
, Cited by: §6.  Treebased batch mode reinforcement learning. Journal of Machine Learning Research. Cited by: §6.
 SEED RL: scalable and efficient deeprl with accelerated central inference. arXiv preprint arXiv:1910.06591. Cited by: §7.
 IMPALA: scalable distributed deeprl with importance weighted actorlearner architectures. In International Conference on Machine Learning, Cited by: §7.
 Leave no trace: learning to reset for safe and autonomous reinforcement learning. International Conference on Learning Representations. Cited by: §3.
 Offpolicy deep reinforcement learning without exploration. In International Conference on Machine Learning, Cited by: §1, §1, §2, Table 6, Table 1, §6.
 Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298. Cited by: §6.
 Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In International Conference on Robotics and Automation, Cited by: §1, §3.
 QProp: sampleefficient policy gradient with an offpolicy critic. In International Conference on Learning Representations, Cited by: §1, §2.
 Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, Cited by: §1, §2.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, Cited by: Figure 1, §1, §2, §3, §3, §7.
 Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, Cited by: §6.

Rainbow: combining improvements in deep reinforcement learning.
In
AAAI Conference on Artificial Intelligence
, Cited by: §1.  When to trust your model: modelbased policy optimization. In Advances in Neural Information Processing Systems, Cited by: §4.3, §6.
 Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. Cited by: §1, §1, §1, §6.
 QTOpt: scalable deep reinforcement learning for visionbased robotic manipulation. In Conference on Robot Learning, Cited by: §1, §3, §3.
 MOReL : modelbased offline reinforcement learning. arXiv preprint arXiv:2005.05951. Cited by: §C.1, §6, §6.
 Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §B.1.
 Stabilizing offpolicy qlearning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §4.2, §5, §6.
 ModelEnsemble TrustRegion Policy Optimization. In International Conference on Learning Representations, Cited by: §1, §B.1, §3, §4.1, Table 1, §6.
 Batch reinforcement learning. In Reinforcement learning, Cited by: §2, §6.
 Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1, §1, §4.3, §6.
 Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2.
 Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning. Cited by: §2.
 Offpolicy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473. Cited by: §6.
 Offline policy evaluation across representations with applications to educational games.. In International Conference on Autonomous Agents and Multiagent Systems, Cited by: §1, §3.
 IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321. Cited by: §6.
 Humanlevel control through deep reinforcement learning. Nature. Cited by: §2.
 Marginal mean models for dynamic regimes. Journal of the American Statistical Association. Cited by: §1, §3.
 Multiagent manipulation via locomotion using hierarchical sim2real. In Conference on Robot Learning, Cited by: §1, §3.
 Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §1.
 Neural Network Dynamics for ModelBased Deep Reinforcement Learning with ModelFree FineTuning. International Conference on Robotics and Automation. Cited by: §1.
 Generalization in deep networks: the role of distance from initialization. arXiv preprint arXiv:1901.01672. Cited by: §4.2.
 Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296. Cited by: §7.
 Offpolicy temporaldifference learning with function approximation. In International Conference on Machine Learning, Cited by: §1.
 Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §4.2.
 Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708. Cited by: §3.

A reduction of imitation learning and structured prediction to noregret online learning
. In International conference on artificial intelligence and statistics, Cited by: §4.2.  Trust region policy optimization. In International Conference on Machine Learning, Cited by: §1, §2, §4.2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
 Keep doing what worked: behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, Cited by: §4.2, §6.
 Learning without stateestimation in partially observable markovian decision processes. In Machine Learning Proceedings, Cited by: §6.
 Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, Cited by: §2.
 BRPO: batch residual policy optimization. arXiv preprint arXiv:2002.05522. Cited by: §1.
 Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin. Cited by: §B.1, §4, §6.
 Mujoco: a physics engine for modelbased control. In International Conference on Intelligent Robots and Systems, Cited by: §3, §5.
 Benchmarking modelbased reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: §B.1, §B.1, §B.2.1, §B.2.1, Table 2, §4.1.
 Behavior Regularized Offline Reinforcement Learning. arXiv preprint arXiv:1911.11361. Cited by: §1, §1, §B.1, §B.1, §B.1, §B.2.1, §C.1, Table 6, §4.2, §5.2, Table 1, §5, §6.
 MOPO: modelbased offline policy optimization. arXiv preprint arXiv:2005.13239. Cited by: §6, §6.
Appendix
A Proof of Proposition 1
We first consider . The behavior cloning objective in its supremum form is,
We apply Pinsker’s inequality to the true and estimated behavior policy to yield
By the same Pinsker’s inequality, we have,
Therefore, by triangle inequality, we have
as desired.
We perform similarly for . The model dynamics loss is
We apply Pinsker’s inequality to the true dynamics and learned model to yield
as desired.
B Details of Experimental Settings
b.1 Implementation Details
For our baseline methods, we use the opensource implementations of SAC, BC, BCQ, and BRAC published in
Wu et al. [2019]. SAC and BRAC have (300, 300) QNetwork and (200, 200) policy network. BC has (200, 200) policy network, and BCQ has (300, 300) QNetwork, (300, 300) policy network, and (750, 750) conditional VAE. As for online METRPO, we utilize the codebase of modelbased RL benchmark Wang et al. [2019]. BREMEN and online METRPO use the policy consisting of two hidden layers with 200 units. The dynamics model also consists of two hidden layers with 1,024 units. We use Adam Kingma and Ba [2014] as the optimizer with the learning rate of 0.001 for the dynamics model, and 0.0005 for behavior cloning in BREMEN. Especially in BREMEN and online METRPO, we adopt a linear feature value function to stabilize the training.To leverage neural networks as Dynastyle Sutton [1991] dynamics models, we modify reward and termination function so that they are not dependent on the internal physics engine for calculation, following modelbased benchmark codebase Wang et al. [2019]; see Table 2. Note that the score of baselines (e.g., BCQ, BRAC) is slightly different from Wu et al. [2019] due to this modification of the reward function. We rerun each algorithm in our environments and got appropriate convergence.
The maximum length of one episode is 1,000 steps without any termination in Ant and HalfCheetah; however, termination function is enabled in Hopper and Walker2d. The batch size of transitions for policy update is 50,000 in BREMEN and METRPO, following Kurutach et al. [2018]. The batch size of BC and BRAC is 256, and BCQ is 100, also following Wu et al. [2019].


Environment  Reward function  Termination in rollouts 
Ant  False  
HalfCheetah  False  
Hopper  True  
Walker2d  True  

b.2 Hyper Parameters
In this section, we describe the hyperparameters in both DeploymentEfficient RL and Offline RL settings. We run all of our experiments with five random seed, and the results are averaged.
b.2.1 DeploymentEfficient RL
Table 3 shows the hyperparameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor and GAE , we follow Wang et al. [2019].



Parameter  Ant  HalfCheetah  Hopper  Walker2d 
Iteration per batch  2,000  2,000  6,000  2,000 
Deployment  5  5  10  10 
Total iteration  10,000  10,000  60,000  20,000 
Rollouts length  250  250  1,000  1,000 
Max step size  0.05  0.1  0.05  0.05 
Discount factor  0.99  0.99  0.99  0.99 
GAE  0.97  0.95  0.95  0.95 
Stationary noise  0.1  0.1  0.1  0.1 

Number of Iterations for Policy Optimization
To achieve high deployment efficiency, the number of iterations for policy optimization between deployments is one of the important hyperparameters for fast convergence. In the existing methods (BCQ, BRAC, SAC), we search over three values: {10,000, 50,000, 100,000}, and choose 10,000 in BCQ and BRAC, and 100,000 in SAC (Figure 5). For BREMEN, we also search over three values: {2,000, 4,000, 6,000}. Figure 6 shows the results of iteration search, and we choose 2,000 in Ant, HalfCheetah, and Walker2d, and 6,000 in Hopper.
Stationary Noise in BREMEN
To achieve effective exploration, the stochastic Gaussian policy is a good choice. We found that adding stationary Gaussian noise to the policy in the imaginary trajectories and data collection led to the notable improvement. Stationary Gaussian policy is written as,
Another choice is a learned Gaussian policy, which parameterizes not only but also . Learned gaussian policy is also written as,
We utilize the zeromean Gaussian , and tune up in Figure 7 with HalfCheetah, comparing stationary and learned strategies. From this experiment, we found that the stationary noise, the scale of 0.1, consistently performs well, and therefore we used it for all our experiments.
Other Hyperparameters in the Existing Methods
As for online METRPO, we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2to1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with a 1to1 ratio. Totally, we iterate 12,500 times policy optimization, which means 500 deployment of the policy. Note that we carefully tune up the hyperparameters of online METRPO, and then its performance was improved from Wang et al. [2019].
Table 4 and Table 5 shows the tunable hyperparameters of BCQ and BRAC, respectively. We refer Wu et al. [2019] to choose these values. In this work, BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.



Parameter  Ant  HalfCheetah  Hopper  Walker2d 
Policy learning rate  3e05  3e04  3e06  3e05 
Perturbation range  0.15  0.5  0.15  0.15 




Parameter  Ant  HalfCheetah  Hopper  Walker2d 
Policy learning rate  1e4  1e3  3e5  1e5 
Divergence penalty  0.3  0.1  0.3  0.3 

b.2.2 Offline RL
In the offline experiments, we apply the same hyperparameters as in the deploymentefficient settings described above, except for the iteration per batch. Algorithm 2 shows pseudocode for BREMEN in offline RL settings where policies are updated only with one fixed batch dataset. The number of iteration is set to 6,250 in BREMEN, and 500,000 in BC, BCQ, and BRAC.
C Additional Experiment Results
c.1 Performance on the Dataset with Different Noise
Following Wu et al. [2019] and Kidambi et al. [2020], we additionally compare BREMEN in offline settings to the other baselines (BC, BCQ, BRAC) with five datasets of different exploration noise. Each dataset has also one million transitions.

eps1: 40 % of the dataset is collected by datacollection policy (partially trained SAC policy) , 40 % of the dataset is collected by epsilon greedy policy with to take a random action, and 20 % of dataset is collected by an uniformly random policy.

eps3: Same as eps1, 40 % of the dataset is collected by , 40 % is collected by epsilon greedy policy with , and 20 % is collected by an uniformly random policy.

gaussian1: 40 % of the dataset is collected by datacollection policy , 40 % is collected by the policy with adding zeromean Gaussian noise to each action sampled from , and 20 % is collected by an uniformly random policy.

gaussian3: 40 % of the dataset is collected by datacollection policy , 40 % is collected by the policy with zeromean Gaussian noise , and 20 % is collected by an uniformly random policy.

random: All of the dataset is collected by an uniformly random policy.
Table 6 shows that BREMEN can also achieve performance competitive with stateoftheart modelfree offline RL algorithm even with noisy datasets. The training curves of each experiment are shown in Section C.4.



Noise: eps1, 1,000,000 (1M) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  1077  2936  791  815 
BC  138171  3788740  266486  1185155 
BCQ  1937116  6046276  800659  479537 
BRAC  2693155  7003118  1243162  3204103 
BRAC (max Q)  290798  707081  1488386  3330147 
BREMEN (Ours)  3519129  7585425  281876  1177697 
METRPO (offline)  1514503  1009731  1301654  128153 


Noise: eps3, 1,000,000 (1M) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  936  2408  662  648 
BC  1364121  2877797  519532  1066176 
BCQ  193821  5739188  1170446  10181231 
BRAC  271890  6434147  122471  2921101 
BRAC (max Q)  291387  6672136  2103746  3079110 
BREMEN (Ours)  3409218  7632104  280365  1161384 
METRPO (offline)  1843674  550467  1308756  354329 


Noise: gaussian1, 1,000,000 (1M) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  1072  3150  882  1070 
BC  127980  4142189  3116  1137477 
BCQ  195876  5854498  475416  608416 
BRAC  290581  7026168  1456161  3030103 
BRAC (max Q)  2910157  7026168  157589  324297 
BREMEN (Ours)  2912165  7928313  1999617  1402290 
METRPO (offline)  1275656  1275656  909631  171119 


Noise: gaussian3, 1,000,000 (1M) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  1058  2872  781  981 
BC  130034  419069  611467  1217361 
BCQ  198297  5781543  1137582  258286 
BRAC  3084180  39332740  1432499  3253118 
BRAC (max Q)  291699  39972761  1417267  3372153 
BREMEN (Ours)  3432185  8124145  1867354  2073245 
METRPO (offline)  1237310  2141872  973243  219145 


Noise: random, 1,000,000 (1M) transitions  
Method  Ant  HalfCheetah  Hopper  Walker2d 
Dataset  470  285  34  2 
BC  98910  21  10662  108110 
BCQ  1222114  2887242  2067  22812 
BRAC  105792  3449259  22730  2954 
BRAC (max Q)  68357  3418171  22437  2650 
BREMEN (Ours)  90511  3627193  27068  2546 
METRPO (offline)  2221665  2701120  32129  26213 

c.2 Comparison among Different Number of Ensembles
To deal with the distribution shift, also known as model bias, during policy optimization, we introduce the dynamics model ensembles. We validate the performance of BREMEN with a different number of dynamics models . Figure 8 and Figure 9 show the performance of BREMEN with the different number of ensembles in deploymentefficient and offline settings. Ensembles with more dynamics models resulted in better performance due to the mitigation of distributional shift except for , and then we choose .
c.3 Implicit KL Control in Offline Settings
Similar to Section 5.3, we present offline experiments to better understand the effect of implicit KL regularization. In contrast to the implicit KL regularization via Eq. 4, the optimization of BREMEN with explicit KL penalty becomes
(6)  
where is the advantage of computed using modelbased rollouts in the learned dynamics model and is the maximum step size. Note that BREMEN with explicit KL penalty does not utilize behavior cloning initialization.
We empirically conclude that the explicit constraint is unnecessary and just TRPO update with behaviorinitialization as implicit regularization is sufficient in BREMEN algorithm. Figure 10 shows the KL divergence between learned policies and the last deployed policies (top row) and model errors measured by a mean squared error of predicted next state from the true state (second row). We find that behavior initialized policy with conservative KL trustregion updates well stuck to the last deployed policy during improvement without explicit KL penalty. The policy initialized with behavior cloning also tended to suppress the increase of model error, which implies that behavior initialization alleviates the effect of the distribution shift. In Walker2d, the model error of BREMEN is relatively large, which may relate to the poor performance with noisy datasets in Section C.1.
Comments
There are no comments yet.