1 Introduction
Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks Barth-Maron et al. (2018); Hessel et al. (2018); Nachum et al. (2019). Virtually all of these demonstrations have relied on highly-frequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment. However, in many real-world applications of RL, such as health Murphy et al. (2001), education Mandel et al. (2014), dialog agents (Jaques et al., 2019), and robotics Gu et al. (2017a); Kalashnikov et al. (2018), the deployment of a new data-collection policy may be associated with a number of costs and risks. If we can learn tasks with a small number of data collection policies, we can substantially reduce these costs and risks.
Based on this idea, we propose a novel measure of RL algorithm performance, namely deployment efficiency, which counts the number of changes in the data-collection policy during learning, as illustrated in Figure 1. This concept may be seen in contrast to sample efficiency or data efficiency Precup et al. (2001); Degris et al. (2012); Gu et al. (2017b); Haarnoja et al. (2018); Lillicrap et al. (2016); Nachum et al. (2018), which measures the amount of environment interactions incurred during training, without regard to how many distinct policies were deployed to perform those interactions. Even when the data efficiency is high, the deployment efficiency could be low, since many on-policy and off-policy algorithms alternate data collection with each policy update (Schulman et al., 2015; Lillicrap et al., 2016; Gu et al., 2016; Haarnoja et al., 2018). Such dependence on high-frequency policy deployments is best illustrated in the recent works in offline RL Fujimoto et al. (2019); Jaques et al. (2019); Kumar et al. (2019); Levine et al. (2020); Wu et al. (2019), where baseline off-policy algorithms exhibited poor performance when trained on a static dataset. These offline RL works, however, limit their study to a single deployment, which is enough for achieving high performance with data collected from a sub-optimal behavior policy, but often not from a random policy. In contrast to those prior works, we aim to learn successful policies from scratch with minimal amounts of data and deployments.
Many existing model-free offline RL algorithms (Levine et al., 2020) are tuned and evaluated on large datasets (e.g., one million transitions). In order to develop an algorithm that is both sample-efficient and deployment-efficient, each iteration of the algorithm between successive deployments has to work effectively on much smaller dataset sizes. We believe model-based RL is better suited to this setting due to its higher demonstrated sample efficiency than model-free RL Kurutach et al. (2018); Nagabandi et al. (2018). Although the combination of model-based RL and offline or limited-deployment settings seems straight-forward, we find this naïve approach leads to poor performance. This problem can be attributed to extrapolation errors Fujimoto et al. (2019) similar to those observed in model-free methods. Specifically, the learned policy may choose sequences of actions which lead it to regions of the state space where the dynamics model cannot predict properly, due to poor coverage of the dataset. This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In model-free settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution Jaques et al. (2019); Kumar et al. (2019); Wu et al. (2019), which, however, can overly limit policies’ expressivity Sohn et al. (2020).
In order to better approach these problems arising in limited deployment settings, we propose Behavior-Regularized Model-ENsemble (BREMEN), which learns an ensemble of dynamics models in conjunction with a policy using imaginary rollouts while implicitly regularizing the learned policy via appropriate parameter initialization and conservative trust-region learning updates. We evaluate BREMEN on high-dimensional continuous control benchmarks and find that it achieves impressive deployment efficiency. BREMEN is able to learn successful policies with only 5-10 deployments, significantly outperforming existing off-policy and offline RL algorithms in this deployment-constrained setting. We further evaluate BREMEN on standard offline RL benchmarks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain.

2 Preliminaries
We consider a Markov Decision Process (MDP) setting, characterized by the tuple
, where is the state space, is the action space,is the transition probability distribution or dynamics,
is the reward function and is the discount factor. A policy is a function that determines the agent behavior, mapping from states to probability distributions over actions. The goal is to obtain the optimal policy aswhere is the expectation of the discounted sum of rewards under the policy . The transition probability
is usually unknown, and it is estimated with a parameterized dynamics model
(e.g., a neural network) in model-based RL. For simplicity, we assume that the reward function
is known, and the reward can be computed for any arbitrary state, but we can easily extend to the unknown setting and predict it using a parameterized function.On-policy vs Off-policy, Online vs Offline At high-level, most RL algorithms iterate many times between collecting a batch of transitions (deployments) and optimizing the policy (learning). If the algorithms discard data after each policy update, they are on-policy (Schulman et al., 2015, 2017), while if they accumulate data in a buffer , i.e. experience replay (Lin, 1992), they are off-policy (Mnih et al., 2015; Lillicrap et al., 2016; Gu et al., 2016, 2017b; Haarnoja et al., 2018; Fujimoto et al., 2019) because not all the data in buffer comes from the current policy. However, we consider all these algorithms to be online RL algorithms, since they involve many deployments during learning, ranging from hundreds to millions. On the other hand, in pure offline
RL, one does not assume direct interaction and learns a policy from only a fixed dataset, which effectively corresponds to a single deployment allowed for learning. Classically, interpolating these two extremes were semi-batch RL algorithms
Lange et al. (2012); Singh et al. (1995), which improve the policy through repetitions of collecting a large batch of transitions and performing many or full policy updates. While these semi-batch RL also realize good deployment efficiency, they have not been extensively studied with neural network function approximators or in off-policy settings with experience replay for scalable sample-efficient learning. In our work, we aim to have both high deployment efficiency and sample efficiency by developing an algorithm that can solve the tasks with minimal policy deployments as well as transition samples.3 Deployment Efficiency
Deploying a new policy for data collection can be associated with a number of costs and risks for many real-world applications like medicine or robotic control Murphy et al. (2001); Mandel et al. (2014); Gu et al. (2017a); Kalashnikov et al. (2018); Nachum et al. (2019). While there is an abundance of works on safety for RL (Chow et al., 2015; Eysenbach et al., 2018; Chow et al., 2018; Ray et al., 2019; Chow et al., 2019), these methods often do not provide guarantees in practice when combined with neural networks and stochastic optimization. It is therefore necessary to validate each policy before deployment. Due to the cost associated with each deployment, it is desirable to minimize the number of distinct deployments needed during the learning process.
In order to focus research on these practical bottlenecks, we propose a novel measure of RL algorithms, namely, deployment efficiency, which counts how many times the data-collection policy has been changed during improvement from random policy to solve the task. For example, if an RL algorithm operates by using its learned policy to collect transitions from the environment times, each time collecting a batch of new transitions, then the number of deployments is , while the total number of samples collected is . The lower is, the more deployment-efficient the algorithm is; in contrast, sample efficiency looks at . Online RL algorithms, whether they are on-policy or off-policy, typically update the policy and acquire new transitions by deploying the newly updated policy at every iteration. This corresponds to performing hundreds to millions of deployments during learning on standard benchmarks (Haarnoja et al., 2018), which is severely deployment inefficient. On the other hand, offline RL literature only studies the case of 1 deployment. A deployment-efficient algorithm would stand in the middle of these two extremes and ideally learn a successful policy from scratch while deploying only a few distinct policies, as illustrated in Figure 1.
Recent deep RL literature seldom emphasizes deployment efficiency, with few exceptions in specific applications Kalashnikov et al. (2018) where such a learning procedure is necessary. Although current state-of-the-art algorithms on continuous control have substantially improved sample or data efficiency, they have not optimized for deployment efficiency. For example, SAC Haarnoja et al. (2018), an efficient model-free off-policy algorithm, performs half a million to one million policy deployments during learning on MuJoCo Todorov et al. (2012) benchmarks. ME-TRPO Kurutach et al. (2018), a model-based algorithm, performs a much lower 100-300 policy deployments, although this is still relatively high for practical settings.111We examined the number of deployments by checking their original implementations, while the frequency of data collection is a tunable hyper-parameter. In our work, we demonstrate successful learning on standard benchmark environments with only 5-10 deployments.
4 Behavior-Regularized Model-Ensemble
To achieve high deployment efficiency, we propose Behavior-Regularized Model-ENsemble (BREMEN). BREMEN incorporates Dyna-style Sutton (1991) model-based RL, learning an ensemble of dynamics models in conjunction with a policy using imaginary rollouts from the ensemble and behavior regularization via conservative trust-region updates.
4.1 Imaginary Rollout from Model Ensemble
As in recent Dyna-style model-based RL methods Kurutach et al. (2018); Wang et al. (2019), BREMEN uses an ensemble of deterministic dynamics models to alleviate the problem of model bias. Each model is parameterized by and trained by the following objective, which minimizes mean squared error between the prediction of next state and true next state over a dataset :
(1) |
During training of a policy , imagined trajectories of states and actions are generated sequentially, using a dynamics model that is randomly selected at each time step:
(2) |
4.2 Policy Update with Behavior Regularization
In order to manage the discrepancy between the true dynamics and the learned model caused by the distribution shift in batch settings, we propose to use iterative policy updates via a trust-region constraint, re-initialized with a behavior-cloned policy after every deployment. Specifically, after each deployment, we are given an updated dataset of experience transitions . With this dataset, we approximate the true behavior policy through behavior cloning (BC), utilizing a neural network parameterized by
, where we implicitly assume a fixed variance, a common practice in BC
(Rajeswaran et al., 2017):(3) |
After obtaining the estimated behavior policy, we initialize the target policy as a Gaussian policy with mean from
and standard deviation of
. This BC initialization in conjunction with gradient descent based optimization may be seen as implicitly biasing the optimized to be close to the data-collection policy Nagarajan and Kolter (2019), and thus works as a remedy for the distribution shift problem (Ross et al., 2011). To further bias the learned policy to be close to the data-collection policy, we opt to use a KL-based trust-region optimization Schulman et al. (2015). Therefore, the optimization of BREMEN becomes(4) | ||||
where is the advantage of computed using model-based rollouts in the learned dynamics model and is the maximum step size.
The combination of BC for initialization and finite iterative trust-region updates serves as an implicit KL regularization, as discussed in Section 4.3. This is in contrast to many previous offline RL algorithms that augment the value function with a penalty of explicit KL divergence Siegel et al. (2020); Wu et al. (2019) or maximum mean discrepancy Kumar et al. (2019). Empirically, we found that our regularization technique outperforms the explicit KL penalty (see Section 5.3).
By recursively performing offline procedure, BREMEN can be used for deployment-efficient learning as shown in Algorithm 1, starting from a randomly initialized policy, collecting experience data, and performing offline policy updates.
4.3 Implicit KL Control from a Mathematical Perspective
We can intuitively understand that behavior cloning initialization with trust-region updates works as a regularization of distributional shift, and this can be supported by theory. Following the notation of Janner et al. (2019), we denote the generalization error of a dynamics model on the state distribution under the true behavior policy as , where represents the total variation distance between true dynamics and learned model . We also denote the distribution shift on the target policy as . A bound relating the true returns and the model returns on the target policy is given in Janner et al. (2019) as,
(5) |
This bound guarantees the improvement under the true returns as long as the improvement under the model returns increases by more than the slack in the bound due to Janner et al. (2019); Levine et al. (2020).
We may relate this bound to the specific learning employed by BREMEN, which includes dynamics model learning, behavior cloning policy initialization, and conservative KL-based trust-region policy updates. To do so, we consider an idealized version of BREMEN, where the expectations over states in equations 1, 3, 4 are replaced with supremums and the dynamics model is set to have unit variance.
Proposition 1 (Policy and model error bound).
Suppose we apply the idealized BREMEN on a dataset , and define in terms of the behavior cloning and dynamics model losses as,
where denotes the Shannon entropy. If one then applies KL-based trust-region steps of step size (equation 4) using stochastic dynamics models with mean and standard deviation 1, then
Proof.
See Appendix A. ∎
5 Experiments
We evaluate BREMEN in both deployment-efficient settings, where the algorithm must learn a policy from scratch via a limited number of deployments, and offline RL, where the algorithm is given only a single static dataset. We use four standard continuous control benchmarks for offline RL Kumar et al. (2019); Wu et al. (2019), namely, Ant, HalfCheetah, Hopper, and Walker2d on the MuJoCo physics simulator Todorov et al. (2012). See Appendix B and C for further details and results.
5.1 Evaluating Deployment Efficiency
We compare BREMEN to ME-TRPO, SAC, BCQ, and BRAC applied to limited deployment settings. To adapt offline methods (BCQ, BRAC) to this setting, we simply apply them in a recursive fashion;222Recursive BCQ and BRAC also do behavioral cloning-based policy initialization after each deployment. at each deployment iteration, we collect a batch of data with the most recent policy and then run the offline update with this dataset. As for SAC, we simply change the replay buffer to update only at specific deployment intervals. For the sake of comparison, we align the number of deployments and the amount of data collection at each deployment (either 100,000 or 200,000) for all methods.
Figure 2 shows the results with 200,000 (top) and 100,000 (bottom) batched transitions per deployment. Regardless of the environments and the batch size per update, BREMEN achieves remarkable performance while existing online and offline RL methods struggle to make any progress in the limited deployment settings. As a point of comparison, we also include results for online SAC and ME-TRPO without limits on the number of deployments but using the same number of transitions.

5.2 Evaluating Offline Learning
We also evaluate BREMEN on standard offline RL benchmarks following Wu et al. (2019). We first train online SAC to a certain cumulative reward threshold, 4,000 in HalfCheetah, 1,000 in Ant, Hopper, and Walker2d, and collect offline datasets. We evaluate agents with the offline dataset of one million (1M) transitions, which is standard for BCQ and BRAC Wu et al. (2019). We then evaluate them on much smaller datasets of 50k and 100k transitions, 510 % of prior works.
Table 1 shows that BREMEN can achieve performance competitive with state-of-the-art model-free offline RL algorithms when using the standard dataset size of 1M. Moreover, BREMEN can also appropriately learn with 10-20 times smaller datasets, where BCQ and BRAC are unable to exceed even BC baseline. As a result, our recursive BREMEN algorithm is not only deployment-efficient but also sample-efficient, and significantly outperforms the baselines.
|
||||
---|---|---|---|---|
1,000,000 (1M) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 1191 | 4126 | 1128 | 1376 |
BC | 1321141 | 428112 | 1341161 | 1421147 |
BCQ Fujimoto et al. (2019) | 202131 | 5783272 | 1130127 | 2153753 |
BRAC Wu et al. (2019) | 2072285 | 7192115 | 142290 | 22391124 |
BRAC (max Q) | 2369234 | 732091 | 1916343 | 24091210 |
BREMEN (Ours) | 3328275 | 8055103 | 2058852 | 2346230 |
ME-TRPO (offline) Kurutach et al. (2018) | 1258550 | 1804924 | 51891 | 211154 |
|
||||
100,000 (100K) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 1191 | 4066 | 1128 | 1376 |
BC | 133081 | 426621 | 1322109 | 142647 |
BCQ | 1363199 | 3915411 | 1129238 | 2187196 |
BRAC | -157383 | 25052501 | 131070 | 21621109 |
BRAC (max Q) | -226387 | 23322422 | 1422101 | 21641114 |
BREMEN (Ours) | 1633127 | 6095370 | 2191455 | 2132301 |
ME-TRPO (offline) | 9744 | 2434 | 307170 | 1061 |
|
||||
50,000 (50K) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 1191 | 4138 | 1128 | 1376 |
BC | 127065 | 423049 | 124961 | 1420194 |
BCQ | 132995 | 1319626 | 1178235 | 1841439 |
BRAC | -878244 | -59773 | 1277102 | 9761207 |
BRAC (max Q) | -843279 | -59056 | 1276225 | 9031137 |
BREMEN (Ours) | 1347283 | 5823146 | 1632796 | 2280647 |
ME-TRPO (offline) | 93832 | -7395 | 15213 | 176343 |
|
5.3 Evaluating Effectiveness of Implicit KL Control
In this section, we present an experiment to better understand the effect of BREMEN’s implicit regularization. Figure 3 shows the KL divergence of learned policies from the last deployed policy. We compare BREMEN to variants of BREMEN that use an explicit KL penalty on value instead of BC initialization (conservative KL trust-region updates are still used). We find that the explicit KL without behavior initialization variants learn policies that move farther away from the last deployed policy than behavior initialized policies. This suggests that the implicit behavior regularization employed by BREMEN is more effective as a conservative policy learning protocol.

6 Related Work
Deployment Efficiency and Offline RL
Although we are not aware of any previous works which explicitly proposed the concept of deployment efficiency, its necessity in many real-world applications has been generally known. One may consider previously proposed semi-batch RL algorithms Ernst et al. (2005); Lange et al. (2012); Singh et al. (1994) as approaching this issue. More recently, a related but distinct problem known as offline RL has gained popularity Levine et al. (2020); Wu et al. (2019). These offline RL works consider an extreme version of 1 deployment, and typically collect the static batch with a partially trained policy rather than a random policy. While offline RL has shown promising results for a variety of real-world applications, such as robotics Mandlekar et al. (2019), dialogue systems Jaques et al. (2019), or medical treatments Gottesman et al. (2018), these algorithms struggle when learning a policy from scratch or when the dataset is small. Nevertheless, common themes of many offline RL algorithms – regularizing the learned policy to the behavior policy Fujimoto et al. (2019); Jaques et al. (2019); Kumar et al. (2019); Siegel et al. (2020); Wu et al. (2019) and utilizing ensembles to handle uncertainty Kumar et al. (2019); Wu et al. (2019) – served as inspirations for the proposed BREMEN algorithm. A major difference of BREMEN from prior works is that the target policy is not explicitly forced to stick close to the estimated behavior policy through the policy update. Rather, BREMEN employs a more implicit regularization by initializing the learned policy with a behavior cloned policy and then applying conservative trust-region updates. Another major difference is the application of model-based approaches to fully offline settings, which has not been extensively studied in prior works Levine et al. (2020), except the two concurrent works from Kidambi et al. (2020) and Yu et al. (2020) that study pessimistic or uncertainty penalized MDPs with guarantees – closely related to Liu et al. (2019). By contrast, our work shows that a simple technique can already enable model-based offline algorithms to significantly outperform the prior model-free methods, and is, to the best of our knowledge, the first to define and extensively evaluate deployment efficiency with recursive experiments.
Model-Based RL
There are many types of model-based RL algorithms (Sutton, 1991; Deisenroth and Rasmussen, 2011; Heess et al., 2015). A simple algorithmic choice is Dyna-style Sutton (1991), which uses a parameterized dynamics model to estimate the true MDP transition function, stochastically mapping states and actions to next states. The dynamics model can then serve as a simulator of the environment during policy updates. Dyna-style algorithms often suffer from the distributional shift, also known as model bias, which leads RL agents to exploit regions where the data is insufficient, and significant performance degradation. A variety of remedies have been proposed to relieve the problem of model bias, such as the use of multiple dynamics models as an ensemble Chua et al. (2018); Kurutach et al. (2018); Janner et al. (2019), meta-learning Clavera et al. (2018)
, energy-based model regularizer
Boney et al. (2019), and explicit reward penalty for unknown state Kidambi et al. (2020); Yu et al. (2020). Notably, we have employed a subset of these remedies – model ensembles and trust-region updates Kurutach et al. (2018) – for BREMEN. Compared to existing works, our work is notable for using BC initialization in conjunction with trust-region updates to alleviate the distribution shift of the learned policy from the dataset used to train the dynamics model.7 Conclusion
In this work, we introduced deployment efficiency, a novel measure for RL performance that counts the number of changes in the data-collection policy during learning. To enhance deployment efficiency, we proposed Behavior-Regularized Model-ENsemble (BREMEN), a novel model-based offline algorithm with implicit KL regularization via appropriate policy initialization and trust-region updates. BREMEN shows impressive results in limited deployment settings, obtaining successful policies from scratch in only 5-10 deployments, as it can improve policies offline even when the batch size is 10-20 times smaller than prior works. Not only can this help alleviate costs and risks in real-world applications, but it can also reduce the amount of communication required during distributed learning and could form the basis for communication-efficient large-scale RL in contrast to prior works Nair et al. (2015); Espeholt et al. (2018, 2019). Most critically, we show that under deployment efficiency constraints, most prior algorithms – model-free or model-based, online or offline – fail to achieve successful learning. We hope our work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the state-of-the-art algorithms like SAC (Haarnoja et al., 2018) while having the deployment efficiency well-suited for safe and practical real-world reinforcement learning.
References
- Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, Cited by: §1.
- Regularizing model-based planning with energy-based models. In Conference on Robot Learning, Cited by: §6.
- A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, Cited by: §3.
- Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031. Cited by: §3.
- Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, Cited by: §3.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, Cited by: §6.
- Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, Cited by: §6.
- Off-policy actor-critic. arXiv preprint arXiv:1205.4839. Cited by: §1.
-
PILCO: a model-based and data-efficient approach to policy search.
In
International Conference on Machine Learning
, Cited by: §6. - Tree-based batch mode reinforcement learning. Journal of Machine Learning Research. Cited by: §6.
- SEED RL: scalable and efficient deep-rl with accelerated central inference. arXiv preprint arXiv:1910.06591. Cited by: §7.
- IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, Cited by: §7.
- Leave no trace: learning to reset for safe and autonomous reinforcement learning. International Conference on Learning Representations. Cited by: §3.
- Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, Cited by: §1, §1, §2, Table 6, Table 1, §6.
- Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298. Cited by: §6.
- Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation, Cited by: §1, §3.
- Q-Prop: sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, Cited by: §1, §2.
- Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, Cited by: §1, §2.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, Cited by: Figure 1, §1, §2, §3, §3, §7.
- Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, Cited by: §6.
-
Rainbow: combining improvements in deep reinforcement learning.
In
AAAI Conference on Artificial Intelligence
, Cited by: §1. - When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, Cited by: §4.3, §6.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. Cited by: §1, §1, §1, §6.
- QT-Opt: scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, Cited by: §1, §3, §3.
- MOReL : model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951. Cited by: §C.1, §6, §6.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §B.1.
- Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §4.2, §5, §6.
- Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations, Cited by: §1, §B.1, §3, §4.1, Table 1, §6.
- Batch reinforcement learning. In Reinforcement learning, Cited by: §2, §6.
- Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1, §1, §4.3, §6.
- Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2.
- Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning. Cited by: §2.
- Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473. Cited by: §6.
- Offline policy evaluation across representations with applications to educational games.. In International Conference on Autonomous Agents and Multiagent Systems, Cited by: §1, §3.
- IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321. Cited by: §6.
- Human-level control through deep reinforcement learning. Nature. Cited by: §2.
- Marginal mean models for dynamic regimes. Journal of the American Statistical Association. Cited by: §1, §3.
- Multi-agent manipulation via locomotion using hierarchical sim2real. In Conference on Robot Learning, Cited by: §1, §3.
- Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §1.
- Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. International Conference on Robotics and Automation. Cited by: §1.
- Generalization in deep networks: the role of distance from initialization. arXiv preprint arXiv:1901.01672. Cited by: §4.2.
- Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296. Cited by: §7.
- Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, Cited by: §1.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §4.2.
- Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708. Cited by: §3.
-
A reduction of imitation learning and structured prediction to no-regret online learning
. In International conference on artificial intelligence and statistics, Cited by: §4.2. - Trust region policy optimization. In International Conference on Machine Learning, Cited by: §1, §2, §4.2.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
- Keep doing what worked: behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, Cited by: §4.2, §6.
- Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings, Cited by: §6.
- Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, Cited by: §2.
- BRPO: batch residual policy optimization. arXiv preprint arXiv:2002.05522. Cited by: §1.
- Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin. Cited by: §B.1, §4, §6.
- Mujoco: a physics engine for model-based control. In International Conference on Intelligent Robots and Systems, Cited by: §3, §5.
- Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: §B.1, §B.1, §B.2.1, §B.2.1, Table 2, §4.1.
- Behavior Regularized Offline Reinforcement Learning. arXiv preprint arXiv:1911.11361. Cited by: §1, §1, §B.1, §B.1, §B.1, §B.2.1, §C.1, Table 6, §4.2, §5.2, Table 1, §5, §6.
- MOPO: model-based offline policy optimization. arXiv preprint arXiv:2005.13239. Cited by: §6, §6.
Appendix
A Proof of Proposition 1
We first consider . The behavior cloning objective in its supremum form is,
We apply Pinsker’s inequality to the true and estimated behavior policy to yield
By the same Pinsker’s inequality, we have,
Therefore, by triangle inequality, we have
as desired.
We perform similarly for . The model dynamics loss is
We apply Pinsker’s inequality to the true dynamics and learned model to yield
as desired.
B Details of Experimental Settings
b.1 Implementation Details
For our baseline methods, we use the open-source implementations of SAC, BC, BCQ, and BRAC published in
Wu et al. [2019]. SAC and BRAC have (300, 300) Q-Network and (200, 200) policy network. BC has (200, 200) policy network, and BCQ has (300, 300) Q-Network, (300, 300) policy network, and (750, 750) conditional VAE. As for online ME-TRPO, we utilize the codebase of model-based RL benchmark Wang et al. [2019]. BREMEN and online ME-TRPO use the policy consisting of two hidden layers with 200 units. The dynamics model also consists of two hidden layers with 1,024 units. We use Adam Kingma and Ba [2014] as the optimizer with the learning rate of 0.001 for the dynamics model, and 0.0005 for behavior cloning in BREMEN. Especially in BREMEN and online ME-TRPO, we adopt a linear feature value function to stabilize the training.To leverage neural networks as Dyna-style Sutton [1991] dynamics models, we modify reward and termination function so that they are not dependent on the internal physics engine for calculation, following model-based benchmark codebase Wang et al. [2019]; see Table 2. Note that the score of baselines (e.g., BCQ, BRAC) is slightly different from Wu et al. [2019] due to this modification of the reward function. We re-run each algorithm in our environments and got appropriate convergence.
The maximum length of one episode is 1,000 steps without any termination in Ant and HalfCheetah; however, termination function is enabled in Hopper and Walker2d. The batch size of transitions for policy update is 50,000 in BREMEN and ME-TRPO, following Kurutach et al. [2018]. The batch size of BC and BRAC is 256, and BCQ is 100, also following Wu et al. [2019].
![]() |
![]() |
![]() |
![]() |
|
||
Environment | Reward function | Termination in rollouts |
Ant | False | |
HalfCheetah | False | |
Hopper | True | |
Walker2d | True | |
|
b.2 Hyper Parameters
In this section, we describe the hyper-parameters in both Deployment-Efficient RL and Offline RL settings. We run all of our experiments with five random seed, and the results are averaged.
b.2.1 Deployment-Efficient RL
Table 3 shows the hyper-parameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor and GAE , we follow Wang et al. [2019].
|
||||
---|---|---|---|---|
Parameter | Ant | HalfCheetah | Hopper | Walker2d |
Iteration per batch | 2,000 | 2,000 | 6,000 | 2,000 |
Deployment | 5 | 5 | 10 | 10 |
Total iteration | 10,000 | 10,000 | 60,000 | 20,000 |
Rollouts length | 250 | 250 | 1,000 | 1,000 |
Max step size | 0.05 | 0.1 | 0.05 | 0.05 |
Discount factor | 0.99 | 0.99 | 0.99 | 0.99 |
GAE | 0.97 | 0.95 | 0.95 | 0.95 |
Stationary noise | 0.1 | 0.1 | 0.1 | 0.1 |
|
Number of Iterations for Policy Optimization
To achieve high deployment efficiency, the number of iterations for policy optimization between deployments is one of the important hyper-parameters for fast convergence. In the existing methods (BCQ, BRAC, SAC), we search over three values: {10,000, 50,000, 100,000}, and choose 10,000 in BCQ and BRAC, and 100,000 in SAC (Figure 5). For BREMEN, we also search over three values: {2,000, 4,000, 6,000}. Figure 6 shows the results of iteration search, and we choose 2,000 in Ant, HalfCheetah, and Walker2d, and 6,000 in Hopper.


Stationary Noise in BREMEN
To achieve effective exploration, the stochastic Gaussian policy is a good choice. We found that adding stationary Gaussian noise to the policy in the imaginary trajectories and data collection led to the notable improvement. Stationary Gaussian policy is written as,
Another choice is a learned Gaussian policy, which parameterizes not only but also . Learned gaussian policy is also written as,
We utilize the zero-mean Gaussian , and tune up in Figure 7 with HalfCheetah, comparing stationary and learned strategies. From this experiment, we found that the stationary noise, the scale of 0.1, consistently performs well, and therefore we used it for all our experiments.

Other Hyper-parameters in the Existing Methods
As for online ME-TRPO, we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2-to-1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with a 1-to-1 ratio. Totally, we iterate 12,500 times policy optimization, which means 500 deployment of the policy. Note that we carefully tune up the hyper-parameters of online ME-TRPO, and then its performance was improved from Wang et al. [2019].
Table 4 and Table 5 shows the tunable hyper-parameters of BCQ and BRAC, respectively. We refer Wu et al. [2019] to choose these values. In this work, BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.
|
||||
---|---|---|---|---|
Parameter | Ant | HalfCheetah | Hopper | Walker2d |
Policy learning rate | 3e-05 | 3e-04 | 3e-06 | 3e-05 |
Perturbation range | 0.15 | 0.5 | 0.15 | 0.15 |
|
|
||||
---|---|---|---|---|
Parameter | Ant | HalfCheetah | Hopper | Walker2d |
Policy learning rate | 1e-4 | 1e-3 | 3e-5 | 1e-5 |
Divergence penalty | 0.3 | 0.1 | 0.3 | 0.3 |
|
b.2.2 Offline RL
In the offline experiments, we apply the same hyper-parameters as in the deployment-efficient settings described above, except for the iteration per batch. Algorithm 2 shows pseudocode for BREMEN in offline RL settings where policies are updated only with one fixed batch dataset. The number of iteration is set to 6,250 in BREMEN, and 500,000 in BC, BCQ, and BRAC.
C Additional Experiment Results
c.1 Performance on the Dataset with Different Noise
Following Wu et al. [2019] and Kidambi et al. [2020], we additionally compare BREMEN in offline settings to the other baselines (BC, BCQ, BRAC) with five datasets of different exploration noise. Each dataset has also one million transitions.
-
eps1: 40 % of the dataset is collected by data-collection policy (partially trained SAC policy) , 40 % of the dataset is collected by epsilon greedy policy with to take a random action, and 20 % of dataset is collected by an uniformly random policy.
-
eps3: Same as eps1, 40 % of the dataset is collected by , 40 % is collected by epsilon greedy policy with , and 20 % is collected by an uniformly random policy.
-
gaussian1: 40 % of the dataset is collected by data-collection policy , 40 % is collected by the policy with adding zero-mean Gaussian noise to each action sampled from , and 20 % is collected by an uniformly random policy.
-
gaussian3: 40 % of the dataset is collected by data-collection policy , 40 % is collected by the policy with zero-mean Gaussian noise , and 20 % is collected by an uniformly random policy.
-
random: All of the dataset is collected by an uniformly random policy.
Table 6 shows that BREMEN can also achieve performance competitive with state-of-the-art model-free offline RL algorithm even with noisy datasets. The training curves of each experiment are shown in Section C.4.
|
||||
---|---|---|---|---|
Noise: eps1, 1,000,000 (1M) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 1077 | 2936 | 791 | 815 |
BC | 138171 | 3788740 | 266486 | 1185155 |
BCQ | 1937116 | 6046276 | 800659 | 479537 |
BRAC | 2693155 | 7003118 | 1243162 | 3204103 |
BRAC (max Q) | 290798 | 707081 | 1488386 | 3330147 |
BREMEN (Ours) | 3519129 | 7585425 | 281876 | 1177697 |
ME-TRPO (offline) | 1514503 | 1009731 | 1301654 | 128153 |
|
||||
Noise: eps3, 1,000,000 (1M) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 936 | 2408 | 662 | 648 |
BC | 1364121 | 2877797 | 519532 | 1066176 |
BCQ | 193821 | 5739188 | 1170446 | 10181231 |
BRAC | 271890 | 6434147 | 122471 | 2921101 |
BRAC (max Q) | 291387 | 6672136 | 2103746 | 3079110 |
BREMEN (Ours) | 3409218 | 7632104 | 280365 | 1161384 |
ME-TRPO (offline) | 1843674 | 550467 | 1308756 | 354329 |
|
||||
Noise: gaussian1, 1,000,000 (1M) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 1072 | 3150 | 882 | 1070 |
BC | 127980 | 4142189 | 3116 | 1137477 |
BCQ | 195876 | 5854498 | 475416 | 608416 |
BRAC | 290581 | 7026168 | 1456161 | 3030103 |
BRAC (max Q) | 2910157 | 7026168 | 157589 | 324297 |
BREMEN (Ours) | 2912165 | 7928313 | 1999617 | 1402290 |
ME-TRPO (offline) | 1275656 | 1275656 | 909631 | 171119 |
|
||||
Noise: gaussian3, 1,000,000 (1M) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 1058 | 2872 | 781 | 981 |
BC | 130034 | 419069 | 611467 | 1217361 |
BCQ | 198297 | 5781543 | 1137582 | 258286 |
BRAC | 3084180 | 39332740 | 1432499 | 3253118 |
BRAC (max Q) | 291699 | 39972761 | 1417267 | 3372153 |
BREMEN (Ours) | 3432185 | 8124145 | 1867354 | 2073245 |
ME-TRPO (offline) | 1237310 | 2141872 | 973243 | 219145 |
|
||||
Noise: random, 1,000,000 (1M) transitions | ||||
Method | Ant | HalfCheetah | Hopper | Walker2d |
Dataset | 470 | -285 | 34 | 2 |
BC | 98910 | -21 | 10662 | 108110 |
BCQ | 1222114 | 2887242 | 2067 | 22812 |
BRAC | 105792 | 3449259 | 22730 | 2954 |
BRAC (max Q) | 68357 | 3418171 | 22437 | 2650 |
BREMEN (Ours) | 90511 | 3627193 | 27068 | 2546 |
ME-TRPO (offline) | 2221665 | 2701120 | 32129 | 26213 |
|
c.2 Comparison among Different Number of Ensembles
To deal with the distribution shift, also known as model bias, during policy optimization, we introduce the dynamics model ensembles. We validate the performance of BREMEN with a different number of dynamics models . Figure 8 and Figure 9 show the performance of BREMEN with the different number of ensembles in deployment-efficient and offline settings. Ensembles with more dynamics models resulted in better performance due to the mitigation of distributional shift except for , and then we choose .


c.3 Implicit KL Control in Offline Settings
Similar to Section 5.3, we present offline experiments to better understand the effect of implicit KL regularization. In contrast to the implicit KL regularization via Eq. 4, the optimization of BREMEN with explicit KL penalty becomes
(6) | ||||
where is the advantage of computed using model-based rollouts in the learned dynamics model and is the maximum step size. Note that BREMEN with explicit KL penalty does not utilize behavior cloning initialization.
We empirically conclude that the explicit constraint is unnecessary and just TRPO update with behavior-initialization as implicit regularization is sufficient in BREMEN algorithm. Figure 10 shows the KL divergence between learned policies and the last deployed policies (top row) and model errors measured by a mean squared error of predicted next state from the true state (second row). We find that behavior initialized policy with conservative KL trust-region updates well stuck to the last deployed policy during improvement without explicit KL penalty. The policy initialized with behavior cloning also tended to suppress the increase of model error, which implies that behavior initialization alleviates the effect of the distribution shift. In Walker2d, the model error of BREMEN is relatively large, which may relate to the poor performance with noisy datasets in Section C.1.

c.4 Training Curves for Offline RL with Different Noises






Comments
There are no comments yet.