Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

06/05/2020 ∙ by Tatsuya Matsushima, et al. ∙ Google The University of Tokyo 10

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) algorithms have recently demonstrated impressive success in learning behaviors for a variety of sequential decision-making tasks Barth-Maron et al. (2018); Hessel et al. (2018); Nachum et al. (2019). Virtually all of these demonstrations have relied on highly-frequent online access to the environment, with the RL algorithms often interleaving each update to the policy with additional experience collection of that policy acting in the environment. However, in many real-world applications of RL, such as health Murphy et al. (2001), education Mandel et al. (2014), dialog agents (Jaques et al., 2019), and robotics Gu et al. (2017a); Kalashnikov et al. (2018), the deployment of a new data-collection policy may be associated with a number of costs and risks. If we can learn tasks with a small number of data collection policies, we can substantially reduce these costs and risks.

Based on this idea, we propose a novel measure of RL algorithm performance, namely deployment efficiency, which counts the number of changes in the data-collection policy during learning, as illustrated in Figure 1. This concept may be seen in contrast to sample efficiency or data efficiency Precup et al. (2001); Degris et al. (2012); Gu et al. (2017b); Haarnoja et al. (2018); Lillicrap et al. (2016); Nachum et al. (2018), which measures the amount of environment interactions incurred during training, without regard to how many distinct policies were deployed to perform those interactions. Even when the data efficiency is high, the deployment efficiency could be low, since many on-policy and off-policy algorithms alternate data collection with each policy update (Schulman et al., 2015; Lillicrap et al., 2016; Gu et al., 2016; Haarnoja et al., 2018). Such dependence on high-frequency policy deployments is best illustrated in the recent works in offline RL Fujimoto et al. (2019); Jaques et al. (2019); Kumar et al. (2019); Levine et al. (2020); Wu et al. (2019), where baseline off-policy algorithms exhibited poor performance when trained on a static dataset. These offline RL works, however, limit their study to a single deployment, which is enough for achieving high performance with data collected from a sub-optimal behavior policy, but often not from a random policy. In contrast to those prior works, we aim to learn successful policies from scratch with minimal amounts of data and deployments.

Many existing model-free offline RL algorithms (Levine et al., 2020) are tuned and evaluated on large datasets (e.g., one million transitions). In order to develop an algorithm that is both sample-efficient and deployment-efficient, each iteration of the algorithm between successive deployments has to work effectively on much smaller dataset sizes. We believe model-based RL is better suited to this setting due to its higher demonstrated sample efficiency than model-free RL Kurutach et al. (2018); Nagabandi et al. (2018). Although the combination of model-based RL and offline or limited-deployment settings seems straight-forward, we find this naïve approach leads to poor performance. This problem can be attributed to extrapolation errors Fujimoto et al. (2019) similar to those observed in model-free methods. Specifically, the learned policy may choose sequences of actions which lead it to regions of the state space where the dynamics model cannot predict properly, due to poor coverage of the dataset. This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In model-free settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution Jaques et al. (2019); Kumar et al. (2019); Wu et al. (2019), which, however, can overly limit policies’ expressivity Sohn et al. (2020).

In order to better approach these problems arising in limited deployment settings, we propose Behavior-Regularized Model-ENsemble (BREMEN), which learns an ensemble of dynamics models in conjunction with a policy using imaginary rollouts while implicitly regularizing the learned policy via appropriate parameter initialization and conservative trust-region learning updates. We evaluate BREMEN on high-dimensional continuous control benchmarks and find that it achieves impressive deployment efficiency. BREMEN is able to learn successful policies with only 5-10 deployments, significantly outperforming existing off-policy and offline RL algorithms in this deployment-constrained setting. We further evaluate BREMEN on standard offline RL benchmarks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain.

Figure 1: Deployment efficiency is defined as the number of changes in the data-collection policy (), which is vital for managing costs and risks of new policy deployment. Online RL algorithms typically require many iterations of policy deployment and data collection, which leads to extremely low deployment efficiency. In contrast, most pure offline algorithms consider updating a policy from a fixed dataset without additional deployment and often fail to learn from a randomly initialized data-collection policy. Interestingly, most state-of-the-art off-policy algorithms are still evaluated in heavily online settings. For example, SAC Haarnoja et al. (2018) collects one sample per policy update, amounting to 100,000 to 1 million deployments for learning standard benchmark domains.

2 Preliminaries

We consider a Markov Decision Process (MDP) setting, characterized by the tuple 

, where  is the state space, is the action space, 

is the transition probability distribution or dynamics, 

is the reward function and  is the discount factor. A policy  is a function that determines the agent behavior, mapping from states to probability distributions over actions. The goal is to obtain the optimal policy  as

where is the expectation of the discounted sum of rewards under the policy . The transition probability 

is usually unknown, and it is estimated with a parameterized dynamics model 

(e.g., a neural network) in model-based RL. For simplicity, we assume that the reward function 

is known, and the reward can be computed for any arbitrary state, but we can easily extend to the unknown setting and predict it using a parameterized function.

On-policy vs Off-policy, Online vs Offline At high-level, most RL algorithms iterate many times between collecting a batch of transitions (deployments) and optimizing the policy (learning). If the algorithms discard data after each policy update, they are on-policy (Schulman et al., 2015, 2017), while if they accumulate data in a buffer , i.e. experience replay (Lin, 1992), they are off-policy (Mnih et al., 2015; Lillicrap et al., 2016; Gu et al., 2016, 2017b; Haarnoja et al., 2018; Fujimoto et al., 2019) because not all the data in buffer comes from the current policy. However, we consider all these algorithms to be online RL algorithms, since they involve many deployments during learning, ranging from hundreds to millions. On the other hand, in pure offline

RL, one does not assume direct interaction and learns a policy from only a fixed dataset, which effectively corresponds to a single deployment allowed for learning. Classically, interpolating these two extremes were semi-batch RL algorithms 

Lange et al. (2012); Singh et al. (1995), which improve the policy through repetitions of collecting a large batch of transitions  and performing many or full policy updates. While these semi-batch RL also realize good deployment efficiency, they have not been extensively studied with neural network function approximators or in off-policy settings with experience replay for scalable sample-efficient learning. In our work, we aim to have both high deployment efficiency and sample efficiency by developing an algorithm that can solve the tasks with minimal policy deployments as well as transition samples.

3 Deployment Efficiency

Deploying a new policy for data collection can be associated with a number of costs and risks for many real-world applications like medicine or robotic control Murphy et al. (2001); Mandel et al. (2014); Gu et al. (2017a); Kalashnikov et al. (2018); Nachum et al. (2019). While there is an abundance of works on safety for RL (Chow et al., 2015; Eysenbach et al., 2018; Chow et al., 2018; Ray et al., 2019; Chow et al., 2019), these methods often do not provide guarantees in practice when combined with neural networks and stochastic optimization. It is therefore necessary to validate each policy before deployment. Due to the cost associated with each deployment, it is desirable to minimize the number of distinct deployments needed during the learning process.

In order to focus research on these practical bottlenecks, we propose a novel measure of RL algorithms, namely, deployment efficiency, which counts how many times the data-collection policy has been changed during improvement from random policy to solve the task. For example, if an RL algorithm operates by using its learned policy to collect transitions from the environment times, each time collecting a batch of new transitions, then the number of deployments is , while the total number of samples collected is . The lower is, the more deployment-efficient the algorithm is; in contrast, sample efficiency looks at . Online RL algorithms, whether they are on-policy or off-policy, typically update the policy and acquire new transitions by deploying the newly updated policy at every iteration. This corresponds to performing hundreds to millions of deployments during learning on standard benchmarks (Haarnoja et al., 2018), which is severely deployment inefficient. On the other hand, offline RL literature only studies the case of 1 deployment. A deployment-efficient algorithm would stand in the middle of these two extremes and ideally learn a successful policy from scratch while deploying only a few distinct policies, as illustrated in Figure 1.

Recent deep RL literature seldom emphasizes deployment efficiency, with few exceptions in specific applications Kalashnikov et al. (2018) where such a learning procedure is necessary. Although current state-of-the-art algorithms on continuous control have substantially improved sample or data efficiency, they have not optimized for deployment efficiency. For example, SAC Haarnoja et al. (2018), an efficient model-free off-policy algorithm, performs half a million to one million policy deployments during learning on MuJoCo Todorov et al. (2012) benchmarks. ME-TRPO Kurutach et al. (2018), a model-based algorithm, performs a much lower 100-300 policy deployments, although this is still relatively high for practical settings.111We examined the number of deployments by checking their original implementations, while the frequency of data collection is a tunable hyper-parameter. In our work, we demonstrate successful learning on standard benchmark environments with only 5-10 deployments.

4 Behavior-Regularized Model-Ensemble

To achieve high deployment efficiency, we propose Behavior-Regularized Model-ENsemble (BREMEN). BREMEN incorporates Dyna-style Sutton (1991) model-based RL, learning an ensemble of dynamics models in conjunction with a policy using imaginary rollouts from the ensemble and behavior regularization via conservative trust-region updates.

4.1 Imaginary Rollout from Model Ensemble

As in recent Dyna-style model-based RL methods Kurutach et al. (2018); Wang et al. (2019), BREMEN uses an ensemble of deterministic dynamics models  to alleviate the problem of model bias. Each model  is parameterized by  and trained by the following objective, which minimizes mean squared error between the prediction of next state  and true next state  over a dataset :

(1)

During training of a policy , imagined trajectories of states and actions are generated sequentially, using a dynamics model  that is randomly selected at each time step:

(2)

4.2 Policy Update with Behavior Regularization

In order to manage the discrepancy between the true dynamics and the learned model caused by the distribution shift in batch settings, we propose to use iterative policy updates via a trust-region constraint, re-initialized with a behavior-cloned policy after every deployment. Specifically, after each deployment, we are given an updated dataset of experience transitions . With this dataset, we approximate the true behavior policy  through behavior cloning (BC), utilizing a neural network parameterized by 

, where we implicitly assume a fixed variance, a common practice in BC 

(Rajeswaran et al., 2017):

(3)

After obtaining the estimated behavior policy, we initialize the target policy  as a Gaussian policy with mean from 

and standard deviation of

. This BC initialization in conjunction with gradient descent based optimization may be seen as implicitly biasing the optimized to be close to the data-collection policy Nagarajan and Kolter (2019), and thus works as a remedy for the distribution shift problem (Ross et al., 2011). To further bias the learned policy to be close to the data-collection policy, we opt to use a KL-based trust-region optimization Schulman et al. (2015). Therefore, the optimization of BREMEN becomes

(4)

where is the advantage of computed using model-based rollouts in the learned dynamics model and is the maximum step size.

The combination of BC for initialization and finite iterative trust-region updates serves as an implicit KL regularization, as discussed in Section 4.3. This is in contrast to many previous offline RL algorithms that augment the value function with a penalty of explicit KL divergence Siegel et al. (2020); Wu et al. (2019) or maximum mean discrepancy Kumar et al. (2019). Empirically, we found that our regularization technique outperforms the explicit KL penalty (see Section 5.3).

By recursively performing offline procedure, BREMEN can be used for deployment-efficient learning as shown in Algorithm 1, starting from a randomly initialized policy, collecting experience data, and performing offline policy updates.

0:  Empty dataset , , Initial parameters , , Number of policy optimization , Number of deployments .
1:  Randomly initialize the target policy
2:  for deployment  do
3:     Collect transitions in the true environment using and add them to dataset ,
4:     Train dynamics models using via Eq. 1.
5:     Train estimated behavior policy using by behavior cloning via Eq. 3.
6:     Re-initialize target policy .
7:     for policy optimization  do
8:         Generate imaginary rollout via Eq. 2.
9:         Optimize target policy satisfying Eq. 4 with the rollout.
Algorithm 1 BREMEN for Deployment-Efficient RL

4.3 Implicit KL Control from a Mathematical Perspective

We can intuitively understand that behavior cloning initialization with trust-region updates works as a regularization of distributional shift, and this can be supported by theory. Following the notation of Janner et al. (2019), we denote the generalization error of a dynamics model on the state distribution under the true behavior policy as , where represents the total variation distance between true dynamics and learned model . We also denote the distribution shift on the target policy as . A bound relating the true returns and the model returns on the target policy is given in Janner et al. (2019) as,

(5)

This bound guarantees the improvement under the true returns as long as the improvement under the model returns increases by more than the slack in the bound due to  Janner et al. (2019); Levine et al. (2020).

We may relate this bound to the specific learning employed by BREMEN, which includes dynamics model learning, behavior cloning policy initialization, and conservative KL-based trust-region policy updates. To do so, we consider an idealized version of BREMEN, where the expectations over states in equations 134 are replaced with supremums and the dynamics model is set to have unit variance.

Proposition 1 (Policy and model error bound).

Suppose we apply the idealized BREMEN on a dataset , and define in terms of the behavior cloning and dynamics model losses as,

where denotes the Shannon entropy. If one then applies KL-based trust-region steps of step size (equation 4) using stochastic dynamics models with mean and standard deviation 1, then

Proof.

See Appendix A. ∎

5 Experiments

We evaluate BREMEN in both deployment-efficient settings, where the algorithm must learn a policy from scratch via a limited number of deployments, and offline RL, where the algorithm is given only a single static dataset. We use four standard continuous control benchmarks for offline RL Kumar et al. (2019); Wu et al. (2019), namely, Ant, HalfCheetah, Hopper, and Walker2d on the MuJoCo physics simulator Todorov et al. (2012). See Appendix B and C for further details and results.

5.1 Evaluating Deployment Efficiency

We compare BREMEN to ME-TRPO, SAC, BCQ, and BRAC applied to limited deployment settings. To adapt offline methods (BCQ, BRAC) to this setting, we simply apply them in a recursive fashion;222Recursive BCQ and BRAC also do behavioral cloning-based policy initialization after each deployment. at each deployment iteration, we collect a batch of data with the most recent policy and then run the offline update with this dataset. As for SAC, we simply change the replay buffer to update only at specific deployment intervals. For the sake of comparison, we align the number of deployments and the amount of data collection at each deployment (either 100,000 or 200,000) for all methods.

Figure 2 shows the results with 200,000 (top) and 100,000 (bottom) batched transitions per deployment. Regardless of the environments and the batch size per update, BREMEN achieves remarkable performance while existing online and offline RL methods struggle to make any progress in the limited deployment settings. As a point of comparison, we also include results for online SAC and ME-TRPO without limits on the number of deployments but using the same number of transitions.

Figure 2: Evaluation of BREMEN with the existing methods (ME-TRPO, SAC, BCQ, BRAC) under deployment constraints (to 5-10 deployments with batch sizes of 200k and 100k). The average cumulative rewards and their standard deviations with 5 random seeds are shown. Vertical dotted lines represent where each policy deployment and data collection happen. BREMEN is able to learn successful policies with only 5-10 deployments, while the state-of-the-art off-policy (SAC), model-based (ME-TRPO), and recursively-applied offline RL algorithms (BCQ, BRAC) often struggle to make any progress. For completeness, we show ME-TRPO(online) and SAC(online) which are their original optimal learning curves without deployment constraints, plotted with respect to samples normalized by the batch size. While SAC(online) substantially outperforms BREMEN in sample efficiency, it uses 1 deployment per sample, leading to 100k-500k deployments required for learning. Interestingly, BREMEN achieves even better performance than the original ME-TRPO(online), suggesting the effectiveness of implicit behavior regularization. For SAC and ME-TRPO under deployment-constrained evaluation, their batch size between policy deployments differs substantially from their standard settings, and therefore we performed extensive hyper-parameter search on the relevant parameters such as the number of policy updates between deployments, as discussed in Appendix B.2.1.

5.2 Evaluating Offline Learning

We also evaluate BREMEN on standard offline RL benchmarks following Wu et al. (2019). We first train online SAC to a certain cumulative reward threshold, 4,000 in HalfCheetah, 1,000 in Ant, Hopper, and Walker2d, and collect offline datasets. We evaluate agents with the offline dataset of one million (1M) transitions, which is standard for BCQ and BRAC Wu et al. (2019). We then evaluate them on much smaller datasets of 50k and 100k transitions, 510 % of prior works.

Table 1 shows that BREMEN can achieve performance competitive with state-of-the-art model-free offline RL algorithms when using the standard dataset size of 1M. Moreover, BREMEN can also appropriately learn with 10-20 times smaller datasets, where BCQ and BRAC are unable to exceed even BC baseline. As a result, our recursive BREMEN algorithm is not only deployment-efficient but also sample-efficient, and significantly outperforms the baselines.

 

1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1191 4126 1128 1376
BC 1321141 428112 1341161 1421147
BCQ Fujimoto et al. (2019) 202131 5783272 1130127 2153753
BRAC Wu et al. (2019) 2072285 7192115 142290 22391124
BRAC (max Q) 2369234 732091 1916343 24091210
BREMEN (Ours) 3328275 8055103 2058852 2346230
ME-TRPO (offline) Kurutach et al. (2018) 1258550 1804924 51891 211154

 

100,000 (100K) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1191 4066 1128 1376
BC 133081 426621 1322109 142647
BCQ 1363199 3915411 1129238 2187196
BRAC -157383 25052501 131070 21621109
BRAC (max Q) -226387 23322422 1422101 21641114
BREMEN (Ours) 1633127 6095370 2191455 2132301
ME-TRPO (offline) 9744 2434 307170 1061

 

50,000 (50K) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1191 4138 1128 1376
BC 127065 423049 124961 1420194
BCQ 132995 1319626 1178235 1841439
BRAC -878244 -59773 1277102 9761207
BRAC (max Q) -843279 -59056 1276225 9031137
BREMEN (Ours) 1347283 5823146 1632796 2280647
ME-TRPO (offline) 93832 -7395 15213 176343

 

Table 1: Comparison of BREMEN to the existing offline methods on static datasets. Each cell shows the average cumulative reward and their standard deviation, where the number of samples is 1M, 100K, and 50K, respectively. The maximum steps per episode is 1,000. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means its variant of sampling multiple actions and taking the maximum according to the learned Q function.

5.3 Evaluating Effectiveness of Implicit KL Control

In this section, we present an experiment to better understand the effect of BREMEN’s implicit regularization. Figure 3 shows the KL divergence of learned policies from the last deployed policy. We compare BREMEN to variants of BREMEN that use an explicit KL penalty on value instead of BC initialization (conservative KL trust-region updates are still used). We find that the explicit KL without behavior initialization variants learn policies that move farther away from the last deployed policy than behavior initialized policies. This suggests that the implicit behavior regularization employed by BREMEN is more effective as a conservative policy learning protocol.

Figure 3: We examine average cumulative rewards (top) and corresponding KL divergence between the last deployed policy and the target policy (bottom) with batch size 200K in limited deployment settings. The behavior initialized policy remains close to the last deployed policy during improvement without explicit value penalty . The explicit penalty is controlled by a coefficient .

6 Related Work

Deployment Efficiency and Offline RL

Although we are not aware of any previous works which explicitly proposed the concept of deployment efficiency, its necessity in many real-world applications has been generally known. One may consider previously proposed semi-batch RL algorithms Ernst et al. (2005); Lange et al. (2012); Singh et al. (1994) as approaching this issue. More recently, a related but distinct problem known as offline RL has gained popularity Levine et al. (2020); Wu et al. (2019). These offline RL works consider an extreme version of 1 deployment, and typically collect the static batch with a partially trained policy rather than a random policy. While offline RL has shown promising results for a variety of real-world applications, such as robotics Mandlekar et al. (2019), dialogue systems Jaques et al. (2019), or medical treatments Gottesman et al. (2018), these algorithms struggle when learning a policy from scratch or when the dataset is small. Nevertheless, common themes of many offline RL algorithms – regularizing the learned policy to the behavior policy Fujimoto et al. (2019); Jaques et al. (2019); Kumar et al. (2019); Siegel et al. (2020); Wu et al. (2019) and utilizing ensembles to handle uncertainty Kumar et al. (2019); Wu et al. (2019) – served as inspirations for the proposed BREMEN algorithm. A major difference of BREMEN from prior works is that the target policy is not explicitly forced to stick close to the estimated behavior policy through the policy update. Rather, BREMEN employs a more implicit regularization by initializing the learned policy with a behavior cloned policy and then applying conservative trust-region updates. Another major difference is the application of model-based approaches to fully offline settings, which has not been extensively studied in prior works Levine et al. (2020), except the two concurrent works from Kidambi et al. (2020) and Yu et al. (2020) that study pessimistic or uncertainty penalized MDPs with guarantees – closely related to Liu et al. (2019). By contrast, our work shows that a simple technique can already enable model-based offline algorithms to significantly outperform the prior model-free methods, and is, to the best of our knowledge, the first to define and extensively evaluate deployment efficiency with recursive experiments.

Model-Based RL

There are many types of model-based RL algorithms (Sutton, 1991; Deisenroth and Rasmussen, 2011; Heess et al., 2015). A simple algorithmic choice is Dyna-style Sutton (1991), which uses a parameterized dynamics model to estimate the true MDP transition function, stochastically mapping states and actions to next states. The dynamics model can then serve as a simulator of the environment during policy updates. Dyna-style algorithms often suffer from the distributional shift, also known as model bias, which leads RL agents to exploit regions where the data is insufficient, and significant performance degradation. A variety of remedies have been proposed to relieve the problem of model bias, such as the use of multiple dynamics models as an ensemble Chua et al. (2018); Kurutach et al. (2018); Janner et al. (2019), meta-learning Clavera et al. (2018)

, energy-based model regularizer 

Boney et al. (2019), and explicit reward penalty for unknown state Kidambi et al. (2020); Yu et al. (2020). Notably, we have employed a subset of these remedies – model ensembles and trust-region updates Kurutach et al. (2018) – for BREMEN. Compared to existing works, our work is notable for using BC initialization in conjunction with trust-region updates to alleviate the distribution shift of the learned policy from the dataset used to train the dynamics model.

7 Conclusion

In this work, we introduced deployment efficiency, a novel measure for RL performance that counts the number of changes in the data-collection policy during learning. To enhance deployment efficiency, we proposed Behavior-Regularized Model-ENsemble (BREMEN), a novel model-based offline algorithm with implicit KL regularization via appropriate policy initialization and trust-region updates. BREMEN shows impressive results in limited deployment settings, obtaining successful policies from scratch in only 5-10 deployments, as it can improve policies offline even when the batch size is 10-20 times smaller than prior works. Not only can this help alleviate costs and risks in real-world applications, but it can also reduce the amount of communication required during distributed learning and could form the basis for communication-efficient large-scale RL in contrast to prior works Nair et al. (2015); Espeholt et al. (2018, 2019). Most critically, we show that under deployment efficiency constraints, most prior algorithms – model-free or model-based, online or offline – fail to achieve successful learning. We hope our work can gear the research community to value deployment efficiency as an important criterion for RL algorithms, and to eventually achieve similar sample efficiency and asymptotic performance as the state-of-the-art algorithms like SAC (Haarnoja et al., 2018) while having the deployment efficiency well-suited for safe and practical real-world reinforcement learning.

References

  • G. Barth-Maron, M. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, Cited by: §1.
  • R. Boney, J. Kannala, and A. Ilin (2019) Regularizing model-based planning with energy-based models. In Conference on Robot Learning, Cited by: §6.
  • Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh (2018) A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, Cited by: §3.
  • Y. Chow, O. Nachum, A. Faust, E. Duenez-Guzman, and M. Ghavamzadeh (2019) Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031. Cited by: §3.
  • Y. Chow, A. Tamar, S. Mannor, and M. Pavone (2015) Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, Cited by: §3.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, Cited by: §6.
  • I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel (2018) Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, Cited by: §6.
  • T. Degris, M. White, and R. S. Sutton (2012) Off-policy actor-critic. arXiv preprint arXiv:1205.4839. Cited by: §1.
  • M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In

    International Conference on Machine Learning

    ,
    Cited by: §6.
  • D. Ernst, P. Geurts, and L. Wehenkel (2005) Tree-based batch mode reinforcement learning. Journal of Machine Learning Research. Cited by: §6.
  • L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski (2019) SEED RL: scalable and efficient deep-rl with accelerated central inference. arXiv preprint arXiv:1910.06591. Cited by: §7.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, Cited by: §7.
  • B. Eysenbach, S. Gu, J. Ibarz, and S. Levine (2018) Leave no trace: learning to reset for safe and autonomous reinforcement learning. International Conference on Learning Representations. Cited by: §3.
  • S. Fujimoto, D. Meger, and D. Precup (2019) Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, Cited by: §1, §1, §2, Table 6, Table 1, §6.
  • O. Gottesman, F. Johansson, J. Meier, J. Dent, D. Lee, S. Srinivasan, L. Zhang, Y. Ding, D. Wihl, X. Peng, J. Yao, I. Lage, C. Mosch, L. H. Lehman, M. Komorowski, M. Komorowski, A. Faisal, L. A. Celi, D. Sontag, and F. Doshi-Velez (2018) Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298. Cited by: §6.
  • S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017a) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation, Cited by: §1, §3.
  • S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine (2017b) Q-Prop: sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, Cited by: §1, §2.
  • S. Gu, T. Lillicrap, I. Sutskever, and S. Levine (2016) Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, Cited by: §1, §2.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, Cited by: Figure 1, §1, §2, §3, §3, §7.
  • N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, Cited by: §6.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In

    AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, Cited by: §4.3, §6.
  • N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard (2019) Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. Cited by: §1, §1, §1, §6.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine (2018) QT-Opt: scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, Cited by: §1, §3, §3.
  • R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims (2020) MOReL : model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951. Cited by: §C.1, §6, §6.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §B.1.
  • A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §4.2, §5, §6.
  • T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations, Cited by: §1, §B.1, §3, §4.1, Table 1, §6.
  • S. Lange, T. Gabel, and M. Riedmiller (2012) Batch reinforcement learning. In Reinforcement learning, Cited by: §2, §6.
  • S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1, §1, §4.3, §6.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2.
  • L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning. Cited by: §2.
  • Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill (2019) Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473. Cited by: §6.
  • T. Mandel, Y. Liu, S. Levine, E. Brunskill, and Z. Popovic (2014) Offline policy evaluation across representations with applications to educational games.. In International Conference on Autonomous Agents and Multiagent Systems, Cited by: §1, §3.
  • A. Mandlekar, F. Ramos, B. Boots, L. Fei-Fei, A. Garg, and D. Fox (2019) IRIS: implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. arXiv preprint arXiv:1911.05321. Cited by: §6.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature. Cited by: §2.
  • S. A. Murphy, M. J. van der Laan, J. M. Robins, and C. P. P. R. Group (2001) Marginal mean models for dynamic regimes. Journal of the American Statistical Association. Cited by: §1, §3.
  • O. Nachum, M. Ahn, H. Ponte, S. Gu, and V. Kumar (2019) Multi-agent manipulation via locomotion using hierarchical sim2real. In Conference on Robot Learning, Cited by: §1, §3.
  • O. Nachum, S. S. Gu, H. Lee, and S. Levine (2018) Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §1.
  • A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. International Conference on Robotics and Automation. Cited by: §1.
  • V. Nagarajan and J. Z. Kolter (2019) Generalization in deep networks: the role of distance from initialization. arXiv preprint arXiv:1901.01672. Cited by: §4.2.
  • A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, et al. (2015) Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296. Cited by: §7.
  • D. Precup, R. S. Sutton, and S. Dasgupta (2001) Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, Cited by: §1.
  • A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §4.2.
  • A. Ray, J. Achiam, and D. Amodei (2019) Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708. Cited by: §3.
  • S. Ross, G. Gordon, and D. Bagnell (2011)

    A reduction of imitation learning and structured prediction to no-regret online learning

    .
    In International conference on artificial intelligence and statistics, Cited by: §4.2.
  • J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel (2015) Trust region policy optimization. In International Conference on Machine Learning, Cited by: §1, §2, §4.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
  • N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, and M. A. Riedmiller (2020) Keep doing what worked: behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, Cited by: §4.2, §6.
  • S. P. Singh, T. Jaakkola, and M. I. Jordan (1994) Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings, Cited by: §6.
  • S. P. Singh, T. Jaakkola, and M. I. Jordan (1995) Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, Cited by: §2.
  • S. Sohn, Y. Chow, J. Ooi, O. Nachum, H. Lee, E. Chi, and C. Boutilier (2020) BRPO: batch residual policy optimization. arXiv preprint arXiv:2002.05522. Cited by: §1.
  • R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin. Cited by: §B.1, §4, §6.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In International Conference on Intelligent Robots and Systems, Cited by: §3, §5.
  • T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba (2019) Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057. Cited by: §B.1, §B.1, §B.2.1, §B.2.1, Table 2, §4.1.
  • Y. Wu, G. Tucker, and O. Nachum (2019) Behavior Regularized Offline Reinforcement Learning. arXiv preprint arXiv:1911.11361. Cited by: §1, §1, §B.1, §B.1, §B.1, §B.2.1, §C.1, Table 6, §4.2, §5.2, Table 1, §5, §6.
  • T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma (2020) MOPO: model-based offline policy optimization. arXiv preprint arXiv:2005.13239. Cited by: §6, §6.

Appendix

A Proof of Proposition 1

We first consider . The behavior cloning objective in its supremum form is,

We apply Pinsker’s inequality to the true and estimated behavior policy to yield

By the same Pinsker’s inequality, we have,

Therefore, by triangle inequality, we have

as desired.

We perform similarly for . The model dynamics loss is

We apply Pinsker’s inequality to the true dynamics and learned model to yield

as desired.

B Details of Experimental Settings

b.1 Implementation Details

For our baseline methods, we use the open-source implementations of SAC, BC, BCQ, and BRAC published in 

Wu et al. [2019]. SAC and BRAC have (300, 300) Q-Network and (200, 200) policy network. BC has (200, 200) policy network, and BCQ has (300, 300) Q-Network, (300, 300) policy network, and (750, 750) conditional VAE. As for online ME-TRPO, we utilize the codebase of model-based RL benchmark Wang et al. [2019]. BREMEN and online ME-TRPO use the policy consisting of two hidden layers with 200 units. The dynamics model also consists of two hidden layers with 1,024 units. We use Adam Kingma and Ba [2014] as the optimizer with the learning rate of 0.001 for the dynamics model, and 0.0005 for behavior cloning in BREMEN. Especially in BREMEN and online ME-TRPO, we adopt a linear feature value function to stabilize the training.

To leverage neural networks as Dyna-style Sutton [1991] dynamics models, we modify reward and termination function so that they are not dependent on the internal physics engine for calculation, following model-based benchmark codebase Wang et al. [2019]; see Table 2. Note that the score of baselines (e.g., BCQ, BRAC) is slightly different from Wu et al. [2019] due to this modification of the reward function. We re-run each algorithm in our environments and got appropriate convergence.

The maximum length of one episode is 1,000 steps without any termination in Ant and HalfCheetah; however, termination function is enabled in Hopper and Walker2d. The batch size of transitions for policy update is 50,000 in BREMEN and ME-TRPO, following Kurutach et al. [2018]. The batch size of BC and BRAC is 256, and BCQ is 100, also following Wu et al. [2019].

(a) Ant
(b) HalfCheetah
(c) Hopper
(d) Walker2d
Figure 4: Four standard MuJoCo benchmark environments used in our experiments.

 

Environment Reward function Termination in rollouts
Ant False
HalfCheetah False
Hopper True
Walker2d True

 

Table 2: Reward function and termination in rollouts in the experiments. We remove all contact information from observation of Ant, basically following Wang et al. [2019].

b.2 Hyper Parameters

In this section, we describe the hyper-parameters in both Deployment-Efficient RL and Offline RL settings. We run all of our experiments with five random seed, and the results are averaged.

b.2.1 Deployment-Efficient RL

Table 3 shows the hyper-parameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor  and GAE , we follow Wang et al. [2019].

 

Parameter Ant HalfCheetah Hopper Walker2d
Iteration per batch 2,000 2,000 6,000 2,000
Deployment 5 5 10 10
Total iteration 10,000 10,000 60,000 20,000
Rollouts length 250 250 1,000 1,000
Max step size 0.05 0.1 0.05 0.05
Discount factor  0.99 0.99 0.99 0.99
GAE 0.97 0.95 0.95 0.95
Stationary noise 0.1 0.1 0.1 0.1

 

Table 3: Hyper-parameters of BREMEN in deployment-efficient settings.
Number of Iterations for Policy Optimization

To achieve high deployment efficiency, the number of iterations for policy optimization between deployments is one of the important hyper-parameters for fast convergence. In the existing methods (BCQ, BRAC, SAC), we search over three values: {10,000, 50,000, 100,000}, and choose 10,000 in BCQ and BRAC, and 100,000 in SAC (Figure 5). For BREMEN, we also search over three values: {2,000, 4,000, 6,000}. Figure 6 shows the results of iteration search, and we choose 2,000 in Ant, HalfCheetah, and Walker2d, and 6,000 in Hopper.

Figure 5: Search on the number of iterations for SAC policy optimization between deployments. The number of transitions per one data-collection is 200K.
Figure 6: Search on the number of iterations for BREMEN policy optimization between deployments. The number of transitions per one data-collection is 200K.
Stationary Noise in BREMEN

To achieve effective exploration, the stochastic Gaussian policy is a good choice. We found that adding stationary Gaussian noise to the policy in the imaginary trajectories and data collection led to the notable improvement. Stationary Gaussian policy is written as,

Another choice is a learned Gaussian policy, which parameterizes not only but also . Learned gaussian policy is also written as,

We utilize the zero-mean Gaussian , and tune up in Figure 7 with HalfCheetah, comparing stationary and learned strategies. From this experiment, we found that the stationary noise, the scale of 0.1, consistently performs well, and therefore we used it for all our experiments.

Figure 7: Search on the Gaussian noise parameter in HalfCheetah. The number of transitions per one data-collection is 200K.
Other Hyper-parameters in the Existing Methods

As for online ME-TRPO, we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2-to-1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with a 1-to-1 ratio. Totally, we iterate 12,500 times policy optimization, which means 500 deployment of the policy. Note that we carefully tune up the hyper-parameters of online ME-TRPO, and then its performance was improved from Wang et al. [2019].

Table 4 and Table 5 shows the tunable hyper-parameters of BCQ and BRAC, respectively. We refer Wu et al. [2019] to choose these values. In this work, BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.

 

Parameter Ant HalfCheetah Hopper Walker2d
Policy learning rate 3e-05 3e-04 3e-06 3e-05
Perturbation range 0.15 0.5 0.15 0.15

 

Table 4: Hyper-parameters of BCQ.

 

Parameter Ant HalfCheetah Hopper Walker2d
Policy learning rate 1e-4 1e-3 3e-5 1e-5
Divergence penalty 0.3 0.1 0.3 0.3

 

Table 5: Hyper-parameters of BRAC.

b.2.2 Offline RL

In the offline experiments, we apply the same hyper-parameters as in the deployment-efficient settings described above, except for the iteration per batch. Algorithm 2 shows pseudocode for BREMEN in offline RL settings where policies are updated only with one fixed batch dataset. The number of iteration is set to 6,250 in BREMEN, and 500,000 in BC, BCQ, and BRAC.

0:  Offline dataset , Initial parameters , , Number of policy optimization .
1:  Train dynamics models using via Eq. 1.
2:  Train estimated behavior policy using by behavior cloning via Eq. 3.
3:  Initialize target policy .
4:  for policy optimization  do
5:     Generate imaginary rollout.
6:     Optimize target policy satisfying Eq. 4 with the rollout.
Algorithm 2 BREMEN for Offline RL

C Additional Experiment Results

c.1 Performance on the Dataset with Different Noise

Following Wu et al. [2019] and Kidambi et al. [2020], we additionally compare BREMEN in offline settings to the other baselines (BC, BCQ, BRAC) with five datasets of different exploration noise. Each dataset has also one million transitions.

  • eps1: 40 % of the dataset is collected by data-collection policy (partially trained SAC policy) , 40 % of the dataset is collected by epsilon greedy policy with to take a random action, and 20 % of dataset is collected by an uniformly random policy.

  • eps3: Same as eps1, 40 % of the dataset is collected by , 40 % is collected by epsilon greedy policy with , and 20 % is collected by an uniformly random policy.

  • gaussian1: 40 % of the dataset is collected by data-collection policy , 40 % is collected by the policy with adding zero-mean Gaussian noise to each action sampled from , and 20 % is collected by an uniformly random policy.

  • gaussian3: 40 % of the dataset is collected by data-collection policy , 40 % is collected by the policy with zero-mean Gaussian noise , and 20 % is collected by an uniformly random policy.

  • random: All of the dataset is collected by an uniformly random policy.

Table 6 shows that BREMEN can also achieve performance competitive with state-of-the-art model-free offline RL algorithm even with noisy datasets. The training curves of each experiment are shown in Section C.4.

 

Noise: eps1, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1077 2936 791 815
BC 138171 3788740 266486 1185155
BCQ 1937116 6046276 800659 479537
BRAC 2693155 7003118 1243162 3204103
BRAC (max Q) 290798 707081 1488386 3330147
BREMEN (Ours) 3519129 7585425 281876 1177697
ME-TRPO (offline) 1514503 1009731 1301654 128153

 

Noise: eps3, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 936 2408 662 648
BC 1364121 2877797 519532 1066176
BCQ 193821 5739188 1170446 10181231
BRAC 271890 6434147 122471 2921101
BRAC (max Q) 291387 6672136 2103746 3079110
BREMEN (Ours) 3409218 7632104 280365 1161384
ME-TRPO (offline) 1843674 550467 1308756 354329

 

Noise: gaussian1, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1072 3150 882 1070
BC 127980 4142189 3116 1137477
BCQ 195876 5854498 475416 608416
BRAC 290581 7026168 1456161 3030103
BRAC (max Q) 2910157 7026168 157589 324297
BREMEN (Ours) 2912165 7928313 1999617 1402290
ME-TRPO (offline) 1275656 1275656 909631 171119

 

Noise: gaussian3, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 1058 2872 781 981
BC 130034 419069 611467 1217361
BCQ 198297 5781543 1137582 258286
BRAC 3084180 39332740 1432499 3253118
BRAC (max Q) 291699 39972761 1417267 3372153
BREMEN (Ours) 3432185 8124145 1867354 2073245
ME-TRPO (offline) 1237310 2141872 973243 219145

 

Noise: random, 1,000,000 (1M) transitions
Method Ant HalfCheetah Hopper Walker2d
Dataset 470 -285 34 2
BC 98910 -21 10662 108110
BCQ 1222114 2887242 2067 22812
BRAC 105792 3449259 22730 2954
BRAC (max Q) 68357 3418171 22437 2650
BREMEN (Ours) 90511 3627193 27068 2546
ME-TRPO (offline) 2221665 2701120 32129 26213

 

Table 6: Comparison of BREMEN to the existing offline methods in offline settings, namely, BC, BCQ Fujimoto et al. [2019], and BRAC Wu et al. [2019]. Each cell shows the average cumulative reward and their standard deviation with 5 seeds. The maximum steps per episode is 1,000. Five different types of exploration noise are introduced during the data collection, eps1, eps3, gaussian1, gaussian3, and random. BRAC applies a primal form of KL value penalty, and BRAC (max Q) means sampling multiple actions and taking the maximum according to the learned Q function.

c.2 Comparison among Different Number of Ensembles

To deal with the distribution shift, also known as model bias, during policy optimization, we introduce the dynamics model ensembles. We validate the performance of BREMEN with a different number of dynamics models . Figure 8 and Figure 9 show the performance of BREMEN with the different number of ensembles in deployment-efficient and offline settings. Ensembles with more dynamics models resulted in better performance due to the mitigation of distributional shift except for , and then we choose .

Figure 8: Comparison of the number of dynamics models in deployment-efficient settings.
Figure 9: Comparison of the number of dynamics models in offline settings.

c.3 Implicit KL Control in Offline Settings

Similar to Section 5.3, we present offline experiments to better understand the effect of implicit KL regularization. In contrast to the implicit KL regularization via Eq. 4, the optimization of BREMEN with explicit KL penalty becomes

(6)

where is the advantage of computed using model-based rollouts in the learned dynamics model and is the maximum step size. Note that BREMEN with explicit KL penalty does not utilize behavior cloning initialization.

We empirically conclude that the explicit constraint is unnecessary and just TRPO update with behavior-initialization as implicit regularization is sufficient in BREMEN algorithm. Figure 10 shows the KL divergence between learned policies and the last deployed policies (top row) and model errors measured by a mean squared error of predicted next state from the true state (second row). We find that behavior initialized policy with conservative KL trust-region updates well stuck to the last deployed policy during improvement without explicit KL penalty. The policy initialized with behavior cloning also tended to suppress the increase of model error, which implies that behavior initialization alleviates the effect of the distribution shift. In Walker2d, the model error of BREMEN is relatively large, which may relate to the poor performance with noisy datasets in Section C.1.

Figure 10: Average cumulative rewards (top row) and corresponding KL divergence of learned policies from the last deployed policy (second row) and model errors (bottom row) in offline settings with 1M dataset (no noise). Behavior initialized policy (purple line) tends to suppress the policy and model error during training better than no-initialization (red line) or explicit KL penalty (green line).

c.4 Training Curves for Offline RL with Different Noises

Figure 11: Performance in Offline RL experiments (Table 1). (top row) dataset size is 1M, (second row) 100K, and (bottom row) 50K, respectively. Note that x-axis is the number of iterations with policy optimization in a log-scale.
Figure 12: Performance in Offline RL experiments with -greedy dataset noise . Dataset size is 1M.
Figure 13: Performance in Offline RL experiments with -greedy dataset noise . Dataset size is 1M.
Figure 14: Performance in Offline RL experiments with gaussian dataset noise . Dataset size is 1M.
Figure 15: Performance in Offline RL experiments with gaussian dataset noise . Dataset size is 1M.
Figure 16: Performance in Offline RL experiments with completely random behaviors. Dataset size is 1M.