Dealing with the Unknown: Pessimistic Offline Reinforcement Learning

by   Jinning Li, et al.
berkeley college

Reinforcement Learning (RL) has been shown effective in domains where the agent can learn policies by actively interacting with its operating environment. However, if we change the RL scheme to offline setting where the agent can only update its policy via static datasets, one of the major issues in offline reinforcement learning emerges, i.e. distributional shift. We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar by manipulating the value function. We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset, so that the learned pessimistic value function lower bounds the true value anywhere within the state space. We evaluate the PessORL algorithm on various benchmark tasks, where we show that our method gains better performance by explicitly handling OOD states, when compared to those methods merely considering OOD actions.



There are no comments yet.


page 7

page 18


POPO: Pessimistic Offline Policy Optimization

Offline reinforcement learning (RL), also known as batch RL, aims to opt...

Offline Reinforcement Learning with Implicit Q-Learning

Offline reinforcement learning requires reconciling two conflicting aims...

Value Driven Representation for Human-in-the-Loop Reinforcement Learning

Interactive adaptive systems powered by Reinforcement Learning (RL) have...

SCORE: Spurious COrrelation REduction for Offline Reinforcement Learning

Offline reinforcement learning (RL) aims to learn the optimal policy fro...

Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Offline Reinforcement Learning (RL) is a promising approach for learning...

OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning

Reinforcement learning (RL) has achieved impressive performance in a var...

Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics

Offline reinforcement learning leverages large datasets to train policie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL), especially with high-capacity models such as deep nets, has shown its power in many domains, e.g., gaming, healthcare, and robotics. However, typical training schemes of RL algorithms rely on active interaction with the environments. It limits their applications in domains where active data collection is expensive or dangerous (e.g., autonomous driving). Recently, offline reinforcement learning (offline RL) has emerged as a promising candidate to overcome this barrier. Different from traditional RL methods, offline-RL learns the policy from a static offline dataset collected without iterative interaction with the environment. Recent works have shown its ability in solving various policy learning tasks [13, 15, 14]. However, offline RL methods suffer from several major problems. One of them is distributional shift. Unlike online RL algorithms, the state and action distributions are different during training and testing. As a result, RL agents may fail dramatically after deployed online. For example, in safety-critical applications such as autonomous driving, overconfident and catastrophic extrapolations may occur in out-of-distribution (OOD) scenes [5].

Many prior works [18, 17, 22, 4, 1, 21]

try to mitigate this problem by handling OOD actions. They discourage the policies to visit OOD actions by designing conservative value functions, or estimating the uncertainty of Q-functions. Although constraining the policy can implicitly mitigate the problem of state distributional shift, few works have adopted measures to explicitly handle OOD states during the training stage. In this work, we propose the Pessimistic Offline Reinforcement Learning (PessORL) framework to explicitly limit the policy from visiting both unseen states and actions. We refer to the states or the actions that are not included in the training data as the unseen states or the unseen actions.

Our PessORL framework is inspired by the concept of pessimistic MDP in [16], where the reward is significantly small for unseen state-action pairs. We aim to limit the magnitude of the value function at unseen states, so that the agent can avoid or recover from unseen states. It is then crucial to precisely detect OOD states and shape the value function at those states. Since prior methods on OOD actions are derived from a similar concept, we can adapt their approaches to handle OOD states. There are mainly two approaches in the literature. One is to estimate the epistemic uncertainty of Q-function and subtract it from the original Q-function to get a conservative Q-function [17, 22, 4, 1, 21]. The other is to regularize the Q-function during the learning process [18]. The first method is highly sensitive to the trade-off between the uncertainty estimation and the original Q-function [31, 33] and the quality of uncertainty estimation [19].

Therefore, we follow the second approach, and add a conservative regularization term to the policy evaluation step of PessORL to shape the value function. We prove that PessORL learns a pessimistic value function that lower bounds the true value function, and forces the policy to avoid or recover from out-of-distribution states and actions. We evaluate the PessORL algorithm on various benchmark tasks. The performance of our method matches the state-of-the-art offline RL methods. In particular, we show that, by explicitly handling OOD states, we can further improve the policy performance compared to those methods merely considering OOD actions.

2 Related Works

A big challenge for offline reinforcement learning methods is to deal with the problems caused by unvisited states or actions in training data, which is also known as distributional shift. In model-free offline reinforcement learning, some works used importance sampling to fill the gap between the learned policy and the behavior policy in the training dataset [25, 9, 11, 23, 2]. There are also many works constrained the learned policy to be similar to the behavior policy by explicit constraints in the training dataset [17, 8, 27, 24, 20], so that the agent can avoid out-of-distribution actions during test time. The work in [33] proposed a latent space to constrain the policy to avoid deviating from the training data support. One further step to make the agent avoid actions that may cause itself deviate from the training data support is to get a conservative value function and thus a conservative policy. The works in [22, 4, 17, 1, 21] estimate the uncertainty of the learned Q-function, and then directly subtract it from the Q-function to get a conservative Q-function. Another way to get a conservative Q-function is to regularize the Q-function in the optimization problem during the learning process [18]. In model-based reinforcement learning (MBRL), there are also many algorithms that constrain the exploitation in the environment with effective uncertainty estimation methods [28, 30, 32, 10, 12, 16]. It is considered to be mature and reliable to detect OOD actions and states by methods from MBRL. Most of the aforementioned methods focus on OOD actions but not have explicit mechanism to deal with OOD states. In this paper, we focus on OOD states and propose a method to learn a pessimistic value function by adding regularization terms when updating Q-functions, and follow the works in the MBRL domain to establish the module to detect OOD states in our algorithm.

3 Background

3.1 Offline Reinforcement Learning

Given a Markov decision process (MDP), an RL agent aims to maximize the expectation of cumulative rewards. The MDP is represented by a tuple

, where is the state space, is the action space, is the transition function, is the reward function, and is the discount factor. Typical RL algorithms optimize the policy using experience collected when interacting with the environment. Unlike those online learning paradigms, offline-RL algorithms rely solely on a static offline dataset, denoted by .

In this work, we focus on dynamic-programming-based RL algorithms under the offline setting, where we extract a policy from a learned value function for the underlying MDP in the training data. Standard Q-learning method estimates an approximate Q-function parametrized by , i.e. . In each iteration, the Q-function is updated as follows:


where is the empirical Bellman update operator defined as:


For discrete action space, we define as the optimal policy induced by the learned Q-function, i.e. . In this case, collides into the Bellman optimality operator. When the action space is continuous, we follow actor-critic algorithms to approximate the optimal policy by executing a policy improvement step after policy evaluation in each iteration:


In the rest of the paper, we denote as the Bellman update error for simplicity.

3.2 Uncertainty-Based Methods and Pessimistic Value Functions

By observing Eqn. 1, it is obvious that the Q-function is never evaluated or updated at states or actions that never appear in the dataset. The agent may behave unexpectedly or unpredictably at those unseen states or actions during test time. For dynamic-programming-based approaches, one way to address the issue of unseen actions is to estimate the epistemic uncertainty of Q-function and subtract it from the original Q-function [17, 22, 4, 1, 21]. The uncertainty is estimated based on an ensemble of learned Q-functions, and the final conservative Q-function becomes , where Unc is defined to be some uncertainty estimation metric, and is the distribution over possible Q-functions. Because the uncertainty metric is directly subtracted, uncertainty-based methods is highly sensitive to the quality of uncertainty estimation. Meanwhile, it is difficult to find an ideal to balance the original Q-function and Unc.

Another way is to regularize the Q-function at the step of policy evaluation. A representative example is Conservative Q-Learning (CQL) [18]. Assuming that the dataset is collected with a behavior policy , and is the learned policy at iteration , the policy evaluation step in CQL becomes:


In the rest of the paper, we denote as the cost term adopted from the CQL, i.e., It is worth noting that the aforementioned methods all focus on OOD actions, but they do not have an explicit mechanism to deal with OOD states, which motivates us to develop the PessORL framework in this work.

4 Pessimistic Offline Reinforcement Learning Framework

In this section, we introduce the PessORL framework to mitigate the issue of state distributional shift. In particular, we propose a novel conservative regularization term in the policy evaluation step. It can then be integrated into Q-learning or actor-critic algorithm, which will be described in Sec. 5.

4.1 How To Deal With OOD States

Assuming the dataset is collected with a behavior policy , and the states are distributed according to a distribution in the dataset, we propose to solve the problem caused by state distributional shift by augmenting the policy evaluation step in CQL [18] with a regularization term scaled by a trade-off factor :


where is a particular state distribution of our choice.

The idea is to use the minimization term to penalize high values at unseen states in the dataset, and the maximization term to cancel the penalization at in-distribution states. The regularized Q-function could then push the agent towards regions close to the states from the dataset, where the values are higher. To achieve it, we need to find a distribution

that assigns high probabilities to states far away from the dataset, and low probabilities to states near the training dataset. We will instantiate a practical design of

in Sec. 5. For now, we just assume assigns high probabilities to OOD states.

4.2 Theoretical Analysis

In this section, we analyze the theoretical properties of the proposed policy evaluation step. The proof and more details can be found in Appendix A.

We define as the iteration of policy evaluation, i.e. denotes the optimized Q-function in the th iteration obtained by PessORL. is defined to be the true Q-function under a policy in the underlying MDP without any regularization. The true Q-function can be written in a recursive form via the exact Bellman operator, , as . We define as the value function under a policy , . For the true value function in the underlying MDP, we also have .

We first introduce the theorem that the learned value function is a lower bound of the true one without considering the sampling error defined in the Lemma A.1.

Theorem 4.1

Assume we can obtain the exact reward function and the transition function of the underlying MDP. Let . Then , the learned value function via Eqn. 5 is a lower bound of the true one, i.e., , if the ratio satisfies

It is worth noting that the learned value function still lower bounds the true value function for any state and action in the training datasets, i.e. , even when we consider the sampling error defined in the Lemma A.1. Further details are shown in Corollary A.1. We have no reward or transition pair collected at unseen states or actions outside the training dataset, so it is impossible to bound the error outside the training dataset when consider the sampling error introduced by the reward function and the transition function.

We can now step further and show that the values at OOD states are lower than those at in-distribution states based on the learned value function. The proof can be found in Appendix A.3.

Theorem 4.2

For any state , if is sufficiently large, then the expectation of the learned value function via Eqn. 5 under the state marginal in the training data is higher than that under , i.e., .

During training time, we can at least evaluate Q-values of OOD actions based on in-distribution states. However, there is actually no information about immediate rewards at OOD states, thus no information about Q-values. Intuitively, under offline settings, the best we can do to mitigate the problem of OOD states is to suppress values at these OOD states, and raising values at in-distribution states, so that the agent can be attracted to the area where it is familiar near the training data. Thm. 4.2 indeed tells us PessORL models a value function that assigns smaller values to OOD states compared to those at in-distribution states. Optimizing a policy under such a value function is similar to forcing the policy to avoid unknown states and actions.

In summary, PessORL can learn a pessimistic value function that lower bounds the true value function. Furthermore, this value function assigns smaller values to OOD states compared to those at in-distribution states, which helps the agent avoid or even recover from OOD states.

5 Implementing the Algorithm

In this section, we introduce a practical PessORL algorithm based on Eqn. 5. This algorithm simply modifies the policy evaluation step of Deep Q-Learning or Soft Actor-Critic algorithms, which is easy to implement.

5.1 Detecting OOD states

In prior to designing the algorithm, we need to choose a proper , which requires a tool for OOD state detection. Following  [16, 19, 3], we use bootstrapping to detect OOD states. In particular, we train a bag of Gaussian dynamics models [3] where each model is . The function outputs the mean difference between the next state and the current state, and

models the standard deviation. OOD states are detected by estimating the uncertainty of bootstrap models at a given state

. Concretely, we define , where is the mean of outputs of all , and the actions are drawn from a policy distribution . A high value indicates the state is more likely to be an unseen state. Given a set of sampled states , we can define a discrete distribution over it using : , which assign high probabilities to OOD states. In the following section, we will use it to construct the distribution .

5.2 Practical Implementation of PessORL

1 Initialize: A Q network parametrized by , A target network parametrized by , a policy network parametrized by , and a bag of dynamics models to detect OOD states;
2 // Dynamics Models Training (Models are used by to detect OOD states in the policy evaluation step)
3 for step in range(, ) do
4       Train dynamics models according to the transitions in the dataset , so that we can later obtain an uncertainty estimation model in the policy evaluation step;
6 end for
7// Policy Evaluation and Improvement
8 for step in range(, ) do
9       Update according to Eqn. 7 with learning rate and :
10       ;
11       Update according to the soft actor critic style objective and learning rate :
12       ;
13       if  mod target_update ==  then
14             Soft Update the target network
15       end if
17 end for
Algorithm 1 Pessimistic Offline Reinforcement Learning (PessORL)

We now introduce a practical PessORL algorithm. In practice, to obtain a well-defined distribution , we add an additional optimization problem over into the original optimization problem. The resulting optimization problem for the policy evaluation step is:


where is a regularization term inspired by [18] in order to stabilize the training. If we choose , where is the distribution we obtained from uncertainty estimations, then , where . The resulting is intuitively reasonable, because it assigns high probabilities to OOD states with high uncertainty estimations. In particular, assigns higher probabilities to states with high values, because we expect to penalize harder on them than those with low values already. With this choice of in Eqn. 6, we obtain the following PessORL policy evaluation step:


The first term in Eqn. 7 is very similar to weighted softmax values over the state space. It penalizes the softmax value over the state space, but also considers the distances between sample points and the training data. The two terms following the trade-off factor is actually trying to decrease the discrepancy between the softmax value over OOD states and the average value over in-distribution states. Intuitively, it should enforce the learned value function to output higher values at in-distribution states, and lower values at out-of-distribution states. The logsumexp term in Eqn. 7 also mitigates the requirement for an accurate uncertainty estimation over the entire state space. Only those states with high values contribute to the regularization.

The complete algorithm is shown in Algorithm 1. We include the version for continuous action space which requires a policy network here, and note that if the action space is discrete, then we no longer need a policy network but just an implicit policy based on the learned Q-function. We implement PessORL on top of CQL [18]

, with its default hyperparameters. We also apply Lagrangian dual gradient descent to automatically adjust the trade-off factor

. During the training process of offline reinforcement learning algorithms such as CQL and PessORL, we only have access to the dataset instead of and . Therefore, we follow the convention in reinforcement learning community and approximated all expectations by Monte Carlo estimation in Eqn. 7.

6 Experiments

We compare our algorithm to prior offline algorithms: two state-of-the-art offline RL algorithms BEAR [17] and CQL [18]; two baselines adapted directly from online algorithms, actor-critic algorithm TD3 [7] and DDQN [29]; and behavior cloning (BC). The TD3 baseline is applied when the action space is continuous, whereas DDQN is trained when the action space is discrete. We evaluate each algorithm on a wide range of task domains, including tasks with both continuous and discrete state and action space. All baselines are run with the default code and hyperparameters from the original repositories. In particular, we are interested in the comparison between our algorithm with CQL, because we essentially add an additional state regularization term to the original CQL framework.

Figure 1: (a) The whole map of the environment; (b) The state density in training dataset; (c) The visualization of uncertainty estimation; (d) The learning curves. The top row (1) and the bottom row (2) are corresponding to PointmassHard-v0 and PointmassSuperHard-v0, respectively. We can see that almost all trajectories in the training datasets are located around the optimal trajectory from the start to the goal in the yellow areas in (b), indicating they are collected by a near-optimal policy.

6.1 Performance on Various Environments

Pointmass Mazes. The task for the agent in this domain is to learn from expert demonstrations to navigate from a random start to a fixed goal. The expert dataset, which contains around 1000 trajectories all from the same start point to the same goal, is collected by online trained RL policy. During the test time of offline RL algorithms, we reset the start to a random point in the state space and the goal to the same fixed point as the dataset. In this way, the performance of the agent at unseen states are evaluated.

Before showing the performance, we first check if the OOD states detection is accurate, and hence, if we can successfully penalize high values at unseen states in training datasets. We evaluate the effectiveness of the OOD states detection method based on the accuracy of uncertainty estimation in the environment Pointmass. Figure 1(b) and (c) are visualizations of the training datasets and estimated uncertainty , both of which have the coordinate systems the same as that in the map (figure 1(a)). We use different colors in figure 1(b) and (c) to represent different values at each point in the map. The uncertainty estimations tend to be high (yellow areas in Fig. 1(c)) in area where the state density is low (blue areas in Fig 1(b)), and vice versa. This trend empirically shows that our uncertainty estimations are reasonable. We can trust them to detect OOD states when training offline RL algorithms.

We include the learning curves in figure 1(d), in which we evaluate each algorithm based on 3 random seeds, and report the average return. The shaded area represents the standard deviation of each evaluation. As we can see in the figures, PessORL outperforms other baselines in both hard and super hard environments. PessORL benefits from the augmented policy evaluation step in Eqn. 7. The learned value function produces high values at areas that have low uncertainty estimations, and low values at highly uncertain areas (OOD states). Therefore, the agent can be “attracted” to the high value areas from low value and unfamiliar areas.

max width= Domain Task BC TD3 BEAR CQL PessORL Gym hopper-medium walker2d-medium halfcheetah-medium ant-medium hopper-medium-expert walker2d-medium-expert halfcheetah-medium-expert ant-medium-expert hopper-random walker2d-random halfcheetah-random ant-random hopper-expert walker2d-expert halfcheetah-expert ant-expert Adroit pen-human door-human hammer-human relocate-human pen-cloned door-cloned hammer-cloned relocate-cloned

Table 1: Performance on Gym and Adroit Domains

Gym Tasks. In this domain, we focus on the locomotion environments from MuJoCo, including Walker2d-v2, Hopper-v2, Halfcheetah-v2, and Ant-v2. Unlike Pointmass environment, we directly adopt the d4rl datasets [6] as our training data in the gym domains. We include four different types of datasets in our experiments, namely, “medium”, “medium-expert”, “random”, and “expert”. The “medium”, “random”, and “expert” dataset are all collected by a single policy, which is an either early-stopping trained, or randomly initialized, or fully trained expert policy. The “medium-expert” dataset is generated by mixing mediocre and expert quality data. We show the normalized scores averaged over 4 random seeds for all methods on gym domain in table 1. We directly ran all baselines from their original repositories with their default parameters, and we only report the average scores we actually obtained. As we can see in the table, PessORL outperforms all other offline RL methods on a majority of tasks on gym domains. PessORL works especially well with mediocre quality datasets according to the results. In fact, it is one of the advantages of offline RL methods over behavior cloning on medium quality datasets, because offline RL methods take advantage of the information both from the reward and the underlying state and action distributions in training datasets, instead of simply imitating behavior policies as behavior cloning. Medium quality datasets are also considered to be similar to real-world datasets. Therefore, it is important for an offline RL method to perform well in medium quality datasets. We also note that PessORL shares some good properties with CQL, such as satisfying performance on mixed quality datasets. PessORL and CQL both outperform other offline methods on medium-expert datasets with PessORL better between them. The reason is that offline RL methods can “stitch” [6] different trajectories from different policies together according to the information from the reward.

Adroit Tasks The adroit domain [26] provides more challenging tasks than the Pointmass environment and the gym domain. The tasks include controlling a 24-DoF simulated Shadow Hand robot to twirl a pen, open a door, hammer a nail, and relocate a ball. Similar to the datasets in the gym domain, we also directly use the d4rl datasets as the training datasets in our experiments. The performance of PessORL and all baselines is shown in table 1. The normalized scores of all methods are average returns on 5 random seeds. We note that PessORL has better performance than other baselines on adroit domains. It is a great advantage for PessORL to learn useful skills from human demonstrations on these high dimensional and highly realistic robotic simulations.

6.2 Discussions and Limitations

Figure 2: (a) The learning curves in hopper-medium-v0. (b) The discrepancy as a function of gradient steps for PessORL, CQL, and BEAR.

The main contribution of this work is to explicitly limit the values at OOD states, so that the learned policy can act conservatively at OOD states and drives the agent back to the familiar areas near the training data. We are interested to see if our framework can indeed induce a different behavior on OOD states. We use as a metric to evaluate it at each iteration. If is close to zero, then intuitively it indicates the values at OOD states are lower than those at in-distribution states. In Fig. 2, we plot at each iteration in hopper-medium-v0. As is shown in the figure, PessORL successfully limits to be non-positive, which meets our goal in this work and aligns with the statement in theorem 4.2.

On the gym domain, we notice that the performance of PessORL and CQL on datasets containing expert trajectories is not satisfying, often not as good as BC. We believe it is because of overly conservative value estimation. In fact, it is widely believed that conservative methods suffer from underestimation [19]. The conservative objective function in Eqn. 7 sometimes assign values that are too low to OOD states and actions. Besides, the uncertainty estimation method cannot be guaranteed to be precise on high-dimensional spaces. It is actually a possible future work direction to solve the underestimation and uncertainty estimation problems in conservative methods.

7 Conclusion

We propose a Pessimistic Offline Reinforcement Learning framework to deal with out-of-distribution states. In particular, we add a regularization term in policy evaluation step to shape value function, so that we can improve its extrapolation to OOD states. We also provide theoretical guarantees that the learned pessimistic value function lower bounds the true one and assigns smaller values to OOD states compared to those at in-distribution states. We evaluate the PessORL algorithm on various benchmark tasks, where we show that our method gains better performance by explicitly handling OOD states compared to those methods merely considering OOD actions.


  • [1] R. Agarwal, D. Schuurmans, and M. Norouzi (2020) An optimistic perspective on offline reinforcement learning. In

    International Conference on Machine Learning

    pp. 104–114. Cited by: §1, §1, §2, §3.2.
  • [2] C. Cheng, X. Yan, and B. Boots (2020)

    Trajectory-wise control variates for variance reduction in policy gradient methods

    In Conference on Robot Learning, pp. 1379–1394. Cited by: §2.
  • [3] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §5.1.
  • [4] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine (2017) Leave no trace: learning to reset for safe and autonomous reinforcement learning. arXiv preprint arXiv:1711.06782. Cited by: §1, §1, §2, §3.2.
  • [5] A. Filos, P. Tigkas, R. McAllister, N. Rhinehart, S. Levine, and Y. Gal (2020) Can autonomous vehicles identify, recover from, and adapt to distribution shifts?. In International Conference on Machine Learning, pp. 3145–3153. Cited by: §1.
  • [6] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine (2020) D4RL: datasets for deep data-driven reinforcement learning. External Links: 2004.07219 Cited by: §6.1.
  • [7] S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596. Cited by: §6.
  • [8] S. Fujimoto, D. Meger, and D. Precup (2019) Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. Cited by: §2.
  • [9] S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. Cited by: §2.
  • [10] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pp. 2555–2565. Cited by: §2.
  • [11] J. Huang and N. Jiang (2020) From importance sampling to doubly robust policy gradient. In International Conference on Machine Learning, pp. 4434–4443. Cited by: §2.
  • [12] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §2.
  • [13] N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard (2019) Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456. Cited by: §1.
  • [14] G. Kahn, P. Abbeel, and S. Levine (2021)

    Badgr: an autonomous self-supervised learning-based navigation system

    IEEE Robotics and Automation Letters 6 (2), pp. 1312–1319. Cited by: §1.
  • [15] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pp. 651–673. Cited by: §1.
  • [16] R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims (2020) Morel: model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951. Cited by: §1, §2, §5.1.
  • [17] A. Kumar, J. Fu, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949. Cited by: §1, §1, §2, §3.2, §6.
  • [18] A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020) Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779. Cited by: §1, §1, §2, §3.2, §4.1, §5.2, §5.2, §6.
  • [19] S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1, §5.1, §6.2.
  • [20] A. Nair, M. Dalal, A. Gupta, and S. Levine (2020) Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359. Cited by: §2.
  • [21] B. O’Donoghue, I. Osband, R. Munos, and V. Mnih (2018) The uncertainty bellman equation and exploration. In International Conference on Machine Learning, pp. 3836–3845. Cited by: §1, §1, §2, §3.2.
  • [22] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. arXiv preprint arXiv:1602.04621. Cited by: §1, §1, §2, §3.2.
  • [23] S. Pankov (2018) Reward-estimation variance elimination in sequential decision processes. arXiv preprint arXiv:1811.06225. Cited by: §2.
  • [24] X. B. Peng, A. Kumar, G. Zhang, and S. Levine (2019) Advantage-weighted regression: simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177. Cited by: §2.
  • [25] D. Precup (2000) Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §2.
  • [26] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §6.1.
  • [27] N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, N. Heess, and M. Riedmiller (2020) Keep doing what worked: behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396. Cited by: §2.
  • [28] R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin 2 (4), pp. 160–163. Cited by: §2.
  • [29] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 30. Cited by: §6.
  • [30] M. Watter, J. T. Springenberg, J. Boedecker, and M. Riedmiller (2015) Embed to control: a locally linear latent dynamics model for control from raw images. arXiv preprint arXiv:1506.07365. Cited by: §2.
  • [31] Y. Wu, G. Tucker, and O. Nachum (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361. Cited by: §1.
  • [32] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine (2019) Solar: deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, pp. 7444–7453. Cited by: §2.
  • [33] W. Zhou, S. Bajracharya, and D. Held (2020) PLAS: latent action space for offline reinforcement learning. arXiv preprint arXiv:2011.07213. Cited by: §1, §2.

Appendix A Proof of Theorem

In this section, we provide the proofs of the theorems in this paper.

Remark. We provide the proofs under a tabular setting. Most continuous space can be approximately discretized to a tabular form, although the cordiality of the tabular form may be large. We define as the tabular transition probabilities under the policy .

a.1 Proof of Theorem 4.1

In Eqn. 5, The Q-function update can be computed in a tabular setting, by setting the derivative of the augmented objective in Eqn. 5 with respect to to zero,

Therefore, we can obtain in terms of by rearranging the terms,


for all , and . For the state-action pair such that and , the last two terms of Eqn. 8, is positive, so that we cannot simply lower bound the true Q-function by the estimated one point-wise. However, we can prove that the value function, which is the expectation of the Q-function, can be lower bounded. Taking the expectation of both sides of Eqn. 8 under the distribution , we have


The first goal is to prove that which implies that each iteration introduces some underestimation, and could eventually converge to a fixed point. Therefore, we need to prove the last two terms on the right hand side of Eqn. 9 is negative. We denote to be the opposite of the last two terms on the right hand side of Eqn. 9, then


From the proof in [1], we know that the second term in Eqn. 10 is non-negative when , that is


Hence, when , it is obvious that for all . When , if satisfies


then we have .

In summary, we have when Eqn. 12 holds. Since the exact Bellman update operator is a contraction [4], we have

which implies that each value-function update is a contraction. According to the contraction mapping theorem, the recursive update in Eqn. 9 will always lead value function to converge to a fixed point . Now that for the true value functions, by subtracting them from both side of Eqn. 9 and substitute and with the fixed point , we have


when satisfies


In Eqn. 13

, we stretch all notations to be vectors, which means

, , and are all vectors containing values for all states. Here, denotes the vector in which the entries are all . The expectations and the operations inside are all computed in a point-wise manner.

Therefore, we can conclude from Eqn. 13 that the estimated value function is a lower bound of the true value-function without considering any sampling error. Thus, we finish the proof of Thm. 4.1.

a.2 Value Lower Bound in Existence of Sampling Errors

We now take sampling error into account. First, we introduce a lemma from [1]:

Lemma A.1

If with high probability the reward function and the transition function can be estimated with bounded error, then the sampling error of the empirical Bellman operator is also bounded:


where is a constant related to , , and , and is the maximum possible reward in the environment.

Note that the bound of the error in Lemma A.1 only holds for states and actions in the training datasets, i.e. . We have no reward or transition pair collected at unseen states or actions outside the training dataset, so it is impossible to bound the error outside the training dataset when consider the sampling error introduced by the reward function and the transition function. Therefore, we can lower bound the true value function by the learned value function at states and actions in the training datasets as in the following corollary.

Corollary A.1

When the sampling error defined in Lemma A.1, for any state and any action in the training dataset, , the learned value function via Eqn. 5 is a lower bound of the true one, i.e., , if the trade-off factor and satisfy the constraints


We now show the proof of Corollary A.1. From Eqn. 13, we can directly bound the estimated value function for any by


when and satisfy the constraints in Eqn. 16. Note that in Eqn. 17, we use vector notations similar to those in Eqn. 13.

Therefore, when we consider sampling error introduced by the reward function and the transition pair, the learned value function by PessORL still lower bounds the true one for any states and actions in the training dataset. Thus, we finish the proof of Corollary A.1.

a.3 Proof of Theorem 4.2

We begin the proof from Eqn. 9. We first take the expectation of both side of Eqn. 9 under the distribution , then


Similarly, we take the expectation of both side of Eqn. 9 under the distribution , then we have


If we subtract Eqn. 19 from Eqn. 18, we get


Therefore, we have , if satisfies


a.4 Existence of Feasible Trade-off Factor

Note that both Eqn. 21 and Eqn. 16 put constraints on the trade-off factor . We show that we can choose an appropriate value of to ensure that a feasible that satisfies both constraints exist. Formally, we denote


for simplicity. From Eqn. 16 and 21, we have

Hence, there exists a feasible when and