Conservative Q-Learning for Offline Reinforcement Learning

06/08/2020 ∙ by Aviral Kumar, et al. ∙ berkeley college 0

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a principled policy improvement procedure. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in reinforcement learning (RL), especially when combined with expressive deep network function approximators, have produced promising results in domains ranging from robotics (kalashnikov2018qtopt) to strategy games (alphastar) and recommendation systems (li2010contextual)

. However, applying RL to real-world problems consistently poses practical challenges: in contrast to the kinds of data-driven methods that have been successful in supervised learning 

(resnet; bert)

, RL is classically regarded as an active learning process, where each training run requires active interaction with the environment. Interaction with the real world can be costly and dangerous, and the quantities of data that can be gathered online are substantially lower than the offline datasets that are used in supervised learning 

(imagenet), which only need to be collected once. Offline RL, also known as batch RL, offers an appealing alternative (ernst2005tree; fujimoto2018off; kumar2019stabilizing; agarwal2019optimistic; jaques2019way; siegel2020keep; levine2020offline). Offline RL algorithms learn from large, previously collected datasets, without interaction. This in principle can make it possible to leverage large datasets, but in practice fully offline RL methods pose major technical difficulties, stemming from the distributional shift between the policy that collected the data and the learned policy. This has made current results fall short of the full promise of such methods.

Directly utilizing existing value-based off-policy RL algorithms in an offline setting generally results in poor performance, due to issues with bootstrapping from out-of-distribution actions (kumar2019stabilizing; fujimoto2018off) and overfitting (fu2019diagnosing; kumar2019stabilizing; agarwal2019optimistic)

. This typically manifests as erroneously optimistic value function estimates. If we can instead learn a

conservative estimate of the value function, which provides a lower bound on the true values, this overestimation problem could be addressed. In fact, because policy evaluation and improvement typically only use the value of the policy, we can learn a less conservative lower bound Q-function, such that only the expected value of Q-function under the policy is lower-bounded, as opposed to a point-wise lower bound. We propose a novel method for learning such conservative Q-functions via a simple modification to standard value-based RL algorithms. The key idea behind our method is to minimize values under an appropriately chosen distribution over state-action tuples, and then further tighten this bound by also incorporating a maximization term over the data distribution.

Our primary contribution is an algorithmic framework, which we call conservative Q-learning (CQL), for learning conservative, lower-bound estimates of the value function, by regularizing the Q-values during training. Our theoretical analysis of CQL shows that only the expected value of this Q-function under the policy lower-bounds the true policy value, preventing extra under-estimation that can arise with point-wise lower-bounded Q-functions, that have typically been explored in the opposite context in exploration literature (osband2017posterior; jaksch2010near). We also empirically demonstrate the robustness of our approach to Q-function estimation error. Our practical algorithm uses these conservative estimates for policy evaluation and offline RL. CQL can be implemented with less than 20 lines of code on top of a number of standard, online RL algorithms (haarnoja; dabney2018distributional), simply by adding the CQL regularization terms to the Q-function update. In our experiments, we demonstrate the efficacy of CQL for offline RL, in domains with complex dataset compositions, where prior methods are typically known to perform poorly (d4rl) and domains with high-dimensional visual inputs (bellemare2013arcade; agarwal2019optimistic). CQL outperforms prior methods by as much as 2-5x on many benchmark tasks, and is the only method that can outperform simple behavioral cloning on a number of realistic datasets collected from human interaction.

2 Preliminaries

The goal in reinforcement learning is to learn a policy that maximizes the expected cumulative discounted reward in a Markov decision process (MDP), which is defined by a tuple

. represent state and action spaces, and represent the dynamics and reward function, and represents the discount factor. represents the action-conditional in the dataset, , and is the discounted marginal state-distribution of . The dataset is sampled from . On all states , let denote the empirical behavior policy, at that state.

Off-policy RL algorithms based on dynamic programming maintain a parametric Q-function and, optionally, a parametric policy, . Q-learning methods train the Q-function by iteratively applying the Bellman optimality operator , and use exact or an approximate maximization scheme, such as CEM (kalashnikov2018qtopt) to recover the greedy policy. In an actor-critic algorithm, a separate policy is trained to maximize the Q-value. Actor-critic methods alternate between computing via (partial) policy evaluation, by iterating the Bellman operator, , where is the transition matrix coupled with the policy: and improving the policy by updating it towards actions that maximize the expected Q-value. Since typically does not contain all possible transitions , the policy evaluation step actually uses an empirical Bellman operator that only backs up a single sample. We denote this operator . Given a dataset of tuples from trajectories collected under a behavior policy :

Offline RL algorithms based on this basic recipe suffer from action distribution shift (kumar2019stabilizing; wu2019behavior; jaques2019way; levine2020offline) during training, because the target values for Bellman backups in policy evaluation use actions sampled from the learned policy, , but the Q-function is trained only on actions sampled from the behavior policy that produced the dataset , . We also remark that Q-function training in offline RL does not suffer from state distribution shift, because the Q-function is never queried on out-of-distribution states. Since is trained to maximize Q-values, it may be biased towards out-of-distribution (OOD) actions with erroneously high Q-values. In standard RL, such errors can be corrected by attempting an action in the environment and observing its actual value. However, the inability to interact with the environment makes it challenging to deal with Q-values for OOD actions in offline RL. Typical offline RL methods (kumar2019stabilizing; jaques2019way; wu2019behavior; siegel2020keep) mitigate this problem by constraining the learned policy (levine2020offline) away from OOD actions.

3 The Conservative Q-Learning (CQL) Framework

In this section, we develop a conservative Q-learning (CQL) algorithm, such that the expected value of a policy under the learned Q-function lower-bounds its true value. A lower bound on the Q-value prevents the over-estimation that is common in offline RL settings due to OOD actions and function approximation error (levine2020offline; kumar2019stabilizing). We use the term CQL to refer broadly to both Q-learning methods and actor-critic methods, though the latter also use an explicit policy. We start by focusing on the policy evaluation step in CQL, which can be used by itself as an off-policy evaluation procedure, or integrated into a complete offline RL algorithm, as we will discuss in Section 3.2.

3.1 Conservative Off-Policy Evaluation

We aim to estimate the value of a target policy given access to a dataset, , generated by following a behavior policy . Because we are interested in preventing overestimation of the policy value, we learn a conservative, lower-bound Q-function by additionally minimizing Q-values alongside a standard Bellman error objective. Our choice of penalty is to minimize the expected Q-value under a particular distribution of state-action pairs, . Since standard Q-function training does not query the Q-function value at unobserved states, but queries the Q-function at unseen actions, we restrict to match the state-marginal in the dataset, such that . This gives rise to the iterative update for training the Q-function, as a function of a tradeoff factor :

(1)

In Theorem 3.1, we show that the resulting Q-function, , lower-bounds at all . However, we can substantially tighten this bound if we are only interested in estimating . If we only require that the expected value of the under lower-bound , we can improve the bound by introducing an additional Q-value maximization term under the data distribution, , resulting in the iterative update (changes from Equation 1 in red):

(2)

In Theorem 3.2, we show that, while the resulting Q-value may not be a point-wise lower-bound, we have when . Intuitively, since Equation 2 maximizes Q-values under the behavior policy , Q-values for actions that are likely under might be overestimated, and hence may not lower-bound pointwise. While in principle the maximization term can utilize other distributions besides , we prove in Appendix D.2 that the resulting value is not guaranteed to be a lower bound for other distribution besides .

Theoretical analysis. We first note that Equations 1 and 2 use the empirical Bellman operator, , instead of the actual Bellman operator, . Following (osband2016deep; jaksch2010near; o2018variational), we use concentration properties of to control this error. Formally, for all

, with probability

, , where

is a constant dependent on the concentration properties (variance) of

and , and (see Appendix D.3 for more details). Now, we show that the conservative Q-function learned by iterating Equation 1 lower-bounds the true Q-function. Proofs can be found in Appendix C.

Theorem 3.1.

For any with , with probability , (the Q-function obtained by iterating Equation 1) satisifies:

Thus, if , then . When , any guarantees .

Next, we show that Equation 2 lower-bounds the expected value under the policy , when . We also show that Equation 2 does not lower-bound the Q-value estimates pointwise.

Theorem 3.2 (Equation 2 results in a tighter lower bound).

The value of the policy under the Q-function from Equation 2, , lower-bounds the true value of the policy obtained via exact policy evaluation, , when , according to:

Thus, if , with probability . When , then any guarantees .

The analysis presented above assumes that no function approximation is used in the Q-function, meaning that each iterate can be represented exactly. We can further generalize the result in Theorem 3.2

to the case of both linear function approximators and non-linear neural network function approximators, where the latter builds on the neural tangent kernel (NTK) framework 

(ntk). Due to space constraints, we present these results in Theorem D.1 and Theorem D.2 in Appendix D.1.

In summary, we showed that the basic CQL evaluation in Equation 1 learns a Q-function that lower-bounds the true Q-function , and the evaluation in Equation 2 provides a tighter lower bound on the expected Q-value of the policy . For suitable , both bounds hold under sampling error and function approximation. Next, we will extend on this result into a complete RL algorithm.

3.2 Conservative Q-Learning for Offline RL

We now present a general approach for offline policy learning, which we refer to as conservative Q-learning (CQL). As discussed in Section 3.1, we can obtain Q-values that lower-bound the value of a policy by solving Equation 2 with . How should we utilize this for policy optimization? We could alternate between performing full off-policy evaluation for each policy iterate, , and one step of policy improvement. However, this can be computationally expensive. Alternatively, since the policy is typically derived from the Q-function, we could instead choose to approximate the policy that would maximize the current Q-function iterate, thus giving rise to an online algorithm.

We can formally capture such online algorithms by defining a family of optimization problems over , presented below, with modifications from Equation 2 marked in red. An instance of this family is denoted by CQL() and is characterized by a particular choice of regularizer :

(3)

Variants of CQL. To demonstrate the generality of the CQL family of optimization problems, we discuss two specific instances within this family that are of special interest, and we evaluate them empirically in Section 6. If we choose to be the KL-divergence against a prior distribution, , i.e., , then we get (for a derivation, see Appendix A). Frist, if , then the first term in Equation 3 corresponds to a soft-maximum of the Q-values at any state and gives rise to the following variant of Equation 3, called CQL():

(4)

Second, if is chosen to be the previous policy , the first term in Equation 4 is replaced by an exponential weighted average of Q-values of actions from the chosen . Empirically, we find that this variant can be more stable with high-dimensional action spaces (e.g., Table 2) where it is challenging to estimate via sampling due to high variance. In Appendix A, we discuss an additional variant of CQL, drawing connections to distributionally robust optimization (namkoong2017variance). We will discuss a practical instantiation of a CQL deep RL algorithm in Section 4. CQL can be instantiated as either a Q-learning algorithm (with instead of in Equations 3, 4) or as an actor-critic algorithm.

Theoretical analysis of CQL. Next, we will theoretically analyze CQL to show that the policy updates derived in this way are indeed “conservative”, in the sense that each successive policy iterate is optimized against a lower bound on its value. For clarity, we state the results in the absence of finite-sample error, in this section, but sampling error can be incorporated in the same way as Theorems 3.1 and 3.2, and we discuss this in Appendix C. Theorem 3.3 shows that any variant of the CQL family learns Q-value estimates that lower-bound the actual Q-function under the action-distribution defined by the policy, , under mild regularity conditions (slow updates on the policy).

Theorem 3.3 (CQL learns lower-bounded Q-values).

Let and assume that (i.e., changes slowly w.r.t to ). Then, the policy value under , lower-bounds the actual policy value, if

The LHS of this inequality is equal to the amount of conservatism induced in the value, in iteration of the CQL update, if the learned policy were equal to soft-optimal policy for , i.e., when . However, as the actual policy, , may be different, the RHS is the maximal amount of potential overestimation due to this difference. To get a lower bound, we require the amount of underestimation to be higher, which is obtained if is small, i.e. the policy changes slowly.

Our final result shows that CQL Q-function update is “gap-expanding”, by which we mean that the difference in Q-values at in-distribution actions and over-optimistically erroneous out-of-distribution actions is higher than the corresponding difference under the actual Q-function. This implies that the policy , is constrained to be closer to the dataset distribution, , thus the CQL update implicitly prevents the detrimental effects of OOD action and distribution shift, which has been a major concern in offline RL settings (kumar2019stabilizing; levine2020offline; fujimoto2018off).

Theorem 3.4 (CQL is gap-expanding).

At any iteration , CQL expands the difference in expected Q-values under the behavior policy and , such that for large enough values of , we have that .

When function approximation or sampling error makes OOD actions have higher learned Q-values, CQL backups are expected to be more robust, in that the policy is updated using Q-values that prefer in-distribution actions. As we will empirically show in Appendix B, prior offline RL methods that do not explicitly constrain or regularize the Q-function may not enjoy such robustness properties.

To summarize, we showed that the CQL RL algorithm learns lower-bound Q-values with large enough , meaning that the final policy attains at least the estimated value. We also showed that the Q-function is gap-expanding, meaning that it should only ever over-estimate the gap between in-distribution and out-of-distribution actions, preventing OOD actions.

4 Practical Algorithm and Implementation Details

1:  Initialize Q-function, , and optionally a policy, .
2:  for step in {1, …, N} do
3:     Train the Q-function using gradient steps on objective from Equation 4 (Use for Q-learning, for actor-critic)
4:     (only with actor-critic) Improve policy via gradient steps on with SAC-style entropy regularization:
5:  end for
Algorithm 1 Conservative Q-Learning (both variants)

We now describe two practical offline deep reinforcement learning methods based on CQL: an actor-critic variant and a Q-learning variant. Pseudocode is shown in Algorithm 1, with differences from conventional actor-critic algorithms (e.g., SAC (haarnoja)) and deep Q-learning algorithms (e.g., DQN (mnih2013playing)) in red. Our algorithm uses the CQL() (or CQL() in general) objective from the CQL framework for training the Q-function , which is parameterized by a neural network with parameters . For the actor-critic algorithm, a policy is trained as well. Our algorithm modifies the objective for the Q-function (swaps out Bellman error with CQL()) or CQL() in a standard actor-critic or Q-learning setting, as shown in Line 3. As discussed in Section 3.2, due to the explicit penalty on the Q-function, CQL methods do not use a policy constraint, unlike prior offline RL methods (kumar2019stabilizing; wu2019behavior; siegel2020keep; levine2020offline). Hence, we do not require fitting an additional behavior policy estimator, simplifying our method.

Implementation details. Our algorithm requires an addition of only 20 lines of code on top of standard implementations of soft actor-critic (SAC) (haarnoja) for continuous control experiments and on top of QR-DQN (dabney2018distributional) for the discrete control experiments. The tradeoff factor, is automatically tuned via Lagrangian dual gradient descent for continuous control, and is fixed at constant values described in Appendix F

for discrete control. We use default hyperparameters from SAC, except that the learning rate for the policy is chosen to be 3e-5 (vs 3e-4 or 1e-4 for the Q-function), as dictated by Theorem 

3.3. Elaborate details are provided in Appendix F.

5 Related Work

We now briefly discuss prior work in offline RL and off-policy evaluation, comparing and contrasting these works with our approach. More technical discussion of related work is provided in Appendix E.

Off-policy evaluation (OPE). Several different paradigms have been used to perform off-policy evaluation. Earlier works (precup2000eligibility; peshkin2002learning; precup2001off) used per-action importance sampling on Monte-Carlo returns to obtain an OPE return estimator. Recent approaches (liu2018breaking; gelada2019off; nachum2019dualdice; Zhang2020GenDICE:) use marginalized importance sampling by directly estimating the state-distribution importance ratios via some form of dynamic programming (levine2020offline) and typically exhibit less variance than per-action importance sampling at the cost of bias. Because these methods use dynamic programming, they can suffer from OOD actions  (levine2020offline; gelada2019off; hallak2017consistent; nachum2019dualdice). In contrast, the regularizer in CQL explicitly addresses the impact of OOD actions due to its gap-expanding behavior, and obtains conservative value estimates.

Offline RL. As discussed in Section 2, offline Q-learning methods suffer from issues pertaining to OOD actions. Prior works have attempted to solve this problem by constraining the learned policy to be “close” to the behavior policy, for example as measured by KL-divergence (jaques2019way; wu2019behavior; peng2019awr; siegel2020keep), Wasserstein distance (wu2019behavior), or MMD (kumar2019stabilizing), and then only using actions sampled from this constrained policy in the Bellman backup or applying a value penalty. Most of these methods require a separately estimated model to the behavior policy,  (fujimoto2018off; kumar2019stabilizing; wu2019behavior; jaques2019way; siegel2020keep), and are thus limited by their ability to accurately estimate the unknown behavior policy, which might be especially complex in settings where the data is collected from multiple sources (levine2020offline). In contrast, CQL does not require estimating the behavior policy. Prior work has explored some forms of Q-function penalties (hester2018deep; vecerik2017leveraging), but only in the standard online RL setting augmented with demonstrations. luo2019learning learn a conservatively-extrapolated value function, by enforcing a linear extrapolation property over the state-space, and use it alongside a learned dynamics model to obtain policies for goal-reaching tasks. Alternate prior approaches to offline RL estimate some sort of uncertainty to determine the trustworthiness of a Q-value prediction (kumar2019stabilizing; agarwal2019optimistic; levine2020offline), typically using uncertainty estimation techniques from exploration in online RL (osband2016deep; jaksch2010near; osband2017posterior; burda2018exploration). These methods have not been generally performant in offline RL (fujimoto2018off; kumar2019stabilizing; levine2020offline) due to the high-fidelity requirements of uncertainty estimates in offline RL (levine2020offline). The gap-expanding property of CQL backups, shown in Theorem 3.4, is related to how gap-increasing Bellman backup operators (bellemare2016increasing; lu2018general) are more robust to estimation error in online RL.

6 Experimental Evaluation

We compare CQL to prior offline RL methods on a range of domains and dataset compositions, including continuous and discrete action spaces, state observations of varying dimensionality, and high-dimensional image inputs. We first evaluate actor-critic CQL, using CQL() from Algorithm 1, on continuous control datasets from the D4RL benchmark (d4rl). We compare to: prior offline RL methods that use a policy constraint – BEAR (kumar2019stabilizing) and BRAC (wu2019behavior); SAC (haarnoja), an off-policy actor-critic method that we adapt to offline setting; and behavioral cloning (BC).

Task Name SAC BC BEAR BRAC-p BRAC-v CQL()
halfcheetah-random 30.5 2.1 25.5 23.5 28.1 35.4
hopper-random 11.3 9.8 9.5 11.1 12.0 10.8
walker2d-random 4.1 1.6 6.7 0.8 0.5 7.0
halfcheetah-medium -4.3 36.1 38.6 44.0 45.5 44.4
walker2d-medium 0.9 6.6 33.2 72.7 81.3 79.2
hopper-medium 0.8 29.0 47.6 31.2 32.3 58.0
halfcheetah-expert -1.9 107.0 108.2 3.8 -1.1 104.8
hopper-expert 0.7 109.0 110.3 6.6 3.7 109.9
walker2d-expert -0.3 125.7 106.1 -0.2 -0.0 153.9
halfcheetah-medium-expert 1.8 35.8 51.7 43.8 45.3 62.4
walker2d-medium-expert 1.9 11.3 10.8 -0.3 0.9 98.7
hopper-medium-expert 1.6 111.9 4.0 1.1 0.8 111.0
halfcheetah-random-expert 53.0 1.3 24.6 30.2 2.2 92.5
walker2d-random-expert 0.8 0.7 1.9 0.2 2.7 91.1
hopper-random-expert 5.6 10.1 10.1 5.8 11.1 110.5
halfcheetah-mixed -2.4 38.4 36.2 45.6 45.9 46.2
hopper-mixed 3.5 11.8 25.3 0.7 0.8 48.6
walker2d-mixed 1.9 11.3 10.8 -0.3 0.9 26.7
Table 1: Performance of CQL() and prior methods on gym domains from D4RL, on the normalized return metric, averaged over 4 seeds. Note that CQL performs similarly or better than the best prior method with simple datasets, and greatly outperforms prior methods with complex distributions (“–mixed”, “–random-expert”, “–medium-expert”).

Gym domains. Results for the gym domains are shown in Table 1. The results for BEAR, BRAC, SAC, and BC are based on numbers reported by d4rl. On the datasets generated from a single policy, marked as “-random”, “-expert” and “-medium”, CQL roughly matches or exceeds the best prior methods, but by a small margin. However, on datasets that combine multiple policies (“-mixed”, “-medium-expert” and “-random-expert”), that are more likely to be common in practical datasets, CQL outperforms prior methods by large margins, sometimes as much as 2-3x.

Adroit tasks. The more complex Adroit (rajeswaran2018dapg) tasks (shown on the right) in D4RL require controlling a 24-DoF robotic hand, using limited data from human demonstrations. These tasks are substantially more difficult than the gym tasks in terms of both the dataset composition and high dimensionality. Prior offline RL methods generally struggle to learn meaningful behaviors on these tasks, and the strongest baseline is BC. As shown in Table 2, CQL variants are the only methods that improve over BC, attaining scores that are 2-9x those of the next best offline RL method. CQL() with (the previous policy) outperforms CQL() on a number of these tasks, due to the higher action dimensionality resulting in higher variance for the CQL() importance weights. Both variants outperform prior methods.

Domain Task Name BC SAC BEAR BRAC-p BRAC-v CQL() CQL()
AntMaze antmaze-umaze 65.0 0.0 73.0 50.0 70.0 74.0 73.5
antmaze-umaze-diverse 55.0 0.0 61.0 40.0 70.0 84.0 61.0
antmaze-medium-play 0.0 0.0 0.0 0.0 0.0 61.2 4.6
antmaze-medium-diverse 0.0 0.0 8.0 0.0 0.0 53.7 5.1
antmaze-large-play 0.0 0.0 0.0 0.0 0.0 15.8 3.2
antmaze-large-diverse 0.0 0.0 0.0 0.0 0.0 14.9 2.3
Adroit pen-human 34.4 6.3 -1.0 8.1 0.6 37.5 55.8
hammer-human 1.5 0.5 0.3 0.3 0.2 4.4 2.1
door-human 0.5 3.9 -0.3 -0.3 -0.3 9.9 9.1
relocate-human 0.0 0.0 -0.3 -0.3 -0.3 0.20 0.35
pen-cloned 56.9 23.5 26.5 1.6 -2.5 39.2 40.3
hammer-cloned 0.8 0.2 0.3 0.3 0.3 2.1 5.7
door-cloned -0.1 0.0 -0.1 -0.1 -0.1 0.4 3.5
relocate-cloned -0.1 -0.2 -0.3 -0.3 -0.3 -0.1 -0.1
Kitchen kitchen-complete 33.8 15.0 0.0 0.0 0.0 43.8 31.3
kitchen-partial 33.8 0.0 13.1 0.0 0.0 49.8 50.1
kitchen-undirected 47.5 2.5 47.2 0.0 0.0 51.0 52.4
Table 2: Normalized scores of all methods on AntMaze, Adroit, and kitchen domains from D4RL, averaged across 4 seeds. On the harder mazes, CQL is the only method that attains non-zero returns, and is the only method to outperform simple behavioral cloning on Adroit tasks with human demonstrations. We observed that the CQL() variant, which avoids importance weights, trains more stably, with no sudden fluctuations in policy performance over the course of training, on the higher-dimensional Adroit tasks.

AntMaze. These D4RL tasks require composing parts of suboptimal trajectories to form more optimal policies for reaching goals on a MuJoco Ant robot. Prior methods make some progress on the simpler U-maze, but only CQL is able to make meaningful progress on the much harder medium and large mazes, outperforming prior methods by a very wide margin.

Kitchen tasks. Next, we evaluate CQL on the Franka kitchen domain (gupta2019relay) from D4RL (d4rl_repo). The goal is to control a 9-DoF robot to manipulate multiple objects (microwave, kettle, etc.) sequentially, in a single episode to reach a desired configuration, with only sparse 0-1 completion reward for every object that attains the target configuration. These tasks are especially challenging, since they require composing parts of trajectories, precise long-horizon manipulation, and handling human-provided teleoperation data. As shown in Table 2, CQL outperforms prior methods in this setting, and is the only method that outperforms behavioral cloning, attaining over 40% success rate on all tasks.

Offline RL on Atari games. Lastly, we evaluate a discrete-action Q-learning variant of CQL (Algorithm 1) on offline, image-based Atari games (bellemare2013arcade). We compare CQL to REM (agarwal2019optimistic) and QR-DQN (dabney2018distributional) on the five Atari tasks (Pong, Breakout, Qbert, Seaquest and Asterix) that are evaluated in detail by agarwal2019optimistic, using the dataset released by the authors.

Figure 1: Performance of CQL, QR-DQN and REM as a function of training steps (x-axis) in setting (1) when provided with only the first 20% of the samples of an online DQN run. Note that CQL is able to learn stably on 3 out of 4 games, and its performance does not degrade as steeply as QR-DQN on Seaquest.

Following the evaluation protocol of agarwal2019optimistic, we evaluated on two types of datasets, both of which were generated from the DQN-replay dataset, released by (agarwal2019optimistic): (1) a dataset consisting of the first 20% of the samples observed by an online DQN agent and (2) datasets consisting of only 1% and 10% of all samples observed by an online DQN agent (Figures 6 and 7 in (agarwal2019optimistic)). In setting (1), shown in Figure 1, CQL generally achieves similar or better performance throughout as QR-DQN and REM. When only using only 1% or 10% of the data, in setting (2) (Table 3), CQL

Task Name QR-DQN REM CQL()
Pong (1%) -13.8 -6.9 19.3
Breakout 7.9 11.0 61.1
Q*bert 383.6 343.4 14012.0
Seaquest 672.9 499.8 779.4
Asterix 166.3 386.5 592.4
Pong (10%) 15.1 8.9 18.5
Breakout 151.2 86.7 269.3
Q*bert 7091.3 8624.3 13855.6
Seaquest 2984.8 3936.6 3674.1
Asterix 189.2 75.1 156.3
Table 3: CQL, REM and QR-DQN in setting (1) with 1% data (top), and 10% data (bottom). CQL drastically outperforms prior methods with 1% data, and usually attains better performance with 10% data.

substantially outperforms REM and QR-DQN, especially in the harder 1% condition, achieving 36x and 6x times the return of the best prior method on Q*bert and Breakout, respectively.

Analysis of CQL. Finally, we perform empirical evaluation to verify that CQL indeed lower-bounds the value function, thus verifying Theorems 3.2, Appendix D.1 empirically. To this end, we estimate the average value of the learned policy predicted by CQL, , and report the difference against the actual discounted return of the policy in Table 4. We also estimate these values for baselines, including the minimum predicted Q-value under an ensemble (haarnoja; fujimoto2018addressing) of Q-functions with varying ensemble sizes, which is a standard technique to prevent overestimed Q-values (fujimoto2018addressing; haarnoja; hasselt2010double) and BEAR (kumar2019stabilizing), a policy constraint method. The results show that CQL learns a lower bound for all three tasks, whereas the baselines are prone to overestimation. We also evaluate a variant of CQL that uses Equation 1, and observe that the resulting values are lower (that is, underestimate the true values) as compared to CQL(). This provides empirical evidence that CQL() attains a tighter lower bound than the point-wise bound in Equation 1, as per Theorem 3.2.

Task Name CQL( CQL (Eqn. 1) Ensemble(2) Ens.(4) Ens.(10) Ens.(20) BEAR
hopper-medium-expert -43.20 -151.36 3.71e6 2.93e6 0.32e6 24.05e3 65.93
hopper-mixed -10.93 -22.87 15.00e6 59.93e3 8.92e3 2.47e3 1399.46
hopper-medium -7.48 -156.70 26.03e12 437.57e6 1.12e12 885e3 4.32
Table 4: Difference between policy values predicted by each algorithm and the true policy value for CQL, a variant of CQL that uses Equation 1, the minimum of an ensemble of varying sizes, and BEAR (kumar2019stabilizing) on three D4RL datasets. CQL is the only method that lower-bounds the actual return (i.e., has negative differences), and CQL( is much less conservative than CQL (Eqn. 1).

We also present an empirical analysis to show that Theorem 3.4, that CQL is gap-expanding, holds in practice in Appendix B, and present an ablation study on various design choices used in CQL in Appendix G.

7 Discussion

We proposed conservative Q-learning (CQL), an algorithmic framework for offline RL that learns a lower bound on the policy value. Empirically, we demonstrate that CQL outperforms prior offline RL methods on a wide range of offline RL benchmark tasks, including complex control tasks and tasks with raw image observations. In many cases, the performance of CQL is substantially better than the best-performing prior methods, exceeding their final returns by 2-5x. The simplicity and efficacy of CQL make it a promising choice for a wide range of real-world offline RL problems. However, a number of challenges remain. While we prove that CQL learns lower bounds on the Q-function in the tabular, linear, and a subset of non-linear function approximation cases, a rigorous theoretical analysis of CQL with deep neural nets, is left for future work. Additionally, offline RL methods are liable to suffer from overfitting in the same way as standard supervised methods, so another important challenge for future work is to devise simple and effective early stopping methods, analogous to validation error in supervised learning.

Acknowledgements

We thank Mohammad Norouzi, Oleh Rybkin, Anton Raichuk, Vitchyr Pong and anonymous reviewers from the Robotic AI and Learning Lab at UC Berkeley for their feedback on an earlier version of this paper. We thank Rishabh Agarwal for help with the Atari QR-DQN/REM codebase and for sharing baseline results. This research was funded by the DARPA Assured Autonomy program, and compute support from Google, Amazon, and NVIDIA.

References

Appendix A Discussion of CQL Variants

We derive several variants of CQL in Section 3.2. Here, we discuss these variants on more detail and describe their specific properties. We first derive the variants: CQL(), CQL(), and then present another variant of CQL, which we call CQL(var). This third variant has strong connections to distributionally robust optimization [namkoong2017variance].

CQL(). In order to derive CQL(), we substitute , and solve the optimization over in closed form for a given Q-function. For an optimization problem of the form:

the optimal solution is equal to , where is a normalizing factor. Plugging this into Equation 3, we exactly obtain Equation 4.

CQL(). In order to derive CQL(), we follow the above derivation, but our regularizer is a KL-divergence regularizer instead of entropy.

The optimal solution is given by, , where is a normalizing factor. Plugging this back into the CQL family (Equation 3), we obtain the following objective for training the Q-function (modulo some normalization terms):

(5)

CQL(var). Finally, we derive a CQL variant that is inspired from the perspective of distributionally robust optimization (DRO) [namkoong2017variance]. This version penalizes the variance in the Q-function across actions at all states , under some action-conditional distribution of our choice. In order to derive a canonical form of this variant, we invoke an identity from namkoong2017variance, which helps us simplify Equation 3. To start, we define the notion of “robust expectation”: for any function , and any empirical distribution over a dataset of elements, the “robust” expectation defined by:

can be approximated using the following upper-bound:

(6)

where the gap between the two sides of the inequality decays inversely w.r.t. to the dataset size, . By using Equation 6 to simplify Equation 3, we obtain an objective for training the Q-function that penalizes the variance of Q-function predictions under the distribution .

(7)

The only remaining decision is the choice of , which can be chosen to be the inverse of the empirical action distribution in the dataset, , or even uniform over actions, , to obtain this variant of variance-regularized CQL.

Appendix B Discussion of Gap-Expanding Behavior of CQL Backups

In this section, we discuss in detail the consequences of the gap-expanding behavior of CQL backups over prior methods based on policy constraints that, as we show in this section, may not exhibit such gap-expanding behavior in practice. To recap, Theorem 3.4 shows that the CQL backup operator increases the difference between expected Q-value at in-distribution () and out-of-distribution () actions. We refer to this property as the gap-expanding property of the CQL update operator.

Function approximation may give rise to erroneous Q-values at OOD actions. We start by discussing the behavior of prior methods based on policy constraints [kumar2019stabilizing, fujimoto2018off, jaques2019way, wu2019behavior] in the presence of function approximation. To recap, because computing the target value requires , constraining to be close to will avoid evaluating on OOD actions. These methods typically do not impose any further form of regularization on the learned Q-function. Even with policy constraints, because function approximation used to represent the Q-function, learned Q-values at two distinct state-action pairs are coupled together. As prior work has argued and shown [achiam2019towards, fu2019diagnosing, kumar2020discor], the “generalization” or the coupling effects of the function approximator may be heavily influenced by the properties of the data distribution [fu2019diagnosing, kumar2020discor]. For instance, fu2019diagnosing empirically shows that when the dataset distribution is narrow (i.e. state-action marginal entropy, , is low [fu2019diagnosing]), the coupling effects of the Q-function approximator can give rise to incorrect Q-values at different states, though this behavior is absent without function approximation, and is not as severe with high-entropy (e.g. Uniform) state-action marginal distributions.

In offline RL, we will shortly present empirical evidence on high-dimensional MuJoCo tasks showing that certain dataset distributions, , may cause the learned Q-value at an OOD action at a state , to in fact take on high values than Q-values at in-distribution actions at intermediate iterations of learning. This problem persists even when a large number of samples (e.g. ) are provided for training, and the agent cannot correct these errors due to no active data collection.

Since actor-critic methods, including those with policy constraints, use the learned Q-function to train the policy, in an iterative online policy evaluation and policy improvement cycle, as discussed in Section 2, the errneous Q-function may push the policy towards OOD actions, especially when no policy constraints are used. Of course, policy constraints should prevent the policy from choosing OOD actions, however, as we will show that in certain cases, policy constraint methods might also fail to prevent the effects on the policy due to incorrectly high Q-values at OOD actions.

How can CQL address this problem? As we show in Theorem 3.4, the difference between expected Q-values at in-distribution actions and out-of-distribution actions is expanded by the CQL update. This property is a direct consequence of the specific nature of the CQL regularizer – that maximizes Q-values under the dataset distribution, and minimizes them otherwise. This difference depends upon the choice of , which can directly be controlled, since it is a free parameter. Thus, by effectively controlling , CQL can push down the learned Q-value at out-of-distribution actions as much is desired, correcting for the erroneous overestimation error in the process.

Empirical evidence on high-dimensional benchmarks with neural networks. We next empirically demonstrate the existence of of such Q-function estimation error on high-dimensional MuJoCo domains when deep neural network function approximators are used with stochastic optimization techniques. In order to measure this error, we plot the difference in expected Q-value under actions sampled from the behavior distribution, , and the maximum Q-value over actions sampled from a uniformly random policy, . That is, we plot the quantity

(8)

over the iterations of training, indexed by . This quantity, intuitively, represents an estimate of the “advantage” of an action , under the Q-function, with respect to the optimal action . Since, we cannot perform exact maximization over the learned Q-function in a continuous action space to compute , we estimate it via sampling described in Equation 8.

We present these plots in Figure 2 on two datasets: hopper-expert and hopper-medium. The expert dataset is generated from a near-deterministic, expert policy, exhibits a narrow coverage of the state-action space, and limited to only a few directed trajectories. On this dataset, we find that is always positive for the policy constraint method (Figure 2(a)) and increases during training – note, the continuous rise in values, in the case of the policy-constraint method, shown in Figure 2(a). This means that even if the dataset is generated from an expert policy, and policy constraints correct target values for OOD actions, incorrect Q-function generalization may make an out-of-distribution action appear promising. For the more stochastic hopper-medium dataset, that consists of a more diverse set of trajectories, shown in Figure 2(b), we still observe that for the policy-constraint method, however, the relative magnitude is smaller than hopper-expert.

In contrast, Q-functions learned by CQL, generally satisfy , as is seen and these values are clearly smaller than those for the policy-constraint method. This provides some empirical evidence for Theorem 3.4

, in that, the maximum Q-value at a randomly chosen action from the uniform distribution the action space is smaller than the Q-value at in-distribution actions.

On the hopper-expert task, as we show in Figure 2(a) (right), we eventually observe an “unlearning” effect, in the policy-constraint method where the policy performance deteriorates after a extra iterations in training. This “unlearning” effect is similar to what has been observed when standard off-policy Q-learning algorithms without any policy constraint are used in the offline regime [kumar2019stabilizing, levine2020offline], on the other hand this effect is absent in the case of CQL, even after equally many training steps. The performance in the more-stochastic hopper-medium dataset fluctuates, but does not deteriorate suddenly.

To summarize this discussion, we concretely observed the following points via empirical evidence:

  • CQL backups are gap expanding in practice, as justified by the negative values in Figure 2.

  • Policy constraint methods, that do not impose any regularization on the Q-function may observe highly positive values during training, especially with narrow data distributions, indicating that gap-expansion may be absent.

  • When values continuously grow during training, the policy might eventually suffer from an unlearning effect [levine2020offline], as shown in Figure 2(a).

(a) hopper-expert-v0
(b) hopper-medium-v0
Figure 2: as a function of training iterations for hopper-expert and hopper-medium datasets. Note that CQL (left) generally has negative values of , whereas BEAR (right) generally has positive values, which also increase during training with increasing values.

Appendix C Theorem Proofs

In this section, we provide proofs of the theorems in Sections 3.1 and 3.2. We first redefine notation for clarity and then provide the proofs of the results in the main paper.

Notation. Let denote an iteration of policy evaluation (in Section 3.1) or Q-iteration (in Section 3.2). In an iteration , the objective – Equation 2 or Equation 3 – is optimized using the previous iterate (i.e. ) as the target value in the backup. denotes the true, tabular Q-function iterate in the MDP, without any correction. In an iteration, say , the current tabular Q-function iterate, is related to the previous tabular Q-function iterate as: (for policy evaluation) or (for policy learning). Let denote the -th Q-function iterate obtained from CQL. Let denote the value function, .

A note on the value of . Before proving the theorems, we remark that while the statements of Theorems 3.2, 3.1 and D.1 (we discuss this in Appendix D) show that CQL produces lower bounds if is larger than some threshold, so as to overcome either sampling error (Theorems 3.2 and 3.1) or function approximation error (Theorem D.1). While the optimal in some of these cases depends on the current Q-value, , we can always choose a worst-case value of by using the inequality , still guaranteeing a lower bound.

We first prove Theorem 3.1, which shows that policy evaluation using a simplified version of CQL (Equation 1) results in a point-wise lower-bound on the Q-function.

Proof of Theorem 3.1. In order to start, we first note that the form of the resulting Q-function iterate, , in the setting without function approximation. By setting the derivative of Equation 1 to 0, we obtain the following expression for in terms of ,

(9)

Now, since, , we observe that at each iteration we underestimate the next Q-value iterate, i.e. .

Accounting for sampling error. Note that so far we have only shown that the Q-values are upper-bounded by the the “empirical Bellman targets” given by, . In order to relate to the true Q-value iterate, , we need to relate the empirical Bellman operator, to the actual Bellman operator, . In Appendix D.3, we show that if the reward function and the transition function, satisfy “concentration” properties, meaning that the difference between the observed reward sample, () and the actual reward function (and analogously for the transition matrix) is bounded with high probability, then overestimation due to the empirical Backup operator is bounded. Formally, with high probability (w.h.p.) , ,

Hence, the following can be obtained, w.h.p.:

(10)

Now we need to reason about the fixed point of the update procedure in Equation 9. The fixed point of Equation 9 is given by:

thus proving the relationship in Theorem 3.1.

In order to guarantee a lower bound, can be chosen to cancel any potential overestimation incurred by . Note that this choice works, since is a matrix with all non-negative entries. The choice of that guarantees a lower bound is then given by:

Of course, we need whenever , for this to hold, and this is assumed in the theorem statament. Note that since, , when , any guarantees a lower bound.

Next, we prove Theorem 3.3 that shows that the additional term that maximizes the expected Q-value under the dataset distribution, , (or , in the absence of sampling error), results in a lower-bound on only the expected value of the policy at a state, and not a pointwise lower-bound on Q-values at all actions.

Proof of Theorem 3.2. We first prove this theorem in the absence of sampling error, and then incorporate sampling error at the end, using a technique similar to the previous proof. In the tabular setting, we can set the derivative of the modified objective in Equation 2, and compute the Q-function update induced in the exact, tabular setting (this assumes and ).

(11)

Note that for state-action pairs, , such that, , we are infact adding a positive quantity, , to the Q-function obtained, and this we cannot guarantee a point-wise lower bound, i.e. . To formally prove this, we can construct a counter-example three-state, two-action MDP, and choose a specific behavior policy , such that this is indeed the case.

The value of the policy, on the other hand, is underestimated, since:

(12)

and we can show that is always positive, when . To note this, we present the following derivation:

Note that the marked term, is positive since both the numerator and denominator are positive, and this implies that . Also, note that , iff . This implies that each value iterate incurs some underestimation, .

Now, we can compute the fixed point of the recursion in Equation 12, and this gives us the following estimated policy value:

thus showing that in the absence of sampling error, Theorem 3.2 gives a lower bound. It is straightforward to note that this expression is tighter than the expression for policy value in Proposition 3.2, since, we explicitly subtract in the expression of Q-values (in the exact case) from the previous proof.

Incorporating sampling error. To extend this result to the setting with sampling error, similar to the previous result, the maximal overestimation at each iteration , is bounded by . The resulting value-function satisfies (w.h.p.),