1 Introduction
Recent advances in reinforcement learning (RL), especially when combined with expressive deep network function approximators, have produced promising results in domains ranging from robotics (kalashnikov2018qtopt) to strategy games (alphastar) and recommendation systems (li2010contextual)
. However, applying RL to realworld problems consistently poses practical challenges: in contrast to the kinds of datadriven methods that have been successful in supervised learning
(resnet; bert), RL is classically regarded as an active learning process, where each training run requires active interaction with the environment. Interaction with the real world can be costly and dangerous, and the quantities of data that can be gathered online are substantially lower than the offline datasets that are used in supervised learning
(imagenet), which only need to be collected once. Offline RL, also known as batch RL, offers an appealing alternative (ernst2005tree; fujimoto2018off; kumar2019stabilizing; agarwal2019optimistic; jaques2019way; siegel2020keep; levine2020offline). Offline RL algorithms learn from large, previously collected datasets, without interaction. This in principle can make it possible to leverage large datasets, but in practice fully offline RL methods pose major technical difficulties, stemming from the distributional shift between the policy that collected the data and the learned policy. This has made current results fall short of the full promise of such methods.Directly utilizing existing valuebased offpolicy RL algorithms in an offline setting generally results in poor performance, due to issues with bootstrapping from outofdistribution actions (kumar2019stabilizing; fujimoto2018off) and overfitting (fu2019diagnosing; kumar2019stabilizing; agarwal2019optimistic)
. This typically manifests as erroneously optimistic value function estimates. If we can instead learn a
conservative estimate of the value function, which provides a lower bound on the true values, this overestimation problem could be addressed. In fact, because policy evaluation and improvement typically only use the value of the policy, we can learn a less conservative lower bound Qfunction, such that only the expected value of Qfunction under the policy is lowerbounded, as opposed to a pointwise lower bound. We propose a novel method for learning such conservative Qfunctions via a simple modification to standard valuebased RL algorithms. The key idea behind our method is to minimize values under an appropriately chosen distribution over stateaction tuples, and then further tighten this bound by also incorporating a maximization term over the data distribution.Our primary contribution is an algorithmic framework, which we call conservative Qlearning (CQL), for learning conservative, lowerbound estimates of the value function, by regularizing the Qvalues during training. Our theoretical analysis of CQL shows that only the expected value of this Qfunction under the policy lowerbounds the true policy value, preventing extra underestimation that can arise with pointwise lowerbounded Qfunctions, that have typically been explored in the opposite context in exploration literature (osband2017posterior; jaksch2010near). We also empirically demonstrate the robustness of our approach to Qfunction estimation error. Our practical algorithm uses these conservative estimates for policy evaluation and offline RL. CQL can be implemented with less than 20 lines of code on top of a number of standard, online RL algorithms (haarnoja; dabney2018distributional), simply by adding the CQL regularization terms to the Qfunction update. In our experiments, we demonstrate the efficacy of CQL for offline RL, in domains with complex dataset compositions, where prior methods are typically known to perform poorly (d4rl) and domains with highdimensional visual inputs (bellemare2013arcade; agarwal2019optimistic). CQL outperforms prior methods by as much as 25x on many benchmark tasks, and is the only method that can outperform simple behavioral cloning on a number of realistic datasets collected from human interaction.
2 Preliminaries
The goal in reinforcement learning is to learn a policy that maximizes the expected cumulative discounted reward in a Markov decision process (MDP), which is defined by a tuple
. represent state and action spaces, and represent the dynamics and reward function, and represents the discount factor. represents the actionconditional in the dataset, , and is the discounted marginal statedistribution of . The dataset is sampled from . On all states , let denote the empirical behavior policy, at that state.Offpolicy RL algorithms based on dynamic programming maintain a parametric Qfunction and, optionally, a parametric policy, . Qlearning methods train the Qfunction by iteratively applying the Bellman optimality operator , and use exact or an approximate maximization scheme, such as CEM (kalashnikov2018qtopt) to recover the greedy policy. In an actorcritic algorithm, a separate policy is trained to maximize the Qvalue. Actorcritic methods alternate between computing via (partial) policy evaluation, by iterating the Bellman operator, , where is the transition matrix coupled with the policy: and improving the policy by updating it towards actions that maximize the expected Qvalue. Since typically does not contain all possible transitions , the policy evaluation step actually uses an empirical Bellman operator that only backs up a single sample. We denote this operator . Given a dataset of tuples from trajectories collected under a behavior policy :
Offline RL algorithms based on this basic recipe suffer from action distribution shift (kumar2019stabilizing; wu2019behavior; jaques2019way; levine2020offline) during training, because the target values for Bellman backups in policy evaluation use actions sampled from the learned policy, , but the Qfunction is trained only on actions sampled from the behavior policy that produced the dataset , . We also remark that Qfunction training in offline RL does not suffer from state distribution shift, because the Qfunction is never queried on outofdistribution states. Since is trained to maximize Qvalues, it may be biased towards outofdistribution (OOD) actions with erroneously high Qvalues. In standard RL, such errors can be corrected by attempting an action in the environment and observing its actual value. However, the inability to interact with the environment makes it challenging to deal with Qvalues for OOD actions in offline RL. Typical offline RL methods (kumar2019stabilizing; jaques2019way; wu2019behavior; siegel2020keep) mitigate this problem by constraining the learned policy (levine2020offline) away from OOD actions.
3 The Conservative QLearning (CQL) Framework
In this section, we develop a conservative Qlearning (CQL) algorithm, such that the expected value of a policy under the learned Qfunction lowerbounds its true value. A lower bound on the Qvalue prevents the overestimation that is common in offline RL settings due to OOD actions and function approximation error (levine2020offline; kumar2019stabilizing). We use the term CQL to refer broadly to both Qlearning methods and actorcritic methods, though the latter also use an explicit policy. We start by focusing on the policy evaluation step in CQL, which can be used by itself as an offpolicy evaluation procedure, or integrated into a complete offline RL algorithm, as we will discuss in Section 3.2.
3.1 Conservative OffPolicy Evaluation
We aim to estimate the value of a target policy given access to a dataset, , generated by following a behavior policy . Because we are interested in preventing overestimation of the policy value, we learn a conservative, lowerbound Qfunction by additionally minimizing Qvalues alongside a standard Bellman error objective. Our choice of penalty is to minimize the expected Qvalue under a particular distribution of stateaction pairs, . Since standard Qfunction training does not query the Qfunction value at unobserved states, but queries the Qfunction at unseen actions, we restrict to match the statemarginal in the dataset, such that . This gives rise to the iterative update for training the Qfunction, as a function of a tradeoff factor :
(1) 
In Theorem 3.1, we show that the resulting Qfunction, , lowerbounds at all . However, we can substantially tighten this bound if we are only interested in estimating . If we only require that the expected value of the under lowerbound , we can improve the bound by introducing an additional Qvalue maximization term under the data distribution, , resulting in the iterative update (changes from Equation 1 in red):
(2) 
In Theorem 3.2, we show that, while the resulting Qvalue may not be a pointwise lowerbound, we have when . Intuitively, since Equation 2 maximizes Qvalues under the behavior policy , Qvalues for actions that are likely under might be overestimated, and hence may not lowerbound pointwise. While in principle the maximization term can utilize other distributions besides , we prove in Appendix D.2 that the resulting value is not guaranteed to be a lower bound for other distribution besides .
Theoretical analysis. We first note that Equations 1 and 2 use the empirical Bellman operator, , instead of the actual Bellman operator, . Following (osband2016deep; jaksch2010near; o2018variational), we use concentration properties of to control this error. Formally, for all
, with probability
, , whereis a constant dependent on the concentration properties (variance) of
and , and (see Appendix D.3 for more details). Now, we show that the conservative Qfunction learned by iterating Equation 1 lowerbounds the true Qfunction. Proofs can be found in Appendix C.Theorem 3.1.
For any with , with probability , (the Qfunction obtained by iterating Equation 1) satisifies:
Thus, if , then . When , any guarantees .
Next, we show that Equation 2 lowerbounds the expected value under the policy , when . We also show that Equation 2 does not lowerbound the Qvalue estimates pointwise.
Theorem 3.2 (Equation 2 results in a tighter lower bound).
The value of the policy under the Qfunction from Equation 2, , lowerbounds the true value of the policy obtained via exact policy evaluation, , when , according to:
Thus, if , , with probability . When , then any guarantees .
The analysis presented above assumes that no function approximation is used in the Qfunction, meaning that each iterate can be represented exactly. We can further generalize the result in Theorem 3.2
to the case of both linear function approximators and nonlinear neural network function approximators, where the latter builds on the neural tangent kernel (NTK) framework
(ntk). Due to space constraints, we present these results in Theorem D.1 and Theorem D.2 in Appendix D.1.In summary, we showed that the basic CQL evaluation in Equation 1 learns a Qfunction that lowerbounds the true Qfunction , and the evaluation in Equation 2 provides a tighter lower bound on the expected Qvalue of the policy . For suitable , both bounds hold under sampling error and function approximation. Next, we will extend on this result into a complete RL algorithm.
3.2 Conservative QLearning for Offline RL
We now present a general approach for offline policy learning, which we refer to as conservative Qlearning (CQL). As discussed in Section 3.1, we can obtain Qvalues that lowerbound the value of a policy by solving Equation 2 with . How should we utilize this for policy optimization? We could alternate between performing full offpolicy evaluation for each policy iterate, , and one step of policy improvement. However, this can be computationally expensive. Alternatively, since the policy is typically derived from the Qfunction, we could instead choose to approximate the policy that would maximize the current Qfunction iterate, thus giving rise to an online algorithm.
We can formally capture such online algorithms by defining a family of optimization problems over , presented below, with modifications from Equation 2 marked in red. An instance of this family is denoted by CQL() and is characterized by a particular choice of regularizer :
(3) 
Variants of CQL. To demonstrate the generality of the CQL family of optimization problems, we discuss two specific instances within this family that are of special interest, and we evaluate them empirically in Section 6. If we choose to be the KLdivergence against a prior distribution, , i.e., , then we get (for a derivation, see Appendix A). Frist, if , then the first term in Equation 3 corresponds to a softmaximum of the Qvalues at any state and gives rise to the following variant of Equation 3, called CQL():
(4) 
Second, if is chosen to be the previous policy , the first term in Equation 4 is replaced by an exponential weighted average of Qvalues of actions from the chosen . Empirically, we find that this variant can be more stable with highdimensional action spaces (e.g., Table 2) where it is challenging to estimate via sampling due to high variance. In Appendix A, we discuss an additional variant of CQL, drawing connections to distributionally robust optimization (namkoong2017variance). We will discuss a practical instantiation of a CQL deep RL algorithm in Section 4. CQL can be instantiated as either a Qlearning algorithm (with instead of in Equations 3, 4) or as an actorcritic algorithm.
Theoretical analysis of CQL. Next, we will theoretically analyze CQL to show that the policy updates derived in this way are indeed “conservative”, in the sense that each successive policy iterate is optimized against a lower bound on its value. For clarity, we state the results in the absence of finitesample error, in this section, but sampling error can be incorporated in the same way as Theorems 3.1 and 3.2, and we discuss this in Appendix C. Theorem 3.3 shows that any variant of the CQL family learns Qvalue estimates that lowerbound the actual Qfunction under the actiondistribution defined by the policy, , under mild regularity conditions (slow updates on the policy).
Theorem 3.3 (CQL learns lowerbounded Qvalues).
Let and assume that (i.e., changes slowly w.r.t to ). Then, the policy value under , lowerbounds the actual policy value, if
The LHS of this inequality is equal to the amount of conservatism induced in the value, in iteration of the CQL update, if the learned policy were equal to softoptimal policy for , i.e., when . However, as the actual policy, , may be different, the RHS is the maximal amount of potential overestimation due to this difference. To get a lower bound, we require the amount of underestimation to be higher, which is obtained if is small, i.e. the policy changes slowly.
Our final result shows that CQL Qfunction update is “gapexpanding”, by which we mean that the difference in Qvalues at indistribution actions and overoptimistically erroneous outofdistribution actions is higher than the corresponding difference under the actual Qfunction. This implies that the policy , is constrained to be closer to the dataset distribution, , thus the CQL update implicitly prevents the detrimental effects of OOD action and distribution shift, which has been a major concern in offline RL settings (kumar2019stabilizing; levine2020offline; fujimoto2018off).
Theorem 3.4 (CQL is gapexpanding).
At any iteration , CQL expands the difference in expected Qvalues under the behavior policy and , such that for large enough values of , we have that .
When function approximation or sampling error makes OOD actions have higher learned Qvalues, CQL backups are expected to be more robust, in that the policy is updated using Qvalues that prefer indistribution actions. As we will empirically show in Appendix B, prior offline RL methods that do not explicitly constrain or regularize the Qfunction may not enjoy such robustness properties.
To summarize, we showed that the CQL RL algorithm learns lowerbound Qvalues with large enough , meaning that the final policy attains at least the estimated value. We also showed that the Qfunction is gapexpanding, meaning that it should only ever overestimate the gap between indistribution and outofdistribution actions, preventing OOD actions.
4 Practical Algorithm and Implementation Details
We now describe two practical offline deep reinforcement learning methods based on CQL: an actorcritic variant and a Qlearning variant. Pseudocode is shown in Algorithm 1, with differences from conventional actorcritic algorithms (e.g., SAC (haarnoja)) and deep Qlearning algorithms (e.g., DQN (mnih2013playing)) in red. Our algorithm uses the CQL() (or CQL() in general) objective from the CQL framework for training the Qfunction , which is parameterized by a neural network with parameters . For the actorcritic algorithm, a policy is trained as well. Our algorithm modifies the objective for the Qfunction (swaps out Bellman error with CQL()) or CQL() in a standard actorcritic or Qlearning setting, as shown in Line 3. As discussed in Section 3.2, due to the explicit penalty on the Qfunction, CQL methods do not use a policy constraint, unlike prior offline RL methods (kumar2019stabilizing; wu2019behavior; siegel2020keep; levine2020offline). Hence, we do not require fitting an additional behavior policy estimator, simplifying our method.
Implementation details. Our algorithm requires an addition of only 20 lines of code on top of standard implementations of soft actorcritic (SAC) (haarnoja) for continuous control experiments and on top of QRDQN (dabney2018distributional) for the discrete control experiments. The tradeoff factor, is automatically tuned via Lagrangian dual gradient descent for continuous control, and is fixed at constant values described in Appendix F
for discrete control. We use default hyperparameters from SAC, except that the learning rate for the policy is chosen to be 3e5 (vs 3e4 or 1e4 for the Qfunction), as dictated by Theorem
3.3. Elaborate details are provided in Appendix F.5 Related Work
We now briefly discuss prior work in offline RL and offpolicy evaluation, comparing and contrasting these works with our approach. More technical discussion of related work is provided in Appendix E.
Offpolicy evaluation (OPE). Several different paradigms have been used to perform offpolicy evaluation. Earlier works (precup2000eligibility; peshkin2002learning; precup2001off) used peraction importance sampling on MonteCarlo returns to obtain an OPE return estimator. Recent approaches (liu2018breaking; gelada2019off; nachum2019dualdice; Zhang2020GenDICE:) use marginalized importance sampling by directly estimating the statedistribution importance ratios via some form of dynamic programming (levine2020offline) and typically exhibit less variance than peraction importance sampling at the cost of bias. Because these methods use dynamic programming, they can suffer from OOD actions (levine2020offline; gelada2019off; hallak2017consistent; nachum2019dualdice). In contrast, the regularizer in CQL explicitly addresses the impact of OOD actions due to its gapexpanding behavior, and obtains conservative value estimates.
Offline RL. As discussed in Section 2, offline Qlearning methods suffer from issues pertaining to OOD actions. Prior works have attempted to solve this problem by constraining the learned policy to be “close” to the behavior policy, for example as measured by KLdivergence (jaques2019way; wu2019behavior; peng2019awr; siegel2020keep), Wasserstein distance (wu2019behavior), or MMD (kumar2019stabilizing), and then only using actions sampled from this constrained policy in the Bellman backup or applying a value penalty. Most of these methods require a separately estimated model to the behavior policy, (fujimoto2018off; kumar2019stabilizing; wu2019behavior; jaques2019way; siegel2020keep), and are thus limited by their ability to accurately estimate the unknown behavior policy, which might be especially complex in settings where the data is collected from multiple sources (levine2020offline). In contrast, CQL does not require estimating the behavior policy. Prior work has explored some forms of Qfunction penalties (hester2018deep; vecerik2017leveraging), but only in the standard online RL setting augmented with demonstrations. luo2019learning learn a conservativelyextrapolated value function, by enforcing a linear extrapolation property over the statespace, and use it alongside a learned dynamics model to obtain policies for goalreaching tasks. Alternate prior approaches to offline RL estimate some sort of uncertainty to determine the trustworthiness of a Qvalue prediction (kumar2019stabilizing; agarwal2019optimistic; levine2020offline), typically using uncertainty estimation techniques from exploration in online RL (osband2016deep; jaksch2010near; osband2017posterior; burda2018exploration). These methods have not been generally performant in offline RL (fujimoto2018off; kumar2019stabilizing; levine2020offline) due to the highfidelity requirements of uncertainty estimates in offline RL (levine2020offline). The gapexpanding property of CQL backups, shown in Theorem 3.4, is related to how gapincreasing Bellman backup operators (bellemare2016increasing; lu2018general) are more robust to estimation error in online RL.
6 Experimental Evaluation
We compare CQL to prior offline RL methods on a range of domains and dataset compositions, including continuous and discrete action spaces, state observations of varying dimensionality, and highdimensional image inputs. We first evaluate actorcritic CQL, using CQL() from Algorithm 1, on continuous control datasets from the D4RL benchmark (d4rl). We compare to: prior offline RL methods that use a policy constraint – BEAR (kumar2019stabilizing) and BRAC (wu2019behavior); SAC (haarnoja), an offpolicy actorcritic method that we adapt to offline setting; and behavioral cloning (BC).
Task Name  SAC  BC  BEAR  BRACp  BRACv  CQL() 

halfcheetahrandom  30.5  2.1  25.5  23.5  28.1  35.4 
hopperrandom  11.3  9.8  9.5  11.1  12.0  10.8 
walker2drandom  4.1  1.6  6.7  0.8  0.5  7.0 
halfcheetahmedium  4.3  36.1  38.6  44.0  45.5  44.4 
walker2dmedium  0.9  6.6  33.2  72.7  81.3  79.2 
hoppermedium  0.8  29.0  47.6  31.2  32.3  58.0 
halfcheetahexpert  1.9  107.0  108.2  3.8  1.1  104.8 
hopperexpert  0.7  109.0  110.3  6.6  3.7  109.9 
walker2dexpert  0.3  125.7  106.1  0.2  0.0  153.9 
halfcheetahmediumexpert  1.8  35.8  51.7  43.8  45.3  62.4 
walker2dmediumexpert  1.9  11.3  10.8  0.3  0.9  98.7 
hoppermediumexpert  1.6  111.9  4.0  1.1  0.8  111.0 
halfcheetahrandomexpert  53.0  1.3  24.6  30.2  2.2  92.5 
walker2drandomexpert  0.8  0.7  1.9  0.2  2.7  91.1 
hopperrandomexpert  5.6  10.1  10.1  5.8  11.1  110.5 
halfcheetahmixed  2.4  38.4  36.2  45.6  45.9  46.2 
hoppermixed  3.5  11.8  25.3  0.7  0.8  48.6 
walker2dmixed  1.9  11.3  10.8  0.3  0.9  26.7 
Gym domains. Results for the gym domains are shown in Table 1. The results for BEAR, BRAC, SAC, and BC are based on numbers reported by d4rl. On the datasets generated from a single policy, marked as “random”, “expert” and “medium”, CQL roughly matches or exceeds the best prior methods, but by a small margin. However, on datasets that combine multiple policies (“mixed”, “mediumexpert” and “randomexpert”), that are more likely to be common in practical datasets, CQL outperforms prior methods by large margins, sometimes as much as 23x.
Adroit tasks. The more complex Adroit (rajeswaran2018dapg) tasks (shown on the right) in D4RL require controlling a 24DoF robotic hand, using limited data from human demonstrations. These tasks are substantially more difficult than the gym tasks in terms of both the dataset composition and high dimensionality. Prior offline RL methods generally struggle to learn meaningful behaviors on these tasks, and the strongest baseline is BC. As shown in Table 2, CQL variants are the only methods that improve over BC, attaining scores that are 29x those of the next best offline RL method. CQL() with (the previous policy) outperforms CQL() on a number of these tasks, due to the higher action dimensionality resulting in higher variance for the CQL() importance weights. Both variants outperform prior methods.
Domain  Task Name  BC  SAC  BEAR  BRACp  BRACv  CQL()  CQL() 

AntMaze  antmazeumaze  65.0  0.0  73.0  50.0  70.0  74.0  73.5 
antmazeumazediverse  55.0  0.0  61.0  40.0  70.0  84.0  61.0  
antmazemediumplay  0.0  0.0  0.0  0.0  0.0  61.2  4.6  
antmazemediumdiverse  0.0  0.0  8.0  0.0  0.0  53.7  5.1  
antmazelargeplay  0.0  0.0  0.0  0.0  0.0  15.8  3.2  
antmazelargediverse  0.0  0.0  0.0  0.0  0.0  14.9  2.3  
Adroit  penhuman  34.4  6.3  1.0  8.1  0.6  37.5  55.8 
hammerhuman  1.5  0.5  0.3  0.3  0.2  4.4  2.1  
doorhuman  0.5  3.9  0.3  0.3  0.3  9.9  9.1  
relocatehuman  0.0  0.0  0.3  0.3  0.3  0.20  0.35  
pencloned  56.9  23.5  26.5  1.6  2.5  39.2  40.3  
hammercloned  0.8  0.2  0.3  0.3  0.3  2.1  5.7  
doorcloned  0.1  0.0  0.1  0.1  0.1  0.4  3.5  
relocatecloned  0.1  0.2  0.3  0.3  0.3  0.1  0.1  
Kitchen  kitchencomplete  33.8  15.0  0.0  0.0  0.0  43.8  31.3 
kitchenpartial  33.8  0.0  13.1  0.0  0.0  49.8  50.1  
kitchenundirected  47.5  2.5  47.2  0.0  0.0  51.0  52.4 
AntMaze. These D4RL tasks require composing parts of suboptimal trajectories to form more optimal policies for reaching goals on a MuJoco Ant robot. Prior methods make some progress on the simpler Umaze, but only CQL is able to make meaningful progress on the much harder medium and large mazes, outperforming prior methods by a very wide margin.
Kitchen tasks. Next, we evaluate CQL on the Franka kitchen domain (gupta2019relay) from D4RL (d4rl_repo). The goal is to control a 9DoF robot to manipulate multiple objects (microwave, kettle, etc.) sequentially, in a single episode to reach a desired configuration, with only sparse 01 completion reward for every object that attains the target configuration. These tasks are especially challenging, since they require composing parts of trajectories, precise longhorizon manipulation, and handling humanprovided teleoperation data. As shown in Table 2, CQL outperforms prior methods in this setting, and is the only method that outperforms behavioral cloning, attaining over 40% success rate on all tasks.
Offline RL on Atari games. Lastly, we evaluate a discreteaction Qlearning variant of CQL (Algorithm 1) on offline, imagebased Atari games (bellemare2013arcade). We compare CQL to REM (agarwal2019optimistic) and QRDQN (dabney2018distributional) on the five Atari tasks (Pong, Breakout, Qbert, Seaquest and Asterix) that are evaluated in detail by agarwal2019optimistic, using the dataset released by the authors.
Following the evaluation protocol of agarwal2019optimistic, we evaluated on two types of datasets, both of which were generated from the DQNreplay dataset, released by (agarwal2019optimistic): (1) a dataset consisting of the first 20% of the samples observed by an online DQN agent and (2) datasets consisting of only 1% and 10% of all samples observed by an online DQN agent (Figures 6 and 7 in (agarwal2019optimistic)). In setting (1), shown in Figure 1, CQL generally achieves similar or better performance throughout as QRDQN and REM. When only using only 1% or 10% of the data, in setting (2) (Table 3), CQL
Task Name  QRDQN  REM  CQL() 
Pong (1%)  13.8  6.9  19.3 
Breakout  7.9  11.0  61.1 
Q*bert  383.6  343.4  14012.0 
Seaquest  672.9  499.8  779.4 
Asterix  166.3  386.5  592.4 
Pong (10%)  15.1  8.9  18.5 
Breakout  151.2  86.7  269.3 
Q*bert  7091.3  8624.3  13855.6 
Seaquest  2984.8  3936.6  3674.1 
Asterix  189.2  75.1  156.3 
substantially outperforms REM and QRDQN, especially in the harder 1% condition, achieving 36x and 6x times the return of the best prior method on Q*bert and Breakout, respectively.
Analysis of CQL. Finally, we perform empirical evaluation to verify that CQL indeed lowerbounds the value function, thus verifying Theorems 3.2, Appendix D.1 empirically. To this end, we estimate the average value of the learned policy predicted by CQL, , and report the difference against the actual discounted return of the policy in Table 4. We also estimate these values for baselines, including the minimum predicted Qvalue under an ensemble (haarnoja; fujimoto2018addressing) of Qfunctions with varying ensemble sizes, which is a standard technique to prevent overestimed Qvalues (fujimoto2018addressing; haarnoja; hasselt2010double) and BEAR (kumar2019stabilizing), a policy constraint method. The results show that CQL learns a lower bound for all three tasks, whereas the baselines are prone to overestimation. We also evaluate a variant of CQL that uses Equation 1, and observe that the resulting values are lower (that is, underestimate the true values) as compared to CQL(). This provides empirical evidence that CQL() attains a tighter lower bound than the pointwise bound in Equation 1, as per Theorem 3.2.
Task Name  CQL(  CQL (Eqn. 1)  Ensemble(2)  Ens.(4)  Ens.(10)  Ens.(20)  BEAR 

hoppermediumexpert  43.20  151.36  3.71e6  2.93e6  0.32e6  24.05e3  65.93 
hoppermixed  10.93  22.87  15.00e6  59.93e3  8.92e3  2.47e3  1399.46 
hoppermedium  7.48  156.70  26.03e12  437.57e6  1.12e12  885e3  4.32 
7 Discussion
We proposed conservative Qlearning (CQL), an algorithmic framework for offline RL that learns a lower bound on the policy value. Empirically, we demonstrate that CQL outperforms prior offline RL methods on a wide range of offline RL benchmark tasks, including complex control tasks and tasks with raw image observations. In many cases, the performance of CQL is substantially better than the bestperforming prior methods, exceeding their final returns by 25x. The simplicity and efficacy of CQL make it a promising choice for a wide range of realworld offline RL problems. However, a number of challenges remain. While we prove that CQL learns lower bounds on the Qfunction in the tabular, linear, and a subset of nonlinear function approximation cases, a rigorous theoretical analysis of CQL with deep neural nets, is left for future work. Additionally, offline RL methods are liable to suffer from overfitting in the same way as standard supervised methods, so another important challenge for future work is to devise simple and effective early stopping methods, analogous to validation error in supervised learning.
Acknowledgements
We thank Mohammad Norouzi, Oleh Rybkin, Anton Raichuk, Vitchyr Pong and anonymous reviewers from the Robotic AI and Learning Lab at UC Berkeley for their feedback on an earlier version of this paper. We thank Rishabh Agarwal for help with the Atari QRDQN/REM codebase and for sharing baseline results. This research was funded by the DARPA Assured Autonomy program, and compute support from Google, Amazon, and NVIDIA.
References
Appendix A Discussion of CQL Variants
We derive several variants of CQL in Section 3.2. Here, we discuss these variants on more detail and describe their specific properties. We first derive the variants: CQL(), CQL(), and then present another variant of CQL, which we call CQL(var). This third variant has strong connections to distributionally robust optimization [namkoong2017variance].
CQL(). In order to derive CQL(), we substitute , and solve the optimization over in closed form for a given Qfunction. For an optimization problem of the form:
the optimal solution is equal to , where is a normalizing factor. Plugging this into Equation 3, we exactly obtain Equation 4.
CQL(). In order to derive CQL(), we follow the above derivation, but our regularizer is a KLdivergence regularizer instead of entropy.
The optimal solution is given by, , where is a normalizing factor. Plugging this back into the CQL family (Equation 3), we obtain the following objective for training the Qfunction (modulo some normalization terms):
(5) 
CQL(var). Finally, we derive a CQL variant that is inspired from the perspective of distributionally robust optimization (DRO) [namkoong2017variance]. This version penalizes the variance in the Qfunction across actions at all states , under some actionconditional distribution of our choice. In order to derive a canonical form of this variant, we invoke an identity from namkoong2017variance, which helps us simplify Equation 3. To start, we define the notion of “robust expectation”: for any function , and any empirical distribution over a dataset of elements, the “robust” expectation defined by:
can be approximated using the following upperbound:
(6) 
where the gap between the two sides of the inequality decays inversely w.r.t. to the dataset size, . By using Equation 6 to simplify Equation 3, we obtain an objective for training the Qfunction that penalizes the variance of Qfunction predictions under the distribution .
(7) 
The only remaining decision is the choice of , which can be chosen to be the inverse of the empirical action distribution in the dataset, , or even uniform over actions, , to obtain this variant of varianceregularized CQL.
Appendix B Discussion of GapExpanding Behavior of CQL Backups
In this section, we discuss in detail the consequences of the gapexpanding behavior of CQL backups over prior methods based on policy constraints that, as we show in this section, may not exhibit such gapexpanding behavior in practice. To recap, Theorem 3.4 shows that the CQL backup operator increases the difference between expected Qvalue at indistribution () and outofdistribution () actions. We refer to this property as the gapexpanding property of the CQL update operator.
Function approximation may give rise to erroneous Qvalues at OOD actions. We start by discussing the behavior of prior methods based on policy constraints [kumar2019stabilizing, fujimoto2018off, jaques2019way, wu2019behavior] in the presence of function approximation. To recap, because computing the target value requires , constraining to be close to will avoid evaluating on OOD actions. These methods typically do not impose any further form of regularization on the learned Qfunction. Even with policy constraints, because function approximation used to represent the Qfunction, learned Qvalues at two distinct stateaction pairs are coupled together. As prior work has argued and shown [achiam2019towards, fu2019diagnosing, kumar2020discor], the “generalization” or the coupling effects of the function approximator may be heavily influenced by the properties of the data distribution [fu2019diagnosing, kumar2020discor]. For instance, fu2019diagnosing empirically shows that when the dataset distribution is narrow (i.e. stateaction marginal entropy, , is low [fu2019diagnosing]), the coupling effects of the Qfunction approximator can give rise to incorrect Qvalues at different states, though this behavior is absent without function approximation, and is not as severe with highentropy (e.g. Uniform) stateaction marginal distributions.
In offline RL, we will shortly present empirical evidence on highdimensional MuJoCo tasks showing that certain dataset distributions, , may cause the learned Qvalue at an OOD action at a state , to in fact take on high values than Qvalues at indistribution actions at intermediate iterations of learning. This problem persists even when a large number of samples (e.g. ) are provided for training, and the agent cannot correct these errors due to no active data collection.
Since actorcritic methods, including those with policy constraints, use the learned Qfunction to train the policy, in an iterative online policy evaluation and policy improvement cycle, as discussed in Section 2, the errneous Qfunction may push the policy towards OOD actions, especially when no policy constraints are used. Of course, policy constraints should prevent the policy from choosing OOD actions, however, as we will show that in certain cases, policy constraint methods might also fail to prevent the effects on the policy due to incorrectly high Qvalues at OOD actions.
How can CQL address this problem? As we show in Theorem 3.4, the difference between expected Qvalues at indistribution actions and outofdistribution actions is expanded by the CQL update. This property is a direct consequence of the specific nature of the CQL regularizer – that maximizes Qvalues under the dataset distribution, and minimizes them otherwise. This difference depends upon the choice of , which can directly be controlled, since it is a free parameter. Thus, by effectively controlling , CQL can push down the learned Qvalue at outofdistribution actions as much is desired, correcting for the erroneous overestimation error in the process.
Empirical evidence on highdimensional benchmarks with neural networks. We next empirically demonstrate the existence of of such Qfunction estimation error on highdimensional MuJoCo domains when deep neural network function approximators are used with stochastic optimization techniques. In order to measure this error, we plot the difference in expected Qvalue under actions sampled from the behavior distribution, , and the maximum Qvalue over actions sampled from a uniformly random policy, . That is, we plot the quantity
(8) 
over the iterations of training, indexed by . This quantity, intuitively, represents an estimate of the “advantage” of an action , under the Qfunction, with respect to the optimal action . Since, we cannot perform exact maximization over the learned Qfunction in a continuous action space to compute , we estimate it via sampling described in Equation 8.
We present these plots in Figure 2 on two datasets: hopperexpert and hoppermedium. The expert dataset is generated from a neardeterministic, expert policy, exhibits a narrow coverage of the stateaction space, and limited to only a few directed trajectories. On this dataset, we find that is always positive for the policy constraint method (Figure 2(a)) and increases during training – note, the continuous rise in values, in the case of the policyconstraint method, shown in Figure 2(a). This means that even if the dataset is generated from an expert policy, and policy constraints correct target values for OOD actions, incorrect Qfunction generalization may make an outofdistribution action appear promising. For the more stochastic hoppermedium dataset, that consists of a more diverse set of trajectories, shown in Figure 2(b), we still observe that for the policyconstraint method, however, the relative magnitude is smaller than hopperexpert.
In contrast, Qfunctions learned by CQL, generally satisfy , as is seen and these values are clearly smaller than those for the policyconstraint method. This provides some empirical evidence for Theorem 3.4
, in that, the maximum Qvalue at a randomly chosen action from the uniform distribution the action space is smaller than the Qvalue at indistribution actions.
On the hopperexpert task, as we show in Figure 2(a) (right), we eventually observe an “unlearning” effect, in the policyconstraint method where the policy performance deteriorates after a extra iterations in training. This “unlearning” effect is similar to what has been observed when standard offpolicy Qlearning algorithms without any policy constraint are used in the offline regime [kumar2019stabilizing, levine2020offline], on the other hand this effect is absent in the case of CQL, even after equally many training steps. The performance in the morestochastic hoppermedium dataset fluctuates, but does not deteriorate suddenly.
To summarize this discussion, we concretely observed the following points via empirical evidence:

CQL backups are gap expanding in practice, as justified by the negative values in Figure 2.

Policy constraint methods, that do not impose any regularization on the Qfunction may observe highly positive values during training, especially with narrow data distributions, indicating that gapexpansion may be absent.

When values continuously grow during training, the policy might eventually suffer from an unlearning effect [levine2020offline], as shown in Figure 2(a).


Appendix C Theorem Proofs
In this section, we provide proofs of the theorems in Sections 3.1 and 3.2. We first redefine notation for clarity and then provide the proofs of the results in the main paper.
Notation. Let denote an iteration of policy evaluation (in Section 3.1) or Qiteration (in Section 3.2). In an iteration , the objective – Equation 2 or Equation 3 – is optimized using the previous iterate (i.e. ) as the target value in the backup. denotes the true, tabular Qfunction iterate in the MDP, without any correction. In an iteration, say , the current tabular Qfunction iterate, is related to the previous tabular Qfunction iterate as: (for policy evaluation) or (for policy learning). Let denote the th Qfunction iterate obtained from CQL. Let denote the value function, .
A note on the value of . Before proving the theorems, we remark that while the statements of Theorems 3.2, 3.1 and D.1 (we discuss this in Appendix D) show that CQL produces lower bounds if is larger than some threshold, so as to overcome either sampling error (Theorems 3.2 and 3.1) or function approximation error (Theorem D.1). While the optimal in some of these cases depends on the current Qvalue, , we can always choose a worstcase value of by using the inequality , still guaranteeing a lower bound.
We first prove Theorem 3.1, which shows that policy evaluation using a simplified version of CQL (Equation 1) results in a pointwise lowerbound on the Qfunction.
Proof of Theorem 3.1. In order to start, we first note that the form of the resulting Qfunction iterate, , in the setting without function approximation. By setting the derivative of Equation 1 to 0, we obtain the following expression for in terms of ,
(9) 
Now, since, , we observe that at each iteration we underestimate the next Qvalue iterate, i.e. .
Accounting for sampling error. Note that so far we have only shown that the Qvalues are upperbounded by the the “empirical Bellman targets” given by, . In order to relate to the true Qvalue iterate, , we need to relate the empirical Bellman operator, to the actual Bellman operator, . In Appendix D.3, we show that if the reward function and the transition function, satisfy “concentration” properties, meaning that the difference between the observed reward sample, () and the actual reward function (and analogously for the transition matrix) is bounded with high probability, then overestimation due to the empirical Backup operator is bounded. Formally, with high probability (w.h.p.) , ,
Hence, the following can be obtained, w.h.p.:
(10) 
Now we need to reason about the fixed point of the update procedure in Equation 9. The fixed point of Equation 9 is given by:
thus proving the relationship in Theorem 3.1.
In order to guarantee a lower bound, can be chosen to cancel any potential overestimation incurred by . Note that this choice works, since is a matrix with all nonnegative entries. The choice of that guarantees a lower bound is then given by:
Of course, we need whenever , for this to hold, and this is assumed in the theorem statament. Note that since, , when , any guarantees a lower bound.
Next, we prove Theorem 3.3 that shows that the additional term that maximizes the expected Qvalue under the dataset distribution, , (or , in the absence of sampling error), results in a lowerbound on only the expected value of the policy at a state, and not a pointwise lowerbound on Qvalues at all actions.
Proof of Theorem 3.2. We first prove this theorem in the absence of sampling error, and then incorporate sampling error at the end, using a technique similar to the previous proof. In the tabular setting, we can set the derivative of the modified objective in Equation 2, and compute the Qfunction update induced in the exact, tabular setting (this assumes and ).
(11) 
Note that for stateaction pairs, , such that, , we are infact adding a positive quantity, , to the Qfunction obtained, and this we cannot guarantee a pointwise lower bound, i.e. . To formally prove this, we can construct a counterexample threestate, twoaction MDP, and choose a specific behavior policy , such that this is indeed the case.
The value of the policy, on the other hand, is underestimated, since:
(12) 
and we can show that is always positive, when . To note this, we present the following derivation:
Note that the marked term, is positive since both the numerator and denominator are positive, and this implies that . Also, note that , iff . This implies that each value iterate incurs some underestimation, .
Now, we can compute the fixed point of the recursion in Equation 12, and this gives us the following estimated policy value:
thus showing that in the absence of sampling error, Theorem 3.2 gives a lower bound. It is straightforward to note that this expression is tighter than the expression for policy value in Proposition 3.2, since, we explicitly subtract in the expression of Qvalues (in the exact case) from the previous proof.
Incorporating sampling error. To extend this result to the setting with sampling error, similar to the previous result, the maximal overestimation at each iteration , is bounded by . The resulting valuefunction satisfies (w.h.p.),
Comments
There are no comments yet.