## 1 Introduction

In offline reinforcement learning (RL), agents learn from a static dataset without any interaction with the environment. Although off-policy RL algorithms are intuitively applicable to this setting, they often perform poorly in practice [fujimotoOffpolicyDeepReinforcement2019, fuD4rlDatasetsDeep2020]. Many research works attribute this problem to distributional shift [fujimotoOffpolicyDeepReinforcement2019, wuBehaviorRegularizedOffline2019b, levineOfflineReinforcementLearning2020d], especially *action*

distributional shift. The out-of-distribution (OOD) actions used in Bellman backups introduce extrapolation errors in the value function and the agent fails to correct such errors since no online interaction is allowed. However, in-distribution actions also raise significant challenges. When the dataset has insufficient information about the underlying Markov Decision Process (MDP), suboptimal actions with high uncertainty in knowledge may appear to be good and thus bias the agent towards making bad decisions. In other words, epistemic uncertainty spuriously correlates with decision-making.

In this paper, we assume that an effective mechanism to deal with spurious correlations is the key ingredient missing in existing methods. Recently, some theoretical studies found that pessimism in the face of uncertainty eliminates spurious correlations in offline learning [jinPessimismProvablyEfficient2021a, xieBellmanconsistentPessimismOffline2021]. Furthermore, the pessimism principle is provably efficient and even achieves mini-max optimal in linear MDPs [jinPessimismProvablyEfficient2021a]

. However, it is empirically shown to fail when combining with function approximators, e.g., neural networks, to solve general MDPs. The two major difficulties come from quantifying uncertainty

[levineOfflineReinforcementLearning2020d, yuCOMBOConservativeOffline2021b] and constraining the action space [fujimotoOffpolicyDeepReinforcement2019].To address these problems, we design a practical algorithm termed Spurious COrrelation REduction (SCORE), which adds an uncertainty penalty into value estimators, i.e., the higher the uncertainty the more the action value will be penalized. In this way, the spurious correlation between epistemic uncertainty and decision-making is alleviated. Implementation-wise, we use bootstrapped ensemble Q networks to quantify the uncertainty. Meanwhile, a gradually decaying behavior cloning (BC) regularizer is added into the policy objective to constrain the action space. Accordingly, the proposed method reduces to a pure uncertainty-based method when the regularization coefficient decreases to zero, avoiding the dependence on the behavioral policy. We further show that this method is theoretically guaranteed and achieves a sublinear rate of convergence under linear function approximation. Some previous papers constrain the action space by enforcing a strong constraints between the learned policy and the behavioral policy

[fujimotoOffpolicyDeepReinforcement2019, wuBehaviorRegularizedOffline2019b, kumarStabilizingOffpolicyQlearning2019] or regularizing the action value [kumarConservativeQlearningOffline2020]. While these methods show good empirical results for particular data distributions, the performance is closely related to the behavioral policy. In contrast, our approach is theoretically adaptive to the data distribution, and the performance depends only on how well the dataset covers the state-action distribution of the optimal policy, rather than the entire state-action space [jinPessimismProvablyEfficient2021a].Our main contributions are as follows: (1) We demonstrate the detrimental effect of spurious correlations in offline RL and show that pessimism in the face of uncertainty can eliminate it, recovering the optimal policy. (2) We propose a practical algorithm that reduces spurious correlations with an uncertainty penalty estimated by bootstrapped ensemble Q networks. We prove that this is in line with the pessimism principle from the Bayesian perspective. (3) We also show that the proposed method converges to the optimal policy with a sublinear rate under linear function approximation. (4) Through extensive experiments on the D4RL benchmark, we show that SCORE is robust across multiple data settings, which indicates that the pessimism principle in offline RL is not only theoretically sound but also strongly supported by empirical results.

## 2 Preliminaries

We consider an MDP , where and represent the state space and the action space respectively. is the Markov transition function, is the reward function, is the discount factor, and is the initial distribution of states.

In offline RL, the agent is given a static dataset collected by the behavioral policy . Suppose that denotes the discounted state-action distribution of a policy , we have and . Then the goal of offline RL is to search for a policy that maximizes the expected total reward given a static dataset . The expectation is taken with respect to , , and . With a slight abuse of notation, we refer to as the dataset distribution.

### 2.1 Suboptimality Decomposition

In offline RL, the samples are drawn from a fixed distribution instead of the environment. Therefore, the true Bellman optimality operator gets replaced by its empirical counterpart ^{1}^{1}1In the empirical Bellman operator

, transition probabilities and rewards are estimated by the sample average in

.. Since the dataset only covers partial information of the environment, the agent would be learning with bias. In this paper, we formalize such bias for any action-value function as follows:(1) |

Since characterizes the error arising from insufficient information about the environment in knowledge and gradually converges to zero as we learn more about the state-action pair (including state transitions and rewards), we refer to it as the *epistemic error*. In the ideal case, the dataset accurately mirrors the environment, i.e., , resulting in zero epistemic error. The agent can learn the optimal policy offline just like in the online setting. However, this is almost impossible in real-world domains. In general, the dataset contains limited information and the epistemic error persists throughout the learning process.

We decompose the suboptimality of a policy , i.e., the performance gap between and the optimal policy , into the following three components [jinPessimismProvablyEfficient2021a]:

(2) | ||||

where is an estimated Q function, is the state-value of a state , and measures the expected return of a policy at the initial state . It is straightforward that the suboptimality of the optimal policy is zero and a lower suboptimality indicates a better policy. In linear MDPs, term () in equation 2 arises from the information-theoretic lower bound and thus is impossible to eliminate. Meanwhile, term () is non-positive as long as the policy is greedy with respect to the estimated action-value function . Therefore, controlling term () is the key to reduce suboptimality in offline RL. We accomplish this by introducing pessimism in the following sections.

### 2.2 Pessimism

Let represents an arbitrary estimated -value function. We first define an uncertainty quantifier with confidence as follows.

###### Definition 2.1 (-Uncertainty Quantifier).

is a -uncertainty quantifier with respect to the dataset distribution if the event

(3) |

satisfies .

In Definition 2.1, measures the uncertainty arising from approximating with , where is the true Bellman optimality operator while is the empirical Bellman operator. We remark that can be constructed implicitly by treating as a whole. When and differ by a large amount, should be large, while when the two quantities are sufficiently close, can be very small or even zero. We then construct a pessimistic Bellman operator as follows:

(4) |

According to Definition 2.1, holds for all state-action pairs with a high probability, i.e., the Q-value obtained by equation 4 lower bounds the true value. In other words, equation 4 provides a pessimistic estimation of the Q function. Replacing the empirical Bellman operator in equation 1 with the pessimistic Bellman operator, it holds that:

(5) |

Since the epistemic error is non-negative as shown in equation 5, term () in equation 2 only reduces the suboptimality. As a result, pessimism eliminates spurious correlations. In the meanwhile, the suboptimality is now upper-bounded by , so what remains is to find a sufficiently small -uncertainty quantifier to tighten this upper bound.

## 3 Spurious COrrelation REduction for Offline RL

In this section, we elaborate the method we used to reduce the impact of spurious correlations on the offline RL problem. We first demonstrate the spurious correlation phenomenon through a simple example in Section 3.1 and verify the effectiveness of the pessimistic Bellman operator. We then present a general algorithm named SCORE in Section 3.2. In Section 3.3, we further analyze the convergence of the proposed algorithm.

### 3.1 An Example of The Spurious Correlation Phenomenon

We consider an episodic MDP with two states and two actions . We assume for any current state that , , , and . In , the reward is always positive regardless of the action performed, while in the bad state , the agent can only get punished. As a result, it is optimal to always perform to stay in/move to . To demonstrate the effect of spurious correlations, we generate an expert dataset with the optimal policy and modify it by adding a trajectory starting from performing the bad action and transiting into the good state.

Figures 1 and 1 show the empirical transition probabilistic distribution of the two datasets. Since the optimal policy always prefers , the empirical probabilities for are all zero. But in the modified dataset, appears once and transits into , so the corresponding probability becomes one (the blue bar in Figure 1). In this case, no OOD actions (both and are included in the dataset) exist, but carries high uncertainty in knowledge. We run offline Q-learning and its pessimistic variant (equation 4) on the modified dataset. Figures 1 and 1 show how the Q values evolve during the training process. Since epistemic uncertainty spuriously correlates with decision-making, offline Q-learning overestimates and converges to a suboptimal policy favoring . By contrast, the pessimistic variant penalizes by high uncertainty, recovering the optimal policy, i.e., always prefers . While most existing works focus on defending OOD actions, in this simple example, we show that even in-distribution data cause serious problems. Therefore, reducing spurious correlations is of significance in offline RL. We refer to Appendix D for more details about the example.

### 3.2 Practical Algorithms

As shown in Section 2.2 and the above example, pessimism can eliminate spurious correlations in offline RL. What remains is to design a proper uncertainty quantifier. From Definition 2.1 and equation 5, we can see that achieves the tightest bound, i.e., the uncertainty quantifier accurately measures the epistemic error . In other words, to eliminate spurious correlations, we need a method to provide reliable estimations of epistemic uncertainty. Since the state and the action space are huge in real-world domains, function approximation (e.g., use deep neural networks) is indispensable to provide sufficient expressiveness. In this case, we can neither directly estimate uncertainty by counting states and actions, nor derive an analytical form of the epistemic uncertainty as in linear MDPs [jinPessimismProvablyEfficient2021a].

Estimating uncertainty is an important research topic. One of the most popular approaches is to use the bootstrapped ensemble method [osbandDeepExplorationBootstrapped2016c, lakshminarayananSimpleScalablePredictive2017]

. Each ensemble member is trained on a different version of data generated by a bootstrap sampling procedure. This approach provides a general and non-parametric way to approximate the Bayesian posterior distribution, so the standard deviation of multiple Q estimations can be regarded as a reasonable estimation of the epistemic uncertainty. We remark that previous works mainly use uncertainty as a bonus in online RL to promote efficient exploration. In this paper, we utilize uncertainty as a penalty to reduce spurious correlations. For the equivalence of the uncertainty obtained by this method to the one studied in theory

[jinPessimismProvablyEfficient2021a], we refer to Appendix C for more details.Policy Evaluation. In the policy evaluation step, we maintain independent critics and their corresponding target networks . The learning objective of each critic is as follows,

(6) | ||||

where is the standard deviation of the predictions of the input state-action pair, and is a hyper-parameter that controls the strength of the uncertainty penalty. At first glance, there are two penalty terms in equation 6, one for the state-action pair and the other for , which is different from equation 4 used in PEVI [jinPessimismProvablyEfficient2021a]. The major reason is that PEVI is an algorithm for episodic MDPs, which calculates the Q values in one pass in an episodic backward manner starting from the terminal state. The target value it uses at step has already been penalized at step , so subtracting at step

is unnecessary. Conversely, we study non-episodic MDPs and the Q networks are optimized with stochastic gradient descent. Each sample is used multiple times in varying order, so it is more appropriate to penalize both

and with a small quantity (controlled by ) each time. Empirically, the penalty term is more effective than since it also serves to defend against OOD actions. While the majority of existing approaches use the smallest Q-value as the target value to avoid overestimation, equation 6 updates each critic towards its corresponding target network . By doing so, the temporal consistency is guaranteed and the uncertainty can be passed over time [osbandDeepExplorationBootstrapped2016c, osbandRandomizedPriorFunctions2018b].Policy Improvement. The objective function of the policy is defined as follows,

(7) |

The behavior cloning loss serves as a regularization term, which frees the algorithm from explicitly modeling the behavioral policy [fujimotoOffpolicyDeepReinforcement2019, kumarStabilizingOffpolicyQlearning2019, wuUncertaintyWeightedActorCritic2021]. In particular, we gradually decrease the regularization coefficients during the training process. At the early stage, the ensemble networks are not accurate enough to measure epistemic uncertainty. The behavior cloning regularization helps to provide a good initialization and avoid the policy from deviating far away from the dataset distribution . In the later stage, the regularization effect becomes weaker and weaker, and the pessimistic Q-values gradually dominate the policy objective. In this way, SCORE returns to a pure uncertainty-based method without relying on the behavioral policy that generates the dataset. Alternatively, we can understand this design choice from the optimization perspective [guoBatchReinforcementLearning2020]. Directly maximizing the uncertainty-penalized value function is a difficult task. Using behavior cloning lowers the difficulty of the optimization problem at the early stage. As the training process proceeds, the regularization effect decreases, and the objective function gradually returns to the original problem, i.e., maximizing the pessimistic action-value function. The complete algorithm is summarized in Algorithm 1.

### 3.3 Convergence Analysis

In this section, we first introduce offline soft-DPG, which is the theoretical counterpart of SCORE. Then we show the equivalence between offline soft-DPG and offline proximal policy optimization (PPO, schulman2015trust, schulman2017proximal). Finally, by analyzing the convergence of offline PPO, we show that offline soft-DPG achieves a sublinear rate of convergence.

Regularized MDP. For any behavior policy , based on the definition of the MDP , we introduce its regularized counterpart , where is the regularization parameter. Specifically, for any policy in , the regularized state-value function and the regularized action-value function are defined as

respectively. We remark that such a regularization term in the definition of serves as a behavior cloning term. Throughout the learning process, we anneal the regularization parameter so that the impact of the behavior cloning term decreases. Formally, for a collection of regularized MDPs , we aim to minimize the suboptimality gap defined as follows,

(8) |

Here we denote by and for notational convenience, where is the optimal policy for . We remark that the suboptimality defined in equation 8 measures the suboptimality gap between the best policy and the corresponding optimal policy under the regularized MDP , where .

Pessimistic Offline Soft-DPG. For the simplicity of presentation, we consider a theoretical counterpart of the proposed algorithm. Formally, we introduce pessimistic offline soft-DPG as follows. At the -th iteration of pessimistic offline soft-DPG, with estimated pessimistic Q-function and policy , we define the offline soft-DPG objective for the regularized MDP as follows,

(9) |

where is the static dataset and the KL divergence is a behavior cloning term. In policy improvement, we employ deterministic policy gradient [silver2014deterministic] to maximize equation 9. We remark that the objective function in equation 9 is equivalent to equation 7 under Gaussian policies. While in policy evaluation, we assume that there exists an oracle that uses the -uncertainty quantifier defined in Definition 2.1 to construct a pessimistic estimator of the Q-function. Such an oracle for pessimistic evaluation can be practically achieved by equation 6 as shown in Section 3.2. Thus, our pessimistic offline soft-DPG is indeed equivalent to its practical counterpart in Algorithm 1.

Equivalence between Soft-DPG and PPO. We show that the update to maximize equation 9 is equivalent to solving the pessimistic proximal policy optimization (PPO, schulman2015trust, schulman2017proximal) objective. Formally, we consider the linear function parameterization in the -th iteration as follows,

(10) |

where and

are feature vectors, and

is the energy function. We denote by and for notational convenience. With pessimistic Q-function and current policy in the -th iteration, we define the offline PPO objective for the regularized MDP as follows,(11) |

where is the behavior policy and is the regularization parameter. Under the parameterization in equation 10, we show in the following lemma that maximizing equation 11 is equivalent to a gradient update of equation 9. To introduce the lemma, we define , where .

###### Lemma 3.1 (Equivalence between Soft-DPG and PPO).

The stationary point of satisfies

where is the deterministic policy associated with .

###### Proof.

See Section B.1 for a detailed proof. ∎

By Lemma B.1, we see that maximizing the offline PPO objective is equivalent to an implicit natural gradient step corresponding to the maximization of the pessimistic offline soft-DPG objective. Thus, to analyze the convergence of pessimistic offline soft-DPG, it suffices to analyze pessimistic offline PPO.

Convergence Analysis. For simplicity of presentation, we take the regularization parameter , where quantifies the speed of annealing. Recall that we employ pessimism to construct estimated Q-functions at each iteration , which ensures that there exists a -uncertainty quantifier defined in Definition 2.1. Formally, we impose the following assumption on the estimated Q-functions, which can be achieved by a bootstrapped ensemble method as shown in Section 3.2.

###### Assumption 3.2 (Pessimistic Q-Functions).

For any , is a -uncertainty quantifier for the estimated Q-function , i.e., the event

holds with probability at least .

We further define the pessimistic error as follows,

(12) |

Such a pessimistic error in equation 12 quantifies the irremovable intrinsic uncertainty. Now, we introduce our main theoretical result as follows.

###### Theorem 3.3.

###### Proof.

See §A for a detailed proof. ∎

Theorem 3.3 states that the sequence of policies generated by pessimistic offline PPO converges sublinearly to an optimal policy in the regularized MDP with an additional pessimistic error term . We remark that such an error term is irremovable, as it arises from the information-theoretic lower bound [jinPessimismProvablyEfficient2021a]. Moreover, given the equivalence between offline PPO and offline soft-DPG as in Lemma 3.1, we know that offline soft-DPG also converges to an optimal policy under a sublinear rate.

## 4 Experiments

In this section, we conduct extensive experiments on the widely adopted benchmark D4RL to verify the effectiveness of the propose algorithm. We first present the results of comparison experiments in Section 4.1. Then we visualize and analyze the uncertainty learned by our method in Section 4.2. Lastly, we perform ablation studies in Section 4.3.

Task | SCORE | MOPO | MOReL | BCQ | BEAR | UWAC | CQL | TD3-BC |

halfcheetah-random | 29.12.6 | 35.92.9 | 30.35.9 | 2.20.0 | 2.30.0 | 2.30.0 | 21.70.6 | 10.61.7 |

hopper-random | 31.30.3 | 16.712.2 | 44.84.8 | 8.10.5 | 3.92.3 | 2.70.3 | 8.11.4 | 8.60.4 |

walker2d-random | 3.77.0 | 4.25.7 | 17.38.2 | 4.60.7 | 12.810.2 | 2.00.4 | 0.51.3 | 1.51.4 |

halfcheetah-medium-replay | 48.00.7 | 69.21.1 | 31.96.0 | 40.91.1 | 36.33.1 | 35.93.7 | 47.20.4 | 44.80.5 |

hopper-medium-replay | 79.924.6 | 32.79.4 | 54.232.0 | 40.916.7 | 52.219.3 | 25.71.9 | 95.62.4 | 57.817.3 |

walker2d-medium-replay | 84.81.1 | 73.72.4 | 13.78.0 | 42.513.7 | 7.07.8 | 23.66.9 | 85.32.7 | 81.92.7 |

halfcheetah-medium | 55.20.4 | 73.12.4 | 20.413.8 | 45.41.7 | 43.00.2 | 42.10.5 | 49.20.3 | 47.80.4 |

hopper-medium | 99.62.8 | 38.334.9 | 53.232.1 | 54.03.7 | 51.84.0 | 50.94.4 | 62.73.7 | 69.14.5 |

walker2d-medium | 89.21.2 | 41.230.8 | 10.38.9 | 74.53.7 | -0.20.1 | 75.43.0 | 83.30.8 | 81.33.0 |

halfcheetah-medium-expert | 92.63.5 | 70.321.9 | 35.919.2 | 94.01.2 | 46.04.7 | 42.70.3 | 70.613.6 | 88.95.3 |

hopper-medium-expert | 100.36.9 | 60.632.5 | 52.127.7 | 108.66.0 | 50.625.3 | 44.98.1 | 111.01.2 | 102.010.1 |

walker2d-medium-expert | 109.30.5 | 77.427.9 | 3.92.8 | 109.70.6 | 22.144.5 | 96.59.1 | 109.70.3 | 110.50.3 |

halfcheetah-expert | 96.40.6 | 81.321.8 | 2.25.4 | 92.72.5 | 92.70.6 | 92.90.6 | 97.51.8 | 96.30.9 |

hopper-expert | 112.00.3 | 62.529.0 | 26.214.0 | 105.38.1 | 54.621.0 | 110.50.5 | 105.45.9 | 109.54.1 |

walker2d-expert | 109.40.6 | 62.43.2 | -0.30.3 | 109.00.4 | 106.86.8 | 108.40.4 | 109.00.4 | 110.30.4 |

Overall | 76.13.5 | 53.316.3 | 26.412.6 | 62.64.0 | 38.810.0 | 50.432.7 | 70.52.5 | 68.13.5 |

### 4.1 Comparison Experiments

We first compare SCORE and other baselines on the D4RL-MuJoCo datasets. The experimental results in Table 1 show that SCORE obtains promising results for nearly all dataset settings. For random datasets of the lowest quality, SCORE is the only model-free algorithm that matches the performance of model-based algorithms. We can see that SCORE also works well on the medium-quality datasets. Learning from data generated by a medium-level policy, performance of SCORE is comparable to the expert policy. These results demonstrate the superiority of the pessimism principle in offline RL. For high-quality datasets, e.g., the medium-expert and expert datasets, SCORE outperforms model-based methods and is on par with the state-of-the-art model-free methods. Besides, we find that SCORE’s performance is in line with the theory, i.e., *it improves along with the quality of the dataset* (how well the dataset covers the trajectory induced by the optimal policy). The overall performance has a considerable improvement compared to the state-of-the-art algorithms (CQL [kumarConservativeQlearningOffline2020] and TD3-BC [fujimotoMinimalistApproachOffline2021]).

We also conduct experiments on a more challenge task suite, D4RL-Adroit, where the datasets for these tasks have very narrow distributions and the data quality is highly unstable. Though such issues pose significant difficulties for stable uncertainty estimation, SCORE still still performs well compared with other methods. We refer to Appendix E.1 for more details of the experiments and Appendix F.1 for the experimental results on D4RL-Adroit datasets.

### 4.2 Visualization and Analysis of Uncertainty

To gain further insight into the uncertainty estimated by SCORE, we visualize the uncertainty. Specifically, we apply the Q functions trained on the medium-replay dataset to quantify uncertainty for different samples. We draw the in-distribution samples from the medium-replay dataset, and the OOD samples come from the expert dataset. For visualization purposes, we reduce the features of these samples to two dimensions using t-distributed stochastic neighbor embedding (t-SNE). Figure 2 shows the contour plot of the uncertainty on the two-dimensional feature space, in which the white dots denote in-distribution samples and the red dots correspond to OOD samples.

Although there are some overlaps between the two types of samples, the in-distribution samples (white) are more concentrated in regions with low uncertainty (the dark regions). On the other hand, the OOD samples (red) loosely distribute in regions with higher uncertainty (the bright regions). We can also see that the in-distribution and OOD samples are more easily distinguishable on halfcheetah, while the opposite holds for hopper and walker2d. We point out that this correlates with the performance observed in the comparison experiments (see Table 1). On halfcheetah, the performance on the medium-replay dataset is substantially lower than on the expert dataset, while it is much closer on hopper and walker2d. We suggest that this phenomenon reflects the property of the dataset, where the medium-replay datasets of hopper and walker2d have better coverage of the state-action pairs induced by the expert policy. Thus, algorithms are more likely to learn high-level policies from these medium-quality datasets.

### 4.3 Ablation Studies

Pessimism. Pessimism is the core component of the proposed method. Figure 3 shows the difference between SCORE and SCORE without pessimism (all other hyper-parameters are remained unchanged except for removing the uncertainty penalty) in a longer training period. We observe that removing pessimism may cause training instability or even severe degradation. This phenomenon is related to the Q-value, and the agent’s performance is greatly affected when the Q-value jitters or explodes. The experimental results indicate that even with a good initialization (via behavior cloning), spurious correlations in offline RL can still be problematic. In contrast, pessimism is able to reduce spurious correlations and guarantees the strong empirical performance of SCORE.

Behavior Cloning. As pointed out in previous studies [fujimotoOffpolicyDeepReinforcement2019, levineOfflineReinforcementLearning2020d], uncertainty-based methods cannot effectively avoid action distributional shifts. Meanwhile, estimating calibrated uncertainty for neural networks is a challenging task. Figure 4 and Figure 5 show the difference between SCORE and SCORE without BC. On datasets collected by a single policy (e.g., the medium datasets), the importance of BC goes without saying. These datasets poorly cover the state-action space, so action distributional shift is more likely to appear. By removing the regularizer, the agent fails to stay in well-supported regions. What’s worse, this further affects the uncertainty estimation and makes it difficult for the agent to learn effectively. On the other hand, on datasets collected by different levels of policies (e.g., the medium-replay datasets), SCORE and SCORE without BC have similar final performance. In this case, behavior cloning serves to provide a good initialization and stabilizes the training process.

## 5 Related Work

Most existing works in offline RL focus on defending OOD actions, but as shown in Section 3.1, in-distribution samples can also cause detrimental effects. The *spurious correlation* arising from insufficient information of the underlying model is the main reason. To deal with this problem, most theoretical works impose various assumptions on the sufficient coverage of the dataset, e.g., the ratio between the visitation measure of the target policy and that of the behavior policy to be upper bounded uniformly over the state-action space [xie2019towards, nachum2019dualdice, jiang2020minimax, duan2020minimax, zhang2020gendice, yin2021near], or the concentrability coefficient to be upper bounded [scherrer2015approximate, chen2019information, liao2020batch, xie2021batch]. Until recently, without assuming sufficient coverage of the dataset, jinPessimismProvablyEfficient2021a establishes a data-dependent upper bound on the suboptimality using pessimism. Our work adds to recent works by extending jinPessimismProvablyEfficient2021a to regularized MDPs.

The majority of offline RL algorithms fall into two categories, i.e., policy-constrained methods and value-penalized methods. Policy-constrained methods avoid OOD actions by restricting the hypothesis space of the policy. For example, fujimotoOffpolicyDeepReinforcement2019 and ghasemipourEmaqExpectedmaxQlearning2021 only consider actions proposed by the estimated behavioral policy. Alternatively, some methods [wuBehaviorRegularizedOffline2019b, kumarStabilizingOffpolicyQlearning2019, kostrikovOfflineReinforcementLearning2021a, nairAcceleratingOnlineReinforcement2020] reformulate the policy optimization problem as a constrained optimization problem to keep the learned policy close to the behavioral policy. More recently, fujimotoMinimalistApproachOffline2021 provides a simple yet effective solution by directly using the behavioral cloning loss.

On the other hand, value-penalized methods penalize the value of OOD actions to steer the policy towards well-supported regions. kumarConservativeQlearningOffline2020 penalizes the actions generated by the learned policy via a value regularization term. yuMopoModelbasedOffline2020 and kidambiMorelModelbasedOffline2020 learn the environmental model and then use the uncertainty in model predictions to penalize the action-values. However, in a subsequent paper [yuCOMBOConservativeOffline2021b], the authors claim that estimating uncertainty for complex models is too difficult and revert to the regularization method.

Most of the current uncertainty-based approaches belong to model-based methods. Since the model is learned in a supervised manner, it provides much stable uncertainty estimation. However, as shown in the comparison experiments in Section 4, their performance heavily rely on sufficient coverage of the state-action space, which in many cases is impractical. In contrast, SCORE utilizes bootstrapped ensembles to estimate uncertainty, avoiding model learning while still providing reliable uncertainty estimations. A recent work UWAC [wuUncertaintyWeightedActorCritic2021] proposes to use Monte Carlo dropout (MC dropout) to estimate uncertainty and perform weighted updates to the critics and the policy. While this method is also model-free, it relies on a strong policy-constrained method. More importantly, *the dropout method does not converge with increasing data* [osbandRandomizedPriorFunctions2018b]. In contrast, our method reduces to a pure uncertainty-based method when the regularization decays to zero, and the uncertainty decreases to zero with more data, enjoying a solid theoretical foundation.

## 6 Conclusion

In this work, we emphasize that spurious correlations stem from insufficient information about the environment is a core problem in Offline RL. We propose a simple and principled algorithm named SCORE to address this problem. The effectiveness of SCORE is verified by both theoretical analyses and empirical studies.

Our work is nicely complementary to recent theoretical studies in offline RL. It suggests that pessimism is not only provably efficient but also helps to improve performance in practice. We remark spurious correlations are not always detrimental. How to avoid over-pessimism and how to further improve the policy in the real environment with a small number of online interactions are the main focuses of our future research.

## References

## Appendix A Proof of Theorem 3.3

###### Proof.

Before we prove the theorem, we first introduce the following useful lemmas.

###### Lemma A.1 (Suboptimality Decomposition).

###### Proof.

See proof of Lemma 4.2 in cai2020provably for a detailed proof. ∎

###### Lemma A.2 (Policy Improvement).

It holds for any that

###### Proof.

See Section B.2 for a detailed proof. ∎

###### Lemma A.3 (Pessimism).

###### Proof.

See proof of Lemma 5.1 in jinPessimismProvablyEfficient2021a for a detailed proof. ∎

Now we prove the theorem. By Lemma A.1, we have

(14) |

where the last inequality comes from Lemma A.2. Further, by telescoping the sum of on the right-hand side of equation A and the non-negativity of the KL divergence, it holds with probability at least that

(15) |

where the last inequality comes from Lemma A.3. Now, by taking , where

combining equation A, with probability at least , we have

Here is the intrinsic uncertainty defined in equation 12. By the fact that , we conclude the proof. ∎

## Appendix B Proof of Lemmas

### b.1 Proof of Lemma 3.1

###### Proof.

By plugging the definition of and the linear parameterization into equation 11, we have

(16) |

It holds for any that

(17) |

Similarly, we have

(18) |

for any . Thus, by combining equation 16, equation B.1, and equation 18, the stationary point of satisfies

(19) |

Now, by equation B.1, we have

which concludes the proof. ∎

Comments

There are no comments yet.