1 Introduction
Offline RL has gained great interest since it enables RL algorithms to scale to many realworld applications, e.g., recommender systems [18, 17], autonomous driving [25], and healthcare [6], waiving costly online trialanderror. In the offline setting, ModelBased Reinforcement Learning (MBRL) is an important direction that already produces significant offline learning performance [26, 3]. Moreover, learning models is also useful for training transferable policies [13, 3]. Therefore, there is increasing studies on learning worldmodels, from supervised prediction methods [7] to adversarial learning methods [24]. However, commonly there exists an underlying causal structure among the states and actions in various tasks. The causal structure can support learning a policy with better generalization ability. For example, in driving a car where the speed depends on the gas and brake pedals but not the wiper, a plain worldmodel trained on rainy days always predicts deceleration when the wiper is turned on and thus can not generalize to other weather situations. By contrast, a causal worldmodel can avoid the spurious dependence between wiper and deceleration (because of the rain) and hence help generalize well in unseen weather. In fact, empirical evidence also indicates that inducing the causal structure is important to improve the generalization of RL [5, 21, 2, 4, 28], but little attention was paid on causal worldmodel learning. In this paper, we first provide theoretical support for the statement above: we show that a causal worldmodel can outperform a plain worldmodel for offline RL. From the causal perspective, we divide the variables in states and actions into two categories, namely, causal variables and spurious variables, and then formalize the procedure that learns a worldmodel from raw measured variables. Based on the formalization, we quantize the influence of the spurious dependence on the generalization error bound and thus prove that incorporating the causal structure can help reduce this bound. We then propose a practical offline MBRL algorithm with causal structure, FOCUS, to illustrate the feasibility of learning causal structure in offline RL. An essential step of FOCUS is to learn the causal structure from data and then use it properly. Learning causal structure from data has been known as causal discovery [19]. There are some challenges in utilizing causal discovery methods in RL, and there are specific properties in the data that causal discovery may benefit from. Specifically, we extended the PC algorithm, which aims to find causal discovery based on the inferred conditional independence relations, to incorporate the constraint that the future cannot cause the past. Consequently, we can reduce the number of conditional independence tests and determine the causal direction. We further adopt kernelbased independence testing [27], which can be applied in continuous variables without assuming a functional form between the variables as well as the data distribution. In summary, this paper makes the following key contributions:

[leftmargin=5mm]

It shows theoretically that a causal worldmodel outperforms a plain worldmodel in offline RL, in terms of the generalization error bound.

It proposes a practical algorithm, FOCUS, and illustrates the feasibility of learning and using a causal worldmodel for offline MBRL.

Our experimental results verify the theoretical claims, showing that FOCUS outperforms baseline models and other online causal MBRL algorithms in the offline setting.
2 Related Work
Causal Structure Learning in Online MBRL. Despite that some methods have been proposed for learning causal structure in online RL, such methods all depend on interactions and do not have a mechanism to transfer into the offline setting. The core of online causal structure learning is to evaluate the performance or other metrics of one structure by interactions and choose the best one as the causal structure. [4] parameterizes the causal structure in the model and learns policies for each possible causal structure by minimizing the loglikelihood of dynamics. Given learned policies, it makes regression between the policy returns and the causal structure and then chooses the structure corresponding to the highest policy return. [9]
(LNCM) samples causal structures from Multivariate Bernoulli distribution and scores those structures according to the loglikelihood on interventional data. Based on the scores, it calculates the gradients for the parameters of the Multivariate Bernoulli distribution and updates the parameters iteratively.
[2] utilizes the speed of adaptation to learn the causal direction, which does not provide a complete algorithm for learning causal structure. By contrast, FOCUS utilizes a causal discovery method for causal structure learning, which only relies on the collected data to obtain the causal structure. Causal Discovery Methods. The data in RL is more complex than it in traditional causal discovery, where the data is often discrete and the causal relationship is under a simple linear assumption. In recent years, practical methods have been proposed for causal discovery for continuous variables, which is the case we are concerned with in this paper. [20]is based on explicit estimation of the conditional densities or their variants, which exploit the difference between the characteristic functions of these conditional densities. The estimation of the conditional densities or related quantities is difficult, which deteriorates the testing performance especially when the conditioning set is not small enough.
[14]discretizes the conditioning set to a set of bins and transforms conditional independence (CI) to the unconditional one in each bin. Inevitably, due to the curse of dimensionality, as the conditioning set becomes larger, the required sample size increases dramatically. By contrast, the KCI test
[27]is a popular and widelyused causal discovery method, in which the test statistic can be easily calculated from the kernel matrices and the distribution can be estimated conveniently.
3 Preliminaries
Markov Decision Process (MDP). We describe the RL environment as an MDP with fivetuple [1], where is a finite set of states; is a finite set of actions; is the transition function with denoting the nextstate distribution after taking action a in state s; is a reward function with denoting the expected immediate reward gained by taking action a in state s; and is a discount factor. An agent chooses actions a according to a policy , which updates the system state , yielding a reward . The agent’s goal is to maximize the the expected cumulative return by learning a good policy . The stateaction value of a policy is the expected discounted reward of executing action from state s and subsequently following policy : . Offline Modelbased Reinforcement Learning. Modelbased reinforcement learning typically involves learning a dynamics model of the environment by fitting it using a maximumlikelihood estimate of the trajectorybased data collected by running some exploratory policy [23, 12]. In the offline RL setting, where we only have access to the data collected by multiple policies, recent techniques build on the idea of pessimism (regularizing the original problem based on how confident the agent is about the learned model) and have resulted in better sample complexity over modelfree methods on benchmark domains [10, 26].
4 Theory
In this section, we provide theoretical evidence for the advantages of a causal worldmodel over a plain worldmodel, which shows that utilizing a good causal structure can reduce the generalization error bounds in offline RL. Specifically, We incorporate the causal structure into the generalization error bounds, which include the model prediction error bound and policy evaluation error bound. The full proof can be found in Appendix A. In this paper, we focus on the linear case.
4.1 Model Prediction Error Bound
In this subsection, we formulize the procedure that learns a plain worldmodel and then connect the model prediction error bound with the number of spurious variables in the plain worldmodel. Specifically, we point out that the spurious variables lead the model learning problem to an illposed problem that has multiple optimal solutions, which consequently results in the increment of the model prediction error bound. Since model learning can be viewed as a supervised learning problem, we provide the model prediction error bound in a supervised learning framework.
Preliminary. Let denote the data distribution where we have samples . The goal is to learn a linear function to predict given X. From the causal perspective, is generated from only its causal parent variables rather than all the variables in X. Therefore we can split the variables in X into two categories, :
[leftmargin=5mm]

represents the causal parent variables of , that is, , where is the ground truth and is a zero mean noise variable that .

represents the spurious variables that , but in some biased data sets and have strong relatedness. In other words, can be predicted by with small error, i.e., , where
is the regression error with zero mean and small variance.
For clearly representation, we use ( represents elementwise multiplication) to replace , where records the indices of in X and . Correspondingly, we also use to replace . According to the definition of , we have , where is the global optimal solution of the optimization problem
(1) 
Above problem is easy if the data is uniformly sampled from . However, in the offline setting, we only have biased data sampled by given policy , where the optimization objective is
(2) 
The Problem 2 has multiple optimal solutions due to the strong linear relatedness of and in , which is proved in Lemma 4.1.
Lemma 4.1.
The most popular method for solving such illposed problem is to add a regularization term for parameters [15]:
(3) 
where is a coefficient. The form of Problem 3
corresponds to the form of the ridge regression, which provides a closedform solution of
by HoerlKennard formula [8]. In the following, we will first introduce the solution of under Problem 3 in Lemma 4.2, and then introduce the model prediction error bound with in Theorem 4.4. For ease of understanding, we provide a simple version where the dimensions of and are both one ().Lemma 4.2 ( Lemma).
Based on Lemma 4.2, we can find that the smaller (it means that and have stronger relatedness in the training dataset ), the larger . And we also have its bound:
Proposition 4.3.
Given as Formula 4, the bound of is that
Theorem 4.4 (Spurious Theorem).
Theorem 4.4 shows that

[leftmargin=5mm]

The upper bound of the model prediction error increases by for each induced spurious variable in the model.

When and have stronger relatedness (which means a bigger ), the increment of the prediction model error bound led by is bigger.
4.2 Policy Evaluation Error Bound
Although in most cases, an accurate model ensures a good performance in MBRL, the model error bound is still an indirect evaluation compared to the policy evaluation error bound for MBRL. In this subsection, we apply the spurious theorem (Theorem 4.4) to offline MBRL and provide the policy evaluation error bound with the number of spurious variables. Suppose that the state value and reward are bounded that , let denote the maximum of and denote the maximum of , we have the policy evaluation error bound in Theorem 4.5.
Theorem 4.5 (RL Spurious Theorem).
Given an MDP with the state dimension and the action dimension , a datacollecting policy , let denote the true transition model, denote the learned model that predicts the dimension with spurious variable sets and causal variables , i.e., . Let denote the policy value of the policy in model and correspondingly . For an arbitrary bounded divergence policy , i.e. , we have the policy evaluation error bound:
where , which represents the spurious variable density, that is, the ratio of spurious variables in all input variables .
Theorem 4.5 shows the relation between the policy evaluation error bound and the spurious variable density, which indicates that:

[leftmargin=5mm]

When we use a noncausal model that all the spurious variables are input, reaches its maximum value . By contrast, in the optimal causal structure, reaches its minimum value of .

The density of spurious variables and the correlation strength of spurious variables both influence the policy evaluation error bound. However, if we exclude all the spurious variables, i.e., , the correlation strength of spurious variables will have no effect.
5 Algorithm
In the theory section, we have provided the theoretical evidence about the advantages of a causal worldmodel over a plain worldmodel. Besides lower prediction errors, a causal worldmodel also matters for better decisionmaking in RL. In the condition that spurious variables do not increase prediction errors (e.g., spurious variables disturb only in unreachable states), a wrong causal relation also leads to terrible decisionmaking. For example, rooster crowing can predict the rise of the sun, but forcing a rooster to crow for a sunny day is a natural decision if we have a wrong causal relation that rooster crowing causes the rise of the sun. In the above example, predicting the rise of the sun by rooster crowing is a zeroerror worldmodel since rooster crowing on a rainy day is an unreachable state, but such worldmodel leads to terrible decisionmaking. After demonstrating the necessity of a causal worldmodel in offline RL, in this section we propose a practical offline MBRL algorithm, FOCUS, to illustrate the feasibility of learning causal structure in offline RL. The main idea of FOCUS is to take the advantage of causal discovery methods and extend it to offline MBRL. Compared to previous online causal structure learning methods, the causal discovery method brings the following advantages in the offline setting:

[leftmargin=5mm]

Robust: The structure is not influenced by the performance of the test data, which is artificially selected and biased.

Efficient: The causal discovery method directly returns the structure by independence testing without any network training procedure and thus saves the samples for network training.
5.1 Preliminary
Conditional Independence Test. Independence and conditional independence (CI) play a central role in causal discovery [16, DBLP:journals/technometrics/Burr03, 11]. Generally speaking, the CI relationship allows us to drop when constructing a probabilistic model for with . There are multiple CI testing methods for various conditions, which provide the correct conclusion only given the corresponding condition. The kernelbased Conditional Independence test (KCItest) [27] is proposed for continuous variables without assuming a functional form between the variables as well as the data distributions, which is the case we are concerned with in this paper. Generally, the hypothesis that variables are conditionally independent is rejected when is smaller than the preassigned significance level, say, 0,05. In practice, we can design the significance level instead of a fixed value. Conditional Variables. Besides the specific CI test method, the conclusion of conditional independence testing also depends on the conditional variable , that is, different conditional variables can lead to different conclusions. Taking the triple as an example, there are three typical structures, namely, Chain, Fork, and Collider as shown in Fig 2, where whether conditioning on significantly influences the testing conclusion.

[leftmargin=5mm]

Chain: There exists causation between and but conditioning on leads to independence.

Fork: There does not exist causation between and but not conditioning on leads to nonindependence.

Collider: There does not exist causation between and but conditioning on leads to nonindependence.
5.2 Building the Causal Structure from the KCI test
Applying the Independence Test in RL. Based on the preliminaries, given the two target variables and the condition variable
, the KCI test returns a probability value
, which measures the probability that and are conditionally independent given the condition . In other words, a small implies that and have causation given . To transform an implicit probability value into an explicit conclusion of whether the causation exists, we design a threshold that:where represents independence and represents that causation exists. Details of choosing can be found in Appendix B.1. In model learning of RL, variables are composed of states and actions of the current and next timesteps and the causal structure refers to whether a variable in timestep (e.g., the dimension, ) causes another variable in timestep (e.g., the dimension, ). With the KCI test, we get the causal relation through the function for each variable pair and then form the causal structure matrix :
where is the element in row and column of .
Choosing the Conditional Variable in RL. As we said in preliminaries, improper conditional variables can reverse the conclusion of independence testing. Therefore we have to carefully design the conditional variable set, which should include the intermediate variable of Chain, the common parent variable of Fork, but not the common son variable of Collider
. Traditionally, the independence test has to traverse all possibilities of the conditional variable set and gives the conclusion, which is too timeconsuming. However, in RL we can reduce the number of conditional independence tests by incorporating the constraint that the future cannot cause the past. Actually, this constraint limits the number of possible conditional variable set to a small value. Therefore we can even take a classified discussion for each possible conditional variable set. Before the discussion, we exclude two kinds of situations for simplicity:

[leftmargin=5mm]

Impossible situations. We exclude some impossible situations as Fig 3 (i) by the temporal property of data in RL. Specifically, the direction of the causation cannot be from the variable of time step to that of time step because the effect cannot happen before the cause.

Compound situations. We only discuss the basic situations and exclude the compound situations, e.g., Fig 3 (j), which is a compound of (a) and (c). It is because in such compound situations, the target variables and have direct causation (or it can not be a compound situation) and the independence testing only misjudges independence as nonindependence but not nonindependence as independence.
We list all possible situations of target variables and condition variable in the worldmodel as shown in Fig 3. Based on the causal discovery knowledge in the preliminaries, we analyze basic situations in the following:

[leftmargin=5mm]

Top Line: In (a)(b), whether conditioning on does not influence the conclusion of causation; In (c), although plays an intermediate variable in a Chain and conditioning on leads to the conclusion of independence of and , the causal parent set of will include when testing the causal relation between and , which can offset the influence of excluding . In (d), conditioning on is necessary for getting the correct conclusion of causation since is the common causal parent in a Fork structure.

Bottom Line: In (e)(f), whether conditioning on does not influence the conclusion of causation; In (g), not conditioning on is necessary to get the correct conclusion of causation since is the common son in a Collider structure; In (h), although plays an intermediate variable in a Chain and not conditioning on leads to the conclusion of nonindependence of and , including in the causal parent set of will not induce any problem since does indirectly cause .
Based on the above classified discussion, we can conclude our principle for choosing conditional variables in RL that: (1) Condition on the other variables in time step; (2) Do not condition on the other variables in time step.
5.3 Combining Learned Causal Structure with An Offline MBRL Algorithm
We combine the learned causal structure with an offline MBRL algorithm, MOPO [26], to form a causal offline MBRL algorithm as in Fig 1. The complete learning procedure is shown in Algorithm 1, where Algorithm 2 can be found in Appendix B.2
. Notice that our causal model learning method could be combined with any offline MBRL algorithm in principle. More implementation details and hyperparameter values are summarized in Appendix
B.1.6 Experiments
To demonstrate that (1) Learning a causal worldmodel is feasible in offline RL and (2) a causal worldmodel can outperform a plain worldmodel and other related online methods in offline RL, we evaluate (1) causal structure learning and (2) policy learning on the Toy Car Driving and MuJoCo benchmark. Toy Car Driving is a simple and typical environment that is convenient to evaluate the accuracy of learned causal structure because We can design the causation between variables in it. The MuJoCo is the most common benchmark to investigate the performance in continuous controlling, where each dimension of the state has a specific meaning and is highly abstract. We evaluate FOCUS on the following indexes: (1) The accuracy, efficiency and robustness of causal structure learning. (2) The policy return and generalization ability in offline MBRL. Baselines. We compare FOCUS with the sota offline MBRL algorithm, MOPO, and other online RL algorithms that also learn causal structure in two aspects, causal structure learning and policy learning. (1) MOPO [26] is a wellknown and widelyused offline MBRL algorithm, which outperforms standard modelbased RL algorithms and prior sota modelfree offline RL algorithms on existing offline RL benchmarks. The main idea of MOPO is to artificially penalize rewards by the uncertainty of the dynamics, which can avoid the distributional shift issue. MOPO can be seen as the blank control with a plain worldmodel. (2) Learning Neural Causal Models from Unknown Interventions (LNCM) [9] is an online MBRL, in which the causal structure learning method can be transformed to the offline setting with a simple adjustment. We take LNCM as an example to show that an online method cannot be directly transferred into offline RL. Environment. Toy Car Driving. Toy Car driving is a typical RL environment where the agent can control its direction and velocity to finish various tasks including avoiding obstacles and navigating. The information of the car, e.g., position, velocity, direction, and acceleration, can form the state and action in an MDP. In this paper, we use a 2D Toy Car driving as the RL environment where the task of the car is to arrive at the destination (The visualization of states and a detailed description can be found in Appendix C.1). The state includes the direction , the velocity (scalar) , the velocity on the
axis (one dimensional vector)
, the velocity on the axis and the position . The action is the steering angle . The visualization of the causal graph can be found in Appendix C.1. This causal structure is designed to demonstrate how a variable become spurious for others and highlight their influence in model learning.For example, when the velocity maintains stationary due to an imperfect sample policy, and have strong relatedness that and one can represent the other. Since we design that , and also have strong relatedness, which leads to that becomes a spurious variable of given , despite that is not the causal parent of . By contrast, when the data is uniformly sampled with various velocities, this spuriousness will not exist. MuJoCo. MuJoCo [22] is a generalpurpose physics engine, which is also a wellknown RL environment. MuJoCo includes multijoint dynamics with contact, where the variables of the state represent the positions, angles, and velocity of the agent. The dimensions of the state are from 3 to dozens. The limited dimensions and the clear meaning of each variable provide the convenience of causal structure learning. Offline Data. We prepare three offline data sets, Random, Medium, and Replay for the Car Driving and MuJoCo. Random represents that data is collected by random policies. Medium represents that data is collected by a fixed but not welltrained policy, which is the least diverse. MediumReplay is a collection of data that is sampled during training the Medium policy, which is the most diverse. The heat map of the data diversity is shown in Appendix C.1.
Index  FOCUS  LNCM  FOCUS(kci)  FOCUS(condition) 

Accuracy  0.993  0.52  0.62  0.65 
Robustness  0.001  0.025  0.173  0.212 
Efficiency(Samples) 
6.1 Causal Structure Learning
We compare FOCUS with baselines on the causal structure learning with the indexes of the accuracy, efficiency, and robustness. The accuracy is evaluated by viewing the structure learning as a classification problem, where causation represents the positive example and independence represents the negative example. The efficiency is evaluated by measuring the samples for getting a stable structure. The robustness is evaluated by calculating the variance in multiple experiments. The results in Table 1 show that FOCUS surpasses LNCM on accuracy, robustness, and efficiency in causal structure learning. Noticed that LNCM also has a low variance because it predicts the probability of existing causation between any variable pairs with around , which means that its robustness is meaningless.
Env  Car Driving  MuJoCo(Inverted Pendulum)  

Random  Medium  Replay  Random  Medium  Replay  
FOCUS  
MOPO  
LNCM 
6.2 Policy Learning
Policy Return. We evaluate the performance of FOCUS and baselines in the two benchmarks on three different and typical offline data sets. The results in Table 2 show that FOCUS outperforms baselines by a significant margin in most data sets. In Random, FOCUS has the most significant performance gains to the baselines in both benchmarks because of the accuracy of causal structure learning in FOCUS. By contrast, in MediumReplay, the performance gains of FOCUS are least since the high data diversity in MediumReplay leads to weak relatedness of spurious variables (corresponds to small ), which verifies our theory. In Medium, the results in the two benchmarks are different. In Car Driving, the relatively high score of LNCM does not mean that LNCM is the best but all three fail. The failure indicates that extremely biased data makes even the causal model fail to generalize. However, the success of FOCUS in the Inverted Pendulum indicates that causal worldmodels depend less on the data diversity since FOCUS can still reach high scores in such a biased dataset where the baselines fail. Here we only provide the results in Inverted Pendulum but not all the environments in MuJoCo due to the characteristics of the robot control, specifically the frequency of observations, which we present a detailed description in Appendix C.1.
Generalization Ability. We compare the performance on different offline data sets, which is produced by mixing up MediumReplay and Medium with different ratios. The in the xaxis represents that the data is mixed by of the Medium and of the MediumReplay. The results in Fig 5 show that FOCUS still performs well with a small ratio of MediumReplay data while the baseline performs well only with a big ratio, which indicates that FOCUS is less dependent on the diversity of data.
6.3 Ablation Study
To evaluate the contribution of each component, we perform an ablation study for FOCUS. The results in Table 1 show that the KCI test and our principle of choosing conditional variables contribute to the causal structure learning of both accuracy and robustness.
7 Conclusion
In this paper, we provide theoretical evidence about the advantages of using a causal worldmodel in offline RL. We present error bounds of model prediction and policy evaluation in offline MBRL with causal and plain worldmodels. We also propose a practical algorithm, FOCUS, to address the problem of learning causal structure in offline RL. The main idea of FOCUS is to utilize causal discovery methods for offline causal structure learning. We design a general mechanism to solve problems in extending causal discovery methods in RL, which includes conditional variables choosing. Extensive experiments on the typical benchmark demonstrate that FOCUS achieves accurate and robust causal structure learning and thus significantly surpasses baselines in offline RL. The limitation of FOCUS lies in: In our theory, we assume that the true causal structure is given. However, in practice, one needs to learn it from data and then use it. Quantifying the uncertainty in the learned causal structure from data is known to be a hard problem, and as one line of our future research, we will derive the generalization error bound with the causal structure learned from data.
References
 [1] (1957) A markovian decision process. Journal of Mathematics and Mechanics 6 (5), pp. 679–684. Cited by: §3.
 [2] (2020) A metatransfer objective for learning to disentangle causal mechanisms. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20), Cited by: §1, §2.
 [3] (2021) Offline modelbased adaptable policy learning. In Advances in Neural Information Processing Systems 34 (NeurIPS’21), Virtual Conference. Cited by: §1.

[4]
(2019)
Causal confusion in imitation learning
. In Advances in Neural Information Processing Systems 32 (NeurIPS’19), pp. 11693–11704. Cited by: §1, §2.  [5] (2018) Human causal transfer: challenges for deep reinforcement learning. In Proceedings of the 40th Annual Meeting of the Cognitive Science Society (CogSci’18), Cited by: §1.
 [6] (2019) Guidelines for reinforcement learning in healthcare. Nature Medicine 25 (1), pp. 16–18. Cited by: §1.
 [7] (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31 (NeurIPS’18), Montréal, Canada, pp. 2455–2467. Cited by: §1.
 [8] (2000) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 42 (1), pp. 80–86. External Links: ISSN 00401706 Cited by: §4.1.
 [9] (2019) Learning neural causal models from unknown interventions. CoRR 1910.01075. External Links: 1910.01075 Cited by: §2, §6.
 [10] (2020) MOReL: modelbased offline reinforcement learning. In Advances in Neural Information Processing Systems 33 (NeurIPS’20), Cited by: §3.
 [11] (2009) Probabilistic graphical models  principles and techniques. MIT Press. External Links: ISBN 9780262013192 Cited by: §5.1.
 [12] (2018) Modelensemble trustregion policy optimization. In Proceedings of the 6th International Conference on Learning Representations, (ICLR’18), Cited by: §3.
 [13] (2022) Adapting environment sudden changes by learning context sensitive policy. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’22), Virtual Conference. Cited by: §1.

[14]
(2005)
Distributionfree learning of bayesian network structure in continuous domains
. In Proceedings of the 20th AAAI Conference on Artificial Intelligence (AAAI’05), pp. 825–830. Cited by: §2.  [15] (2019) Solving rubik’s cube with a robot hand. CoRR 1910.07113. External Links: 1910.07113 Cited by: §4.1.
 [16] (2000) Models, reasoning and inference. Cambridge University Press 19, pp. 2. Cited by: §5.1.
 [17] (2021) Partially observable environment estimation with uplift inference for reinforcement learning based recommendation. Machine Learning (9), pp. 2603–2640. Cited by: §1.
 [18] (2019) VirtualTaobao: Virtualizing realworld online retail environment for reinforcement learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19), Honolulu, HI. Cited by: §1.
 [19] (2000) Causation, prediction, and search. MIT Press. Cited by: §1.
 [20] (2007) A kernelbased causal learning algorithm. In Proceedings of the 24th International Conference on Machine Learning (ICML’07), pp. 855–862. External Links: Document Cited by: §2.
 [21] (2018) Building machines that learn and think like people. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS’18), pp. 5. Cited by: §1.
 [22] (2012) Mujoco: a physics engine for modelbased control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033. Cited by: §6.
 [23] (2017) Information theoretic MPC for modelbased reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA’17), pp. 1714–1721. External Links: Document Cited by: §3.
 [24] (2020) Error bounds of imitating policies and environments. In Advances in Neural Information Processing Systems 33 (NeurIPS’20), pp. 15737–15749. Cited by: Appendix A, §1.
 [25] (2018) BDD100K: A diverse driving video database with scalable annotation tooling. CoRR 1805.04687. External Links: 1805.04687 Cited by: §1.
 [26] (2020) MOPO: modelbased offline policy optimization. In Advances in Neural Information Processing Systems 33 (NeurIPS’20), Cited by: §1, §3, §5.3, §6.
 [27] (2011) Kernelbased conditional independence test and application in causal discovery. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI’11), pp. 804–813. Cited by: §1, §2, §5.1.
 [28] (2022) Invariant action effect model for reinforcement learning. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’22), Virtual Conference. Cited by: §1.
Appendix
Appendix A Theory
Definition A.1 (Optimization objective in data distribution :).
(6) 
Definition A.2 (Optimization objective in data :).
(7) 
Definition A.3 (Optimization objective in data with regularization:).
(8) 
Lemma A.4.
Proof.
∎
Lemma A.5 ( Lemma).
Proof.
Since the solution of the ridge regression is
we take into this solution and get:
(10) 
Since is chosen by HoerlKennard formula that , we have:
∎
Proposition A.6.
Given as Formula 4, we have
Proof.
So we have : . ∎
Theorem A.7 (Spurious Theorem).
Proof.
Let denote , we have
∎
Theorem A.8 (RL Spurious Theorem).
Given an MDP with the state dimension and the action dimension , a datacollecting policy , let denote the true transition model, denote the learned model that predicts the dimension with spurious variable sets and causal variables , i.e., . Let denote the policy value of the policy in model and correspondingly . For any bounded divergence policy , i.e. , we have the policy evaluation error bound:
(12) 
where , which represents the spurious variable density, that is, the ratio of spurious variables in all input variables .
Proof.
Before proving, we first introduce three lemmas: