1 Introduction
Due to cost, feasibility, or safety concerns, practitioners often need to evaluate a sequential decisionmaking strategy using only previouslycollected observational data. In reinforcement learning (RL), this problem is called offpolicy policy evaluation (OPE). When the policy used to collect the data is unknown, there might exist unobserved variables correlated with both the policy and the outcomes. In this case, the causal effect of future interventions is unidentified and naive estimates for a new policy will be biased.
What kind of socalled unobserved confounders arise in Markov decision processes (MDPs)? Unobserved variables of interest in a medical setting are almost always highly persistent. For example, consider electronic medical records that do not document socioeconomic status. A patient’s socioeconomic status is unlikely to change between visits to the hospital. In macroeconomics on the other hand, unobserved shocks are often assumed to be drawn iid every period. Consider the Federal Reserve Board adjusting monetary policy in response to oil price shocks. Events like earthquakes in oil fields might reasonably be assumed to occur independently across quarters.
Recent work develops OPE methods that are robust to unobserved confounding (Namkoong et al., 2020; Kallus and Zhou, 2020). Given an observational data set and a hypothetical confounder, these methods adapt importance sampling approaches to calculate worstcase estimates for the value of a new policy. A practitioner can assess the sensitivity of their results to unobserved variables by increasing the strength of confounding and computing how quickly the worstcase bounds degrade.
However, the existing literature arrives at radically different conclusions. (Kallus and Zhou, 2020)  henceforth KZ  finds that it is possible to efficiently construct nonconservative bounds in the infinite horizon setting. On the other hand, (Namkoong et al., 2020)  henceforth NKYB  only finds nontrivial bounds when confounding is restricted to a single time step. Furthermore, both approaches find the finite horizon case with confounding at each step to be computationally intractable.
The natural questions are: 1) what is responsible for the substantial gap between the conservativeness of the existing bounds? and 2) how can we compute tractable lower bounds for the finite horizon case?
Summary of our Results:
We identify a key assumption under which it is possible to obtain sharp lower bounds on the expected value in a confounded MDP, even as the horizon grows. When the unobserved confounding variables are drawn iid each period, the marginal dynamics over the observed state themselves form an MDP. In this case, OPE methods can be applied to the marginal MDP after appropriate adjustments for confounding. Such an assumption is made in KZ.
But if the unobserved state might be persistent over time, the problem is a genuine partiallyobserved MDP (POMDP). Marginal transition probabilities for the observed state will not be Markovian in general. Medical applications, which frequently feature persistent unobserved variables, fall under this category. As a result, existing bounds that target this setting, such as NKYB, are more conservative. In this paper, we focus on the case where the marginal problem is an MDP and demonstrate enormous performance differences compared to setting with persistent unobservables.
We derive an expression for the bias of common estimands under confounding in the marginal MDP setting. We show how to express OPE “direct methods” in this form. Then we demonstrate how to adapt direct methods to give worstcase bounds in the finite horizon case. Our method is sufficiently generic that any approach which regresses a function against states and actions can be plugged into our framework to get bounds.
Finally, we show that modelbased OPE methods provide sharper lower bounds on the value function. We can compute these bounds in a computationally efficient way by combining techniques from the robust MDP literature with sensitivity models from causal inference. A modelbased approach provides a natural way for domain experts to provide guidance on reasonable limits for the strength of confounding on outcomes. We evaluate our methods with existing OPE benchmarks.
2 Related Work
Offpolicy evaluation There are several classes of popular OPE algorithms. (Voloshin et al., 2019) provides a summary and empirically compares their performance. These classes include: importance sampling (IS) (Precup, 2000; Hanna et al., 2019), modelfree direct methods like Fitted QEvaluation (Le et al., 2019), modelbased methods (Paduraru, 2012; Gottesman et al., 2019), and hybrid methods (Thomas and Brunskill, 2016; Jiang and Li, 2016; Kallus and Uehara, 2020). (Voloshin et al., 2019) shows that, typically, either simple methods like FQE or hybrid methods have the best performance in practice.
Recently, a variety of marginalized importance sampling (MIS) methods (Liu et al., 2018; Uehara et al., 2020; Nachum and Dai, 2020) have been developed, which have the potential to solve the poor empirical performance of standard IS. This approach is adopted by KZ.
Causal inference and sensitivity analysis
Estimating the causal effect of a treatment on some outcome is the object of study in causal inference (Hernán and Robins, 2010; Imbens and Rubin, 2015; Pearl and others, 2009). The line of work on dynamic treatment regimes (Murphy, 2003; Laber et al., 2014) is the most relevant to RL. Work in this area frequently assumes an unconfoundedness condition, which guarantees that the causal effect of a treatment is identified. For example, unconfoundedness will hold if the data come from a randomized control trial.
If unconfoundedness might be violated, then a researcher can assess the robustness of their causal estimates via sensitivity analysis (Rosenbaum, 2002; Franks et al., 2019). In recent work, (Yadlowsky et al., 2018; Kallus et al., 2019) give bounds for treatment effects subject to a sensitivity model. Other work develops bounds for the effectiveness of a singlestep policy in the presence of unobserved confounders (Kallus and Zhou, 2018; Jung et al., 2018).
Offpolicy evaluation with unobserved confounders
Besides NKYB and KZ, most work in RL with unobserved confounders assumes that the causal effects are identified, i.e. assumptions are made about latent structure such that the true effect of interest can be recovered (Bennett and Kallus, 2019; Oberst and Sontag, 2019). For POMDPs, (Tennenholtz et al., 2020) analyze the bias for importance sampling in the presence of confounders, and give some conditions under which this bias can be corrected.
3 Problem Setting and Notation
3.1 Markov Decision Processes
Let be an Markov decision process (MDP) where is the set of states and is the set of actions, which we assume are finite. Let
denote all probability distributions on a set
. is the transition function, is the reward function, is the initial state distribution, and is the discount factor. A (stationary) policy assigns probabilities to each action given a state. We are interested in the expected value of policy :where , and .
3.2 Confounded Offpolicy Evaluation
In this paper, we consider MDPs with unobserved confounding variables. Specifically, we assume the state space is partitioned into observed state and unobserved state . The fullinformation MDP is .
In the confounded offpolicy evaluation problem, we have access to a dataset , collected according to a stationary behavior policy, . Each denotes an observed trajectory where , , , and . Note that while is only a function of the observed state, it can still rely on via .
Our goal is to estimate the expected return for a stationary evaluation policy, , which does not depend on the unobserved state.
4 Two Types of Unobserved State
We begin by making a distinction between unobserved states that are dependent over time, and unobserved states that are drawn iid each time step.
Assumption 1 (IID Confounders). The unobserved state is drawn iid for all and therefore the transition dynamics can be factored as:
This corresponds to the “memoryless” unobserved confounding assumption in KZ. Under Assumption 1, the marginal observed state transition probabilities are Markovian:
and the value of evaluation policy in the true MDP is equal to the value of in the marginal MDP, , where we abuse notation slightly to let and denote the corresponding marginal quantities over the observed state.
While we have reduced the problem to finding the value of in the marginal MDP, this value is not identified given the dataset because the unobserved state affects both the choice of action and the transitions . For example, any dataset collected under policy will be consistent with a set of many possible marginal transition probabilities . However, standard OPE algorithms for MDPs can be adapted to this setting via some strategy to control for confounding.
If the unobserved state is persistent, then the problem is no longer a marginal MDP plus causal uncertainty. Consider the simplest such scenario where is drawn from some initial distribution and . In this setting, is nonstationary in general and is not Markovian due to the dependence via induced by conditioning on . Therefore, the problem is a partiallyobserved MDP (POMDP).
For the POMDP case, even when (as in a randomized trial), many OPE algorithms are biased because the observed state and actions do not themselves constitute an MDP. A notable exception is IS methods. When , the problem satisfies Assumption 1 in (Tennenholtz et al., 2020) for POMDPs.
When the behavior policy varies over , the value is not identified and one must further adapt IS methods as in NKYB. However, as we will demonstrate, without Assumption 1 these bounds are too conservative for practical use  even when confounding is limited to a single time step. Therefore, in this paper, we develop lower bounds on the value of a policy given Assumption 1, and show that the bounds are far less sensitive to confounding. It is crucial to remember that Assumption 1 is not reasonable in some settings, especially medical ones, and given the substantial gap in performance, we suspect that new algorithms or sensitivity models need to be developed to make the persistent confounder case work in practice.
5 Estimation with Unobserved Confounders
5.1 Bias due to Spurious Correlation
Under Assumption 1, we can explicitly quantify the bias due to unobserved confounding. For comparison, we begin with a quantity that is identified under confounding: the behavior policy conditional on the observed state. Consider the naive empirical estimate, , for given .
On the other hand, consider estimating the expectation of a function of and , conditional on , i.e. . Define the corresponding naive estimator, as above.
Proposition 1.
Under Assumption 1 and given a function ,
Proof sketch..
Conditional on and , the distribution of in is and
by Bayes rule. Then reweight accordingly. ∎
As an immediate corollary of Proposition 1, is not, in general, an unbiased estimator of . For a relevant example, let for some . Then , the marginal probability of transitioning to state . Unless or , the naive estimator of the transition probabilities is biased. Furthermore, since is unobserved, the observed data is consistent with many possible .
5.2 Sensitivity Model
While estimands like are not pointidentified under Assumption 1, it is possible to give upper and lower bounds that are consistent with the observed data. However, without further assumptions these bounds are typically vacuous. Therefore, we follow the sensitivity analysis approach and specify limits on the impact of the unobserved state. The idea is that we will construct a worstcase estimate given a fixed level of confounding and study how the estimate changes as the degree of confounding is increased.
We control the dependence of the behavior policy on the unobserved state via a parameter . This is a popular technique in the causal inference literature, described in (Rosenbaum, 2002). In particular, we follow (Tan, 2006) and have
bound the odds ratio between the unobserved behavior policy and the observed marginal behavior policy:
Assumption 2 (Policy Confounding Bound). Given , for all , , and :
Note that Assumption 2 implies the bounds:
where
6 Policy Evaluation with Confounders
In this section, we will show how to compute worstcase value estimates. As long as Assumption 1 holds, by Proposition 1 we have an unbiased expression for regressing any observed quantity against and . This expression depends on the unknown probabilities which we can bound using Assumption 2. By choosing different functions , we can adapt most OPE direct methods as described in (Voloshin et al., 2019). We illustrate this procedure for Fitted Q Evaluation (FQE).
We begin with some notational details. We denote the state and stateaction value functions for a policy and horizon as:
respectively. Throughout the rest of the paper, we will use the shorthand . Denote the Bellman evaluation operator for a policy as , defined as:
where is any function on . The stateaction value function can be computed by applying to , Ttimes (Puterman, 2014). Furthermore, , and the expected value is simply the average of the value function over the initial state distribution. Therefore, we can easily compute estimates of the expected value using .
6.1 Confounded FQE
FQE iteratively applies an empirical approximation of to compute . Let and let be some function class. Given a dataset and an evaluation policy , FQE computes
Essentially, regression with the class approximates the conditional expectation of the function
and . With unobserved confounding, regression using the data no longer gives an unbiased estimate of . Instead, we can apply Proposition 1 with the function defined above to get:
We can then use Assumption 2 to bound the unobserved . For example, we immediately get the following naive bound.
Proposition 2.
Let . Under Assumptions 1 and 2, For all and
This naive bound is too conservative to use in practice, especially as the horizon grows. To get a better bound, we can solve an optimization problem over all possible values of which are consistent with the observed data. Fix and . Let and be the nominal behavior policy and nominal transition probabilities respectively. The basic unknown quantities are , , and . We have the following observable implications:
Lemma 2.
Under Assumption 1, ,
For a fixed and , let be the set of possible such that Lemma 2 and Assumption 2 hold. Then:
Unfortunately, when computing a regression in practice, this requires introducing a new optimization variable for the unknown values of for every data point. Instead we use a clever reparameterization to remove the dependence on that KZ introduced for MIS.
6.2 Reparameterization
Define
and the corresponding set
The idea is that is equal to convolved with an unknown density. Since both and are unknown, optimizing over is equivalent to optimizing over where we replace with . We have the following constraints:
Lemma 3.
Under Assumptions 1 and 2,
,
and
Now we are ready to state our confounded FQE bound:
Theorem 1.
Under Assumptions 1 and 2,
,
For a given dataset
, this bound can be computed with a simple linear program. Fix
and , and for shorthand, denote the naive estimates of the nominal behavior policy and nominal transition probabilities as and respectively. The bound in Theorem 1 can be estimated by the following LP:such that  
where is the sample average of conditional on and . Note that and are all observables estimated from the data, and
is given. Only the vector
is unknown.Remark 1. Theorem 1 gives a lower bound for a single application of . We get a lower bound on by applying ktimes and then averaging over the initial state distribution.
Remark 2.The reparameterized optimization problem in Theorem 1 can in principle be used when regressing a wide variety of functions against and . This provides a blueprint for adapting other OPE methods that solve a regression problem.
7 Sharper Bounds with Robust MDPs
Unobserved variables create bias when they are correlated with both the behavior policy and the state transitions. The sensitivity model in Assumption 2 limits the correlation with the behavior policy. However, in the reparameterization strategy above, we combine our unknowns, and . Therefore, we cannot leverage any additional information that limits the correlation between and the transitions. Consider the extreme case, where . In this case, naive OPE estimates will be unbiased even if in Assumption 2 is large. While in observational studies, it is not possible to rule out all correlation between unobservables and the dynamics, we might be able to use domain knowledge on causal mechanisms to restrict the feasible transitions.
One branch of the sensitivity analysis literature, exemplified by (Rosenbaum and Rubin, 1983), suggests using three sensitivity parameters. First, a bound on the correlation between the unobserved confounder and the treatment. Second, a bound on the correlation between the unobserved confounder and the outcome. Third, a parameter representing the distribution of the unobserved confounder. (Rosenbaum and Rubin, 1983) presents the case where
is a binary variable. However,
(Ding and VanderWeele, 2016) show that for worstcase bounds, this is without loss of generality. Therefore, we assume that .Assumption 2 bounds the impact of on . Following (Rosenbaum and Rubin, 1983), we now introduce two additional parameters:
Assumption 3 (Transition Confounding Bound). Given , for all , , , and :
Assumption 4. Given a fixed .
For any tuple of sensitivity parameters, , we will give worstcase bounds on the value function using a modelbased approach. Each has a corresponding set of possible transition probabilities under Assumptions 2, 3, and 4, such that the observable implications in Lemma 2 hold. Finding the worstcase value given an uncertainty set for the dynamics has been extensively explored in the Robust MDP literature (Nilim and El Ghaoui, 2005). The standard approach is to separate the uncertainty over the stateaction pairs, assuming that the uncertainty sets across pairs are not linked. In our problem, this assumption is violated because of the requirement that is a probability distribution. In the language of robust MDPs, our problem is “srectangular” instead of “s,arectangular”.
Fortunately, srectangular MDPs can also be solved efficiently (Wiesemann et al., 2013). Let denote the set of feasible transition probabilities for a fixed . Let be the matrix whose rows are for each . Instead of the stateaction value function, we iteratively solve for worstcase estimates of the value function:
where . When optimizing over the unknown quantities and for all , , and , this problem has a linear objective with linear and bilinear equality constraints, so it can be easily solved. We estimate by letting , then solving the above minimization problem times. As we will show in our evaluation, for all values of the parameters , the srectangular robust MDP formulation provides sharper bounds than the linear program corresponding to Theorem 1.
8 Evaluation
We use the benchmarks from OPETools (Voloshin et al., 2019) for evaluation. In particular, we adapt their three discrete environments, Graph, Discrete MC, and Gridworld, together with a small toy problem. Note that the data generating processes do not strictly need to be confounded. Our methods bound the worst possible confounded MDP that could have generated the data. Therefore, the two relevant, observable reference points are the value of the behavior policy and the nominal value of the evaluation policy. Nonetheless, for completeness we augment the environments with unobserved confounding variables. Our approach takes an existing behavior policy and transition matrix, and adds an additional state variable which induces a correlation between the policy and transitions based on either the rewards or the optimal value function.
For each environment, we choose a behavior policy and evaluation policy such that the value of without confounding is greater than the value of . This way, it is possible to find which level of confounding makes it impossible to guarantee that is superior to . Furthermore, the impact of confounding can be compared relative to the difference in values between the two policies. See Table 1 for a summary of the four test environments and the Appendix for full details.
Environment  Horizon  States  Actions  Sparse Rewards?  

toy  5  3  2  0.3397  0.4990  No 
opegraph  4  8  2  0.1786  0.7174  No 
opemc  20  22  2  18.1890  15.7381  Yes 
opegridworld  8  16  4  0.4994  0.3569  No 
8.1 Lower Bounds with Confounding
For our first experiment, we collect trajectories from each of the four environments using their respective behavior policies. For each environment, we collect 30,000/horizon trajectories, keeping the number of data points the same across environments. Then, we compute our confounded FQE and robust MDP lower bounds for values of and ranging between 1.1 (barely confounded) and 10 (highly confounded). For the robust MDP bounds, we fix the parameter , i.e. each period the unobserved state is equally likely to be or . The robust MDP bounds are not very sensitive to this parameter and this choice doesn’t impact the qualitative results, although corroborating results are in the Appendix. Our lower bounds for the four environments are plotted in Figure 1.
The confounded FQE bounds are the black curve at the bottom of each plot. Without any additional restrictions of the transition dynamics, these bounds degrade the quickest as increases. This curve intersects the value of at for opegraph, and for the remaining environments. Qualitatively, this means strong requirements on confounding are required for the FQE bounds to guarantee that the evaluation policy is better than the behavior policy. Compare this, for example, to the other curves in opegraph and opemc which are greater than for all values of .
The curves above the confounded FQE curve correspond to our robust MDP bounds. In all cases as grows, the corresponding lower bounds get worse. As mentioned previously, for opegraph and opemc, any value of guarantees that . For toy and opegridworld, consider the curve, which is third from the top. For the toy environment, assuming substantially increases the at which the curve crosses the dotted line compared to the FQE curve. For opegridworld, the curve lies above for all . These examples highlight the qualitative and quantitative importance of limiting the degree of confounding on the transition probabilities.
8.2 Tightness
Our confounded FQE and robust MDP methods provide lower bounds on the expected value subject to their respective sensitivity models. A natural question is: how far are these bounds from the infimum over all fullinformation MDPs consistent with the observed data, subject to the given sensitivity model? We split our analysis of tightness into two parts, the singlestep case and the multistep case.
A single iteration of our bounds requires solving a minimization problem. The tightest possible bound on is the minimum over all valid fullinformation MDPs. But our robust MDP solution produces candidate transition probabilities and behavior policy corresponding to some valid fullinformation MDP. Therefore, since it is a lower bound, it must achieve the true minimum and so a single iteration of the robust MDP approach is tight.
On the other hand, our confounded FQE bound solves a minimization problem separately for each stateaction pair without enforcing that be a density across actions. We quantify the impact on performance by comparing the FQE bound to our robust MDP bound as goes to infinity. We present results for the opegraph and opemc environments in Figure 2. The qualitative findings for the other environments are similar so we defer these results to the Appendix.
For the opegraph environment, the gap between the FQE bound (the black line at the bottom) and the robust MDP bounds for large are negligible until , at which point the gap grows. For the opemc environment, the gap begins substantial and grows slightly larger as grows. For this particular environment, the robust MDP lower bounds always guarantee that the evaluation policy is at least as good as the behavior policy. However, the FQE lower bound can only provide this same guarantee for . Therefore, it appears that enforcing the density constraint across actions can matter in practice, so for cases where we do not wish to make any assumptions on the transitions, we prefer our robust MDP bounds with very large values of .
When confounding occurs in more than one time step, our robust MDP bound is computed iteratively with different minimization problems solved at each time step. The candidate transitions and behavior policy that correspond to each minima may differ, so the lower bounds are potentially loose. Theoretically, the looseness of our bound is characterized by Theorem 4 of (Nilim and El Ghaoui, 2005). In particular, as the horizon goes to infinity, our lower bound converges to the best possible lower bound  the rate of convergence can be found in the proof of the theorem.
To test this empirically, we use the fullinformation transitions and behavior policy from the final iteration of our robust MDP method as a candidate. Because the candidate MDP is consistent with the observed data subject to the sensitivity model, if the value of this MDP matches our lower bound, than our lower bound must be tight. For the toy, opegraph, and opemc environments, we use the same experimental setup as we did for the results in Figure 1. The gap between the candidate MDP value and our lower bounds are reported in Table 2. For these environments, the value of the candidate MDP differs by less than from our lower bound. For the opegridworld environment, we find our lower bound is not tight at small horizons, so we ran experiments with a short, medium, and long horizon. As predicted by the theory, the bound improves for large as value iteration approaches its fixed point.
env  = 2, = 2  = 10, = 10 

toy  1e8  1e8 
opegraph  1e8  1e8 
opemc  1e8  1e8 
opegridworld T=28  2.03e3  3.06e2 
opegridworld T=208  4.75e3  2.87e2 
opegridworld T=508  2.65e5  2.97e4 
8.3 Assumption 1 and Comparison with NKYB
Assumption 1  that the unobserved state is drawn iid each period  is crucial to the quality of the bounds above. We demonstrate this by comparing our bounds to those in NKYB, which do not assume iid confounders. In order to compare to NKYB, we have to alter the experimental setup above in two ways. First, NKYB only supports confounding that occurs in a single time step. The initial time step is confounded, but for the remainder of the horizon, the behavior policy only uses the observed state. We compute the analogue for our robust MDP algorithm by computing iterations of unconfounded value iteration followed by a single iteration of our lower bound.
Second, NKYB uses a similar but more restrictive sensitivity model. Our sensitivity parameter restricts the odds ratio between the confounded policy for a given value of and the policy averaged over all . Their sensitivity parameter restricts the odds ratio for the confounded policy between any values of , which grows roughly like the square of ours. For this comparison, we can calculate the true sensitivity parameters for each confounded environment under the different sensitivity models. We provide a performance comparison using the true sensitivity parameters for each environment in Table 3. Even with confounding restricted to a single time step, the NKYB bounds, which do not assume iid confounders, are enormously conservative.
This is a key result. Even for a single timestep, policy evaluation is highly sensitive to persistent unobserved variables. The ability of our robust MDP bounds to guarantee improvement over the behavior policy in Figure 1, even over longer horizons, depends crucially on our Assumption 1. In turn, this highlights the fact that offpolicy evaluation with confounding in settings where Assumption 1 fails is far more difficult and requires a different algorithmic approach. As mentioned in the introduction, while iid confounders are feasible in certain settings  like unobserved oil supply shocks for macroeconomic policy  Assumption 1 is not reasonable for many applications, especially in medicine.
The results in Table 3 might hinge on the different sensitivity models, so we perform a robustness check which uses identical values of and which should therefore be very favorable for the NKYB bounds. The toy and opegraph NKYB bounds improve, but the opemc and opegridworld bounds remain unusable  see the Appendix for details.
env  Nominal  NKYB  Ours 

toy  0.5189  0.0436  0.25372 
opegraph  0.7008  0.0280  0.3994 
opemc  15.6941  64.5040  15.9647 
opegridworld  0.3588  2.3914  0.4112 
8.4 Horizon and Comparison with KZ
Many of the details above depend on the horizon. For example, our robust MDP bounds become tight as the horizon increases and NKYB restricts confounding to a single timestep. Therefore, in this section we assess how our lower bounds change as the horizon increases. This also provides a convenient setting to compare with the infinite horizon bounds in KZ.
Comparing with KZ requires modification of our initial experimental setup. We use the opegraph and opegridworld environments. In order to generate a nontrivial steadystate distribution, we remove the terminating states and alter the transition probabilities accordingly. Furthermore, to match KZ’s approach, we modify the rewards to only depend on the current state. We then calculate our bounds for 1 to 200 time steps. For both environments, is long enough to spend a majority of the time close to steadystate. We also adopt a discount rate of so that is well beyond the effective horizon. We produce bounds for and using our robust MDP method with set to 1,000,000.
Since we use the same marginal sensitivity model, we can use KZ’s method to calculate infinite horizon bounds for the same values of . Their method computes bounds on the longrun average value, i.e. the expectation of the rewards with respect to the steadystate distribution, instead of the discounted value. Therefore we use the discounted sum of rewards as the perstate reward for KZ’s method. The results are plotted in Figure 3. The dotted black curve at the top is the value of without confounding at each horizon. The curves below are the lower bounds for and respectively. The dots on the far right are the corresponding KZ infinite horizon bounds. In all cases, the gap between our bounds and the unconfounded value grow at the horizon increase. This is not because our bounds are loose  as value iteration reaches its fixed point, our bounds are provably tight as mentioned  but because confounding over many time periods is a more difficult problem. This phenomenon is especially pronounced for : at long horizons, a smaller value of the sensitivity parameters becomes much more valuable.
The infinite horizon bounds follow roughly the same qualitative behavior as ours but are much looser. This is presumably due to the fact that the longrun average of discounted rewards is a different estimand than the average discounted sum of rewards. With no confounding, the difference is small (compared the uppermost line and uppermost dot). But as the level of confounding increases, the longrun average becomes more sensitive. The magnitude of the difference is surprising and perhaps worth studying in future work.
9 Conclusion
To summarize: our first key contribution is to develop a method for computing finite horizon lower bounds for policy evaluation with unobserved confounders that are drawn iid each period. We find that our modelbased robust MDP approach can give substantially sharper bounds by leveraging assumptions about the transition probabilities. To be clear on this point: the argument is not that a plugin estimator using a model of the dynamics is inherently more efficient. When using observational data to estimate a dynamic causal effect, understanding the dynamics of the system and the causal mechanisms are critically important. Quantitatively, we illustrate this by showing that sharp partial identification of the value of a policy requires restricting the set of possible transition probabilities. In practice, such an approach relies on domainexpertise. Practitioners must have enough mechanistic understanding of the dynamics that they are able to specify bounds, , on potential confounding in order to get a reasonable estimate of the expected value.
Our second key contribution is to demonstrate that policy evaluation is far more challenging when there are persistent unobserved confounders. This is responsible for the substantial performance gap between our bounds and those in NKYB. These results are especially relevant for medical applications where unobserved variables are likely to be persistent. For example, any patient variable that may not be recorded, but doesn’t change between treatment choices like socioeconomic status or undocumented chronic illness. Work published after this paper was completed (Kwon et al., 2021) has taken an initial step to tackle this setting without confounding. An important next step will be to achieve similar results in the observational causal setting.
References
 Policy evaluation with latent confounders via optimal balance. Advances in neural information processing systems 32. Cited by: §2.
 Sensitivity analysis without assumptions. Epidemiology (Cambridge, Mass.) 27 (3), pp. 368. Cited by: §7.
 Flexible sensitivity analysis for observational studies without observable implications. Journal of the American Statistical Association. Cited by: §2.

Combining parametric and nonparametric models for offpolicy evaluation.
In
International Conference on Machine Learning
, pp. 2366–2375. Cited by: §2.  Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pp. 2605–2613. Cited by: §2.
 Causal inference. CRC Boca Raton, FL;. Cited by: §2.
 Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Cited by: §2.
 Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. Cited by: §2.
 Algorithmic decision making in the presence of unmeasured confounding. arXiv preprint arXiv:1805.01868. Cited by: §2.

Interval estimation of individuallevel causal effects under unobserved confounding.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 2281–2290. Cited by: §2.  Double reinforcement learning for efficient offpolicy evaluation in markov decision processes. Journal of Machine Learning Research 21 (167), pp. 1–63. Cited by: §2.
 Confoundingrobust policy improvement. arXiv preprint arXiv:1805.08593. Cited by: §2.
 Confoundingrobust policy evaluation in infinitehorizon reinforcement learning. Advances in neural information processing systems 33. Cited by: §1, §1.
 RL for latent mdps: regret guarantees and a lower bound. arXiv preprint arXiv:2102.04939. Cited by: §9.
 Dynamic treatment regimes: technical challenges and applications. Electronic journal of statistics 8 (1), pp. 1225. Cited by: §2.
 Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. Cited by: §2.
 Breaking the curse of horizon: infinitehorizon offpolicy estimation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 5361–5371. Cited by: §2.
 Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2), pp. 331–355. Cited by: §2.
 Reinforcement learning via fenchelrockafellar duality. arXiv preprint arXiv:2001.01866. Cited by: §2.
 Offpolicy policy evaluation for sequential decisions under unobserved confounding. Advances in neural information processing systems 33. Cited by: §1, §1.
 Robust control of markov decision processes with uncertain transition matrices. Operations Research 53 (5), pp. 780–798. Cited by: §7, §8.2.
 Counterfactual offpolicy evaluation with gumbelmax structural causal models. In International Conference on Machine Learning, pp. 4881–4890. Cited by: §2.
 Offpolicy evaluation in markov decision processes. Ph.D. Thesis, Ph. D. Dissertation. McGill University. Cited by: §2.
 Causal inference in statistics: an overview. Statistics surveys 3, pp. 96–146. Cited by: §2.
 Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §2.
 Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §6.
 Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological) 45 (2), pp. 212–218. Cited by: §7, §7.
 Overt bias in observational studies. In Observational studies, pp. 71–104. Cited by: §2, §5.2.
 A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association 101 (476), pp. 1619–1637. Cited by: §5.2.
 Offpolicy evaluation in partially observable environments. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10276–10283. Cited by: §2, §4.
 Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §2.
 Minimax weight and qfunction learning for offpolicy evaluation. In International Conference on Machine Learning, pp. 9659–9668. Cited by: §2.
 Empirical study of offpolicy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854. Cited by: §2, §6, §8.
 Robust markov decision processes. Mathematics of Operations Research 38 (1), pp. 153–183. Cited by: §7.
 Bounds on the conditional and average treatment effect with unobserved confounding factors. arXiv preprint arXiv:1808.09521. Cited by: §2.