1 Introduction
Understanding the dynamic aspect of datadriven decision making is a major open area of active research, whose foundations we have yet to fully grasp. Most applications of interest involve multiple repeated decisions over time and with an evolving state. The topic has been studied by many communities and under multiple regimes and formulations; examples include the field of reinforcement learning
(Sutton & Barto, 1998), longitudinal analysis in biostatistics (Bang & Robins, 2005), the dynamic treatment regime in causal inference and adaptive clinical trials (Lei et al., 2012).The goal of this work is to address policy evaluation and learning from observational data with dynamic treatments, continuous or discrete, and a highdimensional state. To understand policy evaluation one fundamental estimation problem that arises in dynamic decision making: what is the effect of a choice or action today on an outcome a few periods ahead? How can we estimate this effect from observational data where an unknown policy was deployed and multiple treatments were applied to the same individual/unit?
These questions are applicable to a multitude of domains: i) in the medical domain, where the topic has been studied under the name of longitudinal data analysis, typically patients, under standard clinical care, receive multiple treatments over time. How can we detangle the effects of each of these treatments on the final outcomes? ii) in the business domain, enterprises typically make multiple investments/engagements/offers to their customers over a period of time. How can we detangle the effects of these investments on the final revenue lift and correctly calculate the average return of each investment? iii) in the digital marketing domain, the same person is typically shown multiple versions of ads before they convert to make a purchase. How can we correctly attribute the increase in conversion to the different ad impressions?
We formulate a semiparametric Markovian model, that allows for a flexible highdimensional state space, and propose an estimation algorithm that estimates the dynamic effects of interest, from observational data, at parametric root rates. Our work extends results in the field of semiparametric inference (Neyman, 1979; Robinson, 1988; Ai & Chen, 2003; Chernozhukov et al., 2016, 2018) and provides estimation algorithms that provide an asymptotic normal estimate of the dynamic treatment effects, even when the state is highdimensional and machine learning algorithms are used to control for indirect effects of the state, as opposed to effects stemming from the treatments. Our formulation can be viewed as a dynamic version of Robinson’s classic partial linear model in semiparametric inference (Robinson, 1988), where the controls evolve over time and are affected by prior treatments. Our estimation algorithm builds upon and generalizes the recent work of (Chernozhukov et al., 2018, 2017) to estimate not only contemporaneous effects, but also dynamic effects over time. In particular, we propose a sequential residualization approach, where the effects at every period are estimated in a Neyman orthogonal manner and then peeledoff from the outcome, so as to define a new “calibrated” outcome, which will be used to estimate the effect of the treatment in the previous period.
Our work is also closely related to the work on doubly robust estimation in longitudinal data analysis and in offline policy evaluation in reinforcement learning (Nie et al., 2019; Thomas & Brunskill, 2016; Petersen et al., 2014; Kallus & Uehara, 2019b, a)
. However, one crucial point of departure from all of these works is that we formulate minimal parametric assumptions that allow us to avoid nonparametric rates or highvariance estimation. Typical fully nonparametric approaches in offpolicy evaluation in dynamic settings, requires the repeated estimation of inverse propensity weights at each step, which leads to a dependence on quantities that can be very illposed in practice, such as the product of the ratios of states under the observational and the target policy. Moreover, these approaches are typically restricted to discrete states and actions. Our goal is to capture settings where the treatment can also be continuous (investment level, price discount level, drug dosage), and the state contains many continuous variables and is potentially high dimensional. Inverse propensity weighting approaches, though more robust in terms of the assumptions that they make on the model, can be quite prohibitive in these settings even in moderately large data sets.
2 Model and Preliminaries
We consider a partially linear state space Markov decision process
, where is the state at time , is the action or treatment at time and is an observed outcome of interest at time . We assume that these variables are related via a linear Markovian process:where and are exogenous meanzero random shocks, that for simplicity we assume are each drawn i.i.d. across time and . In Section 8 we show that our results generalize to more complex Markovian models, but for simplicity of exposition we focus on this simplified version, as it captures all the complexities needed to highlight our main contributions.
Our goal is to estimate the effect of a change in the treatment policy on the final outcome (in Section 7 we also discuss variants of the problem where we are interested on the discounted sum of observed outcomes ). More concretely, suppose that were to make an intervention and set each of the treatments to some sequence of values: , then what would be the expected difference in the final outcome as compared to some baseline policy? For simplicity and without loss of generality, we will consider as baseline policy setting all treatments to zero. We will denote this quantity with: . Equivalently, we can express the quantity we are interested in docalculus: if we denote with , then . In Section 6, we will also analyze the estimation of the effect of adaptive counterfactual treatment policies, where the treatment at each step can be a function of the state.
2.1 Characterization via Dynamic Effects
Our first observation is that we can decompose the quantity into the estimation of the dynamic treatment effects: if we were to make an intervention and increase the treatment at period by unit, then what is the change of the outcome , for ; assuming that we set all subsequent treatments to zero (or equivalently to some constant value). This quantity is the effect in the final outcome, that does not go through the changes in the subsequent treatments, due to our observational Markovian treatment policy, but only the part of the effect that goes through changes in the state space , that is not part of our decision process. This effect can also be expressed in terms of the constants in our Markov process as:
(1) 
Then we have the following lemma:
Lemma 1.
The counterfactual value function , can be expressed in terms of the dynamic treatment effects as:
(2) 
Proof.
Observe that by repeatedly expanding the state space in terms of the previous state and action pairs we can write that under any nonstate dependent sequence of interventions , the counterfactual value of the state at time , can be written as:
where is a linear function of the meanzero exogenous random shocks at all periods. Thus under intervention
, the final outcome is distributed according to the random variable:
Since is the expected difference of this counterfactual final outcome and the counterfactual final outcome when , and since is meanzero conditional on any exogenous sequence of interventions that do not depend on the state or prior actions, we get the desired result. ∎
Thus to estimate the function , it suffices to estimate the dynamic treatment effects: .
3 Identification of Dynamic Effects
We first start by showing that the parameters are identifiable from the observational data. Observe that even though we can write the final outcome as a linear function of all the treatments and the initial state, this does not lead to a direct identification approach from the observational data, i.e. even though:
(3) 
the shock contains components that are heavily correlated with the treatments, since the shocks at period affect the state at period , which affects the observed treatment at period . Thus we cannot write that , which would lead to an identification strategy of the dynamic effects from the data. In other words, if we view the problem as a simultaneous treatment problem, where the final outcome is the outcome, then we essentially have a problem of unmeasured confounding (implicitly because we ignored the intermediate states).
However, we show that these dynamic effects can be identified from the observational data via a sequential peeling process, which as we show in the next section, leads to an estimation strategy that achieves parametric rates. This is one of the main contributions of the paper.
Theorem 2.
The dynamic treatment effects are identified from the observational data via the following set of conditional moment restrictions:
where:
Proof.
By repeatedly unrolling the state , times we have that:
Thus by rearranging the equation, we have:
Since for all , are subsequent random shocks to the variables , we have that they are meanzero conditional on . Thus:
which concludes the proof of the theorem. ∎
4 Sequential DML Estimation
We now address the estimation problem. We assume that we are given access to i.i.d. samples from the Markovian process, i.e. we are given independent timeseries, and we denote sample , with (in Section 7 we extend our estimator to the case where all the data are part of a single infinite timeseries process and hence we need to handle correlations across samples).
Our goal is to develop an estimator of the function
or equivalently of the parameter vector
. We will consider the case of a highdimensional state space, i.e. , but low dimensional treatment space and a low dimensional number of periods , i.e. is a constant independent of .Our goal is to estimate the parameters at rates and in a way that our estimator is asymptotically normal, so that we can construct asymptotically valid confidence intervals around our dynamic treatment effects and our estimate of the function . The latter is a nontrivial task due to the highdimensionality of the state space. For instance, the latter would be statistically impossible if we were to take the direct route of estimating the whole Markov process (i.e. the highdimensional quantities ). For instance, if these quantities have a number of nonzero coefficients that grows with
at any polynomial rate, then known results on sparse linear regression, preclude their estimation at rootn rates (see e.g.
(Wainwright, 2015)). However, we are not really interested in these lowlevel parameters of the dynamic process, but solely on the low dimensional parameter vector . We will treat this problem as a semiparametric inference problem and develop a Neyman orthogonal estimator for the parameter vector (Neyman, 1979; Robinson, 1988; Ai & Chen, 2003; Chernozhukov et al., 2016, 2018).In particular, we consider a sequential version of the double machine learning algorithm proposed in (Chernozhukov et al., 2018). In the case of a single timeperiod, i.e. , then (Chernozhukov et al., 2018), recommends the following estimator for : using half of your data, fit a model of , i.e. that predicts the outcome from the controls and a model for . Then estimate on the other half of the data, based on the estimating equation:
(4) 
where and are the residual outcome and treatment.
We propose a sequential version of this process that we call sequential DML for sequential double/debiased machine learning. Intuitively our algorithm proceeds as follows:

We can construct an estimate of in a robust (Neyman orthogonal) manner, by applying the approach of (Chernozhukov et al., 2018) on the final step of the process, i.e. on time step ; this will estimate all the contemporaneous effects of the treatments,

Subsequently we can remove the effect of the observed final step treatment from the observed final step outcome, i.e. by redefining the random variable ; doing this we have removed any effects on , caused by the final treatment .

We can then estimate the onestep dynamic effect , by performing the residualonresidual estimation approach with target outcome the “calibrated” outcome , treatment and controls ; Theorem 2 tells us that the required conditional exogeneity moment required to apply the residualization process is valid for these random variables. We can continue in a similar manner, by removing the estimated effect of from and repeating the above process.
We provide a formal statement of the sequential DML process in Algorithm 1, which also describes more formally the sample splitting and crossfitting approach that we follow in order to estimate the nuisance models and required for calculating the estimated residuals.
(5) 
5 Estimation Rates and Normality
Our main theorems are to show that subject to the first stage models of the conditional expectations achieving a small (but relatively slow) estimation error, then the recovered parameters are rootnconsistent and asymptotically normal. Our asymptotic normality proof relies on showing that one can reinterpret our Sequential DML estimator as a estimator based on a set of moments that satisfy the property of Neyman orthogonality.
We also provide finite sample estimation error guarantees for the estimate. To achieve this, we utilize the fact that each parameter estimate
, can be viewed as the minimizer of strongly convex loss function, conditional on the nuisance parameters and the estimates of
.Theorem 3 (Asymptotic Normality).
Suppose that the nuisance models satisfy that:
(6) 
Moreover, suppose that all random variables are bounded and that . Then:
(7) 
where and is a block lower triangular matrix consisting of blocks of size , whose block entries are of the form:
and is a block diagonal matrix whose diagonal block entries take the form: .
Proof.
Let denote the vector of all nuisance functions and their true values. Moreover, let denote the vector of dynamic effect parameters, its true value.
Observe that our estimator can be equivalently viewed as the solution to an empirical version of the following vector of moment conditions:
where is a random vector denoting the variables of a single sample series and:
and
If we denote with and the estimates of the nuisance models, then our estimate is equivalent to the solution of the system of equations:
We now show that the moment vector satisfies the property of Neyman orthogonality with respect to all the nuisance models . For this it suffices to show that for all :
(8) 
where is the derivative of with respect to the output of the nuisance models , evaluated at .
For any , the derivative with respect to for is equal to:
(9) 
while the derivative with respect to is:
Similarly, the derivative with respect to is:
(10) 
Moreover, the Jacobian of the moment vector at the true values is a block lower triangular matrix whose block values are of the form:
Thus its diagonal block values take the form
. Hence, the minimum eigenvalue of
is at least .Thus our setting and our estimator falls exactly into the framework of orthogonal moment estimation of (Chernozhukov et al., 2018) and we can directly apply Theorem 3.1 of (Chernozhukov et al., 2018) to get the result (the exact form of the matrix follows from the observation that and the fact that ). ∎
Theorem 4 (Finite Sample).
Suppose that and that each coordinate of each nuisance model satisfy that w.p. :
where expectation is with respect to the corresponding input of each model. Moreover, suppose that: ^{1}^{1}1Where by absolute value we denote coordinatewise. Then w.p. :
where . If each coordinate of each nuisance model satisfies:
(11) 
Then:
Concrete Rates for Lasso Nuisance Estimates.
Suppose that the observational policy is also linear, i.e. . Then all the models and are highdimensional linear functions of their input arguments, i.e. and
. If these linear functions satisfy a sparsity constraint then under standard regularity assumptions we can guarantee if we use the Lasso regression to estimate each of these functions that w.p.
, the estimation error of all nuisance models is , where is an upper bound on the number of nonzero coefficients. One sufficient regularity condition needed is that the expected covariance matrix of every period’s state has full rank, i.e. . Thus the requirements of the our main theorems of this section would be satisfied as long as the sparsity grows as . These sparsity conditions are for instance satisfied if for instance, only coordinates of the highdimensional state have any effect on the final outcome (i.e. are outcomerelevant) and if only coordinates of the highdimensional state enter the observational policy.6 Evaluation of Adaptive Linear Policies
In all prior sections we primarily discussed the case where we want to evaluate counterfactual policies that are nonadaptive, i.e. we wanted to understand the effect of a prefixed sequence of treatments. In this section we extend our results to adaptive counterfactual policies, where the treatment at each period can be some mapping from the state to an action. However, to achieve parametric estimation rates, we restrict attention to the case of linear policies over a low dimensional part of the state.
We first start by a generic characterization of the counterfactual value function for any adaptive policy. This generalizes our Theorem 1 to account for the fact that the target policy depends on past states and hence is correlated with past random shocks.
Lemma 5.
Let be a space of adaptive policies: i.e. a vector of mappings that maps the state at action to an action at time . Let be the counterfactual value function that for each policy calculates the difference between the counterfactual final outcome under policy as compared to an always zero treatment policy. Then we can write:
(12) 
where denotes the random variable of the action at time chosen by the policy .
We now argue that using our Sequential DML estimation process, we can achieve parametric estimation rates for the counterfactual value function , whenever is a linear policy over a low dimensional part of the state. Let denote the observational policy. Suppose that the target treatment policy is a linear policy of some low dimensional part of the state , i.e. . Then we can evaluate this policy as follows:
Observe that for each entry of , the quantity:
corresponds to the effect of policy as compared to the observed policy when we have one less period and the final outcome is the state . Thus we can apply our characterizations and express this difference as:
where we can estimate the coefficients at a rootn rate and in an asymptotically normal manner using our Sequential DML estimation process.
Thus we can now rewrite:
Continuing this way and peeling off at each step the last term , we can write as a function of , and the parameters, , . Each of these parameters is rootn estimable using our Sequential DML algorithm, and the parameters and are directly estimable from the data, as they correspond to averages of the state and treatment under the observational policy.
We conclude that we can estimate at a rootn rate for any linear policy that depends on a low dimensional (constant) number of endogenous states.
7 Dependent Single TimeSeries Samples
Thus far we have assumed that we are working with independent time series, each of duration . Though this is applicable to many settings where we have panel data with many units over time, in some other settings it is unreasonable to assume that we have many units over time, but rather that we have the same unit of a long period. In this case, we would want to do asymptotics as the number of periods grows. Our goal is still to estimate the dynamic treatment effects (i.e. the effect of a treatment at period on an outcome in period , for ) for some fixed lookahead horizon .
These quantities can allow us to evaluate the effect of counterfactual treatment policies on the discounted sum of the outcomes, i.e. for . We can write the counterfactual value function for any nonadaptive policy as:
(13) 
Assuming outcomes are bounded, the effect on any period can be at most some constant. Thus taking to be roughly , suffices to achieve a good approximation of the effect function , since the rewards vanish after that many periods, i.e. if we let:
then observe that: . Thus after , we have that the approximation error is smaller than . Thus it suffices to learn the dynamic treatment effect parameters for a small number of steps. To account for this logarithmic growth, we will make the dependence on explicit in our theorems below.
For any , we will estimate these parameters by splitting the timeseries into sequential blocks of size
. Then we will treat each of these blocks roughly as independent observations. The crux of our proofs would be to handle the fact that these blocks are not independent but serially correlated. However, we can still apply techniques, such as martingale Bernstein concentration inequalities and martingale Central Limit Theorems to achieve the desired estimation rates.
The other important change that we need to make is in the way that we fit our nuisance estimates. To avoid using future samples to train models that will be used in prior samples (which would ruin the martingale structure), we instead propose a progressive nuisance estimation fitting approach, where at every period, all prior blocks are used to train the nuisance models and then they are evaluated on the next block.
(14) 
Theorem 6.
Let denote the filtration up until block . Suppose that and that each coordinate of each nuisance model satisfy that w.p. :
where expectation is with respect to the corresponding input of each model from block . Moreover, suppose that: . Then w.p. :
where . If each coordinate of each nuisance model satisfies:
(15) 
Then:
Theorem 7.
Concrete rates under sparsity.
Again we note that the requirements on the nuisance estimation models are satisfied under sparsity constraints and if the Lasso is used for their estimation. One complication is that in this setup the data used to fit these models are autoregressive, since the treatment has an effect on subsequent states. However, recent work shows that the same type of error rates as in the independent sample case can be achieved by the lasso for such autoregressive problems (see (Melnyk & Banerjee, 2016)). Moreover, for more general statistical learning algorithms, one can use the results in (Agarwal & Duchi, 2013), to get generalization bounds and hence also MSE bounds for the types of martingale nuisance mean squared error quantities that we are interested in.
8 Generalizations and Extensions
More Complex NonLinear Markov Processes
Our approach easily extends to allow for nonlinear components in the Markov process. The sole requirement that we need for our method to work is that the dynamic effect of a treatment on any subsequent outcome be constant (not heterogeneous in the observable variables) and linear. This for instance would be the case for the following more complex Markovian process:
In other words, we can have an exogenous part state that is not affected by the treatments but which goes into the treatment decision. This exogenous component could affect the endogenous state , the treatment decision and the outcome in a nonlinear manner. Furthermore, the outcome could also affect the next period state in a linear manner. Moreover, the treatment policy in the observational data could be arbitrarily nonlinear. The only real requirement is that the endogenous part of the state that is affected by the treatments be evolving in a linear manner and having a linear effect on the outcomes. Assuming that the nuisance models in our algorithm are estimable at a rate, then our two main theorems continue to hold.
Heterogeneous Effects
We can also account for heterogeneity of the effects in observable characteristics . For instance, suppose that we modified the structural equations as:
In this case, we can also easily adapt our algorithm to estimate heterogeneous models over particular hypothesis spaces by replacing the final stage estimation with a repeated square loss minimization:
We defer the complete analysis of this setting to a full version of the work.
9 Experimental Results
We consider the following DGP:
with and standard normal r.v.’s. Moreover, with , , , , , , .
We run monte carlo experiments. Each experiment drew samples from the above DGP and then estimated the effects and lag effects based on our Sequential DML algorithm. In Figure 1 we see the distribution of estimates of the effect of on and on , based on our Sequential DML estimation algorithm, with crossfitting.
Moreover, we used the asymptotic normal approximation described in the previous section, to create confidence intervals for our estimates. The coverage of the intervals across the monte carlo runs was, for the effect of on and for the effect of on . Both are good approximations of their nominal value, especially considering that this is a very high dimensional setup and the few samples, with samples and variables. For , , the coverage values improve to and correspondingly for same period and lag effects. For , , the coverage values are and correspondingly for same period and lag effects.
References
 Agarwal & Duchi (2013) Agarwal, A. and Duchi, J. C. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1):573–587, Jan 2013. ISSN 15579654. doi: 10.1109/TIT.2012.2212414.
 Ai & Chen (2003) Ai, C. and Chen, X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003.
 Bang & Robins (2005) Bang, H. and Robins, J. M. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005. doi: 10.1111/j.15410420.2005.00377.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.15410420.2005.00377.x.
 Chernozhukov et al. (2016) Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. Locally Robust Semiparametric Estimation. arXiv eprints, art. arXiv:1608.00033, July 2016.
 Chernozhukov et al. (2017) Chernozhukov, V., Goldman, M., Semenova, V., and Taddy, M. Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels. arXiv preprint arXiv:1712.09988, 2017.
 Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.

Freedman (1975)
Freedman, D. A.
On tail probabilities for martingales.
Ann. Probab., 3(1):100–118, 02 1975. doi: 10.1214/aop/1176996452. URL https://doi.org/10.1214/aop/1176996452.  Hall & Heyde (1980) Hall, P. and Heyde, C. 3  the central limit theorem. In Hall, P. and Heyde, C. (eds.), Martingale Limit Theory and its Application, Probability and Mathematical Statistics: A Series of Monographs and Textbooks, pp. 51 – 96. Academic Press, 1980. doi: https://doi.org/10.1016/B9780123193506.500098. URL http://www.sciencedirect.com/science/article/pii/B9780123193506500098.
 Kallus & Uehara (2019a) Kallus, N. and Uehara, M. Double reinforcement learning for efficient offpolicy evaluation in markov decision processes, 2019a.
 Kallus & Uehara (2019b) Kallus, N. and Uehara, M. Efficiently breaking the curse of horizon in offpolicy evaluation with double reinforcement learning, 2019b.
 Lei et al. (2012) Lei, H., NahumShani, I., Lynch, K., Oslin, D., and Murphy, S. A. A” smart” design for building individualized treatment sequences. Annual review of clinical psychology, 8:21–48, 2012.

Melnyk & Banerjee (2016)
Melnyk, I. and Banerjee, A.
Estimating structured vector autoregressive models.
In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 830–839. JMLR.org, 2016.  Neyman (1979) Neyman, J. tests and their use. Sankhya, pp. 1–21, 1979.
 Nie et al. (2019) Nie, X., Brunskill, E., and Wager, S. Learning whentotreat policies, 2019.
 Peel et al. (2013) Peel, T., Anthoine, S., and Ralaivola, L. Empirical bernstein inequality for martingales: Application to online learning. 2013.
 Petersen et al. (2014) Petersen, M., Schwab, J., Gruber, S., Blaser, N., Schomaker, M., and van der Laan, M. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. Journal of causal inference, 2(2):147–185, 2014.
 Robinson (1988) Robinson, P. M. Rootnconsistent semiparametric regression. Econometrica: Journal of the Econometric Society, pp. 931–954, 1988.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
 Thomas & Brunskill (2016) Thomas, P. S. and Brunskill, E. Dataefficient offpolicy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 2139–2148. JMLR.org, 2016.
 Wainwright (2015) Wainwright, J. Highdimensional statistics: A nonasymptotic viewpoint. preparation. University of California, Berkeley, 2015.
Appendix A Proof of Theorem 4
Let denote the random variables associated with a single sample series. Observe that if we let:
Then we have that: (where are as defined in the proof of Theorem 3).
Moreover, observe that by consistency of the nuisance functions we have that for any :