A note on strict functional covariate overlap in causal inference problems with high-dimensional covariates

01/09/2018 ∙ by Debashis Ghosh, et al. ∙ University of Colorado Denver 0

A powerful tool for the analysis of nonrandomized observational studies has been the potential outcomes model. Utilization of this framework allows analysts to estimate average treatment effects. This article considers the situation in which high-dimensional covariates are present and revisits the standard assumptions made in causal inference. We show that by employing a flexible Gaussian process framework, the assumption of strict overlap leads to very restrictive assumptions about the distribution of covariates, results for which can be characterized using classical results from Gaussian random measures as well as reproducing kernel Hilbert space theory. These findings reveal the stringency that accompanies the use of the treatment positivity assumption in high-dimensional settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The availability of high-dimensional covariates in administrative databases and electronic health records has led to increasing scientific focus on attempting to evaluate and develop methods for causal inference with these data structures. There has been a concomitant focus in the statistics and econometrics literature towards the use of machine learning-based methods for performing causal inference with high-dimensional data (e.g., van der Laan and Gruber, 2010; van der Laan and Rose, 2011; Athey and Imbens, 2016; Kallus, 2016; Athey, Imbens and Wager, 2017; Chernozukhov et al., 2017).

In light of this work, it is worth revisiting the standard assumptions necessary for performing causal inference in the potential outcomes framework of Rubin (1974) and Holland (1986). A key assumption that is needed for proper definition of a causal estimand is the unconfoundedness assumption, which states that the treatment is independent of the potential outcomes conditional on confounders. Part of the interest in observational studies with high-dimensional covariates is the belief that sufficiently rich sets of covariates will render the unconfoundedness assumption more plausible.

Another assumption, which is the focus of the current paper, is the treatment positivity assumption, which states that the probability of treatment given covariates is strictly between zero and one for values of the covariate vectors. This is related to the notion of covariate/confounder overlap between the treatment groups. This is taken virtually as a given in most causal analyses, but the emergence of clinical decision support systems, deterministic treatment rules and high-dimensional covariates raises the possibility of this assumption being violated. A simple example of such a violation occurring is a situation in which treatment assignment is made deterministically in a medical setting based on the patient presenting with certain risk factors and comorbidities. In such a case, the treatment positivity assumption would be violated. More generally, Robins and Ritov (1997) have shown that for estimators of the average causal effect to potentially be semiparametrically efficient, the treatment positivity assumption has to be strengthened to the propensity score being uniformly bounded away from zero and one.

Methods for performing causal inference with violations of treatment positivity have been limited in the literature. Crump et al. (2009) characterized its effects in a setting with limited numbers of covariates and developed a simple rule to exclude subjects based on the propensity score. The remaining subjects would be those for whom there is sufficient covariate overlap by which one could make valid causal inference. Traskin and Small (2011) use classification and regression trees (CART) to model group labels derived from the Crump et al. (2009) definition of a study population with sufficient overlap to identify factors by which one can define a study population for which one can make causal inferences about. Two more recent proposals from Ratkovic (2014) and Ghosh (2017) suggest defining study populations based on the margin from machine learning algorithms. Another practical approach is to delete observations with extreme propensities close to zero or one, which is known as propensity score trimming. Some practical guidelines to the use of propensity score trimming have been given by Lee et al. (2011). Yang and Ding (2017) proposed a perturbation-based approach to inference using propensity score trimmed estimators for causal inference. Note that these approaches all estimate a causal parameter that is effectively data-dependent; see Ghosh (2017) for further discussion.

In recent work, D’Amour et al. (2017) studied the notion of strict overlap in the high-dimensional case and described a concept termed strict covariate overlap. This represents one way of extending the treatment positivity assumption into the high-dimensional setting. For this situation, they showed that their covariate overlap assumption implied a bound on the discrepancy between the joint distributions of confounders among the treatment and control groups. The assumption thus places an immediate bound on the difference in joint distributions between treatment and control population. In order to avoid such a restrictive assumption, D’Amour et al. (2017) suggested that one could make sparsity assumptions of several types. First, one could assume sparsity at the level of the propensity score. Second, one could assume the existence of a latent variable that renders treatment independent of the potential outcomes, much as in the standard unconfoundedness assumption. Finally, one could assume the existence of a low-dimensional subspace such that confounders are independent of the potential outcomes conditional on subspace.

Central to the notion of strict overlap is that of bounded likelihood ratio, results for which have been developed by Rukhlin (1993, 1997) and exploited by D’Amour et al. (2017). In their work, D’Amour et al. (2017) consider the situation of Gaussian covariates and suggest that automatically implies a bounded likelihood ratio. However, the framework they were dealing with assumed the number of covariates was fixed. In the current paper, we wish to consider the situation where the number of variables is infinite-dimensional. While this never truly occurs in practice, this type of framework is very much in line with subfields of statistics such as functional data analysis (Ramsay and Silverman, 2005). In particular, we will use the theory of random probability measures, and in particular, random Gaussian measures (Neveu, 1968; Jansson, 1997) in order to develop a new characterization of overlap in the infinite-dimensional sense; we term this strict functional overlap

. The line of research we use enjoys a long history in statistical theory, dating back to the results on random measures based on Wiener processes developed by Cameron and Martin (1944, 1945). We present a new characterization of covariate overlap that overcomes some technical limitations of the bounded likelihood ratio result presented in D’Amour et al. (2017). Second, we provide asymptotic characterizations of strict functional overlap based on equivalence and orthogonality of Gaussian measures. These results have been applied to a variety of problems in statistics, including spatial statistics (Stein, 1999) and more recently, classification with functional data (Delaigle and Hall, 2012; Berrendero et al., 2017). The applications of these results to the causal inference setting are new. For a specific Gaussian process model, we show a phase transition relative to the overlap assumption in terms of the component eigenvalues and eigenfunctions and summarize results from Delaigle and Hall (2012) and Berrendero et al. (2017).

2 Background

2.1 Preliminaries and Causal inference assumptions

In this paper, we will employ the potential outcomes framework (Neyman, 1923; Rubin, 1974), which has been widely used in causal modelling. Let denote the response of interest and be a -dimensional vector of confounders. Let be a binary indicator of treatment exposure that takes the values , where if treated and if control. Let the observed data be represented as , , a random sample from . Note that in this section, we will assume that the confounders are simply a finite-dimensional vector.

Let be the potential outcomes for subject , where refers to the potential outcome under control and that under treatment. What we observe is , which is commonly referred to as the consistency assumption in the causal inference literature.

In an observational study, the vector of covariates could be related to both the outcome and the treatment assignment. Since both and the potential outcomes {Y(0),Y(1)} are affected by , will not hold. To enable causal inference in this scenario, we make the following further assumptions.

  1. Strongly Ignorable Treatment Assumption (SITA): is independent of given .

  2. Stable Unit Treatment Value Assumption (SUTVA): The potential outcomes for subject is statistically independent of the potential outcomes for all subjects , .

  3. Treatment Positivity Assumption (TP): for all values.

Taking these assumptions in order, SITA means that by conditioning on Z, the observed outcomes can be treated as if they come from a randomized complete block design. Rosenbaum and Rubin (1983) show that if SITA holds, then the treatment is independent of the potential outcomes given the propensity score, defined as . Robins (1998) uses the terminology ‘no unmeasured confounders’ in lieu of SITA. The SUTVA assumption is routinely made in causal inference studies, although a violation of this assumption, referred to as interference, has been studied by several authors (e.g., Rosenbaum, 2007; Hudgens and Halloran, 2008). The TP assumption was described in the Introduction and will be considered further in the next section.

In passing, we mention that the typical parameter of focus in causal analyses is the average causal effect (ACE), defined as

(1)

The use of propensity score modelling (Rosenbaum and Rubin, 1983) in conjunction with outcome regression modelling leads to a variety of approaches to average causal effect modelling, a comprehensive overview on the topic being Imbens and Rubin (2015).

2.2 Review of previous work regarding treatment positivity and covariate overlap

We focus on the treatment positivity assumption and describe previous work in this area. Crump et al. (2009) noted the possibility that treatment positivity could be violated and instead defined a subpopulation causal effect using the propensity score. Let denote the indicator function for the event . Define the region for some . Crump et al. (2009) define the subpopulation average causal effect as

Note the dependence of depends on the region of the propensity scores that is in . In practice, must be estimated from the data, and this is done by estimating the propensity score to obtain , . This is then plugged into the definition of to yield an estimator , followed by substitution of the estimated propensity scores and into to yield . Based on the variability of the estimated subpopulation average causal effect, Crump et al. (2009) proposed an optimization criterion for determining an optimal cutoff value in the definition of and demonstrated under some mild assumptions that an optimal exists. The optimal value depends only on the marginal distribution of the propensity scores.

Traskin and Small (2011) developed a meta-approach for characterizing treatment positivity using classification and regression trees (Breiman et al., 1984). In general, a tree classifier works as follows: beginning with a training data set

, , a tree classifier repeatedly splits nodes based on one of the covariates in , until it stops splitting by some stopping criteria (for example, the terminal node only contains training data from one class). Each terminal node is then assigned a class label by the majority of subjects that falls in that terminal node. Once a testing data point with a covariate vector

is introduced, the data point is run from the top of the tree until it reaches one of the terminal nodes. The prediction then will be made by the class label of that terminal node. Compared to parametric algorithms, tree-based algorithms have several advantages. There is no need to assume any parametric model for a tree; the algorithm for its constructions only requires a criterion for splitting a node and a criterion for when to stop splitting (Breiman et al., 1984). Traskin and Small (2011) propose a modelling approach in which one develops a class label for subject

depending on whether or not , . Define , . Traskin and Small (2011) then propose fitting a tree model for on

to determine the covariates that explain being in the overlap set of Crump et al. (2009). We term this a ‘meta-approach’ because the modeling is being based on a response variable that is derived using the estimated propensity score.

On the theoretical front, Khan and Tamer (2010) demonstrated that if the TP assumption is violated, then irregularities regarding identification and inference about causal effects can occur. This has echoes in the work of Robins and Ritov (1997), who show that to have regular semiparametric estimators for average causal effects in the high-dimensional case, the model classes for the propensity score and outcome models have to be well-behaved. Thus, violations in standard overlap assumptions lead to irregularities in estimation and inference. This was noted in Luo et al. (2017), who found that by assuming a weaker covariate overlap assumption, one could derive an estimator that exhibited super-efficiency (i.e, having an information bound that is smaller than the classical semiparametric information bound for regular estimators). This phenomenon also occurs in the collaborative targeted maximum likelihood estimator of Van der Laan and Gruber (2010). The problem with superefficient estimators, as noted by D’Amour et al. (2017), is that for model directions where the relaxed assumptions do not hold, the estimators can have unbounded loss functions.

D’Amour et al. (2017) consider the problem of covariate overlap in one high-dimensional setting. They show by using Bayes rule that the strict overlap assumption, denoted as

(2)

for some , is equivalent to the following assumption based on the likelihood ratio: such that ,

(3)

where is the marginal probability of receiving treatment, is a sigma-algebra generated by the vector , and is the likelihood ratio of the treatment to the control populations. We note that assumption (3) is a bounded likelihood ratio assumption. This allows D’Amour et al. (2017) to exploit results from information theory (Rukhin, 1993, 1997) to show that (3) imposes limits on the rate of growth of discriminatory information between the joint distributions of confounders in the treatment and control populations. Interestingly, these bounds are independent of , the number of confounders. From an intuitive point of view, this makes sense, for when the number of covariates increase, one would expect that the probability of random classifiers to perfectly separate the data between the treatment and control populations would increase. This proves problematic for the causal inference problem, as it leads to a violation of the TP assumption. Extending the arguments of D’Amour et al. (2017) to the current setting proves problematic because the likelihood ratio definition in (3) is not well-defined in infinite-dimensional space. This is because linear and quadratic functions in general are not guaranteed to be well-behaved unless certain topological restrictions are made (e.g., compact linear operators).

2.3 Review of Gaussian Processes

We now review some results from the theory of Gaussian processes. We let denote a stochastic process, where . If is Gaussian, then for any set , will have a

dimensional multivariate normal distribution. For simplicity, we assume here that

for all , although this assumption will be relaxed later in the paper. Comprehensive reviews of Gaussian processes can be found in Neveu (1968) and Jansson (1997).

Let denote an index set. Gaussian Processes can be characterized by their covariance function, given by

For a nondegenerate Gaussian Process, the covariance function will be a symmetric and positive definite function on . Every Gaussian Process denotes a space of functions on the index set , which is known as the Cameron-Martin space. Let denote a Gaussian Hilbert space spanned by . For each , we can define a function on by

The Cameron-Martin space is given by and represents a space of real-valued functions on .

A useful tool for our investigations will be reproducing kernel Hilbert spaces (RKHS) (Wahba, 1990; Berlinet and Thomas-Agnan, 2004). This is a function space that satisfies the property that for any function in it, its pointwise evaluation is a continuous linear functional. As shown in Aronszajn (1950), there exists a one-to-one correspondence between with a so-called kernel function that is a bounded, symmetric, positive definite function. The reproducing property of the RKHS states that for any ,

The Cameron-Martin space in fact defines an RKHS with reproducing kernel and inner product given by

Note that is well-defined since is injective. Thus, there is a 1-1 correspondence between the covariance function of a Gaussian process with an RKHS.

We represent the random measures for and by and , respectively. Let denote a sample space and algebra. Recall the following definitions:

Definition. Two measures and defined on a probability space are said to be mutually singular (or orthogonal), denoted by , if there exists a set such that and . and are said to be mutually equivalent (or equivalent), denoted by , if for all , . It should be pointed out that in most nondegenerate cases, orthogonality and equivalence are mutually exclusive definitions.

It is well-known from probability theory that two Gaussian measures defined on the same measure space will either be orthogonal or equivalent (Hajek, 1958; Feldman, 1958). Assume

to be a measure that dominates both and (e.g., ). Define and to be the Radon-Nikodym derivatives associated with and , respectively. Furthermore, define

and

Note that is the Hellinger distance between two probability measures and is the relative entropy. Several authors have developed characterization results for orthogonality and equivalence of Gaussian measures in terms of and (Hajek, 1958; Rao and Varadarajan, 1962; Shepp, 1966). We summarize them here using Theorem 1 from Shepp (1966).

Theorem 1. (Theorem 1 from Shepp, 1966):

  • if and only if or .

  • if and only if or .

If we define and as the limits of sequences and , , then Theorem 1 suggests that there are two asymptotic scenarios for Gaussian measures as approaches infinity. They are either asymptotically orthogonal (part (a)) or equivalent (part (b)). The result is an extension of Kakutani’s result for product measures (Kakutani, 1948) as well as an application of zero-one laws in probability (Feller, 1961).

While the Gaussian process assumption may seem restrictive at first glance, we note that they in fact represent a very flexible class of probability models to fit to data. They have been widely used in spatial statistics (Haran, 2011), designs of computer experiments (Kennedy and O’Hagan, 2001) and machine learning (Rasmussen and Williams, 2006).

3 Proposed framework

3.1 Functional versions of causal inference assumptions

We now reconsider the assumptions of causal inference using instead of . The functional version of unconfoundedness can be given by

where denotes the filtration generated by .

Remark 1. Note that for confounders in the form of functional data, the indexing set of Z is the positive real line, which is a totally ordered set. While this could be generalized to more general forms of index sets, our results will make use of Theorem 3.2. from Rao and Varadarajan (1963), which requires an ordering of associated algebras for sequences of probability spaces. Unfortunately, such an ordering does not exist for higher-dimensional index sets.

The assumption of SUTVA remains the same as described in §2. Finally, the treatment positivity assumption can be written as

where denotes almost surely.

We define strict functional overlap to be

(4)

for and . Note that (4) represents a generalization of the bounded likelihood ratio assumption of D’Amour et al. (2017). To make the equivalence with their assumption, we would need the following assumption to hold:

(5)

for and . However, in general, is not well-defined in the infinite-dimensional case. In the next section, we provide some results from RKHS theory that allow for (5) to hold.

Assume we have a sequence of Gaussian measures and that convergence weakly to and , respectively. Let the means be denoted as and with associated limits and . We have the following result:

Theorem 2: Assume we have a sequence of Gaussian measures and that convergence weakly to and , respectively. Assumption (4) implies that .

Proof: Following the calculations in Rao and Varadarajan (1963),

where

with denoting the covariance operator in dimensional space for group and . The term represents the Mahalanobis distance in dimensional space. By Theorem 3.2 in Rao and Varadarajan (1963), we have that

The assumption (4) implies that and are uniformly bounded away from zero for all . Thus, their limits will also be greater than zero so that . We can thus use Theorem 1(b) to conclude equivalence. This concludes the proof of Theorem 2.

Remark 2.

D’Amour et al. (2017) consider the situation of Gaussian covariates in a finite-dimensional case. They show that for two multivariate normal distributions that differ in either the mean vectors or the variance-covariance matrices, the likelihood ratio will diverge as the sample size approaches infinity, thus violating their definition of strict covariate overlap. The results presented here can be viewed as a Gaussian measure-based version of the same phenomenon.

3.2 Reinterpretation using RKHS theory and phase transition

We showed in §2.2. how a Gaussian stochastic process can be used to define an RKHS. Suppose we assume that

where are continuous functions and is a zero-mean Gaussian process with continuous covariance function . One can view as “mean” functions, and as the noise term. Then by Theorem 7.1. of Parzen (1961), we have that if and only if , where is the RKHS that corresponds to , and the form for the Radon-Nikodym derivative is given by

where denotes the inner product in . Note also that if and only if . As discussed in §2.2., with Gaussian processes, the equivalence and orthogonality results represent the only two situations of interest, and for RKHS, it is well-known that the trajectories of will be in with probability zero or one (Lukić and Beder, 2001).

Next, we consider a simplification of the model from the previous paragraph in which and admits an expansion of the form

where for , denotes the sequence of eigenvalues of the operator , and represent the corresponding eigenfunctions. Recently, Delaigle and Hall (2012) and Berrendero et al. (2017) have studied and developed this as a model for classification with functional data. These authors referred to the problem as one of ‘near-perfect’ classification of functional data. Of relevance is a quote by Delaigle and Hall (2012): ‘these (functional classification) problems have unusual, and fascinating properties that set them apart from their finite-dimensional counterparts. In particular, we show that in many quite standard settings, the performance of simple (linear) classifiers constructed from training samples becomes perfect as the sizes of the samples diverge….That property never holds for finite-dimensional data, except in pathological cases.”

Delaigle and Hall (2012) and Berrendero et al. (2017) developed technical characterizations of the ‘near-perfect’ classification phenomenon. In the context of our discussion, the work of Berrendero et al. (2017) is perhaps the most relevant. We can restate the results of the previous section in the special case of RKHS as follows:

Theorem 3. (a). Strict functional overlap holds if and only if , where

(b). Strict functional overlap does not hold if and only if ,

Theorem 2 captures a certain type of phase transition in the functional covariate overlap behavior based on the properties of the kernel operator . In particular, consideration of summations of the supremum of the squared eigenfunctions, normalized by the inverse of the eigenvalues, characterize strict functional overlap. If the summation converges (Theorem 3a), then strict functional overlap holds, while divergence of the series (Theorem 3b) is equivalent to the condition being violated.

4 Deep Learning

Recently, the use of deep learning has permeated a variety of disciplines, including opthalmology (Gulshan et al., 2016), cancer research (Esteva et al., 2017) and image analysis (Krizhevsky et al., 2012). Much of the excitement with these techniques is their ability to ‘learn’ representations that have the ability to generalize to new datasets. Deep learning algorithms are descendants of neural network algorithms, which were intensely studied in the 1980s and the 1990s (Hornik et al., 1989; Cybenko, 1989). They work by formulating a multi-layer hierarchical architecture in which one combines information from previous layers through a combination of weights and activation functions. A diagram for a deep learning architecture is given in Figure 1.

Figure 1:

Deep Learning Architecture for ImageNet (Figure 1 from Krizhevsky et al., 2012).

On the far left, we have the input, which is the image dataset and is of dimension

. We then apply a sequence of max pooling and convolution filters, in conjunction with the multi-layer architecture. Eventually, we get to representations in the far right-hand side of the figure that can be used for building a classifier. There have been many empirical studies in the literature demonstrating the ability of deep learning techniques to lead to powerful predictive models (Goodfellow et al., 2015).

The use of deep learning techniques for learning representations in causal inference has also been proposed (Johanson et al., 2016). However, much of the results in Section 3 can be argued to hold here as well. Dunlop et al. (2017) have studied ‘deep’ Gaussian processes, which consist of a recursive definition for the Gaussian process based on a Markov chain. While developing calculations of the form in Section 3 for deep Gaussian processes is beyond the scope of the current manuscript, each iteration consists of a Gaussian Process. Thus, one could argue heuristically that conditioning on the iteration, we would have results of the form of Theorem 2. If one wishes to model confounders given treatment using a deep Gaussian process, in order to have any hope of doing causal inference, one is effectively assuming that the induced measures are equivalent.

To see where it impacts the work of Johansson et al. (2016), we note that they consider discrepancy measures comparing the factual and counterfactual distributions in their optimization problems. Implicitly, to have valid inference, they are assuming that these distributions have common support, which can only occur if the induced measures for confounders given treatment group are identical. Conversely, if a deep learning algorithm is used to learn the representation and there is, roughly speaking, ‘separability’ between the two distributions, then in an asymptotic limit this leads to the measures being orthogonal. Thus, one can view this as effectively needing an equivalence in induced measures for valid causal inference to proceed.

5 Discussion

In this article, we have introduced a strict covariate overlap criterion for causal inference problems and connected it to classical results from the literature on Gaussian random measures and functional data analysis. This criterion extends one developed by D’Amour et al. (2017). An implication of strict functional overlap is that in high-dimensional settings, the treatment positivity assumption is not as innocuous as it seems, which is similar to the conclusion in D’Amour et al. (2017).

We also point out that virtually all proposals that are available for causal effect estimation assume a covariate overlap assumption or equivalently, the propensity score being bounded away from zero and one uniformly in the confounders. Letting denote the sample size and the dimension of the confounders, this could be in a ‘large n, small p’ case (e.g., Chan et al., 2015) or in a ‘large p, small n’ setup (e.g., Athey et al., 2017). Our results suggest that for the latter case, the covariate overlap assumption leads to restrictive assumptions on the distributions of confounders between the treatment groups and that any discrepancy vanishes asymptotically. If such an assumption is not plausible, then this effectively renders impossible the possibility of average causal effect estimators achieving convergence (Robins and Ritov, 1997). In their words, relaxing the assumption of overlap in the high-dimensional case allows for ‘pathological’ distributions to be considered as part of the model class, and these distributions have the possibility of leading to causal effect estimators with irregular behavior (e.g., superefficiency). The estimators in Luo et al. (2017) and van der Laan and Gruber (2010) fall into this class.

This article synthesized several results on Gaussian measures that have been in the literature over the last 70 years. One of the limitations of this article is that the types of covariates that are treated here as being functional are indexed by a one-dimensional set (e.g., time). An example of biomedical data that could be treated as functional are longitudinal data in electronic medical records, but many covariates in practice might not fit this structure. It then might be possible to attempt to combine the criterion in this paper with that considered by D’Amour et al. (2017).

The paper addresses the issue of equivalence versus orthogonality with Gaussian measures for causal inference problems. In practice, it is difficult to verify if the strict overlap criteria in this article or from D’Amour et al. (2017) actually hold. This suggests that approaches that can accommodate violations should be considered. Petersen et al. (2012) discussed possible violations of the treatment positivity assumption along with potential causes and tools to evaluate their effects in terms of biases of causal effect estimates. Suggestions there include restricting the space of treatments, redefining the causal estimand, and using alternative projection functions. Another approach would be to generalize the notion of treatment by exploiting methods from the literature on dynamic treatment regimes (Murphy, 2003; Chakraborty and Moodie, 2013). Furthering work in these areas will be necessary in order to advance the use of these analytic approaches for causal effect estimation with high-dimensional confounders.

Acknowledgement

This research is supported by a pilot grant from the Data Science to Patient Value (D2V) initiative from the University of Colorado.

References

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society 68, 337 – 404.

Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 7353 – 7360.

Athey, S., Imbens, G. and Wager, S. (2017). Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions.
Available at https://arxiv.org/abs/1604.07125.

Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. London: Kluwer Academic Publishers.

Berrendero, J. R., Cuevas, A., and Torrecilla, J. L. (2017). On the use of reproducing kernel Hilbert spaces in functional classification. Available at https://arxiv.org/abs/1507.04398.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

Cameron, R. H. and Martin, W. T. (1944). Transformations of Wiener integrals under translations. Annals of Mathematics 45, 386-396.

Cameron, R. H. and Martin, W. T. (1945). Transformations of Wiener integrals under a general class of linear transformations.

Transactions of the American Mathematical Society 58, 184-219.

Chakraborty B. and Moodie E. E. M. (2013).

Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine.

New York: Springer.

Chan, K. C. G.,Yam, S. C. P. and Zhang, Z. (2015). Globally efficient nonparametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society, Series B 76, 243 – 266.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2017). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal. Accepted Author Manuscript. doi:10.1111/ectj.12097.

Crump, R. K., Hotz, V. J., Imbens, G. W. and Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96, 187 – 199.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.

Mathematics of Control, Signals and systems 2, 303 – 314.

D’Amour, A., Ding, P., Feller, A., Lei, L. and Sekhon, J. (2017). Overlap in Observational Studies with High-Dimensional Covariates. Available at https://arxiv.org/abs/1711.02582.

Delaigle, A. and Hall, P. (2012). Achieving near perfect classification for functional data. J. R. Statist. Soc. B 74, 267–286.

Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95, 932–945.

Dunlop, M. M., Girolami, M. A., Stuart, A. M. and Teckentrup, A. L. (2017) How deep are deep Gaussian Processes? Available at https://arxiv.org/abs/1711.11280.

Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118.

Feldman, J. (1958). Equivalence and perpendicularity of Gaussian processes. Pacific Journal of Mathematics 1958, 699 – 708.

Feller, W. (1961). An Introduction to Probability Theory and Its Applications. New York: John Wiley and Sons.

Ghosh, D. (2017). Relaxed covariate overlap and margin-based causal effect estimation. Available at https://arxiv.org/abs/1801.00816. To appear in Statistics in Medicine.

Ghosh, D., Zhu, Y. and Coffman, D. L. (2015). Penalized regression procedures for variable selection in the potential outcomes framework. Statistics in Medicine 34, 1645 – 58.

Goodfellow, I., Bengio, Y. and Courville, A. (2015). Deep Learning. Boston: MIT Press.

Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson, P. C., Mega, J. L. and Webster, D. R. (2016). Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 316, 2402 – 2410.

Hajek, J. (1958). A property of J-divergences of marginal probability distributions.

Czech Math J. 8, 460 – 462.

Haran, M. (2011) Gaussian random field models for spatial data In Handbook of Markov chain Monte Carlo, Editors, Brooks, S.P., Gelman, A.E. Jones, G.L. and Meng, X.L., Springer-Verlag.

Holland, P. (1986). Statistics and causal inference (with discussion). Journal of the American Statistical Association 81, 945 – 970.

Hornik, K., Stinchcombe, M. and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks 2, 359 – 366.

Hudgens, M. G. and Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association 103, 832 – 842.

Jansson, S. (1997). Gaussian Hilbert Spaces. Cambridge: Cambridge Tracts in Mathematics.

Johansson, F., Shalit, U. and Sontag, D. (2016). Learning representations for counterfactual inference. In ICML. 3020 – 3029.

Kakutani, S. (1948). On equivalence of infinite product measures. Annals of Mathematics 49, 214 – 224.

Kallus, N. (2016) Generalized optimal matching methods for causal inference. Available at https://arxiv.org/abs/1612.08321.

Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 425 – 464.

Khan, S. and Tamer, E. (2010). Irregular identification, support conditions and inverse weight estimation. Econometrica 78, 2021 – 2042.

Krizhevsky, A., Sutskever, I. and Hinton, G. (2012) ImageNet classification with deep convolutional neural networks. In

Advances in Neural Information Processing Systems 25, 1090 – 1098.

Lee, B. K., Lessler, J. and Stuart, E. A. (2011). Weight trimming and propensity score weighting. PLoS ONE 6: e18174. https://doi.org/10.1371/journal.pone.0018174

Lukić, M. N. and Beder, J. H. (2001). Stochastic processes with sample paths in Reproducing Kernel Hilbert Spaces. T͡ransactions of the American Mathematical Society 353, 3945-3969.

Luo, W., Zhu, Y. and Ghosh, D. (2017). On estimating regression causal effects using sufficient dimension reduction, Biometrika 104, 51 – 65.

Murphy, S. A. (2003), Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65: 331 – 355.

Neveu, J. (1968). Processus aléatoire gaussien. L’Université de Montréal Press.

Neyman, J. (1923). Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes. Excerpts reprinted in English, Statistical Science, Vol. 5, pp. 463-472. (D. M. Dabrowska, and T. P. Speed, Translators.)

Parzen, M. (1961). An approach to time series analysis. Annals of Mathematical Statistics 32, 951 – 989.

Petersen, M. L., Porter, K. E., Gruber, S., Wang, Y. and van der Laan, M. J. (2012) Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research 21, 31 – 54.

Ramsay, J. D. and Silverman, B. W. (2005). Functional Data Analysis, 2nd ed.. New York: Springer.

Rao, C. R. and Varadarajan, V. S. (1963). Discrimination of Gaussian processes. Sankhya, Series A 25, 303 – 320.

Rasmussen, C. E. and Williams, C. K. L. (2006). Gaussian processes for machine learning. Boston: MIT Press.

Ratkovic, M. (2014). Balancing within the margin: causal effect estimation with support vector machines. Technical Report, Department of Politics, Princeton University.

Robins, J. M. (1998). Marginal structural models. In Proceedings of the American Statistical Association

. Section on Bayesian Statistics, pp. 1-10.

Robins, J.M. and Ritov, Y. (1997). Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models.

Statistics in Medicine 16, 285 – 319.

Rosenbaum, P. (2007). Interference between units in randomized experiments. Journal of the American Statistical Association 102, 191 – 200.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41 – 55.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 688 – 701.

Rukhin, A. L. (1993). Lower bound on the error probability for families with bounded likelihood ratios. Proceedings of the American Mathematical Society 119, 1307 – 1314.

Rukhin, A. L. (1997). Information-type divergence when the likelihood ratios are bounded. Applicationes Mathematicae 24, 415 – 423.

Shepp, L. M. (1966). Gaussian measures in function space. Pacific Journal of Mathematics 17, 167 – 176.

Stein, M. L. (1999). Interpolation of Spatial Data. New York: Springer-Verlag.

Traskin, M. and Small, D. (2011). Defining the study population for an observational study to ensure sufficient overlap: a tree approach. Statistics in Biosciences 3, 94-118.

van der Laan, M. J. and Gruber, S. (2010). Collaborative double robust targeted maximum likelihood estimation. Int. J. Biostatist. 6, article no. 17.

van der Laan, M. J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. New York: Springer.

Wahba,G. (1990) Spline Models for Observational Data. Philadelphia, SIAM.

Yang, S. and Deng, P. (2017). Asymptotic causal inference with observational studies trimmed by the estimated propensity scores. Available at https://arxiv.org/abs/1704.00666.