A nonparametric projection-based estimator for the probability of causation, with application to water sanitation in Kenya

Current estimation methods for the probability of causation (PC) make strong parametric assumptions or are inefficient. We derive a nonparametric influence-function-based estimator for a projection of PC, which allows for simple interpretation and valid inference by making weak structural assumptions. We apply our estimator to real data from an experiment in Kenya, which found, by estimating the average treatment effect, that protecting water springs reduces childhood disease. However, before scaling up this intervention, it is important to determine whether it was the exposure, and not something else, that caused the outcome. Indeed, we find that some children, who were exposed to a high concentration of bacteria in drinking water and had a diarrheal disease, would likely have contracted the disease absent the exposure since the estimated PC for an average child in this study is 0.12 with a 95 researchers a way to estimate PC, which is essential if one wishes to determine not only the average treatment effect, but also whether an exposure likely caused the observed outcome.


Average Adjusted Association: Efficient Estimation with High Dimensional Confounders

The log odds ratio is a common parameter to measure association between ...

Controlling for Unmeasured Confounding in the Presence of Time: Instrumental Variable for Trend

Unmeasured confounding is a key threat to reliable causal inference base...

On IPW-based estimation of conditional average treatment effect

The research in this paper gives a systematic investigation on the asymp...

Design and Analysis of Bipartite Experiments under a Linear Exposure-Response Model

The bipartite experimental framework is a recently proposed causal setti...

Estimation of Local Average Treatment Effect by Data Combination

It is important to estimate the local average treatment effect (LATE) wh...

Bounding Causes of Effects with Mediators

Suppose X and Y are binary exposure and outcome variables, and we have f...

1 Introduction

The probability of causation (PC) is the probability that an outcome was caused by a specific exposure, and not by something else. Researchers have suggested estimating PC to answer questions of causality in which the exposure and outcome have already been observed for a group of individuals. Current methods for estimating PC and its confidence intervals use plugin estimators, which are inefficient, and parametric assumptions, which are often difficult to make correctly. In this article, we provide a novel estimator that allows for nonparametric estimation and derivation of valid confidence intervals under weak structural assumptions. We illustrate our method in an application to determine whether, for children in Western Kenya who were exposed to high concentrations of bacteria in their drinking water, it was the bacteria or something else that caused their diarrheal disease.

PC, sometimes called the probability of necessity, has been of interest for some time (Lagakos and Mosteller (1986), Tian and Pearl (2000), Pearl (2009), Pearl (2014), Dawid et al. (2013), Dawid et al. (2016)) in the law and in epidemiology because it is especially useful in questions of “but for” causation, that is, in determining whether a specific outcome would not have occurred in the absence of that exposure. In other words, it is the probability that a certain outcome can be attributed to a certain exposure.

PC is especially useful whenever there is a harmful exposure and a negative outcome. For instance, suppose a man has been exposed to a harmful chemical at his work, and then he develops cancer. How can an expert witnesses testifying in court determine whether his cancer was caused by the chemical exposure? PC represents the probability that it was the exposure, and not something else, that caused the cancer. Formally, it is defined as


where is the potential outcome when the exposure is set to zero, is the observed outcome, is the exposure, and are the observed covariates. In the cancer example, this is the probability that the man would have failed to develop the cancer had he not been exposed to the chemical, given that he had cancer when he was exposed. Note that this is not equivalent to the average treatment effect on the treated because it conditions on the treated and on those for whom a specific outcome has been observed. In a lawsuit setting, Lagakos and Mosteller (1986) proposed that PC should correspond to the percentage out of the full compensation that would be awarded to the person bringing the suit. Dawid et al. (2013) and Dawid et al. (2016) suggest instead that PC should be interpreted as my degree of belief about the attribution, a Bayesian interpretation.

PC can also be used in public health, as we do in our application. Children in Western Kenya were exposed to a high bacterial concentration in their drinking water, and then developed a diarrheal disease. What is the probability that it was the bacterial exposure, and not something else, that caused the disease? PC answers this question. Knowing whether the exposure causes the outcome on average, which is the usual question asked by randomized trials, is important. Knowing whether the exposure actually caused the outcome (in those children who were exposed and had the outcome) is also important, especially if a policy is to have a desired impact on a target population.

In this article we seek to answer whether, for those children who were exposed to a high concentration of bacteria in their drinking water and became ill, the exposure is what caused their illness, and not something else. But estimating PC is no trivial endeavor. First, the reader might have noticed that requires knowing the outcome under exposure and no exposure, which is something that cannot be observed directly. Thus identification assumptions are required to be able to estimate PC. Then, an estimator must be used to obtain estimates and a measure of uncertainty for these estimates, such as confidence intervals. With an estimation problem that could have serious consequences on an individual’s future, as it would if it is used in a legal trial or to make health policy decisions, it is essential to estimate PC in a way that gets as close to the truth as possible, and has a valid measure of uncertainty.

To estimate PC while avoiding untestable assumptions, we derive an estimator for the probability of causation that does not rely on parametric assumptions and provides a valid measure of uncertainty. The main methodological contribution of this article is the derivation of a nonparametric influence-function-based (IFB) estimator, which targets a projection of the true PC function onto a parametric model. Our IFB estimator does not require any parametric assumptions for the nuisance functions and it can be used to derive valid confidence intervals because it is asymptotically normal, as long as the nuisance functions can be estimated at a relatively slow rate of convergence of order

. In addition, the estimator does not require knowledge of propensity scores, and it avoids the limitations of nonparametric plugin estimators, which yield slow convergence rates compared to IFB estimators. Although obtaining valid confidence intervals without making parametric assumptions might sound somewhat magical, note that the “catch” is that our estimator relies on the nonparametric methods having a certain rate of convergence, and that We target a projection of the true function onto a parametric model.

Figure 1: Procedure we use to obtain estimates of the probability of causation, the causal parameter shown in Equation 1.

The procedure we follow to derive estimates and confidence intervals for PC is shown in Figure 1

. We define the causal parameter, make identification assumptions to obtain an observable quantity, and derive the estimator. For the estimator, we define a projection onto a parametric model, derive its influence function, derive the influence-function-based estimator, and derive its confidence intervals. Finally, we can use the estimator to obtain an estimate and confidence interval for PC. This final estimate will depend on which parametric model is selected in the projection. Our application uses logistic regression, so the final estimates and confidence intervals will have estimates of PC and estimates for the coefficients of each covariate.

The remainder of this article is organized as follows. Section 2 provides the background for our application. Section 3 shows the derivation of the estimator for the probability of causation. This includes the identification procedure, the projection approach, the proposed influence-function-based estimator, and the estimator’s asymptotic properties. The proof that the influence-function-based estimator is asymptotically normal is left in the appendix. Section 4 compares the performance of the influence-function-based estimator to the performance of plugin estimators by simulation. Section 5 provides the application with real data from a randomized controlled trial in Kenya. Section 6 offers a discussion of our method and our results.

2 Background

Diarrheal diseases are a leading cause of disease and mortality in the developing world, and for children under 5, these diseases account for 20 percent of deaths (Kremer et al. (2011)). These diseases are often transmitted when a water supply is contaminated with fecal matter. In rural Kenya, 43 percent of the population gets their drinking water from nearby springs (as shown in Figure 2, left), usually after transporting it in 10- to 20-liter jerry-cans. In springs, where water seeps out of the ground, the water is vulnerable to contamination when people dip their cans to scoop out water and when runoff introduces human or animal waste into the area.

Figure 2: Images from the Busia district of Western Kenya, where 43% of individuals access water for drinking and cooking from unprotected springs (left). International Child Support, a local NGO, builds concrete structures (right) for residents to access cleaner water. From https://www.povertyactionlab.org/evaluation/cleaning-springs-kenya.

To prevent the outcome of diarrheal diseases in children in the Busia district of Western Kenya, a local NGO called International Child Support built protective cement structures around a randomly selected group of springs (as shown in Figure 2, right). The structures forced water to flow through a pipe rather than seeping from the ground and thus helping prevent the transfer of bacteria. Kremer et al. (2011), researchers from Abdul Latif Jameel Poverty Action Lab (JPAL), worked with the NGO to deploy the experiment as a randomized controlled trial, and JPAL evaluated the project to estimate the causal effect of building the cement structures on the bacterial concentration in the water and in diarrheal diseases in the community.

In the experiment, 400 springs were randomized, which resulted in 13,036 households drinking from protected springs and 32,514 from unprotected springs. Household characteristics such as income, education and health, were approximately equal among the two groups at the start of the program, suggesting that there were no systematic differences between communities that had their springs protected and those that did not. Throughout the program, the researchers collected measures on the level of water contamination and diarrheal disease in all communities. Water quality was measured at all sample springs and households using protocols based on those used at the United States Environmental Protection Agency (EPA). The water quality measure we use is contamination with E. coli, an indicator bacterium that is correlated with the presence of fecal matter, as measured by the natural log of the most probable number (MPN) of colony-forming bacteria per 100 ml of water.

The JPAL researchers sought to estimate the average treatment effect. They found that spring protection significantly reduces diarrhea for children under age three (at baseline or born since the baseline survey) by 25%. They also found that spring protection reduces fecal contamination at the spring by a much higher rate of 66%. However, some of these beneficial effects on the child’s health are lost because household water quality improves less than the quality at the source, due to re-contamination. Interestingly, diarrhea reduction was disproportionately concentrated among girls, suggesting that cleaning water springs could be an effective tool for the improvement of female child survival.

The researchers designed the experiment such that the following assumptions were likely to be satisfied:

  1. Positivity: The researchers selected 400 water springs that were all suitable for treatment.

  2. Consistency: The researchers tested for effects of interference by measuring the bacterial levels in treated and untreated springs that were near each other (within 3 km), and found “little evidence of externalities in water quality.” (Kremer et al. (2011)) They concluded that it was unlikely that treating one spring had an effect on the bacterial concentration of an untreated spring. No interference, along with a well-defined binary treatment, means consistency is also satisfied.

  3. Monotonicity: The researchers did not expect that protecting a spring would increase the bacterial concentration and thus increase the incidence of diarrhea in children.

  4. No unobserved confounders: The treatment was given as a random assignment. The researchers checked that households were balanced in terms of demographic characteristics (such as income, education and health), and that the individuals did not use additional forms of bacterial reduction (such as home water chlorination, boiling or hygiene practices).

It is not easy to find data from a randomized trial where the treatment was harmful and the outcome was negative. Even if it happens in a medical trial, companies restrict access to these data. Instead, we take the JPAL data and redefine it. In the experiment, the treatment was that the water sources were encased in concrete (denoted by ) and the control was that they were not (denoted by ). Their direct outcome was bacterial concentration from encasing the spring and the indirect outcome was diarrheal diseases in the past week.

For the purposes of our project, we take the JPAL direct outcome of bacterial concentration as our exposure , and their indirect outcome of diarrheal diseases as our outcome . For us, the exposure is to a higher amount of bacteria due to having no cement structure (denoted by ), and the control is exposure to a low amount of bacteria due to having a cement structure (denoted by ). Note that the binary exposure is to low concentration and high concentration of the harmful bacteria, and it does not refer to the absence and presence of the bacteria. This allows us to see the experiment as a situation in which there was a harmful exposure (to high concentrations of bacteria) and a negative outcome (diarrheal disease).

The question that must be answered in order to allocate resources to address the problem effectively for specific sub-populations is, “How likely is it that the negative outcome was caused by the exposure, and not by something else?” To reach this estimate, we need a method for estimating PC, which we address in the next section.

3 Deriving an estimator for the probability of causation

Throughout this paper we assume access to an iid sample , where for

a vector of covariates,

an indicator of binary exposure, and an outcome indicator. We also let denote the potential outcome that would have been observed under exposure level . For notational simplicity we let


denote the regression and propensity score functions, respectively.

3.1 Identification

For identification, we make four causal assumptions, which are similar to the ones made by Dawid et al. (2016) and Tian and Pearl (2000):

  1. Positivity: for all such that .

  2. Consistency: .

  3. Monotonicity: .

  4. No unobserved confounders: .111Tian and Pearl (2000) show that if you relax this assumption, you arrive at a different identified parameter.

For a continuous treatment, our identification approach would be very similar.

Using these assumptions, PC equals one minus the risk ratio222Note that sometimes the risk ratio is written as , but we define it as its inverse for convenience., a quantity we call ,


Note that does not involve potential outcomes, and thus it can be estimated with observed data. In other words, is equivalent to the probability of causation under the identifiability assumptions of positivity, consistency, no unobserved confounders, and monotonicity. We seek to estimate , and thus our method is also relevant for the risk ratio.

Another way to write the risk ratio is in terms of the nuisance functions and (sometimes called “outcome regressions”),


The simplest (and most common) method for estimating RR is with a plugin estimator (see, for example, Rothman et al. (2012)), which works by estimating the nuisance functions via generic regression methods, e.g. logistic regressions, and then “plugging in” the estimated regression function into the functional. Thus the plugin estimator of RR is


A common problem with is the variation dependence between RR and the nuisance functions (Richardson et al. (2017)). In other words, it is likely that reasonable parametric models for the right-hand-side of the equation would not match a reasonable model for the left-hand side. For example, if one posited logistic regression models for , , then the ratio would be a complicated non-logistic function. This is sometimes called “non-congeniality”. To avoid this problem, one could use a nonparametric algorithm to find point estimates of

, such as kernel regression, a support vector machine, or a random forest. However, such an approach would typically result in slow convergence rates and a lack of available confidence intervals.

In more recent publications, to estimate RR Richardson et al. (2017)

propose the conditional log odds-product as a preferred nuisance model. Their approach is to develop an unconstrained nuisance model that is variation independent of RR (or RD, the risk difference). They propose doubly-robust estimators of models for (monotone transformations of) RR and RD that are consistent and asymptotically normal even when the nuisance model is misspecified, provided that they have correctly specified the model for the propensity score,

. Doubly-robust estimators combine a form of outcome regression with a model for the exposure (i.e., the propensity score) to estimate the causal effect of an exposure on an outcome. However, the approach in Richardson et al. (2017) still relies on parametric assumptions, both for the nuisance functions and for the RR itself.

Our proposed IFB estimator (for ) does not require any such parametric assumptions. Furthermore, it can be used to derive valid confidence intervals because it is asymptotically normal, as long as the nuisance functions can be estimated at relatively slow nonparametric convergence rates of the order . It also avoids the limitations of plugin estimators, which yield slow convergence rates and no feasible inferential procedures compared to IFB estimators.

3.2 Projection

We do not want to assume that the true function follows a known parametric form, i.e. that is completely known up to some finite-dimensional parameter . Others, such as Richardson et al. (2017) and Tchetgen Tchetgen (2013), tend to assume that the true follows a parametric model because parametric models are easier to work with; however, a disadvantage is that the assumed parametric model will very likely be incorrect.

Furthermore, we want inference by deriving -consistent estimators with influence functions, which yield valid confidence intervals. However, does not have an influence function in a nonparametric model because it is not pathwise differentiable; thus it is similar to a density or regression function, which cannot be estimated at rates in a nonparametric model.

Thus, we target a projection of the true function onto a parametric model that does not make any assumptions about the data distribution. In other words we redefine our parameter of interest as the best-fitting approximation of the true function by a parametric model , where is a vector of real parameters. For example one could use a best fitting linear approximation with , based on least squares error; more details and examples are given shortly. Then, we estimate the parameters of the projection with an influence-function-based (IFB) approach. The IFB approach requires estimation of nuisance functions, which can be accomplished with parametric or nonparametric methods. We propose using nonparametric methods, the more flexible option; it turns out we are nonetheless still able to attain fast parametric rates for estimating .

To clarify, we target a projection of onto (rather than assuming ) because we want to be honest about the fact that we do not know a correct parametric model for . Cox said that “all models are wrong, but some are useful.” Indeed, in this article, we are using a parametric model simply as a tool that is useful for summarizing the data, which is generated according to some true possibly complex form for .



Note is a function of the distribution since it minimizes the error between the model and the true function. We use an

loss-based projection because of its simplicity, but other loss functions could easily be used instead. The function

is a user-specified weight function, and can be defined based on subject matter concerns; for example, if certain parts of the covariate space are more important to approximate well, this can be incorporated into the weight. If all are equally important, one could simply choose a uniform weight . In the case that the model is correct, different choices of the weight give estimators of different efficiencies; otherwise the weight simply defines the projection. One could also consider a data-dependent projection, where the weight equals a kernel function that collapses to a point as sample size grows; this is similar to the approach taken in Kennedy et al. (2017).

Under standard regularity conditions (Tsiatis (2006)), the defined in (10

) satisfies the moment condition


We write for short. This moment condition implicitly defines our parameter , i.e., the coefficients in our best approximation to based on the model ). For this article, we assume the solution is unique.

Our approach for estimating the projection parameter is as follows. We construct an estimator of the moment condition , which we can evaluate at any given value, and then we solve this for the value that sets the estimated moment condition equal to zero. In other words, we define our estimator as the solution to


for an estimated version of the moment condition, to be defined shortly. Therefore the problem essentially reduces to estimating the moment condition at a given fixed value. This differs from classical M-estimation problems (van der Vaart (2000)) in that the moment condition is a complex functional, more complicated than a simple expectation since it involves unknown complex nuisance functions .

A simple way to estimate would be with a plugin estimator

that replaces the expectation with an empirical sample average, and (more problematically) replaces with its estimated version . But this has the usual problems of a plugin (e.g. slower convergence rates if is estimated nonparametrically, unreliable parametric assumptions otherwise).

In contrast, we derive an influence function-based estimator of , which allows for nonparametric estimation of the complex nuisance functions, while still yielding -rates, asymptotic normality, and valid confidence intervals (for both the moment condition itself and the value that sets it to zero).

3.3 Proposed estimator

Here we give the influence function for the moment condition parameter defined based on (10

), which has two crucial consequences: first its variance provides an asymptotic efficiency bound, indicating that no regular asymptotically linear estimator can have smaller MSE without adding assumptions, and second it tells us how to construct efficient estimators that attain such a bound under weak nonparametric conditions.

Influence functions were originally introduced in the robustness literature (Hampel et al. (2011); Huber (1981)). To first order, is the influence of the -th observation on an estimator . Their importance in diverse semi- and nonparametric functional estimation problems was established in ground-breaking work starting in the 1980s and 1990s (Bickel et al. (1993); van der Laan and Robins (2003); van der Vaart (2000); Tsiatis (2006); Robins et al. (2008)). For this reason influence functions have also played a fundamental role in recent causal inference literature, as well as in economics (Chernozhukov et al. (2017)

) and machine learning (

Kandasamy et al. (2015)). Refer to Robins et al. (2000); Ogburn et al. (2015); Toth and van der Laan (2016); Kennedy (2016) for just a few interesting examples.

One can speak of an influence function for an estimator or for a parameter. When we say an estimator has a particular influence function, this means the estimator is asymptotically equivalent to a sample average of the influence function, i.e.,


where the influence function has mean zero and finite variance , and denotes a sample average

. Note the above formulation immediately allows for constructing Wald-style confidence intervals based on the central limit theorem. When we say a parameter has a particular influence function, the meaning is a bit more subtle: this indicates that that function acts as a pathwise derivative of the parameter, where paths refer to parametric submodels (

van der Vaart (2000)). This essentially means the influence function behaves as the derivative term in a von Mises-style distributional Taylor expansion of the parameter. Practically, though, influence functions are crucial because their variance gives an efficiency benchmark, and because they can be used to construct estimators that are nonparametrically efficient with nice properties such as double-robustness. In particular a standard way to construct such estimators is to solve an estimating equation based on the (estimated) influence function.

Our first result gives the efficient influence function of the moment condition , which we then use to construct an estimator .

Theorem 3.1.

Under a nonparametric model, the (uncentered) efficient influence function for the moment condition at any fixed is given by


where we define and .

A proof of the above result is given in the Appendix.

Now we can solve an estimating equation based on the above influence function to construct an appropriate estimator with desirable efficiency and robustness properties. Since the influence function is linear in the parameter this amounts to just computing the sample average of an estimate of (14). And thus solving for the that sets the estimated moment condition equal to zero amounts to just solving the estimating equation.

Therefore our proposed estimator of the minimizer in (10) is defined as the assumed unique value that satisfies




where . Of course from this one can also construct estimates of predicted values based on the best-fitting approximation. In the next section we analyze the asymptotic properties of our proposed estimator.

3.4 Asymptotic properties

Next, we analyze the asymptotic behavior of our influence-function-based (IFB) estimators and , showing that they can be -consistent and asymptotically normal even when the nuisance functions are estimated data-adaptively with flexible nonparametric regression tools. The proof of the result follows similar logic as in Theorem 5.31 from van der Vaart (2000), which analyzes general M-estimators in the presence of complex nuisance functions. Throughout, for a possibly data-dependent function we let denote the squared norm.

Theorem 3.2.

Assume that:

  1. The sequence of functions and its limit are contained in a Donsker class with .

  2. The map is differentiable at the true uniformly in , with invertible derivative matrix (evaluated at the true and ).

Then as long as the proposed estimator is consistent with rate of convergence

Suppose further that:

  1. .

Then the proposed estimator attains the nonparametric efficiency bound, and is asymptotically normal with

and similarly for any fixed we have


Condition 1 of Theorem 3.2 requires that the influence function not be too complex (i.e., it must lie in a Donsker class). This is therefore an implicit complexity restriction on the nuisance functions as well. Donsker classes are described in detail in van der Vaart (2000) and Kennedy (2016), but include for example smooth functions with bounded partial derivatives. However, the Donsker part of Condition 1 can be avoided entirely with sample splitting, i.e., estimating the nuisance functions on one part of the sample and evaluating the influence function on a separate independent part. This was recently termed cross-fitting by Chernozhukov et al. (2017). We use this sample-splitting procedure in our simulations and application. Condition 2 is a weak differentiability condition that essentially just requires a smooth enough projection model .

Theorem 3.2 thus tells us that the rate of convergence of our proposed estimator only has a weak second-order dependence on the errors of the nuisance estimators. In particular, this means will be -consistent and asymptotically normal even if the nuisance estimators only converge at rates. These rates can be attained in nonparametric models under sufficient smoothness, sparsity, or structural (e.g., generalized additive model) conditions.

Let be shorthand for the asymptotic variance of given in Theorem 3.2, so that under the given conditions we have


Then a simple Wald-style 95% confidence interval for is given by


where is any consistent estimator of the asymptotic variance, e.g.,

where is an estimate of the derivative matrix.

Importantly, if we had not used an influence function-based estimator for , the corresponding rates of convergence would not in general involve products of the nuisance errors, so one would expect slower-than- rates for estimating and no available confidence intervals.

3.5 Example: projecting onto a logistic regression

When using the proposed estimator for a dataset, suppose we select logit(). The reader might find it puzzling that we now discuss PC as a logistic model, but recall that it is only the model onto which we are projecting the true function. The consequence of selecting a logit is that we obtain estimates of the outcome, which correspond to PC, and estimated coefficients, just like one does when performing logistic regression, which correspond to the relationship between the covariates and PC.

Specifically, how should these results be interpreted? A logistic regression of on estimates parameter values for via maximum likelihood estimation of the following equation:


In our example, is the probability of causation.

For a binary covariate , the coefficient is the log-odds ratio between the group where and the group where . As is usual, to translate to odds one can just exponentiate the log-odds. In our case, the odds refers to the “odds of causation.” The odds of causation are


If the odds of causation are equal to three, for example, then we can say that it is three times more likely that the outcome was caused by exposure than not. The odds ratio corresponding to covariate , for example, is


If the odds ratio is equal to 0.17, then, for a female the odds that the outcome was caused by the exposure are 0.17 times greater than for a male, .

4 Simulation: IFB-estimator vs. plugin estimator

In this section, we compare the finite sample performance of our nonparametric influence function approach to that of the currently used plugin estimators (both the parametric and nonparametric ones), by simulation333Note: The code for the simulation can be found online at https://github.com/mariacuellar/probabilityofcausation/blob/master/Simulation.R.. We generate our covariates by following the example from Kang and Schafer (2007). We define a vector of four covariates

as normal random variables with mean 0 and variance 1. To see how the estimators performed in a more complicated setting that did not follow simple normal distributions, we used the transformations in

Kang and Schafer (2007) to define a new vector of transformed covariates,


Kang and Schafer (2007) originally defined these transformations for demonstrating that parametric assumptions led to biased results, despite having diagnostics plots that seemed to point to no assumption violations. The are simply a useful set of transformations of normal for our simulation.

We let where . Consider the factorization


where we have


by exchangeability , and we have by monotonicity . Then,


where the first equality follows by exchangeability and consistency, the second by Bayes rule, and the third by definition according to our factorization. Similarly using the fact that ,


Now, for example, we take (so is a logistic regression) and (so ) we get that the functional forms of our outcome regressions are


Note that does not follow a logistic model. Let , as Kang and Schafer (2007) did. To estimate the variance of the influence-function-based estimator, we use the variance from (17). As an alternative to the sandwich variance, we could use the bootstrap, since our estimator is asymptotically normal, although the bootstrap is computationally much slower.

Figure 3: Root-mean-squared-error (RMSE) across 100 simulations. The RMSE is higher for the plugin than for the proposed IFB estimator, and the misspecified model (or in the nonparametric case) has a slower decrease of error over sample size. See Table 1 for full results.

Note that our simulation assumes , and thus we are not testing how projecting onto different forms of affects the estimates. Instead, we chose to focus on demonstrating the superiority of the IFB estimator over the plugin. With simulations, we assessed the performance of the estimators by using the root-mean-squared error (RMSE), absolute bias, and coverage:

Coverage: Proportion of simulations in which the confidence interval covered the true value. (33)

Since our estimates for depend on , we chose to evaluate the bias an RMSE for each of 2,000 simulated covariates, and then take the average of those to obtain one bias and one RMSE per simulation. To compare the performance of the estimators across sample sizes, we tested our method at three sample sizes: 100, 1,000, and 10,000. We used random forests to fit the nuisance parameters, but this could also be done with other nonparametric estimators.

Sample Method Parametric Parametric Nonparametric Nonparametric
size correctly specified, misspecified, with with
100 Plug-in 0.36 (0.35) 0.25 (0.33) 0.09 (0.25) 0.37 (0.38)
Proposed 0.01 (0.07) 0.04 (0.17) 0.01 (0.05) 0.02 (0.1)
1000 Plug-in 0.19 (0.26) 0.22 (0.36) 0.05 (0.19) 0.34 (0.39)
Proposed 0.00 (0.04) 0.04 (0.16) 0.00 (0.03) 0.01 (0.08)
10000 Plug-in 0.07 (0.20) 0.16 (0.32) 0.06 (0.18) 0.14 (0.31)
Proposed 0.00 (0.02) 0.03 (0.14) 0.00 (0.03) 0.01 (0.08)
Table 1: Bias (RMSE) across 100 simulations. GLM and random forests were used for parametric and nonparametric estimation of the nuisance parameters.

Figure 3 (with more complete values in Table 1) shows the motivation for using our nonparametric IFB approach. If the model is correctly specified, the IFB estimates have a greater variance in terms of a difference in a constant than the plugin, but they still have root-n rate and inference under weak conditions. But more common are misspecified parametric models for the nuisance functions. In this case, the error of the plugin estimator does not decrease with sample size. Instead, there is an irreducible amount of bias the prevents the estimates from reaching the true value. With correctly specified parametric models for the nuisance functions, we expect our proposed IFB estimator to perform about as well as the correctly specified parametric plugin model, although in this case it performs better. With misspecified models for the nuisance functions, the IFB estimator performs better than the parametric plugin under misspecified models. Furthermore, if the model is misspecified, the IFB estimator estimates a well-defined quantity.

With nonparametric nuisance functions, the error of the plugin estimates decreases with increasing sample size, even when using the transformed covariates . Since one can never guarantee one knows the true parametric model, it makes sense from this argument that one should want to use nonparametric models for the nuisance functions. However, nonparametric methods do not usually allow for the derivation of valid confidence intervals. The proposed IFB estimator has faster convergence rates than the nonparametric plugin, and it allows for the derivation of confidence intervals under weak conditions.

The purpose of the simulation was to test the convergence rates of the proposed estimator versus those of the plugin. Thus, we have set , whereas in reality the IFB method estimates a projection of the true function onto a parametric model (they are not equal, but instead is projected onto ).

Finally, the coverage of the confidence intervals for the IFB estimates are shown in Table 2. The IFB estimator has close to 95% coverage in the correctly specified parametric setting (which we would not expect to have in most cases) and in the nonparametric cases, although it has higher coverage with the simpler covariates. In the misspecified parametric setting it does not have high coverage, but this is to be expected. We suggest using the nonparametric approach instead.

Sample Parametric Parametric Nonparametric Nonparametric
size correctly specified, misspecified, with with
100 92.10 80.82 94.22 90.01
1,000 94.42 75.23 94.50 92.21
10,000 95.19 69.35 95.23 93.08
Table 2: Coverage across 100 simulations using nonparametric estimation (random forests) for nuisance parameters.

5 Application

5.1 Data

We used the dataset that was carefully gathered by Kremer et al. (2011), as part of a project from the Abdul Latif Jameel Poverty Action Lab (JPAL). 444The dataset, documentation, and article about this study are available online at https://www.povertyactionlab.org/evaluation/cleaning-springs-kenya. The materials are kept in the Harvard Dataverse and are currently maintained by Professor Edward Miguel. Code for the application can be found online at https://github.com/mariacuellar/probabilityofcausation/blob/master/Application. We applied our method to estimate the already mentioned probability of causation,


where is the exposure to high bacterial concentration (unprotected) springs, is the exposure to low bacterial concentration (protected) springs, is the outcome that the household had a child with diarrhea within the past week, is the outcome that no child had diarrhea within the past week, and is the vector of covariates for the households. These include: the child’s gender (0=male, 1=female), the child’s age at baseline, the mother’s years of education at baseline, the water quality at baseline in terms of E.coli MPN (most probable number), whether the home as an iron roof (which could prevent some water cross-contamination at the home), the mother’s hygiene knowledge at baseline (determined from a survey about prevention of diarrheal diseases), the latrine density near the house at baseline (how many latrines are near the house per unit area, which could affect the fecal matter concentration near springs), and the number of children in the household at baseline.

For our analysis, we restricted the sample in the same way as the Kremer et al. (2011)

researchers. We removed all individuals older than age three, all the children of reported users of multiple springs, and all the children flagged as having a problem weight, problem BMI or being a severe height outlier (as defined by the researchers). This reduced the sample size from 45,565 to 22,620. Furthermore, we removed all observations containing any missing values, like

Kremer et al. (2011) do, which reduced the sample size to 2,933. The drop in sample size from 45,565 to 2.933 is concerning, and a more careful view of the data would have to take the missing observations into account. For now, we assume that the values were missing at random, and exploring this assumption further remains as future work.

Figure 4: Distributions of variables for the households in the selected sample for testing the IFB method.

After the data reduction, the sample had 1,446 children who drank from unprotected springs () and 1,487 children who drank from protected springs (). There were 688 children with diarrheal diseases in the past week (), and 2,245 () children without the diseases in the past week. There were 1,457 boys and 1,476 girls. 849 of the households did not have an iron roof, and 2,084 did. Figure 4 shows the histograms of the remaining six variables.

5.2 Results

5.2.1 Result 1: Estimated PC and confidence intervals for a single individual

Probability of causation
Estimate Confidence interval
Parametric plug-in (logistic regressions) 0.69 (0.65, 0.73)
Nonparametric plug-in (random forests) 0.24 (undefined)
Nonparametric proposed (random forests) 0.12 (0.11, 0.13)
Table 3: Estimates of the probability of causation and the corresponding 95% confidence intervals, for a “median” child with covariates given in 5.2.1.

Table 3 shows our results from estimated and its confidence intervals for an individual with specific covariates as an example, by using the currently-used methods and our proposed method. Recall we are interested in the probability of causation: the probability that it was the exposure to high-bacterial springs, and not something else (such as transfer at school or a stomach virus), that caused the child diarrhea in the community of interest, as collected in the data by Kremer et al. (2011).

The covariates selected for Table 3

are the mode (for binary variables) and median (for continuous) variables of the data—we call this child the “median” child: A girl, aged six, whose mother has six years of education, with poor quality water at baseline (level 4), living in a house with an iron roof, baseline hygiene knowledge of 3, latrine density around nearest spring of 0.4 out of 0.6, with five children in the household. The covariates used here were selected for being the most common values for each variable as an example, but PC could be estimated for other covariates as well. We indeed calculated PC for a variety of covariates. Some of the variability of PC can be see in Result 3.

We used random forests to estimate the nuisance functions . According to our estimator, the probability that the child diarrheal disease was caused by the exposure to dirty unprotected springs is 0.12 with a 95% confidence interval of (0.11, 0.13). The parametric plugin method, which is the most commonly used, yielded a dramatically higher estimate of 0.69 with a confidence interval (0.65, 0.73), and the nonparametric plugin method yielded an estimate of 0.24.

The plugin estimates were dramatically different from the IFB estimates, indicating that the parametric plugin model might be highly misspecified. Although the currently used method suggests that it is likely that the diarrheal disease was caused by the exposure to bacteria, our method suggests the opposite: that it is not likely that the diarrheal disease was caused by the exposure to bacteria.

5.2.2 Result 2: The odds of causation

Estimate Robust std. error z value p value
Gender (m:0, f:1) 0.17 0.09 1.99 0.05
Age (in months) -0.21 0.04 -4.93 0.00 ***
Mother’s years of educ. -0.03 0.01 -2.07 0.04 *
Water quality at spring -0.06 0.02 -2.81 0.00 ***
Iron roof indicator -0.05 0.09 -0.57 0.57
Mother’s hygiene knowledge -0.08 0.02 -4.01 0.00 ***
Latrine density near household -0.28 0.24 -1.15 0.25 *
Diarrhea prevention score -0.02 0.02 -1.15 0.25 *
Table 4: Regression results. Estimates are estimated coefficients of model shown in (31), which are the log-odds of causation shown in (32).

Table 4 shows the estimated odds ratios under the column header “Estimate”. For example, the coefficient on latrine density near household is -0.28 (with significance at the 0.01 level). This means that for a one-unit increase in the density, the expected change in log-odds of causation is 0.28 and the change in odds is . So we can say for a one-unit increase in the density, we expect to see about 32% increase in the odds of causation, i.e., the probability that the child’s diarrhea was caused by exposure to the bacteria increases as latrine density increases.

The factors that most affect whether the child’s diarrheal disease was caused by his or her exposure to bacteria in the water springs are: child’s age and latrine density near household. The mother’s years of education, water quality at spring, mother’s hygiene knowledge, and diarrhea prevention score all had a smaller effect, but they were significant at least at the 0.05 level.

This information could be used to learn for which groups of people the risk of harm from being exposed to the bacteria in the water springs is high. The individuals with high odds of causation are the ones who are most harmed by the exposure. Policy makers might want to target these specific subgroups specifically, especially if there is a limited budget.

Selecting a model different than a logistic regression could yield different estimates. In our approach, is very general and is only required to be a parametric model. However, in this application we could arrive at different results.

5.2.3 Result 3: Variation in PC for a given continuous covariate

In the following two examples PC is estimated for an individual with the same covariates as in Result 1, except one covariate is varied to see how the estimated parameter changes. Figure LABEL:fig:pcvsage shows the variation of PC with age. As the child’s age increases, the probability that the diarrheal disease was caused by a bacterial exposure decreases non-linearly. Figure LABEL:fig:pcvsmothereduc shows the variation of PC with the mother’s years of education. We see that the more educated the mother, the less likely it is that the child’s diarrheal disease was caused by the bacterial exposure.

This exercise could be repeated for whatever variables might be of interest. Thus, it is an exercise in exploring the heterogeneity in estimates of PC across the space of covariates. Finding out which groups of individuals have higher estimates of PC could be very useful for researchers when deciding on which sub-populations to implement costly interventions.

6 Discussion

We have derived a novel estimator for the probability of causation. This influence-function-based estimator is for a projection of the identified probability of causation, which does not require making parametric assumptions. This approach has applications in diverse settings, including epidemiology and the law.

The main contribution of our work was providing an estimator for the probability of causation without requiring strong parametric modeling assumptions by using influence functions. We provided an expression for the efficient influence function that involves only derivatives based on the known (and generally simpler) model and weight functions, rather than unknown complex and high-dimensional nuisance functions. Our formulation yields efficient estimators that are easier to construct in practice. We also described the asymptotic properties of our method under weak empirical process conditions and provided simulations to compare it to parametric and nonparametric plugin methods.

Regarding the projection, what is the difference between assuming the data-generating process actually follows an incorrect parametric model () and using a projection approach? If the model happens to be correct, then both the projection approach and a model-based approach will in general yield -consistent estimators with valid confidence intervals, under roughly the same assumptions (though the projection approach may have a larger variance, but only by constant factors). On the other hand, if the model is not correct, then the projection approach is honest about this possibility and still validly defined as a best-fitting approximation, while the parametric model-based approach is no longer formally applicable.

Regarding the choice of parametric model onto which we project the true data, what is the proper way to choose ? After all, selecting different models could yield different estimates. How sensitive the results are to the choice of model depends on how complex the true function is. Kennedy et al. (2017), section 3.4, wrote about how to do model selection using cross-validation in a projection setting for a continuous instrumental variable. They noted that in standard cross-validation it is possible to estimate the risk without bias since the risk does not require the estimation of nuisance parameters. However, in their setting, as in ours, the risk parameter depends on complex nuisance functions via the true parameter of interest and the parametric model being used. They derive an efficient influence function for the risk and then use this as a doubly robust loss function for cross-validation-based model selection. The results in this paper pave the way for a similar data-driven model selection criterion, but we leave the cross-validation type approach for selecting as future work and simply use a logit model as our projection for illustration.

In our application with real-world data, we used the proposed method to study whether exposure to high-bacteria water, as opposed to low-bacteria water, is what caused diarrheal disease in children from a population in Kenya, or if the cause was something else. We estimated the probability of causation by using a dataset from an experiment performed by JPAL. We found that the plugin estimates were dramatically different from the IFB estimates, indicating that the parametric plugin model might be highly misspecified. Our method suggests the opposite: Because PC is low, it is unlikely that the diarrheal disease was caused by the exposure to the high concentration of bacteria in the drinking water. We also found that different populations had different values of PC, that is, there was heterogeneity between groups of individuals. This is an interesting result insofar as it suggests that although some individuals developed the outcome, it is not necessarily the case that it was caused by the exposure, and thus that there might be other causes of this childhood disease. It would be valuable to perform a further medical evaluation to evaluate which groups had higher PC, since future interventions might be more effective for those groups.

The factors that most affect whether the child’s diarrheal disease was caused by his or her exposure to bacteria in the water springs are: age, the mother’s years of education, water quality at spring, mother’s hygiene knowledge, latrine density near household, and diarrhea prevention score. This information could be used to learn for which groups of people the risk of harm from being exposed to the bacteria in the water springs is high. The individuals with high odds of causation are the ones who are most harmed by the exposure. Policy makers might want to target these specific subgroups, especially if there is a limited budget.

Apart from providing useful information about this specific population in Western Kenya, this application served as an example of how one might estimate and interpret the probability of causation, as well as the estimated coefficients of a logistic regression representing PC, the odds of causation. The method derived in this article provides a novel way of analyzing data from randomized trials and other datasets following the assumptions we listed), and thus it could be used to find new insights about causation and public policy. Further work needs to be done to relax the assumption requiring “no unobserved confounders,” which is satisfied by randomized trials, but is an unrealistic assumption for some applications. Nevertheless, estimating PC is appropriate whenever researchers are interested in finding out whether a specific intervention had the effect intended by the policymaker.


The authors are grateful to Greg Ridgeway for his helpful comments.

Conflict of interest

The authors have no conflict of interest to report.


  • Lagakos and Mosteller (1986) Lagakos SW, Mosteller F. Assigned Shares in Compensation for Radiation Related Cancers. Risk Analysis 1986;6(3):345–357.
  • Tian and Pearl (2000) Tian J, Pearl J. Probabilities of causation: Bounds and identification.

    Annals of Mathematics and Artificial Intelligence 2000;28(1-4):287–313.

  • Pearl (2009) Pearl J. Causality: Models, Reasoning and Inference. Cambridge University Press; 2009.
  • Pearl (2014) Pearl J. Causes of Effects and Effects of Causes. Sociological Methods and Research 2014;44(1):149–164.
  • Dawid et al. (2013) Dawid AP, Faigman DL, Fienberg SE. Fitting Science Into Legal Contexts: Assessing Effects of Causes or Causes of Effects? (with Discussion). Sociological Methods and Research 2013;43:359–390.
  • Dawid et al. (2016) Dawid AP, Musio M, Fienberg SE. From Statistical Evidence to Evidence of Causality. Bayesian Analysis 2016;11(3):725–752.
  • Kremer et al. (2011) Kremer M, Leino J, Miguel E, Zwane AP. Spring cleaning: Rural water impacts, valuation, and property rights institutions. The Quarterly Journal of Economics 2011;126:145–205.
  • Rothman et al. (2012) Rothman KJ, Lash TL, Greenland S. Modern Epidemiology. Mid-cycle revision, third ed. LWW; 2012.
  • Richardson et al. (2017) Richardson TS, Robins JM, Wang L. On Modeling and Estimation for the Relative Risk and Risk Difference. Journal of the American Statistical Association 2017;p. 1–10.
  • Tchetgen Tchetgen (2013) Tchetgen Tchetgen E. Estimation of risk ratios in cohort studies with a common outcome: A simple and efficient two-stage approach. The international journal of biostatistics 2013;9(2):251–264.
  • Kennedy et al. (2017) Kennedy EH, Ma Z, McHugh MD, Small DS. Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society Series B 2017;79(4):1229–1245.
  • Tsiatis (2006) Tsiatis AA. Semiparametric Theory and Missing Data. Springer Series in Statistics, Springer; 2006.
  • van der Vaart (2000) van der Vaart AW. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press; 2000.
  • Hampel et al. (2011) Hampel FR, Ronchetti EM, Rousseeuw PJ. Robust statistics: The approach based on influence functions., vol. 114. John Wiley and Sons; 2011.
  • Huber (1981) Huber PJ. Robust Statistics. John Wiley and Sons, Inc.; 1981.
  • Bickel et al. (1993) Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Springer, New York; 1993.
  • van der Laan and Robins (2003) van der Laan MJ, Robins JM. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer; 2003.
  • Robins et al. (2008) Robins JM, Li L, Tchetgen Tchetgen EJ, van der Vaart AW. Higher order influence functions and minimax estimation of nonlinear functionals. Institute of Mathematical Statistics; 2008.
  • Chernozhukov et al. (2017) Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W. Double/Debiased/Neyman Machine Learning of Treatment Effects. arXiv:170108687v1 2017;.
  • Kandasamy et al. (2015) Kandasamy K, Krishnamurthy A, Poczos B, Wasserman L, Robins JM. Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations. arXiv:14114342v3 2015;.
  • Robins et al. (2000) Robins JM, Hernan MA, Brumback B. Marginal Structural Models and Causal Inference in Epidemiology. Epidemiology 2000 September;11(5):550–560.
  • Ogburn et al. (2015) Ogburn EL, Rotnitzky A, Robins JM. Doubly robust estimation of the local average treatment effect curve. Journal of the Royal Statistical Society Series B 2015;.
  • Toth and van der Laan (2016) Toth B, van der Laan MJ. TMLE for Marginal Structural Models Based on an Instrument; 2016, u.C. Berkeley Division of Biostatistics Working Paper Series.
  • Kennedy (2016) Kennedy EH. Semiparametric Theory and Empirical Processes in Causal Inference. In: Statistical causal inferences and their applications in public health research, vol. Part III of ICSA Book Series in Statistics Springer; 2016. p. 141–167.
  • Kang and Schafer (2007) Kang JDY, Schafer JL. Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science 2007;22(4):523–539.
  • Kennedy et al. (2017) Kennedy EH, Lorch SA, Small DS. Robust causal inference with continuous instruments using the local instrumental variable curve. arXiv:160702566v2 2017;.

7 Appendix

7.1 Proof of Theorem 1

The first step is deriving the influence functions of the nuisance parameters in the simple discrete case. For this task, we cite a useful result.

Lemma 7.1.

When is discrete, the efficient influence function (EIF) for the parameter is given by


where .


A simple proof of the above follows from the heuristic that influence functions are derivatives, and so usual derivative rules apply. In particular since

then by the quotient rule, the EIF of the ratio is


which simplifies to , the desired result. ∎

Continuing our heuristic of working in the simple discrete case, we seek to derive the influence function of the moment condition parameter at a given


where . When we derive the EIF in the discrete case we will be able to see the continuous generalization.

We use the notation to denote the influence function of a quantity . The influence functions of , , and are


Recall the heuristic that influence functions are derivatives. Thus, we employ the chain rule on (

37) to get


To find , we use the quotient rule,


We plug into (42),


we substitute in and , and account for the fact that the sum combined with the indicator “picks out” only the cases with . Finally, we get that the efficient influence function (EIF) of the moment condition at a fixed is


as desired.

The result can be proved more formally by checking that is a pathwise derivative in the sense that

where is any smooth one-dimensional parametric submodel with (Bickel et al. (1993); van der Laan and Robins (2003); van der Vaart (2000)), and is the moment condition evaluated at such a submodel. Then the result follows since under a nonparametric model there is only one influence function and it is thus necessarily the efficiency influence function.

7.2 Proof of Theorem 2

Note that by the definition of our estimator and by the definition of our parameter. Also, note that and by the definition of influence functions.

Thus, we can do the usual von Mises expansion on

into four terms (where we have omitted the arguments on the second line for brevity),


I) Term I can converge to zero for two reasons. If we assume the true functions and their estimators are in Donsker classes, and the functions in the class containing are uniformly bounded away from zero and one, Term I is . Briefly, for the reader unfamiliar with empirical processes, classical empirical process theory deals with the empirical distribution function based on i.i.d. random variables. If are i.i.d. real-valued random variables with distribution function . Donsker classes are important in empirical process theory. They are sets of functions with the property that empirical processes indexed by these classes converge weakly to a certain Gaussian process. This property allows us to say that Term I is . See Kennedy (2016), Section 4.2, and van der Vaart (2000) Chapter 19, for a more in-depth explanation of the Donsker property and how it implies the convergence in distribution to zero of these terms. Alternatively, we can avoid Donsker classes altogether if we use data splitting, and (Kennedy (2016)), which is a weak assumption. In this case, Term I will also be .

II) Term II converges to a normal distribution by the classical central limit theorem, as long as its mean and variance exist. This is because


since by definition of our parameter. is a sample average.

III) Next, we want to show that Term III is . We will need iterated expectations, conditioning on , conditioning on , and using the fact that Our goal here is to show that is at most a second-order term:

This final inequality follows from Cauchy–Schwartz . Thus, we showed that Term III is a second-order error, meaning that it is a sum of a product of errors rather than a single error. This shows that if and are estimated nonparametrically, then for Term IV to be it is required that