# On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives

Unobserved confounding is a central barrier to drawing causal inferences from observational data. Several authors have recently proposed that this barrier can be overcome in the case where one attempts to infer the effects of several variables simultaneously. In this paper, we present two simple, analytical counterexamples that challenge the general claims that are central to these approaches. In addition, we show that nonparametric identification is impossible in this setting. We discuss practical implications, and suggest alternatives to the methods that have been proposed so far in this line or work: using proxy variables and shifting focus to sensitivity analysis.

07/31/2019

### Multi-cause causal inference with unmeasured confounding and binary outcome

Unobserved confounding presents a major threat to causal inference in ob...
08/15/2021

### The Proximal ID Algorithm

Unobserved confounding is a fundamental obstacle to establishing valid c...
08/13/2022

### Sensitivity to Unobserved Confounding in Studies with Factor-structured Outcomes

We propose an approach for assessing sensitivity to unobserved confoundi...
05/30/2019

### Multiple Causes: A Causal Graphical View

Unobserved confounding is a major hurdle for causal inference from obser...
04/28/2022

### Controlling for Latent Confounding with Triple Proxies

We apply results in Hu and Schennach (2008) to achieve nonparametric ide...
11/28/2016

### Split-door criterion for causal identification: Automatic search for natural experiments

Unobserved or unknown confounders complicate even the simplest attempts ...
12/03/2019

### Simpson's Paradox and the implications for medical trials

This paper describes Simpson's paradox, and explains its serious implica...

## 1 Introduction

Estimating causal effects in the presence of unobserved confounding is one of the fundamental challenges of casual inference from observational data, and is known to be infeasible in general (Pearl_Causality_2009)

. This is because, in the presence of unobserved confounding, the observed data distribution is compatible with many potentially contradictory causal explanations, leaving the investigator with no way to distinguish between them on the basis of data. When this is the case, we say that the causal quantity of interest, or estimand, is not identified. Conversely, when the causal estimand can be written entirely in terms of observable probability distributions, we say the query is identified.

A recent string of work has suggested that progress can be made with unobserved confounding in the special case where one is estimating the effects of multiple interventions (causes) simultaneously, and these causes are conditionally independent in the observational data given the latent confounder (wang2018blessings; tran2017implicit; ranganath2018multiple)

. The structure of this solution is compelling because it admits model checking, is compatible with modern machine learning methods, models real-world settings where the space of potential interventions is high-dimensional, and leverages this dimensionality to extract causal conclusions. Unfortunately, this work does not establish general sufficient conditions for identification.

In this paper, we explore some of these gaps, making use of two simple counterexamples. We focus on the central question of how much information about the unobserved confounder can be recovered from the observed data alone, considering settings where progressively more information is available. In each setting, we show that the information gained about the unobserved confounder is insufficient to pinpoint a single causal conclusion from the observed data. In the end, we show that parametric assumptions are necessary to identify causal quantities of interest in this setting. This suggests caution when drawing causal inferences in this setting, whether one is using flexible modeling and machine learning methods or parametric models.

Despite these negative results, we discuss how it is still possible to make progress in this setting under minor modifications to either the data collection or estimation objective. We highlight two alternatives. First, we discuss estimation with proxy variables, which can be used to identify causal estimands without parametric assumptions by adding a small number of variables to the multi-cause setting (Miao_Identifying_2016; louizos2017causal). Secondly, we discuss sensitivity analysis, which gives a principled approach to exploring the set of causal conclusions that are compatible with the distribution of observed data.

## 2 Related Work

This paper primarily engages with the young literature on multi-cause causal inference whose primary audience has been the machine learning community. This line of work is motivated by several applications, including genome-wide association studies (GWAS) (tran2017implicit), recommender systems (wang2018deconfounded), and medicine (ranganath2018multiple). wang2018blessings include a thorough review of this line of work and application areas.

These papers can be seen as an extension of factor models to causal settings. Identification in factor models is an old topic. The foundational results in this area are due to kruskal1989rank and were extended to a wide variety of settings by allman2009identifiability. For more elementary results similar to those in our first counterexample, see bollen1989structural or similar introductory texts on factor analysis.

The approach taken in this paper is an example of sensitivity analysis, which is a central technique for assessing the robustness of conclusions in causal inference. One prominent approach, due to rosenbaum1983assessing posits the existence of a latent confounder, and maps out the causal conclusions that result when unidentified parameters in this model are assumed to take certain values. Our second counterexample takes inspiration from the model suggested in this paper.

## 3 Notation and Preliminaries

Consider a problem where one hopes to learn how multiple inputs affect an outcome. Let

be a vector of

variables (causes) whose causal effects we wish to infer, and let be the scalar outcome variable of interest. We write the supports of and as and , repsectively. For example, suppose that corresponds a set of genes that a scientist could, in principle, knock out by gene editing, where if the gene remains active and if it is knocked out. In this case, the scientist may be interested in predicting a measure of cell growth if various interventions were applied. Formally, we represent this quantity of interest using the -operator:

 P(Y∣do(A)),

which represents the family of distributions of the outcome when the causes are set to arbitrary values in (Pearl_Causality_2009).

In general, it is difficult to infer from observational, or non-experimental, data because there may be background factors, or confounders, that drive both the outcome and the observed causes . We represent these confounders with the variable , with support . In the presence of such confounders, the conditional distribution may be different from the intervention distribution .

If is observed, the following assumptions are sufficient to identify the intervention distribution:

• Unconfoundedness: blocks all backdoor paths between and , and

• Positivity: almost surely for each .

Importantly, under these conditions, no parametric assumptions are necessary to identify the intervention distribution. We say that the intervention distribution is nonparametrically identified under these conditions.

The unconfoundedness assumption ensures that the following relation holds between the unobservable intervention distribution and the observable distributions:

 P(Y ∣do(A=a))=E[P(Y∣do(A=a),U)] =E[P(Y∣A=a,U)] =∫UP(Y∣A=a,U=u)P(U=u)du. (1)

We will call the conditional outcome distribution, and the integrating measure. Meanwhile, the postivity assumption ensures that all pairs are observable in the support of , so that the integrand in (1) can be evaluated along each point on the path of the integral. The intervention distribution is identified under these conditions because (1) can be written completely in terms of observable distributions.

## 4 Unobserved Confounding and Multiple Causes

When the confounder is unobserved, the unconfoundedness and positivity assumptions are no longer sufficient for to be identified. In this case, additional assumptions are necessary because (1) is no longer a function of observable distributions.

The multi-cause approach attempts to infer (1) from the observable data alone under assumptions about the conditional independence structure of this distribution. Specifically, this approach incorporates the assumption that the observed distribution of causes admits a factorization by the unobserved confounder . We group these central assumptions in Assumption 1 below, and illustrate them in Figure 1.

###### Definition 1.

We say a variable factorizes the distribution of a set of variables iff

 P(A)=∫U[m∏j=1P(A(j)∣U=u)]P(U=u)du, (2)
###### Assumption 1.

There exists an unobserved variable such that (i) blocks all backdoor paths from to and (ii) factorizes the distribution of .

Under this assumption, the most general form of multi-cause causal inference rests on the following identification claim.

###### Claim 1.

Under Assumption 1, for any variable that factorizes , the following relation holds

 ∫VP(Y ∣A=a,V=v)P(V=v)dv (3) (∗)=∫UP(Y∣A=a,U=u)P(U=u)du =P(Y∣do(A=a)).

If this claim were true, one could obtain an arbitrary factor model for the observed causes satisfying (2) and calculate .

In Section 5, we present a simple counterexample that shows that this claim does not hold in general. The crux of the counterexample is that factorizations of are not always unique, and differences between these factorizations can induce different values for (3).

In light of this counterexample, it is natural to ask whether identification by (1) is feasible in the special case that the factorization of is unique. In this case, we say the factorization is identified. Depending on the specification, a factor model may be identified under fairly weak conditions, especially when the latent factor is categorical; allman2009identifiability present a broad set of sufficient conditions.

###### Claim 2.

Under Assumption 1, if the factorization of is identified, then the intervention distribution is identified by (1).

ranganath2018multiple and tran2017implicit make a variation of this claim by supposing that can be consistently estimated as a function of . In this case, the factorization is identified in the limit where the number of causes grows large.

###### Claim 3.

Under Assumption 1, if there exists an estimator of that is a function of , , such that

 ^U(A)a.s.⟶U,

then the intervention distribution is identified by

 P(Y ∣do(A=a))= ∫UP(Y∣A=a,^U(A)=u)P(^U(A)=u)du. (4)

In Section 6, we give a counterexample and a theorem showing that Claim 2 is false except in the trivial case that the observational and intervention distributions coincide; that is, . In a supporting proposition for this result, we show specifically that Claim 3 is false because the consistency premise implies that the positivity assumption is violated.

## 5 Factorization Existence Is Insufficient

### 5.1 Setup

In this section, we show that Claim 1 is false by a thorough exploration of a counterexample. Specifically, we show that, even under Assumption 1, it is possible that the observed data is compatible with many distinct intevention distributions .

Consider a simple setting where all variables are linearly related, and all independent errors are Gaussian. Letting for each , the structural equations for this setting are

 U :=ϵU A :=αU+ϵA Y :=β⊤A+γU+ϵY

Here, are column vectors, and is a scalar; is a random column vector, and are random scalars. This data-generating process satisfies Assumption 1.

Under this model, the intervention distribution has the following form:

 P(Y∣do(A=a))=N(β⊤a,γ2σ2U+σ2Y).

We will focus specifically on estimating the conditional , which is fully parameterized by . Thus, our goal is to recover from the distribution of observed data.

The covariance matrix can be written as

 ΣAYU=⎛⎜⎝ΣUUΣUAΣUYΣAUΣAAΣAYΣYUΣYAΣYY⎞⎟⎠

where is , is , and is .

The marginal covariance matrix of the observable variables is the bottom-right sub-matrix of this matrix. Its entries are defined by:

 ΣAA =αα⊤σ2U+diag(σ2A) ΣAY =ΣAAβ+γσ2Uα ΣYY =(β⊤α+γ)2σ2U+β⊤diag(σ2A)β+σ2Y

In these equations, the quantity on the LHS is observable, while the structural parameters on the RHS are unobservable. The goal is to invert these equations to obtain a unique value for .

### 5.2 Equivalence Class Construction

When , the number of equations in this model exceeds the number of unknowns, but there still exists an equivalence class of structural equations with parameters

 (α1,β1,γ1,σ2U,1,σ2A,1,σ2Y,1)≠(α,β,γ,σ2U,σ2A,σ2Y)

that induce the same observable covariance matrix, and for which . These parmeterizations cannot be distinguished by observed data. In this section, we show how to construct such a class.

The key to this argument is that the scale of is not identified given , regardless of the number of causes . This is a well-known non-identification result in confirmatory factor analysis (e.g., bollen1989structural, Chapter 7). In our example, the expression for does not change when and are replaced with the structural parameters and :

 α1 :=c⋅α σ2U,1 :=σ2U/c2.

In the following proposition, we state how the remaining structural variables can be adjusted to maintain the same observable covariance matrix when the scale is changed.

###### Proposition 1.

For any fixed vector of parameters and a valid scaling factor (defined below), there exists a vector of parameters that induces the same observable data distribution.

 α1(c) =c⋅α β1(c) =β+Σ−1AAα⋅γσ2U(1−1c) (5) γ1(c) =γ σ2U,1(c) =σ2U/c2 σ2A,1(c) =σ2A σ2Y,1(c) =ΣYY−(β⊤1α1+γ1)2σ2U,1 −β⊤1diag(σ2A,1)β1

The factor is valid if it implies positive .

We call the set of all parameter vectors that correspond to valid values of the ignorance region in the parameter space. Parameters in the ignorance region cannot be distinguished on the basis of observed data because they all imply the same observed data distribution.

We plot an illustration of the ignorance region for from a numerical example in Figure 2. In this example, we set and to be constant vectors for some constants and . In this case is a simple scaling of , so the ignorance region can be represented by the value of this scalar . In this example, the data cannot distinguish between effect vectors that have the opposite sign of the true effect vector , and those those that overstate the effect of by nearly a factor of 2.

### 5.3 Large-m Asymptotics

The ignorance region does not in general disappear in the large treatment number (large-) limit. Here, we extend our example to an asymptotic frame where the ignorance region maintains the same (multiplicative) size even as goes to infinity. Consider a sequence of problems where the number of treatments analyzed in each problem is increasing in the sequence. Each problem has its own data generating process, with some structural parameters indexed by : . We keep the scalar parameters not indexed by fixed.

We consider the marginal variance of each

to be fixed, so for some fixed scalar , for each problem ,

 σ2A,m=1m×1s20.

Likewise, we expect the marginal variance of to be relatively stable, no matter how many treatments we choose to analyze. Given our setup, this means that if the number of treatments is large, the effect of each individual treatment on average needs to become smaller as grows large, or else the variance of would increase in (this is clear from the specification of ). To handle this, we fix some constant scalars and and assume that, for problem ,

 αm=1m×1⋅a0/√m;βm=1m×1⋅b0/√m.

Thus, as , the norms of and , as well as their inner product , which appears in the expression for , remain fixed. 111 The asymptotic frame in this section is not the only way to maintain stable variance in as increases. In particular, one could specify the sequence of problems so that they are projective, and simulate an investigator incrementally adding causes to a fixed analysis. One could then define a sequence of coefficients for each cause added to the analysis, putting some conditions on the growth of the inner product and norm of , such as a sparsity constraint on . Our setup here is simpler.

Under this setup, the interval of valid values for the latent scaling factor remains fixed for any value of .For a fixed in this interval, we examine how the corresponding shift vector behaves as grows large. The components of the shift scale as . Specifically, applying the Sherman-Morrison formula,

 Δβ,m(c) =Σ−1AAαm⋅γσ2U(1−1c) =m−1/2⋅1m×1⋅a0s20+σ2Ua20⋅γσ2U(1−1c).

Thus, for each , the ratio of the th component of the shift vector relative to the th component of the true parameters remains fixed in :

 Δ(k)β,m(c)β(k)m=a0b0(s20+σ2Ua20)⋅γσ2U(1−1c).

Thus, even asymptotically as there is no identification.

## 6 Factorization Uniqueness Is Insufficient

### 6.1 Impossibility of Nonparametric Identification

In this section, we consider identification in the multi-cause setting in the special case where the factorization of by is unique; that is, we consider the case where can be decomposed uniquely into a mixing measure and a conditional treatment distribution . In this setting, Claims 2 and 3 assert that is identified by (1). This claim arises as a natural response to the counterexample in the last section, where the non-uniqueness of the factorization contributes to non-identification.

As in the last section, we show via counterexample that the conditions in these claims are insufficient for identification of . In addition, we show that, in general, parametric assumptions about the conditional outcome distribution are necessary to identify , except in the case where there is no confounding, i.e., the intervention distribution and the observed conditional distribution are equal almost everywhere. We summarize this statement in the following theorem.

###### Theorem 1.

Suppose that Assumption 1 holds, that is identified, and that the model for is not subject to parametric restrictions.

Then either almost everywhere, or is not identified.

Put another way, our theorem states that nonparametric identification is impossible in this setting, except in the trivial case. In this section, we prove two supporting propositions for this theorem and demonstrate them in the context of our running example. The proof of the theorem, which follows almost immediately from these propositions, appears at the end of the section.

### 6.2 Counterexample Setup

Let be a binary latent variable, and a vector of binary causes. In the structural model, we assume that the individual causes are generated independently and identically as a function of . Let the outcome be binary, and generated as a function of and .

 U :=Bern(πU) A(k) :=Bern(pA(U))k=1,⋯,m Y :=Bern(pY(U,A))

In addition to this structural model, we assume that and that is a non-trivial function of . These assumptions are sufficient for the factorization of by to be unique (kruskal1989rank; allman2009identifiability). Thus, this example satisfies the premise of Claim 2.

Our goal is to estimate the intervention distribution for each using the identity in (1). Here, the intervention distribution can be summarized by the following causal parameter:

 πY∣do(a) :=P(Y=1∣do(A=a)) =(1−πU)pY(0,a)+πUpY(1,a). (6)

Because the factorization of is identified, and are identifiable. Thus, to calculate (6), it remains to recover .

We will show that this conditional probability cannot be recovered from the observed data. Our approach will be to analyze the residual distribution . For each value , we can characterize by 4 probabilities in a table (see Figure LABEL:fig:2by2). We use shorthand notation to denote the value in each cell of this table. The values in this table are unobservable, but they are subject to several constraints. The entries of this table are constrained to be positive and sum to 1. In addition, because and are identified, the margins of the table are constrained to be equal to probabilities given by and . We refer to these probabilities with the shorthand and . Using this notation, we can rewrite (1) as:

 πY∣do(a)=(1−πU)p01∣a1−πU∣a+πUp11∣aπU∣a. (7)

We consider identification in two separate cases, depending on the amount of information about is contained in the event . We first consider the case where is non-degenerate, so that there is residual uncertainty about after is observed. We then consider the degenerate case, where can be deterministically reconstructed as a function of (the premise of Claim 3).