DeepAI

# Doubly Robust Criterion for Causal Inference

The semiparametric estimation approach, which includes inverse-probability-weighted and doubly robust estimation using propensity scores, is a standard tool for marginal structural models basically used in causal inference, and is rapidly being extended and generalized in various directions. On the other hand, although model selection is indispensable in statistical analysis, information criterion for selecting an appropriate marginal structure has just started to be developed. In this paper, based on the original idea of the information criterion, we derive an AIC-type criterion. We define a risk function based on the Kullback-Leibler divergence as the cornerstone of the information criterion, and treat a general causal inference model that is not necessarily of the type represented as a linear model. The causal effects to be estimated are those in the general population, such as the average treatment effect on the treated or the average treatment effect on the untreated. In light of the fact that doubly robust estimation, which allows either the model of the assignment variable or the model of the outcome variable to be wrong, is attached importance in this field, we will make the information criterion itself doubly robust, so that either one of the two can be wrong and still be a mathematically valid criterion.

• 1 publication
• 6 publications
03/29/2022

### Information criteria for sparse methods in causal inference

For propensity score analysis and sparse estimation, we develop an infor...
10/23/2019

### Doubly robust treatment effect estimation with missing attributes

Missing attributes are ubiquitous in causal inference, as they are in mo...
08/31/2018

### The Causal Effect of Answer Changing on Multiple-Choice Items

The causal effect of changing initial answers on final scores is a long-...
06/22/2021

### Finding Valid Adjustments under Non-ignorability with Minimal DAG Knowledge

Treatment effect estimation from observational data is a fundamental pro...
11/01/2022

### Robust Direct Learning for Causal Data Fusion

In the era of big data, the explosive growth of multi-source heterogeneo...
09/08/2021

### Parameterizing and Simulating from Causal Models

Many statistical problems in causal inference involve a probability dist...
08/13/2019

### Optimal Estimation of Generalized Average Treatment Effects using Kernel Optimal Matching

In causal inference, a variety of causal effect estimands have been stud...

## 1 Introduction

Marginal structural model (Robins 1997, Robins et al. 2000) is basic in causal inference. It assumes that there are as many potential outcome variables as there are treatments, but that only the outcome variable corresponding to the assigned treatment is observed. There is confounding between the outcome and the assignments, and if this is not taken into account and the estimation is done naively with a marginal likelihood, the estimates will have a bias. The bias can be removed if the correlation is modeled correctly, but a semiparametric approach such as inverse probability weighted estimation (Robins et al. 1994) or doubly robust estimation (Scharfstein et al. 1999, Bang and Robins 2005) is often taken without such difficult modeling.

For example, let be a potential outcome variable when the sample is observed at time , be an assignment variable that takes when is observed and when is not observed, be a confounding variable, we consider one of the simplest marginal structural logistic models whose marginal probability function for the outcome variable is , where , and whose conditional probability function for the assignment variable is (see, for example, Hernán and Robins 2020). Here, is the propensity score introduced by Rosenbaum and Rubin (1983), and is the parameter relating to the causal effect. In this model, when also depends on , although and are correlated, if we obtain the estimator of by maximizing , it will have a bias. The inverse-probability-weighted estimation is a method that uses propensity scores to reproduce the pseudo-complete data and gives an estimator by maximizing .

In this statistical problem, it is natural to treat the model selection of whether the structure relating to the causal effect is -dimensional or -dimensional, rather than -dimensional . However, for this basic selection, there is no theoretically guaranteed information criteria. To be more specific, Platt et al. (2013) pioneered the information criterion for such marginal structural models, proposing the use of inverse-probability-weighting for the goodness-of-fit term. While it is quite reasonable, it uses the number of parameters of the marginal structure as the penalty term, which will lead a considerable underestimation in terms of bias correction. Baba et al. (2017) asymptotically evaluate the bias and correct the penalty term easily and significantly. However, this is a C criterion type (Mallows 1973) that can only handle linear models such as , and cannot deal with the problem described above. In addition, it does not deal with the average treatment effect on the treated or the average treatment effect on the untreated, which is often treated in causal inference, i.e., the causal effect when the target population is not necessarily the whole. In this paper, we derive an information criterion of the AIC-type (Akaike 1973) that overcomes these problems.

When the propensity score is unknown, we assume some parametric function for , obtain an estimator from and use instead of , that is a simple method. On the other hand, if it is supposed that this modeling may be misspecified, we assume some parametric function for the conditional distribution of the outcome variable given the confounding variable. Then, an estimator is obtained from , and is used to provide the doubly robust estimation, that is also standard now. If the modeling of either or is correct, then becomes a consistent estimator. The information criterion given in Baba et al. (2017) covers doubly robust estimation, but it is also inadequate in that it is only derived when both modelings are correct. In this paper, we develop an information criterion that is mathematically valid as long as either modeling is correct, i.e., the information criterion itself is doubly robust. As one can imagine from Konishi and Kitagawa (1996) and Baba et al. (2017), the penalty term of the information criterion for such a general M-estimation depends on the true values of the parameters, which are usually replaced by their consistent estimators. For the simple problem described above, it depends on the true values of and . Although one of their consistent estimators will not be given in the setting of doubly robust estimation, the problem is well avoided here.

The organization of this paper is as follows. First, in Section 2, we describe the model and assumptions, define general causal effects in a setting where the target population is not necessarily the whole, and introduce the inverse-probability-weighted estimator and the doubly robust estimator. In Section 3, we give a risk function based on the Kullback-Leibler divergence, which is naturally defined when considering such estimations, and define an AIC-type criterion, following the usual method for deriving AIC. Then, we asymptotically evaluate the penalty term of the information criterion for inverse-probability-weighted estimation. In Section 4, we derive the information criterion for doubly robust estimation, keeping in mind that one of the models of the assignment variable and the outcome variable may be misspecified. In order to explore the possibility of extending the derived information criterion, Section 5

tries to generalize divergence and weight functions, in order to treat, for example, a loss function robust to outliers and a covariate balancing propensity score. Finally, Section

6 summarizes the conclusions.

## 2 Preparation

### 2.1 Model and assumption

In this paper, we treat marginal structural models

 y=H∑h=1t(h)y(h),y(h)∼f(⋅∣x(h);θ)

where is an assignment variable that becomes when the -th treatment is assigned (), is a potential outcome variable when the -th treatment is assigned, is an explanatory variable for , is the probability function for regression of on , and is the parameter used there (). Note that on the left-hand side is an observed outcome variable. Also, may contain some of the confounding variables, but for simplicity, the others are assumed to be non-random.

In this model, the potential outcome variables ’s with are regarded to be missing. In addition, since in general, naively estimating from the observed values alone will result in a bias. In this paper, we suppose that the confounding variables for and are observed such that this bias can be removed. Moreover, we assume the strongly ignorable treatment assignment condition

 {y(1),y(2),…,y(H)}⊥⊥{t(1),t(2),…,t(H)}∣z,

which is intended to allow for the removal of this bias (Imbens 2000). We also assume the positive value condition . There are samples following this model, and the variables in the -th samples are denoted by subscript . Let be the variables, and we assume that the samples are independent, i.e.

 ui⊥⊥uj(i≠j; i,j∈{1,2,…,N}),

which naturally implies that .

The parameter to be estimated is in the population where the -th treatment group is times larger than actual. If and or , then this is the parameter for the average treatment effect on the treated or the untreated. This satisfies

 H∑h,k=1E{d(k)t(k)∂∂θlogf(y(h)∣x(h);θ)}=0p. (1)

### 2.2 Semiparametric estimation

If the relationship between the potential outcome variable and the confounding variable can be modeled correctly, then the causal effect can be estimated consistently from the maximum likelihood method under the ignorable treatment assignment condition, but in general this modeling is difficult. In recent years, a semi-parametric approach using the propensity score , which does not necessarily require this correct modeling, is often used. Here, is a parameter that characterizes the function of the propensity score. In this paper, we will discuss two typical estimation methods in this approach.

The first type is the inverse-probability-weighted estimation (Robins et al. 1994). In this estimation method, the missing values are pseudo-recovered by multiplying the observed values by the inverse of the propensity score as weights, and then the usual estimation is implemented. Specifically, considering that the causal effect depends on , we define a weighted loss function using the weights , and the inverse-probability-weighted estimator is given by solving

 1NN∑i=1H∑h=1t(h)iw(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ)=0p. (2)

Conditional on under the strongly ignorable treatment assignment condition, is independent of and has the expectation , and the left-hand side of (2) converges to the left-hand side of (1). This indicates that the inverse-probability-weighted estimator is consistent, i.e., . Moreover, by substituting into (2) and expanding it, we obtain

 ^θIPW−θ=1NA(θ,α)−1N∑i=1H∑h=1t(h)iw(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ){1+oP(1)}, (3)

where

 (4)

The above assumes that the propensity score is known, i.e., is known, but in general it is often unknown, in which case some estimator is substituted into . For , for example, we only have to construct the likelihood function for as the multinomial distribution with the probability and use the maximum likelihood estimator.

Although is correlated with in the marginal structural model, the inverse-probability-weighted estimation does not directly use the information in to estimate the expectation of . The doubly robust estimation (Scharfstein et al. 1999; Bang and Robins 2005) improves the inverse-probability-weighted estimation by doing it. Denoting the conditional distribution of given by with a further parameter , we take the expectation of with and write the conditional expectation as . In a practical sense, for , some consistent estimator is substituted. Specifically, in the setting of unknown propensity scores, we add

 1NN∑i=1H∑h=1{H∑k=1d(k)t(k)i−t(h)iw(h)(zi;α)}∂∂θg(h)(x(h)i,zi;θ,β) (5)

to the left-hand side of (2), substitute estimators into and , and then the doubly robust estimator is given by solving with respect to that it equals to . This estimator is known to not only improve the inverse-probability-weighted estimator but also be semiparametric locally efficient (Robins and Rotnitzky 1995). In addition, if either the propensity score or the conditional expectation is specified correctly, it is consistent, and hence is said to be doubly robust.

## 3 Inverse-probability-weighted criterion

### 3.1 Risk function for causal inference

Before defining the risk function to derive the information criterion for causal inference, we will explain QIC proposed by Platt et al. (2013). This is a criterion for missing data such as potential outcome variables, where the first item is times the loss function used in the inverse-probability-weighted estimation, and the second item is times the number of parameters. In the setting of Section 2.2, when is unknown, it is defined as

 QICw=−2N∑i=1H∑h=1t(h)iw(h)(zi;^α)logf(y(h)i∣x(h)i;^θIPW)+2p. (6)

When there are no weights in QIC

, it can be confirmed that this is an asymptotic unbiased estimator of the risk function based on the Kullback-Leibler divergence. On the other hand, when weights are present, the variance of the first term increases, and therefore we should make the second term be larger.

Similarly to usual AIC-type information criteria, let be a copy of

, i.e., a random vector that independently and identically distributed according to the distribution of

, let be an appropriate consistent estimator of with square of root convergence, and it is assumed that we can write using a random vector which is and whose expectation is . Then, we define a risk function

 −2N∑i=1H∑h=1E{t(h)†iw(h)(z†i)logf(y(h)†i∣x(h)†i;^θ)} (7)

such that the first term of (6) is naturally used as its naive estimator. Here, is a limit of , and itself when the propensity score is known. This risk function can be regarded to be based on the Kullback-Leibler divergence of the true and estimated distributions in a population where the -th treatment group is times larger than the actual. Also, as can be seen from (2), the estimation aims to minimize this risk function, and in that sense, it is a natural quantity. The first term in (6) is an evaluation of this risk function using the same data as those for the estimator, so it tends to be smaller than the true value. Therefore, denoting the bias as

 =−2E[N∑i=1H∑h=1t(h)iw(h)(zi){logf(y(h)i∣x(h)i;^θ)−logf(y(h)i∣x(h)i;θ)} =−2E[−N∑i=1H∑h=1t(h)†iw(h)(z†i){logf(y(h)†i∣x(h)†i;^θ)−logf(y(h)†i∣x(h)†i;θ)}], (8)

we define as the weak limit of the quantity in the expectation in (8), and use as the asymptotic bias for the correction. If (8) is expanded with respect to around , we get

 blimit=−2ξ′N∑i=1H∑h=1t(h)iw(h)(zi)∂∂θlogf(y(h)i∣x(h)i;θ) blimit=+2ξ′N∑i=1H∑h=1t(h)†iw(h)(z†i)∂∂θlogf(y(h)†i∣x(h)†i;θ).

The expectation of the second term is divided into the expectation of and the expectation of the other terms, and the former is , so we obtain

 (9)

When this asymptotic bias contains unknown parameters, they are replaced by their consistent estimators, as is done in deriving the usual AIC-type information criterion, and it is added to the first term of (6) to construct the information criterion.

### 3.2 Case of known propensity score

Since the error of the inverse-probability-weighted estimator can be written as (3), we hereafter evaluate

 E(blimit)=−2NN∑i,j=1H∑h,k=1E[ {t(h)iw(h)(zi;α)∂∂θ′logf(y(h)i∣x(h)i;θ)} A(θ,α)−1{t(k)jw(k)(zj;α)∂∂θlogf(y(k)j∣x(k)j;θ)}]. (10)

When , this expectation is divided into the expectation with respect to and the expectation with respect to because of the independency, and the former one is summed by to get

 E{H∑h=1t(h)iw(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ)} =Ezi[H∑h=1E{H∑k=1d(k)e(k)(zi;α)}E{∂∂θlogf(y(h)i∣x(h)i;θ) ∣∣∣ zi}] =E{H∑k,h=1d(k)e(k)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ)} =0p. (11)

The last equality is derived from (1). Therefore, the expectation of (10) becomes

When , this expectation is because . When , this expectation becomes

 E{t(h)iw(h)(zi;α)2∂∂θlogf(y(h)i∣x(h)i;θ)∂∂θ′logf(y(h)i∣x(h)i;θ)} =Ezi[E(t(h)i∣zi)w(h)(zi;α)2E{∂∂θlogf(y(h)i∣x(h)i;θ)∂∂θ′logf(y(h)i∣x(h)i;θ) ∣∣∣ zi}] =E[{H∑k=1d(k)e(k)(zi;α)}21e(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ)∂∂θ′logf(y(h)i∣x(h)i;θ)]. (12)

Therefore, by defining

 (13)

the following theorem can be obtained.

###### Theorem 1.

Let us define and by (4) and (13), respectively. Then, the asymptotic bias of the information criterion for the inverse-probability-weighted estimation is given by

 E(blimit)=−2tr{A(θ,α)−1B(θ,α)}

when the propensity score is known.

Based on this result and the fact that is a consistent estimator of , we propose

 IPWIC1≡−2N∑i=1H∑h=1t(h)iw(h)(zi;α)logf(y(h)i∣x(h)i;^θIPW)+2tr{A(^θIPW,α)−1B(^θIPW,α)}

as an information criterion for the inverse-probability-weighted estimation when the propensity score is known.

### 3.3 Case of unknown propensity score

If the parameter in the propensity score is unknown, then we only have to maximize the log-likelihood to give . From this log-likelihood, the score function is , Fisher information matrix is

 I1(α)≡H∑h=1E{1e(h)(z;α)∂∂αe(h)(z;α)∂∂α′e(h)(z;α)}, (14)

and then the error of is expressed as

 ^α−α=1NI1(α)−1H∑h=1N∑i=1t(h)i∂∂αloge(h)(zi;α){1+oP(1)}. (15)

From this and (3), the error of the inverse-probability-weighted estimator is

 ^θIPW−θ =1NA(θ,^α)−1 =N∑i=1H∑h=1t(h)i{w(h)(zi;α)+∂∂α′w(h)(zi;α)(^α−α)}∂∂θlogf(y(h)i∣x(h)i;θ){1+oP(1)} =1NA(θ,α)−1N∑i=1H∑h=1{t(h)iw(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ) =A(θ,α)−11NN∑i=1H∑h=1{−Λ1(θ,α)′I1(α)−1t(h)i∂∂αloge(h)(zi;α)}{1+oP(1)}, (16)

where

 Λ1(θ,α)≡H∑h=1E{−e(h)(z;α)∂∂αw(h)(z;α)∂∂θ′logf(y(h)∣x(h);θ)}. (17)

Since the error can be written like this, based on (9), we hereafter evaluate

 E(blimit) =−2NN∑i,j=1H∑h,k=1tr(A(θ,α)−1E[t(h)iw(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ) ={t(k)jw(k)(zj;α)∂∂θ′logf(y(k)j∣x(k)j;θ)−t(k)j∂∂α′loge(k)(zj;α)I1(α)−1Λ1(θ,α)}]).

When , this expectation is divided into the expectation with respect to and the expectation with respect to because of the independency, and the sum of the expectations of the terms with by is (11), which is . Therefore, we will evaluate the terms with . When , the expectations are still because . On the other hand, when , the first term is the same as (12). The next term is

 E{t(h)iw(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ)∂∂α′loge(h)(zi;α)I1(α)−1Λ1(θ,α)} =E{w(h)(zi;α)∂∂θlogf(y(h)i∣x(h)i;θ)∂∂α′e(h)(zi;α)I1(α)−1Λ1(θ,α)}.

By defining

 Λ2(θ,α)≡H∑h=1E{w(h)(z;α)∂∂θlogf(y(h)∣x(h);θ)∂∂α′e(h)(z;α)}, (18)

the following theorem can be obtained.

###### Theorem 2.

Let us define , , , and by (4), (13), (14), (17) and (18), respectively. Then, the asymptotic bias of the information criterion for the inverse-probability-weighted estimation is given by

 E(blimit)=−2tr{A(θ,α)−1B(θ,α)−Λ2(θ,α)I1(α)−1Λ1(θ,α)}

when the propensity score is unknown.

Based on this result and the fact that is a consistent estimator of , we propose

 IPWIC2≡−2N∑i=1H∑h=1t(h)iw(h)(zi,^α)logf(y(h)i∣x(h)i;^θIPW) IPWIC2≡+2tr{A(^θIPW,^α)−1B(^θIPW,^α)−Λ2(^θIPW,^α)I1(^α)−1Λ1(^θIPW,^α)}

as an information criterion for the inverse-probability-weighted estimation when the propensity score is unknown.

## 4 Doubly robust criterion

The doubly robust estimator is an estimator that is consistent even if either the “model for the assignment variable conditional on the confounding variable” or the “model for the outcome variable conditional on the confounding variable” is misspecified. In the narrow setting of a linear model and a basic average treatment effect, Baba et al. (2017) proposed the C criterion for the doubly robust estimation, but it is derived assuming that both models are correct, so it has no validity when only one of them is correct. In this section, we aim to derive a doubly robust criterion that has the validity. Note that in the setting of doubly robust estimation, the propensity score is unknown, i.e., is unknown.

Letting , we define

 m(u;θ,α,β)≡ H∑h=1∂∂θ[t(h)w(h)(z;α)logf(y(h)∣x(h);θ)

Then, the doubly robust estimating equation is written as

 N∑i=1⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝m(ui;θ,α,β)H∑h=1∂∂αt(h)iloge(h)(zi;α)H∑h=1∂∂βt(h)ilogp(h)(y(h)i∣zi;β)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=0p+q+r.

Let us denote the limit of the solution of this estimating equation by . If one of the models is correct, and we write from now on, but not necessarily or . We use Taylor expansion of the estimating equation with the estimator substituted into it, and develop a statistical asymptotic theory similar to the conventional one. The derivative of the left-hand side of the estimated equation is

 N∑i=1⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝∂∂θ′m(ui;θ,α∗,β∗)∂∂α′m(ui;θ,α∗,β∗)∂∂β′m(ui;θ,α∗,β∗)0H∑h=1∂2∂α∂α′t(h)iloge(h)(zi;α∗)000H∑h=1∂2∂β∂β′t(h)ilogp(h)(y(h)i∣zi;β∗)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

which is obtained from differentiating the left-hand side by the parameter once more, and if we divide it by , it converges in probability to

 ⎛⎜ ⎜ ⎜ ⎜⎝E{−∂∂θ′m(u;θ,α∗,β∗)}E{−∂∂α′m(u;θ,α∗,β∗)}E{−∂∂β′m(u;θ,α∗,β∗)}0I1(α∗)000I2(β∗)⎞⎟ ⎟ ⎟ ⎟⎠ (19)

from the law of large numbers. Here,

and are Fisher information matrices for and , respectively. The terms in the (1,1) block of this are

 H∑h=1Ez[E{t(h)w(h)(z;α∗)∣z}E{−∂2∂θ∂θ′logf(y(h)∣x(h);θ) ∣∣∣ z}

and if either model is correct, then it becomes based on the definition of (4). It is because if the model for the assignment variable is correct, then , and if the model for the outcome variable is correct, then