## 1 Introduction

Noncompliance is common in studies where some study subjects may self-select into a different study condition than the one to which they were assigned. Noncompliance may be related to unmeasured factors, so without further assumptions, the presence of noncompliance can complicate the analysis of even a randomized experiment. A popular approach to circumvent the noncompliance issue is the intention-to-treat (ITT) analysis (Roland and Torgerson, 1998; Hollis and Campbell, 1999; Heckman and Vytlacil, 2001). An ITT analysis aims to estimate a “real-world” or “diluted” effect of a treatment, ignoring the noncompliance in the study sample and assuming that the level of noncompliance in the study sample reflects the actual situation if the treatment were to be implemented elsewhere (Ten Have et al., 2008). The estimand in an ITT analysis is usually referred to as the effectiveness of a treatment. An ITT analysis is straightforward in the presence of noncompliance, where the (often random) assignments are used as the factor of interest, while the actual treatment status subject to noncompliance is ignored. In randomized studies, since the randomized assignment is orthogonal to all confounders, a simple two-sample comparison can be unbiased for estimating the effectiveness of a treatment (Angrist and Krueger, 1999; Little et al., 2009).

In contrast, the efficacy of a treatment refers to the effectiveness of a treatment when it is actually taken. One measure of efficacy is the complier average causal effect (CACE), or the mean causal effect among those who will comply with treatment assignment, i.e., the principal compliers (Angrist et al., 1996; Little et al., 2009). A favored approach to estimate the efficacy of a treatment is a structural equation model, where the assignment is considered as an instrumental variable (IV) for the actual treatment status received (Greene WH, 2003). Under certain assumptions, the IV approach is unbiased for estimating the efficacy of a treatment. See (Imbens, 2014) for a detailed survey of methods and development in this space. Despite its popularity, the IV approach can suffer from a high estimation variance, particularly if the sample proportion of principal compliers is low (Little et al., 2009; Antonelli et al., 2017).

Two different approaches, the per-protocol (PP) and as-treated (AT) analyses, have also received many applications in assessing the efficacy of a treatment (Higgins and Green, 2008; McNamee, 2009). The PP analysis subsets the sample to those whose actual treatment status is the same as their randomized assignment, i.e., observed compliers. However, observed compliers are usually different from principal compliers. For example, those who would always refuse a treatment, i.e., never-takers, cannot be differentiated from principal compliers when assigned to the control condition. The difference between the observed and principal compliers results in biased estimation of the CACE in the PP analysis. By contrast, the AT analysis ignores the initial assignment and uses the actual treatment to estimate a treatment effect. Due to various sample selection biases, the AT analysis can also be biased for the CACE. Despite their biases, both the PP and AT analyses usually have a smaller estimation variance than the IV approach. If we measure the efficiency of an estimator by its mean squared error (MSE), i.e., the sum of its squared bias and sampling variance, among the IV, PP, and AT estimators, none can always outperform the others. See, e.g., Little et al. (2009) and Antonelli et al. (2017) for discussions of the scenarios where each estimator can outperform the others. The suitable scenario for each estimator depends on non-estimable properties of some unobserved data, e.g., whether there is a mean difference in the outcome between never-takers and principal compliers when both are assigned to the control condition.

In this paper, we consider a synthetic estimator as a convex combination of a set of candidate estimators of the CACE, including but not limited to the IV, AT, and PP estimators. Our approach is rooted in the theory of model averaging, which conventionally focuses on estimating coefficients in least square regression models (Buckland et al., 1997; Hjort and Claeskens, 2003; Judge and Mittelhammer, 2004; Mittelhammer and Judge, 2005; Longford, 2006; Hansen, 2007), and predicting random effects in mixed-effect models (Robinson et al., 1991; Ghosh et al., 1994; Searle, 1997; Longford, 2006)

. The theoretical framework of model averaging is much wider than linear regression and mixed-effect models. In the recent literature,

Lavancier and Rochet (2016) outlined synthetic estimators for a very general class of problems and provided some asymptotic results. Antonelli et al. (2017) proposed a synthetic estimator in the presence of noncompliance in the setting of a randomized controlled trial. In the same spirit, we use the term synthetic estimation instead of model averaging because our candidate estimators do not belong to a common class of models, although they target the same estimand of interest, i.e., the CACE. One purpose of synthetic estimation is to combine several available candidate estimators to form a single and unambiguous estimator. By striking a balance between biases and variances, a combination of candidate estimators can have a smaller MSE than some or even all candidate estimators, in particular, any a priori favored candidate estimator.We propose a class of synthetic compliance estimators (SCEs) which aim to optimally combine candidate estimators of the CACE by minimizing the estimated MSE of the resulting estimator. The proposed SCE is appropriate for observational studies, randomized trials, and other experimental and non-experimental studies where estimation must be adjusted for covariates. The SCE displays a robustness property: without sacrificing too much on the estimation bias, it borrows information from other biased but precise candidate estimators to improve the MSE.

The rest of the paper is organized as follows. In Section 2, we outline specifics about the principal compliance framework and lay out the details of the candidate estimators in this setting. In Section 3, we present the SCE and its practical implementation. The asymptotic properties of the SCE are discussed in Section 4. Section 5 includes simulations demonstrating the operating characteristics and robustness of the SCE. Section 6 contains concluding remarks.

## 2 Framework

### 2.1 Principal stratification and CACE

Suppose interest lies in the causal effect of an assigned treatment on an outcome , for subjects indexed by . The actual treatment status, denoted by

, is a random variable which may not equal the assignment

possibly due to subject self-selection into a different treatment status. Further, suppose is a set of covariates collected on these subjects. We assume throughout that and are binary, so that they take values in . Let be the potential treatment status that would have been observed if . Based on the configuration of and , there are potentially four types of individuals in the population: always-takers who always take the treatment regardless of assignment, i.e. ; never-takers who always take the non-treatment condition regardless of assignment, i.e. ; principal compliers who always conform to the assignment, i.e. ; principal defiers who always do the opposite of the assignment, i.e. . These four groups of individuals are known as the principal strata. The actual treatment status is not influenced by the assignment for always-takers and never-takers. By contrast, principal compliers’ and principal defiers’ actual treatment status is determined by their assignment. Under the principal stratification framework, the group of observed compliers consists of principal compliers, never-takers assigned to the control condition, and always-takers assigned to the treatment condition. In this paper we focus on the principal compliers. Hereafter, the principal compliers are referred to as the "compliers" for brevity in presentation. Let the indicator of compliance be , and letbe the corresponding probability. Further, let

be the sample size in the group assigned to treatment , and let be the sample size of the group assigned to treatment and with treatment status .Define the potential outcome to be the outcome that one would observe if, possibly contrary to fact, and . For these counterfactuals to make sense, we assume the no-interference or Stable Unit Treatment Value Assumption (SUTVA) holds (Rubin, 1978) for both the potential outcomes and the potential treatments . That is, that the potential outcomes and potential treatment values for individual do not depend on any other individual, and and

Further, let the counterfactual based only on the treatment status be

By definition, a causal effect is the difference between a pair of distinct potential outcomes, where one or more causal factors differ. The efficacy of a treatment or the complier average causal effect (CACE) refers to the mean causal effect of treatment among compliers (Angrist et al., 1996; Little et al., 2009). Using the notation above and suppressing the subscripts, it is

(1) |

### 2.2 Identifying assumptions

In this section, we outline a series of assumptions that are typically used to identify the CACE. Some estimators may require many of the ensuing assumptions, while some may require only a few.

###### Assumption 1 (Pseudo-randomization).

Assumption 1 states that treatment assignment is, if not randomized, as good as randomized (Imbens, 2014) within strata defined by the covariates . This assumption is similar in spirit to the no-unmeasured-confounders assumption commonly invoked for treatment effect estimation.

Next, we will make two monotonicity assumptions.

###### Assumption 2 (Monotonicity).

The first is required for identifiability and states that there are no defiers in the population and the treatment assignment does not make anyone less likely to take treatment.

###### Assumption 3 (Strong monotonicity).

This second, stronger version of monotonicity ensures that there are no always-takers () and only serves to simplify notation. While strong monotonicity may not be a reasonable assumption in all cases, any additional complexity arising due to its violation (by inclusion of always-takers) could easily be incorporated in the framework we lay out here.

###### Assumption 4 (Exclusion restriction).

Assumption 4 encodes the so-called exclusion restriction (ER), which states that there is no direct effect of the treatment assignment on the outcome in the population. In situations where treatment assignment is determined by double-blind randomization, the ER is very plausible. In other cases, it must be justified based on substantive expertise. Under the ER, we can equivalently write the CACE as .

We further might require an assumption on the effect of compliance:

###### Assumption 5 (No compliance effect).

Assumption 5 is not needed for traditional IV or two-stage estimators and states that compliers () have the same outcomes as certain non-compliers and thus there is no compliance effect (NCE). When strong monotonicity holds, we only require the no compliance effect to hold when . A version of this assumption was termed General Principal Ignorability in Section 6.1 of Ding and Lu (2017).

### 2.3 Candidate CACE estimators

Our approach relies on a collection of candidate estimators, and it leverages the information in all of the candidates to produce an efficient estimator with low bias. While there have been many estimators of the CACE proposed previously, for clarity and ease of presentation, we restrict ourselves to a series of well-known ones. The SCE could easily be adapted to include more, fewer, or other candidate estimators with ease.

#### IV estimator.

#### Two-stage least squares.

The two-stage least squares (TSLS) estimator of Angrist and Imbens (1995) fits a least squares model for the outcome and an additional model for treatment status. The CACE is estimated as from the model

(3) |

and is the predicted value of a regression of onto . Traditionally, the model for

is a least squares model, but in practice one could use logistic regression or any other binary regression model. Under Assumptions

1, 4, and 2 and assuming the models for and are correct, is consistent for the CACE.#### Per-protocol.

#### As-treated.

#### Principal-score weighting estimators.

Weighting methods using so-called principal scores are laid out in Ding and Lu (2017). The principal score for compliers or compliance score is the conditional probability of being a complier. We consider two principal-score estimators. The first is purely a weighting estimator and can be written:

Under Assumptions 1, 5, 3, and consistency of , is consistent for the CACE.

The second estimator is a model-assisted version of the same estimator:

where is the sample size in group and is estimated from the model

and is estimated from the model

Under Assumptions 1, 5, 3, and consistency of – no additional assumptions from – is consistent for the CACE and has a lower asymptotic variance.

#### Principal-score stratified estimators.

We finally estimated versions of the IV, AT, and PP estimators that were stratified by the principal score . We first computed each estimator withiin quintiles of the principal score and then averaged across quintiles.

### 2.4 Bias-variance tradeoff in candidate estimators

The estimators in the previous section present bias-variance trade-offs for the analyst. To illustrate this point, consider the properties of a few of the estimators under the model

(6) |

where Assumptions 3 (strong monotonicity) and 4 (exclusion restriction) hold and . Because the exclusion restriction holds, . However, due to violation of the NCE assumption, the principal-score and as-treated estimators (for example) will be biased.

First, consider a comparison of and . It is straightforward to show that . The degree of bias for the CACE incurred by the principal-score estimator depends on the compliance proportion and the compliance effect . While the principal-score estimator incurs this bias, it is more efficient than the IV estimator. Following the argument in Feller et al. (2017), one can show that the variance of is

(7) | ||||

(8) |

On the other hand, the variance of can be shown to be

(9) | ||||

(10) |

Noting that so and , it is clear that

Similarly, using a classic omitted-variable result, because the as-treated model is missing only the variable for compliance , where is the coefficient for in the least squares regression of onto and . The quantity will also be if . Comparing to again shows the bias-variance tradeoff in the candidate estimators. The TSLS estimator and the as-treated estimator arise from similar models, but uses the estimated in place of . Because of using the estimated quantity, the TSLS estiator will incur additional variability and

(cf. the result in Murphy and Topel (2002)).

Our approach seeks to exploit this bias-variance tradeoff, borrowing information from the possibly biased estimators (like and ) to induce greater efficiency in the SCE.

## 3 Synthetic estimation

### 3.1 The estimator

In this section, we propose a class of SCEs which leverages the information in all candidate estimators. Let the set of candidate estimators be denoted as , where is an estimator that can be presumed to be unbiased, and

, a vector of length

, collects all other candidates. Because the exclusion restriction (Assumption 4) can often be plausibly assumed to hold, we typically consider either the IV estimator or the TSLS estimator to be . When Assumption 4 is deemed unlikely, another estimator may be considered as . We consider synthetic estimators as a convex combination of the candidates(11) |

where all entries of are between 0 and 1, and . The synthetic estimator in (11) is written as a function of and to illustrate that different synthetic estimators are possible with different candidate estimators and different weight vectors .

The synthetic estimator aims to lower the MSE of the supposedly unbiased by including the possibly biased in the hopes of attaining lower variance without incurring too much estimation bias. To achieve this goal, the convex combination would ideally be chosen to directly minimize the MSE. We adopt the same rationale as in (Robinson et al., 1991; Longford, 2006) to derive the SCE. Specifically, let the sampling variance of the candidate estimators be

where the small in the subscript denotes the finite sample size. Let the biases of candidate estimators be denoted as

Given , the MSE of (11) is

(12) |

where

Let be the minimizer of (12), which is expressed as a function of and to remind readers that it is a function of the bias and sampling variance of the candidate estimators. Suppose is on the interior of the convex constraint. Then,

(13) |

This solution assumes that and are known. In practice, they are unknown and can be replaced with estimators. The minimizer of (12

) with plug-in estimates for the sampling moments is

(14) |

If is outside of the boundaries of the convex constraint, then (14) needs to be projected to the boundaries of the constraints. The general form of our proposed synthetic estimator plugs the estimated optimal weight from (14) into (11)

(15) |

which is written as a function of , , and to illustrate that different synthetic estimators are possible with different candidate estimators , different estimates of the sampling variance of the candidate estimators , and different estimates of the bias of the candidate estimators .

### 3.2 Bias estimation

It is typically straightforward to compute (and thus and ) by using a resampling-based method, e.g., the nonparametric bootstrap, jackknife, or random grouping (see, e.g., (Kovar et al., 1988)). In what follows, can be any of these estimates of . By contrast, it is much more difficult to estimate the finite-sample biases of candidate estimators. We propose three ways for computing .

#### Raw differences between candidate estimators and .

Because we assume that is unbiased, we can compute a crude estimate of the bias in the other candidate estimators by taking their raw differences with :

Denote the SCE using the raw difference as .

#### Shrinking raw differences.

Unless two candidate estimators are highly correlated, their raw differences can be highly variable. To ameliorate the variability, we may want to regularize the bias estimates by down-weighting the most highly variable ones:

where denotes element-wise multiplication and with

.

The expression for the weight is derived by choosing a shrinkage value that minimizes the mean squared error of the bias estimate, i.e. . If the bias is large relative to the variance , then the weight will be close to 1, meaning the bias is not shrunk very much. On the other hand, if the bias is small compared to the variance, the weight will be close to 0 and the bias will be shrunk toward 0. This results in the estimator .

#### Sample-splitting approach.

We also consider an approach which estimates the bias on an independent subset of data. Consider splitting the available data into two equally sized datasets, and estimating the candidate estimators on each. Let be the set of candidates estimated on one half, and be estimated on the other half. Similarly define and . We can estimate the optimal weights from (14) in the two subsets of the data, but apply the weights computed on one half to the candidate estimators from the other half and average the two:

Note that in the sample-splitting estimator, a single estimate of is used in estimating the weights on both halves of the data. This single estimate is estimated using the full sample. We take this approach for its computational simplicity and because the variability in estimation of is typically of a lower order than the variability in estimation of .

## 4 Asymptotic behavior

### 4.1 Asymptotic distribution of synthetic estimator

In this section, we will establish the asymptotic distribution of the proposed estimator using the raw differences as the bias estimate. We make the following assumptions on the asymptotic behavior of candidate estimators as well as their sampling variance and the corresponding estimators.

###### Assumption 6.

As ,

###### Assumption 7.

As ,

###### Assumption 8.

Assumption 6 states that all candidate estimators have the usual root-n convergence rate. Assumption 7 further assumes that candidate estimators are essentially uniformly integrable so that the sampling moments are also converging. By these two assumptions, is unbiased asymptotically but other candidate estimators may not be. When a candidate estimator is inconsistent, i.e., an entry in is infinite, the corresponding weight in converges to zero in probability. Here, we only consider candidate estimators with a finite asymptotic bias, i.e., all entries of are finite. Assumption 8

is the standard rate of the sampling variance estimator in parametric models and the usual bootstrap variance estimator.

Let and . Let , where is a random variable with a support of . Further define two scalar constants as

(16) |

Lemma 1. Some useful facts of the random variable and constants :

The following theorem gives the asymptotic behavior of the synthetic estimator.

where

(17) |

###### Proof.

To derive , we use the relationship . By a basic theorem in linear algebra,

Also note We then have

Combine these terms together and by Lemma 1,

The second moment represents the asymptotic efficiency measure of the synthetic estimator. Compared with the unbiased estimator , the synthetic estimator has a better asymptotic efficiency if and only if

(19) |

Since , a sufficient condition for efficiency gain by is

(20) |

When all candidate estimators have zero asymptotic bias, that is , the synthetic estimator is guaranteed to be asymptotically more efficient than .

### 4.2 Inference

We may use a plug-in version of , as found in (17

), to compute confidence intervals. Because the SCE is based on estimators that may be biased, it is reasonable to anticipate at least a small amount of bias in the SCE. Therefore, using the mean squared error

may be preferred to for creating confidence intervals because it incorporates the bias and produces wider intervals.A (1-)% confidence interval can be constructed as

where is the

th quantile of the standard normal distribution and

where , and . In practice, simulation may be used to generate a large collection of s from which the moments of may be empirically estimated. Let be such a large collection of s drawn from . Then, one may compute

and the confidence intervals follow.

## 5 Simulation

We performed Monte Carlo experiments to demonstrate the finite-sample performance of the proposed synthetic compliance estimators. These simulations demonstrate how compliance effects (in violation of Assumption 5) and sample size impact the performance of the proposed SCEs.

We consider two data-generating mechanisms. First, we adapt the simulation set-up in Ding and Lu (2017). In this setting, the data-generating model takes the following form

(21) | |||

(22) |

where is an indicator of being a complier, , and and . However, when the data are analyzed, is omitted from the models, and therefore corresponds to a measure of the violation of Assumption 5. When is large in magnitude, we expect more bias in the estimators that rely on the assumption of no compliance effect, such as , and . We tested three different sample sizes , and we let vary between -2 and 2.

The second data-generating mechanism is adapted from (Stuart and Jo, 2015). Here, the model has the following form:

(23) |

where , with , and . The CACE is identified by the parameter . The parameters and control the degree to which Assumption 5 – no compliance effect – is violated. If , then compliers naturally have higher means than never-takers, and if , then the effect of is different between compliers and never-takers. Both models ensure the exclusion restriction (Assumption 4) and strong monotonicity (Assumption 3) hold.

Similarly to the first data-generating process, we generated data at three sample sizes, , and we varied the level of violation of Assumption 5 by letting take values in (0, 0.1, 0.2, 0.3, 0.4, 0.5). We also allowed , and to be 0 or 1. We compared the performance of a range of synthetic estimators to the performance of the candidate estimators. In all simulations, was estimated using 200 bootstrap samples.

### 5.1 Overall performance of the SCE

Figure 1 demonstrates the behavior of the SCE with compared to