## 1 Introduction

Estimating the long-term effects of treatments is of interest in many fields, ranging from medicine (e.g., the effects of drugs on mortality rates) to economics (e.g., the effects of childhood interventions on earnings), to marketing (e.g., the effects of incentives on long-term purchasing behavior). A common challenge in estimating such treatment effects is that long-term outcomes are typically either unobserved in the time frame needed to make policy decisions, or observed only for a small group of experimental subjects. One approach to overcome this missing data problem is to analyze treatments effects on an intermediate outcome, termed a “statistical surrogate” (Prentice 1989). The formal requirement for a variable to be a statistical surrogate, sometimes called the Prentice criterion (Begg and Leung 2000, Frangakis and Rubin 2002)), is independence of the treatment and the primary outcome conditional on the statistical surrogate. For example, in the case of studies of the effect of cancer therapies on mortality, tumor size serves as a statistical surrogate for mortality rates if mortality rates are independent of the treatment conditional on the value of the blood marker. Under this assumption, the treatment effect on mortality rates can be identified by from the relation between the treatment and tumor size and the relation between the tumor size and mortality rates using from a separate data set.

Although the use of surrogates has become widespread, the validity of the surrogacy condition is often controversial. Freedman et al. (1992) argued that the surrogate may not mediate all the effect of the treatment and developed a measure of the proportion of the treatment effect on the long-term outcome explained by the surrogate. Others have noted that unmeasured confounding between the surrogate and long-term outcome would invalidate the statistical surrogacy assumption, even if the treatment had no direct effect on the long-term outcome (Rosenbaum 1984, Frangakis and Rubin 2002, Joffe and Greene 2009, VanderWeele 2015).

In this paper, we approach this debate from a different perspective. Rather than attempting to determine whether the surrogacy condition holds for a given single intermediate outcome, we exploit that fact that in modern datasets, constructed from large scale electronic databases, researchers often observe a large number, possibly hundreds or thousands, of intermediate outcomes thought to lie on or close to the causal chain between the treatment and the long-term outcome of interest. These intermediate outcomes might be thought of as proxies for an unobserved latent true statistical surrogate. It may be that no individual candidate surrogate satisfies the Prentice surrogacy criterion by itself, but that collectively these variables do satisfy the statistical surrogacy condition.

We focus primarily on a setting with two samples, an “experimental sample” and an “observational sample.” The experimental sample contains data about the treatment indicator and the surrogates but not the long-term outcome of interest, the “primary outcome.” The observational sample contains information about the surrogates and the primary outcome, but not the treatment indicator. Both samples may also contain pre-treatment variables. Note that, in contrast to the study of mediation in causal problems, or the study of principal stratification, the surrogates are not of intrinsic interest in our analysis: their role is solely to aid in the identification and estimation of the average treatment effect of the treatment on the primary outcome.

As an example, consider evaluating the effects of early-childhood educational interventions, such as reductions in class size or improvements in teacher quality, on long-term outcomes, such as college attendance or earnings. Chetty et al (2011) estimated the effect of class size on earnings by linking data from the Tennessee Project STAR experiment, which randomized class size in grades kindergarten to third grade in the 1980s, to information on earnings decades later. The goal of our paper is to develop methods that will enable researchers to draw similar conclusions from educational experiments without waiting decades to observe the long- term outcome. In our framework, the experimental sample in this application would include data about class size (the treatment), student characteristics, and various intermediate outcomes (surrogates/proxies). The surrogates could include a variety of student outcomes in a few years following the treatment (e.g. grades and test scores across subject areas, as well as attendance). The observational sample would be a large panel dataset that would include the same student characteristics and surrogates as well as longer-term outcomes such as earnings.^{1}^{1}1In another example, an internet company may be interested in the causal effect of a change in the user experience on long term engagement with the website, e.g., overall time spent on the website. Surrogates in that case could include detailed measures of medium term engagement, including which of many webpages were visited and how long a user spent on each page.

We consider three questions in this setting. First, how can the average treatment effect (ATE) be identified and estimated with a high-dimensional vector of surrogates that collectively satisfy the surrogacy assumption? Second, what is the bias from violations of the surrogacy assumption? Third, if the primary outcome is also observed in the experimental sample, is there still information to be gained from using surrogates?

To answer the first question, we introduce two new statistical concepts: the surrogate score, the probability of having received the treatment conditional on covariates and surrogates, and the surrogate index, defined as the expectation of the outcome of interest conditional on the surrogates. Under linearity, the surrogate index is a weighted average of each of the intermediate outcomes, with the weights determined by their ability to predict the primary outcome in the observational sample. We show that the ATE on the primary outcome can be identified by estimating the effect of the treatment on the surrogate index in the experimental sample under a set of assumptions. The key assumption is that the long-term outcome is independent of the treatment, conditional on the surrogates. In the class size application discussed above, the key requirement for identification of the ATE using surrogates is that (i) the test scores of the students in early grades capture all of the effects of the class size intervention and (ii) there are no unobserved confounders that affect both test scores and earnings. The ATE can also be estimated by averaging the outcomes in the observational sample using weights that depend on the surrogate score. Thus the surrogate index and surrogate score provide a simple way to collapse a high-dimensional vector of intermediate outcomes into a single index that can be used to estimate treatment effects, analogous to propensity scores (Rosenbaum and Rubin, 1983) in the causal inference literature. Also analogous to the propensity score literature, where different estimation methods may work better under different circumstances, whether methods based on the surrogate index or based on the surrogate score methods perform better depends on the empirical setting.

Next, we evaluate the degree of bias from the use of surrogates when the surrogacy condition fails. In this case, we show that our approach estimates an average causal effect on a function of the surrogate outcomes, where the function is the conditional expectation of the primary outcome given the surrogate outcomes in the observational sample. We then characterize the difference between this functional and the average treatment effect on the primary outcome itself. This characterization provides a method of assessing the potential degree of bias from violations of the surrogacy condition under alternative assumptions about how the treatment affects the primary outcome conditional on the intermediate outcomes. The formula for bias demonstrates why using many intermediate outcomes generally reduces the degree of bias. Intuitively, the degree of bias is determined by the extent to which the intermediates span the causal pathways from the treatment to the primary outcomes. With a large and diverse set of intermediates, one is more likely to span all, or at least most of, these causal pathways. In the class size application, bias is likely to be smaller if there are many measures of student outcomes in the early grades, as well as a wide range of student characteristics that capture confounders that affect both surrogate outcomes and long-term outcomes. For example, the mapping from test scores to earnings may depend on parent income, in which case controlling for parent income would be valuable. In the limiting case where the intermediate outcomes perfectly predict either the primary outcome or the treatment, the bias vanishes.

Finally, we consider the case where the researcher observes the primary
outcome in the experimental sample itself so that
one can directly identify the average treatment effect on
the primary outcome without making use of surrogates.
However, there remains information content in the surrogates: using the surrogate
index, one can estimate the average effect of interest generally more precisely.
Building on the literature on semi-parametric estimation (e.g., Bickel, Klaassen,
Ritov and Wellner, 1993), we establish the efficiency gain from the use
of the surrogate index.
The efficiency results show the conditions under which surrogates are most
valuable for inference.
They also clarify, for the two-sample case, how costly the lack of observations
on the primary outcome in the experimental sample is.
The use of surrogate indices is likely to be most useful in applications
where the final outcome is a rare event or where substantial noise
is introduced after intermediate outcomes are measured.
In such settings – which include medical trials as well as experimentation
(A/B testing) in other fields – using surrogate indices constructed from
a battery of intermediate outcome can yield substantial gains by increasing precision.^{2}^{2}2As an example, Athey and Stern (2002)
study the impact of Enhanced 911 adoption on cardiac patient outcomes, including mortality. Their data included a suite of surrogate patient health
outcomes measured in the ambulance in addition to data about hospital outcomes including mortality (which occurred for only 3.5% of patients). They constructed a “health index” by projecting mortality on the surrogate health measures. Using
the health index as a dependent variable rather than directly using mortality yielded gains in precision. Our efficiency results provide a formal justification for their approach and findings.

## 2 Set Up

As discussed in the introduction, this paper analyzes two distinct designs (single-sample and two-sample). In both cases the surrogacy assumption is valuable, although in different ways.

### 2.1 The Two Sample Design

Here we consider a setting with two samples, which we refer to as the two sample design (TSD). Motivated by the examples discussed in the Introduction, we refer to the first sample as the experimental sample and the second one as the observational sample. However, these are just labels, and we will make explicit any assumptions we make regarding the assignment and sampling in both samples.

The experimental and observational sample contain observations on and units, respectively. At times it will be convenient to view the data as consisting of a single sample of size , with a binary indicator for the group that unit belongs to. For the individuals in the experimental group there is a single binary treatment of interest and we are interested in the treatment’s effect on a primary, often long-term, outcome, denoted by . To be precise in this two sample setting we index these variables by the sample, or , to which they belong. The outcome is not observed. However, we do measure intermediate outcomes, which we refer to as surrogates (to be defined precisely in Section 3.2), denoted as . Typically, the surrogate outcomes are vector-valued, and often the number of components will be substantial, in order to make the properties we propose feasible. Finally, we measure pre-treatment covariates for each individual. These variables are known not to be affected by the treatment.

Following the potential outcomes framework or Rubin Causal Model set up (Rubin, 2006, Holland, 1986; Imbens and Rubin, 2015), individuals in this group have two pairs of potential outcomes and . We are interested in the causal effects on the outcome, , typically an average of this over the population of interest. The realized outcomes are related to their respective potential outcomes as follows.

Overall, all the units in population that the first sample is drawn from are characterized by the values of the sixtuple . For units in this sample we do not observe the full sixtuple. Rather, we observe only the triple with support , , and respectively.

In the observational sample we do not know which treatment the individuals were exposed to, and in fact, they need not be exposed to either treatment. For example, suppose we are interested in the average causal effect of surgery versus a drug on a particular medical condition, with the experimental sample consisting of individuals exposed to either of those treatments. The observational sample may consist of individuals who neither took the drug, nor were exposed to surgery, possibly because the sample consists of observations from a time period when neither treatment existed. We observe a pretreatment variable , the surrogate outcome and the primary outcome, , with support , , and respectively. We denote these variables in this sample using different labels from those for the corresponding variables in the experimental group because formally they need not measure the exact same object.

This set up with two samples, where the sets of variables that are observed in the two samples differs is implicit in much of the surrogacy literature. It is explicit in some studies on combining data sets, e.g., Ridder and Moffitt (2007) and Chen, Hong, and Tarozzi (2008). Rassler (2002,2004) refers to it as a data fusion setting. Graham, Campos de Xavier Pinto, and Egel (2016) discuss efficient estimation for a particular set of models defined by moment conditions in such a setting, where they allow

to be a general random variable, rather than a binary indicator as in our set up.

### 2.2 The Single Sample Design

In the second setup we consider, there is a single population that is identical to the first population in the two-sample setup. All units in the population are characterized by the sextuple . For units in the sample we observe the quadruple , now including the realized outcome . We refer to this setup as the single sample design (SSD).

Under the unconfoundedness assumption we discuss below, it is well known that the ATE is identified without further assumptions, and so statistical surrogacy does not play a role in identification. Nevertheless, the assumption can play an important role because it can make estimation and inference more precise.

### 2.3 The Estimand

We are interested in the average effect of the treatment on the outcome in the experimental group.

where to be explicit we index the expectation by the population the expectation is taken over. The fundamental problem for estimating in the experimental group is that the outcomes are missing for all units in the experimental sample. We need to exploit the observational sample and its link to the experimental sample through the presence of the surrogate outcomes . The surrogates, like the pretreatment variables, are not of intrinsic interest, and is of interest only in so far that it aids in estimation of .

## 3 Surrogacy and the Surrogate Score

In this section we discuss the surrogacy assumption and related concepts. To maintain the flow of the section we focus primarily on the two sample setting. The corresponding assumptions for the single sample setting are in most cases immediately clear. Whenever there are additional subtleties, we will point them out explicitly.

### 3.1 The Propensity Score and Unconfoundedness

Before we introduce the surrogacy assumption, we define some common quantities and assumptions in causal inference in observational studies (e.g., Rosenbaum, 2000; Imbens and Rubin, 2015). Specifically, for the individuals in the experimental group, we define the propensity score as the conditional probability of receiving the treatment (Rosenbaum and Rubin, 1983): An assumption that is often invoked in observational studies is that the treatment assignment is unconfounded or ignorable conditional on the pre-treatment covariates and that there is overlap. Specifically, for individuals in the experimental group, we have:

###### Assumption 1.

(Ignorable Treatment Assignment, Rosenbaum and Rubin, 1983)

This assumption implies that in the experimental group, we could estimate the average causal effect of the treatment on the outcome by adjusting for pretreatment variables, if the were measured. There are many methods for implementing this. The original Rosenbaum and Rubin (1983) paper suggests matching or subclassification on the propensity score. Abadie and Imbens (2006) derive asymptotic properties for matching estimators. Hirano, Imbens and Ridder (2003) show that Horvitz-Thompson weighting estimators are efficient. Robins, Rotnitzky and Zhao (1995) develop what they call doubly robust estimators. See Rosenbaum (1995, 2002), Rubin (2006), Morgan and Winship (2007), and Imbens and Rubin (2015), for textbook discussions and reviews of this literature.

### 3.2 Statistical Surrogacy

Because the primary outcome is not measured in the experimental group, we need to exploit the presence of the surrogates. The defining property of these surrogates is what Begg and Leung (2000) call the Prentice criterion, and what Frangakis and Rubin (2002) call statistical surrogacy, and which we simply refer to as surrogacy:

###### Assumption 2.

(Surrogacy)

The literature following Prentice (1989) has been concerned with the plausibility of statistical surrogacy assumption and its relation to mediation (VanderWheele, 2015; Van Der Laan and Pedersen, 2004). Freedman et al. (1992) argued that the surrogate may not mediate all the effect of the treatment and provided a quantity to measure the proportion of effect on explained by the surrogate . Also, many noted that unmeasured confounding between and and not captured by would invalidate the statistical surrogacy assumption, even if had no direct effect on (Rosenbaum 1984, Frangakis and Rubin 2002, Joffe and Greene (2009), VanderWeele (2015)). Frangakis and Rubin (2002) developed a concept they labelled principal stratification to address questions related to mediation and surrogacy. Their starting point is a candidate surrogate variable that is of substantive interest, in contrast to our setting where the surrogate is simply a means to an end. They develop a framework where adjusting for this candidate surrogate variable leads to causal effects of the treatment on the primary outcome. These are questions more closely aligned with those addressed in the mediation literature. See also Mealli and Mattei (2012) and Ding and Lu (2015).

We take a somewhat different perspective on the question of the validity of the surrogacy assumption. We view it as similar in spirit to the unconfoundedness assumption. It is unlikely to be satisfied exactly in any particular application, but, especially in cases with a large number of intermediate variables as well as pretreatment variables, it may be a reasonable approximation, as we will formalize in Section 4.2. Moreover, there is often no reasonable alternative. From our perspective it is useful to view the problem of identifying and estimating as a missing data one. The outcome is missing for all units in the experimental sample, and any estimator of the treatment effect

ultimately relies on imputing these missing outcomes. As we will formalize in Section

3.4, the surrogacy assumption is in that missing data perspective in essence an untestable missing-at-random assumption, conditional on the surrogates and the pretreatment variables. Any alternative assumption that is sufficiently strong to identify the average treatment effect must therefore violate the missing-at-random assumption even though there i no evidence against that assumption.To exploit the notion of statistical surrogacy in settings with possibly many surrogates, we introduce a new concept, which we label the “surrogate score.” It is the conditional probability of having received the treatment given the value for the surrogate outcome and the covariates.

###### Definition 1.

(Surrogate Score)

In contrast to the definition of the propensity score we write here the probability of “having received the treatment” rather than “receiving the treatment” because the surrogate score is conditional on a post-treatment outcome, whereas the propensity score conditions solely on pre-treatment variables. An important property the surrogate score shares with the propensity score is that it allows for statistical procedures that adjust only for scalar differences in other variables, irrespective of the dimension of the statistical surrogates. We state the next result without proof.

###### Proposition 1.

(Surrogacy Score) Under surrogacy (Assumption 2) we have

### 3.3 Comparability of The Two Samples

This section discusses how we can use the information from the observational sample to help us estimate , specifically how to infer the missing values in the experimental sample from the observed values in the observational sample. Surrogacy is not sufficient for that, because that in itself does not make any assumptions about the observational sample. The key assumption is the conditional distribution of given is the same as the conditional distribution of given . Formally,

###### Assumption 3.

(Comparability of Samples)

and , and .

There are two immediate consequences of making the comparability assumptions, both of which allows us to share information between the two groups. To discuss these, we define the surrogate index:

###### Definition 2.

(The Surrogate Index) The surrogate index is the conditional expectation of the outcome given the surrogate outcomes and the pretreatment variables in the observational sample:

We can define the corresponding conditional expectation in the experimental sample:

In contrast to , is not estimable because we do not observe the outcome in the experimental sample. These conditional means are related to what Hansen (2008) calls the prognostic score, although in the setting Hansen considers there is no surrogate variable, and the conditional expectation is only a function of the pretreatment variables. Define also the conditional expectation given treatment, pre-treatment variables and the surrogate:

(3.1) |

We state the next result without proof.

###### Proposition 2.

Next, let be the sampling weight of being in the experimental sample and be the sampling weight of being in the observational sample. Suppose we define the propensity to be in the experimental sample as follows

###### Definition 3.

(Sampling Score)

We also make the assumption

###### Assumption 4.

Overlap in Sampling Score

We can also also write with a slight abuse of notation in defining a probability measure over , which in our two sample design is not stochastic.

### 3.4 A Missing Data Approach

To get an intuition for the surrogacy and comparability assumptions, one can also frame them as a missing data assumption, close to the missingness at random (MAR) assumption common in the missing data literature (Rubin, 1976; Little and Rubin, 1988), and specifically the literature on combining samples with different sets of variables, (Gelman, King and Liu, 1998; Rassler, 2002; Rassler 2004; Graham, Campos de Xavier Pinto, and Egel, 2012, 2016). To see this, let indicating that the outcome was measured and otherwise, and define

The complete data are . We view the sample as randomly drawn from a large population, so that we can view as stochastic. For the units in the sample we observe the incomplete data . We can now rephrase the critical assumptions.

###### Assumption 5.

(Missing Data Assumption)

Conditional on , the three variables , and are jointly independent, or, with some abuse of the Dawid conditional independence notation,

We state the following result without proof.

###### Proposition 3.

Comparability corresponds to being independent of given , and surrogacy corresponds to being independent of given and given . Assumption 5 is in fact stronger than the combination of these two, because it also assumes that conditional on , is independent of , and it assumes that is independent of . Neither are required for our main results, but because we do not need the in the observational sample and because these restrictions do not imply testable restrictions there is no loss of generality.

## 4 The Two Sample Design: Identification

### 4.1 Identification

Here we present two representations of the average treatment effect that suggest two different estimation strategies. Just as in the unconfoundedness setting the corresponding estimation strategies differ in terms of the conditional expectations that need to be estimated. The full set of conditional expectations include the propensity score , the surrogate score , the sampling score , and the surrogate index .

The motivation for developing the different representations is that estimators corresponding to those different representations may have substantially different properties. Just as in the case of estimating average treatment effects under unconfoundedness, the lack of smoothness in the various scores or conditional expectations may affect the properties of estimators that rely on estimating these conditional expectations.

Define

(4.1) |

and

(4.2) |

where the superscript on the indicates the population the expectation is taken over.

The first representation, , shows how can be written as the expected value of the propensity-score-adjusted difference between treated and controls of the surrogate index. This will lead to an estimation strategy where in the experimental sample the missing are imputed by . In contrast, the second representation, , shows how can be written as the expected value of the difference in two weighted averages of the outcome, with the weights a function of the surrogate score and the sampling score. This will lead to an estimation strategy where in the observational sample the are weighted proportional to the estimated surrogate score to estimate , and weighted proportional to one minus the estimated surrogate score to estimate . There are additional representations, for example replacing in (4.1) by , or replacing in (4.1) by . Estimators based on those representations do not appear to have attractive properties, either in theory or in our simulations.

### 4.2 The Consequences of Violations of Surrogacy and Comparability

In most applications the surrogacy assumption is at best a reasonable approximation. Instead the researcher may be confident that the assocation between the primary outcome and the treatment conditional on the proposed surrogate variables is limited, or just that there is a substantial association between the the surrogates and the primary outcome. In this section we interpret the probability limit of estimators based on either of the two characterizations of the estimand in Theorem 1 in case either or both of the surrogacy and comparability assumptions are violated. Throughout the section we maintain unconfoundedness.

Without surrogacy and comparability there are two things we can say.

###### Theorem 2.

First,

and , under unconfoundedness we have

The first term captures the bias arising from violations of surrogacy, and the second term captures the bias arising from violations of comparability.

The first result shows that in general we estimate a valid causal effect as long as unconfoundedness holds. It is the average effect on a function of the surrogate, rather than the average effect on the primary outcome. This result also shows that which strategy we follow, using the surrogate score or the surrogate index to build an estimator, does not matter for the interpretation. The second result shows how lack of surrogacy and lack of comparability affect the difference between what is being estimated and the average treatment effect on the outcome of interest.

Consider the bias from violations of surrogacy, the first term in the bias. It consists of two factors. The first factor is small if the surrogates explain much of the variation in and therefore and are close. The second factor is small if the surrogate explains much of the variation in , so that the surrogate score is close to zero or one and therefore is close to zero.

Let us consider a special case where the assignment is completely random, so the propensity score is constant, , and where we have a substantial number of intermediate outcomes. These intermediate outcomes may be qualitatively very different, some continuous, some discrete or binary, and with very different substantive interpretations. The surrogate approach suggests a systematic way of combining the causal effects on the surrogates. Moreoever, suppose we approximate by a linear function, . Let be the average causal effect on the surrogates. Then can be estimated by

The linear model for leads to a set of weights on the potentially large set of intermediate outcomes. Note the role of the pretreatment variables here. We do not simply regress the primary outcome on the surrogate outcomes. Instead we include the pretreatment variables in that regression, even if the data come from a randomized experiment, in order to improve the explanatory power of the surrogate index and the surrogate score.

It is also interesting to relate this discussion to the use of indices in health research. Consider the Body Mass Index (BMI), defined as (McGee et al, 2004; Adams et al, 2006). That index is defined as a person’s weight in kilograms divided by their height in meters squared. This index is predictive of future health outcomes, although it is obviously not a conditional expectation. Nevertheless we can interpret estimates of the causal effect of treatments on the BMI through this approach.

## 5 The Two Sample Design: Estimation

In this section we discuss a number of estimation strategies. We take some of the insights from the literature on estimating average treatment effects under unconfoundedness to suggest strategies that appear to be promising. The key difference with the unconfoundedness setting is that there are in the current setting two adjustments to be done.

### 5.1 An Estimator Based on the Surrogate Index

Suppose we estimate the surrogate index as . We can then average this in the experimental sample for the treated and controls, after adjusting for the propensity score. A natural estimator, corresponding to (4.1), is the following difference of two average over the experimental sample:

(5.1) |

We refer to this as the surrogate index estimator. Note that compared to the representation in the theorem we normalize the weights so that the weights sum up to one. This tends to improve the finite sample properties of the estimators substantially. In the case where the estimator for was based on a linear specification, is linear, this leads to

where is an estimator for In the case without pretreatment variables where the experimental sample came from a completely randomized experiment, this would further simplify to

where and are the average values for the surrogate outcome in treated and control samples respectively. However, we emphasize that in general, there may be interactions between the surrogates and pre-treatment variables.

When the number of pre-treatment variables or surrogates (and their interactions) is large, using logistic regression may not be feasible, and one may wish to consider regularization methods such as LASSO (Tibshirani, 1996; Belloni, Chernozhukov and Hansen, 2014), ridge regression, tree or forest based methods (Breiman, Friedman, Olshen, and Stone, 1984; Wager and Athey, 2015), or super learners (VanderLaan and Rose, 2011) to estimate the various scores and conditional expectations.

### 5.2 An Estimator Based on the Surrogate Score

In this Section we use the second representation for in the main theorem. Let , , and , be estimators for , , and respectively. These may be nonparametric estimators, or simply estimators based on generalized linear models. For example we could specify

and

estimated by maximum likelihood or method of moments. Note that we have assumed the most typical models for the propensity score, the surrogate score, and the sampling score and there is no doubt that our resulting estimate of the treatment effect could be sensitive to misspecification of these models especially if there is limited overlap. However, we feel this would provide a starting point for estimating the treatment effect under our setting. Again in settings with a large number of surrogates or pretreatment variables one may wish to use regularization methods. Once we have estimates , and , we would plug them into the sample analogs of the expected values in the main theorem.

What we refer to as the surrogate score estimator is based on averaging over the observational sample:

(5.2) |

where for the weights are

### 5.3 Matching Estimators

Although matching estimators are generally not efficient in settings with unconfoundedness (Rubin, 2006; Abadie and Imbens, 2006, 2016), they have a lot of intuitive appeal, and it is instructive to see how a matching strategy could be implemented in this case. Consider unit in the experimental sample with and , and suppose this is a treated unit with . We need to find three matches for this unit. First, we need to find a unit with the opposite treatment in the same (experimental) sample. Specfically, we need to find the closest unit in the experimental sample, in terms of pretreatment variables, among the units with . Suppose this unit is unit , with , and the value of the pretreatment variable for this unit is , and the surrogate is (as a result of the matching we should have , but potentially could be quite different from ). Next we need to find for each of the units and a match in the observational sample. First, find the unit in the observational sample closest to unit , in terms of both pretreatment variables and surrogates. Let be the index for this unit, and let the value of the outcome for this unit be , and the values of the pretreatment variables and surrogates and (now as a result of the matching and . Finally, find the unit in the observational sample closest to unit , in terms of both pretreatment variables and surrogates. Let the value of the outcome for this unit be , and the values of the pretreatment variables and surrogates and , with and .

Then we combine these matches to estimate the causal effect for unit , , as the difference in average outcomes for the two matches from the observational sample:

The matching estimator for would then be the average of this over the experimental sample.

In settings with high-dimensional pre-treatment variables or surrogates this matching strategy it would be unlikely that such a matching strategy would be effective, and methods relying on regularized estimation of the surrogate index or surrogate score would be more attractive.

## 6 Simulation

### 6.1 Setup

We conduct a small simulation study to assess the performance of different estimation methods for if the identifying assumptions are met. To focus on the role of the surrogate variables, we constrain the study to a randomized experimental design without pre-treatment covariates so that the propensity score is constant, , and a constant sampling score so that . Within this design we focus on the role of the surrogate index and the surrogate score. Specifically, let

be the ordinary least squares estimate of the conditional expectation of

given and let be the logistic regression estimate of the conditional expectation of given . We study the following two estimators for , simplified versions of (5.1)-(5.2), to the case with and :Subsequent sections study the behaviors of and

under different data generating processes. In particular, we study (i) the properties of the surrogate score and the surrogate index as the number of surrogates increases, (ii) the consequences of misspecifying the surrogate score and the surrogate index, (iii) the role of different sample sizes in different samples, and (iv) the role of the explanatory power of the surrogates in the surrogacy score and the surrogacy index. In all simulation settings, we study the bias and variance of the two estimators

and evaluated from 1000 simulated data sets.### 6.2 Dimension of Surrogates

In this section, we consider the effect of increasing the dimension of the surrogates on estimating . Each data set has individuals with from the experimental sample and from the observational sample. Suppose we have surrogates, where takes on values from to . The surrogates follow a multivariate standard Normal with mean zero and identity covariance under both the observational and the experimental sample. We generate data based on the following model.

where are fixed parameters chosen from a standard Normal with mean 0 and variance and . We also generate under the same model as . For the experimental sample, we only use and in the observational sample, we only use . Note that all of the identifying assumptions are satisfied by the simulation design.

Figure 1 shows the result of the simulation.

We see that regardless of the dimension of , both estimators have similar performance with respect to bias and variance, although has a slightly higher variance as dimension of the surrogates are quite large. Also, as expected, the bias and variance from both estimators increase as the dimension of the surrogates grows because the sample size remains fixed at and . In short, the simulation demonstrates that the estimation methods can handle large number of surrogates at the expected loss in bias and variance.

### 6.3 Misspecification

In this section, we consider the effect of using an inadequate number of surrogates. In our set up there are 250 surrogates that collectively satisfy the surrogacy assumption. We then compare the two estimators, using only the first surrogates, for . The sample size remains fixed at 1,000, with and . The coefficients on the surrogate variables are , so that the initial surrogates are the most important ones.

Figure 2 shows the result of the simulation.

We see that initially increasing the number of surrogates improves the bias of both estimators. As the number of surrogates increases, at some point the remaining surrogates contribute too little information to improve the bias, and small sample issues start dominating. At that point the bias starts increasing with the number of covariates, just as in the earlier simulations where the set of surrogates used was always sufficient.

### 6.4 Different Sample Sizes

In this section, we consider the effect of having different sample sizes from different samples in estimation. The simulation setup is identical to Section 6.2 except we fix , set and so that and the treatment effect is equal to , and vary , the relative proportion of the experimental sample. A implies that there are more units in the observational data than the experimental data while a implies that there are more units in the experimental data than the observational data. At , the sample sizes between the experimental and the observational samples are identical. We vary from to and study the estimation properties of and under this setting.

Bias | Standard Deviation | Bias | Standard Deviation | |
---|---|---|---|---|

0.05 | 2.011 | 6.357 | 0.023 | 7.490 |

0.25 | 0.001 | 3.018 | 0.060 | 3.508 |

0.5 | 0.013 | 2.801 | 0.012 | 2.850 |

0.75 | 0.067 | 3.482 | 0.012 | 3.004 |

0.95 | 0.423 | 7.420 | 2.747 | 6.434 |

Table 1 summarizes the results. When the sample sizes are roughly equivalent in both the observational and the experimental sample, we achieve the lowest variance for both estimators and the variance for both estimators form bowl-shape as we vary . However, bias fluctuates depending on and the estimator. For example, bias is the highest for when , perhaps because the surrogate score is poorly estimated due to the small sample size of the experimental data even though there is a lot of samples in the observational data. Similarly, the bias for is the highest when , most likely because the surrogate index is poorly estimated from the small sample size of the observational data. However, for , even if and we have a better estimate of the surrogate score, there is still more bias compared to or since there isn’t enough samples in the observational. data. A similar phenomena can be observed with when and we have a good estimate of the surrogate index, although the bias of at is less pronounced than that of at . Indeed, when it comes to bias, the simulation suggests a complex non-linear trade-off between obtaining good estimates of the surrogate score/index and having enough samples in the other data to utilize these estimated scores/indices.

### 6.5 Explanatory Power

In this section, we characterize the behavior of the two estimators when we increase the explanatory power of the surrogate score and the index. The simulation setup is identical to Section 6.2 except we fix and we set and based on the following distributions laid out in Table 2.

Design | Bias | Standard Deviation | Bias | Standard Deviation |
---|---|---|---|---|

, | 0.030 | 2.191 | 0.022 | 2.214 |

, | 0.235 | 3.407 | 0.137 | 3.448 |

, | 0.169 | 3.089 | 0.093 | 3.162 |

, | 0.222 | 3.566 | 0.111 | 3.581 |

As expected, we see that as the variance of and increase, the variance of both estimators increases, although obviously if the surrogates have very little explanatory power the variance must increase. The story for bias is a bit more complex. Bias tends to be the lowest when the variance of and is small, with the exception of the estimator , which has lower bias than its counterpart . Note that the bias of is affected by the variance increase in any one of the parameters and .

### 6.6 Summary

In summary, the simulation study reveals the following trends. First, while fixing the sample size, if one increases the dimensions of the surrogates, outperforms in terms of variance. Second, the sensitivity to misspecification is similar. Third, when the sample sizes between the two data sets differ, there is a interesting trade-off between bias and variance for both estimators. For example, variance tends to be minimized when there is an equal sample size between the two data sets and bias tends to be minimized at non-extreme, but not necessarily equal, sample sizes. The modelling assumptions, when correct, are more valuable for the smallest of two two samples, so that if the experimental sample is smaller than the observational sample, outperforms . Fourth, the explanatory power simulation suggests that when and are drawn from distributions with higher variance the bias tends to be small for compared to . The simulation study, especially the one concerning unequal sample size, hints at the complexity of estimation and finite-sample performance of these estimators and we leave it as an area of future research to precisely characterize properties of estimators.

## 7 The Single Sample Design: Efficiency

In this section we consider the single sample design, and analyze the potential for efficiency gains that might arise by exploiting the surrogacy assumption. We use our findings to further quantify the efficiency losses that arise due to the failure to observe the long-term outcome in the two-sample setting. Focusing on the information content from the surrogacy assumption, our semiparametric efficiency bound analysis follows in the spirit of Bickel, Klaassen, Ritov and Wellner (1993).

### 7.1 Efficiency Bounds: The Value of Surrogacy

In the single sample case, in the absence of covariates and without further assumptions, it is well known that an efficient estimator for the effect of a treatment on is the difference between the sample mean of the treated outcomes and the sample mean of the control outcomes. Thus, it might seem that incorporating surrogate variables in estimation (for example, by replacing by the surrogate index in estimation, as in ) would hurt efficiency. However, in this section we show that the opposite is true, once we incorporate the surrogacy assumption. The intuition is that the surrogacy assumption allows us to pool all data-including data for both treated and control units-when estimating the relationship between and , since the surrogacy assumption requires that this relationship does not vary with the treatment.

Let , , and . Then, we have the following efficiency result.

###### Theorem 3.

The efficiency bound without assuming surrogacy, but when surrogacy holds is