## 1 Introduction

The gold standard for inferring the causal effect of a treatment (such as smoking, a tax policy, or a fertilizer) on an outcome (such as blood pressure, stock prices, or crop yield) is the randomized experiment: the analyst manually assigns the treatment to each of her units uniformly at random. Unfortunately, this manipulation is impossible or unethical for some treatments of practical interest, leading to the need for inferring causal relations from observational studies.
In many observational studies, it is common for *instrumental variables (IV)* to be available. These variables are (a) allocated randomly across units, (b) correlated with the treatment, and (c) affect the dependent variable only through their effect on the treatment. The fact that instrumental variables allow for consistent estimation of causal effect with non-randomized treatments is a hallmark of the causal inference literature, and has led to the use of IV methods across many different applied settings (e.g., joskow1987contract; gerber2000; acemoglu2001; autor2013).

The most popular existing method that uses instrumental variables to conduct causal inference is Two-Stage Least Squares Regression (2SLS) (angrist1991does; card1993using; wooldridge2010econometric). The 2SLS methodology makes strong parametric assumptions about the underlying outcome model (linearity), which do not generalize well to complex problems. Non-parametric approaches to IV-based causal estimates generalize 2SLS to more complex models (newey2003instrumental; frolich2007), but lack interpretability; it is difficult to troubleshoot or trust black box models. Matching methods that allow for nonparametric inference on average treatment effects without requiring functional estimation have recently been introduced for the IV problem in kang2016full: the full-matching algorithm presented in their work relaxes some of the strong assumptions of 2SLS, however, it does not scale well to massive datasets, and imposes a fixed metric on covariates. It also does not take into account that covariates have different levels of importance for matching.

The approach for instrumental variable analysis presented in this paper aims to handle the problems faced by existing methods: it is non-parametric, scalable, and preserves the interpretability of having high-quality matched groups. We create an Almost-Matching Exactly framework (roy2017flame; dieng2018collapsing)

for the purpose of instrumental variable analysis. Our methodology estimates the causal effects in a non-parametric way and hence performs better than 2SLS or other parametric models. It improves over existing matching methods for instrumental variables when covariates are discrete, leveraging an adaptive distance metric. This adaptive distance metric is capable of systematically accounting for nuisance variables, discounting their importance for matching. The algorithm scales easily to large datasets (millions of observations) and can be implemented within most common database systems for optimal performance.

In what follows, first we introduce the problem of instrumental variable estimation for observational inference, and describe the role of matching within it. Second, we outline the Almost-Matching Exactly with Instrumental Variables (AME-IV) framework for creating matched groups. Third, we describe estimators with good statistical properties that can be used on the matched data. Finally, we present results from applying our methodology to both simulated and real-world data: we show that the method performs well in most settings and outperforms existing approaches in several scenarios.

## 2 Related Work

Widely used results on definition and identification of IVs are given in imbens1997bayesian; angrist1996identification, and generalized in brito2002generalized; chen2016incorporating. Methods for discovery of IVs are developed in silva2017.

The most popular method for IV estimation in the presence of observed confounders is two-stage least squares (2SLS) (card1993using). 2SLS estimators are consistent and efficient under linear single-variable structural equation models with a constant treatment effect (wooldridge2010econometric). One drawback of 2SLS is its sensitivity to misspecification of the model. Matching, on the other hand, allows for correct inference without the need to specify an outcome model.

Recent work on matching for IV estimation includes matching methods that match directly on covariates, rather than on summary statistics like propensity score (ichimura2001propensity). These matching methods can be very powerful nonparametric estimators; full matching (kang2013causal) is one such approach, but has a limitation in that its distance metric between covariates is fixed, whereas ours is learned. roy2017flame provides an in-depth discussion of other matching methods including near-far and full-matching, in the context of AME.

Other IV methods in the presence of measured covariates include Bayesian methods (imbens1997bayesian), semiparametric methods (abadie2003semiparametric; tan2006regression; ogburn2015doubly), nonparametric methods (frolich2007)

and deep learning methods

(hartford2017deep), but these methods do not enjoy the benefits of interpretability that matching provides.## 3 Methodology

We consider the problem of instrumental variable estimation for a set of units indexed by . Each unit is randomly assigned to a binary instrument level. Units respond to being assigned different levels of this instrument by either taking up the treatment or not: we denote with the treatment level taken up by each unit after being exposed to value of the instrument. Subsequently, units respond to a treatment/instrument regime by exhibiting different values of the outcome variable of interest, which we denote by . Note that this response depends both on the value of the instrument assigned (2nd argument) and on the treatment value that units take up in response to that instrument value (1st argument). All quantities introduced so far are fixed for a given unit

but not always observed. In practice, we have a random variable

for each unit denoting the level of instrument that it was assigned, and observed realizations of are denoted with . Whether a unit receives treatment is now a random variable (), and the outcome is random (), and they take the form:Note that the only randomness in the observed variables comes from the instrument, all other quantities are fixed. We use and to denote observed realizations of and

respectively. We also observe a fixed vector of

covariates for each unit, , where is a space with dimensions. In this paper we are interested in the case in which, corresponding to categorical variables, where exact matching is well-defined.

Throughout we make the SUTVA assumption, that is (i) outcome and treatment assignment for each individual are unrelated to the instrument exposure of other individuals, and (ii) the outcome for each individual is unrelated to the treatment assignment of other individuals (angrist1996identification). However, ignorability of treatment assignment is not required. We make use of the instrumental variable to estimate the causal effect of treatment on outcome. In order for a variable to be a valid instrument it must satisfy the following standard assumptions (see, e.g., imbens1994identification; angrist1996identification; imbens2015):

(A1) Relevance: , that is, the variable does indeed have a non-zero causal effect on treatment assignment, on average.

(A2) Exclusion: If then for each unit . This assumption states that unit ’s potential outcomes are only affected by the treatment it is exposed to, and not by the value of the instrument. Therefore, can be denoted by: .

(A3) Ignorability: for all units , and some non-random function . This assumption states that the instrument is assigned to all units that have covariate value

with the same probability. It implies that if two units

and have , then .(A4) Strong Monotonicity: for each unit . This assumption states that the instrument is seen as an encouragement to take up the treatment, this encouragement will only make it more likely that units take up the treatment and never less likely.

Figure 1 gives a graphical summary of the identification assumptions. An instrumental variable satisfying (A1, A2, A3 and A4) allows us to estimate the treatment effect for a subgroup that responds positively to exposure to the instrument (imbens1994identification). We note that these are not the only criteria for the use of instrumental variables, for example brito2002generalized introduces a graphical criterion for identification with instrumental variables. These are units that would have undertaken the treatment only after administration of the instrument and never without (angrist1996identification). Note that we cannot identify these units in our sample, given what we observe, but we can estimate the treatment effect on them (imbens2015). This treatment effect is known as Local Average Treatment Effect (LATE) and takes the following form (imbens1994identification; angrist1996identification):

(1) |

where is the total number of units such that , is the weight associated with each value of , is the number of units where , and:

The quantities above are also known as the Intent-To-Treat effects: they represent the causal effects of the instrument on the outcome and the treatment, respectively. Intuitively, these effects can be estimated in an unbiased and consistent way due to ignorability of instrument assignment (A3) conditional on units having the same value of .

Approximate matching comes into this framework because in practice we almost never have enough treated and control units with the same exact values of in our observed data to accurately estimate the quantities above. With approximate matching, we want to construct matched groups from observed such that A3 holds approximately within each group. This means that a good approximate matching algorithm is one that produces groups where, if and are grouped together, then . In the next section, we propose the Almost-Matching Exactly with Instrumental Variables (AME-IV) framework to build good approximately matched groups from binary covariates.

### 3.1 Almost-Matching Exactly With Instrumental Variables (Ame-Iv Problem)

The AME-IV framework has the goal of matching each instrumented (i.e., ) unit to at least one non-instrumented unit (i.e., ) as exactly as possible. (The entire set of calculations is symmetric when we match each non-instrumented unit, thus w.l.o.g. we consider only instrumented units.) When units are matched on all covariates, this is an exact match. When units can be matched on the most important covariates (but not necessarily all covariates), this is an almost-exact match. The importance of covariate for matching is represented by a fixed nonnegative weight . Thus, we consider the following problem for each instrumented unit , which is to maximize the weighted sum of covariates on which we can create a valid matched group for :

where denotes the Hadamard product, is a binary vector to represent whether or not each covariate is used for matching, and is a nonnegative vector with a reward value associated with matching on each covariate. The constraint in our optimization problem definition guarantees that the main matched group of each instrumented unit contains at least one non-instrumented unit. The solution to this optimization problem is a binary indicator of the optimal set of covariates that unit can be matched on. Note that, if all entries of happen to be one, then the units in unit ’s main matched group will be exact matches for .

We define ’s main matched group in terms of as:

We now theoretically connect Assumption A3 with solving the AME-IV problem, and show how approximate matches can lead to the assumption being approximately satisfied within each matched group. This makes IV estimation possible even when it is not possible to exactly match each unit. To do so, we introduce the notation to denote a vector of length where the entry is one if and zero otherwise.

###### Lemma .

For any unit where = 1, with as defined in the AME-IV problem, then for any unit with , if , i.e., , we have:

(2) |

In particular, if has all entries equal to one and then .

The detailed derivation of this lemma is in the supplement. This statement clarifies that by solving the AME-IV problem, we minimize the weighted hamming distance between each unit and all other units with a different assignment of the instrument that belong to ’s main matched group. We now introduce a smoothness assumption under which we can formally link the matched groups created by AME-IV with the necessary conditions for causal estimation using instrumental variables.

(A5) Smoothness: For any two , and , we have: , where is an increasing function of such that .

Note that this is a variant of a standard assumption made in most matching frameworks (see, e.g., rosenbaum2010design). The following proposition follows immediately from Lemma 3.1 applied to A5. If with , and A5 holds, then

In particular, if is one in all entries, then .

With this observation, we know that units matched together will have similar probabilities of being instrumented (in fact, as similar as possible, as finite data permits). This will allow us to produce reliable estimates of using our matched groups, provided that the data actually contain matches of sufficiently high quality.

### 3.2 Full Ame-Iv Problem

In the full version of the AME-IV problem, the weights are chosen so that the variables used for each matched group have a useful quality: these variables together can create a high-quality predictive model for the outcomes. The weights become variable importance measures for each of the variables.

In order to determine the importance of each variable

, we use variable importance techniques to analyze machine learning models trained on a separate training set. Specifically, the units

are divided into a training and a holdout set, the first is used to create matched groups and estimate causal quantities, and the second to learn the importance of each of the variables for match quality. Formally define the empirical predictive error on the training set, for set of variables as:where is some class of prediction functions. The empirical predictive error measures the usefulness of a set of variables. (The set of variables being evaluated are the ones highlighted by the indicator variables .)

We ensure that we always match using sets of variables that together have a low error . In fact, for each unit, if we cannot match on all the variables, we will aim to match on the set of variables for which the lowest possible prediction error is attained. Because of this, all matched groups are matched on a set of variables that together can predict outcomes sufficiently well.

The Full-AME-IV problem can thus be stated as: for all instrumented units ,

When importance weights are a linear function of the covariates, then solving the problem above is equivalent to solving the general AME-IV problem. An analogous result holds without IVs for the AME problem (roy2017flame).

In the standard Full-AME problem, there is no instrument, and each matched group must contain both treatment and control units, whereas in the Full-AME-IV case, the key is to match units so that instrumented units are matched with non-instrumented units regardless of treatment. Intuitively, this makes sense because treatment uptake is in itself an outcome of instrumentation in the IV framework: a group with very large or very small numbers of treated or control units would imply that units with certain values of are either highly likely or highly unlikely to respond to the instrument by taking up the treatment.

### 3.3 Flame-Iv: An Approximate Algorithm for the Full-Ame-Iv Problem

We extend ideas from the Fast Large-scale Almost Matching Exactly (FLAME) algorithm introduced by roy2017flame to approximately solve the AME-IV problem. Our algorithm – FLAME-IV – uses instrumental variables to create matched groups that have at least one instrumented and one non-instrumented unit within them. The procedure starts with an exact matching that finds all exact main matched groups. Then at each iteration FLAME-IV iteratively chooses one covariate to drop, and creates matched groups on the remaining covariates. To decide which covariate to drop at each iteration, FLAME-IV loops through the possibilities: it temporarily drops one covariate and computes the match quality after dropping this covariate. Then FLAME-IV selects the covariate for which was maximized during this loop. Match quality is defined as a trade-off between prediction error, (which is defined in Section 3.2) and a balancing factor, which is defined as:

is computed on the holdout training dataset. In practice, the balancing factor improves the quality of matches by preventing FLAME-IV from leaving too many units stranded without matched groups. That is, it could prevent all treated units from being matched to the same few control units when more balanced matched groups were possible. More details about the FLAME-IV algorithm are in the supplement.

It is recommended to early-stop the algorithm before the drops by 5% or more (roy2017flame). This way, the set of variables defining each matched group is sufficient to predict outcomes well (on the training set). The details about early-stopping are in the supplement.

## 4 Estimation

Assuming that (A1) through (A5) and SUTVA hold, the LATE, , can be estimated in a consistent way (imbens1994identification; angrist1996identification); in this section we adapt common estimators for to our matching framwork. Consider a collection of matched groups, , each associated with a different value of . We estimate the average causal effect of the instrument on the treatment, and on the outcome, , within each matched group, , and then take the ratio of their weighted sums over all groups to estimate .

We start with the canonical estimator for :

(3) |

Similarly, the estimator for the causal effect of the instrument on the treatment, , can be written as:

(4) |

From the form of in Equation (1) it is easy to see that, if the estimators in (3) and (4) are unbiased for and respectively (which is true, for instance, when matches are made exactly for all units), then the ratio of their weighted average across all matched groups is a consistent estimator for :

(5) |

where denotes the number of units in matched group . A natural extension of this framework allows us to estimate the LATE within matched group , defined as:

(6) |

This can be accomplished with the following estimator:

(7) |

We quantify uncertainty around our estimates with asymptotic Confidence Intervals (CIs). To compute CIs for these estimators we adapt the approach laid out in

imbens2015. Details on variance estimators and computations are given in the supplement.

In the following section, we present simulations that employ these estimators in conjunction with the algorithms presented in the previous section to estimate and . The performance of our methodology is shown to surpass that of other existing approaches.

## 5 Simulations

We evaluate the performance of our method using simulated data. We compare our approach to several other methods including two-stage least squares (angrist1991does; card1993using; wooldridge2010econometric), and two other state-of-the-art nonparametric methods for instrumental variables, full matching (kang2016full) and nearfar matching (baiocchi2010building). Full matching and nearfar matching find units that differ on the instrument while being close in covariate space according to a predefined distance metric. Both algorithms rely on a sample-rank Mahalanobis distance with an instrument propensity score caliper.

We implement FLAME-IV using bit-vector calculations. More details about the implementation are in the supplementary materials.

In the first set of experiments, we compare the performance of the different methods on the estimation of local average treatment effects. In Experiment 5.2 we demonstrate the power of FLAME-IV for estimating individualized local average treatment effects. Experiment 5.3 describes the scalability of the approach in terms of the number of covariates and number of units.

Throughout, we generate instruments, covariates and continuous exposures based on the following structural equation model (wooldridge2010econometric):

(8) |

where , and . For important covariates, . For unimportant covariates, in the control group, and in the treatment group. We discretize the exposure values by defining:

### 5.1 Estimation of

In this experiment, outcomes are generated based on one of two homogeneous treatment effect models: a linear and a nonlinear model, respectively defined as:

(9) | |||||

(10) |

Under both generation models, the true treatment effect is 10 for all individuals. There are 10 confounding covariates, 8 of which are important and 2 are unimportant. The importance of the variables is exponentially decaying with .

We measure performance using the absolute bias of the median, i.e., the absolute value of the bias of the median estimate of 500 simulations and median absolute deviation, i.e., the median of the absolute deviations from the true effect, for each simulation. We present simulation results at varying levels of strength of the instrumental variable. This is measured by a concentration parameter, defined as the influence that the instrument has on treatment take-up. This is represented by the concentration parameter in Eq. (8). Usually a concentration parameter below 10 suggests that instruments are weak (stock2002survey).

We also assess the performance of our methods by varying the size of training and holdout data. We generate two training and holdout datasets of different sizes: one with 1000 instrumented units and 1000 non-instrumented units, and one with 50 instrumented units and 50 non-instrumented units. For each case, we run each experiment 500 times for each of the algorithms.

Figures 2 and 3 show the results of this experiment. All algorithms achieve better estimation accuracy when the instrument is stronger (i.e., more instrumented units take up the treatment). Figure 2 shows results for the linear generation model, and Figure 3 shows results for the nonlinear generation model. As both figures show, FLAME-IV with and without early-stopping generally outperform all other algorithms in terms of bias and deviation. This is likely because our methodology does not rely on a parametric outcome model and uses a discrete learned distance metric. The only exceptions are the left-upper plot on Figure 2 and Figure 3, which represents the bias results on small datasets (50 instrumented & 50 noninstrumented). 2SLS has advantages here, because the amount of data is too small for powerful nonparametric methods like FLAME-IV to fit reliably. FLAME-IV’s matching estimates lead to slightly larger bias than 2SLS.

In the supplementary materials, we report results of similar experiments but with the additional inclusion of observed confounders of instrument assignment. We see no degradation in the performance. Result patterns with confounded instruments mimic those in Figures 2 and 3.

Next, we compare 95% confidence intervals for each algorithm. The results are reported in Table 1. FLAME-IV performs well on the nonlinear generation model, leading to the narrowest 95% CI of all the methods. For the linear generation model, the 95% CI for FLAME-IV is narrower than the equivalent CIs for full matching and nearfar matching, but wider than 2SLS. Again, this is expected, and due to the correct parameterization of 2SLS with the linear generation model. More details about the confidence intervals are available in the supplement.

FLAME-IV | 2SLS | Full-Matching | Nearfar Matching | |
---|---|---|---|---|

Linear Model | 10.15 | 10.16 | 10.96 | 11.23 |

(9.72, 10.58) | (9.92, 10.40) | (10.14, 12.68) | (10.23, 12.89) | |

Nonlinear Model | 9.95 | 10.11 | 18.97 | 21.67 |

(9.47, 10.43) | (6.96, 13.25) | (11.35, 41.44) | (12.96, 45.71) |

### 5.2 Estimation of

One advantage of the AME-IV methodology is that it allows us to estimate LATE’s on compliers (units for whom ) within each matched group. This results in more nuanced estimates of the LATE and in overall better descriptions of the estimated causal effects. We evaluate performance of FLAME-IV in estimating matched group-level effects in a simulation study, with the estimators described in Section 4.

To study how well FLAME-IV estimates individual causal effects, we generate data with heterogeneous treatment effects. The new generation models, (11) and (12) below, are unlike the generation models in (9) and (10), in that different individuals have different treatment effects. The two heterogeneous treatment effect data generation models are:

(11) | |||||

(12) |

Here with , . We generate 1000 treatment and 1000 control units from both models. We increased the value of the concentration parameter in Eq. (8) so that has a strong effect on for the whole dataset. This is done to ensure appropriate treatment take-up within each group. Even with this adjustment, a few groups did not have any units take up treatment in the simulation. Results for these groups were not computed and are not reported in Figure 4. We estimate the LATE within each matched group (). Note that in groups where the instrument is very strong, the LATE will approximately equal the average treatment effect on the treated.

Experimental results for both data generation models are shown in Figure 4. As we can see, our estimated effects almost align with true treatment effects and lead to relatively small estimation error for both linear and nonlinear generation models. Our algorithm performs slightly better when the generation model is linear.

### 5.3 Running Time Evaluation

For the synthetic data generated by Section 5.2, Figure 5 compares the runtime of our algorithm against full matching. We computed the runtime by varying number of units (Figure 5, left panel) and by varying number of covariates (Figure 5, right panel). Each runtime is the average of five experiment results. The plot suggests that our algorithm scales well with both the number of units and number of covariates. Full matching depends on a Mahalanobis distance metric, which is costly to compute in terms of time. FLAME-IV scales even better than full matching on a larger dataset with more units or covariates. Experimental results about larger datasets are in the supplement. We note that the maximum number of units and covariates of full matching is also limited to the maximum size of vectors in R. Experiments were run on an Intel Core i7-4790 @ 3.6 GHz with 8 GB RAM and Ubuntu 16.04.01.

## 6 Will a Five-Minute Discussion Change Your Mind?

In this section, we demonstrate the practical utility of our method by applying it to a real-world dataset. Since we do not observe the ground truth, we cannot evaluate the performance in terms of predictions, instead, we determine whether we can replicate the results of a previously published study. Specifically, we examine how door-to-door canvassing affects actual electoral outcomes; using experimental data generated by running a country-wide experiment during the 2012 French general election (pons2018). The original study estimates the effects of a door-to-door campaign in favor of François Hollande’s Parti Socialiste (PS) on two outcomes: voter turnout and share of votes for PS. The two outcomes are measured twice: once for each of the two rounds of voting that took place during the 2012 election. The units of analysis are geographically defined electoral precincts, often, but not always, comprised of different municipalities.

The instrument in this case is pre-selection into campaign precincts: the 3,260 electoral precincts were clustered into strata of 5, among which 4 were randomly chosen and made available to conduct a campaign. The treatment is the decision of which of these four instrumented precincts to actually run campaigns in, as not all of the four instrumented precincts were actually chosen for door-to-door campaigns. The decision was based on the proportion of PS votes at the previous election within each precinct and the target number of registered citizens for each territory. These deciding factors evidently confound the causal relationship between treatment and outcomes. This setup provides an ideal setting for an Instrumental Variable design, where random pre-selection into campaign districts can be used to estimate the LATE of actual door-to-door campaigns on both turnout and PS vote share.

We replicate the original study’s results by running our algorithm on the data without explicitly accounting for the strata defined by the original experiment. Since some of the covariates used for matching are continuous, we coarsen them into 5 ordinal categories. We coarsen turnout at the previous election and PS vote share at the previous election into 10 categories instead, as these variables are particularly important for matching and we would like to make more granular matches on them. Results from applying our methods to the data from the study are presented in Table 2. Columns 2 and 3 shows results for PS vote share as an outcome, and the last two columns for voter turnout as an outcome. Results are presented disaggregated by each round of election.

Panel A provides LATE estimates from FLAME-IV. Unlike the earlier study (pons2018)

, our estimates are independent of the strong parametric assumptions of 2SLS. We reach conclusions similar to those of the original paper, finding no positive effect of canvassing on voter turnout and a positive statistically significant effect on vote share for PS. In general, our standard error estimates are similar to those obtained with 2SLS, however more conservative due to the non-parametric nature of the estimators we employ. Interestingly, our estimate of the effect of canvassing on vote share has a

greater magnitude than the original analysis, while our estimate for the effect of canvassing on voter turnout is nearly the same as the original paper’s.Our methodology also allows an improvement on the original analysis by estimating effects of door-to-door campaigns on the two outcomes for particular subgroups of interest. LATE estimates for income and gender subgroups are reported in Panel B and Panel C of Table 2. The income subgroups are defined by median income, whereas gender subgroups are defined by share of female population in each precinct. We find that canvassing was more effective in increasing the vote share for PS, in the first round of the election, in precincts where male population is in the majority. We also find that canvassing had negative effect on voter turnout in low income precincts, but positive effect on voter share for PS. The combination of these results show that canvassing was successful in convincing voters to switch their votes in favour of François Hollande.

In the supplement we show two example matched groups output by FLAME-IV. In this case the algorithm was successful in separating localities with low support for PS from localities in which support for PS was greater. These examples highlight how the algorithm can produce meaningful and interpretable groups, while reducing potential for confounding by observed covariates.

In conclusion, the results of our analysis of the voter turnout data clearly show that our method produces novel and interesting results when applied to real-world scenarios, independently of strong parametric assumptions, and with a simple interpretable framework.

## 7 Conclusion

Matching methods can be extremely powerful: they are both highly nonparametric and interpretable to users, allowing them to trust and troubleshoot their models more easily. Our approach to matching for instrumental variables accounts for the limitations faced by existing methods. We improve on 2SLS by using a highly non-parametric powerful modeling approach. We retain interpretability unlike traditional machine learning approaches by using matching. We improve on existing matching methods by learning an interpretable distance metric on a training set. Our methodology also provides a systematic way to account for nuisance variables, and to achieve consistently high quality matching outcomes. The algorithm can be implemented easily within most common database systems for optimal performance. It scales well to large datasets. It achieves a balance between interpretability, scalability, trustworthiness, and modeling power that is unsurpassed by any other method for IV analysis. Code is publicly available at: https://github.com/almost-matching-exactly

#### Acknowledgements

This work was supported in part by NIH award 1R01EB025021-01, NSF awards IIS-1552538 and IIS1703431, a DARPA award under the L2M program, and a Duke University Energy Initiative Energy Research Seed Fund (ERSF) grant.

## References

## 8 Supplement

### 8.1 Proof of Lemma 3.1

Since the result is exactly symmetric when non-instrumented units are matched we prove it only for the case when instrumented units are matched. Assume . For a given unit with = 1, suppose we could find a as defined in the AME-IV problem. Let us define another unit with = 0, and , by definition of it must be that . So , where is a vector of length that has all entries equals to 1.

Assume there is another unit with = 0, and .

If , then . So

If , let us define , obviously . Since , we have:

Therefore,

This concludes the proof.

### 8.2 Asymptotic Variance and Confidence Intervals for LATE Estimates

To construct estimators for the variance of we use an asymptotic approximation, that is, we will try to estimate the asymptotic variance of , rather than its small sample variance. The strategy we use to do this is the same as imbens2015, with the difference that our data is grouped: we adapt their estimators to grouped data using canonical methods for stratified sampling. In order to define asymptotic quantities for our estimators, we must marginally expand the definitions of potential outcomes introduced in our paper. In practice, while our framework has been presented under the assumption that the potential outcomes and treatments are fixed, we now relax that assumption and instead treat as realizations of random variables , which are drawn from some unknown distribution . In this case the SUTVA assumption requires that each set of potential outcomes and treatments is independently drawn from the same distribution for all units. As usual, lowercase versions of the symbols above denote observed realizations of the respective random variables.

The asymptotic behaviour of our method is straightforward. Since the covariates we consider are discrete (say binary for convenience) there are only a finite number of possible covariate combinations one can observe. If the sample size increases and the probability of observing all combinations of covariates is positive then asymptotically all possible combinations of covariates will be observed. In fact, most units will be matched exactly when . This means that our matched groups will only contain exactly matched units, and therefore be exactly equal to a stratified fully randomized experiment in which the strata are the matched groups, by our Assumption 3 of our paper. By this principle, asymptotic results for IV estimation in stratified experiments, such as those in imbens2015, apply asymptotically.

Recall as well that in this scenario we have a set of matched groups indexed by , such that each unit is only in one matched group. We denote the number of units in matched group that have with and the number of units in matched group with with . Finally the total number of units in matched group is .

We make all the assumptions listed in Section3 but we must require a variant of (A3), to be used instead of it. This assumption is:

(A3’) .

That is, if two units are in the same matched group, then they have the same probability of receiving the instrument. This probability will be equal to the ratio of instrument 1 units to all units in the matched group because we hold these quantities fixed. Note that this more stringent assumption holds when matches are made exactly, and is common in variance computation for matching estimators (see, for example, kang2016full).

We keep our exposition concise and we do not give explicit definitions for our variance estimands. These are all standard and can be found in imbens2015.

We have to start from estimating variances of observed potential outcomes and treatments within each matched group. We do so with the canonical approach:

where: is an estimator for the variance of potential responses for the units with instrument value 0 in matched group , for the variance of potential responses for the units with instrument value 1 in matched group , for the variance of potential treatments the units with instrument value 0 in matched group , and is an estimator for the variance of potential treatments for the units with instrument value 1 in matched group . The fact that follows from Assumption A4.

We now move to variance estimation for the two s. Conservatively biased estimators for these quantities are given in imbens2015. These estimators are commonly used in practice and simple to compute, hence why they are often preferred to unbiased but more complex alternative. We repeat them below:

To estimate the asymptotic variance of we also need estimators for the covariance of the two s both within each matched group, and in the whole sample. Starting with the former, we can use the standard sample covariance estimator for :

The reasoning behind why we use only units with instrument value 1 to estimate this covariance is given in imbens2015, and follows from A4. We can use standard techniques for covariance estimation in grouped data to combine the estimators above into an overall estimator for as follows:

Once all these estimators are defined, we can use them to get an estimate of the asymptotic variance of . This quantity is obtained in imbens2015 with an application of the delta method to convergence of the two s. The final estimator for the asymptotic variance of , which we denote by , is given by:

Using this variance, % asymptotic confidence intervals can be computed in the standard way.

### 8.3 The FLAME-IV Algorithm

We adapt the Algorithm described in roy2017flame to the IV setting. The algorithm is ran as described in that paper, except instrument indicator is used instead of the treatment indicator as input to the algorithm. Here we give a short summary of how the algorithm works and refer to roy2017flame for an in-depth description.

FLAME-IV takes as inputs a training dataset , consisting of covariates, instrument indicator, treatment indicator, and outcome for every unit that we wish to match on, as well as a holdout set consisting of the same variables for a different set of units that aren’t used for matching but to evaluate prediction error and match quality. The algorithm then first checks if any units can be matched exactly with at least one unit with the opposite instrument indicator. If yes, all the units that match exactly are put into their own matched group and removed from the pool of units to be matched. After this initial check, the algorithm starts iterating through the matching covariates: at each iteration, Match Quality is evaluated on the holdout set after removing each covariate from the set of matching covariates. The covariate whose removal leads to the smallest reduction in MQ, is discarded and the algorithm proceeds to look for exact matches on all the remaining covariates. Units that can be matched exactly on the remaining covariates are put into matched groups, and removed from the set of units to be matched. Note that MQis recomputed after removing each remaining covariate at each iteration because the subset of covariates that it is evaluated on always is always smaller after each iteration (it does not include the covariate removed prior to this iteration). The algorithm will proceed in this way, removing covariates one by one, until either: a) MQ goes below a pre-defined threshold, b) all remaining units are matched, or c) all covariates are removed. Experimental evidence presented in roy2017flame suggests a threshold of 5% of the prediction error with all of the covariates. The matched groups produced by the algorithm can then be used with the estimators described in the paper to estimate desired treatment effects. Units left unmatched after the algorithm stops are not used for estimation. The algorithm ensures that at least one instrumented and one non-instrumented unit are present in each matched groups, but has no guarantees on treatment and control units: matched groups that do not contain either treated or control units are not used for estimation.

One of the strengths of FLAME-IV is that it can be implemented in several ways that guarantee performance on large datasets. An implementation leveraging bit vectors is described in roy2017flame, optimizing speed when datasets are not too large. A native implementation of the algorithm on any database management system that uses SQL as a query language is also given in the same paper: this implementation is ideal for large relational databases as it does not require data to be exported from the database for matching.

While FLAME-IV is a greedy solution to the AME-IV problem, an optimal solution could be obtained by adapting the DAME (Dynamic Almost Matching Exactly) procedure described in dieng2018collapsing to the IV setting by using instrument indicators as treatment indicators in the input to the algorithm. Resulting matched groups with no treated or control units should be discarded as we do here. The same estimators we employ in this paper can also be employed with the same properties for matched groups constructed with this methodology.

### 8.4 More Running Time Results on Large Dataset

Figure 6 shows the results of running time for FLAME-IV on a larger dataset. The running time is still very short( 2 min) on the large dataset for FLAME-IV. Full matching can not handle a dataset of this size.

### 8.5 Additional Simulations with Confounded Instrument Assignment

Here we present results from simulations similar to those in Section 5.1, but where, in addition to treatment assignment, instrument assignment is also confounded. Instrument is assigned as follows:

(13) | |||

(14) | |||

(15) |

where contains last two variables for unit , and .

Results for a linear outcome model, same as in Equation (9) are displayed in Figure 8, and results for a nonlinear outcome model as in Eq: (10) are displayed in Figure 8. Results are largely similar to those obtained when instrument assignment is unconfounded. This suggests that our method performs equally well when instrument assignment is confounded.

Territory | Last Election | Last Election | Population | Share Male | Share | Treated | Instrumented |

PS Vote Share | Turnout | (in thousands) | Unemployed | ||||

Matched Group 1 | |||||||

Plouguenast et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 1 |

Lorrez-le-Bocage-Préaux et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 1 |

La Ferté-Macé et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 1 |

Mundolsheim et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 1 | 1 |

Paris, 7e arrondissement | (0.01, 0.05] | (0.77, 0.88] | (1,800, 2,250] | (0.47, 0.57] | (0.1, 0.2] | 0 | 1 |

Sainte-Geneviève et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 0 |

Cranves-Sales et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 0 |

Hem et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 1 |

Legé et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 1 |

Moûtiers et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 0 |

Paris, 7e arrondissement | (0.01, 0.05] | (0.77, 0.88] | (1,800, 2,250] | (0.47, 0.57] | (0.1, 0.2] | 0 | 1 |

Craponne-sur-Arzon et environs | (0.01, 0.05] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0, 0.1] | 0 | 0 |

Matched Group 2 | |||||||

Nantes | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.47, 0.57] | (0.1, 0.2] | 1 | 1 |

Alès | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.37, 0.47] | (0.2, 0.3] | 1 | 1 |

Sin-le-Noble | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.47, 0.57] | (0.2, 0.3] | 1 | 1 |

Grand-Couronne et environs | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.47, 0.57] | (0.1, 0.2] | 1 | 1 |

Dreux | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.47, 0.57] | (0.2, 0.3] | 1 | 1 |

Vosges | (0.19, 0.22] | (0.77, 0.88] | (0, 450] | (0.47, 0.57] | (0.1, 0.2] | 0 | 0 |

Arras et environs | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.37, 0.47] | (0.1, 0.2] | 1 | 1 |

Montargis et environs | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.37, 0.47] | (0.2, 0.3] | 1 | 1 |

Marseille, 3e arrondissement | (0.19, 0.22] | (0.66, 0.77] | (450, 900] | (0.47, 0.57] | (0.1, 0.2] | 1 | 1 |

Nantes | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.47, 0.57] | (0.1, 0.2] | 1 | 1 |

Mâcon et environs | (0.19, 0.22] | (0.66, 0.77] | (0, 450] | (0.37, 0.47] | (0.1, 0.2] | 1 | 1 |

### 8.6 Sample Matched Groups

Sample matched groups are given in Table 3. These groups were produced by FLAME-IVon the data from pons2018, introduced in Section 6. The algorithm was ran on all of the covariates collected in the original study except for territory. Here we report some selected covariates for the groups. The first group is comprised of electoral districts in which previous turnout was relatively good but PS vote share was low. This suggest that existing partisan splits are being taken into account by FLAME-IVfor matching. Municipalities in the second group have slightly lower turnout at the previous election but a much larger vote share for PS. Note also that treatment adoption is very high in the second group, while low in the first: this suggest that the instrument is weak in Group 1 and strong in Group 2.