Collapsing-Fast-Large-Almost-Matching-Exactly: A Matching Method for Causal Inference

06/18/2018 ∙ by Awa Dieng, et al. ∙ Duke University 0

We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. Matching methods are heavily used in the social sciences due to their interpretability, but most matching methods in the past do not pass basic sanity checks in that they fail when irrelevant variables are introduced. Also, past methods tend to be either computationally slow or produce poor matches. The method proposed in this work aims to match units on a weighted Hamming distance, taking into account the relative importance of the covariates; the algorithm aims to match units on as many relevant variables as possible. To do this, the algorithm creates a hierarchy of covariate combinations on which to match (similar to downward closure), in the process solving an optimization problem for each unit in order to construct the optimal matches. The algorithm uses a single dynamic program to solve all of optimization problems simultaneously. Notable advantages of our method over existing matching procedures are its high-quality matches, versatility in handling different data distributions that may have irrelevant variables, and ability to handle missing data by matching on as many available covariates as possible



There are no comments yet.


page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In observational causal inference where the scientist does not control the randomization of individuals into treatment, an ideal approach matches each treatment unit to a control unit with identical covariates. However, in high dimensions, few such “identical twins” exist, since it becomes unlikely that any two units have identical covariates in high dimensions. In that case, how might we construct a match assignment that would lead to accurate estimates of conditional average treatment effects (CATEs)?

For categorical variables, we might choose a Hamming distance to measure similarity between covariates. Then, the goal is to find control units that are similar to the treatment units on as many covariates as possible. This leads to two challenges, the first being computational (how does one compute optimal matches on Hamming distance?), and the second being that not all covariates are equally relevant, which can impact the Hamming distance.

This second issue, that not all covariates are equally important, has serious implications for CATE estimation. Matching methods generally suffer in the presence of many irrelevant covariates (covariates that are not related to either treatment or outcome): the irrelevant variables would dominate the Hamming distance calculation, so that the treatment units would mainly be matched to the control units on the irrelevant variables. This means that matching methods do not always pass an important sanity check in that irrelevant variables should be irrelevant. As our experiments show, the participation of irrelevant variables can overwhelm many state of the art matching methods, including Propensity Score Matching [13], Nearest Neighbor matching [14], as well as methods not explicitly designed for matching such as [21] Causal Forest.

We propose a method in this work that aims to mitigate the problems listed above. The algorithm uses a hold-out training set to determine which variables are the most important to match on. It then determines matches by optimizing a weighted Hamming distance, where the weights on each variable are determined from the analysis of the hold-out set. Finding a matched group for a given unit then becomes a constrained discrete optimization problem, and we solve all of these optimization problems efficiently with a single dynamic program. This algorithm has the same basic monotonicity property (downwards closure) and structure as that of the apriori algorithm [1]

used in data mining for finding frequent itemsets. However, frequency of itemsets is irrelevant here, instead the goal is to find a largest (weighted) number of covariates that both a treatment and control unit have in common. The algorithm fully optimizes the weighted Hamming distance of each treatment unit to the nearest control unit (and vice versa). It is efficient, owing to the use of database programming and bit-vector computations, and does not require an integer programming solver. The method adaptively chooses subsets of features to consider, based on the other subsets of features that have already been considered. The method is named

D-AEMR : Dynamic Almost-Exact Matching with Replacement.

2 Related work

Randomized experiments are a gold standard for causal inference [18]; however, much of the available data for causal inference is observational, where the experimenter does not control the randomization mechanism. Observational datasets can be much larger and easier to obtain than randomized trial data, and matching techniques provide ways to approximate the covariate balance that would have come from a randomized trial.

Exact matching is not possible in high dimensions, as “identical twins” in treatment and control samples are not likely to exist. Early on, this led to techniques that reduce dimension using propensity score matching [15, 14, 16, 5], which extend to penalized regression approaches [19, 11, 4, 6]. Propensity score matching methods project the entire dataset to one dimension and thus cannot be used for estimating CATE (conditional average treatment effect), since the matched groups often do not agree on important covariates. In “optimal matching,” [12], an optimization problem is formed to choose matches according to a pre-defined distance measure, with balance constraints, though as discussed above, this distance measure can be dominated by irrelevant covariates, often leading to poor matched groups and biased estimates of treatment effect. Coarsened exact matching [9, 10] has the same problem.

A recent alternative framework is that of almost-exact matching, where each matched group contains units that are close on covariates that are important for predicting outcomes. For example, Coarsened Exact Matching [9, 10] is almost-exact, if one were to use an oracle (should one ever become available) that bins covariates according to importance for estimating causal effects. The FLAME algorithm [3]

is an almost-exact matching method that adapts the distance metric to the data using machine learning. It starts by matching “identical twins,” and proceeds by eliminating less important covariates one by one, attempting to match individuals on the largest set of covariates available. FLAME has a balance constraint to ensure that it does not remove too many treatment or control units as it eliminates covariates. It uses a hold-out training set to determine importance of covariates. The combination of balance and importance of covariates is called “match quality” (MQ). FLAME can handle datasets too large to fit in memory, and scales well with the number of covariates, but removing covariates in exactly one order (rather than all possible orders as in the present work) means that many high-quality matches will be missed.

D-AEMR ’s matched groups tend to match on more covariates than FLAME; the distances between matched units in D-AEMR are smaller within each matched group, thus its matches are distinctly higher quality. This has implications for missing data, where D-AEMR can find matched groups that FLAME cannot.

3 Almost-Exact Matching Framework

Consider a dataframe where , , respectively denote the covariates for all units, the outcome vector and the treatment indicator ( for treated, for control). Unit ’s th covariate is denoted . Notation indicates covariates for the th unit, and is an indicator for whether or not unit is treated.

Throughout we make SUTVA and ignorability assumptions [17]. The goal is to match treatment and control units on as many relevant covariates as possible. Relevance of covariate is denoted by and it is determined using a hold-out training set. Either ’s can be fixed beforehand or adjusted dynamically using a for-loop inside the algorithm.

Assuming that we have a fixed weight for each covariate , we would like to find a match for each treatment unit that matches at least one control unit on as many relevant covariates as possible. Thus we consider the following matching problem:

Almost-Exact Matching with Replacement (AEMR): For each treatment unit ,


where denotes Hadamard product.

The solution to the AEMR problem is an indicator of the optimal set of covariates for treatment unit ’s matched group. The constraint says that the optimal matched group contains at least one control unit.

The solution of the AEMR problem can be the same for multiple treatment units, in which case they would form a single matched group. Formally, we define the matched group for treatment unit .

The main matched group corresponding to treatment unit contains all units such that

That is, the main matched group contains all treatment and control units that have identical covariate values to on the covariates for which is 1.

The formulation of the AEMR and main matched group formulation is symmetric for control units.

3.1 Straightforward AEMR solutions do not suffice

There are two straightforward (but inefficient) approaches to solving the AEMR problem for all units.

AEMR Solution 1 (quadratic in , linear in ): Brute force pairwise comparison of treatment points to control points. For all treatment units , we (i) iterate over all control units , (ii) find the vector with value 1 if there is a match on the values of the corresponding covariates, and 0 otherwise, (iii) find the control unit(s) with the highest value of , and (iv) return them as the main matched group for the treatment unit (and compute the auxiliary group). Whenever a previously matched unit is matched to a previously unmatched unit , record the ’s main matched group as an auxiliary group for the previously matched unit . When all units are ‘done’ (all units are either matched already or cannot be matched) then stop, and compute the CATE for each treatment and control unit using its main matched group. If a unit belongs to auxiliary matched groups then its outcome is used for computing both its own CATE (in its own main matched group) and the CATEs of units for whom it is in an auxiliary group (e.g., will be used to compute ’s estimated CATE).

This algorithm is polynomial in both and , however, the quadratic time complexity in also makes this approach impractical for large datasets (for instance, when we have more than a million units with half being treatment units).

AEMR Solution 2 (order , exponential in :) Brute force iteration over all subsets of the covariates. This approach solves the AEMR problem simultaneously for all treatment and control units for a fixed weight vector . First, (i) enumerate every (which serves as an indicator for a subset of covariates), (ii) order the ’s according to , (iii) form all valid main matched groups having at least one treated and one control unit (iv) the first time each unit is matched, mark that unit with a ‘done’ flag, and record its corresponding main matched group and, to facilitate matching with replacement, (v) whenever a previously matched unit is matched to a previously unmatched unit, record this main matched group as an auxiliary group. When all units are ‘done’ (all units are either matched already or cannot be matched) then stop, and compute the CATE for each treatment and control unit using its main matched group. Each unit’s outcome will be used to estimate CATEs for every auxiliary group that it is a member of, as before. Although this approach exploits the efficient ‘group by’ function (e.g., provided in database (SQL) queries), which can be implemented in time by sorting the units, iterating over all possible vectors makes this approach unsuitable for practical purposes (exponential in ).

3.2 Key to D-Aemr : Downward closure will help solve AEMR more efficiently

If is in the millions, the first solution, or any simple variation of it, is practically infeasible. A straightforward implementation of the second solution is also inefficient. However, a monotonicity property (downward closure) stated below will allow us to prune the search space so that the second solution can be modified to be completely practical, as we will demonstrate in the D-AEMR algorithm in the next section, which does not enumerate all v’s, it uses the monotonicity to reduce the number of v’s it considers.

(Monotonicity of in AEMR solutions) Fix treatment unit . Consider feasible , meaning . Then,

  • [leftmargin=*]

  • Any feasible such that elementwise will have .

  • Consequently, consider feasible vectors and . Define as the elementwise . Then , and .

These follow from the fact that the elements of are binary and the elements of are non-negative. The first property means that if we have found a feasible , we need not consider any with fewer 1’s as a possible solution of the AEMR for unit . Thus, D-AEMR starts from being all 1’s (consider all covariates). It systematically drops one element of to zero at a time, then two, then three, ordered according to values of . The second property implies that we must evaluate both and as possible AEMR solutions before evaluating . Conversely, a new subset of variables defined by cannot be considered unless all of its supersets have been considered.

These two properties form the basis of the D-AEMR algorithm.

4 The Dynamic AEMR (D-Aemr) Algorithm

AEMR has two computations: the weights and the actual matching mechanism. The weights correspond to the importance of the covariates and are computed by running regression on a hold-out training set prior to running the algorithm, or are computed adaptively as the algorithm executes (the computation of the weights is discussed in the supplement). The matching mechanism is designed to solve the AEMR.

We call a covariate-set any set of covariates. We denote by the original set of all covariates from the input dataset, where . When we drop a set of covariates , it means we will match on . For any covariate-set , we associate an indicator-vector defined as follows:


i.e. the value is 1 if the covariate is not in implying that it is being used for matching.

Algorithm 2 gives the pseudocode of the D-AEMR algorithm. Instead of looping over all possible vectors to solve the AEMR, it considers a covariate-set for dropping only if satisfies the monotonicity property of Proposition 3.2. For example, if has been considered for dropping to form matched groups, it would not process next because the monotonicity property requires , and to have been considered previously for dropping.

The D-AEMR algorithm uses the GroupedMR (Grouped Matching with Replacement) subroutine given in Algorithm 1 to form all valid main matched groups having at least one treated and one control unit. GroupedMR takes a given subset of covariates and finds all subsets of treatment and control units that have identical values of those covariates. We use an efficient implementation of the group-by operation used in the algorithm using bit-vectors, which is discussed in the supplement. To keep track of main and auxiliary matched groups, GroupedMR takes the entire set of units as well as the set of unmatched units from the previous iteration as input along with the covariate-set to match on in this iteration. Instead of matching only the unmatched units in using the group-by operator, it matches all units in to allow for matching with replacement as in the AEMR objective. It keeps track of the main matched groups for the unmatched units that are matched for the first time, and auxiliary groups (see previous section) for the other matched units.

Input : Unmatched Data , subset of indexes of covariates .
Output : Newly matched units using covariates indexed by where groups have at least one treatment and one control unit, the remaining data as and the matched groups
= group-by (form groups by exact matching on )
= prune() (remove groups without at least one treatment and control unit)
= Get subset of where the covariates match with (recover newly matched units)
return . (matched units, unmatched units, and matched groups)
Algorithm 1 Procedure GroupedMR

D-AEMR keeps track of two sets of covariate-sets: (1) The set of processed sets contains the covariate-sets whose main matched groups (if any exist) have already been formed. That is, contains if matches have been constructed on by calling the GroupedMR procedure. (2) The set of active sets contains the covariate-sets that are eligible to be dropped according to Proposition 3.2. For any iteration , , i.e., the sets are disjoint, where denote the states of at the end of iteration . Due to the monotonicity property Proposition 3.2, if , then each proper subset belonged to in an earlier iteration . Once an active set is chosen as the optimal subset to drop in iteration , is excluded from (it is no longer active) and is included in as a processed set. In that sense, the active sets are generated and included in in a hierarchical manner similar to the apriori algorithm. A set is included in only if all of its proper subsets of one less size , , have been processed.

Figure 1 shows an example of the earlier steps of the D-AEMR algorithm.

Figure 1: Example of the first few steps of D-AEMR . The red number in an oval is a count of how many times the covariate in the left of the same oval has been within a dropped set of the size indicated on the left of the oval’s row. Step1: Find the optimal set to drop from the set of active sets and form the matched groups. Step2: Add the dropped set to the corresponding set of processed sets. Step3: If new sets become active, add them to the set of active sets. Otherwise, return to Step1. Repeat until no more active sets or units to match.
Input :  Data , precomputed weight vector for all covariates (see supplement)
Output :  all matched units and all the matched groups from all iterations
Notation: : iterations, () = unmatched (matched) units at the end of iteration , = matched groups at the end of iteration , = set of active covariate-sets at the end of iteration that are eligible be dropped to form matched groups, = set of covariate-sets at the end of iteration that have been processed (considered for dropping and formulation of matched groups). Initialize: while there is at least one treatment unit to match in  do
        -- find the ‘best’ covariate-set to drop from the set of active covariate-sets Let ( denotes the indicator-vector of as in (2)) --find matched units and main groups -- generate new active covariate-sets -- remove from the set of active sets -- update the set of active sets -- update the set of already processed covariate-sets -- remove matches
Algorithm 2 The D-AEMR algorithm

The procedure GenerateNewActiveSets gives an efficient implementation of generation of new active sets in each iteration of D-AEMR, and takes the currently processed sets and a newly processed set as input. Let . In this procedure, denotes the set of all processed covariate-sets in of size , and also includes . Inclusion of in may lead to generation of new active sets of size if all of its subsets of size (one less) have been already processed. The new active sets triggered by inclusion of in would be supersets of of size if all subsets of size belong to . To generate such candidate superset , we can append with all covariates appearing in some covariate-set in except those in . However, this naive approach would iterate over many superfluous candidates for active sets. Instead, GenerateNewActiveSets safely prunes some such candidates that cannot be valid active sets using support of each covariate in , which is the number of sets in containing . Indeed, for any covariate that is not frequent enough in , the monotonicity property ensures that any covariate-set that contains that covariate cannot be active. The following proposition shows that this pruning step does not eliminate any valid active set (proof is in the supplement):

If for a superset of a newly processed set where and , all subsets of of size have been processed (i.e. is eligible to be active after is processed), then is included in the set returned by GenerateNewActiveSets.

The explicit verification step of whether all possible subsets of of one less size belongs to is necessary, i.e., the above optimization only prunes some candidate sets that are guaranteed not to be active. For instance, consider , , and . For the superset of , all of have support of in , but this cannot become active yet, since the subset of does not belong to .

Finally, the following theorem states the correctness of the D-AEMR algorithm (proof is in the supplement).

(Correctness) The algorithm D-AEMR solves the AEMR problem.

Input :  a newly dropped set of size ,
the set of previously processed sets
Initialize: new active sets = – compute all subsets of of size and include = Notation: = support of covariate in – get all the covariates contained in sets in = and –get the covariates that have enough support minus = and – if all covariates in have enough support in if   then
        – generate new active set for  all  do
               if all subsets , , belong to  then
                      add newly active set to
return , ,
= = = = (subsets of of size 2 are return
Algorithm 3 Procedure GenerateNewActiveSets Example

5 Simulations

We present results under a variety of data generating processes. We show that D-AEMR produces higher quality matches than popular matching methods such as 1-PNNSM and Nearest Neighbor matching with Mahalanobis distance, and can produce better treatment effect estimates than black box machine learning methods such as Causal Forest (which is not a matching method, and is not interpretable). The ‘MatchIt’ R-package ([8]) was used to perform 1-Propensity Score Nearest Neighbor matching (1-PSNNM) and Nearest Neighbor with Mahalanobis distance (Mahalanobis). For Causal Forest, we used the ‘grf’ R-package ([2]). D-AEMR also improves over the original FLAME [3] with regards to the quality of matches in high dimensions. Other matching methods (optmatch, cardinality match) do not scale to large problems and thus needed to be omitted. We use ”FLAME” to represent original FLAME algorithm. Throughout this section, the outcome is generated with

where is the binary treatment indicator. We vary the distribution of the covariates, coefficients (’s, ’s, and ), as well as the fraction of treated units.

5.1 Presence of irrelevant covariates

A basic sanity check for matching algorithms is how sensitive to irrelevant covariates they are. To that end, we run experiments with a majority of the covariates being irrelevant to the outcome. This simulation generates 15000 control units, 15000 treatment units, 5 important covariates and 10 irrelevent covariates. For important covariates let with , , . For unimportant covariates , in the control group and in the treatment group.

Figure 2: D-AEMR perfectly estimates the CATTs when stopped early i.e. before dropping important covariates. D-AEMR and FLAME find all the good matches early on and should not be run all the way to the end. If run to the end, poor matches are introduced. All the other methods are sensitive to irrelevant covariates and give poor estimates.

Results: As Figure 2 shows, D-AEMR and FLAME achieve the optimal result before dropping any important covariates. They respectively match and of units on all important covariates. When imposing that the FLAME algorithms find all the possible matches even if important covariates are dropped (not recommended), poor matches are introduced. However, their worst case scenario is still substantially better than the comparative methods, all of which perform poorly in presence of irrelevant covariates. Causal Forest is especially ill suited for this case.

5.2 Exponentially decaying covariate importance

A notable advantage of D-AEMR over FLAME is that it should produce more high quality matches before resorting to lower quality matches. To test that theory, in this experiment we consider covariates of decaying importance (letting the parameters decrease exponentially, ).

We evaluate the performance of the algorithms when and of the units are matched.
Results: As Figure 3 shows, in all instances (each threshhold in each column) D-AEMR matches on more covariates, yielding better estimates than FLAME.

Figure 3: D-AEMR makes higher quality matches early on. Furthermore, on all thresholds (each column), D-AEMR consistently matches on more covariates than FLAME and minimizes the Hamming distance between matched units.

5.3 Imbalanced Data

Imbalance is a common situation in observational studies: there are often substantially more control than treatment units. The data considered in this experiment has covariates with decreasing importance and does not include unimportant covariates. A fixed batch of 2000 treatment units was generated as well as a batch of 40000 control units. We sample from the batch of controls to construct different imbalance ratios: 40000 in the most imbalanced case (ratio1), then 20000 (ratio2) and, finally, 10000 (ratio3). This experiment was run reusing control units to amplify the high dimensional matching of D-AEMR and to find the best possible match for each treatment unit. Results: A check reveals that FLAME and D-AEMR outperform the neareast neighbor matching methods, which produce poor estimates regardless of the degree of imbalance. A second check highlights that D-AEMR is distinctively better than FLAME when the data are imbalanced. D-AEMR yields better results when the data is extremely imbalanced. Additionally, D-AEMR has an average of 4 covariates not matched on, with of units matched on all but 2 covariates. On the other hand, FLAME averages 7 covariates not matched on and only units matched on all but 2 covariates.

Figure 4: (1) Estimated CATT vs True CATT from less imbalance (bottom) to more imbalance (top). In all cases, D-AEMR outperforms 1-PSNNM and Mahalanobis, which produce poor quality estimates.(2) D-AEMR estimation quality increases with the imbalance. The best results are produced when the data is extremely imbalanced. Within each ratio of the imbalance, D-AEMR outperforms FLAME.
Figure 5: For all the different imbalance ratios, D-AEMR consistently matches on more covariates than FLAME.

5.4 Running Time Evaluation

In this section, we compare the running time of our dynamic approach to solving the AEMR problem with a brute force solution (AEMR solution 1 described in Section 3). We also compare with FLAME which does not solve the AEMR problem. All experiments were run on a MacBook Pro with Intel Core i5 Processor (Cores: 2, Speed: 2,4 GHz), 8 GB RAM.

Results: As shown in Figure 6, FLAME provides the best running time performance because it utilizes a bit vector implementation and incrementally reduces the number of covariates. D-AEMR resets the pool of covariate sets to consider at each round of the algorithm, which takes more time. However, as shown in the previous simulations, D-AEMR produces high quality matches that the other methods do not. Our proposed method substantially reduces the running time to solve the AEMR as compared to brute force. The running time for D-AEMR can be further optimized through simple parallelization of the checking of active sets. Additional results are available in Section E in the supplement.

Figure 6: Runtime comparison

5.5 Effect of Noise

In Section D in the supplement, we study how D-AEMR performs in the presence of noise. We show that D-AEMR

 outperforms Causal Forest on all noise levels constructed. While Causal Forest with small amount of noise (low noise coefficient and low noise variance) performs better than Causal Forest without noise, it still produces worse estimates than

D-AEMR . The relative ordering of the methods remains unchanged.

5.6 Missing Data

We consider the case when data are missing at random. We compared performance of FLAME and D-AEMR

 both with and without multiple imputation.

To generate covariates we first sample , where

is not the identity matrix (provided in the supplement) and then let


. This creates correlated binary variables so methods that impute the missing values are expected to perform well. We generate 15000 control and 5000 treated units; 20% of the data are missing at random.

To allow for missing values in D-AEMR and FLAME we construct an matrix where if covariate is unobserved for unit . Treating “missing” as just another category for each variable we proceed with the algorithm by adding a condition for a matched group to be valid: if covariates are being matched on then for a unit in the group. If the sum is greater than then the group matched on the level “missing” for at least one covariate, making it invalid. We compare this approach to multiply imputing 10 datasets using the Multiple Imputation Chained Equations algorithm in the ’mice’ R package and averaging the estimates over the datasets.

The correlation matrix that defines the relationship among the covariates is given in Figure 7.

Figure 7: Correlation matrix that defines the relationship among the covariates .

We generate a dataset with 20% missing values, and Figures 9 and  8 contain the results with and without imputation. D-AEMR outperforms FLAME in terms of CATE estimation. D-AEMR with imputation yields a MSE of 1.11 vs. MSE without imputation of 5.08 compared to FLAME which has 2.77 (MSE with imputation) vs. 195.63 (MSE without imputation). This suggests that for datasets that are too large to undergo multiple imputation, D-AEMR still produces reasonable causal estimates. The result is more interpretable when there is no imputation, because it matches only on covariates that the observation actually possesses rather than imputed covariates.

Figure 8: Without Imputation: Comparison of true CATE and CATE estimates for D-AEMR (left) and FLAME (right) without imputation on the missing data experiment. 20 missing values.
Figure 9: With Imputation: Comparison of true CATE and CATE estimates for D-AEMR (left) and FLAME (right) with imputation on the missing data experiment. 20 missing values.

6 Breaking the cycle of drugs and crime in the United States

Breaking The Cycle (BTC) [7] is a social program conducted in several U.S. states designed to reduce criminal involvement and substance abuse, and improve health and employment among current offenders. A survey [7] was conducted in Alabama, Florida, and Washington regarding the program’s effectiveness, with high quality data for over 380 individuals. These data (and this type of data generally) can be a powerful tool in the war against opioids, and our ability to draw interpretable, trustworthy conclusions from it depends on our ability to construct high-quality matches. For the survey, participants were chosen to receive screening shortly after arrest and participate in a drug intervention under supervision. Similar defendants before the start of the BTC program were selected as the control group. Covariates used for matching are detailed in Table 1.

6.1 Covariate Importance

The features used for the BTC data are listed in Table 1

. Each feature has a variable importance score associated with it, the ratio of the loss of a ridge regression model trained with and without randomly permuting that feature. To obtain the values within the table, we randomly shuffled the values of that feature 100 times, retraining ridge regression, and each time compared with the unshuffled loss of ridge regression:

In Table 1, the mean variable importance values are reported. These calculations were performed on a holdout training set, not on the test data.

As Table 1 shows, all covariates have similar values of the importance score, which suggests that all covariates are almost equally important.

Covariate Importance Score
Live with anyone with an alcohol problem 0.958458
Have trouble understanding in life 0.983272
Live with anyone using nonprescribed drugs 0.984558
Have problem getting along with father in life 0.989829
Have a automobile 0.990346
Have driver license 0.990732
Have serious depression or anxiety in past 30 days 0.992095
Have serious anxiety in life 0.994400
SSI benefit Last 6 Months 1.001295
Have serious depression in life 1.003302
Table 1: Features for BTC data, and importance score of each feature, learned from a holdout training set.

6.2 Order of Dropping Covariates

The order in which D-AEMR and FLAME process covariates is different. Table 2 shows the order in which the covariates were processed for both algorithms. We used the formulations of FLAME and D-AEMR where variable importance is recomputed for only the variables we are considering at that iteration, since variable importance values change in the absence of various covariates. Covariates in FLAME are not eliminated on variable importance alone, there is a balancing factor that prefers not to eliminate too many units from either treated or control at once. Because D-AEMR eliminates one variable at a time, it tends to process the least relevant covariate in the earlier rounds. For example, at the first round, both algorithms drop the covariate “Live with anyone with an alcohol problem” which has the lowest importance score in Table1. At the second round, D-AEMR process the covariate “Have trouble understanding in life” because this covariate has the lowest importance score aside from “Live with anyone with an alcohol problem.” On the other hand, at that same second round, FLAME processes “Have serious anxiety in life” which now is dropped along with “Live with anyone with an alcohol problem.” Once “Live with anyone with an alcohol problem” was removed, “Have serious anxiety in life” has a lower importance score than “Have trouble understanding in life,” which is why it was dropped.

1st Live with anyone with an alcohol problem Live with anyone with an alcohol problem
2nd Have serious anxiety in life Have trouble understanding in life
3rd Live with anyone using nonprescribed drugs Have serious depression or anxiety in past 30 days
4th Have trouble understanding in life Live with anyone using nonprescribed drugs
5th Have problem getting along with father in life Have serious anxiety in life
6th Have serious depression in life Have driver license
7th Have driver license Have problem getting along with father in life
8th Have a automobile Have serious anxiety in life
9th Have serious depression or anxiety in past 30 days Have a automobile
Table 2: Order in which features were processed for FLAME and D-AEMR .

6.3 Analysis of the matches

We compare the quality of matches between FLAME and D-AEMR  in terms of the number of covariates used to match within the groups. Many of the units matched exactly on all covariates and thus were matched by both algorithms at the first round. For the remaining units, D-AEMR matches on more covariates than FLAME. Figure 10 shows the results. In particular, D-AEMR matched many more units on 9 variables, whereas FLAME matched more units on only one variable.

Figure 10: Number Matched: Number of units matched per covariates for the BTC data

6.4 Comparison of D-Aemr with SVM based method Minimax Surrogate Loss

We wanted to use D-AEMR

 as a tool to double check performance of a black box machine learning approach. There is no ground truth, so there will not be formal accuracy evaluation (see simulations for accuracy comparison). We chose a recent method that predicts whether treatment effects are positive, negative, or neutral, using a support vector machine formulation


We ran D-AEMR on the BTC dataset and saved the CATE for each treatment and control unit that were matched. Units with a positive CATE have a negative treatment effect and vice versa. We also implemented the SVM approach and recorded a prediction of positive, negative, or neutral treatment effect for each unit. Figure 11 in the main manuscript shows that D-AEMR and the SVM approach agree on most of the matched units.

Figure 11: Matched Units Analysis: CATE estimates for the BTC dataset. The colors represents the labels from the SVM method.

To see this, the units for which D-AEMR predicted approximately zero treatment effect all have a “neutral” treatment effect label from the SVM approach.

Most positive CATE’s corresponded to negative treatment effects from the SVM. Only one matched group seems to have a different prediction than the SVM: a negative CATE with a negative treatment effect prediction. The easiest way to explain the discrepancy between the two methods is that D-AEMR is a matching method, not a statistical model. Usually if one wanted to estimate CATEs, one would smooth the treatment effect estimates with a regression model after matching, however, we did not do this. Thus, if the negative SVM group happened to be closer in proximity to the matched groups with positive CATEs, it would completely resolve the discrepancy between the two models: the SVM simply smoothed out the treatment effect estimates so that there was a negative predicted treatment effect, even though the CATE may have been negative for that matched group.

In order to determine whether this was true, we computed the Hamming distance between the units in that unusual group to units in other groups to investigate. As it turned out, the units within that matched group were much closer to the units with negative treatment effect than other units, and as such, the difference between the results of the two methods is because there was no smoothing in the matching method. We similarly investigated the blue group at x=0.5 CATE estimate in Figure 11 and again, the covariates of its units were closer in Hamming distance to other blue groups than to other points.

Smoothing the matched groups is beyond the scope of this paper, but can be done with any regression method once the groups are formed if desired. The combination of matched groups with regression techniques for treatment effect can easily handle data with highly nonlinear baseline effects and smooth treatment effects; matching methods allow us to create such models without having to model the nonlinear baseline at all since the matching accounts for it.

Further, in the BTC dataset, we observed a decline in the match quality after about 30 out of the 43 covariates were dropped. We suggest stopping when the match quality MQ (which considers both prediction error and balancing factors in groups) becomes low, since in that case, either prediction error is bad (drop of 5%) or the treatment and control samples have become too imbalanced ( 10% ratio difference). This all changes with highly imbalanced data, in which case, running D-AEMR to the end will generally find the best match for each treatment unit and there is no need to stop early (see Section F).

7 Conclusion

The algorithm presented here produces better matches than any other published algorithm for matching. Not only are the matches high quality, the predictions of treatment effect from the matches are as good as, if not better than, the (black box) machine learning methods we have tried on the problems we have encountered. This algorithm stands to have an impact in the social sciences, as it produces much higher quality results than the methods used for studies within the realms of law, economics, sociology, psychology, and criminology, management, as well as in many areas of healthcare where observational studies are conducted. Unlike matches from other methods, which match individuals whose covariates can be nothing like each other, the matches from D-AEMR are meaningful, as they are almost exact. Code is publicly available at



  • [1] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
  • [2] S. Athey, J. Tibshirani, and S. Wager.

    Generalized Random Forests.

    ArXiv e-prints, October 2016.
  • [3] Anonymous Authors. FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference. Attached as supplementary material, 2018.
  • [4] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014.
  • [5] William G Cochran and Donald B Rubin. Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, pages 417–446, 1973.
  • [6] Max H Farrell. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23, 2015.
  • [7] Adele V. Harrell, Douglas Marlowe, and Jeffrey Merrill. Breaking the cycle of drugs and crime in birmingham, alabama, jacksonville, florida, and tacoma, washington, 1997-2001. 2006.
  • [8] Daniel Ho, Kosuke Imai, Gary King, and Elizabeth Stuart. Matchit: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, Articles, 42(8):1–28, 2011.
  • [9] Stefano M Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking: Coarsened exact matching. Political analysis, page mpr013, 2011.
  • [10] Stefano M Iacus, Gary King, and Giuseppe Porro. Multivariate matching methods that are monotonic imbalance bounding. Journal of the American Statistical Association, 106(493):345–361, 2011.
  • [11] Jeremy A Rassen and Sebastian Schneeweiss. Using high-dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system. Pharmacoepidemiology and drug safety, 21(S1):41–49, 2012.
  • [12] Paul R Rosenbaum.

    Imposing minimax and quantile constraints on optimal matching in observational studies.

    Journal of Computational and Graphical Statistics, (just-accepted), 2016.
  • [13] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • [14] Donald B Rubin. Matching to remove bias in observational studies. Biometrics, pages 159–183, 1973.
  • [15] Donald B Rubin. The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, pages 185–203, 1973.
  • [16] Donald B Rubin. Multivariate matching methods that are equal percent bias reducing, i: Some examples. Biometrics, pages 109–120, 1976.
  • [17] Donald B. Rubin. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593, 1980.
  • [18] Donald B Rubin. For objective causal inference, design trumps analysis. The Annals of Applied Statistics, pages 808–840, 2008.
  • [19] Sebastian Schneeweiss, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology (Cambridge, Mass.), 20(4):512, 2009.
  • [20] S. Thye Goh and C. Rudin. A Minimax Surrogate Loss Approach to Conditional Difference Estimation. ArXiv e-prints, March 2018.
  • [21] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (just-accepted), 2017.

Appendix A Group-by Implementation using Bit-Vectors

We implement the ”Group-by” operation by bit-vector manipulation[3]. Assume there are p covariates in the dataset. If the -th covariate is -ary (), we at first rearrange the covariates such that for all and get reordered covariates values of unit to be . Then for each unit , we compute two values.


Together with the treatment indicator value :


Since the covariates are rearranged so that for all , two units and have the same covariate values if and only if . For each unit u with values and , we compute how many times these two values occurs for all the units in the dataset(using unique() function of Numpy package in Python) and denote them as and respectively. To guarantee each matched group has at least one treatment and one control, we mark a unit as matched if its and differs. More rigorous proof can be found in [3].

Appendix B Adjustment of weights

One noticable feature of D-AEMR resides in its ability to dynamically adjust the weight of the covariates througout the run of the algorithm. This property is important as covariates become more or less relevant in presence of other covariates. To recompute the weight ie the importance score of each covariate, D-AEMR utilizes a hold-out training set whose covariate space is restricted to the subset of the covariates being considered (). Using the hold-out set that includes both treatment and control units, we obtain the vectors representing the weights and

by linear regressions on control and treatment units. The readjusted weights are used to compute a

prediction error measure to determine which subset of covariates leads to the best estimation quality. Specifically, is computed for each active set considered by


Notation: is the holdout dataset, represents a selector function on the rows and columns of the input dataset for any given matched unit . In other words, consists of a subset of rows and covariates that match with (maximizing gives the main matched group for any unit ), and are respectively the treatment indicator and the outcome for unit in the hold-out dataset.

Other criteria for optimization in addition to optimizing over is giving preference to subsets of attributes that allow better balancing of treatment and control units in the matched group. That balancing factor depends only on the number of units matched at each iteration by dropping and the remaining unmatched units.

A matched quality quantity consisting of a trade off between these two measures ensures that D-AEMR only processes sets that maintain good prediction quality while the matched groups contain enough units to yield good estimate of treatment effects.

Appendix C Omitted Proofs

c.1 Proof of Proposition 4

Proposition 4 If for a superset of a newly processed set where and , all subsets of of size have been processed (i.e. is eligible to be active after is processed), then is included in the set returned by GenerateNewActiveSets.


Suppose all subsets of of size are already processed and belong to . Let be the covariate in . Clearly, would appear in , since at least one subset of of size would contain , and . Further all covariates in , including and those in will have support at least in . To see this, note that there are subsets of of size , and each covariate in appears in exactly of them. Hence , the ‘if’ condition to check minimum support for all covariates in is satisfied, and the final ‘if’ condition to eliminate false positives is satisfied too. Therefore will be included in returned by the procedure. ∎

c.2 Proof of Theorem 4

Theorem 4(Correctness) The algorithm D-AEMR solves the AEMR problem.


Consider any treatment unit . Let be the set of covariates in its main matched group returned in D-AEMR (the while loop in D-AEMR runs as long as there is a treated unit, and the GroupedMR returns main matched group for every unit when it is matched for the first time). Let be the indicator vector of (see (2)). Since the GroupedMR procedure returns a main matched group only if it is a valid matched group containing at least one treated and one control unit (see Algorithm 1), and since all units in the matched group on have the same value of covariates in , there exists a unit with and .

Hence it remains to show that the covariate set in the main matched group for corresponds to the maximum weight . Assume the contradiction that there exists another covariate-set such that , there exists a unit with and , and gives the maximum weight over all such .

(i) cannot be a (strict) subset of , since D-AEMR ensures that all subsets are processed before a superset is processed to satisfy the downward closure property in Proposition 3.2.

(ii) cannot be a (strict) superset of , since it would violate the assumption that for non-negative weights.

(iii) Hence assume that and are incomparable (there exist covariates in both and ). Suppose the active set was chosen in iteration . If was processed in an earlier iteration , since forms a valid matched group for , it would give the main matched group for violating the assumption. We argue that in this case must be active at the start of iteration , and will be chosen as the best covariate set in iteration , leading to a contradiction.

Note that we start with all singleton sets as active sets in in the D-AEMR algorithm. Consider any singleton subset (comprising a single covariate in ). Due to the downward closure property in Proposition 3.2, , hence . Hence all of these singleton subsets of will be processed in earlier iterations , and will belong to the set of processed covariate sets .

Repeating the above argument, consider any subset . It holds that . Hence all subsets of will be processed in earlier iterations starting with the singleton subsets of . In particular, all subsets of size will belong to . As soon as the last of those subsets is processed, the procedure GenerateNewActiveSets will include in the set of active sets in a previous iteration . Hence if is not processed in an earlier iteration, it must be active at the start of iteration , leading to a contradiction.

Hence for all treatment units , the covariate-set giving the maximum value of will be used to form the main matched group of , showing the correctness of the D-AEMR algorithm. ∎

Appendix D Effect of Noise

We use 15000 treated units, 15000 control units, 5 important covariates and 8 unimportant covariates. We change our generative process in the following way to add noise:


where the noise coefficient and denotes noise chosen as either , , below:

First, we run D-AEMR and Causal Forest on all noise models to study how noise affect each method in terms of CATE estimations. Then, we compare the CATE estimates of D-AEMR and Causal Forest on the noise model in which Causal Forest achieves its best results.

Results: As Figure 12 shows, D-AEMR outperforms Causal Forest on all noise levels constructed. While Causal Forest with small amount of noise (low noise coefficient and low noise variance) performs better than Causal Forest without noise (see Figure 15), it still produces worse estimates than D-AEMR . The relative ordering of the methods remains unchanged. Additionally, Figure 13 show that D-AEMR is robust to small amount of noise but a high noise coefficient () and/or high variance (sd ) can lead to poorer results. To add smoothing of the estimates for D-AEMR , one can either: (i) smooth after matching by doing regression afterwards, or (ii) increase the minimum number of units in each match (which is a parameter in our code) to induce smoothing.

Figure 12: comparison of D-AEMR and Causal Forest for different noise levels. Noise coefficient

. Each row represents a different standard deviation (from top to bottom: sd = 1, sd = 2.5, sd = 5)

Figure 13: effect of noise on CATE estimation for D-AEMR when stopped early (before dropping important covariates. Each column represents a different noise coefficient (from left to right : .(nonoise), ). Each row represents a different standard deviation (from top to bottom: sd = 1, sd = 2.5, sd = 5)
Figure 14: effect of noise on CATE estimation for D-AEMR when run until no more matches. Each column represents a different noise coefficient (from left to right : .(nonoise), ). Each row represents a different standard deviation (from top to bottom: sd = 1, sd = 2.5, sd = 5)
Figure 15: effect of noise on CATE estimation for Causal Forest. Each column represents a different noise coefficient (from left to right : .(nonoise), ). Each row represents a different standard deviation (from top to bottom: sd = 1, sd = 2.5, sd = 5)

Appendix E More runtime evaluation of D-Aemr 

The running time of D-AEMR varying the number of units is given in Figure 16.

Figure 16: Run time of D-AEMR with variations on the number of units.

Appendix F Evolution of the match quality (MQ) in the BTC dataset

Figure 17 shows how the prediction error (PE) and balancing factor (BF) change over iterations for the BTC dataset with 43 covariates when we run D-AEMR. The prediction error suddenly drops after 30 iterations. This leads to a drop in the matching quality and can serve as a stopping criteria in practice.

(a) Evolution of the Prediction Error
(b) Evolution of the Balancing Factor
Figure 17: Evolution of Match Quality (MQ = PE + C*BF) where C is a trade off parameter