In observational causal inference where the scientist does not control the randomization of individuals into treatment, an ideal approach matches each treatment unit to a control unit with identical covariates. However, in high dimensions, few such “identical twins” exist, since it becomes unlikely that any two units have identical covariates in high dimensions. In that case, how might we construct a match assignment that would lead to accurate estimates of conditional average treatment effects (CATEs)?
For categorical variables, we might choose a Hamming distance to measure similarity between covariates. Then, the goal is to find control units that are similar to the treatment units on as many covariates as possible. This leads to two challenges, the first being computational (how does one compute optimal matches on Hamming distance?), and the second being that not all covariates are equally relevant, which can impact the Hamming distance.
This second issue, that not all covariates are equally important, has serious implications for CATE estimation. Matching methods generally suffer in the presence of many irrelevant covariates (covariates that are not related to either treatment or outcome): the irrelevant variables would dominate the Hamming distance calculation, so that the treatment units would mainly be matched to the control units on the irrelevant variables. This means that matching methods do not always pass an important sanity check in that irrelevant variables should be irrelevant. As our experiments show, the participation of irrelevant variables can overwhelm many state of the art matching methods, including Propensity Score Matching , Nearest Neighbor matching , as well as methods not explicitly designed for matching such as  Causal Forest.
We propose a method in this work that aims to mitigate the problems listed above. The algorithm uses a hold-out training set to determine which variables are the most important to match on. It then determines matches by optimizing a weighted Hamming distance, where the weights on each variable are determined from the analysis of the hold-out set. Finding a matched group for a given unit then becomes a constrained discrete optimization problem, and we solve all of these optimization problems efficiently with a single dynamic program. This algorithm has the same basic monotonicity property (downwards closure) and structure as that of the apriori algorithm 
used in data mining for finding frequent itemsets. However, frequency of itemsets is irrelevant here, instead the goal is to find a largest (weighted) number of covariates that both a treatment and control unit have in common. The algorithm fully optimizes the weighted Hamming distance of each treatment unit to the nearest control unit (and vice versa). It is efficient, owing to the use of database programming and bit-vector computations, and does not require an integer programming solver. The method adaptively chooses subsets of features to consider, based on the other subsets of features that have already been considered. The method is namedD-AEMR : Dynamic Almost-Exact Matching with Replacement.
2 Related work
Randomized experiments are a gold standard for causal inference ;
however, much of the available data for causal inference is observational, where the experimenter does not control the randomization mechanism. Observational datasets can be much larger and easier to obtain than randomized trial data, and matching techniques provide ways to approximate the covariate balance that would have come from a randomized trial.
Exact matching is not possible in high dimensions, as “identical twins” in treatment and control samples are not likely to exist. Early on, this led to techniques that reduce dimension using propensity score matching [15, 14, 16, 5], which extend to penalized regression approaches [19, 11, 4, 6]. Propensity score matching methods project the entire dataset to one dimension and thus cannot be used for estimating CATE (conditional average treatment effect), since the matched groups often do not agree on important covariates. In “optimal matching,” , an optimization problem is formed to choose matches according to a pre-defined distance measure, with balance constraints, though as discussed above, this distance measure can be dominated by irrelevant covariates, often leading to poor matched groups and biased estimates of treatment effect. Coarsened exact matching [9, 10] has the same problem.
A recent alternative framework is that of almost-exact matching, where each matched group contains units that are close on covariates that are important for predicting outcomes. For example, Coarsened Exact Matching [9, 10] is almost-exact, if one were to use an oracle (should one ever become available) that bins covariates according to importance for estimating causal effects. The FLAME algorithm 
is an almost-exact matching method that adapts the distance metric to the data using machine learning. It starts by matching “identical twins,” and proceeds by eliminating less important covariates one by one, attempting to match individuals on the largest set of covariates available. FLAME has a balance constraint to ensure that it does not remove too many treatment or control units as it eliminates covariates. It uses a hold-out training set to determine importance of covariates. The combination of balance and importance of covariates is called “match quality” (MQ). FLAME can handle datasets too large to fit in memory, and scales well with the number of covariates, but removing covariates in exactly one order (rather than all possible orders as in the present work) means that many high-quality matches will be missed.
D-AEMR ’s matched groups tend to match on more covariates than FLAME; the distances between matched units in D-AEMR are smaller within each matched group, thus its matches are distinctly higher quality. This has implications for missing data, where D-AEMR can find matched groups that FLAME cannot.
3 Almost-Exact Matching Framework
Consider a dataframe where , , respectively denote the covariates for all units, the outcome vector and the treatment indicator ( for treated, for control). Unit ’s th covariate is denoted . Notation indicates covariates for the th unit, and is an indicator for whether or not unit is treated.
Throughout we make SUTVA and ignorability assumptions . The goal is to match treatment and control units on as many relevant covariates as possible. Relevance of covariate is denoted by and it is determined using a hold-out training set. Either ’s can be fixed beforehand or adjusted dynamically using a for-loop inside the algorithm.
Assuming that we have a fixed weight for each covariate , we would like to find a match for each treatment unit that matches at least one control unit on as many relevant covariates as possible. Thus we consider the following matching problem:
Almost-Exact Matching with Replacement (AEMR): For each treatment unit ,
where denotes Hadamard product.
The solution to the AEMR problem is an indicator of the optimal set of covariates for treatment unit ’s matched group. The constraint says that the optimal matched group contains at least one control unit.
The solution of the AEMR problem can be the same for multiple treatment units, in which case they would form a single matched group. Formally, we define the matched group for treatment unit .
The main matched group corresponding to treatment unit contains all units such that
That is, the main matched group contains all treatment and control units that have identical covariate values to on the covariates for which is 1.
The formulation of the AEMR and main matched group formulation is symmetric for control units.
3.1 Straightforward AEMR solutions do not suffice
There are two straightforward (but inefficient) approaches to solving the AEMR problem for all units.
AEMR Solution 1 (quadratic in , linear in ): Brute force pairwise comparison of treatment points to control points. For all treatment units , we (i) iterate over all control units , (ii) find the vector with value 1 if there is a match on the values of the corresponding covariates, and 0 otherwise, (iii) find the control unit(s) with the highest value of , and (iv) return them as the main matched group for the treatment unit (and compute the auxiliary group). Whenever a previously matched unit is matched to a previously unmatched unit , record the ’s main matched group as an auxiliary group for the previously matched unit . When all units are ‘done’ (all units are either matched already or cannot be matched) then stop, and compute the CATE for each treatment and control unit using its main matched group. If a unit belongs to auxiliary matched groups then its outcome is used for computing both its own CATE (in its own main matched group) and the CATEs of units for whom it is in an auxiliary group (e.g., will be used to compute ’s estimated CATE).
This algorithm is polynomial in both and , however, the quadratic time complexity in also makes this approach impractical for large datasets (for instance, when we have more than a million units with half being treatment units).
AEMR Solution 2 (order , exponential in :) Brute force iteration over all subsets of the covariates. This approach solves the AEMR problem simultaneously for all treatment and control units for a fixed weight vector . First, (i) enumerate every (which serves as an indicator for a subset of covariates), (ii) order the ’s according to , (iii) form all valid main matched groups having at least one treated and one control unit (iv) the first time each unit is matched, mark that unit with a ‘done’ flag, and record its corresponding main matched group and, to facilitate matching with replacement, (v) whenever a previously matched unit is matched to a previously unmatched unit, record this main matched group as an auxiliary group. When all units are ‘done’ (all units are either matched already or cannot be matched) then stop, and compute the CATE for each treatment and control unit using its main matched group. Each unit’s outcome will be used to estimate CATEs for every auxiliary group that it is a member of, as before. Although this approach exploits the efficient ‘group by’ function (e.g., provided in database (SQL) queries), which can be implemented in time by sorting the units, iterating over all possible vectors makes this approach unsuitable for practical purposes (exponential in ).
3.2 Key to D-Aemr : Downward closure will help solve AEMR more efficiently
If is in the millions, the first solution, or any simple variation of it, is practically infeasible. A straightforward implementation of the second solution is also inefficient. However, a monotonicity property (downward closure) stated below will allow us to prune the search space so that the second solution can be modified to be completely practical, as we will demonstrate in the D-AEMR algorithm in the next section, which does not enumerate all v’s, it uses the monotonicity to reduce the number of v’s it considers.
(Monotonicity of in AEMR solutions) Fix treatment unit . Consider feasible , meaning . Then,
Any feasible such that elementwise will have .
Consequently, consider feasible vectors and . Define as the elementwise . Then , and .
These follow from the fact that the elements of are binary and the elements of are non-negative. The first property means that if we have found a feasible , we need not consider any with fewer 1’s as a possible solution of the AEMR for unit . Thus, D-AEMR starts from being all 1’s (consider all covariates). It systematically drops one element of to zero at a time, then two, then three, ordered according to values of . The second property implies that we must evaluate both and as possible AEMR solutions before evaluating . Conversely, a new subset of variables defined by cannot be considered unless all of its supersets have been considered.
These two properties form the basis of the D-AEMR algorithm.
4 The Dynamic AEMR (D-Aemr) Algorithm
AEMR has two computations: the weights and the actual matching mechanism. The weights correspond to the importance of the covariates and are computed by running regression on a hold-out training set prior to running the algorithm, or are computed adaptively as the algorithm executes (the computation of the weights
is discussed in the supplement).
The matching mechanism is designed to solve the AEMR.
We call a covariate-set any set of covariates. We denote by the original set of all covariates from the input dataset, where . When we drop a set of covariates , it means we will match on . For any covariate-set , we associate an indicator-vector defined as follows:
i.e. the value is 1 if the covariate is not in implying that it is being used for matching.
Algorithm 2 gives the pseudocode of the D-AEMR algorithm. Instead of looping over all possible vectors to solve the AEMR, it considers a covariate-set for dropping only if satisfies the monotonicity property of Proposition 3.2. For example, if has been considered for dropping to form matched groups, it would not process next because the monotonicity property requires , and to have been considered previously for dropping.
The D-AEMR algorithm uses the GroupedMR (Grouped Matching with Replacement) subroutine given in Algorithm 1 to form all valid main matched groups having at least one treated and one control unit. GroupedMR takes a given subset of covariates and finds all subsets of treatment and control units that have identical values of those covariates. We use an efficient implementation of the group-by operation used in the algorithm using bit-vectors, which is discussed in the supplement. To keep track of main and auxiliary matched groups, GroupedMR takes the entire set of units as well as the set of unmatched units from the previous iteration as input along with the covariate-set to match on in this iteration. Instead of matching only the unmatched units in using the group-by operator, it matches all units in to allow for matching with replacement as in the AEMR objective. It keeps track of the main matched groups for the unmatched units that are matched for the first time, and auxiliary groups (see previous section) for the other matched units.
D-AEMR keeps track of two sets of covariate-sets: (1) The set of processed sets contains the covariate-sets whose main matched groups (if any exist) have already been formed. That is, contains if matches have been constructed on by calling the GroupedMR procedure. (2) The set of active sets contains the covariate-sets that are eligible to be dropped according to Proposition 3.2. For any iteration , , i.e., the sets are disjoint, where denote the states of at the end of iteration . Due to the monotonicity property Proposition 3.2, if , then each proper subset belonged to in an earlier iteration . Once an active set is chosen as the optimal subset to drop in iteration , is excluded from (it is no longer active) and is included in as a processed set. In that sense, the active sets are generated and included in in a hierarchical manner similar to the apriori algorithm. A set is included in only if all of its proper subsets of one less size , , have been processed.
Figure 1 shows an example of the earlier steps of the D-AEMR algorithm.
The procedure GenerateNewActiveSets gives an efficient implementation of generation of new active sets in each iteration of D-AEMR, and takes the currently processed sets and a newly processed set as input. Let . In this procedure, denotes the set of all processed covariate-sets in of size , and also includes . Inclusion of in may lead to generation of new active sets of size if all of its subsets of size (one less) have been already processed. The new active sets triggered by inclusion of in would be supersets of of size if all subsets of size belong to . To generate such candidate superset , we can append with all covariates appearing in some covariate-set in except those in . However, this naive approach would iterate over many superfluous candidates for active sets. Instead, GenerateNewActiveSets safely prunes some such candidates that cannot be valid active sets using support of each covariate in , which is the number of sets in containing . Indeed, for any covariate that is not frequent enough in , the monotonicity property ensures that any covariate-set that contains that covariate cannot be active. The following proposition shows that this pruning step does not eliminate any valid active set (proof is in the supplement):
If for a superset of a newly processed set where and , all subsets of of size have been processed (i.e. is eligible to be active after is processed), then is included in the set returned by GenerateNewActiveSets.
The explicit verification step of whether all possible subsets of of one less size belongs to is necessary, i.e., the above optimization only prunes some candidate sets that are guaranteed not to be active. For instance, consider , , and . For the superset of , all of have support of in , but this cannot become active yet, since the subset of does not belong to .
Finally, the following theorem states the correctness of the D-AEMR algorithm (proof is in the supplement).
(Correctness) The algorithm D-AEMR solves the AEMR problem.
We present results under a variety of data generating processes. We show that D-AEMR produces higher quality matches than popular matching methods such as 1-PNNSM and Nearest Neighbor matching with Mahalanobis distance, and can produce better treatment effect estimates than black box machine learning methods such as Causal Forest (which is not a matching method, and is not interpretable). The ‘MatchIt’ R-package () was used to perform 1-Propensity Score Nearest Neighbor matching (1-PSNNM) and Nearest Neighbor with Mahalanobis distance (Mahalanobis). For Causal Forest, we used the ‘grf’ R-package (). D-AEMR also improves over the original FLAME  with regards to the quality of matches in high dimensions. Other matching methods (optmatch, cardinality match) do not scale to large problems and thus needed to be omitted. We use ”FLAME” to represent original FLAME algorithm. Throughout this section, the outcome is generated with
where is the binary treatment indicator. We vary the distribution of the covariates, coefficients (’s, ’s, and ), as well as the fraction of treated units.
5.1 Presence of irrelevant covariates
A basic sanity check for matching algorithms is how sensitive to irrelevant covariates they are. To that end, we run experiments with a majority of the covariates being irrelevant to the outcome. This simulation generates 15000 control units, 15000 treatment units, 5 important covariates and 10 irrelevent covariates.
For important covariates let with , ,
For unimportant covariates , in the control group and in the treatment group.
Results: As Figure 2 shows, D-AEMR and FLAME achieve the optimal result before dropping any important covariates. They respectively match and of units on all important covariates. When imposing that the FLAME algorithms find all the possible matches even if important covariates are dropped (not recommended), poor matches are introduced. However, their worst case scenario is still substantially better than the comparative methods, all of which perform poorly in presence of irrelevant covariates. Causal Forest is especially ill suited for this case.
5.2 Exponentially decaying covariate importance
A notable advantage of D-AEMR over FLAME is that it should produce more high quality matches before resorting to lower quality matches. To test that theory, in this experiment we consider covariates of decaying importance (letting the parameters decrease exponentially, ).
We evaluate the performance of the algorithms when and of the units are matched.
Results: As Figure 3 shows, in all instances (each threshhold in each column) D-AEMR matches on more covariates, yielding better estimates than FLAME.
5.3 Imbalanced Data
Imbalance is a common situation in observational studies: there are often substantially more control than treatment units. The data considered in this experiment has covariates with decreasing importance and does not include unimportant covariates. A fixed batch of 2000 treatment units was generated as well as a batch of 40000 control units. We sample from the batch of controls to construct different imbalance ratios: 40000 in the most imbalanced case (ratio1), then 20000 (ratio2) and, finally, 10000 (ratio3). This experiment was run reusing control units to amplify the high dimensional matching of D-AEMR and to find the best possible match for each treatment unit. Results: A check reveals that FLAME and D-AEMR outperform the neareast neighbor matching methods, which produce poor estimates regardless of the degree of imbalance. A second check highlights that D-AEMR is distinctively better than FLAME when the data are imbalanced. D-AEMR yields better results when the data is extremely imbalanced. Additionally, D-AEMR has an average of 4 covariates not matched on, with of units matched on all but 2 covariates. On the other hand, FLAME averages 7 covariates not matched on and only units matched on all but 2 covariates.
5.4 Running Time Evaluation
In this section, we compare the running time of our dynamic approach to solving the AEMR problem with a brute force solution (AEMR solution 1 described in Section 3). We also compare with FLAME which does not solve the AEMR problem. All experiments were run on a MacBook Pro with Intel Core i5 Processor (Cores: 2, Speed: 2,4 GHz), 8 GB RAM.
Results: As shown in Figure 6, FLAME provides the best running time performance because it utilizes a bit vector implementation and incrementally reduces the number of covariates. D-AEMR resets the pool of covariate sets to consider at each round of the algorithm, which takes more time. However, as shown in the previous simulations, D-AEMR produces high quality matches that the other methods do not. Our proposed method substantially reduces the running time to solve the AEMR as compared to brute force. The running time for D-AEMR can be further optimized through simple parallelization of the checking of active sets. Additional results are available in Section E in the supplement.
5.5 Effect of Noise
In Section D in the supplement, we study how D-AEMR performs in the presence of noise. We show that D-AEMR
outperforms Causal Forest on all noise levels constructed. While Causal Forest with small amount of noise (low noise coefficient and low noise variance) performs better than Causal Forest without noise, it still produces worse estimates thanD-AEMR . The relative ordering of the methods remains unchanged.
5.6 Missing Data
We consider the case when data are missing at random. We compared performance of FLAME and D-AEMR
both with and without multiple imputation.
To generate covariates we first sample , where
is not the identity matrix (provided in the supplement) and then letwhere
. This creates correlated binary variables so methods that impute the missing values are expected to perform well. We generate 15000 control and 5000 treated units; 20% of the data are missing at random.
To allow for missing values in D-AEMR and FLAME we construct an matrix where if covariate is unobserved for unit . Treating “missing” as just another category for each variable we proceed with the algorithm by adding a condition for a matched group to be valid: if covariates are being matched on then for a unit in the group. If the sum is greater than then the group matched on the level “missing” for at least one covariate, making it invalid. We compare this approach to multiply imputing 10 datasets using the Multiple Imputation Chained Equations algorithm in the ’mice’ R package and averaging the estimates over the datasets.
The correlation matrix that defines the relationship among the covariates is given in Figure 7.
We generate a dataset with 20% missing values, and Figures 9 and 8 contain the results with and without imputation. D-AEMR outperforms FLAME in terms of CATE estimation. D-AEMR with imputation yields a MSE of 1.11 vs. MSE without imputation of 5.08 compared to FLAME which has 2.77 (MSE with imputation) vs. 195.63 (MSE without imputation). This suggests that for datasets that are too large to undergo multiple imputation, D-AEMR still produces reasonable causal estimates. The result is more interpretable when there is no imputation, because it matches only on covariates that the observation actually possesses rather than imputed covariates.
6 Breaking the cycle of drugs and crime in the United States
Breaking The Cycle (BTC)  is a social program conducted in several U.S. states designed to reduce criminal involvement and substance abuse, and improve health and employment among current offenders. A survey  was conducted in Alabama, Florida, and Washington regarding the program’s effectiveness, with high quality data for over 380 individuals. These data (and this type of data generally) can be a powerful tool in the war against opioids, and our ability to draw interpretable, trustworthy conclusions from it depends on our ability to construct high-quality matches. For the survey, participants were chosen to receive screening shortly after arrest and participate in a drug intervention under supervision. Similar defendants before the start of the BTC program were selected as the control group. Covariates used for matching are detailed in Table 1.
6.1 Covariate Importance
The features used for the BTC data are listed in Table 1
. Each feature has a variable importance score associated with it, the ratio of the loss of a ridge regression model trained with and without randomly permuting that feature. To obtain the values within the table, we randomly shuffled the values of that feature 100 times, retraining ridge regression, and each time compared with the unshuffled loss of ridge regression:
In Table 1, the mean variable importance values are reported. These calculations were performed on a holdout training set, not on the test data.
As Table 1 shows, all covariates have similar values of the importance score, which suggests that all covariates are almost equally important.
|Live with anyone with an alcohol problem||0.958458|
|Have trouble understanding in life||0.983272|
|Live with anyone using nonprescribed drugs||0.984558|
|Have problem getting along with father in life||0.989829|
|Have a automobile||0.990346|
|Have driver license||0.990732|
|Have serious depression or anxiety in past 30 days||0.992095|
|Have serious anxiety in life||0.994400|
|SSI benefit Last 6 Months||1.001295|
|Have serious depression in life||1.003302|
6.2 Order of Dropping Covariates
The order in which D-AEMR and FLAME process covariates is different. Table 2 shows the order in which the covariates were processed for both algorithms. We used the formulations of FLAME and D-AEMR where variable importance is recomputed for only the variables we are considering at that iteration, since variable importance values change in the absence of various covariates. Covariates in FLAME are not eliminated on variable importance alone, there is a balancing factor that prefers not to eliminate too many units from either treated or control at once.
Because D-AEMR eliminates one variable at a time, it tends to process the least relevant covariate in the earlier rounds. For example, at the first round, both algorithms drop the covariate “Live with anyone with an alcohol problem” which has the lowest importance score in Table1. At the second round, D-AEMR process the covariate “Have trouble understanding in life” because this covariate has the lowest importance score aside from “Live with anyone with an alcohol problem.” On the other hand, at that same second round, FLAME processes “Have serious anxiety in life” which now is dropped along with “Live with anyone with an alcohol problem.” Once “Live with anyone with an alcohol problem” was removed, “Have serious anxiety in life” has a lower importance score than “Have trouble understanding in life,” which is why it was dropped.
|1st||Live with anyone with an alcohol problem||Live with anyone with an alcohol problem|
|2nd||Have serious anxiety in life||Have trouble understanding in life|
|3rd||Live with anyone using nonprescribed drugs||Have serious depression or anxiety in past 30 days|
|4th||Have trouble understanding in life||Live with anyone using nonprescribed drugs|
|5th||Have problem getting along with father in life||Have serious anxiety in life|
|6th||Have serious depression in life||Have driver license|
|7th||Have driver license||Have problem getting along with father in life|
|8th||Have a automobile||Have serious anxiety in life|
|9th||Have serious depression or anxiety in past 30 days||Have a automobile|
6.3 Analysis of the matches
We compare the quality of matches between FLAME and D-AEMR in terms of the number of covariates used to match within the groups. Many of the units matched exactly on all covariates and thus were matched by both algorithms at the first round. For the remaining units, D-AEMR matches on more covariates than FLAME. Figure 10 shows the results. In particular, D-AEMR matched many more units on 9 variables, whereas FLAME matched more units on only one variable.
6.4 Comparison of D-Aemr with SVM based method Minimax Surrogate Loss
We wanted to use D-AEMR
as a tool to double check performance of a black box machine learning approach. There is no ground truth, so there will not be formal accuracy evaluation (see simulations for accuracy comparison). We chose a recent method that predicts whether treatment effects are positive, negative, or neutral, using a support vector machine formulation.
We ran D-AEMR on the BTC dataset and saved the CATE for each treatment and control unit that were matched. Units with a positive CATE have a negative treatment effect and vice versa. We also implemented the SVM approach and recorded a prediction of positive, negative, or neutral treatment effect for each unit. Figure 11 in the main manuscript shows that D-AEMR and the SVM approach agree on most of the matched units.
To see this, the units for which D-AEMR predicted approximately zero treatment effect all have a “neutral” treatment effect label from the SVM approach.
Most positive CATE’s corresponded to negative treatment effects from the SVM. Only one matched group seems to have a different prediction than the SVM: a negative CATE with a negative treatment effect prediction. The easiest way to explain the discrepancy between the two methods is that D-AEMR is a matching method, not a statistical model. Usually if one wanted to estimate CATEs, one would smooth the treatment effect estimates with a regression model after matching, however, we did not do this. Thus, if the negative SVM group happened to be closer in proximity to the matched groups with positive CATEs, it would completely resolve the discrepancy between the two models: the SVM simply smoothed out the treatment effect estimates so that there was a negative predicted treatment effect, even though the CATE may have been negative for that matched group.
In order to determine whether this was true, we computed the Hamming distance between the units in that unusual group to units in other groups to investigate. As it turned out, the units within that matched group were much closer to the units with negative treatment effect than other units, and as such, the difference between the results of the two methods is because there was no smoothing in the matching method. We similarly investigated the blue group at x=0.5 CATE estimate in Figure 11 and again, the covariates of its units were closer in Hamming distance to other blue groups than to other points.
Smoothing the matched groups is beyond the scope of this paper, but can be done with any regression method once the groups are formed if desired. The combination of matched groups with regression techniques for treatment effect can easily handle data with highly nonlinear baseline effects and smooth treatment effects; matching methods allow us to create such models without having to model the nonlinear baseline at all since the matching accounts for it.
Further, in the BTC dataset, we observed a decline in the match quality after about 30 out of the 43 covariates were dropped. We suggest stopping when the match quality MQ (which considers both prediction error and balancing factors in groups) becomes low, since in that case, either prediction error is bad (drop of 5%) or the treatment and control samples have become too imbalanced ( 10% ratio difference). This all changes with highly imbalanced data, in which case, running D-AEMR to the end will generally find the best match for each treatment unit and there is no need to stop early (see Section F).
The algorithm presented here produces better matches than any other published algorithm for matching. Not only are the matches high quality, the predictions of treatment effect from the matches are as good as, if not better than, the (black box) machine learning methods we have tried on the problems we have encountered. This algorithm stands to have an impact in the social sciences, as it produces much higher quality results than the methods used for studies within the realms of law, economics, sociology, psychology, and criminology, management, as well as in many areas of healthcare where observational studies are conducted. Unlike matches from other methods, which match individuals whose covariates can be nothing like each other, the matches from D-AEMR are meaningful, as they are almost exact. Code is publicly available at https://github.com/almostexactmatch/daemr.
-  Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
S. Athey, J. Tibshirani, and S. Wager.
Generalized Random Forests.ArXiv e-prints, October 2016.
-  Anonymous Authors. FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference. Attached as supplementary material, 2018.
-  Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014.
-  William G Cochran and Donald B Rubin. Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, pages 417–446, 1973.
-  Max H Farrell. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23, 2015.
-  Adele V. Harrell, Douglas Marlowe, and Jeffrey Merrill. Breaking the cycle of drugs and crime in birmingham, alabama, jacksonville, florida, and tacoma, washington, 1997-2001. 2006.
-  Daniel Ho, Kosuke Imai, Gary King, and Elizabeth Stuart. Matchit: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, Articles, 42(8):1–28, 2011.
-  Stefano M Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking: Coarsened exact matching. Political analysis, page mpr013, 2011.
-  Stefano M Iacus, Gary King, and Giuseppe Porro. Multivariate matching methods that are monotonic imbalance bounding. Journal of the American Statistical Association, 106(493):345–361, 2011.
-  Jeremy A Rassen and Sebastian Schneeweiss. Using high-dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system. Pharmacoepidemiology and drug safety, 21(S1):41–49, 2012.
Paul R Rosenbaum.
Imposing minimax and quantile constraints on optimal matching in observational studies.Journal of Computational and Graphical Statistics, (just-accepted), 2016.
-  Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
-  Donald B Rubin. Matching to remove bias in observational studies. Biometrics, pages 159–183, 1973.
-  Donald B Rubin. The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, pages 185–203, 1973.
-  Donald B Rubin. Multivariate matching methods that are equal percent bias reducing, i: Some examples. Biometrics, pages 109–120, 1976.
-  Donald B. Rubin. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593, 1980.
-  Donald B Rubin. For objective causal inference, design trumps analysis. The Annals of Applied Statistics, pages 808–840, 2008.
-  Sebastian Schneeweiss, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology (Cambridge, Mass.), 20(4):512, 2009.
-  S. Thye Goh and C. Rudin. A Minimax Surrogate Loss Approach to Conditional Difference Estimation. ArXiv e-prints, March 2018.
-  Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (just-accepted), 2017.
Appendix A Group-by Implementation using Bit-Vectors
We implement the ”Group-by” operation by bit-vector manipulation. Assume there are p covariates in the dataset. If the -th covariate is -ary (), we at first rearrange the covariates such that for all and get reordered covariates values of unit to be . Then for each unit , we compute two values.
Together with the treatment indicator value :
Since the covariates are rearranged so that for all , two units and have the same covariate values if and only if . For each unit u with values and , we compute how many times these two values occurs for all the units in the dataset(using unique() function of Numpy package in Python) and denote them as and respectively. To guarantee each matched group has at least one treatment and one control, we mark a unit as matched if its and differs. More rigorous proof can be found in .
Appendix B Adjustment of weights
One noticable feature of D-AEMR resides in its ability to dynamically adjust the weight of the covariates througout the run of the algorithm. This property is important as covariates become more or less relevant in presence of other covariates. To recompute the weight ie the importance score of each covariate, D-AEMR utilizes a hold-out training set whose covariate space is restricted to the subset of the covariates being considered (). Using the hold-out set that includes both treatment and control units, we obtain the vectors representing the weights and
by linear regressions on control and treatment units. The readjusted weights are used to compute aprediction error measure to determine which subset of covariates leads to the best estimation quality. Specifically, is computed for each active set considered by
Notation: is the holdout dataset, represents a selector function on the rows and columns of the input dataset for any given matched unit . In other words, consists of a subset of rows and covariates that match with (maximizing gives the main matched group for any unit ), and are respectively the treatment indicator and the outcome for unit in the hold-out dataset.
Other criteria for optimization in addition to optimizing over is giving preference to subsets of attributes that allow better balancing of treatment and control units in the matched group. That balancing factor depends only on the number of units matched at each iteration by dropping and the remaining unmatched units.
A matched quality quantity consisting of a trade off between these two measures ensures that D-AEMR only processes sets that maintain good prediction quality while the matched groups contain enough units to yield good estimate of treatment effects.
Appendix C Omitted Proofs
c.1 Proof of Proposition 4
Proposition 4 If for a superset of a newly processed set where and , all subsets of of size have been processed (i.e. is eligible to be active after is processed), then is included in the set returned by GenerateNewActiveSets.
Suppose all subsets of of size are already processed and belong to . Let be the covariate in . Clearly, would appear in , since at least one subset of of size would contain , and . Further all covariates in , including and those in will have support at least in . To see this, note that there are subsets of of size , and each covariate in appears in exactly of them. Hence , the ‘if’ condition to check minimum support for all covariates in is satisfied, and the final ‘if’ condition to eliminate false positives is satisfied too. Therefore will be included in returned by the procedure. ∎
c.2 Proof of Theorem 4
Theorem 4(Correctness) The algorithm D-AEMR solves the AEMR problem.
Consider any treatment unit . Let be the set of covariates in its main matched group returned in D-AEMR (the while loop in D-AEMR runs as long as there is a treated unit, and the GroupedMR returns main matched group for every unit when it is matched for the first time). Let be the indicator vector of (see (2)). Since the GroupedMR procedure returns a main matched group only if it is a valid matched group containing at least one treated and one control unit (see Algorithm 1), and since all units in the matched group on have the same value of covariates in , there exists a unit with and .
Hence it remains to show that the covariate set in the main matched group for corresponds to the maximum weight . Assume the contradiction that there exists another covariate-set such that , there exists a unit with and , and gives the maximum weight over all such .
(i) cannot be a (strict) subset of , since D-AEMR ensures that all subsets are processed before a superset is processed to satisfy the downward closure property in Proposition 3.2.
(ii) cannot be a (strict) superset of , since it would violate the assumption that for non-negative weights.
(iii) Hence assume that and are incomparable (there exist covariates in both and ).
Suppose the active set was chosen in iteration . If was processed in an earlier iteration , since forms a valid matched group for ,
it would give the main matched group for violating the assumption. We argue that in this case must be active at the start of iteration , and will be chosen as the best covariate set in iteration , leading to a contradiction.
Note that we start with all singleton sets as active sets in in the D-AEMR algorithm. Consider any singleton subset (comprising a single covariate in ). Due to the downward closure property in Proposition 3.2, , hence . Hence all of these singleton subsets of will be processed in earlier iterations , and will belong to the set of processed covariate sets .
Repeating the above argument, consider any subset . It holds that . Hence all subsets of will be processed in earlier iterations starting with the singleton subsets of . In particular, all subsets of size will belong to . As soon as the last of those subsets is processed, the procedure GenerateNewActiveSets will include in the set of active sets in a previous iteration . Hence if is not processed in an earlier iteration, it must be active at the start of iteration , leading to a contradiction.
Hence for all treatment units , the covariate-set giving the maximum value of will be used to form the main matched group of , showing the correctness of the D-AEMR algorithm. ∎
Appendix D Effect of Noise
We use 15000 treated units, 15000 control units, 5 important covariates and 8 unimportant covariates. We change our generative process in the following way to add noise:
where the noise coefficient and denotes noise chosen as either , , below:
First, we run D-AEMR and Causal Forest on all noise models to study how noise affect each method in terms of CATE estimations. Then, we compare the CATE estimates of D-AEMR and Causal Forest on the noise model in which Causal Forest achieves its best results.
Results: As Figure 12 shows, D-AEMR outperforms Causal Forest on all noise levels constructed. While Causal Forest with small amount of noise (low noise coefficient and low noise variance) performs better than Causal Forest without noise (see Figure 15), it still produces worse estimates than D-AEMR . The relative ordering of the methods remains unchanged. Additionally, Figure 13 show that D-AEMR is robust to small amount of noise but a high noise coefficient () and/or high variance (sd ) can lead to poorer results. To add smoothing of the estimates for D-AEMR , one can either: (i) smooth after matching by doing regression afterwards, or (ii) increase the minimum number of units in each match (which is a parameter in our code) to induce smoothing.
Appendix E More runtime evaluation of D-Aemr
The running time of D-AEMR varying the number of units is given in Figure 16.
Appendix F Evolution of the match quality (MQ) in the BTC dataset
Figure 17 shows how the prediction error (PE) and balancing factor (BF) change over iterations for the BTC dataset with 43 covariates when we run D-AEMR. The prediction error suddenly drops after 30 iterations. This leads to a drop in the matching quality and can serve as a stopping criteria in practice.