1 Introduction
Estimating causal effects in an observational study is complicated by the lack of randomization in treatment assignment, which may lead to confounding bias. The standard approach is to weight observations such that the empirical distribution of observed covariates is similar between the treatment and control groups (see e.g., Lunceford and Davidian, 2004; Rubin, 2006; Ho et al., 2007; Stuart, 2010). Researchers then estimate the causal effects using the weighted sample while assuming the absence of unobserved confounders. Recently, a large number of weighting methods have been proposed to directly optimize covariate balance for causal effect estimation (e.g., Hainmueller, 2012; Imai and Ratkovic, 2014; Zubizarreta, 2015; Chan et al., 2016; Athey et al., 2018; Li et al., 2018; Wong and Chan, 2018; Zhao, 2019; Hazlett, 2020; Kallus, 2020; Ning et al., 2020; Tan, 2020).
This paper provides a new insight to this fast growing literature on covariate balancing by demonstrating that support vector machine (SVM), which is one of the most popular classification algorithms in the machine learning literature (Cortes and Vapnik, 1995; Schölkopf et al., 2002), can be used to balance covariates and estimate the average treatment effect under the standard unconfoundedness assumption. Specifically, we adapt the SVM classifier as a kernelbased weighting procedure that minimizes the maximum mean discrepancy between the treatment and control groups (Gretton et al., 2007a) while simultaneously maximizing effective sample size. The resulting weights are bounded, leading to stable causal effect estimation. Importantly, as SVM has been extensively studied and widely used, we can exploit its wellknown theoretical properties and highly optimized implementation.
All matching and weighting methods face the same tradeoff between effective sample size and covariate balance with a better balance typically leading to a smaller sample size. We show that SVM directly addresses this fundamental tradeoff. Specifically, the dual optimization problem for SVM computes a set of balancing weights as the dual coefficients while yielding the support vectors that comprise a largest balanced subset. In addition, the regularization parameter of SVM controls the tradeoff between sample size and covariate balance. This implies that the existing path algorithm (Hastie et al., 2004; Sentelle et al., 2016) can efficiently characterize the balancesample size frontier (King et al., 2017). Since both sample size and covariate balance affect the statistical properties of causal estimates, we analyze how this tradeoff affects causal effect estimation.
In the causal inference literature, we are not the first ones to realize the connection between SVM and covariate balancing. In an unpublished working paper, Ratkovic (2014)
notes that the hingeloss function of SVM has a firstorder condition, which leads to balanced covariate
sums amongst the support vectors. Instead, we show that the dual form of SVM optimization problem leads to the covariate mean balance. In addition, Ghosh (2018) notes the relationship between the SVM margin and the region of covariate overlap. The author argues that the support vectors correspond to observations lying in the intersection of the convex hulls for the treated and control samples (King and Zeng, 2006). In contrast, we show that SVM can be used to obtain weights which can be used for causal effect estimation. Furthermore, neither of these two previous works studies the relationship between the regularization parameter of SVM and the fundamental tradeoff between covariate balance and effective sample size.The proposed methodology is also related to several other covariate balancing methods. First, we establish that SVM can be seen as a continuous relaxation of the quadratic integer program for computing the largest balanced subset. Indeed, SVM approximates an optimization problem closely related to cardinality matching (Zubizarreta et al., 2014). Second, SVM is a kernelbased covariate balancing method. Several researchers have recently developed weighting methods to balance functions in a reproducing kernel Hilbert space (RKHS) (Wong and Chan, 2018; Hazlett, 2020; Kallus, 2020). SVM shares the advantage of these methods that it can balance a general class of functions and easily accommodate nonlinearity and nonadditivity in the conditional expectation functions for the outcomes. In particular, we show that SVM fits into the kernel optimal matching framework (Kallus, 2020). Unlike these covariate balancing methods, however, we can exploit the existing path algorithms of SVM to compute the set of solutions over the entire regularization path with comparable complexity to computing a single solution (Hastie et al., 2004; Sentelle et al., 2016). This allows us to efficiently characterize the tradeoff between covariate balance and effective sample size.
The rest of the paper is structured as follows. In Section 2, we present our methodological results. In Section 3, we conduct simulation studies to compare the performance of SVM with that of the aforementioned related covariate balancing methods. Lastly, in Section 4, we apply SVM to the data from the right heart catheterization observational study (Connors et al., 1996).
2 Methodology
In this section, we establish several properties of SVM as a covariate balancing method. We first show that the SVM dual can be viewed as a regularized optimization problem that minimizes the maximum mean discrepancy (MMD). We then compare SVM to cardinality matching and show how the regularization path algorithm for SVM can be viewed as a balancesample size frontier. Lastly, we discuss how to use SVM for causal effect estimation and compare SVM to existing kernel balancing methods.
2.1 Setup and Assumptions
Suppose that we observe a simple random sample of units from a superpopulation of interest, . Denote the observed data by where represents a dimensional vector of covariates, is the outcome variable, and is a binary treatment assignment variable that is equal to 1 if unit is treated and 0 otherwise. We define the index sets for the treatment and control groups as and with the group sizes equal to and , respectively. Finally, we define the observed outcome as where and are the potential outcomes under treatment and control conditions, respectively. This notation implies the Stable Unit Treatment Value Assumption (SUTVA) — no interference between units and the same version of the treatment (Rubin, 1990). Furthermore, we make the following standard identification assumptions. All of these assumptions are maintained throughout this paper.
Assumption 1 (Unconfoundedness)
The potential outcomes are independent of the treatment assignments conditional on the covariates . That is, for all , we have
Assumption 2 (Overlap)
For all , the propensity score is bounded away from 0 and 1, i.e.,
To consider causal inference with SVM, it is convenient to define the following transformed treatment variable, which is equal to either or , following the convention of classification methods,
(1) 
In addition, for
, we define the conditional expectation functions, disturbances, and conditional variance functions as
Note that by construction, we have for . Lastly, let denote a reproducing kernel Hilbert space (RKHS), with norm and kernel , where is a feature mapping of the covariates to the RKHS.
2.2 Support Vector Machines
Support vector machines (SVMs) are a widelyused methodology for twoclass classification problems (Cortes and Vapnik, 1995; Schölkopf et al., 2002)
. SVM aims to compute a separating hyperplane of the form
(2) 
where is the normal vector for the hyperplane and is the offset. In this paper, we use SVM for the classification of treatment status. In the case of nonseparable data, and are computed according to the softmargin SVM problem, which is formulated as
(3)  
s.t.  
where are the socalled slack variables, and represents a regularization parameter controlling the tradeoff between the margin width and margin violation of the hyperplane. Note that is related to the traditional SVM cost parameter via the equality .
Defining the matrix with elements and the vector with elements , this problem has a corresponding dual form given by
(4)  
s.t.  
where represents a vector of ones and denotes an elementwise inequality.
We begin by providing an intuitive explanation of how SVM can be viewed as a covariate balancing procedure. First, note that the quadratic term in the above dual objective function can be written as a weighted measure of covariate discrepancy between the treatment and control groups,
(5) 
while the constraint ensures that the sum of weights is identical between the treatment and control groups,
Lastly, the second term in the objective, , is proportional to the sum of weights for each treatment group, since the above constraint implies . Thus, SVM simultaneously minimizes the covariate discrepancy and maximizes the effective sample size, which in turn leads to minimization of the weighted differenceinmeans in the transformed covariate space. It is also important to note that unlike some other balancing methods, the weights are bounded as represented by the constraint for all , leading to stable causal effect estimation (Tan, 2010).
The choice of kernel function and its corresponding feature map determine the type of covariate balance enforced by SVM, as shown in equation (5
). In this paper, we focus on the linear, polynomial, and radial basis function (RBF) kernels. The linear kernel
corresponds to a feature map , and hence the quadratic term measures the discrepancy in the original covariates. The general form for the degree polynomial kernel with scale parameter is . For example, when , this kernel has a corresponding feature map,Hence, the quadratic kernel leads to a discrepancy measure of the original covariates, their squares, and all pairwise interactions. In general, the degree polynomial kernel leads to a feature map consisting of all powers of the original covariates and all interactions up to degree . The final kernel considered in this paper is the RBF kernel with scale parameter : . This kernel can be viewed as a generalization of the polynomial kernel in the limit .
In addition, SVM sets the weights to zero for the units whose treatment status is easy to classify. To see this, note that the Karush–Kuhn–Tucker (KKT) conditions for softmargin SVM lead to the following useful characterization for a solution :
(6)  
The set of units that satisfy represent those that are easy to classify and receive zero weight. The set of units that satisfy are referred to as marginal support vectors, whereas the set of units that meet are the nonmarginal support vectors. Collectively, these last two sets correspond to the units that the optimal hyperplane has the most difficulty classifying.
In sum, the SVM dual problem finds the bounded weights that minimize the covariate discrepancy between the treatment and control groups while simultaneously maximizing effective sample size. The regularization parameter controls which of these two components receives more emphasis. SVM chooses optimal weights such that easytoclassify units are given zero weight. Our goal in the remainder of this section is to extend the above intuition and establish a more rigorous connection between SVM, covariate balancing, and causal effect estimation.
2.3 SVM as a Maximum Mean Discrepancy Minimizer
We now show that SVM minimizes the maximum mean discrepancy (MMD) of covariate distribution between the treatment and control groups. The MMD is a commonly used measure of distance between probability distributions
(Gretton et al., 2007a) that was recently proposed as a metric for balance assessment in causal inference (Zhu et al., 2018). Specifically, we show that the SVM dual problem given in equation (4) can be viewed as a regularized optimization problem for computing weights which minimize the MMD.The MMD, which is also called the kernel distance, is a measure of distance between two probability distributions based on the difference in mean function values for functions in the unit ball of a RKHS
(Gretton et al., 2007a). The MMD has found use in several statistical applications, such as hypothesis testing (Gretton et al., 2007b, 2012) and density estimation (Sriperumbudur, 2011). Given the unit ball RKHS and two probability measures and , the MMD is defined as(7) 
An important property of the MMD is that when is a characteristic kernel (e.g., the Gaussian radial basis function kernel and Laplace kernel), then if and only if (Sriperumbudur et al., 2010).
The computation of requires the knowledge of both and , which is typically unavailable. In practice, an estimate of using the empirical distributions and can be computed as
(8) 
where and are the size of the samples drawn from and , respectively. The properties of this statistic are wellstudied (see e.g., Sriperumbudur et al., 2012). In causal inference, the empirical MMD can be used to assess balance between the treated and control samples (Zhu et al., 2018). This is done by setting and . Then, the quantity gives a measure of independence between the treatment assignment and the observed pretreatment covariates .
Equation (8) naturally suggests a weighting procedure that balances the covariate distributions between the treatment and control groups by minimizing the empirical MMD. We define a weighted variant of the empirical MMD as
(9) 
where and denote the reweighted empirical distributions under weights . The weights are restricted to the simplex set,
(10) 
The optimization problem for finding the MMD minimizing weights is therefore formulated as
(11)  
s.t. 
Note that computing weights according to this problem is generally not preferable due to the lack of regularization, which leads to overfitting and sparse , resulting in many discarded samples. The following theorem, which we prove in Appendix A.1, establishes that the SVM dual problem can be viewed as a regularized version of the optimization problem in equation (11).
Theorem 1 (SVM Dual Problem as Regularized MMD Minimization)
Theorem 1 shows that SVM minimizes the MMD with the regularization parameter controlling the tradeoff between the covariate imbalance, measured as the MMD, and the effective sample size, measured as the sum of the support vector weights . Thus, a greater size of support vector set may lead to a worse covariate balance between the treatment and control groups within that set.
2.4 SVM as a Relaxation of the Largest Balanced Subset Selection
SVM can also be seen as a continuous relaxation of the quadratic integer program (QIP) for computing the largest balanced subset. Consider the modified version of the optimization problem in equation (4), in which we replace the continuous constraint with the binary integer constraint . Since the second term in the objective can be rewritten as a norm, this problem, which we refer to as SVMQIP, is given by:
(12)  
s.t.  
Interpreting the variables as indicators of whether or not unit is selected into the optimal subset, we see that the objective is a tradeoff between subset balance in the projected features (first term) and subset size (second term). Here, balance is measured by a difference in sums. However, the constraint requires the optimal subset to have an equal number of treated and control units, so balancing the feature sums also implies balancing the feature means.
Thus, the SVM dual in equation (4) can be viewed as a continuous relaxation of the largest balanced subset problem represented by SVMQIP in equation (12), with the set of support vectors comprising an approximation to the largest balanced subset, as these are the units for which . The quality of the approximation is difficult to characterize generally since differences between the two computed subsets are influenced by a number of factors, including separability in the data, the value of , and the choice of kernel. In our own experiments, we observe significant overlap between the two solutions, suggesting that SVM uses noninteger weights to augment the SVMQIP solution without compromising balance in the selected subset. We compare the differences between the two methods in Section 3.2
2.5 Relation to Cardinality Matching
Closely related to the SVMQIP formulation in equation (12) is cardinality matching (Zubizarreta et al., 2014), which maximizes the number of matches subject to a set of covariate balance constraints. The objective of cardinality matching is given by,
(13)  
s.t.  
where are selection variables indicating whether treated unit is matched to control unit , denotes the th element of covariate vector , is an arbitrary function of the covariates specifying each of the balance conditions, and is a tolerance selected by a researcher. Common choices for
are the first and secondorder moments, and
is typically set to a scalar multiple of the corresponding standardized differenceinmeans.To establish the connection between SVMQIP and cardinality matching, we first note that cardinality matching need not be formulated as a matched pair optimization problem. In fact, the balance constraints between pairs, as formulated in equation (13), is equivalent to those between the treatment and control groups in the selected subsample. Similarly, the onetoone matching constraints are equivalent to constraining the number of treated and control units in the selected subsample to be equal. Therefore, defining the indicator variable for selection into the optimal subset, we can rewrite the objective of cardinality matching as,
(14)  
s.t.  
The difference between cardinality matching and SVMQIP lies in the way that balance is enforced in the optimal subset. Cardinality matching imposes covariatespecific balance by bounding each dimension’s differenceinmeans, while SVMQIP imposes aggregated balance by penalizing the normed differenceinmeans. The preference between these two measures of balance depends on the dimensionality of the covariates and a priori knowledge about confounding mechanisms. If we suspect certain covariates to be confounders, then bounding those specific dimensions is more reasonable. However, if no such information is available and the covariate space is highdimensional, then restricting the overall balance may be more preferable. We empirically examine the relative performance of these two methods in Section 4.
We emphasize that using the norm to measure balance is also computationally attractive since it avoids the direct calculation of the balance conditions through the socalled “kernel trick.” Furthermore, as we discuss below, SVM can be used to approximate solutions to SVMQIP with high accuracy and a much lower computational cost, which allows us to approximate the regularization path for SVMQIP faster than a single solution for cardinality matching can be computed.
2.6 Regularization Path as a BalanceSample Size Frontier
Another important advantage to using SVM to perform covariate balancing is the existence of path algorithms, which can efficiently compute the set of solutions to equation (4) over different values of . Since Theorem 1 established that
controls the tradeoff between the MMD and a heuristic measure of subset size, the path algorithm for SVM can be viewed as the weighting analog to the balancesample size frontier
(King et al., 2017). Below, we briefly discuss the algorithm for computing the SVM regularization path and describe how the path can be interpreted as a balancesample size frontier.The algorithm.
Path algorithms for SVM were first proposed in Hastie et al. (2004), who showed that the weights and scaled intercept are piecewise linear in and presented an algorithm for computing the entire path of solutions with a comparable computation cost to finding a single solution. However, their algorithm was prone to numerical problems and would fail in the presence of singular submatrices of . Recent work on SVM path algorithms has focused on resolving the issues with singularities. In our analysis, we use the algorithm presented in Sentelle et al. (2016), which is briefly described in Appendix A.2.
Initial solution.
The initial solution in the SVM regularization path corresponds to the solution at such that for any , the minimizing weight vector does not change. We assume without a loss of generality that . Then initially, for all , and the remaining weights are computed according to
(15) 
Thus, the initial solution computes control weights such that the weighted empirical MMD is minimized while fixing the renormalized weights for the treated units. Note that this solution also corresponds to the largest subset, as measured by , amongst all solutions on the regularization path.
Terminal solution.
The regularization path completes when the resulting solution has no nonmarginal support vectors, or in the case of nonseparable data, when . In practice, however, path algorithms run into numerical issues when is small, so we terminate the path at , which appears to work well in our experiments. This value is often greater than , which corresponds to the MMDminimizing solution defined in Theorem 1. In practice, however, we find the differences in balance between these two solutions to be negligible.
Summarizing the regularization path, we see that the initial solution at has the largest weight sum and can be viewed as the largest balanced subset retaining all observations in the minority class. As we move through the path, the SVM dual problem imposes greater restrictions on balance in the subset, which leads to smaller subsets, until we reach the terminal value , at which the weighted empirical MMD is smallest on the path.
2.7 Causal Effect Estimation
Theorem 1 establishes that the SVM dual problem can be viewed as a regularized optimization problem for computing balancing weights to minimize the MMD. Under the unconfoundedness assumption, therefore, the resulting weighted support vector set composes a subsample of the data that approximates randomization of treatment assignment. However, achieving a high degree of balance often requires significant pruning of the original sample, especially in scenarios where the covariate distributions for the treated and control groups have limited overlap. In this section, we provide a characterization of this tradeoff between subset size and subset balance and discuss its impact on the bias of causal effect estimates.
Recent work by Kallus (2020) established that many existing matching and weighting methods in causal inference are minimizing the dual norm of the bias for a weighted estimator, a property called error dual norm minimizing. The author also proposes a new method, kernel optimal matching (KOM), which considers minimizing the dual norm of the bias when the conditional expectation functions are embedded in an RKHS. Below, we show that SVM also fits into the KOM framework.
We restrict our attention to the following weighted differenceinmeans estimator,
(16) 
where is computed via the application of SVM to the data . Below, we derive the form for the conditional bias with respect to two estimands, the sample average treatment effect (SATE), , and the sample average treatment effect for the treated (SATT), . We then discuss how to compute this bias when the conditional expectation functions and are unknown.
As we show in Appendix A.3, under Assumptions 1 and 2, the conditional bias with respect to and for the estimator above is given by
(17) 
where
(18) 
and . Fan et al. (2016) use a similar bias decomposition.
The first term in equation (17) represents the bias due to the imbalance of prognostic score (Hansen, 2008). Note that the estimation of this quantity is difficult since is typically unknown. Instead, we embed in a unitball RKHS, , and consider an that maximizes this bias term. As we show in Appendix A.4, minimizing this quantity leads to the following optimization problem that is of the same form as that of the SVM dual problem given in equation (11),
(19)  
s.t. 
Thus, SVM can also be viewed as a regularization method for minimizing prognostic imbalance.
However, SVM does not address the second term of the conditional bias in equation (17). This term corresponds to the bias due to extrapolation outside of the weighted treatment group to the population of interest. If the SATT is the target quantity, the term represents the bias due to the difference between the weighted and unweighted CATE for the treatment group. Thus, the prognostic balance achieved by SVM may come at the expense of this CATE bias, which represents the discrepancies between the weighted covariate distribution for the treatment group and the unweighted covariate distribution for the sample of interest. This is the direct consequence of the tradeoff between balance and effective sample size, which is controlled by the regularization parameter as shown earlier.
To illustrate this point more clearly, suppose that the target estimand is the SATT and , a scenario often encountered in practice. In this case, the initial solution on the regularization path for SVM is given by for . This implies that the renormalized simplex weights for the treated observations are given by for , and hence the second bias term in equation (17) vanishes. The remaining weights are then chosen such that the imbalance in prognostic score is minimized. Since the treated group is unmodified in this setting, balance between the unweighted treated and reweighted control covariate distributions will be at its largest value on the path since balancing is more difficult in the unpruned data. As the penalty for balance is increased, the SVM solution will trim both treated and control samples that are difficult to balance. This pruning improves balance, but may increase the CATE bias.
2.8 Relation to Kernel Balancing Methods
A closely related method recently proposed in the causal inference literature is kernel balancing (Hazlett, 2020; Kallus et al., 2018; Wong and Chan, 2018). We now discuss the relations between SVM and existing kernel balancing methods. Consider the following alternative decomposition of conditional bias,
(20) 
To minimize this bias, kernel balancing methods restrict and to a RKHS and consider minimizing the largest bias under the pair . This problem is given by,
(21)  
s.t. 
where denotes the constraints on the weights. For example, Kallus et al. (2018) restricts them to whereas Wong and Chan (2018) essentially uses , though their formulation is slightly different than that give above.
As shown in Kallus et al. (2018), the problem of minimizing this worstcase conditional bias amounts to computing weights that balance the treatment and control covariate distributions with respect to the empirical distribution for the population of interest. Let and denote the weighted empirical covariate distributions for the treatment and control groups, respectively, while having represent the empirical covariate distribution corresponding to the population of interest. Then if is also restricted to the unitball RKHS (fixing the size of is necessary since the bias scales linearly with and ), i.e., , the optimization problem in equation (21) can be written in terms of the minimization of the empirical MMD statistics:
(22)  
s.t. 
The objective in equation (22) does not contain a measure of distance between the conditional covariate distributions and . Instead, balance between these two distributions is indirectly encouraged through balancing each one individually with respect to the target distribution . This is in contrast with SVM, which directly balances the covariate distribution between the treatment and control groups.
3 Simulations
In this section, we examine the performance of SVM in ATE estimation under two different simulation settings. We also examine the connection between SVM and the QIP for the largest balanced subset.
3.1 Setup
We consider two simulation setups used in previous studies. Simulation A comes from Lee et al. (2010) who use a slightly modified version of the simulations presented in Setoguchi et al. (2008). Specifically, we adopt the exact setup corresponding to their “scenario G,” which is briefly summarized here. We refer readers to the original article for the exact specification. For each simulated dataset, we generate 10 covariates
from the standard normal distribution, with correlation introduced between four pairs of variables. Treatment assignment is generated according to
, where is some coefficient vector, and controls the degree of additivity and linearity in the true propensity score model. This scenario uses the true propensity score model with a moderate amount of nonlinearity and nonadditivity. The outcome model was specified to be linear in the observed covariates with a constant, additive treatment effect: , with and .Simulation B comes from Wong and Chan (2018) and represents a more difficult scenario where both the propensity score and outcome regression models are nonlinear functions of the observed covariates. For each simulated data set, we generate a tendimensional random vector from the standard normal distribution. The observed covariates are defined as the nonlinear functions of these variables, i.e., , where , , , , and . Treatment assignment follows , which corresponds to Model 1 of Wong and Chan (2018). Finally, the outcome model is specified as , with . Note that the true PATE is under this model.
3.2 Comparison between SVM and SVMQIP
We begin by examining the connection between SVM and SVMQIP by comparing solutions obtained using one simulated dataset of units generated according to Simulation A. Specifically, we first compute the SVM path using the path algorithm described in Section 2.6, obtaining a set of regularization parameter breakpoints . Next, we compute the SVMQIP solution for each of these breakpoints using the Gurobi optimization software (Gurobi Optimization, 2020). We limit the solver to spending 5 minutes of runtime for each problem. Finding the exact integervalued solution under a given requires a significant amount of time, but a good approximation can typically be found in a few seconds.
For both methods, we compute the objective function value at each of the breakpoints as well as the coverage of the SVMQIP solution by the SVM solution. The latter represents the proportion of units with nonzero SVM weights that are included in the largest balanced subset identified by SVMQIP. Formally, the coverage is defined as
In order to examine the effects of separability on the quality of the approximation, we perform the above analysis using three different types of features. Specifically, we use a linear kernel with the untransformed covariates (linear), a linear kernel with the degree2 polynomial features formed by concatenating the original covariates with all twoway interactions and squared terms (polynomial), and the Gaussian RBF with scale parameter chosen according the median heuristic (RBF). In all cases, we scale the input feature matrix such that the columns have
mean and standard deviation
before performing the kernel computation.(a) Linear (b) Polynomial (c) RBF 
Figure 1 shows that the objective values for the SVM and SVMQIP solutions are close when the penalty on balance is small, with divergence between the two methods occurring towards the end of the regularization path. In the linear case, we see that the paths for the two methods are nearly identical, suggesting that the solutions of the two problems are essentially the same. Divergence in the polynomial and RBF settings is more pronounced due to greater separability in the transformed covariate space, which is more difficult to balance without noninteger weights. When is very small, we also find that SVMQIP returns , indicating that the penalty on balance is too great. Lastly, the effects of approximating the SVMQIP solution are reflected in the RBF setting, where upon close inspection the objective value appears to be somewhat noisy and nonmonotonic.
(a) Linear (b) Polynomial (c) RBF 
Interestingly, the coverage plots in Figure 2 show that even when the objective values between the two methods are divergent, the SVM solution still predominantly covers the SVMQIP solution. The regions with zero coverage in the polynomial and RBF settings correspond to instances where the balance penalty is so significant that a nontrivial solution cannot be found for the SVMQIP. This result illustrates that SVM approximates onetoone matching by augmenting a wellbalanced matched subsample with some noninteger weights. This leads to an increase in the subset size while preserving the overall balance within the subsample.
3.3 Performance of SVM
Next, we evaluate the performance of SVM in estimating the ATE for Simulations A and B. For each scenario, we generate 1,000 datasets with samples. For each simulated dataset, we compute the ATE estimate over a fixed grid of values chosen based on the simulation scenario and input feature. As described in Section 3.2, we use the linear, polynomial, and RBFinduced features, standardizing the covariate matrix before passing it to the kernel in all cases.
(a) Linear (b) Polynomial (c) RBF 
In Figure 3, we plot the distribution of ATE estimates over Monte Carlo simulations against the regularization parameter . The results for Simulation A (top panel) show that the bias approaches zero as the penalty on balance increases ( decreases). This behavior is due to the fact that the conditional bias in the estimate under the outcome model for Simulation A is given by
This implies that all bias comes from prognostic score imbalance. This quantity becomes the smallest when minimizing , which is controlled by the regularization parameter in the SVM dual objective under the linear setting. Note that this quantity is also small under both the polynomial and RBF input features.
We also find that under all three settings, there is relatively little change in the variance of the estimates along most of the path, suggesting that the variance gained from trimming the sample is counteracted by the variance decreased from correcting for heteroscedasticity. The exception to this observation occurs at the beginning of the linear case, where the reduction in bias also reduces the variance, and at the end of the RBF path, where the amount of trimming is so substantial relative to the balance gained that the variance increases.
For Simulation B (bottom panel), we also observe that the bias decreases as the penalty on balance increases. However, due to misspecification, nonlinearity, and the presence of treatment effect heterogeneity in the outcome model, the bias never decays to zero as shown in Section 2.7. We also find that the SVM with linear kernel can reduce bias as well as the other kernels, suggesting that SVM is robust to misspecification and nonlinearity in the outcome model. Similar to Simulation A, we also observe relatively small changes in the variance as the constraint on balance increases, except at the end of the RBF path where there is substantial sample pruning.
3.4 Comparison with Other Methods
Next, we compare the performance of SVM with that of other methods. Our results below show that the performance of SVM is comparable to that of related stateoftheart covariate balancing methods available in the literature. In particular, we consider kernel optimal matching (KOM; Kallus et al., 2018), kernel covariate balancing (KCB; Wong and Chan, 2018), cardinality matching (CARD; Zubizarreta et al., 2014)
, and inverse propensity score weighting (IPW) based on logistic regression (GLM) and random forest (RFRST), both of which were used in the original simulation study by
Lee et al. (2010). For SVM, we compute solutions using , , and for Simulation A under the linear, polynomial, and RBF settings, respectively. For Simulation B, we use , , and . These values are taken from the grid of values used in the simulation based on visual inspection of the path plots in Figure 3 around where the estimate curve flattens out.For KOM, we compute weights under the linear, polynomial, and RBF settings described earlier with the default settings for the provided code. For KCB, we compute weights using the RBF kernel and use its default settings. While KCB allows for other kernel functions, it was originally designed for the use of RBF and Sobolev kernels. We find its results to be poor when using the linear and polynomial features. For CARD, we used a threshold of and times the standardized differenceinmeans for the linear and polynomialinduced features, respectively, and we set the search time for the algorithm to 3 minutes. For GLM and RFRST, we used the linearinduced features and the default algorithm settings described in Lee et al. (2010).
(a) Simulation A (b) Simulation B 
Figure 4 plots the distributions of the effect estimates over 1,000 simulated datasets for both scenarios. Simulation A (left panel) shows comparable performance across all methods, with SVM and KOM having the best performance in terms of both bias and variance. In particular, SVM achieves near zero bias under all three input features. The results for KCB show that it performs slightly worse in comparison to the other kernel methods, with greater bias and variance under the RBF setting.
The results for CARD show near identical performance with SVM under the linear setting, however results under the polynomial setting are notably worse. The reason for this comes from the choice of balance threshold, which was set to 0.1 times the standardized differenceinmeans of the input feature matrix. Although decreasing the scalar below 0.1 would lead to a more balanced matching, we found that algorithm was unable to consistently find a solution for all datasets with scalar multiples smaller than 0.1. This result highlights the main issue with defining balance dimensionbydimension, which makes it difficult to enforce small overall balance without information on the underlying geometry of the data. Lastly, the propensity score methods show the worst performance. This is somewhat expected as the true propensity score model is more complicated than the true outcome model under this simulation setting.
We note that further reduction in the variance of the SVM solution while preserving bias is likely possible with a more principled method of choosing the solution for each simulated dataset. In general, a value of that works well for one dataset may not work well for another, and a better approach would examine estimates over the path and balancesample size curves for each dataset individually. Nevertheless, our heuristic procedure to selecting a solution produced highquality results, demonstrating the strength of SVM as a balancing method.
The results for Simulation B (right panel) show a slightly more varying performance across methods. Amongst the kernel methods, we find that KOM has the best performance under the polynomial and RBF settings, achieving near zero bias under these scenarios, while SVM has the best performance under the linear setting. The discrepancy under the linear setting is due to misspecification, which leads to a poor regularization parameter choice and consequently poor balance and bias under the KOM procedure.
We also find that SVM is unable to drive the bias to zero, which is due to the treatment effect heterogeneity in the outcome model. As discussed in Section 2.7, SVM ignores the second term in the conditional bias decomposition (17), which is zero under a constant additive treatment effect in Simulation A but is nonzero in Simulation B. In contrast, KOM targets both bias terms in its formulation, which leads to greater bias reduction.
In comparison to the other kernel methods, we find that KCB has comparable bias to SVM but greater variance. For CARD, we observe comparable results to SVM under the linear setting, but again we observe worse performance under the polynomial setting due to the reasons mentioned above. Lastly, we find mixed results between the two propensity score methods. Logistic regression (GLM) has the worst performance while Random forest (RFRST) exhibits the second best performance amongst all methods. This result is likely due to the simple structure of the true propensity score model, whose nonlinearity can only be accurately modeled by RFRST.
4 Empirical Application: Right Heart Catheterization Study
In this section, we apply the proposed methodology to the right heart catheterization (RHC) data set originally analyzed in Connors et al. (1996). This observational data set was used to study the effectiveness of right heart catheterization, a diagnostic procedure, for critically ill patients. The key result from the study was that after adjusting for a large number of pretreatment covariates, right heart catheterization appeared to reduce survival rates. This finding contradicts the existing medical perception that the procedure is beneficial.
4.1 Data and Methods
The data set consists of 5,735 patients, with 2,184 of them assigned to the treatment group and 3,551 assigned to the control group. For each patient, we observe the treatment status, which indicates whether or not he/she received catheterization within 24 hours of hospital admission. The outcome variable represents death within 30 days. Finally, the dataset contains a total of 72 pretreatment covariates that are thought to be related to the decision to perform right heart catheterization. These variables include background information about the patient, such as age, sex, and race, indicator variables for primary/secondary diseases and comorbidities, and various measurements from medical test results.
We compute the full SVM regularization paths under the linear, polynomial, and RBF settings described in Section 3.1
. In forming the polynomial features, we exclude all trivial interactions (e.g., interactions between categories of the same categorical variable) and squares of binaryvalued covariates. For comparison, we also compute the KOM weights under all three settings, the KCB weights under the RBF setting, and the CARD weights under the linear and polynomial settings with a threshold fixed to 0.1 times the standardized differenceinmeans.
4.2 Results
(a) Linear (b) Polynomial (c) RBF 
ATE estimates for the RHC data over the SVM regularization path. The horizontal axis represents the normed differenceinmeans in covariates within the weighted subset. The solid blue line denotes the average estimate, and the solid gray background denotes the pointwise 95% confidence intervals.
Figure 5 plots the ATE estimates over the SVM regularization paths with the pointwise 95% confidence intervals based on the weighted Neyman variance estimator (Imbens and Rubin, 2015, Chapter 19). The horizontal axis represents the normed differenceinmeans within the weighted subset as a covariate balance measure. For all three settings, we find that the estimated ATE slightly increases as the weighted subset becomes more balanced, supporting the results originally reported in Connors et al. (1996) that right heart catheterization decreased survival rates.
(a) Linear (b) Polynomial (c) RBF 
Figure 6 illustrates the tradeoff between the balance measure (the normed differenceinmeans in covariates within the weighted subset) and effective subset size, as the balancesample size frontier. Such graphs can be useful to researchers in selecting a solution along the regularization path for estimating the ATE. Across all cases, we achieve a good amount of balance improvement once the data set is pruned to about 3,500, which occurs around where the tradeoff between subset size balance becomes less favorable.
We also examine differences in dimensionbydimension balance between SVM and CARD and between SVM and KOM under the linear and polynomial settings. We do not conduct such a comparison for RBF, which is infinite dimensional. Here, we consider four different SVM solutions: the largest subset size solution whose standardized differenceinmeans in covariates was below 0.1 for all dimensions, the solution whose effective sample size was nearest the subset size for the other method, the solution whose normed differenceinmeans in covariates was closest to that of the other method, and the solution occurring at the kneedle estimate for the elbow of the balanceweight sum curve. We take the minimumbalance solution when no elbow exists, as in the linear case. The effective sample size is computed according to the following Kish’s formula:
(23) 
(a) Small difference inmeans (b) Closest effective sample size (c) Closest normed differenceinmeans (d) Elbow 
Figure 7 presents the covariate balance comparisons between SVM and CARD for both linear and polynomial settings. Comparing against the small differenceinmeans solution (leftmost column) for which the standardized differenceinmeans for all covariates are below 0.1, CARD retains more observations in its selected subset, but SVM achieves a better covariate balance than CARD for most dimensions although there are some large imbalances. This is expected because SVM minimizes the overall covariate imbalance without a constraint on each dimension as in CARD. We observe a similar result when comparing CARD with the SVM solutions based on the closest effective sample size solution (leftmiddle) and the closest normed differenceinmeans solution (rightmiddle). It is notable that the latter generally achieves a better covariate balance while retaining more observations than CARD. Finally, the results for the elbow SVM solution (rightmost column) show that tight balance is attainable with a moderate amount of sample pruning, with near exact balance in the linear setting. This level of covariate balance is difficult to achieve with CARD due to the infeasibility of optimization particularly in high dimensional settings.
(a) Linear (b) Polynomial 
Figure 8 shows the dimensional balance comparisons for the KOM solution against the SVM solution. Here, we see that under the linear setting, KOM retains significantly more units than the SVM solution while attaining the same balance. This is due to the constraint of SVM, which encourages the selected subset to have a roughly equal proportion of treated and control units, while the KOM solution allows the resulting subset to be more imbalanced. Under the polynomial setting, however, we find that SVM retains significantly more units than KOM while achieving a similar degree of covariate balance, which is likely due to poor regularization parameter choice by the KOM algorithm.
Feature  Method  Estimate  Standard error  Effective sample size 

Linear  0.0615  0.0148  3442  
—  —  —  
0.0479  0.0132  4367  
CARD  0.0335  0.0138  4174  
KOM  0.0656  0.0147  4159  
Polynomial  0.0634  0.0279  1087  
0.0588  0.0154  3325  
0.0541  0.0135  4375  
CARD  0.0313  0.0139  4084  
KOM  0.0452  0.0251  1623  
RBF  0.0518  0.0289  1123  
0.0527  0.0148  3444  
0.0474  0.0132  4378  
KCB  0.0337  0.0173  3306  
KOM  0.0582  0.0166  4185 
Lastly, we compare the point estimates of the ATE, the weighted Neyman standard error, and the effective sample size for SVM, CARD (with linear and polynomial), KOM, and KCB (with RBF) in Table 1. We considered three different solutions from the SVM path in our comparisons: , which corresponds to the initial solution for which the balance constraint is most relaxed and , , , which corresponds to the most regularized solution with the best covariate balance on the path, and , which corresponds to the solution occurring at the elbow of the balanceweight sum curves shown in Figure 6.
The results show that SVM leads to a positive estimate in all cases, which agrees with the original finding reported in Connors et al. (1996). We also find that the three SVM solutions differ most significantly in their standard errors, which increases as the constraint on balance becomes stronger and the subset is more pruned, as shown in the effective sample size column. In particular, the heavily balanced solution under the RBF setting leads to a 95% confidence interval which overlaps with zero. This is in contrast with the less balanced solution, which has both a larger effect estimate and smaller confidence interval. This result demonstrates the necessity of computing the regularization path so that researchers may avoid lowquality solutions due to poor parameter choice.
Comparing against other methods, we observe that KOM yields greater estimates of positive effects with comparable standard errors in the linear and RBF settings. However, under the polynomial setting, the standard error is much larger and the sample is significantly more pruned than the modestly balanced solution. Both CARD and KCB produce smaller positive effect estimates, with the standard error for KCB leading to a 95% confidence interval which overlaps with zero.
5 Concluding Remarks
In this paper, we show how support vector machines (SVMs) can be used to compute covariate balancing weights and estimate causal effects. We establish a number of interpretations of SVM as a covariate balancing procedure. First, the SVM dual problem computes weights which minimize a tradeoff between the MMD and a measure of subset size while simultaneously maximizing effective sample size. Second, the SVM dual problem can be viewed as a continuous relaxation of the largest balanced subset problem, which is closely related to cardinality matching. Lastly, similar to existing kernel balancing methods, SVM weights were shown to minimize the worstcase bias due to prognostic score imbalance. Additionally, path algorithms can be used to compute the entire set of SVM solutions as the regularization parameter varies, which constitutes a balancesample size frontier. These methods provide researchers with a characterization of the balancesample size tradeoff and allow for visualization on how causal effect estimates vary as the constraint on balance changes.
Our work in this paper suggests several possible directions for future research. On the algorithmic side, a disadvantage of the proposed methodology is that it encourages roughly equal effective number of treated and control units in the optimal subset, which can lead to unnecessary sample pruning. One could use weighted SVM (Lin and Wang, 2002) to address this problem, but existing path algorithms are applicable only to unweighted SVM. On the theoretical side, results in this paper suggest a fundamental connection between the support vectors and the set of overlap, i.e., . Steinwart (2004) shows that the fraction of support vectors for a variant of the SVM discussed here asymptotically approaches the measure of this overlap set, suggesting that SVM may be used to develop a statistical test for the overlap assumption.
References
 Athey et al. (2018) Athey, S., Imbens, G. W., and Wager, S. (2018). Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society, Series B, Methodological, 80(4), 597–623.
 Chan et al. (2016) Chan, K. C. G., Yam, S. C. P., and Zhang, Z. (2016). Globally efficient nonparametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society, Series B, Methodological, 78, 673–700.
 Connors et al. (1996) Connors, A. F., Speroff, T., Dawson, N. V., Thomas, C., Harrell, F. E., Wagner, D., Desbiens, N., Goldman, L., Wu, A. W., Califf, R. M., et al. (1996). The effectiveness of right heart catheterization in the initial care of critically iii patients. Jama, 276(11), 889–897.
 Cortes and Vapnik (1995) Cortes, C. and Vapnik, V. (1995). Supportvector networks. Machine learning, 20(3), 273–297.
 Dinkelbach (1967) Dinkelbach, W. (1967). On nonlinear fractional programming. Management science, 13(7), 492–498.
 Fan et al. (2016) Fan, J., Imai, K., Liu, H., Ning, Y., and Yang, X. (2016). Improving covariate balancing propensity score: A doubly robust and efficient approach. Technical report, Princeton University.
 Ghosh (2018) Ghosh, D. (2018). Relaxed covariate overlap and marginbased causal effect estimation. Statistics in Medicine, 37(28), 4252–4265.
 Gretton et al. (2007a) Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. J. (2007a). A kernel method for the twosampleproblem. In Advances in neural information processing systems, pages 513–520.
 Gretton et al. (2007b) Gretton, A., Fukumizu, K., Teo, C., Song, L., Schölkopf, B., and Smola, A. (2007b). A kernel statistical test of independence. Advances in neural information processing systems, 20, 585–592.
 Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel twosample test. The Journal of Machine Learning Research, 13(1), 723–773.
 Gurobi Optimization (2020) Gurobi Optimization, L. (2020). Gurobi optimizer reference manual.
 Hainmueller (2012) Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political analysis, pages 25–46.
 Hansen (2008) Hansen, B. B. (2008). The prognostic analogue of the propensity score. Biometrika, 95(2), 481–488.
 Hastie et al. (2004) Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5(Oct), 1391–1415.
 Hazlett (2020) Hazlett, C. (2020). Kernel balancing: A flexible nonparametric weighting procedure for estimating causal effects. Statistica Sinica, 30(3), 1155–1189.
 Ho et al. (2007) Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15(3), 199–236.
 Imai and Ratkovic (2014) Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 76(1), 243–263.
 Imbens and Rubin (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
 Kallus (2020) Kallus, N. (2020). Generalized optimal matching methods for causal inference. Journal of Machine Learning Research, 21(62), 1–54.
 Kallus et al. (2018) Kallus, N., Pennicooke, B., and Santacatterina, M. (2018). More robust estimation of sample average treatment effects using kernel optimal matching in an observational study of spine surgical interventions. arXiv preprint arXiv:1811.04274.
 King and Zeng (2006) King, G. and Zeng, L. (2006). The dangers of extreme counterfactuals. Political Analysis, 14(2), 131–159.
 King et al. (2017) King, G., Lucas, C., and Nielsen, R. A. (2017). The balancesample size frontier in matching methods for causal inference. American Journal of Political Science, 61(2), 473–489.
 Lee et al. (2010) Lee, B. K., Lessler, J., and Stuart, E. A. (2010). Improving propensity score weighting using machine learning. Statistics in medicine, 29(3), 337–346.
 Li et al. (2018) Li, F., Morgan, K. L., and Zaslavsky, A. M. (2018). Balancing covariates via propensity score weighting. Journal of the American Statistical Association, 113(521), 390–400.

Lin and Wang (2002)
Lin, C.F. and Wang, S.D. (2002).
Fuzzy support vector machines.
IEEE transactions on neural networks
, 13(2), 464–471.  Lunceford and Davidian (2004) Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19), 2937–2960.
 Ning et al. (2020) Ning, Y., Peng, S., and Imai, K. (2020). Robust estimation of causal effects via highdimensional covariate balancing propensity score. Biometrika, 107(3), 533–554.
 Ratkovic (2014) Ratkovic, M. (2014). Balancing within the margin: Causal effect estimation with support vector machines. Department of Politics, Princeton University, Princeton, NJ, page available at https://www.princeton.edu/~ratkovic/public/BinMatchSVM.pdf.

Rubin (1990)
Rubin, D. B. (1990).
Comments on “On the application of probability theory to agricultural experiments. Essay on principles. Section 9” by J. SplawaNeyman translated from the Polish and edited by D. M. Dabrowska and T. P. Speed.
Statistical Science, 5, 472–480.  Rubin (2006) Rubin, D. B. (2006). Matched Sampling for Causal Effects. Cambridge University Press, Cambridge.
 Schaible (1976) Schaible, S. (1976). Fractional programming. ii, on dinkelbach’s algorithm. Management science, 22(8), 868–873.
 Schölkopf et al. (2002) Schölkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
 Sentelle et al. (2016) Sentelle, C., Anagnostopoulos, G., and Georgiopoulos, M. (2016). A simple method for solving the svm regularization path for semidefinite kernels. IEEE transactions on neural networks and learning systems, 27(4), 709.
 Setoguchi et al. (2008) Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J., and Cook, E. F. (2008). Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiology and drug safety, 17(6), 546–555.
 Sriperumbudur (2011) Sriperumbudur, B. K. (2011). Mixture density estimation via hilbert space embedding of measures. In 2011 IEEE International Symposium on Information Theory Proceedings, pages 1027–1030. IEEE.
 Sriperumbudur et al. (2010) Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11, 1517–1561.
 Sriperumbudur et al. (2012) Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet, G. R., et al. (2012). On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6, 1550–1599.
 Steinwart (2004) Steinwart, I. (2004). Sparseness of support vector machines—some asymptotically sharp bounds. In Advances in Neural Information Processing Systems, pages 1069–1076.
 Stuart (2010) Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1–21.
 Tan (2010) Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3), 661–682.

Tan (2020)
Tan, Z. (2020).
Regularized calibrated estimation of propensity scores with model misspecification and highdimensional data.
Biometrika, 107(1), 137–158.  Wong and Chan (2018) Wong, R. K. and Chan, K. C. G. (2018). Kernelbased covariate functional balancing for observational studies. Biometrika, 105(1), 199–213.
 Zhao (2019) Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions. Annals of Statistics, 47(2), 965–993.
 Zhu et al. (2018) Zhu, Y., Savage, J. S., and Ghosh, D. (2018). A kernelbased metric for balance assessment. Journal of causal inference, 6(2).
 Zubizarreta (2015) Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511), 910–922.
 Zubizarreta et al. (2014) Zubizarreta, J. R., Paredes, R. D., Rosenbaum, P. R., et al. (2014). Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of forprofit and notforprofit high schools in chile. The Annals of Applied Statistics, 8(1), 204–231.
Appendix A Supplementary Appendix
a.1 Proof of Theorem 1
We prove this theorem by establishing several equivalent reformulations of the SVM dual problem. By equivalence, we mean that these problems differ from one another only in the scaling of the regularization parameter, implying that their regularization paths consist of the same set of solutions. More formally, given two optimization problems P1 and P2, we say that P1 and P2 are equivalent if a solution for P1 can be used to construct a solution for P2.
Denote the SVM weight set
and consider a rescaled version of the SVM dual given in equation (4), which we label as P1:
(P1)  
s.t. 
Note that for a given and solution to the original problem defined in (4) under , is also a solution to the rescaled problem P1 under . This establishes equivalence between these two problems.
We begin by proving the following lemma, which allows us to replace the squared seminorm term with to obtain the problem
(P2)  
s.t. 
Proof.
This result follows from the strong duality of the SVM dual problem, which allows us to form the following equivalent problem in which the penalized term is replaced with a hard constraint with threshold :
(P3)  
s.t.  
The solution to (P3) is unchanged whether we minimize or , so the problem
s.t.  
is identical to (P3). By strong duality, we can again enforce the hard constraint on the term through a penalized term with new regularization parameter, which establishes the equivalence between (P1) and (P2). ∎
Next, we consider the fractional program
(P4)  
s.t. 
The following lemma connects (P4) to the reformulated SVM problem (P2) through Dinkelbach’s method (Dinkelbach, 1967; Schaible, 1976):
Lemma 2
Thus, the solution to the rescaled SVM dual problem (P2) under minimizes the fractional program (P4). Finally, we consider the MMD minimization problem defined in equation (11),
(P5)  
s.t. 
The following lemma establishes equivalence between (P4) and (P5) under the proper renormalization of the fractional program solution.
Proof.
Let and be solutions to problems (P4) and (P5), respectively, and consider the vectorvalued function , which normalizes the weights in the treated and control groups to each sum to . First note that since , is feasible for (P4). Then by optimality of , we have
Next, note that is feasible for (P5). Then by optimality of , we have
In order for both of these inequalities to be true, we must have
Note that the assumption is only a formality, since by Lemma 2, the trivial solution can occur only when , which is outside the portion of the regularization path that we consider. ∎
We are now ready to prove Theorem 1. Part (i): Lemma 1 establishes equivalence between the regularization paths for the rescaled SVM dual (P1) and (P2). In addition, Lemma 2 establishes the existence of such that the solution to (P2) under is also a solution to (P4). Then, it follows that there exists such that the solution to the rescaled SVM dual problem under minimizes (P4). Finally, recall that Lemma 3 establishes that the minimizing solution to (P4) is also a solution to the weighted MMD minimization problem. Therefore, there exists such that the solution to the SVM dual under minimizes the weighted MMD. Part (ii): The proof follows from Schaible (1976, Lemma 3).
a.2 The Path Algorithm of Sentelle et al. (2016)
We briefly describe the path algorithm of Sentelle et al. (2016) used in this paper. The regularization path for SVM is characterized by a sequence of breakpoints, representing the values of at which either one of the support vectors on the margin exits the margin, or a nonmarginal observation reaches the margin. Between these breakpoints, the coefficients of the marginal support vectors change linearly in , while the coefficients of all other observations stay fixed as is changed. Since the KKT conditions must be met for any solution , we can use a linear system of equations to compute how each and changes with respect to .
Based on this idea, beginning with an initial solution corresponding to some large initial value of , the path algorithm first computes how the current marginal support vectors change with respect to . Given this quantity, the next breakpoint in the path is computed by decreasing until a marginal support vector exits the margin, i.e, or , or a nonmarginal observation enters the margin, i.e., . At this point, the marginal support vector set is updated, and the changes in and , as well as the next breakpoint, are computed. This procedure repeats until the terminal value of is reached.
a.3 Conditional Bias with Respect to SATE and SATT
In this section, we derive the conditional bias for the weighted differenceinmeans estimator. Note that our derivation follows the one given in Kallus et al. (2018). Consider the problem of estimating the SATE and SATT, defined as
(24) 
respectively. We denote the weighted estimator , which has a form
(25) 
where . The conditional bias with respect to the SATE is given by
Comments
There are no comments yet.