In this work, we reframe the problem of balanced treatment assignment as optimization of a two-sample test between test and control units. Using this lens we provide an assignment algorithm that is optimal with respect to the minimum spanning tree test of Friedman and Rafsky (1979). This assignment to treatment groups may be performed exactly in polynomial time. We provide a probabilistic interpretation of this process in terms of the most probable element of designs drawn from a determinantal point process which admits a probabilistic interpretation of the design. We provide a novel formulation of estimation as transductive inference and show how the tree structures used in design can also be used in an adjustment estimator. We conclude with a simulation study demonstrating the improved efficacy of our method.READ FULL TEXT VIEW PDF
Decision-making often requires engaging with counterfactual questions. For instance, determining whether to give a patient a medication depends on what their health outcomes would have been absent the medication. One of the most successful tools for answering these types of counterfactual questions has been experimentation. For a sample of patients, randomly give half of them the medication and half of them a placebo, and measure the average
health outcomes for each of the two groups. This provides unbiased estimates of the typical response in the sample: the average treatment effect (ATE)(Imbens and Rubin, 2015). This does not address a doctor’s most fundamental concern, however: how would this patient respond to treatment, relative to their counterfactual health outcomes under placebo? To answer this question, it is necessary to consider individual treatment effects (ITE) (Shalit et al., 2017). While the literature has provided many improvements around the design of experiments to measure the former quantity (the ATE), in this paper, we analyze the problem of experimental design for estimation of the ITE.
This work is concerned with extending the capabilities of experimental design along two axes:
Experimental Design for Individual Treatment Effects. To our knowledge, this is the first work focused on design-based solutions to the estimation of ITEs.
Our primary contributions are:
Motivate the estimation of ITEs around transductive learning.
Show that the problem of good experimental design is closely related to a ubiquitous graph-cutting problem through a bias-variance decomposition of the design problem.
Reorient the problem of balance around a two-sample test between treatment and control covariate profiles.
Provide an efficient approximation to this problem based on maximum spanning trees, which optimizes a ubiquitous graph-based two-sample test and provides highly accurate estimates of ITEs.
The structure of this paper is as follows. Section 2 describes the problem of experimental design, and estimation of ITEs given a design. Section 3 provides an overview of pre-existing work on experimental design. Section 4 presents the problem of ITE-optimizing experimental design, connects it to graph cutting and discusses existing approaches through this lens. Section 5 presents our proposed design which optimizes test of balance based on the minimum spanning tree. Section 6 shows a bevy of simulation evidence demonstrating the strength of our proposed design.
We first give some background and notation before introducing the task of this work. Throughout we will consider three sets of variables, , and . We assume that is pretreatment, i.e. the values are not caused by or . We will also assume that is given as some function of , , and mean-zero noise. Given a set of realizations of and the potential outcomes (Rubin, 2011), , are the values of that would have been observed had treatment been observed at or , respectively. For some mathematical statements it is more convenient to annotate treatment as being in
, and we will indicate this by the use of a vector, wherein notates control and indicates treatment. Causal effects are then, in turn, derived as contrasts between potential outcomes. In this paper, we will consider two causal estimands:
The Individual Treatment Effect (ITE) is the conditional effect of treatment, .
The Average Treatment Effect (ATE) is an estimate of the marginal effect of treatment from a finite sample. The ATE is easily expressed as an expectation of the ITE, .
Optimal experimental design—the central task of this paper—considers the following problem. Given the set of pre-treatment covariates , how should treatment be assigned to each individual in order to to obtain an unbiased estimate of a causal estimand with minimal variance? Optimal experimental design for estimating the ATE has been studied for decades (c.f., Fisher (1935); Morgan et al. (2012); Hall et al. (1995); Kallus (2018); Higgins et al. (2016)). In the general setting, Kallus (2018) showed that complete randomization is minimax optimal. However, with additional assumptions placed over the potential outcomes, improvement can be made through careful allocation. One such assumption, which we will employ throughout the remainder of the paper, is that the potential outcomes are smooth functions with additive noise. More precisely, we introduce the following assumptions
The pre-treatment covariates, , belong to a metric space, with the corresponding metric denoted , and are drawn from some distribution with finite variance. In this paper, we assume that is drawn from some (possibly unknown) distribution and that the domain is a metric space, with the metric .
Each of the potential outcomes, are drawn from the following generative process
Where is mean zero.
Each potential outcome function , , is Lipschitz continuous with Lispchitz constant,
There have been a plethora of design procedures that attempt to explicitly improve balance. These approaches fall into three primary camps:
Optimization (Kallus, 2018). An optimization procedure is used to find the best vector of assignments to treatment in order to minimize some measure of imbalance. This assignment may be deterministic.
These approaches can be difficult to scale to the necessary sample sizes for the online environment, as finding optimally balanced treatment assignments is an NP-hard problem.
The optimization objective most commonly employed for optimal experimental design is mean balance (c.f. Morgan et al. (2012); Kallus (2018)), i.e., minimizing the distance in means between the instances of that are allocated to treatment and control, respectively. This measure can be extended to incorporate higher order and non-linear dependencies by applying a feature transformation, , to the covariates. The resulting optimization problem is then given by
where is a distance function. Popular choices for are Euclidean (Hansen and Bowers, 2008), and Mahalanobis (Morgan et al., 2012) distance. Of particular interest to this work is the balance measure used by Kallus (2018) which considers the mean difference between after projecting the covariates in to a reproducing kernel Hilbert space (RKHS). Defining to be the Gram matrix corresponding to the optimal experimental design under corresponds to solving the following binary quadratic program, termed the pure strategy optimal design (PSOD) by its author,
where is the Gram matrix of with respect to an RKHS. Under smoothness assumptions on the potential outcome function, the solution to PSOD was shown to be Bayes optimal, with variance guarantess comparable to those provided by post-hoc regression adjustment.
While mean balance has intuitive and theoretical appeal, it also comes with significant computational disadvantages. Kallus (2018) shows that PSOD, which accomodates a large number of mean balance measures, is equivalent to solving the balanced number partition problem which is known to be NP-hard. The implemented solution requires solving a semi-definite program which prevents the applicability of the method to moderately large domains (in the hundreds to thousands).
Design for the average treatment effect has received considerable attention in the literature. Less studied, however, is design specifically targeting individual treatment effects. Recently, this quantity has gained substantial attention due to Athey and Imbens (2016); Wager and Athey (2018) and the broader literature around individual treatment effect estimation (Shalit et al., 2017; Shi et al., 2019)111In the statistics and econometrics literature, this task is often referred to as “heterogeneous treatment effect” estimation.. The ITE is given by the difference in potential outcomes conditioned on , , . The central task considered within this paper is allocating treatment to estimate the ITE well (we will make this statement more formal shortly).
To motivate our design task, we begin with an estimator for the individual treatment effects. We will restrict ourselves to distance based regression functions,
Special cases of this general formulation are k-nearest neighbors regression as well as Nadaraya-Watson kernel regression. These estimators are non-parametric and fairly flexible. We focus on this estimator due to its analytical tractability in combination with its generally reasonable performance as a non-parametric estimator. As we will show in section 6, a design which is effective for this estimator will typically also be effective for other ITE estimators. Under the assumed model, the empirical estimate of the ITE can be written as
Equation 3 can be interpreted as two independent regressions inferring the potential outcomes of and , where the predictions for observed potential outcomes are constrained to be equal to the observed outcome. In the individual treatment effect estimation literature, training an outcome for each potential outcome surface is often referred to as a “T-learner” (Künzel et al., 2019), but due to our restriction that observed potential outcomes take their observed values, our approach is more similar to the “X-learner” of Künzel et al. (2019). This restriction is also often employed in the tranductive learning setting, for example, by Zhu et al. (2003)
. Framed in terms of transductive inference, our task of ITE estimation is to impute the counterfactual for each unit, and this imputation of counterfactuals is the only way that error is introduced into our estimation problem.
To our knowledge, this paper is the first to examine designing an experiment explicitly for the estimation of ITEs. We do so by viewing experimental design as an optimal graph cut problem. We discuss the details of the connection between graph cutting and experimental design for ITE estimation next.
A natural interpretation of the assignment problem is to view the observations of covariates as nodes in a graph with with treatment being a missingness indicator. Through this lens we see that the task of treatment assignment can be interpreted as minimizing the risk of two interrelated regression problems: predicting the control counterfactual for treatment using only control units, and predicting the treated counterfactual for control using only the treated units. The resulting optimization problem is then given as
where we refer to similarity between points as and replace with a more explicit expression. Note that the choice of similarity function, as before, is a design choice made by the practitioner. As with most causal inference applications, the outcomes are unobserved which can make reasoning over design choices difficult a priori. However, after leveraging the Lipschitz assumption (assumption 3) the following proposition allows for a bound on the bias and variance of the regression function.
A proof is provided in the supplement for completeness. Proposition 1 provides an expression for the error which relies only on observable quantities, namely the distance between treatment and controls and the regression weights, and an assumption on the magnitude of noise. The optimization problem in equation 4 can then be recast as
with as in equation 4. This lens makes explicit the tradeoffs between bias and variance in the design. It should come as no surprise that the optimal design will be heavily reliant on the distribution of and the magnitude of the noise term, i.e. the size of . For example, on one extreme when is close to zero, then the best choice will be to concentrate all of the weight on the first nearest neighbor. As we discuss in section 4.3, this corresponds to a greedy design which two-colors a one-nearest neighbor graph. More generally, it is necessary to reason over trade-offs that are occurring with respect to the experimental design.
In this work, we propose to view these choices by recasting the problem of experimental design in terms of graph cutting. Specifically, we consider a graph, where the edge weights, are the similarity between and . After remapping treatment to via , the problem of treatment assignment can be recast as choosing an assignment. This view is quite natural, since the set of cut edges, i.e., edges where are those which are used to infer the counterfactuals in the nearest neighbor regression. The following proposition makes this more formal by relating the risk of the regression estimator to the Maxcut problem
Where , and .
The proof is provided in the supplement. The first term is an irreducible component which corresponds to the estimation error due to non-smoothness of the potential outcome function. It shows the error under an oracle scenario in which the unobserved potential outcome for a unit is estimated based on the potential outcomes for all other units. It further presumes this estimation is performed for every unit, which is not possible due to the fundamental problem of causal inference. This represents the Bayes risk of the estimation problem: the lower bound of the error incurred for this estimation. The second term is more interesting, as it describes the error due to the assignment process we choose. While and do not depend on the assignment (and therefore optimal design need not incorporate them), the remaining piece does. This term is the negative of the objective of the Maxcut graph-cutting problem, which we now describe in greater detail.
First, informally: maxcut divides the nodes of a graph into two disjoint and exhaustive subsets by removing (“cutting”) edges with the maximum edge-weights.
A common way to write this is through the use of the graph Laplacian:
where corresponds to which set each node belongs to, denoted and . The graph Laplacian is a matrix which represents the structure of the network, formed as the diagonal matrix of node-degree minus the incidence matrix of edge weights, . Maxcut is a canonical NP-hard problem and, is not amenable to a polynomial-time approximation scheme unless the unique games conjecture is true (Khot et al., 2007; Goemans and Williamson, 1995). Common approximation algorithms include semidefinite programming (Goemans and Williamson, 1995; Trevisan, 2012). The best known approximation ratio for this problem, in general, is through semidefinite programming, with a ratio of . Given the difficulty of this problem, it is not possible to uniquely minimize Proposition 2 in polynomial time, so we will focus only on efficient algorithms for the computation of a design. Note that the kernel allocation procedure proposed by Kallus (Kallus, 2018) is isomorphic to the Maxcut problem. We provide a proof of this correspondence in the supplement.
Certain special cases, however, allow for efficient solutions to Maxcut. Among these are particular bipartite graphs such as forests and trees. These graphs, for instance, admit solutions to Maxcut in linear time.
The results above can help shed light on common experimental designs, such as the matched pair design (Imai, 2008).
Kallus (2018) demonstrates that, when outcomes are Lipschitz, implementing this design by finding the max weight matching is optimal. This can be efficiently implemented using, e.g. the Edmonds’ algorithm (Edmonds, 1967). This optimality result, however, restricts the set of designs to those which may be defined as a matching on the graph. A graph matching, of course, may have no two edges which share an end-point. Our result demonstrates that a wider class of designs may be considered, opening the door to stronger assignment mechanisms.
In the observational literature on matching methods, there is a distinction drawn between greedy and so-called “optimal” matching (Stuart, 2010; Hansen and Klopfer, 2006). The distinction being that a greedy matching algorithm can “double dip,” using the same unit as the matched control for multiple treated units. The experimental design based on optimal matching is the Kallus (2018) matched pair design, but we can similarly form a greedy design by two-coloring the one-nearest neighbor graph. The decomposition of Proposition 1 gives us a ground on which to compare these designs. The greedy design ensures minimal pointwise bias for the ITE by minimizing the distance to a match. While providing the minimum pointwise bias of the design, the variance properties are not so clearcut. Depending on specific properties of the data, either the greedy design or the matched-pair design could be lower variance.
We now turn to the question: can we construct a feasible “optimal” design? To provide a specific example from the previous section, how should practitioners decide between greedy designs (which may imply higher leverage for certain observations) and non-greedy designs (which may imply higher bias for the imputation of ITEs for some units)? This question does not have easy answers. Indeed, a simple example can illustrate this conundrum. Suppose a graph with one point in the center in two dimensions with points surrounding it in a circle. Further suppose that each of these points is units away from the centerpoint, but units away from the next closest point on the exterior. Assume that each unit has a residual, (as in Assumption 2for the center point and for exterior points. The greedy design would ensure that the center and the exterior points received different treatments. The matched pair design would pair the center with one random exterior point, and then match all other exterior points with a neighboring exterior point. Then we can write the expectation of the bound in equation 2 for the center point in the matched-pair design as . For one point in the exterior, that quantity is , while for all others it is . For the greedy design, this quantity would instead be for the center point. For all exterior points, the bound would be .
Depending on the relative values of versus and the distance versus , either the greedy design or the matched-pair design could minimize the bound. That is, if is very large, then the greedy design will tend to exhibit variance properties that overwhelm its low bias. Similarly, if tends to be larger (or ), then the matched-pair design will tend to have unacceptably large biases that will overwhelm its variance properties. Of course, in the asymptotic regime, only bias matters and thus the greedy design will minimize this bound. In finite samples, this thoroughly unsatisfying bias-variance tradeoff demonstrates that the optimal design depends crucially on properties of the data which are unknowable a priori. In short, we do not seek an optimal design, but instead simply designs that make a reasonable tradeoff between bias and variance for many applied situations.
Practitioners who understand more about their data, such as the extent of heteroskedasticity and the smoothness of the conditional expectation function can therefore make better decisions about design than any overarching theoretical statement that we can provide here.
We begin by limiting our space of algorithms to scenarios in which Maxcut can be efficiently solved. Since trees and forests admit linear-time solutions to Maxcut, we focus on them.
In proposition 2, it is clear that integrated absolute bias is minimized when, for each unit, the similarity to the units with positive weights (i.e. the impute counterfactual) are maximized, as discussed in Section 4.3. The easiest way to ensure this is to match each unit with its closest neighbor in the graph and ensure that each neighbor in this newly sparsified graph receives different treatments than its neighbors. This solution is where we begin; we call this design “GreedyNeighbors”, because the design is realized by solving Maxcut exactly on the one-nearest-neighbor graph. The nearest neighbor graph can be computed efficiently in time by using a -tree. The one-nearest-neighbor graph is a forest, so solving Maxcut is trivially accomplished in by greedily walking the forest alternating treatment assignment. Thus, this design is realizable in aggregate time. An important thing to note about this design in contrast to typical matched-pair designs is that a unit may be “matched” to more than one unit. Note that there are many realizable assignments with the GreedyNeighbors design, as each disconnected subgraph of the nearest-neighbor graph is assigned independently. This implies that there are possible assignments, where is the number of disconnected subgraphs of the nearest-neighbor graph.
This design, however, despite minimizing bias on the ITE estimates, is needlessly high variance. Each added edge will stabilize the variance component in the decomposition in proposition 1. Thus, adding edges will reduce the variance of the ultimate solution (at the expense of some additional possibility for bias). We propose a design which manages this tradeoff in a computationally tractable way based on the maximal spanning tree of the original similarity graph. Algorithm 1 summarizes this design. In short, the maximal spanning tree (MST) is the largest tree over the graph which contains no loops or cycles. The maximal spanning tree always contains the nearest neighbor graph (as in, all edges of the nearest neighbor graph also are within the maximal spanning tree). Since the MST is a tree, it can also be solved trivially by Maxcut in . The MST itself can be computed in . Thus, the full procedure requires, again, only time complexity. Adding any additional edge to the MST which fails to preserve the bi-partiteness of the graph will make it no longer amenable to a greedy solution to Maxcut. This makes it the largest graph (in terms of total edge-weight), for which Maxcut is necessarily able to be efficiently solved. We refer to this algorithm as “SoftBlock”, since it softens the idea of a blocked design by allowing for substantial correlations between any two units (rather than simply units which lie within the same block).
We now provide a probabilistic interpretation of the proposed design. Starting with the observation that the set of all random spanning trees defines a determinantal point process (DPP) where the probability of a spanning tree is proportional to the product of its edge weights (Lyons and Peres, 2017), i.e. . This can be trivially modified to represent a distribution where each tree’s probability is given by its respective balance by first considering an exponentiation of the weights, i.e.,
It’s easily observed that when the sum of the weights is maximized (that is, the MST), the probability of the tree is also maximized. Thus, SoftBlock, the design based on the MST, is the MAP estimate from this DPP.
All designs we have considered correspond to a particular test of balance between treated and control units. For example, rerandomization using the Mahalonobis distance minimizes a -test, and as we detail in the appendix, problem 1 corresponds to minimizing an uncentered version of maximum mean discrepancy (Gretton et al., 2012).
As it turns out, SoftBlock shares an interesting connection to the minimum spanning tree test of Friedman and Rafsky (1979) Specifically, the graph based test addresses the problem of detecting differences between two distributions by viewing the problem in terms of a cut on a minimum spanning tree. The procedure is as follows. The two samples , and are pooled and a similarity graph, is constructed according to an analyst specified similarity metric. The minimum spanning tree, for
is then found. The test statistic is defined as the number of edges inthat connect samples from and , i.e., , where are the set of edges in the minimum spanning tree, , and are the total number of samples in the pooled dataset. The test is minimized if the two samples share only one edge in the spanning tree, and maximized when edges connect units from different samples as much as possible. This procedure was shown to be asymptotically normal and consistent by Henze et al. (1999). The SoftBlock assignment mechanism directly minimizes the Friedman-Rafsky test statistic. By optimizing a consistent test of balance, our procedure asymptotically guarantees balance on covariates between groups. Given that this is a consistent test, we can be sure that even though we aren’t directly optimizing linear balance, we will converge to linear balance in the limit. In finite samples, this procedure may sacrifice some degree of linear balance relative to traditional blocking procedures. Essentially, linear balance implies a computationally intractable (i.e. NP-hard) exact solution, while the use of a different metric of balance provides a simple polynomial time algorithm (with equivalence to the linear problem in the limit).
In this section, we present experiments demonstrating the effectiveness of SoftBlock. We begin by describing the methods we benchmark against:
Bernoulli randomization. This method flips a fair coin for each unit. This method is minimax optimal for the ATE as per Kallus (2018).
Rerandomization. The method of Morgan et al. (2012)
randomizes, checks balance (by Mahalanobis distance) and, if it’s too high, repeats. In our implementation, we use the heuristic ofKallus (2018), which accepts a randomization with only 1% probability. Thus, it ensures that the chosen design has one of the 1% most balanced designs (in terms of Mahalanobis distance).
Kallus’ PSOD and Heuristic Designs. These designs of Kallus (2018) optimize assignments to minimize mean imbalance in an RKHS.
, consisting of four uniform random variables multipled together. We additionally provide a simulation with a linear outcome, one based on a sinusoid, and one with covariates distributed along two circumscribed circles.
Figure 1 shows the runtimes of these various methods on the TwoCircles data generating process. At very low sample sizes, Kallus’s (2018) PSOD method is the fastest way to design an experiment and estimate effects, but by moderate sample sizes is outpaced by QuickBlock, SoftBlock and the GreedyNeighbors methods. SoftBlock is faster than QuickBlock at nearly all sample sizes, but the two approaches increase computational time at a similar rate.
Figure 2 shows the performance of the methods on the simulation setups in Table 1. Note that values in this chart are normalized for sample size (errors are multiplied by to allow for easier comparison across a wide array of sample sizes). In the LinearDGP, the methods which estimate the ATE with Lin’s (2013) regression-adjustment method are, in fact, correctly specifiedparametric models. As such, they (Rerandomization, Greedy Nearest Neighbors and Bernoulli randomization) have much lower error than competitor methods. SoftBlock, however, converges to nearly the same error by around . In the LinearDGP, SoftBlock and Greedy Nearest Neighbors are substantially more effective at estimating the ITEs than competitor methods, with SoftBlock outperforming Greedy Nearest Neighbors. Similar patterns hold in terms of the ITE on all DGPs, with QuickBlock performing the closest to SoftBlock, particularly at higher sample sizes. For estimating the ATE on the non-linear DGPs, SoftBlock is nearly always the most effective method, often substantially so, for example in moderate sample sizes on the QuickBlockDGP. The comparison between the GreedyNeighbors design and SoftBlock is informative, since the MST always contains the nearest neighbor graph. SoftBlock has two main advantages over this design. First, it reduces variance by using more than just the closest neighbor (for instance, sometimes the two nearest neighbors are both very close, so it would be wise to use both of them). Second, by being a single connected graph, it ensures that the assignments across different pairs of nearest neighbors are “lined up”. That is, it avoids certain bad randomizations, in which, for example, two nearby edges are oriented in the same direction wherein the unit with larger covariate value is assigned treatment in both pairs. The cut on the MST, on the other hand, is more likely to insulate against this eventuality by connecting these subgraphs and ensuring the orientation of treatments do not match.
shows the performance of the design-based estimators for ITEs. In contrast to the previous figure, which estimates ITEs with a random forest T-learner, this figure shows the ITEs estimated by only the specific estimator implied by the design. This means that, for a blocking estimator, a difference-in-means estimator is used within each block to impute conditional effects (which are assumed constant within blocks). For SoftBlock, the ITE estimator is the difference of the observed ego unit and its synthetic counterfactual constructed by the weighted average of its neighbors in the minimum-spanning-tree as analyzed in section4.1
. In this comparison, SoftBlock performs substantially and consistently better than other designs. The comparison to blocking methods in this experiment demonstrate why SoftBlock is able to do better at estimating the ITE than other methods: it is optimized to ensure good interpolation across the entire space. In particular, we can once again see as informative its comparative stability relative to the matched-pair designs (note that theKallus (2018) matched-pair design is infeasibly slow to display above sample sizes of 100 in this simulation).
Figure 4 shows the performance of various methods on the IHDP simulation study, as introduced by Hill (2011). We compare using setting “B”, in which the outcome model is nonlinear and the treatment effect is not constant. In this data, SoftBlock provides the lowest error estimates of the ATE, and all methods tend to perform well for estimating the ITE with a random forest T-learner.
In this paper, we’ve provided a framework through which to think about designs for individual treatment effect estimation and provided a formulation of the problem as graph cutting. Through this framework we presented two novel experimental designs which are well-suited to estimating ITEs and compare them to prior work. Simulations demonstrate that this method provides an improvement in terms of both computational tractability as well as efficiency.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3076–3085. JMLR. org, 2017.
Adapting neural networks for the estimation of treatment effects.In Advances in Neural Information Processing Systems, pages 2507–2517, 2019.
Max cut and the smallest eigenvalue.SIAM Journal on Computing, 41(6):1769–1786, 2012.
Where is the Lipschitz constant as in assumption 3, and is a distance measure. By placing an additional assumption that the noise term can be bounded by a constant, , the variance term can be further bounded with probability using an application of Hoeffding’s inequality [Anava and Levy, 2016],
By triangle inequality, we have for every
Let us also denote and
The first inequality follows by assuming that and noting that So the first summand is maximized by setting for and second one is maximized by setting for The second inequality follows by . Finally, by summing over all , we have
This is minimized by the max-cut. ∎
Given the graph, an equivalent formulation of Equation 1 is finding the weighted maximum cut on the graph (find a subset of the vertices such that the total weight of edges connecting nodes of the two different subsets is maximized). The equivalence is made plain by considering the binary quadratic program formulation of Maxcut given in Equation 5,
where we have defined to be the combinatorial graph Laplacian, where is a diagonal matrix where is the degree of vertex and is the weighted adjacency matrix. Comparing this to problem 1,
we can see that problem 7 given by maxcut is isomorphic to problem 1, given by Kallus ’s PSOD strategy. Therefore, improved approximations to Maxcut will additionally be improved approximations to Kallus .
The kernel objective of Kallus  is defined as
where is the Gram matrix for some reproducing kernel. By using the cyclic properties of the trace, we can rewrite the objective in equation 8 is equivalent to
a biased estimator of the Hilbert-Schmidt independence criterion with respect to and the kernel given by can be written as [Gretton et al., 2008]
|QuickBlockDGP||1 + y(0)|
For the estimation of ATEs for Bernoulli and rerandomization, we use regression-adjusted estimators as used in Lin 
: a linear regression with covariates mean-centered and interacted with treatment. QuickBlock uses a blocking estimator as the authors propose, and theKallus  designs use a difference-in-means estimator as proposed. The matched-pairs design takes the average in within-pair outcomes, which leads to a more efficient estimator than difference-in-means [Imai, 2008]. When examining ITE estimators, we use random forest based T-learners unless otherwise noted [Athey and Imbens, 2016, Künzel et al., 2019]. All methods use the same hyperparameters, with the number of trees set at to ensure model complexity grows with sample size and with maximum tree depth set at 8.
Figure 5 shows the sensitivity of the Kallus  methods and SoftBlock to hyperparameters. Since both methods are based on similarities defined by a kernel matrix, we plot the performance of these methods on the TwoCircles problem as the bandwidth of the Gaussian kernel changes. Softblock is not at all sensitive to hyperparameters, performing well at all values, while the Kallus  methods perform well only when the hyper-parameters are set well. In essence, these methods perform covariate adjustment a-priori, but this means that they implicitly specify an outcome model before data is observed
. As such, it is very difficult to set these values effectively in practice, as it amounts to tuning a non-parametric model without data for cross-validation or other model selection techniques.