Wasserstein Random Forests and Applications in Heterogeneous Treatment Effects

by   Qiming Du, et al.
Sorbonne Université

We present new insights into causal inference in the context of Heterogeneous Treatment Effects by proposing natural variants of Random Forests to estimate the key conditional distributions. To achieve this, we recast Breiman's original splitting criterion in terms of Wasserstein distances between empirical measures. This reformulation indicates that Random Forests are well adapted to estimate conditional distributions and provides a natural extension of the algorithm to multivariate outputs. Following the philosophy of Breiman's construction, we propose some variants of the splitting rule that are well-suited to the conditional distribution estimation problem. Some preliminary theoretical connections are established along with various numerical experiments, which show how our approach may help to conduct more transparent causal inference in complex situations.



There are no comments yet.


page 1

page 2

page 3

page 4


A Groupwise Approach for Inferring Heterogeneous Treatment Effects in Causal Inference

There is a growing literature in nonparametric estimation of the conditi...

Estimating Treatment Effects with Causal Forests: An Application

We apply causal forests to a dataset derived from the National Study of ...

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

Many scientific and engineering challenges -- ranging from personalized ...

Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression

We propose an adaptation of the Random Forest algorithm to estimate the ...

Ranking of average treatment effects with generalized random forests for time-to-event outcomes

In this paper we present a data-adaptive estimation procedure for estima...

Estimating Heterogeneous Causal Effects in the Presence of Irregular Assignment Mechanisms

This paper provides a link between causal inference and machine learning...

Modified Causal Forests for Estimating Heterogeneous Causal Effects

Uncovering the heterogeneity of causal effects of policies and business ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the primary objectives of supervised learning is to provide an estimation of the conditional expectation

for some underlying 1-dimensional objective and a multidimensional covariate given the dataset . However, in many real-world applications, it is also important to extract the additional information encoded in the conditional distribution . This is particularly the case in the field of Heterogeneous Treatment Effects (HTE) estimation problems, which represent the primary motivation of this work.

1.1 Motivation

In HTE problems, the traditional object of interest is the Conditional Average Treatment Effect (CATE) function, defined by


where (resp. ) denotes the potential outcome (e.g., Rubin (1974) and Imbens and Rubin (2015)) of the treatment (resp. no treatment). The data are usually of form , where denotes the treatment assignment indicator. Recently, many approaches based on modern statistical learning techniques have been investigated to estimate the CATE function (e.g., Künzel et al. (2019); Athey and Wager (2019); Nie and Wager (2017)). Typically, assuming unconfoundedness, that is


one is able to estimate and respectively with


The classical approach in the HTE context is to design the causal inference procedure around the estimation of the CATE function defined in (1) using , and to test whether there is a notable difference between and for each newcoming individual . It is important to note that this is already a difficult task for certain datasets due to the unbalance between treatment and control groups or other practical reasons. For instance, the X-learner Künzel et al. (2019) is proposed to deal with the unbalanced design by making efficient use of the structural information about the CATE function, and the R-learner Nie and Wager (2017) is introduced to improve accuracy and robustness of the CATE function estimation by formulating it into a standard loss-minimization problem. However, a simple inference based on the CATE function, or other key features composed by the average treatment effects (for example, the sorted Group Average Treatment Effects (GATES) proposed in Chernozhukov et al. (2018)), may be hazardous in some situations because of the lack of information on the fluctuations, or multimodality, of both conditional laws and . This phenomenon, in practice, can arise when and/or depend on some additional unconfounding factors that are not collected in the study, which, however, greatly affect the behaviors of the potential outcomes. From another point of view, being aware of the existence of such problems may also help to fix the flaw of data collection procedure or the lack of subgroup analysis for future study.

Ideally, one is interested in estimating the joint conditional distribution Unfortunately, a major difficulty of HTE estimation lies in the fact that it is in general impossible to collect and at the same time for the point . Unlike the difference in the linear conditional expectation , the dependence between and given is much more complex and difficult to track. Hence, due to the lack of information of the collectable dataset, the estimation of the conditional covariance between and

is usually unavailable, let alone the conditional joint distribution. A possible route to address this shortcoming is to concentrate on a weaker problem: instead of estimating

, we are interested in the estimation of conditional marginal distributions and . By considering the two subgroups (3) of the dataset , the problem thus enters into a more classical supervised learning context, similar as the design of T-learners Künzel et al. (2019), while the objective is replaced by the estimation of conditional distributions. In some scenarios, even a simple raw visualization of the marginal conditional distributions, as a complement of CATE function estimation, may greatly help the decision making process for practitioners.

Another motivation comes from the need to set-up statistically sound decision procedures for multivariate objectives in the context of HTE. For example, a treatment is often related to a cost, which is also collectable and sometimes essential to the final treatment decisions. In this context, a simple extension of the CATE function will clearly not be able to capture the dependencies between the treatment effects and the cost. Thus, a statistical tool that allows conditional distribution estimation with multivariate objective will therefore be useful for more complex inferences involving both treatment effects and costs at the same time. In general, the traditional nonparametric methods for conditional distribution inference (e.g. Hall et al. (1999); Hall and Yao (2005)) are less effective when it comes to flexibility of implementation, parallelization, and the ability to handle high-dimensional noisy data. So, we try to take advantage of the available modern machine/statistical learning tools.

1.2 Random Forests for conditional distribution estimation

In order to address the issues described in the subsection above, our idea is to propose an adaptation of the Random Forests (RF) algorithm Breiman (2001), so that it can be applied to the conditional distribution estimation problems in the HTE context. RF have proven to be successful in many real-world applications—the reader is referred to Biau and Scornet (2015) and the references therein for a general introduction. If we look at the final prediction at each point provided by the RF algorithm, it can be regarded as a weighted average of

, where the random weights depend upon the training dataset and the stochastic mechanism of the forests. Therefore, a very natural idea is to use this weighted empirical measure to approximate the target conditional distribution. This is also the driving force in the construction of Quantile Regression Forests

Meinshausen (2006); Athey et al. (2019)

and other approaches that combine kernel density estimations and Random Forests (e.g.

Pospisil and Lee (2019)).

In the present article, instead of studying the quantile or density function of the target conditional distribution, we focus directly on the (weighted) empirical measures output by the forests and the associated Wasserstein distances. This also makes further inferences based on Monte-Carlo methods or smoothing more convenient and straightforward. To make it clearer, let us denote by

the probability measure associated with the conditional distribution

. Heuristically speaking, if the Wasserstein distance between

and is dominated, in some sense, by the distance between and , then the data points that fall into a “neighborhood” of are expected to be capable to provide reliable approximation of the conditional measure . In the RF context, the role of each tree in the ensemble is to build a wisely created partition of the domain, so that the “neighborhood” mentioned above can be defined accordingly. As such, the random weights come from the averaging procedure of multiple trees.

As Breiman’s original RF are primarily designed for conditional expectation estimations, we first provide in Section 2

a reformulation that gives new insights into Breiman’s original splitting criterion, by exploiting a simple relation between empirical variance and Wasserstein distance between empirical measures. This reformulation allows a new interpretation of the RF algorithm in the context of conditional distribution estimation, which, in turn, can be used to handle multivariate objectives with a computational cost that grows linearly with the dimension of the output. We also investigate in this section several dedicated modifications of Breiman’s splitting rule and present some preliminary theoretical connections between their constructions. With a slight abuse of language, all these RF variants aiming at conditional distribution estimation are referred to as Wasserstein Random Forests (WRF) in this article. Finally, we return in Section

3 to the HTE problem and illustrate through various numerical experiments how WRF may help to design more transparent causal inferences in this context.

2 Wasserstein Random Forests

In order to simplify the introduction of WRF, we temporarily limit the discussion to the classical supervised learning setting. Let and

be, respectively, the canonical random variables of covariate and the objective. Our goal is to estimate the conditional measure

associated with using the dataset .

2.1 Mechanism of Random Forests

A Random Forest is an ensemble method that aggregates a collection of randomized decision trees. Denote by

the number of trees and, for , let be the canonical random variable that captures randomness of the -th tree. Each decision tree is trained on a randomly selected dataset with the same cardinal , sampled uniformly in with or without replacement. More concretely, for each tree, a sequence of axis-aligned splits is made recursively by maximizing some fixed splitting rule. At each iteration, directions are explored and the splits are always performed in the middle of two consecutive data points, in order to remove the possible ties. The splitting stops when the current cell contains fewer points than a threshold , or when all the data points are identical. In this way, a binary hierarchical partition of is constructed. For any , we denote by the cell in the -th tree that contains and by the number of data points in that fall into .

The core of our approach relies on the fact that the prediction of the conditional distribution at point given by the -th tree is simply the empirical measure associated with the observations that fall into the same cell as , that is

where is the Dirac measure at . Let . As such, the final estimation provided by the forest is but the average of the , , over the trees, i.e.,



is the random weight associated with . It is readily checked that for any . Thus, the final prediction is a weighted empirical measure with random weights naturally given by the tree aggregation mechanism. Our notation is compatible with Biau and Scornet (2015), where a more detailed introduction to RF is provided. It should be stressed again that we are interested in learning the conditional distribution , not in inferring the conditional expectation as in traditional forests. This is of course a more complex task, insofar as the expectation is just a feature, albeit essential, of the distribution.

Figure 1: Illustration of the a single decision tree. Note that and are not sampled in the sub-dataset used for the tree’s construction.

As an illustration, consider in Figure 1 the partition provided by a decision tree trained on a bidimensional sub-dataset of size 8. The estimation of the conditional distribution at the point is simply the empirical measure associated with the cell to which it belongs. Mutatis mutandis, suppose that there is another decision tree that gives the measure as the estimation at . Then the final estimation of the conditional distribution output by the forest that contains these two trees is the empirical distribution . On the other hand, the classical RF outputs the average as the scalar estimation of the conditional expectation.

2.2 Splitting criteria and Wasserstein distances

Now, let us take a closer look at the splitting criteria that are maximized at each cell in the construction of trees. For a cell that consists in a subset of data points , an axis-aligned cut along the -th coordinate at position defines a partition of . More precisely,

With a slight abuse of notation, we write when . Recall that Breiman’s original splitting criterion Breiman (2001) takes the following form:


where (resp. , ) is the average of the that fall into (resp. , ), and (resp. , ) is the cardinal of (resp. , ). This criterion is maximized at each node of each tree over and the mtry randomly chosen coordinates (see, e.g., Section 2.2 of Biau and Scornet (2015)). The quantity can be interpreted as the difference between the total variance and the intra-class variance within the subgroups divided by the split, which, thanks to the total variance decomposition, turns out to be the associated inter-class variance, i.e.,


Therefore, when seeking possible extensions, Breiman’s splitting criterion should be understood in a twofold manner: minimization of the intra-class variance or maximization of the inter-class variance, induced by the split. Regardless of the choice of interpretation, since there is only a finite number of cuts to be evaluated at each iteration, a decision tree can therefore be built in a greedy manner. Without loss of generality, when bootstrap is involved (i.e., is sampled with replacement), one may consider multisets/bags in order to deal with duplicate data for formal definitions discussed above.

Before proceeding further, we recall some basic properties of Wasserstein distances between empirical measures. If not mentioned otherwise, and denote respectively the dimension of the covariate and the dimension of the objective . For two empirical measures of the same size on , say and , by considering the original form of Monge problem (see, e.g., Section 2.2 of Peyré and Cuturi (2018)), the -Wasserstein distance is defined by

where denotes a permutation in , the collection of all permutations on and is the Euclidean norm on . When and are probability measures on (i.e., ), it is easily checked that


where (resp. ) denotes the -th order statistic of the sample (resp. ).

Intra-class interpretation

We focus on the representation of given in (4). The key observation is encapsulated in the following proposition, which expresses Breiman’s splitting criterion in terms of Wasserstein distances between empirical measures.

Proposition 2.1.

Assume that , i.e., . Denote by (resp. , ) the empirical measure associated with (resp. , ). Then


First, in the same spirit of Kontorovich’s relaxation (see, e.g., Ambrosio and Gigli (2013); Peyré and Cuturi (2018)), we provide a permutation-based representation of the Wasserstein distance between empirical measures of potentially different sizes and . Denote by the ceiling function. Observe that, by exploiting the following alternative forms


one has

When , one obtains


In particular, when and is univariate, we have



Plugging (9) into (4) leads to the desired conclusion. ∎

Proposition 2.1 heuristically indicates that RF are well-adapted to estimate conditional distributions, by offering a new interpretation of Breiman’s splitting criterion in terms of intra-class quadratic Wasserstein distances between empirical measures. An important consequence of this result is that it allows a natural generalization of Breiman’s criterion to outputs with a dimension greater than 1. Indeed, the extension of RF to multivariate outputs is not straightforward, even in the context of conditional expectation estimation (e.g., Segal and Xiao (2011); Miller et al. (2014)). The dependence between the different coordinates of the objective is usually dealt with using additional tuning or supplementary prior knowledge. Such a modeling is not necessary in our approach since dependencies in the

-vector features are captured by the Wasserstein distances. (Note however that some appropriate normalization should be considered when there are noticeable differences between the coordinates of the objective.) Besides, this extension is also computationally efficient, as the complexity of the evaluation at each cell increases linearly w.r.t. the dimension

of the objective . To see this, assume that and just note that, according to (8), standard calculations give


In the sequel, to increase the clarity of our presentation, we use the notation instead of for the criterion defined in Proposition 2.1, keeping in mind that this quantity is valid for any dimension of . It is interesting to note that can be naturally extended to orders not necessarily equal to , by putting

However, finding an efficient implementation of is not trivial in terms of computational complexity. This requires a further analysis, which is beyond the scope of the present article.

Inter-class interpretation

Let us now give another look at the representation of in the form of (5). Unlike the intra-class interpretation provided above, there is no free -representation in this situation. However, by replacing the -error with the -distance between empirical measures according to the goal of conditional distribution estimation, it is natural to consider the following splitting criterion:

In the univariate case (i.e., ), thanks to (6) and the same technique involving the alternative form (7) used in the proof of Proposition 2.1, it is easily checked that can be computed with complexity at each cell that contains the data points

. This rate can be achieved by considering a Quicksort algorithm in order to deal with the order statistics in (

6). The implementation is tractable, although slightly worse than , the complexity of . However, the computation of the Wasserstein distance is not trivial when , where an exact computation is of order (cf. Section 2 of Peyré and Cuturi (2018)). A possible relaxation is to consider an entropic regularized approximation such as Sinkhorn distance (cf. Cuturi (2013); Genevay et al. (2019); Lin et al. (2019)), where the associated complexity is of order with tolerance . Nevertheless, since the amount of evaluations of

is enormous during the construction of RF, we only recommend using this variant of splitting criterion for univariate problems at the moment. The details of efficient implementations and possible relaxations for multivariate cases will be left for future research.

Since the principle of minimizing (resp. maximizing) the intra-class (resp. inter-class) variance is a universal strategy beyond the implementation in the RF context, we hope that these Wasserstein-based interpretations may inspire a wider range of future applications. It is however now time to put our splitting analysis to good use and return to the HTE conditional distribution estimation problem.

3 Applications

Our primary interest is the improvement that WRF can bring into the causal inference under the potential outcomes framework. To be clearer, let us recall that is a random variable on and denotes the treatment assignment indicator, i.e., a Bernoulli random variable that may depends on . As for now, the potential outcomes and are assumed to be univariate random variables—extension to the multivariate case will be discussed a little later. During the observational study, the i.i.d. dataset is collected with an abbreviation of . Under the unconfoundedness assumption (2

), our goal is to estimate the probability distribution

associated with the conditional marginal distribution for , based on the dataset . To this aim, we take advantage of the WRF framework presented in Section 2 and discuss some possible ways to exploit the additional information that can be collected by the estimation of and .

3.1 Wasserstein Random Forests for HTE

Before discussing the usage of these conditional marginal distribution estimations, we would like to stress again that we have no intention to “replace” the Average Treatment Effect-based causal inference strategy. On the contrary, our primary motivation is to provide a complementary tool so that a more transparent inference can be conducted, by maximizing the usage of available data. More precisely, we train WRF respectively on the treatment and control groups (3), to estimate respectively the conditional measures and . These estimations are denoted by and .

First, since potential outcomes are assumed to be univariate, a raw visualization of and is always accessible and informative. In this way, causality can therefore be visualized by the change of the shape of the marginal distributions. Next, following a philosophy similar to the CATE function, we propose to assess the changes in the conditional distribution in terms of Wasserstein distance using the criterion

Intuitively speaking, is more sensitive than the CATE function , in the sense that is capable of capturing certain causal effects that are less noticeable in terms of . An estimation can be obtained as a by-product of the estimation of and . Finally, regarding the multivariate output case, we would like to mention that when the cost of the treatment, say , is also collected in the dataset, WRF can then be used—as we have seen in Subsection 2.2, without further effort—to provide an estimation of the joint multivariate distribution in order to conduct more complex inferences involving the costs and the treatment effects at the same time. The same idea also applies to the case where the treatment effects themselves are also multivariate.

3.2 Experimental results

Since the conditional distribution is in general inaccessible from the real-world datasets, we present here a brief simulation study based on synthetic data to illustrate the performance of WRF in the context of HTE.

A Python package is accessible at the following github repository:

We also provide in Appendices a more complete numerical analysis on the tuning suggestions and influence of propensity score function (i.e., the probability of treatment), along with a discussion on the possible extensions. All the detailed algorithms can be found in the end.

We consider the following model, where and the symbol

stands for the Gaussian distribution:

  • with ;

  • ;

  • ;

with , , and . To summarize, the conditional measure is unimodal, while is bimodal, composed of two independent Gaussians. The mixture parameters in can be interpreted as an unconfounding factor that is not collected in the study. The four functions , , , and have been designed to implement complex dependence between the covariate and the potential outcomes. We note however that the CATE function takes the simple form and is therefore equal to zero if and only if .

(a) estimated by -WRF
(b) estimated by -WRF
estimated by -WRF
(d) estimated by -WRF
(e) estimated by -WRF
Figure 2: An illustration of estimated conditional distributions provided by different variants of WRF with the same parameters: (with repetition), , , . In the legend, pred and ref denote respectively the prediction given by WRF and reference values sampled directly from the true conditional distribution with sample size fixed to be 5000. The acronyms kde-pred and kde-ref stand for the outputs of the kdeplot function of seaborn package Waskom et al. (2020), which provides a standard kernel smoothing. Finally, kde-Y denotes the kdeplot of the -population, i.e., all the or in the training dataset according to the treatment/control group.

We trained the models based on a simulated dataset of size , which is reasonably small considering the complexity of the conditional distribution estimation problem. The treatment and control groups are balanced due to the symmetrical form of , and the size of training data for each group is therefore around 2500. An illustration for an individual with (so, ) can be found in Figure 2 ((a)-(c) for -WRF; (d)-(e) for -WRF). This visualization highlights the good quality of conditional inference performed by our WRF methods. More importantly, it stresses the pertinence of studying conditional distributions in the HTE context, since the CATE function, as is the case here, is not always capable to provide insights regarding causality. For example, according to the trained -WRF model, we have (reference value ), which is much more noticeable as an indicator of causality than the CATE function in this situation. We also note that we are able to approximate the joint conditional distribution of and given , which illustrates the potential of our approach in the multivariate case (Figure 2 (c)). The results for the two considered settings and are pretty similar. A finer comparison based on average Wasserstein distance is shown in Table 1, where ( and ) denotes the average -distance between and (approximated by an empirical measure of size 10 000) tested on points randomly sampled in .

-WRF 0.6325 0.7831 1.1511 1.7993
-WRF 0.6618 0.8317 1.1462 1.7752
-WRF 0.6493 0.7906 1.0757 1.6445
Table 1: Comparison based on average Wasserstein distance


We have proposed a new approach based on WRF that can help HTE inference through estimating some key conditional distributions. From a theoretical perspective, the challenge is to prove consistency of WRF when the sample size tends to infinity, in the spirit of works such as Scornet et al. (2015); Wager (2014). For example, a first goal would be to show that, under appropriate assumptions, as , where denotes the output of WRF and the expectation is taken w.r.t. both the distribution of and the sample. A deeper understanding of the relations between inter/intra interpretations is also an interesting topic for future research.

Broader impact

With the introduction of Wasserstein Random Forests and their applications in the Heterogeneous Treatment Effects context, we hope the present paper will be useful in a twofold manner. First, for the causal inference community, it provides complementary tools that allow to design more transparent inference strategies. Second, to the Random Forests community, it opens an interesting perspective that connects forests, Wasserstein distances, and conditional distribution estimation problems. We do not expect any specific negative outcomes from the broader impact of this article.


This work is funded by the Agence Nationale de la Recherche, under grant agreement no. ANR-18-CE36-0010-01. François Petit acknowledges the support of the Idex “Université de Paris 2019”.

Appendix A On the parameter tuning of WRF

We discuss in this section the influence of the choice of parameters (i.e., mtry, , and nodesize) of the WRF and try to provide some suggestions on the algorithm tuning. We stick to the model provided in Section 3.2 of the main text and compare the respectively for and to illustrate the performance of our method in unimodal and multimodal situations. Unlike the conditional expectation estimation, the cross validation-based tuning strategy is not straightforward to implement for conditional distribution estimation. Indeed, we have only a single sample at each point , and it does not provide enough information for the conditional distribution. Therefore, we also track the performance of the associated conditional expectation estimations in terms of Mean Squared Error (MSE). The conditional expectation functions given of and are denoted respectively by and . Our goal is to illustrate whether the tuning for the conditional expectation can be exploited to guide the tuning for the conditional distribution estimation problem. We also note that since each tree is constructed using only part of the data, the out-of-bag errors for the forest can thus be obtained by averaging the empirical error of each tree on the unused sub-dataset (see, e.g., Section 2.4 of Biau and Scornet (2015)) in the case where an independent test dataset is not available.

First, it is well-known that in the classical RF context the number of trees should be taken as large as possible, according to the available computing budget, in order to reduce the variance of the forest. Although the goal in the WRF framework is changed to the conditional distribution estimation, it is still suggested to use a large if possible.

Second, let us investigate the number of directions to be explored at each cell mtry. The result is illustrated in Figure 3 ((a)-(d) for average Wasserstein loss and (e)-(f) for MSE of conditional expectation estimation). Roughly speaking, the value of mtry reflects the strength of greedy optimization at each cell during the construction of decision trees. A conservative approach is to choose mtry as large as possible according to the available computing resources.

(a) Comparison of -.
(b) Comparison of -.
(c) Comparison of -.
(d) Comparison of -.
(e) Comparison of the estimation of .
(f) Comparison of the estimation of .
Figure 3: An illustration of the performance of different variants of WRF (namely, -WRF and -WRF) with mtry varying in , (with repetition), and .

Then, let us see the influence brought by the change of nodesize. The illustration can be found in Figure 4 ((a)-(d) for average Wasserstein loss and (e)-(f) for MSE of conditional expectation estimation). In the classical RF context, the motivation of the choice can be interpreted as introducing some local averaging procedure at each cell in order to deal with the variance or noise of the sample. Here, as discussed in the main text, we are interested in the conditional distribution estimation in the HTE context, where the variance or other fluctuation of the conditional distribution is part of the information to be estimated. Hence, the interpretation of the choice should be adapted accordingly, as the minimum sample size that is used to describe the conditional distribution at each cell. This interpretation is better suited when it comes to the estimation of multimodal conditional distributions. As shown in Figure 4 (a)-(d), there are some optimal choices of nodesize between and . In the simple cases, such as the estimation of (unimodal), the MSE of the associated conditional expectation (Figure 4 (e)) can be used, accordingly, to tune the algorithm for conditional distribution estimation. However, in the more complex case such as the estimation of (bi-modal), the MSE of the conditional expectation estimation is no as stable (Figure 4 (f)). Nevertheless, it is also recommended to use small nodesize in this situation as a conservative choice.

(a) Comparison of -.
(b) Comparison of -.
(c) Comparison of -.
(d) Comparison of -.
(e) Comparison of the estimation of .
(f) Comparison of the estimation of .
Figure 4: An illustration of the performance of different variants of WRF (namely, -WRF and -WRF) with nodesize varying in , (with repetition), and .
(a) Comparison of -.
(b) Comparison of -.
(c) Comparison of -.
(d) Comparison of -.
(e) Comparison of the estimation of .
(f) Comparison of the estimation of .
Figure 5: An illustration of the performance of different variants of WRF (namely, -WRF and -WRF) with varies in (with repetition), , and .

Finally, we discuss the size of the sub-dataset used to construct each decision tree. Note that the choice of is still not well-understood even in the classical RF context (see, e.g., Biau and Scornet (2015); Scornet et al. (2015)). When the computing budget allows to implement (with replacement, which corresponds to the classical Bootstrap), we recommend to use this choice. Otherwise, we recommend to fix the from one fifth to one third of the whole data size in order to maintain a reasonably good performance without heavy computations.

Suggestions on the parameter tuning

The take-home message for the parameter tuning of WRF is simple: We recommend to use large and mtry according to the available computing resources. The parameter nodesize can be tuned via a cross validation-based strategy using the MSE of the associated conditional expectation estimation. In addition, we suggest to choose smaller nodesize when there is abnormal fluctuation of the MSE score. It is also proposed to use classical bootstrap (i.e., with replacement) when possible. Otherwise, we suggest to fix a smaller according to the computing budget. Finally, although there is no theoretical guarantee, we advocate to use -WRF for univariate objective, since it has a better overall accuracy with a reasonable additional computational cost.

Appendix B On the propensity score function

The propensity score function measures the probability that the treatment is assigned to a certain individual, which basically determines the distribution of the available dataset for the estimation of and in the population. More precisely, imagine that is an individual such that in the neighbourhood of , the value of is close to 0. Then, it is expected that only very few training data for the estimation of can be collected during the observational study. As a consequence, it is expected that the estimation at such point is of reasonably bad quality. For example, returning to the model provided in Section 3.2 of the main text, the propensity score function is

Denote by an individual such that , , and . It is readily checked that . As shown in Figure 6, the estimation of is very accurate (see Figure 6 (a)-(b)), while the estimation of is of poor quality (see Figure 6 (c)-(d)).

(a) estimated by -WRF
(b) estimated by -WRF
(c) estimated by -WRF
(d) estimated by -WRF
Figure 6: An illustration of estimated conditional distributions provided by different variants of WRF with the same parameters: (with repetition), , , . In the legend, pred and ref denote respectively the prediction given by WRF and reference values sampled directly from the true conditional distribution with sample size fixed to be 5000. The acronyms kde-pred and kde-ref stand for the outputs of the kdeplot function of seaborn package Waskom et al. (2020), which provides a standard kernel smoothing. Finally, kde-Y denotes the kdeplot of the -population, i.e., all the or in the training dataset according to the treatment/control group.

From a theoretical perspective, one may suppose that the propensity score function is bounded away from 0 and 1 uniformly for all (see, e.g., Künzel et al. (2019); Nie and Wager (2017)). However, it is, unfortunately, not possible to control the propensity score during an observational study. As a consequence, it is usually very difficult to verify such an assumption in practice.

Therefore, a more meaningful question can be how to detect if our estimation is reliable or not for a certain individual. A straightforward strategy is to estimate the propensity score function independently, as done for example in Athey and Wager (2019), and to test whether the value of this score is close to 0 and 1. Another approach is to exploit the information encoded in the splits/weights of the forest to detect whether enough data is collected for the prediction at target individual. The details are left for future research.

Finally, let us mention that if the goal is to estimate the function defined in Section 3.1 of the main text, we expect that more dedicated variants of WRF can be constructed, in the same spirit of Causal Forests introduced in Athey and Wager (2019).

Appendix C Possible extensions

In this section, we discuss two natural extensions of WRF that we did not investigate in details.

First, inspired by the Random Rotation Ensembles introduced in Blaser and Fryzlewicz (2016)

, it is natural to consider the implementation of oblique splits, i.e., the splits are not necessarily axis-aligned. More precisely, for each tree, by sampling a uniformly distributed rotation matrix (e.g. Section 3 of

Blaser and Fryzlewicz (2016)), we are able to construct the decision tree by using the rotated sub-dataset (or equivalently, one can also implement randomly rotated cuts in the tree’s construction). Intuitively speaking, the rotation variants of WRF will be more consistent when it comes to performance, while the additional computing resources are required for both training and prediction.

Another direction is to replace the Dirac mass in the empirical measures by some kernel , as proposed in Pospisil and Lee (2019). For instance, the -WRF can be modified by using the following splitting criteria:

where the kernel is chosen according to prior knowledge of the problem. At the same time, the final prediction will be replaced by

where remains the same as defined in Section 2.1 of the main text. When the associated -distance is easy to compute, we expect that this extension will be more accurate for small datasets. Nevertheless, the performances of these natural extensions are still not clear. The details are therefore left for future research.

Appendix D Algorithms

The detailed algorithms for the WRF used in the main text are provided below.

Require: Training dataset , number of trees , subsample size