1 Introduction
In recent years the ability of machines to solve increasingly more intricate tasks has grown exponentially. At the core of this revolution (Sejnowski, 2018), there is the staggering predictive power of machine learning algorithms. However, prediction does not imply causation (Lechner, 2019). In social and health sciences the largest part of scientific research questions deals with inferring a causal relationship (e.g., evaluating the impact of a policy, the effects of drug, the returns from a marketing or business strategy and so on). Moreover, following the growing availability of large datasets, the necessity to deal with problems connected with potentially heterogeneous treatment effects is stronger than what was observed in the past. The availability of large datasets makes it possible to investigate and, in turn, customize the causal effect estimates for population subsets as well as for individuals (Athey, 2018). In this scenario, machine learning techniques are increasingly used to address causal inference tasks and, in particular, to estimate heterogeneous causal effects (Foster et al., 2011; Hill, 2011; Su et al., 2012; Green and Kern, 2012; Athey and Imbens, 2016; Hahn et al., 2017; Wager and Athey, 2018; Lee et al., 2018; Lechner, 2019). A growing literature seeks to apply supervised machine learning techniques to the problem of estimating heterogeneous treatment effects. In their seminal contributions, Hill (2011) and Foster et al. (2011) propose to directly apply machine learning algorithms to estimate the unit level causal effect as a function of the units’ attributes. In other papers, machine learning algorithms are adapted to estimate the heterogeneous causal effects (Athey and Imbens, 2016; Athey et al., 2016; Hahn et al., 2017; Wager and Athey, 2018; Lechner, 2019). However, most of these techniques are tailored for causal inference in settings where the treatment is randomly assigned to the units and do not address imperfect compliance issues. Nevertheless, in the real world, the implementation of policies or interventions often results in imperfect compliance, which makes the policy evaluation complicated. Imperfect compliance may arise in observational studies where the assignment to the treatment can be different from the receipt of the treatment (e.g., individuals are randomly assigned to a treatment, but not all the units that are assigned to it actually receive it). Recently, some algorithms have been proposed to deal with imperfect compliance (Athey et al., 2016; Hartford et al., 2016; Wang et al., 2018; Bargagli Stoffi and Gnecco, 2019). However, these methods exhibit three principal limitations: (i) random forestbased algorithms for causal inference require large samples to converge to a good asymptotic behaviour for the estimation of causal effects, as shown in Hahn et al. (2018b) and Wendling et al. (2018)
; (ii) deep learningbased algorithms lack interpretability of the machine learning blackbox which can expose them to critiques, especially in the context of social sciences; (iii) the algorithms proposed by
Wang et al. (2018) and Bargagli Stoffi and Gnecco (2018, 2019) are based on single learning algorithms that perform worse as compared to multiple learning algorithms (i.e., ensemble methods)^{1}^{1}1Ensemble methods have extensively been shown to outperform single learning algorithms in prediction tasks (Van der Laan et al., 2007).Moreover, in machine learning applications, inference and uncertainty quantification are of secondary importance after predictive performance. However, in policy decision settings it is crucial to know the credibility and variance of the counterfactual predictions
(Athey et al., 2016).To address and accommodate these shortcomings this paper innovates the literature in both a methodological and an empirical perspective. First, we develop a machine learning algorithm tailored to draw causal inference in situations where the assignment mechanism is irregular, namely the assignment depends on the observed and unobserved potential outcomes (Imbens and Rubin, 2015). This methodology contributes to the increasing use of machine learning techniques to draw causal inference (Athey and Imbens, 2017). In particular, we propose to modify a machine learning technique, namely, Bayesian Causal Forests, developed for causal inference goals (Hahn et al., 2017) to fit an instrumental variable setting (Angrist et al., 1996). The proposed method, Bayesian Instrumental Variable Causal Forest (BCFIV), is an ensemble semiparametric Bayesian regression model that directly builds on the Bayesian Additive Regression Trees (BART) algorithm (Chipman et al., 2010). BART is an ensembleoftrees approach to nonparametric regression (Starling et al., 2018), which is in turn a refined version of the random forest algorithm (Breiman, 2001): BART obtains more precise estimates both in noncausal inference scenarios (Chipman et al., 2010; Murray, 2017; Linero and Yang, 2018; Linero, 2018; Hernández et al., 2018; Starling et al., 2018) and in causal inference settings (Hill, 2011; Hahn et al., 2017, 2018b; Logan et al., 2019) by employing a full set of prior distributions on depth of the trees, on the noise and on outcome in their nodes.
Second, we evaluate the fit of the proposed algorithm by comparing it with two alternative machine learning methods explicitly developed to draw causal inference in the presence of irregular assignment mechanisms: namely, the Generalized Random Forests (GRF) algorithm (Athey et al., 2016) and the Honest Causal Trees with Instrumental Variables (HCTIV) algorithm (Bargagli Stoffi and Gnecco, 2019). Using Monte Carlo simulations, we evaluate each algorithm with respect to three dimensions: (i) the choice of the correct source of heterogeneity (i.e., the choice of the right splitting variable); (ii) the choice of the correct cutoff, in the case of a continuous splitting variable, and (iii) the estimation of the heterogeneous causal effects. These dimensions are consistent with recent evaluations of various machine learning methods for causal inference that highlight the excellence of Bayesian algorithms for causal inference (Hahn et al., 2018b; Wendling et al., 2018). We show that for each dimension, BCFIV outperforms both GRF and HCTIV in small samples and converges to an optimal asymptotic behaviour.
Third, we show how the proposed algorithm can be used for targeted policies, which are increasingly relevant as the call for personalized interventions has unfurled in all social sciences, especially in economics and management sciences (Athey and Imbens, 2017). The main objective of targeted policies studies is to inform policymakers about the best allocation of treatments to individuals or subpopulations (Kitagawa and Tetenov, 2018). The idea behind these policies is to target those observations that benefit the most from a certain intervention in order to get either of two possible welfare gains: (i) reducing the costs of an intervention with constant effect sizes, or (ii) increasing the intervention effects for given costs (Kleinberg et al., 2017).
Fourth, in an empirical application, BCFIV is used for the evaluation of an educational policy. The evaluation of educational policies is a promising field for the discovery of heterogeneous causal effects and, in turn, targeted policies. This is due to at least two factors: (i) in the education context, there is a clear source of heterogeneity given by the disparate profiles of schools and students; and (ii) it is possible to gather large (administrative) datasets. In a similar framework, machine learning provides a tailored, datadriven tool for the evaluation of the heterogeneity in the causal effects, and, consequently, the implementation of targeted policies. In particular, we evaluate the effects of additional resources for disadvantaged students on students’ performance in a fuzzy Regression Discontinuity Design (RDD) scenario (Hahn et al., 2001). The fuzzy RDD that we implement in this paper is described in detail in Section 4. By using a unique administrative dataset, we employ BCFIV to evaluate the heterogeneity in the effects of the ‘Equal Educational Opportunities Program’ promoted by the Flemish Ministry of Education starting from 2002. The program is aimed at providing additional funding for secondary schools with high share of disadvantaged students (De Witte et al., 2018). We focus on the effects of additional funding on two outcomes: (i) students’ performance, namely if a student gets the most favorable outcome (Acertificate); and (ii) students’ progresses to the following year without retention in their grades. The Flemish Ministry of Education provided us with data on the universe of pupils in the first stage of secondary education in the school year 2010/2011 with a total of 135,682 students. We obtained data on student level and school level characteristics. Moreover, this setting provides us with a quasiexperimental identification strategy since the additional funding is provided to schools based on being above or below an exogenously set threshold regarding the proportion of disadvantaged students. There is also a second, exogenously set, eligibility criterion stating that schools have to generate a minimum number of teaching hours. This provides us with an imperfect compliance setting, as not all the schools fulfill both the criteria, in which we are able to exploit a fuzzy regression discontinuity design to draw causal effects.
The results of our empirical application suggest that, although the effects of additional funding on the overall population of students are found to be not statistically significant^{2}^{2}2This is in line with further research of additional funding on school level outcomes (De Witte et al., 2018)., there is appreciable heterogeneity in the causal effects: the effects on students’ performance are positive and significant if we focus on the subpopulation of students in schools with younger teachers (namely, teachers younger than 40 years old) and less senior principals (namely, principals with less than 26 years of experience)^{3}^{3}3As we show in Section 4.3, these results are in line with the literature of aging on teachers’ performance.. These results can advise policymakers in multiple ways: the heterogeneous drivers could, on the one hand, help them enhancing the policy effectiveness by targeting just the schools with the highest shares of pupils that benefit the most from additional funding. On the other hand, policymakers could investigate more in depth the reason why some schools do not benefit from the policy and ultimately provide additional tools to these schools to enhance the policy outcomes.
The methodology proposed in this paper can be more widely applied to evaluations of the heterogeneous impact of an intervention in the presence of an irregular assignment mechanism in social and biomedical sciences^{4}^{4}4An R function for BCFIV is available upon request to the corresponding author..
The remainder of this paper is organized as follows: in Section 2 we provide a general overview on causal inference and the applied machine learning frameworks and we introduce our algorithm. In Section 3 we compare the performance of our algorithm with the performance of other methods already established in the literature. In Section 4 we depict the usage of our algorithm in an educational scenario to evaluate the heterogeneous causal effects of additional funding to schools. Section 5 discusses the results and highlights the further applications of heterogeneous causal effects discovery and targeted policies in education as well as in social and biomedical sciences.
2 Bayesian Instrumental Variable Causal Forest
2.1 Notation
This paper contributes to the literature by establishing a novel machine learning approach for the estimation of conditional causal effects in the presence of an irregular assignment mechanism.
We follow the standard notation of the Rubin’s causal model (Rubin, 1974, 1978; Imbens and Rubin, 2015). Given a set of units, indexed by , we denote with a generic outcome variable, with a binary treatment indicator and, with a matrix of control variables. Given the Stable Unit Treatment Value Assumption (SUTVA), that excludes interference between the treatment assigned to one unit and the potential outcomes of another (Imbens and Rubin, 2015), we can postulate the existence of a pair of potential outcomes: . Specifically, the potential outcome for a unit if assigned to the treatment is , and the potential outcome if assigned to the control is . We cannot observe for the same unit both the potential outcomes at the same time. However, we observe the potential outcome that corresponds to the assigned treatment: .
In order to draw proper causal inference in observational studies researchers need to assume strong ignorability to hold. This assumption states that:
(1) 
and
(2) 
where is the features space. The first assumption (unconfoundedness) rules out the presence of unmeasured confounders while the second condition needs to be invoked to be able to estimate the unbiased treatment effect on all the support of the covariates space. If these two conditions hold, we are in the presence of the socalled regular assignment mechanism. In such a scenario the Average Treatment Effect (ATE) can be expressed as:
(3) 
and one can define, following Athey and Imbens (2016), the Conditional Average Treatment Effect (CATE) simply as:
(4) 
CATE is central for targeted policies as it enables the researcher to investigate the heterogeneity in causal effects. For instance, we may be interested in assessing how the effects of an intervention vary within different subpopulations.
In observational studies, the assignment to the treatment may be different from the reception of the treatment. In these scenarios, where one allows for noncompliance between the treatment assigned and the treatment received, one can assume that the assignment is unconfounded, wherein the receipt is confounded (Angrist et al., 1996). In such cases, one can rely on an instrumental variable (IV), , to draw proper causal inference^{5}^{5}5
Throughout this Section and throughout the paper we assume the instrumental variable to be binary but, one could, in principle, relax this assumption. However, at the moment there are only a few studies that develop machine learning algorithms for the estimation of heterogeneous causal effects with continuous treatment variable: we leave the application of such algorithms to further research.
. can be thought as a randomized assignment to the treatment, that affects the receipt of the treatment , without directly affecting the outcome (exclusion restriction). Thus, one can then express the treatment received as a function of the treatment assigned: .If the classical four IV assumptions^{6}^{6}6See Appendix A for a detailed discussion of the four assumptions and how they are assumed to hold in our application reported in Section 4. (Angrist et al., 1996) hold, one can get the causal effect of the treatment on the subpopulation of compliers, the socalled Complier Average Causal Effect (CACE), that is:
(5) 
CACE is also sometimes referred as LATE (Local Average Treatment Effects) and represents the estimate of causal effect of the assignment to treatment on the principal outcome, , for the subpopulation of compliers (Imbens and Rubin, 2015). In this paper we consider the following conditional version of CATE. The conditional CACE,
, can be thought as the CACE for a subpopulation of observations defined by a vector of characteristics
:(6) 
2.2 Estimating Conditional Causal Effects with Machine Learning
In recent years, various algorithms have been proposed to estimate conditional causal effects (i.e, CATE and ). Most algorithms focus on the estimation of CATE (Hill, 2011; Su et al., 2012; Green and Kern, 2012; Athey and Imbens, 2016; Hahn et al., 2017; Wager and Athey, 2018; Lee et al., 2018; Lechner, 2019) while just a few (Athey et al., 2016; Hartford et al., 2016; Wang et al., 2018; Bargagli Stoffi and Gnecco, 2019) focus on the estimation of . In this paper, we propose an algorithm for the estimation of CATE in an irregular assignment mechanism scenario. In particular, we adapt the Bayesian Causal Forest (BCF) algorithm (Hahn et al., 2017) for such a task. BCF was originally proposed for regular assignment mechanisms. This algorithm builds on the Bayesian Additive Regression Trees (BART) algorithm (Chipman et al., 2010) which in turn is a Bayesian version of an ensemble of Classification and Regression Trees (CART) (Breiman et al., 1984)^{7}^{7}7Chipman et al. (2010) highlight how their algorithm is different from other ensemble methods such as the Random Forest algorithm (Breiman, 2001)..
CART is a widely used algorithm for the construction of binary trees (namely, trees where each node is splitted into only two branches). A binary tree is constructed by splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning sample and proceeding with the splits to the final nodes (leaves). Figure 3 illustrates how the binary partitioning works in practice in a simple case with just two regressors and .
(Right) The corresponding partition of the sample space.
Binary trees are named classification trees when the outcome variable can take a discrete set of values, and regression trees when the outcome variable takes continuous values. The CART algorithm associates, for every individual belonging to a partition of the feature space, a conditional prediction for the outcome variable. The task of a binary tree is to estimate the conditional expectation of the observed outcome, on the basis of the information on features and outcomes for units in the training sample, and to compare the resulting estimates on a test sample to tune the complexity of the tree, in order to minimize the “error”^{8}^{8}8There are various “error” measures used to optimize binary trees. The most widely used are the meansquarederror for regression trees and the entropy or the Gini index for classification trees. between the true and estimated values of within each partition.
The accuracy of the predictions of binary trees, , can be dramatically improved by iteratively constructing the trees. A Random Forest (RF) consists in an ensemble of trees, where each tree is constructed by randomly sampling the observations and randomly drawing the covariates (predictors, in the machine learning literature) that are used to build each tree (Breiman, 2001). One of the main problems of RFs is that they tend to “overfit” the data on which they are trained. “Overfitting” leads to a scarce generalizability of the predictions on samples different from the training set. In order to avoid this, Bayesian Additive Regression Trees were proposed by Chipman et al. (2010).
BART, as well as BCF, are “refined versions” of the RF algorithm. BART is, as the RF, a sumoftrees ensemble algorithm, but its estimation approach used to obtain the values of
relies on a fully Bayesian probability model
(Kapelner and Bleich, 2013). In particular, the BART model can be expressed as:(7) 
where the distinct binary trees are denoted by ^{9}^{9}9 represents the entire tree: its structure, its nodes and its leaves (terminal nodes). .
The Bayesian component of the algorithm is incorporated in a set of three different priors on: (i) the structure of the trees (this prior is aimed at limiting the complexity of any single tree
and works as a regularization device); (ii) the probability distribution of data in the nodes (this prior is aimed at shrinking the node predictions towards the center of the distribution of the response variable
); (iii) the error variance (which bounds away from very small values that would lead the algorithm to overfit the training data)^{10}^{10}10The choice of the priors, and the derivation of the posterior distributions, is discussed in depth by Chipman et al. (2010) and Kapelner and Bleich (2013). Namely, (i) the prior on the probability that a node will split at depth is where (these hyperparameters are generally chosen to be and); (ii) the prior on the probability distribution in the nodes is a normal distribution with zero mean:
where and can be used to calibrate the plausible range of the regression function; (iii) the prior on the error variance is where is determined from the data in a way that the BART will improve 90% of the times the RMSE of an OLS model.. The aim of these priors is to “regularize” the algorithm, preventing single trees to dominate the overall fit of the model (Kapelner and Bleich, 2013). Moreover, BART allows the researcher to tune the variables’ importance by departing from the original formulation of the Random Forest algorithm where each variable is equally likely to be chosen from a discrete uniform distribution (with probability
) to build a single tree learner. These Bayesian tools give researchers the possibility to mitigate the “overfitting” problem of RFs and to tune the algorithm with prior knowledge. Give these characteristics BART has shown particular flexibility and an excellent performance in both prediction tasks (Murray, 2017; Linero and Yang, 2018; Linero, 2018; Hernández et al., 2018; Starling et al., 2018) and in causal inference tasks (Hill, 2011; Hahn et al., 2017; Logan et al., 2019).Thus far, the algorithms that we discussed were tailored to find heterogeneity in the response variable , but are not developed to estimate the heterogeneity in the causal effects. The BCF algorithm proposed by Hahn et al. (2017) is a semiparametric Bayesian regression model that directly builds on BART. It however introduces some significant changes in order to estimate heterogeneous treatment effects in regular assignment mechanisms (even in the presence of strong confounding). The principal novelties of this model are the expression of the conditional mean of the response variable as a sum of two functions and the introduction, in the BART model specification for causal inference, of an estimate of the propensity score, , in order to improve the estimation of heterogeneous treatment effects^{11}^{11}11It is important to highlight that the propensity score is not used to estimate the causal effects but to moderate the distortive effects in treatment heterogeneity discovery due to strong confounding. Moreover, since BCF includes the entire predictors’ vector, , even if the propensity score is misspecified or poorly estimated, the model allows for the possibility that the response remains correctly specified (Hahn et al., 2017). In Appendix B, we show that even if the estimate of the propensity score is incorrectly specified the results are still widely robust.. As depicted from the results of the Atlantic Causal Inference Conference (ACIC) competition in 2016 and 2017, reported by Hahn et al. (2018b), it was observed that BCF performs dramatically better than other machine learning algorithms for causal inference in the presence of randomized and regular assignment mechanisms.
2.3 Extending BCF to an IV Scenario: Bayesian Instrumental Variable Causal Forest
Bargagli Stoffi and Gnecco (2019) show that a naïve application of methods developed for the estimation of heterogeneous causal effects in randomized or regular assignment mechanisms would introduce a large bias in the estimation of the heterogeneous causal effects in imperfect compliance settings. Moreover, the authors show that this would lead to very imprecise heterogeneous causal effects estimators. This reason drives the need for a new algorithm, the Bayesian Instrumental Variable Causal Forest (BCFIV), tailored for causal inference on heterogeneous effects in the presence of irregular mechanisms.
The BCFIV algorithm is constructed in two steps:

Discovering heterogeneity for the conditional intentiontotreat ();

Estimation of the conditional CACE () within the subpopulations defined in the first step.
We will discuss these two steps in detail in the next Section.
2.3.1 Heterogeneity in the Conditional ITT
The BCFIV algorithm starts from modifying (7) to adapt it for the estimation of the intentiontotreat, by including the instrumental variable :
(8) 
where, for simplicity, we assume the error to be a mean zero additive noise as in Hill (2011); Hahn et al. (2017); Logan et al. (2019). The conditional expected value can be expressed as:
(9) 
and in turn the conditional intentiontotreat, , is:
(10) 
Then, adapting to an irregular assignment mechanism the model proposed by Hahn et al. (2017), we adopt the following functional form for (9):
(11) 
where is the estimated propensity score for the instrumental variable:
(12) 
The expression of as a sum of two functions is central: the first component of the sum, , directly models the impact of the control variables on the conditional mean of the response (the component that is independent from the treatment effects) while the second component models directly the intentiontotreat effect as a nonlinear function of the observed characteristics (this second components captures the heterogeneity in the intentiontotreat). Both the functions and are given independent priors. These priors are chosen in line with Hahn et al. (2017) to be for the first component the same priors of Chipman et al. (2010) (see Section 2.2). However, for the second component the priors are changed in a way that allows for less deep, hence simpler trees^{12}^{12}12The depth penalty parameters are set to be and (instead of and )..
The expression of as a sum of two functions has a double effect: (i) on the one hand, it allows the algorithm to learn which component in the heterogeneity of the conditional mean of the outcome is driven by a direct effect of the control variables and which component is the true heterogeneity in the effects of the assignment to the treatment on ; (ii) on the other hand, it allows the predictions of the treatment effect driven by the BART to be modelled directly and separately with respect to the impact of the control variables (Hahn et al., 2017).
The estimated propensity score, in the BCF model, is not used for the estimation of the effects but is included, as an additional covariate, in the first component of (11) to mitigate possible problems connected to regularization induced confounding (RIC)^{13}^{13}13RIC is analyzed in depth in Hahn et al. (2018a). RIC issues rise when the ML algorithm used for regularizing the coefficient does not shrink to zero some coefficients due to a nonzero correlation between and resulting in an additional degree of bias that is not under the researcher’s control. and targeted selection^{14}^{14}14Targeted selection refers to settings where the treatment (or in an IV scenario the assignment to the treatment) is assigned based on an exante prediction of the outcome conditional on some characteristics . We refer to Hahn et al. (2017) for a discussion of targeted selection problems.. In our application in Section 4, none of these problems are present since the specification of the propensity score for is known by the researcher (namely, ). However, in scenarios where the instrumental variable is not randomized exante, the inclusion of leads to an improvement in the discovery of the heterogeneity in the causal effect (Hahn et al., 2018b). Furthermore, it is important to highlight that choosing a misspecified definition of does not impact in a significant way the quality of the results as shown in Appendix B. This is due to the fact that this first step of our algorithm is not about directly estimating the conditional CACE but is tailored to discover the heterogeneity in .
Once one estimated with the BCFIV the unitlevel intentiontotreat, one can build a simple binary tree, using a CART model Breiman (1984), on the fitted values () to discover the drivers of the heterogeneity.
2.3.2 Estimation of Conditional CACE
Once the heterogeneous patterns in the intentiontotreat are learned from the algorithm, one can estimate the conditional CACE, . To do so, one can simply use the method of moments estimator in (6) within all the different subpopulations that were detected in the previous step.
The conditional CACE can be estimated in a generic subsample (i.e., for each , where is a generic node of the tree, like a nonterminal node or a leaf) as:
(13) 
where is estimated as:
(14) 
and as:
(15) 
where (where is the number of observations with in the subsample of observations with ^{15}^{15}15It is worth highlighting that, since the supervised machine learning technique is used just in the discovery phase and not in the estimation phase, the estimators that are proposed here could be used in a more “traditional way”, in settings where the subgroups are defined exante by the researcher (e.g. grade retention, teacher seniority and principal seniority). .
To show in detail how this second step works let us use a toy example. Let’s imagine a simple heterogeneity structure for where and is a single regressor. This is namely, the case where the average intentiontotreat for those individuals with positive values of is greater than for individuals with nonpositive values. Then, the conditional CACE can be estimated in the two different subpopulations defined with respect to as^{16}^{16}16Alternatively, one can perform a Two Stage Least Squares (TSLS) regression within the different subpopulations. This is our preferred estimation strategy and is the one used both for the simulations and the real application.:
(16) 
2.4 Properties of the Conditional CACE Estimator
In the case of a binary instrument () and a binary treatment variable (), Angrist et al. (1996) and Imbens and Rubin (2015) revealed that the population versions of (13)(15) correspond to a Two Stage Least Squares (henceforth referred as TSLS) estimator of , in the cases where the four IV assumptions can be assumed to hold. Hence, since this case is analogous to our setting, one can apply the TSLS method in every node of the tree for the estimation of the effect on the compliers population, as it is presented by Imbens and Rubin (1997).
The two simultaneous equations of the TSLS estimator are, in the population,
(17)  
(18) 
where , and ^{17}^{17}17The latter comes from the fact that (18) is assumed to represent the linear projection of onto .. In the econometric terminology, the explanatory variable is , while the IV variable is .
We can express the TSLS equations, conditional on a subpopulation of a node , as
(19)  
(20) 
where , and .
Moreover, the following reduced equation (obtained plugging (20) into (19)) holds:
(21) 
In the case of a single instrument, the logic of IV regression is that one can estimate the respective parameters and of the regressions (20) and (2.4) above by least squares, when the observations in each node are independent and identically distributed, then obtaining an estimate of the parameter in (19). In particular, for every element of a node , one can estimate through TSLS, as the following ratio (Imbens and Rubin, 2015):
(22) 
The TSLS estimator associated with (19)(2.4) satisfies the next properties. They can be proved likewise in the application of TSLS to the population case (see, e.g., Imbens and Rubin (2015)).
Theorem 1: Consistency of the Conditional TSLS Estimator.
Let (Assumption 1), (Assumption 2) and (Assumption 3) hold. Then
(23) 
where denotes convergence in probability, and is the number of observations within the node .
Theorem 2: Asymptotic Normality of the Conditional TSLS Estimator.
Let Assumptions 1, 2, and 3 hold. Let also be finite (Assumption 4). Then
(24) 
where denotes convergence in distribution, stands for normal distribution, and is the asymptotic variance of the TSLS estimator.
The proofs of the two Theorems above directly follow from their unconditional versions^{18}^{18}18For further details on these proofs we refer to (Wooldridge, 2002, Section 5.2).. In this case, we want to stress that in order for the convergence of our estimator to and its normality to hold approximately we need to have a sufficient number of observations within every node. Hence, we suggest to perform our algorithm on sufficiently large datasets and to trim those nodes where the number of observations is not large enough.
3 Monte Carlo Simulations
To evaluate the performance of the BCFIV algorithm we compare it, using Monte Carlo Simulations, with two methods that are directly tailored for drawing causal inference in irregular assignment mechanism scenarios: the Honest Causal Trees with Instrumental Variable (HCTIV) algorithm (Bargagli Stoffi and Gnecco, 2019) and the Generalized Random Forests (GRF) algorithm (Athey et al., 2016). Both the latter algorithms outperform other machine learning algorithms which are not tailored for irregular assignment mechanisms (Bargagli Stoffi and Gnecco, 2019), so we focus, in this context, just on a comparison within these three algorithms.
Since the foremost focus of this paper is on discovery of the heterogeneity in the causal effects, we compare the algorithms on three dimensions: (i) the correct choice of the variable that drives the heterogeneity (heterogeneity driving variable [HDV]), (ii) the correct choice of the threshold value used to perform the binary split of the data given the right identification of HDV, and (iii) the meansquared error for the heterogeneous causal effects given the correct choice of HDV.
For Monte Carlo Simulations we build two different designs. The functional forms of the designs are built following the simulation designs in Wang et al. (2018). The first design takes the form of where , and . The interaction term between the regressor and the treatment indicator is functional to heterogenise the treatment effects, while the nuisance parameter is an unobserved variable that affects both and the response variable . The second design has the same functional form but . In both the designs we set the correlations between , the instrument and the nuisance parameter to be: and , while assumes values 5 and 10 and the sample sizes are 500, 1,000, 5,000. For both designs the results are aggregated over 30 rounds of simulations.
The results from the simulations are shown in Table 1
. As shown in Panel A, the correct identification of the HDV is very similar for BCFIV and GRF in the designs with 500 and 5,000 units. GRF is asymptotically faster in identifying the right HDV, as it outperforms both BCFIV and HCTIV when the sample size is 1,000. Panel B depicts the results in terms of mean squared error between the true and the predicted threshold used to perform the binary split of the data. The threshold is not available in Design 2 where the regressors are binary variables. BCFIV outperforms both GRF and HCTIV with all the sample sizes and with both 5 and 10 features (with the exception of Design 1 in the sample of 5,000 units with 5 features). Panel C depicts the mean squared error of prediction for the causal effects given the correct identification of the HDV. Another clear advantage of using BCFIV is given by the correct identification of the treatment effects. Indeed, BCFIV outperforms, in terms of lower meansquarederror of prediction for the treatment effects, the other algorithms in both the designs with 5 and 10 features (with the exception of Design 1 in the samples of 5,000 units with 5 features and 500 units with 10 features). Hence, in a scenario with binary regressors, BCFIV is preferable irrespective of the sample size. However, in a scenario with continuous regressors
^{19}^{19}19For instance, when the regressors are distributed according to a standardized normal distribution., there is a tradeoff between the capacity of getting the right HDV and the capacity of correctly estimating the causal effect. In designs with samples sizes of 1,000, GRF outperforms BCFIV in correctly identifying the HDV but fails to precisely estimate the causal effect. As the sample size increases, particularly in the scenario with 5 features, both the algorithms get to the same asymptotic results in terms of correct HDV identification and the meansquarederror of prediction.We argue that, in small samples, BCFIV would be preferable to GRF because, while the proportion of correctly identified HDVs is very similar, the gains obtained both in terms of meansquarederror between the true and predicted threshold and the true and predicted causal effects are much larger. In fact, the relative gap^{20}^{20}20The formula for the relative gap is, for the MSE of prediction, the following (Wang et al., 2018):
Sample Size  
#Features  Approach  500  1,000  5,000  500  1,000  5,000 
Design 1  Design 2  
HDV  
5  BCFIV  0.57  0.57  1.00  0.90  0.96  1.00 
GRF  0.63  0.83  1.00  0.93  1.00  1.00  
HCTIV  0.43  0.40  0.70  0.53  0.56  0.86  
10  BCFIV  0.30  0.33  0.73  0.53  0.93  1.00 
GRF  0.23  0.43  1.00  0.50  0.96  1.00  
HCTIV  0.17  0.30  0.77  0.20  0.63  0.83 
Sample Size  
#Features  Approach  500  1,000  5,000  500  1,000  5,000 
Design 1  Design 2  
Threshold  
5  BCFIV  0.062  0.037  0.006       
GRF  0.063  0.069  0.002        
HCTIV  0.185  0.188  0.045        
10  BCFIV  0.046  0.014  0.017       
GRF  0.190  0.125  0.040        
HCTIV  0.096  0.023  0.167       
Sample Size  
#Features  Approach  500  1,000  5,000  500  1,000  5,000 
Design 1  Design 2  
Causal Effects  
5  BCFIV  0.047  0.026  0.002  0.048  0.005  0.010 
GRF  0.330  0.030  0.001  0.055  0.006  0.011  
HCTIV  0.067  0.051  0.012  0.048  0.005  0.010  
10  BCFIV  0.017  0.021  0.020  0.036  0.005  0.002 
GRF  0.230  0.197  0.128  0.190  0.105  0.012  
HCTIV  0.013  0.039  0.046  0.036  0.005  0.002 
In Appendix B, we provide a number of robustness checks of the Monte Carlo simulation. In particular, we focus on what happens to the fit of the three algorithms when one: (i) changes the correlation between and (possible weakinstrument problems)^{21}^{21}21
It is important to highlight that in order to avoid weakinstrument problems within a node our algorithm performs a weakinstrument test in every subsample (namely, an Ftest on the first stage regression) and discards the nodes where the null hypothesis of weak instrument is not rejected.
; (ii) introduces a violation in the exclusion restriction; (iii) changes the specification of the propensity score for the BCFIV; (iv) introduces multiple heterogeneity variables; (v) changes the error distribution. The results that we highlighted before hold true also in the robustness checks: BCFIV converges slowly to an optimal identification of the HDVs but largely outperforms GRF with respect to the meansquarederror of prediction for the causal effects^{22}^{22}22However, when we introduce a partial violation of the exclusion restriction assumption (design 2) we see exactly the opposite: BCFIV outperforms GRF with respect to the identification of the correct HDV while GRF outperforms BCFIV in precisely estimating the causal effects.. Moreover, the performance of BCFIV does not seem to widely deteriorate, as compared to the baseline models in Table 1, in any of the robustness designs.4 Heterogeneous Causal Effects of Education Funding
There is a wide consensus that education positively influences labor market outcomes (see the review by Psacharopoulos and Patrinos (2018)). Students’ performance can be driven by multiple factors connected with students’ characteristics and environmental characteristics. However, to the best of our knowledge, this is the first paper to study the impact of additional school funding on students’ performance using machine learning techniques tailored for causal inference. In this Section we apply the BCFIV algorithm to evaluate the impact and estimate the heterogeneity in the effects of additional funding to schools with disadvantaged students on students’ performance. First, we describe the data used for this application. Next, we depict the identification strategy. Finally, we describe the results obtained and their relevance in the economics of education literature.
4.1 Data
Starting from the year 2002, the Flemish Ministry of Education promoted the “Equal Educational Opportunities” program (henceforth referred as EEO) to ensure equal educational opportunities to all the students (OECD, 2017). The EEO program provides additional funding for secondary schools with a significant share of disadvantaged students. Owing to the funding schools can hire additional teachers and increase the number of teaching hours. Pupils are considered to be disadvantaged on the basis of five different indicators: (i) the pupil lives outside the family; (ii) the pupil does not speak Dutch as a native language; (iii) the mother of the pupil does not have a secondary education degree; (iv) the pupil receives educational grant guaranteed for low income families; and (v) one of the parents is part of the travelling population. In order for a school to be eligible for the EEO funding, it needs to satisfy two conditions: the first condition is that the share of students with at least one of the five characteristics has to exceed an exogenously set threshold; the second condition is that the school has to generate at least six teaching hours. The exogenous cutoff is, for students in the first two years of secondary education (first stage students), a minimum share of 10% disadvantaged students.
The Flemish Ministry of Education provided us with data on the universe of pupils in the first stage of education in the school year 2010/2011 (135,682 students). In particular, we have data on student level characteristics and school level characteristics. The student level characteristics cover the gender of the pupil (gender), the grade retention in primary school (retention) and the inclusion of the pupil in the special needs student population in primary school (which serves as a proxy of student’s low cognitive skills). The school level characteristics include both the teacher characteristics, such as the teachers’ age, seniority and education, in addition to principal characteristics, such as the principals’ age and seniority. Teacher and principal seniority measures the level of experience of the teachers and principals, respectively. These variables assume values in the range of 1 to 7, where the teachers (and principals) with a seniority level of 1 are the least experienced (05 years of experience) and teachers (and principals) with a seniority level of 7 are the most experienced (more than 30 years of experience)^{23}^{23}23Teachers and principals’ seniority classes are the following: class 1: between 0 and 5 years of experience; class 2: between 6 and 10; class 3: between 11 and 15; class 4: between 16 and 20; class 5: between 20 and 25; class 6: between 26 and 30; class 7: more than 30.
. Similarly, the ages of teacher and principal are reported as categorical variables that range from 1 to 8, where teachers/principals in the first category are the youngest (less than 30 years old) and teachers/principals in the last category are the oldest (more than 60 years old)
^{24}^{24}24Teachers and principals’ age classes are the following: class 1: less than 30 years old; class 2: between 30 and 34; class 3: between 35 and 39; class 4: between 40 and 44; class 5: between 45 and 49; class 6: between 50 and 54; class 7: between 55 and 60; class 8: more than 60.. Teachers’ education records whether or not the teacher holds a pedagogical training (in the following we will refer to it as “teacher training”). All these variables are aggregated at school level in the form of averages (for age and seniority) and shares (for teachers’ education) and assigned to each student with respect to the school where he/she is enrolled.The outcome variables are two dummy variables defined as follows: the variable
progress school assumes value 1 if the student progresses to the following year without any grade retention and 0 if not (this variable is a complement of school retention); the variable Acertificate assumes value 1 if the student gets an “Acertificate” at the end of the school year (which is the most favorable outcome) and 0 if not. Since we do not have data on standardized test scores for Flemish students, Acertificate is a good, available proxy of student performance. Every year, each student performs a final test and gets a ranking from “A” to “C”. Students that get an “A” can progress school without any restriction, while the students that get either “B” or “C” can progress school but only in specific programs or have some grade retention. Both these outcome variables are proxies for different levels of students’ performance: a positive Acertificate proxies for a higher level of performance than a positive progress school. In principle, the target of a policymaker could be to have the highest possible share of students getting “Acertificates” and the lowest share of students not progressing through school.4.2 Identification Strategy
To evaluate the impact of the policy on students’ performance we construct a fuzzy regression discontinuity design (Hahn et al., 2001). Regression Discontinuity Design (RDD) is a method that aims at evaluating the causal effects of interventions in settings where the assignment to the treatment is determined (at least partly) by the values of an observed covariate lying on either sides of a threshold point. The idea is that subjects just above and below this threshold are very similar and one can assume a quasirandomization around the threshold (Mealli and Rampichini, 2012). RDDs are categorized in sharp RDDs and fuzzy RDDs. In sharp RDDs, the central assumption is that, around the threshold, there is a sharp discontinuity (from 0 to 1) in the probability of being treated. This is due to the fact that in sharp RDDs there is no room for imperfect compliance. In the real world, however, thresholds are not strictly implemented, as in the case of our application. To deal with these situations, one can use fuzzy RDDs, which are applicable when around the threshold the probability of being actually treated changes sharply, but not discontinuously from 0 to 1. In our application of the fuzzy RDD technique, we exploit two cutoffs around the 10% share of disadvantaged students in the first stage of secondary education. The students in schools just below the threshold are assigned to the control group (), while the students in schools just above the threshold are assigned to the treated group (). The bandwidth around the threshold (from which one obtains the two cutoffs) is determined using the “rdrobust package” in R (Calonico et al., 2015). The optimal, biascorrected bandwidths around the threshold are 3.5% and 3.7%, respectively for the outcome variables Acertificate and progress school. Accordingly to these two bandwidths, we obtain two refined samples where the sample with the 3.5% bandwidth is the smallest and the sample with the 3.7% bandwidth the largest. Moreover, to guarantee an equal representation to all the schools, and to avoid biases related to the overrepresentation of biggest schools’ students, we sample 50 pupils from each school. In turn, this leads to a higher balance among the averages between the observations assigned to the treatment and the observations assigned to the control, as shown in panel (a) of Figure 11. In Appendix C, we run a series of tests to show that the RDD (Regression Discontinuity Design) is valid for this application. Moreover, as a robustness check we sample a higher number of students according to the size of the smallest school (62 pupils) from every school. In Appendix D, we show the balance in the samples of units assigned to the treatment and to the control in the second scenario.
4.3 Results
This Section assesses the effects of the additional funding on students’ performances and highlights the main drivers of the heterogeneity in causal effects. These analyses are made for both the outcome variables: Acertificate and progress school^{25}^{25}25It is important to highlight that the results for both the outcomes, considered separately, in terms of effects and heterogeneity drivers, remain roughly the same when we widen the sample of units included in the analysis (results are reported in the Appendix D)..
4.3.1 ACertificate
Proceeding from the seminal contributions of Coleman (1966) and Hanushek (2003) to recent contributions by Jackson et al. (2015) and Jackson (2018), the question on whether or not school spending affects students’ performances has been central in the economic literature.
In our study, the variable Acertificate serves as a proxy for positive performance. In our sample, the students that got an “Acertificate” are the 91.73% of the total population. In Figure 7, the heterogeneous Complier Average Causal Effects (CACE) estimated using the proposed model are depicted^{26}^{26}26The nodes for whom (i) it was not possible to compute the CACE or, (ii) the weakIV test was not rejected were excluded from the plot.^{27}^{27}27In Figures 7, 8, 14, 15, the socalled summarizing trees (Hahn et al., 2017) are depicted. A summarizing tree is a classification or regression tree that is built using the fitted values estimated from the BCFIV. These summarizing trees are used to provide a visualization of the heterogeneity in the causal effects.. The darker the shade of blue in the node the higher the causal effect.
Although positive, the overall effect of the additional funding is not significant. This finding is in line with the recent literature on school spending and students’ performance in a crosscountry scenario (Hanushek et al., 2016; Hanushek and Woessmann, 2017) and in the Flanders, in particular (De Witte et al., 2017, 2018). Nevertheless, rather than focusing on the overall average effect it is more interesting to explore the heterogeneous effects.
The first driver in the heterogeneity of the effects is the dummy variable primary retention: for students who have experienced retention of grade during primary school, the effects of funding are larger. These effects, even if not significant, show that the effect is slightly higher for students that had a weaker performance in the past. The second driver of heterogeneity is the age of the teacher: students in schools with younger teachers (namely, when teaching age assumes values lower or equal to 4 on a scale from 1 to 8, referring to teachers younger than 40 years old) have a significant increase in their performance if they did not experience any retention in primary school. This effect is positive and significant (although slightly lower: 0.06, meaning that being treated leads to an increase of 6% in the probability of getting the best grade) even if we rule out the conditioning on the students with no primary retention. This heterogeneity driver, namely, the age of the teacher, is particularly appreciable as there are evidences in the education literature that connect teachers’ age to their teaching performance (Young and Place, 1988; Kinney and Smith, 1992) and in turn teaching performance to students’ positive achievements (Kosgei et al., 2013). In the current scenario, it seems that the additional funding has a positive significant effect when the additional teaching hours are granted to schools with younger teachers.
Further heterogeneous effects come from the interaction between teacher’s age and principal’s seniority. The effect for students without primary retention in schools with younger teachers and principals with lower levels of seniority (namely, lower or equal than 5 on a 1 to 7 scale, referring to principals with less than 26 years of experience) is statistically significantly higher than the effects on the overall population. Again, this holds true even if we rule out the conditioning on the primary retention variable (the effect in this case is 0.18**, meaning that being treated leads to an increase of 18% in the probability of getting the best grade). This subpopulation of students that accounts for the 32% and 34% of the overall sample (respectively, when conditioning, or not, on the primary retention variable) shows effects that are between 3.2 and 3.5 times larger than the overall effect^{28}^{28}283.2 is the ratio between the conditional treatment effect for the subpopulation of students that did not experience primary retention, and that are in schools with younger teachers and principals (0.1651), corresponding to the bluer leaf in Figure 7, and the average treatment effect for the overall population (0.0511). 3.5 is the ratio between the conditional treatment effect for the subpopulation of students that are in schools with younger teachers and principals (0.1769) and the average treatment effect on the overall population (0.0511).. This evidence can be interpreted in the following way: the additional funding has a positive, but not statistically significant, effect in boosting the performance of students in the overall population, but it increases its effect in a notable way for those students in schools with younger teachers and less senior principals. These results, are in line with the evidence that additional school funding does not boost the performance of the overall population of students (Hanushek et al., 2016; Hanushek and Woessmann, 2017; De Witte et al., 2017, 2018) and with the literature that connects students’ achievements with teaching performance (Young and Place, 1988; Kinney and Smith, 1992) and, in turn, teaching performance with students’ performance (Kosgei et al., 2013). A novel evidence from the current research reveals that the causal effects are lower when the principal have more than 25 years of seniority. This finding opens up new fields for further investigation, in line with the newly established role of machine learning in the economic literature as a “theorydriving/theorytesting” tool (Mullainathan and Spiess, 2017).
These results are relevant to the policy as they furnish the instruments to policymakers to enhance the effects of additional funding on students’ performance. Indeed, on one side policymakers could target just students in school with positive, statistically significant effects reducing the overall costs of the policy and using the savings to experiment more effective policies in the other schools. On the other side, policymakers could analyze the reason of lack of the effectiveness of funding in schools with certain characteristics and implement policies to boost the effects of future funding.
4.3.2 Progress in School
The second outcome variable, progress school, assumes value 1 if the student progresses to the following year without any grade retention and 0 if not: roughly 98% of the students in the sample manage to progress in school in the first two years of secondary education. For the students unable to progress in school, this variable is used as a proxy of negative achievements. Therefore, it is quite interesting to understand if additional funding were effective in driving students away from negative performance. Figure 8 depicts the heterogeneous conditional CACEs: the darker the shade of green in the node, the higher the causal effect.
The additional funding has a slightly positive impact on the chance of progress school for the overall students in the sample^{29}^{29}29However, in the robustness checks this effect is slightly negative. In any case, also in the robustness checks the overall effect is not statistically significant.. This effect, as well as the heterogeneous causal effects are not significant (again, this is in line with what was found by De Witte et al. (2018) at school level). However, it is compelling to observe the main drivers of the heterogeneity in the causal effect. The first driver is the gender of the student: the effect seems to be positive for male students and negative for female students. This is particularly absorbing given the fact that 63% of the students that do not progress in school are males. The second driver of the heterogeneity in CACE is the principal seniority. As in the case of the previous outcome, students in schools with more senior principals (more than 25 years of experience) show lower causal effects. This holds true even when we do not condition on the gender of the students.
Furthermore, the added value of our algorithm is that it could enable policymakers to target just those units that benefit the most from the treatment and it provides an insight on possible inefficiencies in the allocation and/or usage of funding. From our analysis it seems that there is room for policies that support less senior principals since students in their schools show higher returns in terms of performance from additional funding^{30}^{30}30It is worth highlighting that the heterogeneity drivers are slightly different for the two outcomes that we observe. This could be due to the fact that Acertificate is a proxy of “positive” achievements, while progress school is a proxy of “less positive” achievements. Hence, the subpopulations that are affected by the funding are different for the two different outputs. Let us focus on positive treatment effects. On one hand, the funding positively affect those students that would not be “Astudents” in their absence. On the other hand, they have a positive effect on those students that would not progress school in absence of the funding. These two subpopulations are clearly different and in turn the heterogeneity drivers are different..
5 Conclusion and Discussion
In this paper, we developed a novel Bayesian machine learning technique, BCFIV, to draw causal inference in scenarios with imperfect compliance. By investigating the heterogeneity in the causal effects, the technique expedites targeted policies. We manifested that the BCFIV technique outperforms other machine learning techniques tailored for causal inference in precisely estimating the causal effects and converges to an optimal asymptotic performance in identifying the heterogeneity driving variables (HDVs). Moreover, we show that the competitive advantages of using BCFIV, as compared to GRF or HCTIV, are substantial. Peculiarly, the performance of BCFIV in precisely estimating the heterogeneous causal effects shadows its slower convergence to an optimal identification of HDVs as compared to GRF. This is especially true if we look at the relative gaps between the BCFIV and the other techniques.
BCFIV can assist the researchers to shed light on the heterogeneity of causal effects in IV scenarios in order to provide to policymakers a relevant knowledge for targeted policies. In our application, we evaluated the effects of additional funding on students’ performances. While the overall effects are positive but not significant, there are significant differences among different subpopulations of students. Indeed, for students in schools with younger teachers and younger principals (with respect to the average age and seniority, respectively) the effects of the policy are between 3.2 and 3.5 times bigger than the effects on the overall population (in the most conservative scenario) and significant for the Acertificate output.
On one hand, as an underlying mechanism, the need for additional funds can be higher in schools with younger teachers and principals, who are more often observed in the most disadvantaged schools. This phenomenon arises as senior teachers and principals select themselves out of the most disadvantaged schools and more into advantaged schools, thereby creating relatively more vacancies in disadvantaged schools. Therefore, on average, younger teachers lack a real choice but to start teaching in the most disadvantaged schools. Moreover, owing to the additional funds, schools could use the funds to reduce class sizes, which might be more effective for younger (and less senior) teachers. On the other hand, we can think of the motivation for both teachers and principals to decrease as they grow older and this, in turn, have an impact on their performance. Favouring this hypothesis Ololube (2006) finds that motivation enhances productivity and has an impact on teachers’ performances. However, Arifin (2015) claims that teachers’ motivation positively and significantly affects teacher’s job satisfaction, but it does not affect their performance. The investigation of the true causal channel is beyond the goals of this paper and is left to further investigation where more granular teachers’ and principals’ characteristics are available.
It is worth highlighting in this context that policymakers could use the results from this study to target just those students and schools that really benefit from the additional funding in order to increase the welfare effects of the policy and to enhance targeted policies to boost the effects of additional funding for those students that seem to avail lesser benefits.
The extension of these methods to other fields of economic investigation and the development of novel machine learning algorithms for targeted policies and welfare maximization can form the future scope of further research. In particular, the development of an algorithm that could deal with welfare maximization in the context of multiple outcomes is of interest. Moreover, further research should focus on connecting BCFIV and GRF into a single ensemble algorithm, following (Van der Laan et al., 2007), to obtain a novel algorithm that combines the small and large sample properties of both BCFIV and GRF to obtain possible gains in imperfect compliance scenarios.
References
 Angrist et al. (1996) Angrist, J.D., Imbens, G.W., Rubin, D.B., 1996. Identification of causal effects using instrumental variables. Journal of the American Statistical Association 91, 444–455.
 Arifin (2015) Arifin, H.M., 2015. The influence of competence, motivation, and organisational culture to high school teacher job satisfaction and performance. International Education Studies 8, 38–45.

Athey (2018)
Athey, S., 2018.
The impact of machine learning on economics, in: The Economics of Artificial Intelligence: An Agenda. University of Chicago Press.
 Athey and Imbens (2016) Athey, S., Imbens, G., 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 7353–7360.
 Athey and Imbens (2017) Athey, S., Imbens, G.W., 2017. The state of applied econometrics: Causality and policy evaluation. Journal of Economic Perspectives 31, 3–32.
 Athey et al. (2016) Athey, S., Tibshirani, J., Wager, S., 2016. Generalized random forests. arXiv preprint, arXiv:1610.01271 .

Bargagli Stoffi and Gnecco (2018)
Bargagli Stoffi, F.J., Gnecco, G.,
2018.
Estimating heterogeneous causal effects in the
presence of irregular assignment mechanisms.
In Proceedings of the 5th IEEE Conference in Data Science and Advanced Analytics , 1–10.
 Bargagli Stoffi and Gnecco (2019) Bargagli Stoffi, F.J., Gnecco, G., 2019. Causal tree with instrumental variable: An extension of the causal tree framework to irregular assignment mechanisms. International Journal of Data Science and Analytics DOI: https://doi.org/10.1007/s4106.
 Breiman (1984) Breiman, L., 1984. Classification and regression trees. Routledge, New York, New York.
 Breiman (2001) Breiman, L., 2001. Random forests. Machine Learning 45, 5–32.
 Breiman et al. (1984) Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and regression trees. Belmont, CA: Wadsworth & Brooks .
 Calonico et al. (2015) Calonico, S., Cattaneo, M.D., Titiunik, R., 2015. rdrobust: An r package for robust nonparametric inference in regressiondiscontinuity designs. R Journal 7, 38–51.
 Chipman et al. (2010) Chipman, H.A., George, E.I., McCulloch, R.E., et al., 2010. Bart: Bayesian additive regression trees. The Annals of Applied Statistics 4, 266–298.
 Coleman (1966) Coleman, J.S., 1966. Equality of educational opportunity. Washington DC: US Government Printing Office , 1–32.
 De Witte et al. (2018) De Witte, K., Smet, M., D’Inverno, G., 2018. The effect of additional resources for schools with disadvantaged students: Evidence from a conditional efficiency model. Steunpunt Sono Research Report .
 De Witte et al. (2017) De Witte, K., Smet, M., Van Assche, R., 2017. The impact of additional funds for schools with disadvantaged students: a regression discontinuity design. Steunpunt Sono Research Report .
 Foster et al. (2011) Foster, J.C., Taylor, J.M., Ruberg, S.J., 2011. Subgroup identification from randomized clinical trial data. Statistics in Medicine 30, 2867–2880.
 Green and Kern (2012) Green, D.P., Kern, H.L., 2012. Modeling heterogeneous treatment effects in survey experiments with bayesian additive regression trees. Public Opinion Quarterly 76, 491–511.
 Hahn et al. (2001) Hahn, J., Todd, P., Van der Klaauw, W., 2001. Identification and estimation of treatment effects with a regressiondiscontinuity design. Econometrica 69, 201–209.

Hahn et al. (2018a)
Hahn, P.R., Carvalho, C.M.,
Puelz, D., He, J., et al.,
2018a.
Regularization and confounding in linear regression for treatment effect estimation.
Bayesian Analysis 13, 163–182.  Hahn et al. (2018b) Hahn, P.R., Dorie, V., Murray, J.S., 2018b. Atlantic causal inference conference (acic) data analysis challenge 2017.
 Hahn et al. (2017) Hahn, P.R., Murray, J.S., Carvalho, C.M., 2017. Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects. arXiv preprint, arXiv:1706.09523 .
 Hanushek (2003) Hanushek, E.A., 2003. The failure of inputbased schooling policies. The Economic Journal 113, F64–F98.
 Hanushek et al. (2016) Hanushek, E.A., Machin, S.J., Woessmann, L., 2016. Handbook of the Economics of Education. Elsevier.
 Hanushek and Woessmann (2017) Hanushek, E.A., Woessmann, L., 2017. School resources and student achievement: A review of crosscountry economic research, in: Cognitive abilities and educational outcomes. Springer, pp. 149–171.
 Hartford et al. (2016) Hartford, J., Lewis, G., LeytonBrown, K., Taddy, M., 2016. Counterfactual prediction with deep instrumental variables networks. arXiv preprint, arXiv:1612.09596 .
 Hernández et al. (2018) Hernández, B., Raftery, A.E., Pennington, S.R., Parnell, A.C., 2018. Bayesian additive regression trees using bayesian model averaging. Statistics and Computing 28, 869–890.
 Hill (2011) Hill, J.L., 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20, 217–240.
 Imbens and Rubin (1997) Imbens, G.W., Rubin, D.B., 1997. Estimating outcome distributions for compliers in instrumental variables models. The Review of Economic Studies 64, 555–574.
 Imbens and Rubin (2015) Imbens, G.W., Rubin, D.B., 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
 Jackson (2018) Jackson, C.K., 2018. Does School Spending Matter? The New Literature on an Old Question. Technical Report. National Bureau of Economic Research.
 Jackson et al. (2015) Jackson, C.K., Johnson, R.C., Persico, C., 2015. The effects of school spending on educational and economic outcomes: Evidence from school finance reforms. The Quarterly Journal of Economics 131, 157–218.
 Kapelner and Bleich (2013) Kapelner, A., Bleich, J., 2013. Bartmachine: Machine learning with bayesian additive regression trees. arXiv preprint, arXiv:1312.2171 .
 Kinney and Smith (1992) Kinney, D.P., Smith, S.P., 1992. Age and teaching performance. The Journal of Higher Education 63, 282–302.
 Kitagawa and Tetenov (2018) Kitagawa, T., Tetenov, A., 2018. Who should be treated? Empirical welfare maximization methods for treatment choice. Econometrica 86, 591–616.
 Kleinberg et al. (2017) Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., Mullainathan, S., 2017. Human decisions and machine predictions. The Quarterly Journal of Economics 133, 237–293.
 Kosgei et al. (2013) Kosgei, A., Mise, J.K., Odera, O., Ayugi, M.E., 2013. Influence of teacher characteristics on students’ academic achievement among secondary schools. Journal of Education and Practice 4, 76–82.
 Van der Laan et al. (2007) Van der Laan, M.J., Polley, E.C., Hubbard, A.E., 2007. Super learner. Statistical Applications in Genetics and Molecular Biology 6.
 Lechner (2019) Lechner, M., 2019. Modified causal forests for estimating heterogeneous causal effects. CEPR Discussion Paper No. DP13430 .
 Lee and Lemieux (2010) Lee, D.S., Lemieux, T., 2010. Regression discontinuity designs in economics. Journal of Economic Literature 48, 281–355.
 Lee et al. (2018) Lee, K., Small, D.S., Dominici, F., 2018. Discovering effect modification and randomization inference in air pollution studies. arXiv preprint, arXiv:1802.06710 .
 Linero (2018) Linero, A.R., 2018. Bayesian regression trees for highdimensional prediction and variable selection. Journal of the American Statistical Association 113, 626–636.
 Linero and Yang (2018) Linero, A.R., Yang, Y., 2018. Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 1087–1110.
 Logan et al. (2019) Logan, B.R., Sparapani, R., McCulloch, R.E., Laud, P.W., 2019. Decision making and uncertainty quantification for individualized treatments using bayesian additive regression trees. Statistical Methods in Medical Research 28, 1079–1093.
 McCrary (2008) McCrary, J., 2008. Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics 142, 698–714.
 Mealli and Rampichini (2012) Mealli, F., Rampichini, C., 2012. Evaluating the effects of university grants by using regression discontinuity designs. Journal of the Royal Statistical Society: Series A (Statistics in Society) 175, 775–798.
 Mullainathan and Spiess (2017) Mullainathan, S., Spiess, J., 2017. Machine learning: an applied econometric approach. Journal of Economic Perspectives 31, 87–106.
 Murray (2017) Murray, J.S., 2017. Loglinear bayesian additive regression trees for categorical and count responses. arXiv preprint, arXiv:1701.01503 .
 OECD (2017) OECD, 2017. Educational opportunity for all: Overcoming inequality throughout the life course.
 Ololube (2006) Ololube, N.P., 2006. Teachers job satisfaction and motivation for school effectiveness: An assessment. Essays in Education 18, 1–10.
 Psacharopoulos and Patrinos (2018) Psacharopoulos, G., Patrinos, H.A., 2018. Returns to investment in education: a decennial review of the global literature. Education Economics 26, 445–458.
 Rubin (1974) Rubin, D.B., 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 688.
 Rubin (1978) Rubin, D.B., 1978. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics , 34–58.
 Sejnowski (2018) Sejnowski, T.J., 2018. The deep learning revolution. MIT Press.
 Starling et al. (2018) Starling, J.E., Murray, J.S., Carvalho, C.M., Bukowski, R., Scott, J.G., 2018. Functional response regression with funbart: an analysis of patientspecific stillbirth risk. arXiv preprint, arXiv:1805.07656 .
 Su et al. (2012) Su, X., Kang, J., Fan, J., Levine, R.A., Yan, X., 2012. Facilitating score and causal inference trees for large observational studies. Journal of Machine Learning Research 13, 2955–2994.
 Wager and Athey (2018) Wager, S., Athey, S., 2018. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113, 1228–1242.
 Wang et al. (2018) Wang, G., Li, J., Hopp, W.J., 2018. An instrumental variable tree approach for detecting heterogeneous treatment effects in observational studies. Available at SSRN 3045327 .
 Wendling et al. (2018) Wendling, T., Jung, K., Callahan, A., Schuler, A., Shah, N., Gallego, B., 2018. Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Statistics in Medicine 37, 3309–3324.
 Wooldridge (2002) Wooldridge, J.M., 2002. Econometric analysis of cross section and panel data. MIT Press, Cambridge, Massachusetts.
 Young and Place (1988) Young, I.P., Place, A.W., 1988. The relationship between age and teaching performance. Journal of Personnel Evaluation in Education 2, 43–52.
Appendix A Discussion on the Instrumental Variables Assumptions
In a typical IV scenario one can express the treatment received as a function of the treatment assigned: . This leads to distinguish four subpopulations of units () (Angrist et al., 1996; Imbens and Rubin, 2015): (i) those that comply with the assignment (compliers: and ); (ii) those that never comply with the assignment (defiers: and ); (iii) those that even if not assigned to the treatment always take it (alwaystakers: ); (iv) those that even if assigned to the treatment never take it (nevertakers: ). In such a scenario what “one directly gets from the data” is the socalled IntentionToTreat ():
(25) 
which is defined as the effect of the intention to treat a unit on the outcome of the same unit. (25) can be written as the weighted average of the intentiontotreat effects across the four subpopulations of compliers, defiers, alwaystakers and nevertakers:
(26) 
where is the effect of the treatment assignment on units of type and is the proportion of units of type .
does not represent the effect of the treatment itself but just the effect of the assignment to the treatment. If we want to draw proper causal inference in such a scenario we need to invoke the four classical IV assumptions (Angrist et al., 1996):

exclusion restriction: where, for each subpopulation and , the shortened notation is used to denote

monotonicity: ;

existence of compliers: ;

unconfoundedness of the instrument:
.
In our application, these four assumptions are assumed to hold. Let us look at them in detail. The exclusion restriction is assumed to hold since we can reasonably rule out a direct effect of being eligible (around the threshold) on the performance of students. The effect, in this case, can be reasonably assumed to go through the actual reception of additional funding. Monotonicity holds by design: since we are in a onesided noncompliance scenario there is no possibility for those who are not assigned to the treatment to defy and get the treatment. The same can be said about the existence of compliers. Since the subpopulation of alwaystakers can be ruled out by design, this leads to the fact that units receiving the treatment are compliers. Unconfoundedness of the instrument can also reasonably be assumed to hold since observations around the exogenous threshold are as good as if they were randomized to the assignedtothetreatment group and the assignedtothe control group. This holds true especially since we do not observe any manipulation around the threshold and sorting of the units into the treated group.
Appendix B Robustness Checks in Monte Carlo Simulations
We introduce some changes in the synthetic models used to test the fit of the BCFIV (as compared to GRF and HCTIV). The model from which we start is the simplest model introduced in Section 3: where , and . We introduce 5 different variations in this model (each one corresponds to a different design in Table 5):

we change the correlation between and in order to introduce possible weakinstrument problems: we decrease the correlation to and we do so by introducing in half of the population a very weak instrument ;

we introduce a partial violation in the exclusion restriction;

we introduce multiple heterogeneity driving variables (HDVs):
(27) where this variation is introduced to test if the HDVs are correctly selected even when they are multiple;

we change the error distribution, , to test if the algorithm is robust to changes in the noise parameter;

we manipulate the propensity score function for the BCFIV, to test if this model is robust to a misspecification of .
The results from these five different designs are reported in 1. In the presence of a weakinstrument (design 1), the fit of all the three algorithms deteriorates. As we saw in the Monte Carlo simulations in Section 3, GRF is better in identifying the correct HDV but BCFIV outperforms both GRF and HCTIV in picking the right threshold and in precisely estimating the causal effect. As we introduce a partial violation of the exclusion restriction (design 2), BCFIV outperforms the other algorithms with respect to all the dimensions both in the cases with small sample and large sample sizes. When we introduce multiple heterogeneity driving variables, the capacity of correctly estimating the causal effects for GRF deteriorates while BCFIV outperforms the other algorithms. In the last two designs (design 4 and 5), we again see a tradeoff, for the designs with 500 and 1,000 units, between the capacity of correctly identifying the HDV (GRF outperforms the other techniques) and precisely estimating the causal effects (BCFIV outperforms the other algorithms). In both the latter designs, BCFIV and GRF get to fairly similar asymptotic results.
Sample Size  

#Design  Approach  500  1000  5000  500  1000  5000  500  1000  5000 
HDV  Threshold  MSE given HDV  
1  BCFIV  0.37  0.63  0.83  0.065  0.031  0.007  0.078  0.117  0.017 
GRF  0.56  0.73  1.00  0.157  0.107  0.007  0.230  0.122  0.011  
HTCIV  0.23  0.36  0.73  0.289  0.090  0.065  0.110  0.140  0.022  
2  BCFIV  0.63  0.40  0.77  0.022  0.061  0.002  0.005  0.107  0.043 
GRF  0.43  0.57  0.23  0.052  0.067  0.003  0.171  0.082  0.035  
HTCIV  0.43  0.37  0.53  0.189  0.170  0.064  0.018  0.102  0.056  
3  BCFIV  0.60  0.76  1.00  0.352  0.242  0.021  0.183  0.077  0.004 
GRF  0.77  1.00  1.00  0.169  0.230  0.004  0.776  0.365  0.275  
HTCIV  0.53  0.63  0.73  0.323  0.289  0.180  0.312  0.116  0.031  
4  BCFIV  0.63  0.80  1.00  0.043  0.065  0.002  0.006  0.009  0.002 
GRF  0.80  0.97  1.00  0.068  0.014  0.001  0.211  0.031  0.001  
HTCIV  0.47  0.40  0.70  0.207  0.103  0.087  0.029  0.014  0.017  
5  BCFIV  0.53  0.63  1.00  0.112  0.020  0.016  0.018  0.011  0.007 
GRF  0.63  0.83  1.00  0.062  0.069  0.002  0.330  0.030  0.001  
HTCIV  0.43  0.40  0.70  0.185  0.188  0.045  0.067  0.051  0.012 
Appendix C RDD Checks
In order to check whether or not the RDD (Regression Discontinuity Design) setting is valid, we implement the following checks (Lee and Lemieux, 2010)^{31}^{31}31The checks depicted in this Subsection are made on the sample of 50 students introduced in Subsection 4.2.: (i) we check the balance in the sample of units assigned to the treatment just above and below the threshold (this is done to check if the randomization holds); (ii) we examine if there are manipulations in the distribution of schools with respect to the share of disadvantaged students around the threshold, (iii) we employ a formal manipulation test, the McCrary test (McCrary, 2008), to discover potential sorting around the threshold; (iv) we check if there is a discontinuity in the probability of being assigned to the treatment around the threshold. Table 2 shows that the averages of the control variables are not statistically different for the group of units assigned to the treatment and assigned to the control around the cutoff, with the exception of teacher seniority. Thus, there is evidence that more senior teachers selfselect in schools with lower disadvantaged students. However, as we will show in Section 4.3, this variable does not surface in any model as a heterogeneity driver. We argue that including this variable in our model results in more robust findings. This is due to the fact that our model is robust to spurious heterogeneity coming from unbalances in the samples, as shown by Hahn et al. (2017) in randomized and regular assignment mechanisms’ scenarios. Moreover, panel (b) of Figure 11
shows the standardized difference in the means for these two groups with the relative standardized confidence intervals. The McCrary manipulation test implemented in
Calonico et al. (2015) through a LocalPolynomial Density Estimation leads to the rejection of the null hypothesis of the threshold manipulation^{32}^{32}32The McCrary test leads to a Tvalue of 0.7497 corresponding to a pvalue of 0.4534. The test is performed aggregating the student data at school level.. Both these results and the plot of the distribution of schools with respect to the share of disadvantaged students around the threshold in Figure 12 indicate that there is no evidence of manipulation. Finally, Figure 13 shows a clear discontinuity in the probability of being assigned to the treatment around the threshold.However, as we pointed out in Section 4, schools that are assigned to the treatment actually receive the treatment if they satisfy an additional condition of a minimum of six teaching hours. This leads to a fuzzyregression discontinuity design where the jump in the probability of being assigned to the treatment around the cutoff is not sharp. This scenario is characterized by imperfect compliance.
Students can be sorted, with respect to their compliance status, into two types: (i) students in schools above the cutoff with more than six teaching hours or students in schools below the cutoff (compliers: or ); (ii) students in schools above the cutoff but with less than six teaching hours (nevertakers: )^{33}^{33}33This a socalled case of onesidednoncompliance, in which we do not observe any alwaystakers since for those that are sorted out of the assignment to the treatment () there is no possibility to access the treatment..
The assignment to the treatment variable (i.e., studying in a school just below or above the cutoff) is a relevant instrumental variable in our scenario (namely, the correlation between and is roughly 0.62). Moreover, we can reasonably assume both the exogeneity condition and the exclusion restriction to hold in this situation. On one side, since the randomization of the instrument holds there is no reason not to assume conditional independence between the instrument and the unobservables. On the other side, the exclusion restriction seems to hold as well since we can believe that being just below or above the threshold does not affect the performance of students in any way other than through the additional funding. In this imperfect compliance setting, the causal effect of the additional funding on the students’ performance can be assessed through the Complier Average Causal Effect in (5). Moreover, using our novel BCFIV algorithm we can estimate the Conditional Complier Average Causal Effect, (6), to assess the heterogeneity in the causal effects.
Above Cutoff  Below Cutoff  Full Sample  pvalue  
Retention  0.036  (0.187)  0.037  (0.189)  0.037  (0.188)  0.913 
Gender  0.492  (0.500)  0.471  (0.499)  0.482  (0.500)  0.155 
Special Needs  0.000  (0.000)  0.002  (0.044)  0.001  (0.030)  0.045 
Teacher Age  4.022  (0.333)  4.024  (0.269)  4.023  (0.304)  0.814 
Teacher Seniority  3.867  (0.452)  3.927  (0.342)  3.895  (0.404)  0.000 
Teacher Training  0.982  (0.025)  0.981  (0.026)  0.982  (0.026)  0.169 
Principal Age  6.022  (1.308)  5.951  (1.229)  5.988  (1.271)  0.067 
Principal Seniority  5.778  (1.228)  5.829  (0.935)  5.802  (1.098)  0.120 
Observations  2250  2050  4300 
Results for 3.5% discontinuity sample with bootstrapped samples of size 50. Standard deviations are in parentheses and the pvalue corresponds to a ttest for the difference between the means in the group above and below the threshold.
Appendix D Robustness Checks for Policy Evaluation
In this Section of the Appendix, we test the robustness of our models to sampling variations. The sampling variations introduced are of two sources: (i) a wider bandwidth around the threshold (from 3.5% to 3.7%) and (ii) an expansion in the number of sampled units (from 50 up to the lowest number of students per school which is 62). To understand if the balance and the results are robust we manifest the balance in the averages in the samples of units assigned to the treatment and assigned to the control (Tables 3, 4, 5) and the results of the causal effects when we increase the number of units sampled (Figures 14, 15).
In all the different samples the school level characteristics remain widely balanced (with the exception of teacher seniority^{34}^{34}34This could be due to the fact that less senior principals select themselves in schools with a lower percentage of disadvantaged students.). Primary retention and Gender seem to be slightly unbalanced when we widen the bandwidth, this however holds true just in the case where we sample through bootstrap 50 units (Gender in this case gets back to a good balance).
Above Cutoff  Below Cutoff  Full Sample  pvalue  
Gender  0.471  (0.499)  0.493  (0.500)  0.482  (0.499)  0.110 
Retention  0.039  (0.194)  0.035  (0.184)  0.037  (0.189)  0.418 
Special Needs  0.001  (0.039)  0.000  (0.000)  0.001  (0.027)  0.045 
Teacher Age  4.024  (0.269)  4.002  (0.333)  4.023  (0.304)  0.793 
Teacher Seniority  3.926  (0.341)  3.867  (0.452)  3.895  (0.404)  0.000 
Teacher Training  0.982  (0.025)  0.981  (0.026)  0.982  (0.026)  0.126 
Principal Age  5.951  (1.228)  6.002  (1.308)  5.988  (1.271)  0.041 
Principal Seniority  5.829  (0.934)  5.777  (1.227)  5.802  (1.097)  0.083 
Observations  2790  2542  5332 
Above Cutoff  Below Cutoff  Full Sample  pvalue  
Retention  0.030  (0.170)  0.042  (0.201)  0.036  (0.186)  0.025 
Gender  0.497  (0.500)  0.461  (0.499)  0.479  (0.500)  0.015 
Special Needs  0.000  (0.021)  0.001  (0.037)  0.001  (0.030)  0.309 
Teacher Age  4.022  (0.333)  4.023  (0.260)  4.022  (0.299)  0.955 
Teacher Seniority  3.867  (0.452)  3.932  (0.330)  3.899  (0.398)  0.000 
Teacher Training  0.982  (0.025)  0.983  (0.026)  0.983  (0.026)  0.805 
Principal Age  6.022  (1.308)  6.000  (1.206)  6.011  (1.259)  0.556 
Principal Seniority  5.778  (1.228)  5.818  (0.912)  5.798  (1.083)  0.212 
Observations  2250  2200  4450 
Above Cutoff  Below Cutoff  Full Sample  pvalue  
Retention  0.029  (0.168)  0.040  (0.196)  0.034  (0.182)  0.026 
Gender  0.490  (0.500)  0.464  (0.499)  0.477  (0.500)  0.058 
Special Needs  0.000  (0.019)  0.001  (0.038)  0.001  (0.030)  0.174 
Teacher Age  4.022  (0.333)  4.023  (0.260)  4.022  (0.299)  0.950 
Teacher Seniority  3.867  (0.452)  3.932  (0.330)  3.899  (0.398)  0.000 
Teacher Training  0.982  (0.025)  0.983  (0.026)  0.983  (0.026)  0.784 
Principal Age  6.022  (1.308)  6.000  (1.206)  6.011  (1.259)  0.512 
Principal Seniority  5.778  (1.227)  5.818  (0.912)  5.798  (1.083)  0.165 
Observations  2790  2728  5518 
With respect to the results of the BCFIV algorithm when we increase the number of sampled units they remain widely the same. There are two slight differences: (i) in the case of the summarizing tree for Acertificate, in Figure 14, for units with Primary retention equal to one the effect is now negative but still not significant (it is important to notice that these observations account just for the 4% of the overall units); (ii) in the case of the summarizing tree for Progress School the overall effect is now negative but still not significant. An important observation is that there are no differences in the heterogeneity drivers and in the direction of the effects (i.e., for more senior principals the effects for both the outcomes are still lower).
Comments
There are no comments yet.