Supervised machine learning plays an important role in many prediction problems. Based on a learning sample consisting ofpairs of target value and predictors , one learns a rule that predicts the status of some unseen via when only information about is available. Both the machine learning and statistics communities differentiate between “classification problems”, where the target is a class label, and “regression problems” with conceptually continuous target observations . In binary classification problems with the focus is on rules
for the conditional probability ofbeing given , more formally . Such a classification rule is probabilistic in the sense that one cannot only predict the most probable class label but also assess the corresponding probability. This additional information is extremely valuable because it allows an assessment of the rules’ uncertainty about its prediction. It is much harder to obtain such an assessment of uncertainty from most contemporary regression models, because the rule (or “regression function”) typically describes the conditional expectation but not the full predictive distribution of given . Thus, the prediction only contributes information about the mean of some unseen target
but tells us nothing about other characteristics of its distribution. Without making additional restrictive assumptions, for example constant variances in normal distributions, the derivation of probabilistic statements from the regression functionalone is impossible.
Contemporary random forest-type algorithms also strongly rely on the notion of regression functions describing the conditional mean only (for example Biau et al., 2008; Biau, 2012; Scornet et al., 2015), although the first random forest-type algorithm for the estimation of conditional distribution functions was published more than a decade ago (“bagging survival trees”, Hothorn et al., 2004). A similar approach was later developed independently by Meinshausen (2006) in his “quantile regression forests”. In contrast to a mean aggregation of cumulative hazard functions (Ishwaran et al., 2008) or densities (Criminisi et al., 2012), bagging survival trees and quantile regression forests are based on “nearest neighbour weights”. We borrow this term from Lin and Jeon (2006), where these weights were theoretically studied for the estimation of conditional means. The core idea is to obtain a “distance” measure based on the number of times a pair of observations is assigned to the same terminal node in the different trees of the forest. Similar observations have a high probability of ending up in the same terminal node whereas this probability is low for quite different observations. Then, the prediction for predictor values (either new or observed) is simply obtained as a weighted empirical distribution function (or Kaplan-Meier estimator in the context of right-censored target values) where those observations from the learning sample similar (or dissimilar) to
in the forest receive high (or low/zero) weights, respectively. Although this aggregation procedure in the aforementioned algorithms is suitable for estimating predictive distributions, the underlying trees are not. The reason is that the ANOVA- or log-rank-type split procedures commonly applied are not able to deal with distributions in a general sense. Consequently, the splits favour the detection of changes in the mean – or have power against proportional hazards alternatives in survival trees. However, in general, they have very low power for detecting other patterns of heterogeneity (e.g., changes in variance) even if these can be explained by the predictor variables. A simple toy example illustrating this problem is given in Figure1. Here, the target’s conditional normal distribution has a variance split at value of a uniform predictor. We fitted a quantile regression forest (Meinshausen, 2006, 2017) to the
observations depicted in the figure along with ten additional independent uniformly distributed non-informative predictors (using
trees without random variable selection; see Appendix “Computational Details”). The true conditionaland quantiles are not approximated very well by the quantile regression forest. In particular, the split at does not play an important role in this model. Thus, although such an abrupt change in the distribution can be represented by a binary tree, the traditional ANOVA split criterion employed here was not able to detect this split.
To improve upon quantile regression forests and similar procedures in situations where changes in moments beyond the mean are important, we propose “transformation forests” for the estimation and prediction of conditional distributions forgiven predictor variables and proceed in three steps. We first suggest to understand forests as adaptive local likelihood estimators (see Bloniarz et al., 2016
, for a discussion of the special case of local linear regression). Second, we recap the most important features of the flexible and computationally attractive “transformation family” of distributions (Hothorn et al., 2014, 2017) which includes a variety of distribution families. Finally, we adapt the core ideas of “model-based recursive partitioning” (Zeileis et al., 2008, who also provide a review of earlier developments in this field) to this transformation family and introduce novel algorithms for “transformation trees” and “transformation forests” for the estimation of conditional distribution functions which potentially vary in the mean and also in higher moments as a function of predictor variables . In our small example in Figure 1, these novel transformation trees and forests were able to recover the true conditional distributions much more precisely than quantile regression forests.
Owing to the fully parametric nature of the predictive distributions that can be obtained from these novel methods, model inference procedures, such as variable importances, independence tests or model-based resampling, can be formulated in a very general and straightforward way (Section 5). Some remarks on asymptotic properties are given in Section 6. The performance of transformation trees and forests is evaluated empirically on four artificial data generating processes and on survey data for body mass indices from Switzerland in Section 7. Details of the variable and split selection procedure in transformation trees as well as the corresponding theoretical complexity and empirical timings are discussed in Section 8.
2 Adaptive Local Likelihood Trees and Forests
We first deal with the unconditional distribution of a target random variable and we restrict our attention to a specific probability model defined by the parametric family of distributions
with parameters and parameter space . With predictors from some predictor sample space , our main interest is in the conditional distribution and we assume that this conditional distribution is a member of the family of distributions introduced above, i.e., we assume that a parameter exists such that . We call the “conditional parameter function” and the task of estimating the conditional distributions for all reduces to the problem of estimating this conditional parameter function.
From the probability model we can derive the log-likelihood contribution for each of independent observations from the learning sample for . We propose and study a novel random forest-type estimator of the conditional parameter function in the class of adaptive local likelihood estimators of the form
where is the “conditional weight function” for observation given a specific configuration of the predictor variables (which may correspond to an observation from the learning sample or to new data). This weight measures the similarity of the two distributions and under the probability model . The main idea is to obtain a large weight for observations which are “close” to in light of the model and essentially zero in the opposite case. The superscript indicates that the weight function may depend on the learning sample, and in fact the choice of the weight function is crucial in what follows.
Local likelihood estimation goes back to Brillinger (1977) in a comment to Stone (1977) and was the topic of Robert Tibshirani’s PhD thesis, published in Tibshirani and Hastie (1987). Early regression models in this class were based on the idea of fitting polynomial models locally within a fixed smoothing window. Adaptivity of the weights refers to an -dependent, non-constant smoothing window, i.e., different weighing schemes are applied in different parts of the predictor sample space . An overview of local likelihood procedures was published by Loader (1999). Subsequently, we illustrate how classical maximum likelihood estimators, model-based trees, and model-based forests can be embedded in this general framework by choosing suitable conditional weight functions and plugging these into (1).
The unconditional maximum likelihood estimator is based on unit weights not depending on , i.e., all observations in the learning sample are considered to be equally “close”; thus
In contrast, model-based trees can adapt to the learning sample by employing rectangular splits to define a partition of the predictor sample space. Each of the cells then contains a different local unconditional model. More precisely, the conditional weight function is simply an indicator for and being elements of the same terminal node so that only observations in the same terminal node are considered to be “close”. The weight and parameter functions are
Thus, this essentially just picks the parameter estimate from the -th terminal node which is associated with cell
along with the corresponding conditional distribution . Model-based recursive partitioning (MOB, Zeileis et al., 2008) is one representative of such a tree-structured approach.
A forest of trees is associated with partitions for . The -th terminal node of the -th tree contains the parameter estimate and the -th tree defines the conditional parameter function . We define the forest conditional parameter function via “nearest neighbour” forest weights
The conditional weight function counts how many times and are element of the same terminal node in each of the trees, i.e., captures how “close” the observations are on average across the trees in the forest. Hothorn et al. (2004) first suggested these weights for the aggregation of survival trees. The same weights have later been used by Lin and Jeon (2006) for estimating conditional means, by Meinshausen (2006) for estimating conditional quantiles and by Bloniarz et al. (2016) for estimating local linear models. An “out-of-bag” version only counts the contribution of the -th tree for observation when was not used for fitting the -th tree.
Forests relying on the aggregation scheme (3) model the conditional distribution for some configuration of the predictors as
. In this sense, such a forest is a fully specified parametric model with (in-bag or out-of-bag) log-likelihood
allowing a broad range of model inference procedures to be directly applied as discussed in Section 5. Although this core idea seems straightforward to implement, we unfortunately cannot pick our favourite tree-growing algorithm and mix it with some parametric model as two critical problems remain to be addressed in this paper. First, most of the standard tree-growing algorithms are not ready to be used for finding the underlying partitions because their variable and split selection procedures have poor power for detecting distributional changes which are not linked to changes in the mean as was illustrated by the simple toy example presented in the introduction. Therefore, a tailored tree-growing algorithm inspired by model-based recursive partitioning also able to detect changes in higher moments is introduced in Section 4. The second problem is associated with the parametric families
. Although, in principle, all classical probability models are suitable in this general framework, different parameterizations render unified presentation and especially implementation burdensome. We address this second problem by restricting our implementation to a novel transformation family of distributions. Theoretically, this family contains all univariate probability distributionsand practically close approximations thereof. We highlight important aspects of this family and the corresponding likelihood function in the next section.
3 Transformation Models
A transformation model describes the distribution function of by an unknown monotone increasing transformation function and some a priori chosen continuous distribution function . We use this framework because simple, e.g., linear, transformation functions implement many of the classical parametric models whereas more complex transformation functions provide similar flexibility as models from the non-parametric world. In addition, discrete and continuous targets, also under all forms of random censoring and truncation, are handled in a unified way. As a consequence, our corresponding “transformation forests” will be applicable to a wide range of targets (discrete, continuous with or without censoring and truncation, counts, survival times) with the option to gradually move from simple to very flexible models for the conditional distribution functions .
In more detail, let
denote an absolutely continuous random variable with density, distribution, and quantile functions, and , respectively. We furthermore assume for a log-concave density as well as the existence of the first two derivatives of the density with respect to , both derivatives shall be bounded. We do not allow any unknown parameters for this distribution. Possible choices include the standard normal, the standard logistic and the standard minimum extreme value distribution with distribution functions , and , respectively.
Let denote the space of all strictly monotone transformation functions. With the transformation function we can write as with density and there exists a unique transformation function for all distribution functions (Hothorn et al., 2017). A convenient feature of characterising the distribution of by means of the transformation function is that the likelihood for arbitrary measurements can be written and implemented in an extremely compact form.
For a given transformation function , the likelihood contribution of an observation is given by the corresponding density
The likelihood for intervals is, unlike in the above “exact continuous” case, defined in terms of the distribution function (Lindsey, 1996), where one can differentiate between three special cases:
For truncated observations in the interval , the above likelihood contribution has to be multiplied by the factor when . A more detailed discussion of likelihood contributions to transformation models can be found in Hothorn et al. (2017).
We parameterise the transformation function as a linear function of its basis-transformed argument using a basis function such that . In the following, we will write and assume that the true unknown transformation function is of this form. For continuous targets the parameterisation needs to be smooth in , so any polynomial or spline basis is a suitable choice for . For the empirical experiments in Section 7 we employed Bernstein polynomials (for an overview see Farouki, 2012) of order () defined on the interval with
is the density of the Beta distribution with parametersand . This choice is computationally attractive because strict monotonicity can be formulated as a set of linear constraints on the parameters for all (Curtis and Ghosh, 2011).
The distribution family that transformation forests are based upon is called transformation family of distributions with parameter space and transformation functions
. This family encompasses a wide variety of densities capturing different locations and shapes (including scale and skewness), see Figure6 for an illustration of different body mass index distributions. The log-likelihood contribution for an observation is now the log-density of the transformation model .
4 Transformation Trees and Forests
Conceptually, the model-based recursive partitioning algorithm (Zeileis et al., 2008) for tree induction starts with the maximum likelihood estimator . Deviations from such a given model that can be explained by parameter instabilities due to one or more of the predictors are investigated based on the score contributions. The novel “transformation trees” suggested here rely on the transformation family whose score contributions have relatively simple and generic forms. The score contribution of an “exact continuous” observation from an absolutely continuous distribution is given by the gradient of the log-density with respect to
For an interval-censored observation the score contribution is
Under truncation to the interval , one needs to substract the term from the score function.
With the transformation model and thus the likelihood and score function being available, we start our tree induction with the global model . The hypothesis of all observations coming from this model can be written as the independence of the -dimensional score contributions and all predictors, i.e.,
This hypothesis can be tested either using asymptotic M-fluctuation tests (Zeileis et al., 2008) or permutation tests (Hothorn et al., 2006b; Zeileis and Hothorn, 2013) with appropriate multiplicity adjustment depending on the number of predictors. Rejection of leads to the implementation of a binary split in the predictor variable with most significant association to the score matrix; algorithmic details are discussed in Section 8. Unbiasedness of a model-based tree with respect to variable selection is a consequence of splitting in the variable of highest association to the scores where association is measured by the marginal multiplicity-adjusted -value (for details see Hothorn et al., 2006b; Zeileis et al., 2008, and Section 8). The procedure is recursively iterated until cannot be rejected. The result is a partition of the sample space .
Based on the “transformation trees” introduced here, we construct a corresponding random forest-type algorithm as follows. A “transformation forest” is an ensemble of transformation trees fitted to subsamples of the learning sample and, optionally, a random selection of candidate predictors available for splitting in each node of the tree. The result is a set of partitions of the predictor sample space. The transformation forest conditional parameter function is defined by its nearest neighbour forest weights (3).
The question arises how the order of the parameterisation of the transformation function via Bernstein polynomials affects the conditional distribution functions and . On the one hand, the basis with only allows linear transformation functions of a standard normal and thus our models for are restricted to the normal family, however, with potentially both mean and variance depending on as the split criterion in transformation trees is sensitive to changes in both mean and variance. This most simple parameterisation leads to transformation trees and forests from which both the conditional mean and the conditional variance can be inferred. Using a higher order also allows modelling non-normal distributions. In the extreme case with the unconditional distribution function interpolates the unconditional empirical cumulative distribution function of the target. With , the split criterion introduced in this section is able to detect changes beyond the second moment and, consequently, also higher moments of the conditional distributions may vary with . An empirical comparison of transformation trees and forests with linear () and nonlinear () transformation function can be found in Section 7. Additional empirical properties of transformation models with larger values of are discussed in Hothorn (2018a).
5 Transformation Forest Inference
In contrast to other random forest regression models, a transformation forest is a fully-specified parametric model. Thus, we can derive all interesting model inference procedures from well-defined probability models and do not need to fall back to heuristics. Predictions from transformation models are distributionsand we can describe these on the scale of the distribution, quantile, density, hazard, cumulative hazard, expectile, and any other characterising functions. By far not being comprehensive, we introduce prediction intervals, a unified definition of permutation variable importance, the model-based bootstrap and a test for global independence in this section.
5.1 Prediction Intervals and Outlier Detection
For some yet unobserved target under predictors , a two-sided prediction interval for and some can be obtained by numerical inversion of the conditional distribution , for example via
with the property
The empirical level depends on how well the parameters are approximated by the forest estimate . If for some observation the corresponding prediction interval excludes , one can (at level
) suspect this observed target of being an outlier.
5.2 Permutation Variable Importance
The importance of a variable is defined as the amount of change in the risk function when the association between one predictor variable and the target is artificially broken. Permutation variable importances permute one of the predictors at a time (and thus also break the association to the remaining predictors, see Strobl et al., 2008). The risk function for transformation forests is the negative log-likelihood, thus a universally applicable formulation of variable importance for all types of target distributions in transformation forests is
where the -th variable was permuted in for .
5.3 Model-Based Bootstrap
We suggest the model-based, or “parametric”, bootstrap to assess the variability of the estimated forest conditional parameter function as follows. First, we fit a transformation forest and sample new target values for each observation from this transformation forest. For these pairs of artificial targets and original predictors , we refit the transformation forest. This procedure of sampling and refitting is repeated times. The resulting conditional parameter functions are a bootstrap sample from the distribution of conditional parameter functions assuming the initial was the true conditional parameter function. The bootstrap distribution of
or functionals thereof can be used to study their variability or to derive bootstrap confidence intervals(Efron and Tibshirani, 1993) for parameters or other quantities, such as conditional quantiles.
5.4 Independence Likelihood-Ratio Test
The first question many researchers have is “Is there any signal in my data?”, or, in other words, is the target independent of all predictors ? Classical tests, such as the -test in a linear model or multiplicity-adjusted univariate tests, have very low power against complex alternatives, i.e., in situations where the impact of the predictors is neither linear nor marginally visible. Because transformation forests can potentially detect such structures, we propose a likelihood-ratio test for the null
. This null hypothesis is identical toand reads , or even simpler, for the class of models we are studying. Under the null hypothesis, the unconditional maximum likelihood estimator would be optimal. It therefore makes sense to compare the log-likelihoods of the unconditional model with the log-likelihood of the transformation forest using the log-likelihood ratio statistic
Under we expect small differences and under the alternative we expect to see larger log-likelihoods of the transformation forest. The null distribution of such likelihood-ratio statistics is hard to assess analytically but can be easily approximated by the model-based bootstrap (early references include McLachlan, 1987; Beran, 1988). We first estimate the unconditional model and, in a second step, draw samples from this model of size , i.e., we sample from the unconditional model, in this sense treating as the “true” parameter. In the -th sample the predictors are identical to the those in the learning sample and only the target values are replaced. For each of these samples we refit the transformation forest and obtain . Based on this model we compute the log-likelihood ratio statistic
where is the log-likelihood contribution by the -th observation from the -th bootstrap sample. The -value for is now . The size of this test in finite samples depends on the performance of transformation forests under and its power on the ability of transformation forests to detect non-constant conditional parameter functions . Empirical evidence for a moderate overfitting behaviour and a high power for detecting distributional changes are reported in Section 7.
6 Theoretical Evaluation
The theoretical properties of random forest-type algorithms are a contemporary research problem and we refer to Biau and Scornet (2016) for an overview. In this section we discuss how these developments relate to the asymptotic behaviour of transformation trees and transformation forests.
For the maximum likelihood estimator () is consistent and asymptotically normal (Hothorn et al., 2017). In the non-parametric setup, i.e., for arbitrary distributions , Hothorn et al. (2014) provide consistency results in the class of conditional transformation models. Based on these results, consistency and normality of the local likelihood estimator for an a priori known partition is guaranteed as long as the sample size tends to infinity in all cells .
If the partition (transformation trees) or the nearest neighbour weights (transformation forests) are estimated from the data, established theoretical results on random forests (Breiman, 2004; Lin and Jeon, 2006; Meinshausen, 2006; Biau et al., 2008; Biau, 2012; Scornet et al., 2015) provide a basis for the analysis of transformation forests. Lin and Jeon (2006) first analysed random forests for estimating conditional means with adaptive nearest neighbours weights, where estimators for the conditional mean of the form
were shown to be consistent in non-adaptive random forests
as . Meinshausen (2006) showed a Glivenko-Cantelli-type result for conditional distribution functions
where the weights are obtained from Breiman and Cutler’s original random forest implementation (Breiman, 2001).
In order to understand the applicability of these results to transformation forests, we define the expected conditional log-likelihood given for a fixed set of parameters as
where is the likelihood contribution by some observation . By definition, the true unknown parameter has minimal expected risk and thus maximises the expected log-likelihood, i.e.,
Our random forest-type estimator of the expected conditional log-likelihood given for a fixed set of parameters is now
Under the respective conditions on the distribution of
and the joint distribution ofgiven by Lin and Jeon (2006), Biau and Devroye (2010), or Biau (2012), this estimator is consistent for all
(the result being derived for non-adaptive random forests). This result gives us consistency of the conditional log-likelihood function
The forest conditional parameter function is consistent when
as for all in a neighbourhood of . The result can be shown under the assumptions regarding given by Hothorn et al. (2017), especially continuity in . Because the conditional log-likelihood is a conditional mean-type estimator of a transformed target
, future theoretical developments in the asymptotic analysis of more realistic random forest-type algorithms based on nearest neighbour weights will directly carry over to transformation forests.
It is worth noting that some authors studied properties of random forests in regression models of the form where the conditional variance does not depend on . This is in line with the ANOVA split criterion implemented in Breiman and Cutler’s random forests (Breiman, 2001). The split procedure applied in transformation trees is, as will be illustrated in the next section, able to detect changes in higher moments. Thus, transformation forests might be a way to relax the assumption of additivity of signal and noise in the future.
7 Empirical Evaluation
Transformation forests were evaluated empirically, comparing this novel member of the random forest family to established procedures using artificial data generating processes. The data scenarios controlled the variation of several properties of interest: type of conditional parameter function, types of effect, and model complexity in low and high dimensions. The corresponding hypotheses to be assessed are:
Type of Conditional Parameter Regression.
Tree-Structured Conditional Parameter Function. Transformation trees and forests are able to identify subgroups associated with different transformation models, i.e., subgroups formed by a recursive partition (or tree) in predictor variables corresponding to different parameters and thus different conditional distributions .
Non-Linear Conditional Parameter Function. Transformation forests are able to identify conditional distributions whose parameters depend on predictor variables in a smooth non-linear way.
Type of Effect.
No Effect. In a non-informative scenario with (i.e., mean and all higher moments constant) transformation trees perform as good as the unconditional maximum likelihood estimator. Thus, there is no (pronounced) overfitting.
Location Only. Transformation trees and forests perform as good as classical regression trees and forests when higher moments of the conditional distribution are constant.
Unlinked Location and Scale. Transformation trees and forests outperform classical regression trees and forests when higher moments of the conditional distribution are varying in a way that is not linked to variations in the mean.
Linked Location and Scale. Transformation trees and forests perform as good as classical regression trees and forests when higher moments of the conditional distribution are varying but in a way that is linked to the mean.
Transformation trees and forests with linear transformation function , i.e., with parameters, perform best for conditionally normal target variables. Transformation trees and forests with non-linear transformation function perform slightly worse in this situation.
Transformation trees and forests with non-linear transformation function , i.e., with parameters of a Bernstein polynomial of order five, outperform transformation trees and forests with linear transformation function for conditionally non-normal target variables.
Dimensionality. Transformation forests stabilise transformation trees in the presence of high-dimensional non-informative predictor variables.
7.1 Data Generating Processes
Two data generating processes corresponding to H1a and H1b were studied. The first problem implements simple binary splits in the conditional mean and/or conditional variance of a normal target allowing a direct comparison of the split criteria employed by classical and transformation trees. The second problem is inspired by the “Friedman 1” benchmark problem (Friedman, 1991), and implements smooth non-linear conditional mean and variance functions for normal targets, in order to provide a more complex and more realistic scenario.
Tree-Structured Conditional Parameter Function (H1a)
The conditional normal target
depends on tree-structured conditional mean and variance functions and according to four different setups (corresponding to hypotheses H2a–c):
All predictors are independently uniform on in the low-dimensional case (two informative and five noise variables) and in the high-dimensional case (two informative and noise variables, H4).
For the evaluation of hypothesis H2d we studied the same setup as above but for conditionally log-normal targets with
Here, the conditional mean of the target variable depends both on the underlying conditional mean of and the corresponding conditional variance :
It is important to note that the true transformation function in model (6) is a scaled and shifted log-transformation. Unlike the true linear transformation function in model (5), which can be exactly fitted by the linear and Bernstein parameterisations of the transformation function in transformation trees and forests, the true log-transformation cannot be approximated well by the basis functions . Therefore, no competitor in this simulation experiment is able to exactly recover the true data generating process.
Non-Linear Conditional Parameter Function (H1b)
The data generating process
with all predictors from independent uniform distributions on in the low-dimensional case (ten informative and five noise variables) and in the high-dimensional case (ten informative and noise variables, H4) is inspired by the “Friedman 1” benchmarking problem (Friedman, 1991). This original benchmark problem is conditional normal with a conditional mean function depending on five uniform predictor variables
and constant variance. For our experiments, we scaled the output of Friedman1 to the interval and denote this scaled function as . Model (7) is conditionally normal with potentially non-constant conditional mean function
and potentially non-constant conditional variance function
The latter function is based on an additional set of five uniformly distributed predictor variables and thus the conditional mean and variance function are not linked (H2c).
Again, we considered all setups corresponding to H2a–c and H4, including the non-informative case with constant mean and variance:
Hypothesis H2d for non-linear conditional parameter functions (H1b) was studied in the log-normal model
and the remarks to model (6) stated above also apply here.
For testing the hypotheses H1–H4, we compared the performance of transformation trees and forests with linear and non-linear transformation functions to the performance of conditional inference trees (Hothorn et al., 2006b) and conditional inference forests (Strobl et al., 2007) as representatives of unbiased recursive partitioning and to Breiman and Cutler’s random forests (Breiman, 2001) as an representative of exhaustive search procedures. In more detail, we compared the performance of the following methods:
Conditional inference trees with internal stopping by default parameters.
Transformation trees, either with linear ( parameters) or non-linear ( parameters of a Bernstein polynomial) transformation functions. Tree-growing parameters are identical to those from CTree.
Conditional inference forests with mtry equal to one third of the number of predictor variables. Trees were grown without internal stopping until sample size constraints were met.
Breiman and Cutler’s random forests with tree-growing parameters analogous to CForest (i.e., same mtry and stopping based on sample size constraints).
Transformation forests, either with linear () or non-linear () transformation functions, and tree-growing parameters analogous to CForest and RForest.
See Table 1 for a schematic overview of all competitors and Appendix “Computational Details” for the exact tree-growing parameter specifications.
In order to allow a fair comparison on the same scale, trees and forests obtained from the classical methods, i.e., conditional inference trees and forests and Breiman and Cutler’s random forests, were used to estimate conditional parameter functions (2) and (3) in the same way as for transformation trees and forests: We first fitted trees and forests using the reference implementations of the corresponding methods and, second, computed the corresponding conditional weight functions, which allowed estimation of conditional parameter functions , and in the third step. It should be noted that the combination of Breiman and Cutler’s random forests with transformation models in our RForest variant is conceptually very similar to quantile regression forests. Meinshausen (2006, 2017) uses Breiman and Cutler’s random forest to build the trees. The only difference to our RForest variant is that aggregation in quantile regression forests takes place via the weighted empirical conditional cumulative distribution function with weights , see Formula (4), instead of the application of a smooth conditional distribution function corresponding to a transformation model.
|Tree Growing||Model Complexity|
|Variable||Split||Split||Linear ()||Non-Linear ()|
|exhaustive||exhaustive RSS||MSE||RForest ()||RForest ()|
|unbiased (inference based)||maximally selected score test||location||
|exhaustive likelihood||location/scale||TTree (, exh)|
|TTree (, exh)|
7.3 Performance Measures
The primary performance measure is the out-of-sample log-likelihood because it assesses the whole predicted distribution in a “proper” way (Gneiting and Raftery, 2007). To adjust for sampling variation, the log-likelihood of the true data generating process is employed as the reference measure. More precisely, the negative log-likelihood difference, that is the negative log-likelihood of a competitor minus the negative log-likelihood of the true data generating process, was evaluated for the observations of the validation sample. Conditional medians and prediction intervals are of additional interest and we also compared their performance by the out-of-sample check risk corresponding to the , (absolute error) and quantiles in reference to the true data generating process. A direct comparison of coverage and lengths of prediction intervals is not considered as it would only be valid or useful for a given configuration of the predictor variables. This is termed “conditional coverage” vs. “sample coverage” in Mayr et al. (2012) or considered as maximising forecast sharpness only subject to calibration in the proper scoring rules literature (Gneiting et al., 2007).
7.4 Results: Tree-Structured Conditional Parameter Function (H1a)
Given the type of conditional parameter function (here: tree, H1a) all other properties of the data generating process are varied and assessed, summarising the results with parallel coordinate displays and superimposed boxplots of the negative log-likelihood differences (see Figure 2). These were obtained from pairs of learning samples (size ) and validation samples, using a normal dependent variable in the first step. This allows to assess the type of effect (mean and/or higher moments) in the rows of the panels (H2a–c), the dimensionality (H4) in the columns of the panels, and the complexity (H3, vs. ) along the -axes.
In the situation where all predictor variables were non-informative (H2a, top row of Figure 2), CTree () and TTree () were most resistant to overfitting; this effect is due to the test-based internal stopping of the unbiased tree methods compared here. TTree () with non-linear transformation function had slightly larger negative log-likelihood differences due to the increased model complexity (H3). Moreover, if model complexity is further increased by considering forests instead of trees, all random forest variants exhibit some more pronounced overfitting behaviour.
Under the simple change in the mean (H2b, second row in Figure 2), CTree () and TTree () were able to detect this split best. TTree () and all random forest variants performed less well in this situation. A variance change (H2c, third row in Figure 2) lead to smallest negative log-likelihood difference and thus superior performance for all transformation trees and forests as compared to the trees and forests splitting only based on the mean. TTree () performed best while none of the classical procedures seemed to be able to properly pick up this variance signal. The aggregation of multiple transformation trees lead to decreased performance, this effect was also visible in Figure 1 (which was based on the same data generating process (5)).
When changes in both mean and variance were present (H2c, fourth row in Figure 2), transformation forests with linear transformation function TForest () performed as good as the corresponding TTree in the low-dimensional setup but better than all other procedures in the high-dimensional setup with non-informative variables (H4). This effect might be due to a too restrictive inference-based early stopping in TTree. TTree () showed some extreme outliers (H3, visible in the parallel coordinates in Figure 2) which were due to convergence problems. The corresponding transformation forests TForest (), however, did not experience such problems and thus seemed to stabilise the trees.
In summary, the results with respect to our hypotheses were:
Transformation trees reliably recover tree-structured conditional parameter functions in both mean and variance.
Transformation trees are rather robust to overfitting when there is no effect while transformation forests (like all other random forests) exhibit some overfitting.
Transformation trees and forests perform comparably to their classical counterparts.
Transformation trees and forests outperform their classical counterparts if there are only variance effects or variance effects that are not linked to the mean.
For normal responses transformation trees and forests with linear transformation function () consistently perform better than the more complex Bernstein polynomials ().
Transformation forests stabilise the transformation trees in high-dimensional settings.
As a next step, the same simulation experiments were considered using a log-normal target variable instead of the normal variable employed above. Figure 3 depicts the negative log-likelihood differences for this setup, based on learning samples of size . Using this highly skewed distribution affects the results regarding the following two hypotheses:
All models with complexity are clearly not appropriate anymore as they cannot capture the skewness. Consequently, all models based on the more flexible Bernstein polynomials with outperform all other methods.
The classic RForest (), i.e., the combination of Breiman and Cutler’s random forests with a subsequent flexible transformation model, performs almost on par with transformation trees and forests even when there are changes in the variance only. The reason is that any changes in the variance are always also linked to changes in the mean due to the skewness of the distribution.
Qualitatively the same conclusions can be drawn when assessing the competing methods based on predictions of the conditional quantiles (Figure 10 and 13 for normal and log-normal targets, respectively), quantiles (Figure 11 and 14), and quantiles (Figure 12 and 15). However, the differences are less pronounced for the quantiles (medians, corresponding to the absolute errors). Note also that combining predictions of and quantiles amounts to prediction intervals.
By and large, all empirical results in this section conformed with our hypotheses H1–4, suggesting a stable behaviour of transformation trees and forests, especially with appropriate linear transformation function for normal targets, in these very simple situations. The next section proceeds to a less idealised scenario with non-linear conditional parameter functions defining mean and/or variance.
7.5 Results: Non-Linear Conditional Parameter Function (H1b)
The same hypotheses were assessed as in the previous section but for non-linear Friedman1-type conditional parameter functions instead of the tree-structured functions considered previously. More specifically, Figures 4 and 8 depict the negative log-likelihood differences based on learning samples with normally-distributed targets (
) and log-normally-distributed targets (), respectively. We summarise the results as follows.
When a signal was present (rows 2–4), all random forest variants outperformed single trees under normality. Under non-normality, this still holds for the random forest variants combined with flexible models ().
When there is no effect (top rows), CTree () and TTree () showed best resistance to overfitting under normality. Under non-normality, TTree () still shows this behavior but the corresponding forests also perform similarly well.
All forest variants performed similarly well when predictor variables only had an effect on the mean (second rows).
Under normality, transformation forests performed best when some of the predictor variables also affected the variance (rows 3–4), where the classical procedures were not able to capture these changes appropriately.
Under non-normality, transformation forests (with ) still perform best (rows 3–4). However, the classical RForest also perfoms well albeit with a much larger variance than TForest.
Under non-normality, all trees and forests combined with flexible Bernstein polynomials () clearly outperform all other methods. Under normality, the flexible models with were sometimes slightly worse than the models but often also a little bit better.
In many situations the picture in low-dimensional settings (left column) is quite similar to that in high-dimensional scenarios (right column). However, sometimes it can be seen that transformation forests stabilise transformation trees in the presence of high-dimensional non-informative predictor variables.
As before, qualitatively the same patterns could be observed for the corresponding , , and check risks (Figures 16–18 and Figures 19–21, respectively) and thus prediction intervals. In summary, our hypotheses H1–4 were found to describe the behaviour of transformation trees and forests in this more complex setup well. The loss of using an overly complex model, such as a transformation model with , was tolerable in the simple normal setups but the gains, especially when parameters of a skewed target depend on the predictor variables, was found to be quite substantial.
7.6 Illustration: Swiss Body Mass Indices
Finally, to conclude this section, we illustrate the applicability of transformation trees and forests in a realistic situation by modelling the conditional body mass index (BMI = weight (in kg) / height (in m)) distribution for Switzerland, based on individuals aged between and years from the 2012 Swiss Health Survey (Bundesamt für Statistik, 2013). The predictor variables included smoking, sex, age, and a number of “lifestyle variables” : fruit and vegetable consumption, physical activity, alcohol intake, level of education, nationality and place of residence. Smoking status was categorised into never, former, light, moderate, and heavy smokers. A more detailed description of this data set can be found in Lohse et al. (2017) and extended transformation models for body mass indices are discussed by Hothorn (2018a).
The conditional transformation model underlying transformation trees and transformation forests