Most measures carried out in the real world, e.g.
, by sensors embedded in different machines or by analyses of samples, are uncertain if not erroneous in some cases. This uncertainty may be due to the generating process of the samples being measured or from the intrinsic limitations of any measurement process. Considering such measures, that constitute many of the data sets used in data science applications in both industry and academy, as certain is thus in the best case a naive position. Our experiments illustrate this point inasmuch as the method we propose here to handle uncertainty outperforms standard approaches on several, real benchmark data sets.
However, if several methods have been developed to obtain uncertainty estimates from data sets, very few studies have been devoted to designing data science methods that can deal with such uncertainties. We address this problem here in the context of regression trees, a popular machine learning method at the basis of state-of-the-art ensemble methods as Random Forests and Gradient Boosted Trees.
In this context, recent studies tackling the problem of uncertainty have focused on the uncertainty of the output variable by providing conditional quantiles (as opposed to conditional means). The work on quantile regression forests developed by Meinshausen  is a good illustration of this. We take here a different approach and directly model the uncertainty of the input variables in the regression trees we consider. This leads to a regression tree in which observations no longer belong to a single leaf. Instead, each observation has a non-null probability of being assigned to any leaf of the tree. The construction process associated to such trees is slightly more complex than the ones of standard trees (it involves in particular the inversion of an matrix, where is the size of the training data), but the improvements obtained on the prediction fully justify this additional complexity, as shown in Section 4 on both benchmark and modified (with an additional noise) data sets.
The idea of including information on the uncertainty of the input data in a regression method is not new and is related to uncertainty propagation and sensitivity analysis. Several authors have indeed proposed to integrate uncertainties on the inputs in different regression methods, as multivariate linear models 11]. To the best of our knowledge, our approach is the first one to address this problem in the context of regression trees and ensemble methods based on such trees, such as Random Forests.
The remainder of the paper is organized as follows: Section 2 presents the general model we rely on and its main properties; Section 3 then describes the algorithms for constructing the regression tree and the associated prediction rule, while Section 4 presents the experiments conducted and the numerical results obtained. Finally, Section 5 positions our work wrt previous studies while Section 6 concludes the paper.
2 Regression trees with uncertain input
be an output random variable andsome input random variables. A classical question is to find some relationship between and , estimating the so called link function involved in the model .
Tree-based ensemble methods, as Random Forests or Gradient Boosted Trees, are popular machine learning methods, developed to address the above regression problem . In these methods, a set of regressors are constructed and aggregated in a convenient way. The building blocks are decision regression trees , which are defined from a partition of the space of input variables into regions obtained by dyadic splits minimizing a risk function. A weight is associated to each region leading to a piece-wise predictor, for a new input , of the form:
where is the set of parameters, learned from a training data set, defining the tree. Both categorical and quantitative inputs can in theory be considered. For the sake of simplicity, we however focus in this study on quantitative inputs, thus considering that .
To deal with uncertainty in the input, we introduce an auxiliary latent random vectorrepresenting the true value of the data and consider the general regression function that relates to through . The standard regression model is obtained from this general model by considering that is a Dirac at (or equivalently is Gaussian with mean
and variance). We further assume here that the variables are independent of each other and that the true measure given the observation, on each variable, is Gaussian, leading to the following complete model:
The Gaussian distribution is widespread and easy to manipulate, hence its use here. Other distributions can nevertheless be considered for the measurement error, but this is beyond the scope of this study. In the remainder, we will denotethe vector of variances of the Gaussian distributions. In practical situations, these variances may be given (for example when the data corresponds to measurements from machines for which the uncertainty is known) or may be directly learned from the data.
When one is dealing with uncertain input, and we want to stress again that this is the general situation, one can no longer assume that observations are hard assigned to regions. Instead, each observation has a probability of being associated to each region, leading to the following prediction rule that directly generalizes Eq. (1):
Note that the set of parameters now includes the variances . In addition, because of the independence assumption at the basis of the model retained, one has:
We now turn to the procedure for learning the parameters of the model.
2.1 Estimation procedure
The estimation procedure of the parameters is based, in this study, on the minimization of the empirical quadratic risk, which is the standard risk considered in regression trees. More precisely, for the learning set defined by:
with the observed sample, we define:
where has been introduced in Eq. (3). This criterion has to be minimized on the training set wrt the parameters denoted by : regions of the tree, associated weights, and the variances of the latent vector . To do so, we introduce the matrix defined, for and for , by:
where the interval corresponds to the region projected onto the th variable. It is easy to see that when fixing the regions and the vector of variances, minimizing Eq. (5) with respect to corresponds to a weighted average of
, in a way similar to the linear regression model ifis not singular:
where denotes the vector of univariate outputs . Indeed, by definition:
Differentiating wrt , for all , one gets:
So that, if is not singular, it leads to Eq. (6):
which is a minimum. In Section 2.3, we derive assumptions under which is indeed not singular. In practice, one can always use the pseudo-inverse of , which we will do in our experiments. Lastly, note that depends on the regions through the matrix .
As the regression tree is constructed, the number of regions is not fixed and increases step by step, meaning that the size of the matrix is also changing during the iterative process. Let us assume that regions have already been identified, meaning that the current tree has leaves, each leaf corresponding to a region (i.e., hyper-rectangle). As in standard regression trees, each region , can be decomposed into two sub-regions wrt a variable and a splitting point :
This decomposition adds a new region, so that , for which the associated elements, and thus can readily be computed. For each region , one is looking for the best split, i.e. the best variable and the best splitting point that minimize:
The sum includes all possible observations as each observation has a non null probability to belong to any region, and that both and are hidden in . Using (6), the above problem can be reformulated as:
where denotes the set of splitting points corresponding to the middle points of the coordinates of the observations sorted according to .
The decomposition corresponding to the best split is then used to build child nodes of . In this process, that is repeated till a stopping criterion is met111Any standard stopping criterion can be used here., the number of regions, the matrix and the weights are gradually updated. Section 3 provides the algorithm corresponding to this construction.
Lastly, the vector of variances, , can either be set according to some high level principles, or can be learned through a grid search on some validation steps. The latter is more demanding wrt computational resources, but is likely to lead to better results. However, in our experiments, we use the former strategy, with the aim of showing that our approach is robust in the sense that it yields good results even when is set a priori.
2.2 Final prediction
2.3 A sufficient condition on the invertibility of
The formula used for the construction of the tree relies on the inverse on the matrix . Even if numerically the Moore-Penrose pseudo-inverse might be used to approximate this inverse, we provide here a sufficient condition on the invertibility of . Without loss of generality, the regions involved in the definition of the regression tree are of the form for , where . As usual, we denote , the quantile of the standard Gaussian distribution. The general invertibility result is stated in Theorem 2.3. With the model defined in (2), if the following assumption is satisfied:
then the matrix is invertible. Roughly speaking, the matrix
is invertible provided that the standard deviationis sufficiently small. The smaller the regions and/or the larger the number of input variables, the lower the uncertainty on the measurement has to be to ensure the invertibility of the matrix .
To prove Theorem 2.3, we prove that is of full rank. To do so, we first prove the following result:
Let us fix and consider such that . Assume that
Observe that a sufficient condition to have is
We now search a sufficient condition to have the inequality just above. To get this inequality the following condition is sufficient: for all ,
Note that this last condition is satisfied if we have, for all ,
which is equivalent to
Note that . A sufficient condition is then, for all ,
Since , a condition even more conservative is the following:
This concludes the proof.
We assume in the following that Assumption (9) is satisfied. Then, the set is non-empty. Let us consider, for all , a representative of this set and let us introduce the matrix defined, for all , by:
Assume that Assumption (9) holds. Then the matrix is invertible. We first show that is a strictly dominant diagonal matrix, i.e.:
Indeed, by Proposition 2.3, we know that and is the only region where it is true:
According to Hadamard’s Lemma, we know that a strictly dominant diagonal matrix is invertible, which concludes the proof. As is invertible, is of full rank, leading to the fact that is invertible which concludes the proof of Theorem 2.3.
2.4 Extension to Random Forests
It is straightforward to use the uncertain regression trees introduced above in Random Forests, leading to Uncertain Random Forests. Indeed, each tree of the forest is now an uncertain regression tree that can be constructed as outlined before. Assuming a forest of uncertain trees and denoting , the weights estimated for each tree and
the probability distribution of a new observationover the regions () of the tree () of the forest, the prediction rule of the uncertain random forest takes the form:
As one can note, the above prediction rule is a direct extension of (8).
3 Associated algorithms
Algorithm 3 describes the construction of uncertain regression trees. This construction parallels the one of standard regression trees except that we consider a matrix encoding the probability distribution of training data points over regions, which is built dynamically (as the regions and the weights) by adding a new column when a given region is split into two sub-regions.
Each time a region, corresponding to a current leaf of the tree being constructed, is considered (through the pop function in Algorithm 3), one identifies the best split that minimizes the empirical risk in (7) among all possible splitting points in of each variable . The set is defined by , with corresponding to the observations of the learning set belonging to the region , sorted such that .
The -th column of corresponding to the current region is finally replaced by the probability distribution of the training data points over its left sub-region (denoted ) and an additional column is added to for the right sub-region (denoted ). The weights are easily deduced from at each step through (6).
The algorithm finally outputs the set of regions corresponding to the leaves of the tree and the weights .
Uncertain regression trees
while stopping criterion not met
#defined in (7)
K = K+1
S.append((k,), (K, ))
Note that in this version the tree is constructed in a depth-first manner, and that the usual stopping criteria for regression trees can be used (as imposing a minimum number of observations in each leaf or a maximal depth for the tree).
4 Experimental validation
The goal of our experiments is to assess to which extent the proposed uncertain regression trees are:
useful compared to standard regression trees,
useful in Random Forests.
By standard regression trees we mean here regression trees based on the quadratic risk and the prediction rule given in (1). In addition, we consider a trade-off between standard and uncertain regression trees based on standard trees (and thus avoiding the complex construction process outlined before) but using the prediction rule given in (8). The matrix and the weights are thus computed only once, when the standard trees have been built. The rationale for this method is to rely on the strengths of the two approaches: a simple construction tree process and a robust prediction rule. As one can conjecture, this method yields results in between the two approaches.
We make use in our experiments of four benchmark data sets commonly used in regression tasks and described in Section 4.1. We first use these data sets without any modification, to illustrate the fact that real data sets contain uncertain inputs. The uncertain regression trees proposed here indeed outperform standard trees and Random Forests on these data sets, as shown in Section 4.2. We then modify two of these data sets by adding a uniform noise bounded by a quantity proportional to the empirical variance of the data. This additional perturbation aims at assessing the robustness of the different methods (standard and uncertain trees) in situations where the input data is highly uncertain. Once again, the results obtained show the good behaviour of the uncertain trees and Random Forests (Section 4.3). In all our experiments, the results are evaluated through the Root Mean Squared Errors (RMSE), which is a standard evaluation measure in regressions tasks. To ensure that all the available data is used for both training and testing, we further rely on 5-fold cross-validation and report the mean RMSE and its standard deviation over the 5 folds.
The stopping criterion for the trees, both standard and uncertain, is based on the fact that all leaves should contain at least ten percent of the training data. For Random Forests, three features are randomly selected for constructing a tree, which roughly corresponds to one third of the features on the data sets considered, a standard proportion in Random Forests.
As mentioned before, the vector of variances is fixed according to some high level principle. In particular, when no additional noise is introduced on the data, the variance for a particular variable is set to the empirical variance of (in other words, we assume that the variance of the true values is of the same order as the variance of the observed values). When some noise is added to the data, the variance of is set to one half of the variance of the observed, noisy data (in this case, the variance of the true values should be lower than the empirical variance of the observed values; we arbitrarily chose one half in this study).
Lastly, our algorithms have been implemented in scikit-learn , using Cython , a compiled programming language that extends Python with static typing, for better performance. Development proceeds on GitHub222https://github.com/dtrckd/scikit-learn/tree/probtree, a platform which greatly facilitates the collaboration.
4.1 Data Sets
Experiments are conducted on 4 publicly available data sets, popular in the regression tree literature. The main characteristics of these data sets are summarized in Table 1
. As we focus in this paper on quantitative variables, we simply deleted the categorical variables from the data sets. Several applications are considered, among which environment (concentration in ozone over a day for the data setOzone introduced in ), health (data set Diabete, introduced in  and used in ), economy (data set BigMac about price of sandwiches, available in R package
alr3and used in ) or biology (data set Abalone introduced in  and used in  among others).
|Data set||sample size||number of variables|
4.2 Results on benchmark data sets
As mentioned in the introduction, data sets are by nature uncertain, so we illustrate our method on the 4 data sets introduced in Section 4.1. The noise reflects the uncertainty, so we propose to use the empirical standard deviation vector as the input parameter .
Performances are computed by a 5-fold cross validation for each data set, and we present mean and standard deviation of RMSE from each fold in Table 2.
We compare the standard tree, the standard Random Forests (with =100 trees), our approach and the standard tree with uncertain prediction.
|Standard tree||25.09 (12.3)||17.82 (4.3)||60.29 (4.3)||2.70 (0.3)|
|Standard RF,||19.78 (11.0)||15.86 (4.1)||58.18 (4.3)||2.65 (0.4)|
|Standard tree with uncertain prediction||21.49 (8.7)||16.79 (4.1)||57.05 (3.7)||2.41 (0.2)|
|Uncertain tree||18.74 (8.9)||15.39 (3.7)||56.56 (3.3)||2.33 (0.3)|
For all data sets, the best performances are achieved by our method. As expected, the standard Random Forest are performing better than considering only one tree, by smoothing the predictions. Our approach is quite similar in meaning by smoothing the prediction but considering only one tree. However, we can see in Table 2 that performances are better for our method than the standard RF. Performances of the standard tree with uncertain prediction are better than the ones from the standard tree, but not always than the standard RF.
Finally, we remark that the standard deviation of our method is smaller than the others, meaning that results are more stable, due to the smoothing.
4.3 Results on artificial uncertain data sets
Now, we consider data sets even more uncertain, by adding some artificial noise which could be related to some measure error. The variability of the data is coming from two sources, namely the variance in the latent variables (related to the variance of with the notations of Section 2) and the variance of the uncertainty (related to the variance of ), both leading to the variance of the observations. Then, we assume here that is smaller than the standard deviation of the observations, which can be estimated through the data set. Basically it means that the main part of the variance is due to the uncertainty. In our experiments, we introduce some noise in the observation so that .
To construct artificial uncertain data sets satisfying this condition, we consider in this section BigMac, Ozone, Diabete and Abalone data sets introduced in Section 4.1. A noise is added to each observation using the following process. For each observation from an input variable , , we add a noise generated from the product of a Rademacher variable and a uniform variable coming from the interval
. The Uniform distribution is used to not over fit the model (2) which considers Gaussian distribution and we satisfy the condition .
We consider here, as before, the standard tree, the standard RF with tress, the standard tree with uncertain prediction and the uncertain tree, but we also consider uncertain RF with trees.
Results are set in Table 3.
|Uncertain data sets||BigMac||Ozone||Diabete||Abalone|
|Standard tree||22.28 (8.7)||19.37 (4.1)||60.47 (2.81)||2.54 (0.17)|
|Standard tree with uncertain prediction||21.68 (9.8)||17.35 (4.9)||57.92 (3.43)||2.40 (0.15)|
|Uncertain tree||19.28 (13.4)||16.87 (6.0)||58.56 (3.45)||2.34 (0.20)|
|Standard RF,||19.25 (7.8)||15.72 (3.0)||59.55 (4.39)||2.64 (0.18)|
|Uncertain RF,||18.06 (9.3)||15.49 (3.7)||55.66 (4.31)||1.98 (0.12)|
Our method outperforms again the standard approach when considering solely trees. RMSE are a bit higher than in Table 2, because there is more variability. Remark however than standard RF performs better in that case. We also compute uncertain RF, which outperform the uncertain tree, meaning that the smoothing is done in another way, improving the method. Uncertain RF also reduces the standard deviation.
Regression trees have been introduced in the 1980s through the popular CART algorithm , notably allowing one to deal with both categorical and numerical input variables. They constitute the basic building block of state-of-the-art ensemble methods based on the combination of random trees, notably Random Forests introduced by Breiman in  to circumvent the instability of Regression trees . Since Random Forests are particularly well suited for Big Data analysis (see ), many applications have been addressed in various domains with Random Forests, for example in ecology or genomics.  provides a review of the use of Random Forests for data mining purposes. An interesting feature of Random Forests is the possibility to quantify the importance of input variables in the model obtained (see  for more details on that point). Another interesting feature, which was empirically established, is their robustness to noise. They are thus, to a certain extent, robust to uncertain inputs (even though no mechanism was specifically designed for that). This said, explicitly modelling the uncertainty as done in the uncertain regression trees proposed here allows one to outperform the standard version of Random Forests, as illustrated in our experiments.
Another way to take into account uncertainty in the data is to use quantile regression. Several adaptations of ensemble methods for quantile regression have been proposed, as quantile Random Forests or quantile boosting trees [9, 14, 15, 16, 23]. These studies however focus on the uncertainty in the output variable (by producing conditional quantiles rather than a conditional mean) and not on the uncertainties in the inputs, as done here. It is of course possible to combine both approaches, which we intend to do in the future.
Lastly, the idea of including information on the uncertainty of the input data in a regression method is not new and is related to uncertainty propagation and sensitivity analysis. Several authors have indeed proposed to integrate uncertainties on the inputs in different regression methods, as multivariate linear models  or neural networks . In each case, the methods have been improved, showing the benefits of explicitly modelling uncertainties in the input data. To the best of our knowledge, our approach is the first one to address this problem in the context of regression trees (and ensemble methods based on such trees). Our conclusion on the benefits of this approach parallels the ones of the above-mentioned studies.
We have introduced in this study uncertain regression trees with can deal with uncertain inputs. In such trees, observations no longer belong to a single region, but rather have a non-null probability to be assigned to any region of the tree. This extra flexibility leads nevertheless to a construction process that is more complex than the one underlying standard regression trees and that necessitates the inversion of a square matrix (for which we have theoretically provided a sufficient condition). In practice, we rely on the pseudo-inverse of this matrix. The experimental results fully justify the approach we have proposed and show that uncertain regression trees improve the results of standard regression trees and Random Forests on several benchmark data sets. A similar conclusion is drawn on artificial uncertain data sets obtained from the standard ones by introducing additional uncertainty in the form of a uniform noise.
The methodology developed in this study can also be adapted to the case where some input data are categorical. We plan to work on such an adaptation in the future, by considering different types of uncertainties.
Furthermore, as mentioned before, the vector of variances for modelling uncertainties can easily be learned by grid search on validation sets (typically using cross-validation). One can expect by doing so that the results would further improve. We have set this vector in our experiments to values that we believe are reasonable, so as to show that our approach is robust in the sense that it yields good results even when is set a priori. We nevertheless plan to run additional experiments to learn this vector. As a grid search can be easily parallelized, this learning should not impact the running time of the algorithm.
, is another promising research direction. We also intend to explore the use of other empirical loss functions, as the quantile loss used in the definition of quantile Random Forests or quantile boosting trees[9, 14, 15, 16].
-  S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn and K. Smith, Cython: the best of both worlds, CiSE (2011) 13(2), pp. 31-3.
-  D.A. Belsley, E. Kuh and R.E. Welsch, Regression diagnostics: Identifying Influential Data and Sources of Colinearity, Wiley (1980), pp. 244-261.
-  L. Breiman, J. Friedman, C. J. Stone and R. A. Olshen, Classification and regression trees, CRC press (1984).
-  L. Breiman, Random forests, Machine learning 45(1)(2001), pp. 5–32.
-  P.A. Cornillon, A. Guyader, F. Husson, N. Jegou, J. Josse, M Kloareg, E. Matzner-Lober, and L. Rouvière, R for Statistics, CRC Press (2012).
-  R. Genuer, J.M. Poggi, and C. Tuleau-Malot. Variable selection using random forests.Pattern Recognition Letters 31, no. 14 (2010), pp. 2225-2236.
-  R. Genuer, J.-M. Poggi, C. Tuleau-Malot, N. Villa-Vialaneix, Random Forests for Big Data. Big Data Research, 9(2017), pp. 28-46.
-  B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, Least angle regression, Ann. Statist. 32,2 (2004), pp. 407–499.
-  N. Fenske, T. Kneib and T. Hothorn, Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression, Journal of the American Statistical Association, 106(494) (2011), pp.494–510.
Y. Freund, R. Schapire and N. Abe, A short introduction to boosting
, Journal-Japanese Society For Artificial Intelligence 14(5) (1999), pp.771–780.
Y. Gal, and G. Zoubin,
Dropout as a Bayesian approximation: Representing model uncertainty in deep learningIn international conference on machine learning (2016), pp. 1050-1059.
-  S. Gey and J.M. Poggi, Boosting and instability for regression trees. Computational statistics & data analysis, (2006) 50(2) pp 533–550.
-  T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Springer Series in Statistics (2009).
-  B. Kriegler and R. Berk, Boosting the quantile distribution: A cost-sensitive statistical learning procedure, Preprint (2007).
-  , Small area estimation of the homeless in Los Angeles: An application of cost-sensitive stochastic gradient boosting, The Annals of Applied Statistics, 4(3)(2010), pp.1234–1255.
-  N. Meinshausen, Quantile regression forests, Journal of Machine Learning Research, 7(2006), pp. 983–999.
-  M.S. Reis, and P. M. Saraiva. Integration of data uncertainty in linear regression and process optimization. AIChE journal 51.11 (2005), pp. 3007-3019.
-  W.J. Nash, T. L. Sellers, S. R. Talbot, A. J. Cawthorn and B. B. Ford The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from the North Coast and Islands of Bass Strait, Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288) (1994).
-  R. Quinlan, Combining Instance-Based and Model-Based Learning, In Proceedings on the Tenth International Conference of Machine Learning (1993), pp. 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
-  G. Ridgeway, Generalized Boosted Models: A guide to the gbm package, (2007).
-  G. Varoquaux, L. Buitinck, G. Louppe, O. Grisel, F. Pedregosa, and A. Mueller, Scikit-learn: Machine Learning Without Learning the Machinery, GetMobile: Mobile Comp. and Comm. 19, 1 (2015), pp. 29-33.
-  A. Verikas, A. Gelzinis, and M. Bacauskiene, Mining data with random forests : A survey and results of new tests. Pattern Recognition, 44(2)(2011), pp. 330–349.
-  S. Zheng QBoost: Predicting quantiles with boosting for regression and binary classification. Expert Systems with Applications. (2012) Feb 1;39(2), pp. 1687-97.