Fréchet random forests

06/04/2019 ∙ by Louis Capitaine, et al. ∙ Université de Bordeaux 0

Random forests are a statistical learning method widely used in many areas of scientific research essentially for its ability to learn complex relationship between input and output variables and also its capacity to handle high-dimensional data. However, data are increasingly complex with repeated measures of omics, images leading to shapes, curves... Random forests method is not specifically tailored for them. In this paper, we introduce Fréchet trees and Fréchet random forests, which allow to manage data for which input and output variables take values in general metric spaces (which can be unordered). To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. Finally, the method is studied in the special case of regression on curve shapes, both within a simulation study and a real dataset from an HIV vaccine trial.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Random Forests [2]

are one of the state-of-the-art machine learning method. It owes its success to very good predictive performance coupled with very few parameters to tune. Moreover, as a tree-based method, it is able to handle regression and classification (2-class or multi-class) problems in a consistent manner and deals with quantitative or qualitative input variables. Finally, its non-parametric nature allows to proceed high-dimensional data (where the number of input variables is very large in regards of statistical units).

General principle of tree predictor is to recursively partition the input space. Starting from the root node (which contains all learning samples), it repeatedly split each node (into two or more child nodes) until a stopping rule is reached. Let us focus, for the sake of clarity, on the case where all input variables are quantitative. For most of tree predictors, splits are binary and consist of an input variable and a threshold , leading to two child nodes containing observations that verify and respectively [3]. The splitting variable as well as the threshold are most of the time sought to minimize an heterogeneity criterion on child nodes (the main idea being to partition the input space into more and more homogeneous regions in terms of the output variable).

One limitation of the previously described splitting strategy is that all input variables must live in an ordered space (the method must decide if an observation of the splitting variable is less or larger than the threshold). Yet, with complex data structures, inputs can belong to unordered spaces. For instance, let’s suppose that we have repeated measurements of input variables as well as the output variable and in addition that the objective is to predict the output trajectory given the inputs trajectories. This framework actually motivates our work as it will be shown with the application. If the problem is tackled at the trajectory level (or curve level) the notion of order is then lost. However, ignoring the fact that measurements are repeated, which helps to retrieve the classical case of quantitative input variables, can lead to an important loss of information. Thus, one way of analyzing this kind of data is to generalize the notion of split in unordered metric spaces. Recently, random forests have been adapted to the general metric space framework but in the special case where neither the representation of the data nor the distances between data points are available [8]. In this paper, the distances between any items of the space are computable. Note that the approach proposed in this article is very general and can also be applied if inputs are of different natures such as images, shapes, curves, etc.

Hence, we consider the following framework: suppose that we get a learning sample made of i.i.d. observations of a generic couple , where is a product of metric spaces (which can be unordered) , and where is also metric space with distance . The core idea of this work is to generalize the notion of split with a split function that only uses the distance of one metric space. Furthermore, as the notion of mean in the output space is also needed to affect predictions to terminal nodes of a tree, the Fréchet mean (that generalizes the mean in general metric spaces [5] is used. This justifies the names Fréchet trees and Fréchet random forests hereafter. Once the notion of split is defined, the building of a maximal tree and the pruning of that tree to obtain an optimal tree are extended to the framework we consider. Finally, with this generalization of CART trees, Fréchet random forests are derived in a rather standard way: a forest predictor is an aggregation of a collection of randomized trees. Note that the aggregation step consists of taking the Fréchet mean of individual tree predictions.

Recently, many methods have been developed with the Fréchet average as the main concept. This cumbersome notion has made it possible to perform PCA for longitudinal data on Riemanian manifolds [4], to analyze ensemble of complex objects with their shape, such as ECG curves [1] or phylogenetic trees [14]. More lately, new innovative regression methods have emerged to explain a metric space valued output variable with Euclidean predictors [15]. The two methods proposed in this paper allow regression between predictors in different metric spaces and a metric space valued output.

In the following sections, we first precisely present the Fréchet tree predictor (Section 2) before introducing Fréchet random forests (Section 3). Section 4 is dedicated to a particular problem of regression on shape of curves, while simulations according to this particular case are presented in Section 5. An application of the Fréchet random forests method on real data is presented in Section 6. Finally, interest and potential limitations of the proposed method are discussed in Section 7.

2 Fréchet Trees

2.1 Split function

One key ingredient in the building of a decision tree is the way its nodes are split

[3]. Splitting a node of a tree according to some variable amounts to find a way of grouping observations of this node into two subsets (its child nodes). This grouping is usually perform to maximize the differences between the two resulting child nodes in terms of the output variable. However, if variable is strongly related to the output variable , then it is expected that for two observations with "close" values in , associated outputs will be "close" in . From this idea, we introduce split functions in general metric spaces.

Let be a metric space, a split function is defined as a measurable application such that and the partition of associated to is a Voronoï partition.

Thus, a split function is entirely defined by an alphabet and the associated partition of is such that for every , there exists a unique such that . We recall that the partition is a Voronoï partition if for all and for all . Hence, for any fixed alphabet , the split function associated with is a minimizer of the empirical distortion [7]:

over the functions , for any subset of .

2.2 Splitting rule

Let be a subset of the input space and for any , let denotes the restriction of the metric space to . Let be a split function on . The right and left child nodes associated the split function are:

The quality of the obtained split is then defined by the following measure of Fréchet variance decrease:

where is the number of observations of in and , and are the empirical Fréchet means of observations belonging to nodes , and :

The Fréchet mean is a natural generalization of the usual mean in Euclidean space to any metric space. It is worth noting that the decrease in Fréchet variance for each possible division is compared with the metric of the output space, which makes it possible to compare divisions made on input variables of different natures. At last, the split variable , chosen for the splitting the node corresponding to is the one that maximizes :

It is easy to show that for all thanks to the use of the Fréchet mean, which means that each splitting results in a decrease of the total variance of Fréchet.

2.3 Tree pruning

Starting from the root node (that contains the whole input space ), nodes are recursively split in order to give a partition of . A binary tree is then naturally recursive partitioning. A node of the tree is not split if it is pure, that is if the Fréchet variance of this node, , is null, where:

As a first step in the building process, the tree is developed until all nodes are pure, leading to the so-called maximal tree. Then, the (standard) pruning algorithm of CART [3]

is applied. The only difference is use of Fréchet variance instead of the standard empirical variance. At the end of this step, a sequence of nested sub-trees of the maximal tree is obtained. Next, the sub-tree associated to the lowest prediction error (estimated by cross-validation) is selected as the final tree predictor. The way a Fréchet tree predicts new inputs is detailed in the next section.

As a matter of fact, the pruning step provides both a sequence of nested partitions of the input space and also a sequence of nested partitions of the output observations . Thus, another criterion related to the partitioning of output observations (hence different from the cross-validated prediction error) can be used in order to select the final tree. For example, Hubert’s statistic [9] on all these partitions can be calculated to determine which partition of output observations is the best (i.e, with clusters the most homogeneous as possible, and the most distant from each other). This leads to a clustering of output observations into an optimal number of clusters (for this statistic). Note that those clusters come from the recursive splitting process associated to the tree and hence the different decisions are made on input variables.

2.4 Prediction

Let be an optimized Fréchet tree built. We note the set of leafs (i.e., terminal nodes) of . For each leaf , the empirical Fréchet mean of the observations belonging to is associated to . Then the prediction of the output variable associated with any is given by:

In order to determine to which leaf belongs an observation , it is drop down the tree as follows. Starting from the root node, the associated split variable is considered, together with its two child nodes and , as well as the corresponding Fréchet means and . To decide in which child node must fall, its -distance with and must be computed and goes to if and to otherwise. This process is then repeated until it falls into a leaf. The error made by on is defined as:

3 Fréchet Random forests

A Fréchet random forest is derived as standard random forests [2]: it consists of an aggregation of a collection of multiple randomized Fréchet trees. Here, the same random perturbations as standard random forests [2] are used. Let and consider the -th tree. First, it is built on a bootstrap sample of the learning sample ( observations drawn with replacement among ), and secondly, the search for the optimized split for each node is restricted on a subset of variables randomly drawn among the input variables (this random subset is denoted hereafter). Hence, the -th tree is denoted and can be viewed as a doubly-randomized Fréchet tree. Once all randomized trees are built, the Fréchet mean is again used to aggregated them. Thus, for any the prediction made by the Fréchet random forest is:

Furthermore, Fréchet forests inherit from standard random forest quantities: OOB (Out-Of-Bag) error and variable importance scores. The OOB error provides a direct estimation of the prediction error of the method and proceeds as follows. The predicted output value, , of the -th observation , is obtained by aggregating only trees built on bootstrap samples that do not contain . The OOB error is then computed as the average distance between those predictions and the :

Variable importance (VI) provides information on the use of input variables in the learning task that can be used e.g. to perform variable selection. For , variable importance of input variable , , is computed as follows. For the -th bootstrap sample , let us define the associated sample by all observations that were not picked in . First, , the error made by tree on is computed. Then, the values of in the sample are randomly permuted, to get a disturbed sample , and the error, , made by on is calculated. Finally, VI of is defined as:

4 Regression on curve shapes for longitudinal data analysis

Let us focus, on data made of repeated measurements (over time, i.e. longitudinal data) of quantitative variables. Evolution of some variable over time can thus be represented by a curve. Hence, it is assumed that every input variables as well as the output variable are curves. In this case, the -th observation is a curve from to , and is a curve from to . The different curve spaces are equipped with the Fréchet distance defined for two real-valued curves and with support in , by:

where and are re-parameterizations of . An intuitive idea of this distance is the following: imagine a man and his dog who are each walking on a curve, the Fréchet distance between their respective trajectories is the minimum length of the leash that allows the dog and his owner to walk along their respective curves, from one end to the other, without going backwards. Fréchet distance is a natural measure of similarity between the shapes of curves and has been widely used in various applications such as signature authentication [17], path classification [6] and speech recognition [10].

Finally, the 2-means (-means with ) for longitudinal data using Fréchet distance [6] is chosen as a split function. This split function is an adaptation of the -means method tailored to one-dimensional curves. It allows to find groups of trajectories based on their shapes (which are usually not found by conventional methods, e.g. based one Euclidean distance).

The next section illustrates the behavior of Fréchet trees and forests through a simulation study in this context.

5 Simulation study

5.1 Data simulation

In a first scenario, observations of input variables are simulated, according to the following model for any :

(1)

where , ,

is Gaussian white noise with standard deviation

and and are defined as follows:

The term allows to randomly affect typical temporal behaviors, defined by and functions, to observations. The terms is a dilatation term of or , while corresponds to an additive noise.

Output variable is simulated in a similar way. The pair is used to determine a trajectory for the output variable:

(2)

where are the same coefficients used in (1), is a Gaussian white noise with standard deviation and are given by:

(3)

Figure 1 illustrates the data made of the observations simulated using the previous simulation model.

Figure 1: Simulated dataset made of observations.

In a second scenario, additional noise variables are simulated as independent paths of a standard Brownian motion on , the number of observations being unchanged (). Hence, these variables do not have any group structure nor any link to the output variable. This scenario helps to study the behavior of Fréchet random forests and its associated variable importance score in the case of high-dimensional data.

5.2 Results

All computations in this article have been performed on the same server (without concurrent access) with one intel core i7 9700k@5Ghz processor with 8 cores, 32Go of RAM, equipped with the Windows 10 operating system.
Let us start with the first scenario with only the input variables defined in Eq.(1). First, a Fréchet maximal tree was built on the simulated dataset and pruned to maximize Huberts’ statistics. The resulting tree (Figure 2) is made of leaves, one for each typical behavior characterizing the outputs, and the predicted curves are very close in shape to the typical behaviors given by Eq.(3). In conclusion, the Fréchet tree method managed to retrieve the group structure of output curves by separating the different behaviors of input curves.

Figure 2: On the left, the Fréchet tree built on the simulated dataset. The splitting variable of a node is indicated below it. The blue curves above the edges connecting a parent node to its child nodes represent the centers obtained by the split function applied to the parent node for the splitting variable. On the right, the predicted trajectories for each leaf (solid line) as well as the typical behaviour functions characterizing the output (dotted lines) given in Eq. (3).

Next, a Fréchet random forest was applied on the same dataset. The number of randomly drawn variables at each node, , was set to (the only choice possible to get randomness since ) and the number of trees, , to (justified by the fact that, in this experiment, the OOB error stabilizes at as soon as around trees are included in the forest). In order to compare predictive performances of trees and forests, the prediction error of both methods is estimated using random cuts of into a training set (with observations) and test set (made of the remaining observations). Trees and forests reach average errors of (with a standard deviation, s.d., of ) and (with a s.d. of ) respectively. Thus, it appears that, as for standard RF compared to CART trees, forests lead to a significant improvement in predictive performance. Note that the OOB error computed on one Fréchet forest composed by is very close to the error prediction estimated on the random cuts of .

Finally, the second scenario is considered, so the simulated dataset now contains additional noise variables (). A Fréchet random forest made of trees is built with parameter fixed to (which is the usual default value in the standard regression framework). The OOB error is , the addition of a very large number of noise variables leads to a significant increase in the forest’s prediction error. Variable importance is a quantity of interest in this case of high-dimensional data (, ), especially to detect and remove the noise variables from the input data and thus considerably improve the prediction error. As a result, the two informative variables stand out with importance scores of and respectively while noise variables ones stay very close to (between and ). Thus, in addition to their good predictive capacity, Fréchet random forests manage to highlight informative variables (which are curves in this example) from useless ones in a sparse setting.

6 Application to the DALIA vaccine trial

DALIA is a therapeutic vaccine trial including 17 HIV-infected patients who received a HIV vaccine candidate before stopping their antiretroviral treatment. For a full description of the DALIA vaccine trial we refer to [12]. At each harvest time before stopping their treatment, 5398 gene transcripts were measured by microarray technology. The plasma HIV viral load (which was log-transformed) for every patient was measured at each harvest time after the antiretroviral interruption. In this application the measurement times of the inputs (gene transcripts) differ from the ones of the output (HIV viral load). The objective is to be able to predict the HIV viral load dynamics after antiretroviral treatment interruption for a patient given the evolution of his/her gene expression during the vaccination phase [16]. Figure 3 illustrates the design of the DALIA vaccine trial and the dynamics of the viral replication after antiretroviral treatment interruption with a large between-individuals variability. The analysis with Fréchet random forest was performed on the 17 patients. The parameter was fixed to and the number of trees, , was set to . The OOB error of the Fréchet random forest converged and stabilized for almost trees composing the forest. Figure 4 illustrates both the OOB predictions and the predictions on the learning samples (fits) of the evolution of the viral load after the HAART interruption for 4 patients of the vaccine trial.

Figure 3: On the left, the DALIA vaccine trial design. To the right, dynamics of plasma HIV viral load (one curve per patient) after antiretroviraltreatment interruption, DALIA vaccine trial.
Figure 4: Plot of the evolution of viral load after interruption of treatment for four patients of DALIA vaccine trial, and both OOB predictions and fits (predictions on learning samples) obtained by Fréchet random forests.

The predictions of the Fréchet forest on the learning sample were close to the observed viral load curves. Moreover, despite a very small number of individuals, the OOB predictions obtained with this forest are very close in shape to the true curves.
Among the 100 variables selected, many belongs to the groups of genes (modules) that were selected in a previous work because i) their dynamics was influenced by the vaccine ii) their abundance after vaccination was associated with the peak of viral load. For instance, 5 genes from the inflammation module 3.2 and 3 genes from the T cell module 4.1 were selected with the current approach.
Thus, the Fréchet random forests method applied on the complex example of the DALIA vaccine trial is extremely effective both for its capacity to predict the output variable as well as for its ability to find relevant genes in order to explain the evolution of the viral load after the treatment interruption. It should be noted that standard CART trees and random forests methods cannot be used on such an application. Indeed, both the number and the observation times of the input and output variables were different.

7 Discussion

Two new tree-based methods, Fréchet trees and Fréchet random forests, for general metric spaces-valued data were introduced. Let us emphasize that the proposed methods are very versatile. Indeed, input variables can thus all be of different kinds, each one having its own metric, and the kind of the output variable can also be a different one.

The example of learning curve shapes was presented to illustrate the capacity of the methods to learn from data in unordered metric spaces. A simulation study in this framework demonstrated the ability of Fréchet trees and forests to recover the data structure. Let us stress that in the simulation schemes presented in this work, input and output variables are observed repeatedly over time, but with the following characteristics. The numbers of measurements (constituting the trajectories) can differ between input variables and the output variable. This can be a problem for most traditional parametric methods for longitudinal data analysis, such as mixed models [13]. So, one interesting aspect of the proposed method is its ability to deal with this type of data quite effectively, since it is only based on the shape of trajectories (thanks to the Fréchet distance). Moreover, Fréchet trees and forests are also able to analyze data for which input and output curves are not observed at the same times of measurement (they are thus robust to missing data for some curves) or even the same time-windows. In this paper, we conducted an analysis on an example in the field of vaccinology, with repeated measures of the transcriptome, correlated to the later specific immune response. It was possible to analyze completely those data thanks to the fact that Fréchet random forests are flexible enough to allow different time-windows measurements due to their curves shape-based learning approach.

Finally, as a by-product, a new way of finding the optimal subtree of a maximal tree is introduced and helps to find an interesting clustering of outputs. In the case of curves, the set of typical behaviors retrieved can be help the results interpretation. We also showed that the variable importance computation performed by Fréchet forests made it possible to efficiently retrieve input variables the most related to the output variable, even in a sparse high-dimensional context.

However, there are two main limitations to Fréchet trees and forests: the first is that the Fréchet mean has to exist in the output space [11] and has to be fairly approximated. The second concerns the computation time. Indeed, even if, the proposed approaches have been fully coded for the trajectories case, Fréchet random forests can still be computationally intensive. This problem can be alleviated by the fact that, as all forests methods, they are easily parallelized (the different trees can be built in parallel).

One direction for future work is to develop these methods for images and shapes data and to include the possibility of dealing with mixed data in the implementation. Finally, we are working on the proof of consistency of the estimators thanks to the recent work around the Fréchet mean and the Fréchet regression in [15].

References

  • [1] Jérémie Bigot. Fréchet means of curves for signal averaging and application to ECG data analysis. Ann. Appl. Stat., 7(4):2384–2401, 2013.
  • [2] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [3] Leo Breiman, Jerome Friedman, Richard Olshen, and J. Stone, Charles. Classification and Regression Trees. Chapman & Hall, New York, 1984.
  • [4] Xiongtao Dai, Zhenhua Lin, and Hans-Georg Müller. Modeling longitudinal data on riemannian manifolds. arXiv preprint arXiv:1812.04774, 2018.
  • [5] M. Maurice Fréchet. Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo (1884-1940), 22(1):1–72, Dec 1906.
  • [6] Christophe Genolini, René Ecochard, Mamoun Benghezal, Tarak Driss, Sandrine Andrieu, and Fabien Subtil. kmlshape: An efficient method to cluster longitudinal data (time-series) according to their shapes. PLOS ONE, 11(6):1–24, 06 2016.
  • [7] Siegfried Graf and Harald Luschgy.

    Foundations of quantization for probability distributions

    .
    Springer, 2007.
  • [8] Siavash Haghiri, Damien Garreau, and Ulrike Luxburg. Comparison-based random forests. In International Conference on Machine Learning, pages 1866–1875, 2018.
  • [9] Lawrence J Hubert and Joel R Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological bulletin, 83(6):1072, 1976.
  • [10] Sam Kwong, QH He, Kim-Fung Man, KS Tang, and CW Chau.

    Parallel genetic-based hybrid pattern matching algorithm for isolated word recognition.

    International Journal of Pattern Recognition and Artificial Intelligence

    , 12(05):573–594, 1998.
  • [11] Thibaut Le Gouic and Jean-Michel Loubes. Existence and Consistency of Wasserstein Barycenters. Probability Theory and Related Fields, August 2017.
  • [12] Yves Lévy, Rodolphe Thiébaut, Monica Montes, Christine Lacabaratz, Louis Sloan, Bryan King, Sophie Pérusat, Carson Harrod, Amanda Cobb, Lee K Roberts, et al. Dendritic cell-based therapeutic vaccine elicits polyfunctional hiv-specific t-cell immunity associated with control of viral load. European journal of immunology, 44(9):2802–2810, 2014.
  • [13] Mary J. Lindstrom and Douglas M. Bates. Nonlinear mixed effects models for repeated measures data. Biometrics, 46(3):673–687, 1990.
  • [14] Tom M W Nye, Xiaoxian Tang, Grady Weyenberg, and Ruriko Yoshida. Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees. Biometrika, 104(4):901–922, 09 2017.
  • [15] Alexander Petersen and Hans-Georg Müller. Fréchet regression for random objects with euclidean predictors. The Annals of Statistics, 47(2):691–719, 2019.
  • [16] Rodolphe Thiébaut, Boris P. Hejblum, Hakim Hocini, Henri Bonnabau, Jason Skinner, Monica Montes, Christine Lacabaratz, Laura Richert, Karolina Palucka, Jacques Banchereau, and Yves Lévy. Gene expression signatures associated with immune and virological responses to therapeutic vaccination with dendritic cells in hiv-infected individuals. Frontiers in Immunology, 10:874, 2019.
  • [17] Jianbin Zheng, Xiaolei Gao, Enqi Zhan, and Zhangcan Huang. Algorithm of on-line handwriting signature verification based on discrete fréchet distance. In Lishan Kang, Zhihua Cai, Xuesong Yan, and Yong Liu, editors, Advances in Computation and Intelligence, pages 461–469, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.