A Framework for an Assessment of the Kernel-target Alignment in Tree Ensemble Kernel Learning

by   Dai Feng, et al.

Kernels ensuing from tree ensembles such as random forest (RF) or gradient boosted trees (GBT), when used for kernel learning, have been shown to be competitive to their respective tree ensembles (particularly in higher dimensional scenarios). On the other hand, it has been also shown that performance of the kernel algorithms depends on the degree of the kernel-target alignment. However, the kernel-target alignment for kernel learning based on the tree ensembles has not been investigated and filling this gap is the main goal of our work. Using the eigenanalysis of the kernel matrix, we demonstrate that for continuous targets good performance of the tree-based kernel learning is associated with strong kernel-target alignment. Moreover, we show that well performing tree ensemble based kernels are characterized by strong target aligned components that are expressed through scalar products between the eigenvectors of the kernel matrix and the target. This suggests that when tree ensemble based kernel learning is successful, relevant information for the supervised problem is concentrated near lower dimensional manifold spanned by the target aligned components. Persistence of the strong target aligned components in tree ensemble based kernels is further supported by sensitivity analysis via landmark learning. In addition to a comprehensive simulation study, we also provide experimental results from several real life data sets that are in line with the simulations.



There are no comments yet.


page 1

page 2

page 3

page 4


(Decision and regression) tree ensemble based kernels for regression and classification

Tree based ensembles such as Breiman's random forest (RF) and Gradient B...

Random Forest (RF) Kernel for Regression, Classification and Survival

Breiman's random forest (RF) can be interpreted as an implicit kernel ge...

Kernel Alignment Inspired Linear Discriminant Analysis

Kernel alignment measures the degree of similarity between two kernels. ...

Several Tunable GMM Kernels

While tree methods have been popular in practice, researchers and practi...

Learning Landmark-Based Ensembles with Random Fourier Features and Gradient Boosting

We propose a Gradient Boosting algorithm for learning an ensemble of ker...

Kernels and Ensembles: Perspectives on Statistical Learning

Since their emergence in the 1990's, the support vector machine and the ...

Ensembles of Kernel Predictors

This paper examines the problem of learning with a finite and possibly l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tree-based ensembles such as random forest (RF) and gradient boosted trees (GBT) are a mainstay in machine learning and in particular they are considered as dominant algorithms for tabular data

Feng et al. (2018). Tree-based ensembles can be also interpreted as kernel generators and this interpretation has been expounded theoretically to investigate their asymptotic properties (Scornet (2016) and Chen and Shah (2018)). On the other hand, the RF and GBT kernels plugged into kernel learning have been shown to perform well, even outperforming their respective ensembles in comprehensive simulations and real life data sets Davies and Ghahramani (2014),Feng and Baumgartner (2021).

The kernel-target alignment has been first proposed in Cristianini et al. (2001),Cristianini et al. (2006) for classification and later extended in Braun et al. (2008)

for regression as a means of characterization of the relevant information in supervised learning. It has been shown that the kernel-target alignment can be quantified using eigenvectors of the kernel matrix and the target variable. The kernel-target alignment enables assessment of a match of a given kernel to the learning problem represented by a particular data set. In Refs.

Cristianini et al. (2001),Cristianini et al. (2006), the analysis of kernel-target alignment was applied to classification, in Braun et al. (2008) it was used for regression and in Montavon et al. (2011)

it was applied in the analysis of hidden layers of deep neural networks.

Landmark (prototype) or (dis)similarity based learning Pekalska et al. (2001), Bien and Tibshirani (2011), can be used as an alternative way to develop prediction models in nonlinear feature spaces via the kernel matrix Kar and Jain (2011),Balcan et al. (2008). In similarity/dissimilarity learning the kernel entries are explicitly interpreted as pairwise similarities/dissimilarities between the points (samples). Recasting the problem this way, prediction models can be built accordingly. Tree ensemble generated kernels can therefore be readily used within this paradigm.

Kernel-target alignment has never been evaluated for the tree-based ensembles, however eigenanalysis of the RF kernel matrix was suggested as potentially useful in Breiman (2000). Focus of our investigation is the kernel-target alignment of the tree ensemble based kernels and its relationship with the performance of related kernel learning. Furthermore, we propose a sensitivity analysis through landmark learning. The remainder of the manuscript is organized as follows: Section 2 introduces the theoretical framework of the tree ensemble learning, formalizes the notion of the kernel-target alignment and landmark learning, Section 3 details a simulation study that systematically evaluates kernel-target alignment and performance of the ensemble based kernels, Section 4 summarizes the experiments on real life data sets and Section 5 provides discussion, conclusions and future research directions.

2 Methods

2.1 Terminology

Following Breiman Breiman (2000) and Refs. Ishwaran and Lu (2019), Chen and Shah (2018) and Scornet (2016), we consider a supervised learning problem, where training set of pairs: is provided. . In our case is a continuous target, for which the

. Let the target vector be then


2.2 Tree Ensemble Based Kernel Learning

Kernel methods in the machine learning literature are a class of methods that are formulated in terms of a similarity (Gram) matrix . The similarity matrix represents the similarity between two points and . Kernel methods have been well developed and there is a large body of references covering their different aspects Herbich (2001),Schoelkopf and Smola (2001),Friedman et al. (2009)

. In our work we used a common kernel algorithm, namely kernel ridge regression (KRR). KRR is a kernelized version of the traditional linear ridge regression with the L2-norm penalty. Given the kernel matrix

estimated from the training set, first the coefficients of the (linear) KRR predictor in the non-linear feature space induced by the kernel are obtained:


where is the regularization parameter.

The KRR predictor is given as:


where .

Tree-based ensembles are aggregated from a set of regression trees , with representing a single tree. Each single tree partitions the feature space into disjoint regions. Moreover, for a single tree, each region in the feature partition is given by a unique decision path from the tree root to its terminal node. As a byproduct a tree-based kernel is naturally generated via the regression trees and their respective feature partitions Breiman (2000),Chen and Shah (2018)

. Thus, a kernel that corresponds to a tree-based ensemble is obtained as a probability that

and are in the same terminal node () Chen and Shah (2018)


The ensemble based kernels can be obtained by various feature space partitioning mechanisms Fan et al. (2020). We used the RF and extreme gradient boosting (XGB) as an implementation of the GBT in our work and their algorithmic details are provided for completness in the Appendix. We will use GBT and XGB in the remainder of the manuscript, interchangeably.

2.3 Kernel-target Alignment

In the Ref. Cristianini et al. (2001), the sample ordering according to the leading eigenvector of the kernel matrix was shown to correspond to the class delineation and was considered as a proxy of the kernel target alignment. This idea was further developed in Braun et al. (2008) via eigenanalysis of the kernel matrix.

We define the spectral (here it is also a singular value) decomposition of the kernel matrix



where is an orthonormal matrix with the columns (eigenvectors of ) and is an by

diagonal matrix with eigenvalues

-s on the diagonal. We assume that -s are ordered according to their magnitude.

The kernel-target alignment components can be obtained as an absolute value of the scalar product . The fundamental result of Braun et al. (2008) is concerned with the rate of the decay of kernel-target alignment components. In particular, it was proved that under mild conditions, the kernel-target aligned components decay at the same rate as the eigenvalues of the kernel matrix. As a consequence of this decay, the relevant information pertaining to the supervised problem is concentrated in the leading eigenvectors of the kernel matrix that are strongly aligned with the target Braun et al. (2008),Montavon et al. (2011), if such strongly aligned components exist.

In our investigation, we used normalized kernel-target alignment components given by the absolute value of the Pearson correlation coefficients between -s and , with being the kernel matrix obtained from the tree-based ensembles.

2.4 Landmark Learning in Nonlinear Kernel Feature Spaces

In the landmark learning, the empirical similarity map (data driven embedding) Balcan et al. (2008),Kar and Jain (2011) is generated by selecting a subset of data points (landmarks) also referred to as a reference set Pekalska et al. (2001). A point in the feature space is represented by the similarities to the landmarks. This approach is akin to a dimensionality reduction of the original kernel problem to a lower dimensional problem Balcan et al. (2008). A linear model is subsequently developed on the landmark features and a landmark predictor is obtained as:


where is an matrix and is number of landmarks. is a similarity of the point to the landmark

. Consider now the singular value decomposition (SVD) of



where in this case the matrix is an orthonormal matrix with the columns (left singular vectors of ) and is an by diagonal matrix with singular values on the diagonal. We assume that -s are ordered according to their magnitude. Accordingly, the -s and -s are the eigenvalues and the eigenvectors of the landmark kernel matrix , respectively. Following the previous considerations for in the section 2.3, the kernel-target alignment components in landmark learning can be obtained as an absolute value of the scalar product . Again, we use here the absolute value of the Pearson correlation coefficient between -s and , where -s are obtained from the Equation 6.

We used landmark learning in a sensitivity analysis, where increasing number of landmarks was randomly selected from the training set.

The code for the simulation and real life data analysis was developed in the R programming language R Core Team (2017). The ranger Wright and Ziegler (2017)

implementation of RF and xgboost

Chen et al. (2020) of XGB was used, respectively. All algorithms were applied using their default parameters. The regularization parameter for kernel ridge regression was chosen at minimum value, such that the matrix, was invertible.

3 Simulation

Simulation scenarios for kernel target alignment evaluation of RF/GBT kernels were set up according to previously reported simulation benchmarks. They included Friedman Friedman (1991), Meier 1, Meier 2 Meier et al. (2009), van der Laan Van der Laan et al. (2007) and Checkerboard Zhu et al. (2015).

3.1 Simulation Setup

For each simulation scenario, the predictors were simulated from Uniform (Friedman, Meier 1, Meier 2, van der Laan) or Normal distributions (Checkerboard), respectively.

Continuous targets were generated as . For the definitions of for each simulation case see below.

The five functional relationships between the predictors and target for different simulation settings are specified as follows.

1. Friedman. The setup for Friedman was as described in Friedman (1991).

2. Checkerboard. In addition to Friedman, we simulated data from a Checkerboard-like model with strong correlation as in Scenario 3 of Zhu et al. (2015).

The component of is equal to .

3. van der Laan. The setup was studied in van der Laan et al. Van der Laan et al. (2007).


4. Meier 1. This setup was investigated in Meier et al. Meier et al. (2009).


5. Meier 2. This setup was investigated in Meier et al. Meier et al. (2009) as well.


For each functional relationship (Friedman, Checkerboard, Meier 1, Meier 2, and van der Laan), we simulated data from four scenarios with different samples sizes n = 800 and n = 1600 and number of features p = 20 and p = 40. Within each scenario, we simulated 200 data sets, and for each data set we randomly chose 75% of samples as training data and remaining 25% as test data. We repeated the analysis 200 times to evaluate the kernel-target alignment of RF and XGB kernels on the training sets and its relationship with the respective kernel algorithm performance on the independent test sets.

3.2 Simulation Results

The kernel-target alignment spectra of the Friedman, Checkerboard, Meier 1, Meier 2 and van der Laan simulation settings are shown in the Figs. 1, 2, 3, 4 and 5, respectively. The left and right panels in the Figures correspond to the results obtained from the RF and XGB, respectively. On the x-axis, the first thirty components ordered according to the decreasing eigenvalues of the kernel matrix (singular values) are shown. These components are shown for the ensemble derived kernel matrix and the landmark matrices -s. The landmark matrices are built from a varying number of landmarks or prototypes (nProto) including 100, 200 and 300 prototypes. The y-axis corresponds to the absolute value of the correlation coefficient between corresponding eigenvectors of (or left singular vectors of ) and the target .

The RF alignment spectrum for Friedman shows strong peaks corresponding to leading eigenvectors (singular vectors) across the simulation settings Figs. 1(a,c,e,g). These peaks show the same pattern for the prototypes and their magnitude is monotonically increasing with increasing number of prototypes (n=100, 200 and 300). The XGB alignment spectra for Friedman are flatter than those for the RF Figs. 1(b,d,f,h) and they are overlapping with respect to the increasing number of prototypes.

The RF alignment spectrum for Checkerboard is similar to that of Friedman across all simulation settings. A strong peak is associated with the second leading eigenvector (left singular vector) Figs. 2(a,c,e,g). In contrary to Friedman, the XGB alignment spectrum also shows strong components across all simulation settings Figs. 2(b,d,f,h). Interestingly, for the Checkerboard, there is little difference in the alignment spectra with respect to the number of prototypes.

The Meier 1 and Meier 2 data sets are similar to the Friedman data set. For the RF alignment spectrum, Meier 1 and Meier 2 show strong peaks, magnitude of which is increases with the increasing number of prototypes (Figs.3(a,c,e,g) and Figs. 4(a,c,e,g) for Meier 1 and Meier 2, respectively). The XGB kernel alignment spectra are flatter, monotonically decreasing and overlapping with respect to the number of prototypes (Figs.3(b,d,f,h) and 4(b,d,f,h) for Meier 1 and Meier 2, respectively).

For the van der Laan data set, the RF kernel alignment spectrum shows stronger peaks for leading eigenvectors (singular vectors) as it was for the other simulated data sets (Figs.5(a,c,e,g)). However, the alignment spectrum for van der Laan is flat across all scenarios, indicating weak performance of XGB for this data set.

The kernel-target alignment vs performance of the kernel and XGB kernel for the different data generating mechanisms and simulation settings is summarized in Figs.6(a-l). The performance on the test set was evaluated by an absolute value of the Pearson correlation coefficient between the predicted values and the target (Ytest) (the y-axis in the Figs.6(a-l)).

Three summaries of the kernel-target alignment on the training set have been evaluated (the x-axis in the Figs.6(a-l)). On the left side (Figs.6(a,d,g,j)), the mean of the absolute value of the correlation coefficient between the first eigenvector corresponding to the largest eigenvalue (singular value) and Ytrain is shown (Cristianini et al. (2001),Cristianini et al. (2006)). In the middle (Figs.6(b,e,h,k)), the mean correlation coefficient between the eigenvector with maximum correlation with Ytrain is shown. On the right, the mean of absolute value of the correlation coefficient between the eigenvectors (singular vectors) of and Ytrain from the best 5 (with highest correlation coefficient) among leading 10 eigenvectors is displayed. Overall, the higher the kernel-target alignment obtained on the training set, the better the performance of the RF/XGB kernels across the data generating mechanisms and simulation settings. The eigenvectors corresponding to the largest eigenvalue (singular value) are not necessarily best aligned with the target, which is demonstrated for the Checkerboard data for both (RF and XGB) kernels and the van der Laan data set for the XGB kernel, respectively (Figs.6(a,b,d,e,g,h,j,k)).



Figure 1: Kernel-target alignment spectra and relevant dimensionality—Friedman

4 Application Using Real Life Data Sets

4.1 Experimental Setup

We assessed the kernel-target alignment in tree-based ensemble kernel using real life data sets, the summary of which is given in Table 1.

For the larger data sets (California and Protein) we randomly selected 2000 samples and split them into training and test set, with 1500 and 500 samples, respectively. We repeated the analysis 200 times to evaluate the kernel-target alignment of RF and XGB kernels, respectively. For the other data sets we split the data into training and test set in the ratio 3 to 1, respectively, and repeated the analysis 200 times. Similarly to the simulation, we evaluated the kernel-target alignment on the training sets and related it to the performance on the test sets.

4.2 Real Life Data Sets Results

The alignment spectra are given in the Figs. 7, 8, 9, 10 and 11 for California, Boston, Protein, Concrete and CSM data sets, respectively. There results from the RF and XGB kernels are provided in sub-figure (a) and (b), respectively. California, Boston and Protein data sets (Figs. 7(a,b), 8(a,b) and 9(a,b) data sets are characterized by strong peaks in the alignment spectra for both RF and XGB kernels. The Concrete data set has multiple strong peaks for the XGB kernel (Fig.10(b)) in contrast to that of the RF kernel. On the other hand, the CSM data set has strong peaks for the RF kernel (Fig.11(a)), whereas the alignment spectrum obtained from the XGB kernel is flat (Fig.11(b)). The performance of the RF and XGB kernels in terms of the correlation of the predictions with the target on the test set are provided in Table 2. In addition, the average of the correlation coefficients of the top five eigenvector components and target obtained from the training sets are provided as a summary measure of the kernel-target alignment (see Table 2). Using this metric, for three data sets (Boston, Protein and CSM) the RF kernel shows higher alignment with the target than that of the XGB, and in turn, a better performance. On the other hand, for the Concrete data set, the kernel-target alignment for the XGB kernel is higher than that of the RF kernel. As a consequence, the XGB kernel exhibits better performance. For the California data set, the RF and XGB show comparable kernel-target alignment, with RF slightly outperforming the XGB. To note, the overall alignment spectra of the California data set for the RF and the XGB exhibit the same pattern (see Figs.7(a),(b)).

For completeness, the results of the prediction performance in terms of the mean squared error (MSE) are given in the Table 3.

Dataset n p
California Housing Pace and Barry (1997) 20640 9
Boston Housing Harrison Jr and Rubinfeld (1978),Dua and Graff (2017) 506 13
Protein Tertiary Structure Dua and Graff (2017) 45730 9
Concrete Compressive Strength Yeh (1998), Dua and Graff (2017) 1030 9
Conventional and Social Movie (CSM) Ahmed et al. (2015), Dua and Graff (2017) 187 12
Table 1: Summary of the real life datasets

-.8in-.8in Dataset RFk XGBk RFk5 XGBk5 California Housing 0.856 (0.016) 0.234 (0.023) Boston Housing 0.919 (0.022) 0.311 (0.022) Protein Tertiary Structure 0.575 (0.033) 0.120 (0.016) Concrete Compressive Strength 0.965 (0.007) 0.215 (0.021) CSM 0.434 (0.104) 0.144 (0.032)

Table 2:

Performance of the RF and XGB kernels on test set and summary measures of the kernel alignment obtained from the training set. RFk and XGBk refer to the mean correlation coefficients between target and the predictions obtained from the test sets for the RF and XGB kernel, respectively. RFk5 and XGBk5 refer to the average correlation coefficient of top 5 components and target ordered according to the eigenvalues of the kernel matrix for RF and XGB kernel (from training set), respectively. The metrics are provided as means and standard deviations.

5 Discussion and Future Work

In this paper, we have shown that for regression, the performance of the tree ensemble (RF/XGB) based kernels is associate with the degree of the kernel-target alignment. In a comprehensive simulation study and real life data sets we demonstrated that strong target aligned components of the kernel matrix are translated into high performance of the tree ensemble based kernels. The strongly target aligned components correspond to (typically a small number of) eigenvectors of the kernel matrix with larger eigenvalues (i.e. they are in the left side of the eigenvalue spectrum). However they do not necessarily exactly follow the magnitude based ordering of the eigenvalues. This suggests that the target aligned components span a low dimensional manifold that is implicitly represented by the tree based kernel. Moreover, the strongly aligned components (peaks) are persistent as shown in the sensitivity analysis with a landmark tree based kernel learning and an increasing number of landmarks.

The kernel-target alignment can be applied to other tree ensemble based kernels, e.g. the recent RF and XGB variants. They include oblique, rotation or mixup forests (Menze et al. (2011), Rodriguez,JJ et al. (2006) and Rodriguez,JJ et al. (2020), respectively). Of interest is also the kernel-target alignment of kernels obtained from Bayesian approaches such as Mondrian forests Balog et al. (2016) or Bayesian non parametric partitions and Bayesian additive regression trees (BART), (Fan et al. (2020) and Linero (2017)).

Furthermore, the concept of the tree ensemble based kernel is inherent to other prediction targets such as the binomial and time to event targets that represent classification and survival, respectively. For example, the proximity matrix i.e. ensemble kernel for the survival forest is readily available Ishwaran and Lu (2019). For the estimation of the kernel-target alignment for a survival target, an additional challenge of incomplete information about the target due to censoring needs to be addressed and is an interesting topic for future research.

We used kernel ridge regression as a kernel learning algorithm in our contribution. There have been also recent advancements in the development of the response guided principal components Lang and Zou (2020), Tay et al. (2021). These have focused on principal component regression and incorporation of sparsity constraints through the LASSO penalty. As our results support the notion of relevant dimensionality expressed by the target aligned components for tree ensemble based kernels, we plan to explore sparse, response (target) guided nonlinear principal component regression for the tree ensemble kernel learning in the future.

6 Appendix

6.1 RF and the RF Kernel

Random Forest (RF) is defined as an ensemble of tree predictors grown on bootstrapped samples of a training set Breiman (2000). When considering an ensemble of tree predictors , with representing a single tree. The

are iid random variables that encode the randomization necessary for the tree construction

Scornet (2016), Ishwaran and Lu (2019).

The RF predictor is obtained as:


RF kernel ensuing from the RF is defined as a probability that and are in the same terminal node ) Breiman (2000), Scornet (2016).


where denotes the indicator function.

6.2 Gradient Boosted Trees (GBT) and the GBT Kernel

The GBT are (similarly to RF) ensemble of tree predictors. In contrast to the RF, the GBT ensemble predictor is obtained as a sum of weighted individual tree predictors through iterative optimization of an objective (cost) function Friedman (2001),Chen and C (2016) :


The objective function of GBT comprises of a loss function and for the extreme gradient boosting a regularization term is added to control the model complexity. In our work we used the extreme gradient boosting (XGB) implementation of the GBTs

Chen and C (2016).

The objective function of the XGB algorithm is given follows:


is the loss function. The loss function used in xgboost is squared error and logistic loss for regression and classification, respectively
denotes the regularization penalty
, is the number of tree terminal nodes and a corresponding regularization parameter, respectively
is a regularization parameter controlling for the L2 norm of the individual tree weights

As for the RF kernel, the XGB kernel is defined as a probability that and are in the same terminal node () Chen and Shah (2018).



Figure 2: Kernel-target alignment spectra and relevant dimensionality—Checkerboard


Figure 3: Kernel-target alignment spectra and relevant dimensionality—Meier 1


Figure 4: Kernel-target alignment spectra and relevant dimensionality—Meier 2


Figure 5: Kernel-target alignment spectra and relevant dimensionality—van der Laan




Figure 6: Mean of absolute value of cc between the eigenvector and Ytrain from the first component (i.e. largest eigenvalue) vs. cc between prediction and Ytest (left). Mean of absolute value of cc between eigenvector and Ytrain from the best component (i.e. largest absolute correlation coefficient) vs. cc between prediction and Ytest (middle). Mean of absolute value of cc between the eigenvector and Ytrain from the best 5 among first 10 components vs. cc between prediction and Ytest (right)
Figure 7: Kernel-target alignment spectra and relevant dimensionality—California housing
Figure 8: Kernel-target alignment spectra and Relevant Dimensionality—Boston housing
Figure 9: Kernel-target alignment spectra and relevant dimensionality—Protein Tertiary Structure
Figure 10: Kernel-target alignment spectra and relevant dimensionality—Concrete Compressive Strength
Figure 11: Kernel-target alignment spectra and relevant dimensionality—Conventional and Social Movie

-.5in-.5in Dataset RF XGB RFk XGBk California housing 3.64 5.46 3.64 () () () Boston housing 12.512 (4.570) 19.172 (5.697) 13.349 (3.544) Protein Tertiary Structure 21.248 (1.560) 35.121(2.771) 27.385(2.253) Concrete Compressive Strength 31.083(4.878) 38.049(7.849) 19.851(4.12) CSM 0.727 (0.158) 1.122(0.266) 0.906(0.183)

Table 3: Performance (measured by the MSE) for the RF/XGB and the RF/XGB kernels


  • D. Ahmed, A. Jahnagir, M. A, et al. (2015) Using crowd source based features from social media and conventional features to predict the movies popularity.. In IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), pp. 273–278. Cited by: Table 1.
  • M. Balcan, A. Blum, and S. N (2008) A theory of learning with similarity functions. Machine Learning 72, pp. 89–112. Cited by: §1, §2.4.
  • M. Balog, B. Lakshminarayanan, Z. Ghahramani, D. M. Roy, and Y. W. Teh (2016) The mondrian kernel. In

    Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence

    pp. 32–41. Cited by: §5.
  • J. Bien and R. Tibshirani (2011) Prototype selection for interpretable classification. Annals of Applied Statistics 5 (4), pp. 2403–2424. Cited by: §1.
  • M. Braun, J. Buhmann, and K. Muller (2008) On relevant dimensions in kernel feature spaces. Journal of Machine Learning Research 9, pp. 1875–1908. Cited by: §1, §2.3, §2.3.
  • L. Breiman (2000) Some infinity theory for predictor ensembles. Technical report Technical Report 579, Statistics Dept. UCB. Cited by: §1, §2.1, §2.2, §6.1, §6.1.
  • G. Chen and D. Shah (2018) Explaining the success of nearest neighbor methods in prediction. Cited by: §1, §2.1, §2.2, §6.2.
  • T. Chen and G. C (2016) A scalable tree boosting system.. In Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp. 785–794. Cited by: §6.2, §6.2.
  • T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano, T. Zhou, M. Li, J. Xie, M. Lin, Y. Geng, and Y. Li (2020) Xgboost: extreme gradient boosting. Note: R package version External Links: Link Cited by: §2.4.
  • N. Cristianini, J. Elisseev, J. Shawe Taylor, et al. (2001) On kernel target alignment. In Proceedings of the Neural Information Processing Systems, Cited by: §1, §2.3, §3.2.
  • N. Cristianini, J. Elisseev, J. Shawe Taylor, et al. (2006) Innovations in machinne learning. Cited by: §1, §3.2.
  • A. Davies and Z. Ghahramani (2014) The random forest kernel and other kernels for big data from random partitions.. arXiv preprint arXiv:1402.4293. Cited by: §1.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Table 1.
  • X. Fan, B. Li, L. Luo, et al. (2020) Bayesian nonparametric space partitions: a survey.. arXiv preprint arXiv:2002.11394. Cited by: §2.2, §5.
  • D. Feng and R. Baumgartner (2021) (Decision and regression) tree ensemble based kernels for regression and classification.. arXiv preprint arXiv:2012.10737. Cited by: §1.
  • J. Feng, Y. Yu, and Z. Zhou (2018)

    Multi-layered gradient boosting decision trees

    In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §1.
  • J. Friedman (2001) Greedy function approximation: a gradient boosting machine.. Annals of Statistics 29 (5), pp. 1189–1232. Cited by: §6.2.
  • J. H. Friedman (1991) Multivariate adaptive regression splines. The annals of statistics, pp. 1–67. Cited by: §3.1, §3.
  • J. Friedman, T. Hastie, and R. Tibshirani (2009) The elements of statistical learning. Springer. Cited by: §2.2.
  • D. Harrison Jr and D. Rubinfeld (1978) Hedonic housing prices and the demand for clean air.. Journal of environmental economics and management 5, pp. 81–102. Cited by: Table 1.
  • R. Herbich (2001)

    Learning kernel classifiers

    MIT Press. Cited by: §2.2.
  • H. Ishwaran and M. Lu (2019) Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine, pp. 558–582. Cited by: §2.1, §5, §6.1.
  • P. Kar and P. Jain (2011) Similarity-based learning via data driven embeddings. Proceedings of the Advances in Neural Information Processing Systems,, pp. 1998–2006. Cited by: §1, §2.4.
  • W. Lang and H. Zou (2020) A simple method to improve principal components regression.. Stat, pp. e288. Cited by: §5.
  • A. R. Linero (2017) A review of tree-based bayesian methods. Communications for Statistical Applications and Methods 24 (6), pp. 543–559. Cited by: §5.
  • L. Meier, S. Van de Geer, P. Bühlmann, et al. (2009) High-dimensional additive modeling. The Annals of Statistics 37 (6B), pp. 3779–3821. Cited by: §3.1, §3.1, §3.
  • D. Menze, B. Kelm, U. Koethe, et al. (2011) On oblique random forests.. In Proceedings ECML/PKDD, Lecture Notes in Computer Science, 6911, pp. 453–469. Cited by: §5.
  • G. Montavon, M. Braun, and K. Muller (2011) Kernel analysis of deep networks. Journal of Machine Learning Research 12, pp. 2563–2581. Cited by: §1, §2.3.
  • K. Pace and R. Barry (1997) Sparse spatial autoregressions. Statistics & Probability Letters 33 (3), pp. 291–297. Cited by: Table 1.
  • E. Pekalska, P. Paclik, and R. Duin (2001) A generalized kernel approach to dissimilarity-based classification. Journal of Machine Learning Research 2, pp. 175–211. Cited by: §1, §2.4.
  • R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: §2.4.
  • Rodriguez,JJ, M. Juez-Gil, A. Arnaiz-González, et al. (2020) An experimental evaluation of mixup regression forests. Expert Systems and Applications 151, pp. 113376. Cited by: §5.
  • Rodriguez,JJ, L. Kuncheva, and C. Alonso (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28 (10), pp. 1619–30. Cited by: §5.
  • B. Schoelkopf and A. Smola (2001) Learning with kernels. MIT Press. Cited by: §2.2.
  • E. Scornet (2016) Random forests and kernel methods. IEEE Transactions on Information Theory 62 (3), pp. 1485 – 1500. Cited by: §1, §2.1, §6.1, §6.1.
  • J. Tay, J. Friedman, and R. Tibshirani (2021) Principal component-guided sparse regression.. The Canadian Journal of Statistics. Cited by: §5.
  • M. J. Van der Laan, E. C. Polley, and A. E. Hubbard (2007) Super learner. Statistical applications in genetics and molecular biology 6 (1). Cited by: §3.1, §3.
  • M. N. Wright and A. Ziegler (2017)

    ranger: a fast implementation of random forests for high dimensional data in C++ and R

    Journal of Statistical Software 77 (1), pp. 1–17. External Links: Document Cited by: §2.4.
  • J. Yeh (1998) Modeling of strength of high-performance concrete using artificial neural networks.. Cement and Concrete research 28, pp. 1797–1808. Cited by: Table 1.
  • R. Zhu, D. Zeng, and M. R. Kosorok (2015) Reinforcement learning trees. Journal of the American Statistical Association 110 (512), pp. 1770–1784. Cited by: §3.1, §3.