1 Introduction
Treebased ensembles such as random forest (RF) and gradient boosted trees (GBT) are a mainstay in machine learning and in particular they are considered as dominant algorithms for tabular data
Feng et al. (2018). Treebased ensembles can be also interpreted as kernel generators and this interpretation has been expounded theoretically to investigate their asymptotic properties (Scornet (2016) and Chen and Shah (2018)). On the other hand, the RF and GBT kernels plugged into kernel learning have been shown to perform well, even outperforming their respective ensembles in comprehensive simulations and real life data sets Davies and Ghahramani (2014),Feng and Baumgartner (2021).The kerneltarget alignment has been first proposed in Cristianini et al. (2001),Cristianini et al. (2006) for classification and later extended in Braun et al. (2008)
for regression as a means of characterization of the relevant information in supervised learning. It has been shown that the kerneltarget alignment can be quantified using eigenvectors of the kernel matrix and the target variable. The kerneltarget alignment enables assessment of a match of a given kernel to the learning problem represented by a particular data set. In Refs.
Cristianini et al. (2001),Cristianini et al. (2006), the analysis of kerneltarget alignment was applied to classification, in Braun et al. (2008) it was used for regression and in Montavon et al. (2011)it was applied in the analysis of hidden layers of deep neural networks.
Landmark (prototype) or (dis)similarity based learning Pekalska et al. (2001), Bien and Tibshirani (2011), can be used as an alternative way to develop prediction models in nonlinear feature spaces via the kernel matrix Kar and Jain (2011),Balcan et al. (2008). In similarity/dissimilarity learning the kernel entries are explicitly interpreted as pairwise similarities/dissimilarities between the points (samples). Recasting the problem this way, prediction models can be built accordingly. Tree ensemble generated kernels can therefore be readily used within this paradigm.
Kerneltarget alignment has never been evaluated for the treebased ensembles, however eigenanalysis of the RF kernel matrix was suggested as potentially useful in Breiman (2000). Focus of our investigation is the kerneltarget alignment of the tree ensemble based kernels and its relationship with the performance of related kernel learning. Furthermore, we propose a sensitivity analysis through landmark learning. The remainder of the manuscript is organized as follows: Section 2 introduces the theoretical framework of the tree ensemble learning, formalizes the notion of the kerneltarget alignment and landmark learning, Section 3 details a simulation study that systematically evaluates kerneltarget alignment and performance of the ensemble based kernels, Section 4 summarizes the experiments on real life data sets and Section 5 provides discussion, conclusions and future research directions.
2 Methods
2.1 Terminology
2.2 Tree Ensemble Based Kernel Learning
Kernel methods in the machine learning literature are a class of methods that are formulated in terms of a similarity (Gram) matrix . The similarity matrix represents the similarity between two points and . Kernel methods have been well developed and there is a large body of references covering their different aspects Herbich (2001),Schoelkopf and Smola (2001),Friedman et al. (2009)
. In our work we used a common kernel algorithm, namely kernel ridge regression (KRR). KRR is a kernelized version of the traditional linear ridge regression with the L2norm penalty. Given the kernel matrix
estimated from the training set, first the coefficients of the (linear) KRR predictor in the nonlinear feature space induced by the kernel are obtained:(1) 
where is the regularization parameter.
The KRR predictor is given as:
(2) 
where .
Treebased ensembles are aggregated from a set of regression trees , with representing a single tree. Each single tree partitions the feature space into disjoint regions. Moreover, for a single tree, each region in the feature partition is given by a unique decision path from the tree root to its terminal node. As a byproduct a treebased kernel is naturally generated via the regression trees and their respective feature partitions Breiman (2000),Chen and Shah (2018)
. Thus, a kernel that corresponds to a treebased ensemble is obtained as a probability that
and are in the same terminal node () Chen and Shah (2018)(3) 
The ensemble based kernels can be obtained by various feature space partitioning mechanisms Fan et al. (2020). We used the RF and extreme gradient boosting (XGB) as an implementation of the GBT in our work and their algorithmic details are provided for completness in the Appendix. We will use GBT and XGB in the remainder of the manuscript, interchangeably.
2.3 Kerneltarget Alignment
In the Ref. Cristianini et al. (2001), the sample ordering according to the leading eigenvector of the kernel matrix was shown to correspond to the class delineation and was considered as a proxy of the kernel target alignment. This idea was further developed in Braun et al. (2008) via eigenanalysis of the kernel matrix.
We define the spectral (here it is also a singular value) decomposition of the kernel matrix
as:(4) 
where is an orthonormal matrix with the columns (eigenvectors of ) and is an by
diagonal matrix with eigenvalues
s on the diagonal. We assume that s are ordered according to their magnitude.The kerneltarget alignment components can be obtained as an absolute value of the scalar product . The fundamental result of Braun et al. (2008) is concerned with the rate of the decay of kerneltarget alignment components. In particular, it was proved that under mild conditions, the kerneltarget aligned components decay at the same rate as the eigenvalues of the kernel matrix. As a consequence of this decay, the relevant information pertaining to the supervised problem is concentrated in the leading eigenvectors of the kernel matrix that are strongly aligned with the target Braun et al. (2008),Montavon et al. (2011), if such strongly aligned components exist.
In our investigation, we used normalized kerneltarget alignment components given by the absolute value of the Pearson correlation coefficients between s and , with being the kernel matrix obtained from the treebased ensembles.
2.4 Landmark Learning in Nonlinear Kernel Feature Spaces
In the landmark learning, the empirical similarity map (data driven embedding) Balcan et al. (2008),Kar and Jain (2011) is generated by selecting a subset of data points (landmarks) also referred to as a reference set Pekalska et al. (2001). A point in the feature space is represented by the similarities to the landmarks. This approach is akin to a dimensionality reduction of the original kernel problem to a lower dimensional problem Balcan et al. (2008). A linear model is subsequently developed on the landmark features and a landmark predictor is obtained as:
(5) 
where is an matrix and is number of landmarks. is a similarity of the point to the landmark
. Consider now the singular value decomposition (SVD) of
:(6) 
where in this case the matrix is an orthonormal matrix with the columns (left singular vectors of ) and is an by diagonal matrix with singular values on the diagonal. We assume that s are ordered according to their magnitude. Accordingly, the s and s are the eigenvalues and the eigenvectors of the landmark kernel matrix , respectively. Following the previous considerations for in the section 2.3, the kerneltarget alignment components in landmark learning can be obtained as an absolute value of the scalar product . Again, we use here the absolute value of the Pearson correlation coefficient between s and , where s are obtained from the Equation 6.
We used landmark learning in a sensitivity analysis, where increasing number of landmarks was randomly selected from the training set.
The code for the simulation and real life data analysis was developed in the R programming language R Core Team (2017). The ranger Wright and Ziegler (2017)
implementation of RF and xgboost
Chen et al. (2020) of XGB was used, respectively. All algorithms were applied using their default parameters. The regularization parameter for kernel ridge regression was chosen at minimum value, such that the matrix, was invertible.3 Simulation
Simulation scenarios for kernel target alignment evaluation of RF/GBT kernels were set up according to previously reported simulation benchmarks. They included Friedman Friedman (1991), Meier 1, Meier 2 Meier et al. (2009), van der Laan Van der Laan et al. (2007) and Checkerboard Zhu et al. (2015).
3.1 Simulation Setup
For each simulation scenario, the predictors were simulated from Uniform (Friedman, Meier 1, Meier 2, van der Laan) or Normal distributions (Checkerboard), respectively.
Continuous targets were generated as . For the definitions of for each simulation case see below.
The five functional relationships between the predictors and target for different simulation settings are specified as follows.
1. Friedman. The setup for Friedman was as described in Friedman (1991).
2. Checkerboard. In addition to Friedman, we simulated data from a Checkerboardlike model with strong correlation as in Scenario 3 of Zhu et al. (2015).
The component of is equal to .
3. van der Laan. The setup was studied in van der Laan et al. Van der Laan et al. (2007).
(7) 
4. Meier 1. This setup was investigated in Meier et al. Meier et al. (2009).
(8) 
5. Meier 2. This setup was investigated in Meier et al. Meier et al. (2009) as well.
(9) 
For each functional relationship (Friedman, Checkerboard, Meier 1, Meier 2, and van der Laan), we simulated data from four scenarios with different samples sizes n = 800 and n = 1600 and number of features p = 20 and p = 40. Within each scenario, we simulated 200 data sets, and for each data set we randomly chose 75% of samples as training data and remaining 25% as test data. We repeated the analysis 200 times to evaluate the kerneltarget alignment of RF and XGB kernels on the training sets and its relationship with the respective kernel algorithm performance on the independent test sets.
3.2 Simulation Results
The kerneltarget alignment spectra of the Friedman, Checkerboard, Meier 1, Meier 2 and van der Laan simulation settings are shown in the Figs. 1, 2, 3, 4 and 5, respectively. The left and right panels in the Figures correspond to the results obtained from the RF and XGB, respectively. On the xaxis, the first thirty components ordered according to the decreasing eigenvalues of the kernel matrix (singular values) are shown. These components are shown for the ensemble derived kernel matrix and the landmark matrices s. The landmark matrices are built from a varying number of landmarks or prototypes (nProto) including 100, 200 and 300 prototypes. The yaxis corresponds to the absolute value of the correlation coefficient between corresponding eigenvectors of (or left singular vectors of ) and the target .
The RF alignment spectrum for Friedman shows strong peaks corresponding to leading eigenvectors (singular vectors) across the simulation settings Figs. 1(a,c,e,g). These peaks show the same pattern for the prototypes and their magnitude is monotonically increasing with increasing number of prototypes (n=100, 200 and 300). The XGB alignment spectra for Friedman are flatter than those for the RF Figs. 1(b,d,f,h) and they are overlapping with respect to the increasing number of prototypes.
The RF alignment spectrum for Checkerboard is similar to that of Friedman across all simulation settings. A strong peak is associated with the second leading eigenvector (left singular vector) Figs. 2(a,c,e,g). In contrary to Friedman, the XGB alignment spectrum also shows strong components across all simulation settings Figs. 2(b,d,f,h). Interestingly, for the Checkerboard, there is little difference in the alignment spectra with respect to the number of prototypes.
The Meier 1 and Meier 2 data sets are similar to the Friedman data set. For the RF alignment spectrum, Meier 1 and Meier 2 show strong peaks, magnitude of which is increases with the increasing number of prototypes (Figs.3(a,c,e,g) and Figs. 4(a,c,e,g) for Meier 1 and Meier 2, respectively). The XGB kernel alignment spectra are flatter, monotonically decreasing and overlapping with respect to the number of prototypes (Figs.3(b,d,f,h) and 4(b,d,f,h) for Meier 1 and Meier 2, respectively).
For the van der Laan data set, the RF kernel alignment spectrum shows stronger peaks for leading eigenvectors (singular vectors) as it was for the other simulated data sets (Figs.5(a,c,e,g)). However, the alignment spectrum for van der Laan is flat across all scenarios, indicating weak performance of XGB for this data set.
The kerneltarget alignment vs performance of the kernel and XGB kernel for the different data generating mechanisms and simulation settings is summarized in Figs.6(al). The performance on the test set was evaluated by an absolute value of the Pearson correlation coefficient between the predicted values and the target (Ytest) (the yaxis in the Figs.6(al)).
Three summaries of the kerneltarget alignment on the training set have been evaluated (the xaxis in the Figs.6(al)). On the left side (Figs.6(a,d,g,j)), the mean of the absolute value of the correlation coefficient between the first eigenvector corresponding to the largest eigenvalue (singular value) and Ytrain is shown (Cristianini et al. (2001),Cristianini et al. (2006)). In the middle (Figs.6(b,e,h,k)), the mean correlation coefficient between the eigenvector with maximum correlation with Ytrain is shown. On the right, the mean of absolute value of the correlation coefficient between the eigenvectors (singular vectors) of and Ytrain from the best 5 (with highest correlation coefficient) among leading 10 eigenvectors is displayed. Overall, the higher the kerneltarget alignment obtained on the training set, the better the performance of the RF/XGB kernels across the data generating mechanisms and simulation settings. The eigenvectors corresponding to the largest eigenvalue (singular value) are not necessarily best aligned with the target, which is demonstrated for the Checkerboard data for both (RF and XGB) kernels and the van der Laan data set for the XGB kernel, respectively (Figs.6(a,b,d,e,g,h,j,k)).
4 Application Using Real Life Data Sets
4.1 Experimental Setup
We assessed the kerneltarget alignment in treebased ensemble kernel using real life data sets, the summary of which is given in Table 1.
For the larger data sets (California and Protein) we randomly selected 2000 samples and split them into training and test set, with 1500 and 500 samples, respectively. We repeated the analysis 200 times to evaluate the kerneltarget alignment of RF and XGB kernels, respectively. For the other data sets we split the data into training and test set in the ratio 3 to 1, respectively, and repeated the analysis 200 times. Similarly to the simulation, we evaluated the kerneltarget alignment on the training sets and related it to the performance on the test sets.
4.2 Real Life Data Sets Results
The alignment spectra are given in the Figs. 7, 8, 9, 10 and 11 for California, Boston, Protein, Concrete and CSM data sets, respectively. There results from the RF and XGB kernels are provided in subfigure (a) and (b), respectively. California, Boston and Protein data sets (Figs. 7(a,b), 8(a,b) and 9(a,b) data sets are characterized by strong peaks in the alignment spectra for both RF and XGB kernels. The Concrete data set has multiple strong peaks for the XGB kernel (Fig.10(b)) in contrast to that of the RF kernel. On the other hand, the CSM data set has strong peaks for the RF kernel (Fig.11(a)), whereas the alignment spectrum obtained from the XGB kernel is flat (Fig.11(b)). The performance of the RF and XGB kernels in terms of the correlation of the predictions with the target on the test set are provided in Table 2. In addition, the average of the correlation coefficients of the top five eigenvector components and target obtained from the training sets are provided as a summary measure of the kerneltarget alignment (see Table 2). Using this metric, for three data sets (Boston, Protein and CSM) the RF kernel shows higher alignment with the target than that of the XGB, and in turn, a better performance. On the other hand, for the Concrete data set, the kerneltarget alignment for the XGB kernel is higher than that of the RF kernel. As a consequence, the XGB kernel exhibits better performance. For the California data set, the RF and XGB show comparable kerneltarget alignment, with RF slightly outperforming the XGB. To note, the overall alignment spectra of the California data set for the RF and the XGB exhibit the same pattern (see Figs.7(a),(b)).
For completeness, the results of the prediction performance in terms of the mean squared error (MSE) are given in the Table 3.
Dataset  n  p 

California Housing Pace and Barry (1997)  20640  9 
Boston Housing Harrison Jr and Rubinfeld (1978),Dua and Graff (2017)  506  13 
Protein Tertiary Structure Dua and Graff (2017)  45730  9 
Concrete Compressive Strength Yeh (1998), Dua and Graff (2017)  1030  9 
Conventional and Social Movie (CSM) Ahmed et al. (2015), Dua and Graff (2017)  187  12 
5 Discussion and Future Work
In this paper, we have shown that for regression, the performance of the tree ensemble (RF/XGB) based kernels is associate with the degree of the kerneltarget alignment. In a comprehensive simulation study and real life data sets we demonstrated that strong target aligned components of the kernel matrix are translated into high performance of the tree ensemble based kernels. The strongly target aligned components correspond to (typically a small number of) eigenvectors of the kernel matrix with larger eigenvalues (i.e. they are in the left side of the eigenvalue spectrum). However they do not necessarily exactly follow the magnitude based ordering of the eigenvalues. This suggests that the target aligned components span a low dimensional manifold that is implicitly represented by the tree based kernel. Moreover, the strongly aligned components (peaks) are persistent as shown in the sensitivity analysis with a landmark tree based kernel learning and an increasing number of landmarks.
The kerneltarget alignment can be applied to other tree ensemble based kernels, e.g. the recent RF and XGB variants. They include oblique, rotation or mixup forests (Menze et al. (2011), Rodriguez,JJ et al. (2006) and Rodriguez,JJ et al. (2020), respectively). Of interest is also the kerneltarget alignment of kernels obtained from Bayesian approaches such as Mondrian forests Balog et al. (2016) or Bayesian non parametric partitions and Bayesian additive regression trees (BART), (Fan et al. (2020) and Linero (2017)).
Furthermore, the concept of the tree ensemble based kernel is inherent to other prediction targets such as the binomial and time to event targets that represent classification and survival, respectively. For example, the proximity matrix i.e. ensemble kernel for the survival forest is readily available Ishwaran and Lu (2019). For the estimation of the kerneltarget alignment for a survival target, an additional challenge of incomplete information about the target due to censoring needs to be addressed and is an interesting topic for future research.
We used kernel ridge regression as a kernel learning algorithm in our contribution. There have been also recent advancements in the development of the response guided principal components Lang and Zou (2020), Tay et al. (2021). These have focused on principal component regression and incorporation of sparsity constraints through the LASSO penalty. As our results support the notion of relevant dimensionality expressed by the target aligned components for tree ensemble based kernels, we plan to explore sparse, response (target) guided nonlinear principal component regression for the tree ensemble kernel learning in the future.
6 Appendix
6.1 RF and the RF Kernel
Random Forest (RF) is defined as an ensemble of tree predictors grown on bootstrapped samples of a training set Breiman (2000). When considering an ensemble of tree predictors , with representing a single tree. The
are iid random variables that encode the randomization necessary for the tree construction
Scornet (2016), Ishwaran and Lu (2019).6.2 Gradient Boosted Trees (GBT) and the GBT Kernel
The GBT are (similarly to RF) ensemble of tree predictors. In contrast to the RF, the GBT ensemble predictor is obtained as a sum of weighted individual tree predictors through iterative optimization of an objective (cost) function Friedman (2001),Chen and C (2016) :
(12) 
The objective function of GBT comprises of a loss function and for the extreme gradient boosting a regularization term is added to control the model complexity. In our work we used the extreme gradient boosting (XGB) implementation of the GBTs
Chen and C (2016).The objective function of the XGB algorithm is given follows:
(13)  
(14) 
where
is the loss function. The loss function used in xgboost is squared error and logistic loss for regression and classification, respectively
denotes the regularization penalty
, is the number of tree terminal nodes and a corresponding regularization parameter, respectively
is a regularization parameter controlling for the L2 norm of the individual tree weights
As for the RF kernel, the XGB kernel is defined as a probability that and are in the same terminal node () Chen and Shah (2018).
(15) 
References
 Using crowd source based features from social media and conventional features to predict the movies popularity.. In IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), pp. 273–278. Cited by: Table 1.
 A theory of learning with similarity functions. Machine Learning 72, pp. 89–112. Cited by: §1, §2.4.

The mondrian kernel.
In
Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence
, pp. 32–41. Cited by: §5.  Prototype selection for interpretable classification. Annals of Applied Statistics 5 (4), pp. 2403–2424. Cited by: §1.
 On relevant dimensions in kernel feature spaces. Journal of Machine Learning Research 9, pp. 1875–1908. Cited by: §1, §2.3, §2.3.
 Some infinity theory for predictor ensembles. Technical report Technical Report 579, Statistics Dept. UCB. Cited by: §1, §2.1, §2.2, §6.1, §6.1.
 Explaining the success of nearest neighbor methods in prediction. Cited by: §1, §2.1, §2.2, §6.2.
 A scalable tree boosting system.. In Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp. 785–794. Cited by: §6.2, §6.2.
 Xgboost: extreme gradient boosting. Note: R package version 1.2.0.1 External Links: Link Cited by: §2.4.
 On kernel target alignment. In Proceedings of the Neural Information Processing Systems, Cited by: §1, §2.3, §3.2.
 Innovations in machinne learning. Cited by: §1, §3.2.
 The random forest kernel and other kernels for big data from random partitions.. arXiv preprint arXiv:1402.4293. Cited by: §1.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Table 1.
 Bayesian nonparametric space partitions: a survey.. arXiv preprint arXiv:2002.11394. Cited by: §2.2, §5.
 (Decision and regression) tree ensemble based kernels for regression and classification.. arXiv preprint arXiv:2012.10737. Cited by: §1.

Multilayered gradient boosting decision trees
. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §1.  Greedy function approximation: a gradient boosting machine.. Annals of Statistics 29 (5), pp. 1189–1232. Cited by: §6.2.
 Multivariate adaptive regression splines. The annals of statistics, pp. 1–67. Cited by: §3.1, §3.
 The elements of statistical learning. Springer. Cited by: §2.2.
 Hedonic housing prices and the demand for clean air.. Journal of environmental economics and management 5, pp. 81–102. Cited by: Table 1.

Learning kernel classifiers
. MIT Press. Cited by: §2.2.  Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine, pp. 558–582. Cited by: §2.1, §5, §6.1.
 Similaritybased learning via data driven embeddings. Proceedings of the Advances in Neural Information Processing Systems,, pp. 1998–2006. Cited by: §1, §2.4.
 A simple method to improve principal components regression.. Stat, pp. e288. Cited by: §5.
 A review of treebased bayesian methods. Communications for Statistical Applications and Methods 24 (6), pp. 543–559. Cited by: §5.
 Highdimensional additive modeling. The Annals of Statistics 37 (6B), pp. 3779–3821. Cited by: §3.1, §3.1, §3.
 On oblique random forests.. In Proceedings ECML/PKDD, Lecture Notes in Computer Science, 6911, pp. 453–469. Cited by: §5.
 Kernel analysis of deep networks. Journal of Machine Learning Research 12, pp. 2563–2581. Cited by: §1, §2.3.
 Sparse spatial autoregressions. Statistics & Probability Letters 33 (3), pp. 291–297. Cited by: Table 1.
 A generalized kernel approach to dissimilaritybased classification. Journal of Machine Learning Research 2, pp. 175–211. Cited by: §1, §2.4.
 R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: §2.4.
 An experimental evaluation of mixup regression forests. Expert Systems and Applications 151, pp. 113376. Cited by: §5.
 Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28 (10), pp. 1619–30. Cited by: §5.
 Learning with kernels. MIT Press. Cited by: §2.2.
 Random forests and kernel methods. IEEE Transactions on Information Theory 62 (3), pp. 1485 – 1500. Cited by: §1, §2.1, §6.1, §6.1.
 Principal componentguided sparse regression.. The Canadian Journal of Statistics. Cited by: §5.
 Super learner. Statistical applications in genetics and molecular biology 6 (1). Cited by: §3.1, §3.

ranger: a fast implementation of random forests for high dimensional data in C++ and R
. Journal of Statistical Software 77 (1), pp. 1–17. External Links: Document Cited by: §2.4.  Modeling of strength of highperformance concrete using artificial neural networks.. Cement and Concrete research 28, pp. 1797–1808. Cited by: Table 1.
 Reinforcement learning trees. Journal of the American Statistical Association 110 (512), pp. 1770–1784. Cited by: §3.1, §3.
Comments
There are no comments yet.