Gradient Boosting With Piece-Wise Linear Regression Trees

02/15/2018 ∙ by Yu Shi, et al. ∙ Tsinghua University 0

Gradient boosting using decision trees as base learners, so called Gradient Boosted Decision Trees (GBDT), is a very successful ensemble learning algorithm widely used across a variety of applications. Recently, various GDBT construction algorithms and implementation have been designed and heavily optimized in some very popular open sourced toolkits such as XGBoost and LightGBM. In this paper, we show that both the accuracy and efficiency of GBDT can be further enhanced by using more complex base learners. Specifically, we extend gradient boosting to use piecewise linear regression trees (PL Trees), instead of piecewise constant regression trees. We show PL Trees can accelerate convergence of GBDT. Moreover, our new algorithm fits better to modern computer architectures with powerful Single Instruction Multiple Data (SIMD) parallelism. We propose optimization techniques to speedup our algorithm. The experimental results show that GBDT with PL Trees can provide very competitive testing accuracy with comparable or less training time. Our algorithm also produces much concise tree ensembles, thus can often reduce testing time costs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gradient Boosted Decision Trees (GBDT) [9]

has shown its excellent performance in many real world applications and data science competitions. GBDT trains a series of regression trees

[1]

. At each step, a new regression tree is trained to minimize certain loss function. The overall predicted values are updated by adding the predicted values of the new regression tree. The process is repeated until a maximum number of trees have been created. Taking the loss as a function of current predicted values for all training samples, we can see each decision tree as an approximation of one step of gradient descent on the loss function.

Traditional regression trees in gradient boosting assign a single predicted value for all sample points on each leaf. In other words, all the training samples partitioned into the same leaf are given the same predicted value. Such trees are called piecewise constant trees

, since each defines a piecewise constant function in the input space. Training piecewise constant trees is cheap, and they are used in tree ensemble learning algorithms including gradient boosting and random forests. These algorithms have been implemented as efficient toolkits with good usability, such as XGBoost

[3] and LightGBM [10].

However, it is very likely that with more complex decision tree model, we can enhance the power of gradient boosting algorithms[5, 15]. Decision trees used in gradient boosting are often piecewise constant trees. Thus the model given by gradient boosting is a sum of piecewise constant functions. And this function is expected to approximate the underlying complex nonlinear relationship in the data. The intuition is that, with more complex trees, we may fit nonlinear relationship better and produce higher prediction accuracy[5]. The key to achieve this goal is extending GBDT to used piece-wise linear regression trees (PL Trees). PL Trees have better fitting ability than traditional piece-wise constant regression trees. We provide both theoretical analysis and experiment results to show that with PL Trees, GBDT can achieve better accuracy using fewer iterations.

Another motivation for using more complex base learner is to design algorithms that maximally exploit the capability of modern computer architecture. An important trend in computer architecture is to leverage the power of Single Instruction Multiple Data (SIMD) processing units. Most processors used today support operations on 256-bit registers with a single instruction. The latest Intel Xeon Phi [4]

coprocessors support 512-bit SIMD. It is important to design machine learning algorithms to utilize these features to achieve better efficiency. Our algorithm is designed to better utilize the growing power of computer architecture in GBDT construction algorithm.

Though PL Trees were proposed long time ago [8, 2, 12] , they are not widely adopted as base learners for gradient boosting in practice. [15] is the first to adapted PL Trees as base learners of gradient boosting. Instead of using both gradients and hessians as XGBoost[3], [15] uses only first order gradients. According to [13], using hessians is key to faster convergence rate. Recently, LinXGboost[5]

extends XGBoost to use PL Trees. LinXGBoost focuses on low dimensional data and tries to fit linear models of all features. To our best knowledge, there’s no implementation of gradient boosting with PL Trees can surpass XGBoost and LightGBM on large scale data, in terms of both efficiency and accuracy. One important reason is that the computation cost of PL Trees is higher and training an ensemble of them can be quite expensive, since each leaf requires a linear model. In this paper, we reduce the cost from both algorithmic and system aspects. From algorithmic level, we adopt an incremental feature selection strategy during the growth of a tree to constrain the features used in linear regressions in small subsets. The histogram technique (see e.g.,

[14, 3, 10]) is adapted to further reduce the training cost when find the best split of a tree node. From system level, SIMD techniques are used to exploit fine grain parallelism in the algorithm. We arrange data structures carefully to reduce cache misses and provide enough memory bandwidth for the SIMD operations. Intuitively, the extra training cost of PL Trees fills the spare SIMD units which are difficult for traditional GBDT to utilize. All these techniques together make our algorithm more efficient than the ordinary GBDT.

The main contributions of our work include:

  • We extend the traditional GBDT algorithm to use more complex PL regression tree as the base learners. Our experiments demonstrate that our algorithm can improve the convergence rate of GBDT. With much fewer trees, we can produce comparable and sometimes better prediction results, and reduce the testing time cost, which is desirable for deploying the algorithm in energy efficient devices.

  • We design an incremental feature selection strategy during the growth of a decision tree to keep the linear models small in size. This strategy avoids the prohibitive computational cost for fitting large linear models repeatedly when training a tree.

  • We provide highly vectorized implementation to exploit the power of SIMD units that is difficult for traditional GBDT to utilize.

  • We also show that PL Tree in gradient boosting implicitly defines a nontrivial positive semi-definite kernel over the input space. The effect of a sample in leaf to the predicted value of another sample in leaf is determined by , where is a certain implicit kernel function. Note that our algorithm does not require calculating all the pairwise explicitly.

  • We evaluate our algorithm on both synthetic and public datasets, and compare it with state-of-the-art toolkits including XGBoost and LightGBM. The experimental results show that our algorithm can often achieve better accuracy with (sometimes significantly) less training time.

2 Review of Gradient Boosting

In this section, we provide a brief review of GDBT. Specifically, we review one of the most popular variant Xgboost, which uses second-order gradient of loss function [3]. Given a dataset of samples with features, where is the number of samples, and . Gradient boosting trains a sequence of decision trees . The final output of gradient boosting is the summation of these trees The loss function is usually augmented by regularization terms to prevent overfitting. reflects the complexity of tree . Let be the loss function for a single sample. Then the total loss Let be the predicted value of after iteration . At iteration , a new tree is trained to minimize the following loss

With second-order approximation [3], we can approximate the loss above by

Here is a constant value independent of , and . Leaving out the constant, the objective of iteration becomes

(1)

The specific form of regularizer varies with the type of base learner .

3 Gradient Boosting With PL Trees

PL Trees used in our algorithm are full binary decision trees with linear functions in each leaf. Formally, there are two basic components of these PL Trees,

  • Splits: A split associated with an internal node is a condition used to partition the samples in the node to its two child nodes. If a sample point meets the condition, it goes to the left child, otherwise to the right. PL Trees use univariate splits in the form , where is the th feature value of sample . The feature is called the split feature.

  • Linear Functions: On each leaf , there is a linear function for the prediction of samples in . The function has the form , where is a subset of . We call the regressors for leaf . The selection of the regressors is described in Section 4.

Starting from a single root node, a PL Tree is trained by greedily splitting nodes into children until the number of leaves in the tree reaches a preset maximum value . To give a clear framework for the training of PL Trees in gradient boosting, we first define two oracles:

  1. FitNode. fits a linear function on samples in leaf . The parameters of the function are calculated analytically to minimize (1).

  2. SplitEval. For a leaf in tree of (1), a variable and a real value , returns the reduction of the value of (1), when splitting leaf with split condition and fitting sample points in both child nodes using FitNode.

Now the framework for training a PL Tree is summarized in Algorithm 1. We will spell out the details for FitNode and SplitEval in Section 3.1 and 3.2 respectively.

1:initialize the tree with a single root node
2:while put all the sample points in root node do
3:     number of leaves
4:     for each leaf  do
5:         
6:     end for
7:     
8:     split with condition into and
9:     ,
10:end while
Algorithm 1 Train a PL Tree

3.1 Minimize Loss in a Single Leaf

Let be the set of samples in leaf of tree in (1). We can rewrite (1) as

(2)

Let be the linear function fitted in leaf . The regularization function used here is Here is the norm of parameters of linear model in leaf . Adding prevents the linear models in the leaves from being too steep. Leaving out the notation and focusing on the loss of a single leaf ,

(3)

We first focus on fitting an optimal linear model for a single leaf with given regressors . In other words, we need to find the optimal parameters of . The choice of regressors is left to Section 4.

Let . Substituting into (3), we get the loss in terms of the parameters of linear models

(4)

Let , , and and be the submatrix and subvector of and respectively by selecting and for . Let be the matrix of samples in with features in , augmented by a column of 1’s. We can write the loss in a concise form:

(5)

Thus the optimal value of can be calculated analytically

(6)

Calculation of Equation (6) is exactly . Hence, the minimum loss of leaf is

(7)

The result in (7) is essentially the same with LinXGBoost[5]. The key difference is that our algorithm will specify the set of regressors in leaf

in Section 4.2, while LinXGBoost uses the full set of features as regressors. This makes our algorithm more efficient and feasible for high dimensional data.

3.2 Find the Optimal Split

In this section, we define the algorithm to find the optimal split. Specifically, we describe the details of oracle SplitEval, and the calculation of in line 5 of Algorithm 1. Note that we still assume the regressor set is fixed here.

When splitting a leaf into children and with condition , we split the matrix into sub-matrices and accordingly. That is, contains the values of regressors of sample points in the left leaf , and for those in the right leaf . Similarly we define , , and . With these notations and the definition in (6), the results of and are and . Similarly we define and as in (7).

Then the reduction of loss incurred by splitting into and is,

(8)

This is what returns. For the calculation of , we adopt a greedy search over all possible split conditions. For every feature and every possible split point of , we calculate (8) and choose the optimal and . The algorithm is summarized below.

1:
2:for each feature  do
3:     for each split point of feature  do
4:         if  then
5:              
6:              
7:         end if
8:     end for
9:end for
Algorithm 2 Find the Optimal Split for Leaf

4 Algorithmic Optimization

In Algorithm 2, to calculate the loss, SplitEval and FitNode are executed repeatedly. For each split point, we need to calculate Equation (7) twice for both child nodes, which is expensive when the number of regressors is large. In this section, we introduce algorithmic optimizations to reduce the cost of Algorithm 2.

4.1 Histograms for GBDT With PL Tree

Histogram is used in several GBDT implementations [14, 3, 10] to reduce the number of potential split points. For each feature, a histogram of all its values in all data points is constructed. The boundaries of bins in the histogram are chosen to distribute the training data evenly over all bins. Each bin accumulates the statistics needed to calculate the loss reduction. When finding the optimal split point, we only consider the bin boundaries, instead of all unique feature values. After the histogram is constructed, we only need to record the bin number of each feature for each data point. Fewer than 256 bins in a histogram is enough to achieve good accuracy [16], thus a bin number can be stored in a single byte. We can discard the original feature values and only store the bin numbers during the boosting process. Thus using histograms produces small memory footprint.

It seems nature to directly use the histogram technique in our algorithm. However, the feature values are needed when fitting linear models in leaves. We still need to access the feature values constantly, which incurs long memory footprint. To overcome this problem, for each feature and each bin of , we record the average feature values in bin , denoted as . When fitting linear models, we use to replace the original feature value . Here is the value of feature of data point , and falls in bin . In this way, we can still discard the original feature values after preprocessing. Thus we adapt the histogram technique to PL Trees and preserve the small memory footprint.

4.2 Incremental Feature Selection

It is unaffordable to use all features as regressors when calculating the linear model parameters in Equation (6). For each leaf, we need to select a small subset of the features as regressors.

In fact, the regressor selection can be done automatically as the tree grows. Considering splitting into and with condition , it is very natural to add split feature into the regressor sets of and . The intuition is that, if splitting with should result in a significant reduction in the loss function, then the value of contains relatively important information for the fitting of linear models in and . Thus, for each leaf , we choose the split features of its ancestor nodes as regressors. Formally, suppose the linear model of leaf is . When we split into and with condition , the linear model of and is . In other words, matrices and should include one more column of feature than matrix when computing (6). We call this incremental feature selection, since the features (regressors) used in linear models are incrementally added when the tree grows. To further speedup our algorithm, we propose a half-additive fitting approach which is inspired by [8]. Due to limited space, we describe the details of half-additive fitting in Appendix A.

A similar idea was used in [8] for growing a single piecewise linear tree. The difference is that [8] does not recalculate the parameters of the parent node when fitting linear models in child leaves. Only the optimal values of and are computed for both child leaves, and other parameters are kept the same as their parent node. Here we name the strategy of [8] additive fitting. In our algorithm, however, all the parameters of child leaves are recalculated using (6), i.e. in a fully corrective way. With fully corrective parameter updating, for linear models in all leaves, we always obtain the analytical optimal solutions in (6).

To prevent the number of regressors become too large as the tree grows deep, we restrict the number of regressors up to a maximum value . When the number of regressors is in a leaf , no more regressor will be added to the children of . Experiments show that is a decent choice.

4.3 Half-Additive Fitting

To further speedup our algorithm, we propose a half-additive fitting approach to update the parameters in child nodes when splitting a leaf. It is a combination of fully corrective and additive fitting in Section 4.2.

Given the regressors in leaf , let be the linear model. Suppose leaf is split into leaf and leaf with condition . We take as a single regressor for leaf and . In other words, the linear model of and is , where , and are the parameters to calculate for child nodes. When the regressors are orthogonal and zero-mean in and , it is easy to check that the optimal value of is 1, and are still optimal parameters for regressors in leaf and . In this case we get the same effect of fully-corrective. In general case, when the number of data is much larger than regressors, the regressors are usually nearly orthogonal. With half-additive fitting, we only need to solve parameters, which is more efficient yet achieves comparable accuracy with fully-corrective fitting. We provide both approaches as options in our toolkit.

5 System Optimization

In this section, we introduce techniques to further speedup our algorithm in modern CPUs. As mentioned in Section 4.1, statistics needed to calculate the loss reduction should be accumulated in each bin of histograms. For traditional GBDT, the information is the sum of gradient and hessian for each data point falling in the bin (e.g., [3]). For our algorithm, however, much more information is needed in each bin.

Considering Equation (7), (leaving out the notation for leaf ) we can write and as summations of statistics from all data points in the leaf, as follows:

For simplicity, here we use to represent the column vector of regressors for data point . . With slight abuse of notation, we use to denote the number of data points in the leaf, and to denote the number of data points in whole training set. In the training set, each data point has a unique ID, ranging from 1 to N. For each leaf , an array of length is maintained to record these ID’s for all data points in the leaf. For each feature , we maintain an array of length . For a data point with unique ID , records which bin the data point falls in the histogram of feature .

With these notations, we can describe the histogram construction process for leaf and feature : For each data point in , we first find out the unique ID from , and then use the ID to fetch the bin index from , and finally we add the statistics to the corresponding bin. Algorithm 3 summarizes this process. Figure 1 shows the detail of line 7 and 8 in Algorithm 3. This approach is adopted in LightGBM [10] for histogram construction of dense features.

1:Input: data points in leaf ,
2:Input: gradients and hessians, ,
3:Input: array of bin indices of feature ,
4:Input: ID array for leaf ,
5:Output: histogram
6:for i = 1 to n do
7:     id = [i]
8:     bin = [id]
9:     [bin] += [, ]
10:end for
Algorithm 3 Histogram Construction for feature on leaf
Figure 1: Access

With features, the size of matrix is . Note that is symmetric, it has unique values. Traditional GBDT needs only two values, and . Thus histogram construction is much more expensive in our algorithm. In the following two subsections, we introduce two important techniques to speedup this process.

5.1 SIMD parallelism

Single Instruction Multiple Data (SIMD) parallelism in modern CPUs, a.k.a. vectorization, supports operations on multiple data items with single instruction. It is obvious that vectorization can be used to speedup line 9 of Algorithm 3, which is a simultaneous addition of multiple data items. With vectorization, however, each clock cycle more operands are needed, thus the speedup of vectorization is often bounded by memory bandwidth [6]. In algorithm 3, when accessing , we have to skip the data points not in leaf (purple blocks in Figure 1). Thus the access to array is discontinuous, causing frequent cache misses. To address this problem in Algorithm 3, we introduce a technique to reduce cache misses and increase memory bandwidth (a very similar idea is used in LightGBM [10] for histogram construction of sparse features).

Suppose leaf has data points. For each feature , we maintain an array of length to record bin indices of feature for data points in leaf . In other words, with the notations in Algorithm 3, for , [i] = [[i]]. Since each bin index is stored in a byte, and the access to is continuous, we keep the memory footprint very small and reduces cache misses. Also, with , we can avoid accessing the unique ID array . Histogram construction with using vectorization is summarized in Algorithm 4.

For root node , is exactly . When leaf is split into and , is split into and accordingly. The split operation has to be done for every feature . In Section 5.2, we show how to reduce the cost of splitting using Bit Manipulation Instruction Set, so that it is also affordable for dense features.

1:Input: data points in leaf ,
2:Input: gradients and hessians, ,
3:Input: array of bin indices of feature ,
4:Output: histogram
5:for i = 1 to n do
6:     bin = [i]
7:     [bin] += [, ] //vectorzed add
8:end for
Algorithm 4 Vectorized Histogram Construction

Note that it is hard for traditional GBDT to fully exploit SIMD, since it only adds two values to the histogram bin each time. Most CPUs today provide 256-bit registers for SIMD. At least 4 double precision values or 8 single precision values are required to fully occupy the registers. Since our algorithm requires matrix inversions, we use double precision values for the sake of numerical stability. With only 2 variables in linear models, the statistics in each bin can already fulfill a 256-bit register.

5.2 Using Bit Manipulation Instructions

As mentioned in Section 5.1, splitting requires extracting bin indices from and store into and . To do this, we need to know for each data point in , whether it goes to or . This information is recorded in a bit vector. Specifically, if is split with condition . Then [i]= , where is the value of feature of data point . Creating only requires a single sweep of .

Then for each feature , bin indices in are extracted according to . It is very suitable for Bit Manipulation Instructions (BMI) to handle. BMI is an extension of x86 instructions to speedup bit operations. We use Parallel Bits Extract (PEXT) of BMI to extract the bin indices. PEXT takes two 64-bit registers and as operands. For each bit in whose value is , the corresponding bit in is extracted and stored in the output register. Each PEXT instruction can handle 64 bits simultaneously, so we can process 8 bin indices in simultaneously. The workflow of using PEXT is shown in Figure 2. We first broadcast each bit in into a byte, thus 1 becomes 0xff and 0 becomes 0x00. Then, with a PEXT instruction, we can extract . Then we negate the bits in and extract using another PEXT instruction.

Figure 2: Parallel Bits Extract to Split

6 A Kernel View of PL Tree

We provide some theoretical insights for our algorithm. We show that when the loss function of gradient boosting is convex, each PL Tree in the gradient boosting implicitly defines a non-trivial symmetric semi-definite (SPSD) kernel. With the notations defined in Section 4, the predicted values for samples fall in leaf are

Let . It is easy to show that is a symmetric positive semi-definite matrix. When the loss function used in gradient boosting is convex (which is a common case), hessians . Thus is a diagonal matrix with nonnegative elements. Then is positive semi-definite, so is .

Let be the function mapping a sample to a sub-vector of with only values of regressors of . Then we can define a kernel function

Thus is the Gram matrix of kernel for samples in leaf , which indicates that is a symmetric positive semi-definite (SPSD) kernel.

Let be the set of indices of samples in leaf . Now we can express predicted value of sample in terms of and gradients ,

(9)

The predicted value in (9) is a weighted sum of gradients . The closer a sample is to in the feature space of kernel , the bigger contribution gradient of has to the predicted value of .

With piecewise constant trees, the output value of the ordinary GBDT for all samples in is

(10)

Thus, piecewise constant trees provide identical predicted values for all samples in a leaf. But with the kernel weights, PL Trees provide similar outputs for similar samples in the leaf. This is why PL Trees can be more accurate as base learners for gradient boosting.

In fact, the whole LP tree also defines a kernel ,

The Gram matrix of is simply a big matrix with Gram matrices ’s of leaves listing in the diagonal, which is SPSD since each is SPSD.

7 Experiments

We evaluate our algorithm on 4 public datasets and compare the results with LightGBM and XGBoost. We name our algorithm GBDT-PL. In addition, we create 2 synthetic datasets. Table 1 lists the datasets we used. Cubic and Poly are synthetic. The other four are all from UCI datasets. Our code is available at the github page. 111 For more information about the datasets and source code, please refer to the anonymized github page:
https://github.com/GBDT-PL/GBDT-PL.git

We use identical GBDT hyper-parameters for all algorithms. Table 2 shows our experiment environment.

name # training # testing # features task
HIGGS 10000000 500000 28 classification
HEPMASS 7000000 3500000 28 classification
CASP 30000 15731 9 regression
Epsilon 400000 100000 2000 classification
Cubic 10000000 1000000 10 regression
Poly 2000000 1000000 200 regression
Table 1: Datasets Description
OS CPU Memory
CentOS Linux 7 2 Xeon E5-2690 v3 DDR4 2400Mhz, 128GB
Table 2: Experiment Platform

We compare convergence rate, accuracy, training time, and testing time. Recently LightGBM add a sampling strategy to speedup the training. The strategy samples data points according to the magnitudes of gradients before each iteration. The strategy is called Gradient Base One Side Sampling (GOSS) [10]. It can speedup training with little sacrifice in accuracy when the number of training data is large. It is straightforward to apply GOSS to our method GBDT-PL as well. We test GOSS for both LightGBM and GBDT-PL.

Table 3 shows the common parameter setting for GBDT-PL, LightGBM and XGBoost. By default, we use the histogram version of XGBoost and LightGBM. Each tree in our experiments has at most 255 leaves. The learning rate is 0.1. We test both 63 bins and 255 bins for the histograms. is the coefficient for regularization terms. min sum of is the minimum sum of ’s allowed in each leaf. We set it to be 100 to prevent the trees from growing too deep. For GBDT-PL, we use at most 5 regressors per leaf.

max leaves learning rate bins min sum of
255 0.1 63/255 0.01 100
Table 3: Parameter Setting

7.1 Effects of Optimization Techniques

To evaluate the effects of various techniques in Section 4 and 5, we start from a baseline version, and add the optimization techniques incrementally. We record the training time for 500 iterations using 63 bins on HIGGS and Epsilon datasets. Figure 3 shows the training time when adding each optimization technique, from top to bottom. The first bar in the top is the baseline version (Algorithm 3). The second bar adds SIMD for histogram construction in Algorithm 3. Based on SIMD, the third bar use the (Algorithm 4). The fourth bar uses Bit Manipulation Instructions (BMI) (Section 5.2). The bottom bar adds the half-additive technique (Section 4.3).

Figure 3: Effects of Optimization Techniques

7.2 Convergence Rate and Accuracy

We run 500 iterations and plot testing accuracy per iteration. Figure 4 shows the results. We use lgb for LightGBM and xgb for XGBoost in the figure legends. Table 4 shows the testing accuracy after 500 iterations. For HIGGS and HEPMASS, GBDT-PL uses fewer trees to reach the same accuracy, as shown in (a) and (b) of Figure 4. On other datasets we get slightly better convergence rate. Note that curves of XGBoost and LightGBM overlap in most figures.

(a) HIGGS AUC
(b) HEPMASS AUC
(c) CASP RMSE
(d) Epsilon AUC
(e) Cubic RMSE
(f) Poly RMSE
Figure 4: Convergence Rate and Accuracy
Algorithm HIGGS HEPMASS CASP Epsilon Cubic Poly
GBDT-PL, 63 bins 0.8539 0.9563 3.6497 0.9537 0.5073 879.7
LightGBM, 63 bins 0.8455 0.9550 3.6217 0.9498 0.9001 923.2
XGBoost, 63 bins 0.8449 0.9549 3.6252 0.9498 0.9010 921.4
GBDT-PL, 255 bins 0.8545 0.9563 3.6160 0.9537 0.2686 673.4
LightGBM, 255 bins 0.8453 0.9550 3.6206 0.9499 0.5743 701.3
XGBoost, 255 bins 0.8456 0.9550 3.6177 0.9500 0.5716 701.1
Table 4: Testing Accuracy

7.3 Training Time

We plot the accuracy of testing sets w.r.t. training time in Figure 5. To leave out the effect of evaluation time, for each dataset we have two separate runs. In the first run we record the training time per iteration only, without doing evaluation. In the second round we evaluate the accuracy every iteration. For HIGGS dataset, GBDT-PL reaches AUC 0.845 at around 150 second, while LightGBM reaches the same accuracy with 250 seconds. Slight improvement can be seen in other datasets.

(a) HIGGS AUC
(b) HEPMASS AUC
(c) CASP RMSE
(d) Epsilon AUC
(e) Cubic RMSE
(f) Poly RMSE
Figure 5: Training Time Comparison

7.4 Testing Time

To measure the testing time accurately, we expand most testing sets by replicating the original ones several times. Due to limited space, the details of the experimental setup are provided in our github. We plot the testing accuracy w.r.t. the testing time in Figure 6. Though GBDT-PL reaches better accuracy with fewer trees compared with XGBoost and LightGBM, a linear function is evaluate to give prediction for each data point. So evaluating each tree in GBDT-PL is more expensive. For Cubic and CASP, GBDT-PL achieves better testing time performance. For Poly, XGBoost and LightGBM is faster at deploy time due to cheaper tree evaluation.

(a) HIGGS AUC
(b) HEPMASS AUC
(c) CASP RMSE
(d) Epsilon AUC
(e) Cubic RMSE
(f) Poly RMSE
Figure 6: Testing Time Comparison

7.5 Goss

We implement the Gradient Base One Side Sampling (GOSS) [10]. At each iteration, we sample the data points with biggest gradient magnitude within top , and random sample from the rest of training data set. This is the default sampling ratio in LightGBM. Figure 7 shows that GOSS also works for GBDT-PL. Compared with Figure 5, GOSS speedups the training process of both GBDT-PL and LightGBM.

(a) HIGGS AUC
(b) HEPMASS AUC
(c) Epsilon AUC
(d) Poly RMSE
Figure 7: Training Time Comparison with GOSS

8 Related Work and Discussions

Gradient boosting have many off-the-shelf toolkits. XGBoost [3] and LightGBM [10] are the most popular ones with careful optimizations in time and memory cost. Both of them use piece-wise constant trees. These tools approximate loss functions with second-order Taylor approximation at each step. Then train a decision tree to minimize the second-order approximation, which is analog to Newton’s method. Our work is also based on the second-order approximation technique, which can optimize the loss function faster [7].

Recently, many variants have been proposed to improve the accuracy of tree-based ensemble methods, including [11] and [17]. These approaches all use piecewise constant base learners, but use more complex split conditions.

Regression trees other than piecewise constant appear in early statistical literature. Various training algorithms for a single such tree are proposed. The SUPPORT of [2] fits models with all numerical regressors for each node first, then choose the feature to split the node by comparing the correlation with current residual using statistical tests. SUPPORT then uses the sample mean of the chosen feature as the split point. GUIDE [12] is a similar model with more advanced statistical technique to choose the split feature. [8] trains linear models in leaves additively. The linear model in each node extends the model of its parent by adding one more feature. The coefficients of features in the parent model are not changed, and only the coefficient of newly added feature is calculated. Compared with the regression tree models above, PL Tree used in our work is designed to greedily reduce the loss at each iteration of gradient boosting.

[15] adapts gradient boosting to use PL Trees and applies the algorithm on product demand prediction. Their algorithm is based on traditional gradient boosting with only first order gradient. [5] extends XGboost to use PL Trees and focuses on low dimensional data. LinXGBoost is implemented in Python, which makes it difficult to do system level optimizations as GBDT-PL to speedup the algorithm. Both of [5] and [15] lack a feature selection strategy to keep the linear models on leaves small.

References

  • [1] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regression trees. CRC press, 1984.
  • [2] P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao. Piecewise-polynomial regression trees. Statistica Sinica, pages 143–167, 1994.
  • [3] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
  • [4] Chrysos. Intel® xeon phi™ coprocessor-the architecture[j]. intel whitepaper, 2014. Intel Whitepaper, page 176, 2014.
  • [5] L. de Vito. Linxgboost: Extension of xgboost to generalized local linear models. arXiv preprint arXiv:1710.03634, 2017.
  • [6] R. Espasa, M. Valero, and J. E. Smith. Vector architectures: past, present and future. In Proceedings of the 12th international conference on Supercomputing, pages 425–432. ACM, 1998.
  • [7] J. Friedman, T. Hastie, R. Tibshirani, et al.

    Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).

    The annals of statistics, 28(2):337–407, 2000.
  • [8] J. H. Friedman. A tree-structured approach to nonparametric multiple regression.

    Smoothing techniques for curve estimation

    , 757:5–22, 1979.
  • [9] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  • [10] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3149–3157, 2017.
  • [11] P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo. Deep neural decision forests. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 1467–1475, 2015.
  • [12] W.-Y. Loh. Regression tress with unbiased variable selection and interaction detection. Statistica Sinica, pages 361–386, 2002.
  • [13] P. Sun, T. Zhang, and J. Zhou. A convergence rate analysis for logitboost, mart and their variant. In ICML, pages 1251–1259, 2014.
  • [14] S. Tyree, K. Q. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387–396. ACM, 2011.
  • [15] J. C. Wang and T. Hastie. Boosted varying-coefficient regression models for product demand prediction. Journal of Computational and Graphical Statistics, 23(2):361–382, 2014.
  • [16] H. Zhang, S. Si, and C.-J. Hsieh. Gpu-acceleration for large-scale tree boosting. arXiv preprint arXiv:1706.08359, 2017.
  • [17] J. Zhu, Y. Shan, J. Mao, D. Yu, H. Rahmanian, and Y. Zhang. Deep embedding forest: Forest-based serving with deep embedding features. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2017.