In the last decade, boosting techniques, which combine weak learners to a strong learner, have been widely developed and employed from the machine learning to computational learning communities. AdaBoost (Freund and Schapire, 1997) and gradient boosting decision trees (GBDT) (Friedman, 2001)
, are some of the most popular learning algorithms used in practice. There are several highly optimized implementations of boosting, among which XGBoost(Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) are broadly applied to increase the scalability and decrease the complexity. These implementations can train models with hundreds of trees using millions of training examples in a matter of minutes. NGBoost (Duan et al., 2019) generalized Natural Gradient as the direction of the steepest ascent in Riemannian space, and applied it for boosting to enable the probabilistic predication capability for the regression tasks. Natural gradient boosting shows promising performance improvements on small datasets due to better training dynamics, but it suffers from slow training speed overhead especially for large datasets. To reduce the training time, we consider the setting of the best-first decision tree learning Shi (2007)
for the weak learners, remove the restriction of maximum depth for base learners and carefully tunes the following three hyper-parameters: learning rate, number of estimators and the maximum number of leaves. Our best setting achieves up to 4.85x speed up, significantly improves the original NGBoost performance and beats the state-of-the-art performances on the Energy, Power and Protein datasets from the UCI Machine Learning Repository.
|Boston||506||2.96 0.42||3.01 0.57||2.47 0.12||2.48 0.16||26.81s||10.04s|
|Concrete||1030||5.49||4.71 0.61||3.08 0.12||2.94 0.18||29.96s||9.28s|
|Energy||768||0.51||0.35 0.07||0.76||0.37 0.28||30.24s||6.24s|
|Kin8nm||8192||0.18||0.14 0.00||-0.40||-0.60 0.03||189.28s||82.14s|
|Naval||11934||0.00 0.00||0.00 0.00||-4.88||-5.49 0.04||317.85s||207.01s|
|Power||9568||3.92||3.47 0.19||2.80||2.65 0.08||120.31s||48.09s|
|Protein||45730||4.59||4.21 0.06||2.86||2.76 0.03||1191.02s||502.34s|
|Wine||1588||0.64||0.62 0.05||0.94 0.07||0.91 0.08||42.44s||16.86s|
|Yacht||308||0.63 0.19||0.90 0.35||0.46 0.28||1.03 0.44||22.52s||5.11s|
|Year MSD||515345||9.18 NA||9.14 NA||3.47 NA||3.46 NA||14.00h||5.15h|
2 Robustly Optimized Natural Gradient Boosting
Since when the maximum number of leaves is fixed, the leaf-wise tree growth algorithms (best-first) tend to achieve lower loss than the level-wise algorithmsShi (2007); Ke et al. (2017), we remove the maximum depth restriction and instead use the maximum number of leaves restriction as the regularization to prevent over-fitting. Apart from the performance gains, this change also leads to around 30% speed up. This is because with maximum number of leaves restriction, the decision trees can often achieve lower loss by going deeper with less splits, while the decision trees bounded by maximum depth will often keep doing less effective splitting at the shallow levels.
For hyperparameter tuning, our insight is that we can counter the performance drop from decreasing the number of the weak estimators by increasing the model complexity of each base learner. In this way, the training time can be linearly reduced due to less number of weak learners for training. Since we reduce the number of weak learners and thus decrease the parameters in the system, we increase the learning rate accordingly for robust training dynamics. Based on this insight, we gradually decrease the number of estimators, while at the same time increase the maximum number of leaves and the learning rate to find the settings with the best performance. We first search for the best setting on the Energy dataset from the UCI Machine Learning Repository, and then report the performance on all datasets with the setting discovered. Generally, we use the following hyperparameters through out our experiments: learning rate, , number of estimators, , maximum number of leaves, .
Our experiments use datasets from the UCI Machine Learning Repository, and follow the same protocol as NGBoost Hernández-Lobato and Adams (2015); Duan et al. (2019). For all datasets, we hold out a random 10% of the examples as a test set. From the other 90% we initially hold out 20% as a validation set to select (the number of boosting stages) that gives the best log-likelihood, and then re-fit the entire 90% using the chosen . The refit model is then made to predict on the held-out 10% test set. This entire process is repeated 20 times for all datasets except Protein and Year MSD, for which it is repeated 5 times and 1 time respectively. For the Average Training Time (ATT) measurement, we take an average of the training times measured from each of the repeated training processes. Unlike the original implementation, we use the learning rate of 0.04 throughout all the datasets.
We also re-run the official NGBoost code with the same hyper-parameters as claimed in the original paper for a fair comparison of the performance and the training time. All the experiments are conducted on a single Intel(R) Xeon(R) E5-2630 v4 2.20GHz CPU.
Table 1 compares the performance of our approach with the original approach of NGBoost on the regression benchmark of UCI datasets. We can see that RoNGBa achieves significantly better results on most of the datasets apart from extremely small (Yacht, Boston) datasets, which need extra hyperparemter tuning for better performance. Specifically, our approach significantly beat the state-of-the-art performances on the Energy, Power and Protein datasets as reported from Gal and Ghahramani (2016) and Lakshminarayanan et al. (2017). We can also observe that our approach can achieve a speed up ranging from 1.53x to 4.85x in various kinds of datasets, which empirically confirms our insight that reducing the overall number of learners can cut down much more amount of computation time than the time gained from increasing each base learner’s model complexity.
5 Related Work
AdaBoost Freund and Schapire (1997)
changes the input distribution to obtain subsequent answers from the former weak learners. At each training step, it puts higher weights on mis-classified examples, and finally composes a strong classifier by weighted sum of all the weak hypotheses.
Gradient Boosting Decision Tree (GBDT) Freund and Schapire (1997)
is adapted from Adaboost in order to handle a variety of loss functions. GBDT first expresses the loss function minimization problem into an additive model, and performs numerical optimization directly in the function space applying greedy forward stage-wise algorithm. Most importantly, GBDT uses the data-based analogue of the unconstrained negative gradient of the loss function in the current model as the approximate value of the residual in boosting tree, which gives the best steepest-descent step direction in the N-dimensional data space.
Compared with AdaBoost, GBDT constructs multiple decision trees serially to predict the data. It takes the decision tree model as parameter and each iteration is fitted to the negative gradient of the loss function to improve. However, AdaBoost takes each point as parameter and adjusts the weight of the negative points to improve. Therefore, by choosing different types of loss functions , such as square error and absolute error in regression, negative binomial log-likelihood error in classification, GBDT can be applied to broader and more diverse learning problems than AdaBoost, like multi-class classification, click prediction, and learning to rank.
XGBoost (Chen and Guestrin, 2016)
improves GBDT with better scalability. XGBoost is suitable for large scale data and limited computing resource with high speed and equivalent accuracy. To achieve this scalability, XGBoost uses mainly three techniques to improve: 1) XGBoost approximates the best split of decision trees by weighted quantile sketch, instead of greedily computing all possible splits. 2) XGBoost handles sparse data by sparsity-aware algorithm which only trains non-missed data and gets a default tree direction for missing values. 3) XGBoost stores memory with a cache-aware block structure for out-of-core computing.
LightGBM Ke et al. (2017) further improves the system scalability for high-dimensional large data. They apply two methods, Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) on GBDT to increase the efficiency without hurting the accuracy.
GOSS samples training data by keeping all the instances with large gradients and random sampling on the instances with small gradients since instances with small gradients are already well-trained. Then, to keep data distribution, they amplify the sampled data with small gradients via a constant during computing the information gain. Instead of filtering out data with zero values as training data in XGBoost, LightGBM samples the training dataset more wisely.
In reality, there are features mutually exclusive and thus data can be very sparse. To reduce the number of features, EFB bundles the exclusive features into a single feature. First, they take features as vertices and add edges between not mutually exclusive features. Edges are weighted by total conflicts between features. Then, they sort the features by degrees in the graph. Finally, they put a feature in the sorted list to an existing bundle or a new created one based on the conflicts comparing to a threshold. After feature histograms are constructed, they find the best split points by histogram-based algorithm, comparing to XGBoost approximates the best split points by weighted quantile sketch.
Even though LightGBM does not apply new techniques, such as cache-aware blocks and out-of-core computing in XGBoost, to interact with system more efficiently, LightGBM still outperforms XGBoost with more efficient algorithm.
In this work, we proposed RoNGBa, a Robustly optimized NGBoost approach. RoNGBa applies leaf number clipping for base learners and find the best hyperparameters based on a simple yet effective insight on computation-accuracy trade-off. Our approach significantly beats the state-of-the-art performance on various kinds of UCI datasets while still has up to 4.85x speed up compared with the original approach of NGBoost.
Our future work is to apply the techniques of Gradient-based One-Side Sampling and Exclusive Feature Bundling from LightGBM for more efficient natural gradient boosting on large-scale higher-dimensional datasets.
We want to thank Michal Moshkovitz and Joseph Geumlek for the early discussions of the project.
- Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §1, §5.
- NGBoost: natural gradient boosting for probabilistic prediction. arXiv preprint arXiv:1910.03225. Cited by: RoNGBa: A Robustly Optimized Natural Gradient Boosting Training Approach with Leaf Number Clipping, §1, §3.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §1, §5, §5.
- Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1.
Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §4.
- . In International Conference on Machine Learning, pp. 1861–1869. Cited by: §3.
- Lightgbm: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §1, §2, §5.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §4.
- Best-first decision tree learning. Ph.D. Thesis, The University of Waikato. Cited by: §1, §2.