1 Introduction
In the last decade, boosting techniques, which combine weak learners to a strong learner, have been widely developed and employed from the machine learning to computational learning communities. AdaBoost (Freund and Schapire, 1997) and gradient boosting decision trees (GBDT) (Friedman, 2001)
, are some of the most popular learning algorithms used in practice. There are several highly optimized implementations of boosting, among which XGBoost
(Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) are broadly applied to increase the scalability and decrease the complexity. These implementations can train models with hundreds of trees using millions of training examples in a matter of minutes. NGBoost (Duan et al., 2019) generalized Natural Gradient as the direction of the steepest ascent in Riemannian space, and applied it for boosting to enable the probabilistic predication capability for the regression tasks. Natural gradient boosting shows promising performance improvements on small datasets due to better training dynamics, but it suffers from slow training speed overhead especially for large datasets. To reduce the training time, we consider the setting of the bestfirst decision tree learning Shi (2007)for the weak learners, remove the restriction of maximum depth for base learners and carefully tunes the following three hyperparameters: learning rate, number of estimators and the maximum number of leaves. Our best setting achieves up to 4.85x speed up, significantly improves the original NGBoost performance and beats the stateoftheart performances on the Energy, Power and Protein datasets from the UCI Machine Learning Repository.
Dataset  N  RMSE  NLL  ATT  

NGBoost  RoNGBa  NGBoost  RoNGBa  NGBoost  RoNGBa  
Boston  506  2.96 0.42  3.01 0.57  2.47 0.12  2.48 0.16  26.81s  10.04s 
Concrete  1030  5.49  4.71 0.61  3.08 0.12  2.94 0.18  29.96s  9.28s 
Energy  768  0.51  0.35 0.07  0.76  0.37 0.28  30.24s  6.24s 
Kin8nm  8192  0.18  0.14 0.00  0.40  0.60 0.03  189.28s  82.14s 
Naval  11934  0.00 0.00  0.00 0.00  4.88  5.49 0.04  317.85s  207.01s 
Power  9568  3.92  3.47 0.19  2.80  2.65 0.08  120.31s  48.09s 
Protein  45730  4.59  4.21 0.06  2.86  2.76 0.03  1191.02s  502.34s 
Wine  1588  0.64  0.62 0.05  0.94 0.07  0.91 0.08  42.44s  16.86s 
Yacht  308  0.63 0.19  0.90 0.35  0.46 0.28  1.03 0.44  22.52s  5.11s 
Year MSD  515345  9.18 NA  9.14 NA  3.47 NA  3.46 NA  14.00h  5.15h 
2 Robustly Optimized Natural Gradient Boosting
Since when the maximum number of leaves is fixed, the leafwise tree growth algorithms (bestfirst) tend to achieve lower loss than the levelwise algorithmsShi (2007); Ke et al. (2017), we remove the maximum depth restriction and instead use the maximum number of leaves restriction as the regularization to prevent overfitting. Apart from the performance gains, this change also leads to around 30% speed up. This is because with maximum number of leaves restriction, the decision trees can often achieve lower loss by going deeper with less splits, while the decision trees bounded by maximum depth will often keep doing less effective splitting at the shallow levels.
For hyperparameter tuning, our insight is that we can counter the performance drop from decreasing the number of the weak estimators by increasing the model complexity of each base learner. In this way, the training time can be linearly reduced due to less number of weak learners for training. Since we reduce the number of weak learners and thus decrease the parameters in the system, we increase the learning rate accordingly for robust training dynamics. Based on this insight, we gradually decrease the number of estimators, while at the same time increase the maximum number of leaves and the learning rate to find the settings with the best performance. We first search for the best setting on the Energy dataset from the UCI Machine Learning Repository, and then report the performance on all datasets with the setting discovered. Generally, we use the following hyperparameters through out our experiments: learning rate, , number of estimators, , maximum number of leaves, .
3 Experiments
Our experiments use datasets from the UCI Machine Learning Repository, and follow the same protocol as NGBoost HernándezLobato and Adams (2015); Duan et al. (2019). For all datasets, we hold out a random 10% of the examples as a test set. From the other 90% we initially hold out 20% as a validation set to select (the number of boosting stages) that gives the best loglikelihood, and then refit the entire 90% using the chosen . The refit model is then made to predict on the heldout 10% test set. This entire process is repeated 20 times for all datasets except Protein and Year MSD, for which it is repeated 5 times and 1 time respectively. For the Average Training Time (ATT) measurement, we take an average of the training times measured from each of the repeated training processes. Unlike the original implementation, we use the learning rate of 0.04 throughout all the datasets.
We also rerun the official NGBoost code with the same hyperparameters as claimed in the original paper for a fair comparison of the performance and the training time. All the experiments are conducted on a single Intel(R) Xeon(R) E52630 v4 2.20GHz CPU.
4 Results
Table 1 compares the performance of our approach with the original approach of NGBoost on the regression benchmark of UCI datasets. We can see that RoNGBa achieves significantly better results on most of the datasets apart from extremely small (Yacht, Boston) datasets, which need extra hyperparemter tuning for better performance. Specifically, our approach significantly beat the stateoftheart performances on the Energy, Power and Protein datasets as reported from Gal and Ghahramani (2016) and Lakshminarayanan et al. (2017). We can also observe that our approach can achieve a speed up ranging from 1.53x to 4.85x in various kinds of datasets, which empirically confirms our insight that reducing the overall number of learners can cut down much more amount of computation time than the time gained from increasing each base learner’s model complexity.
5 Related Work
AdaBoost Freund and Schapire (1997)
changes the input distribution to obtain subsequent answers from the former weak learners. At each training step, it puts higher weights on misclassified examples, and finally composes a strong classifier by weighted sum of all the weak hypotheses.
Gradient Boosting Decision Tree (GBDT) Freund and Schapire (1997)
is adapted from Adaboost in order to handle a variety of loss functions. GBDT first expresses the loss function minimization problem into an additive model, and performs numerical optimization directly in the function space applying greedy forward stagewise algorithm. Most importantly, GBDT uses the databased analogue of the unconstrained negative gradient of the loss function in the current model as the approximate value of the residual in boosting tree, which gives the best steepestdescent step direction in the Ndimensional data space.
Compared with AdaBoost, GBDT constructs multiple decision trees serially to predict the data. It takes the decision tree model as parameter and each iteration is fitted to the negative gradient of the loss function to improve. However, AdaBoost takes each point as parameter and adjusts the weight of the negative points to improve. Therefore, by choosing different types of loss functions , such as square error and absolute error in regression, negative binomial loglikelihood error in classification, GBDT can be applied to broader and more diverse learning problems than AdaBoost, like multiclass classification, click prediction, and learning to rank.
XGBoost (Chen and Guestrin, 2016)
improves GBDT with better scalability. XGBoost is suitable for large scale data and limited computing resource with high speed and equivalent accuracy. To achieve this scalability, XGBoost uses mainly three techniques to improve: 1) XGBoost approximates the best split of decision trees by weighted quantile sketch, instead of greedily computing all possible splits. 2) XGBoost handles sparse data by sparsityaware algorithm which only trains nonmissed data and gets a default tree direction for missing values. 3) XGBoost stores memory with a cacheaware block structure for outofcore computing.
LightGBM Ke et al. (2017) further improves the system scalability for highdimensional large data. They apply two methods, Gradientbased OneSide Sampling (GOSS) and Exclusive Feature Bundling (EFB) on GBDT to increase the efficiency without hurting the accuracy.
GOSS samples training data by keeping all the instances with large gradients and random sampling on the instances with small gradients since instances with small gradients are already welltrained. Then, to keep data distribution, they amplify the sampled data with small gradients via a constant during computing the information gain. Instead of filtering out data with zero values as training data in XGBoost, LightGBM samples the training dataset more wisely.
In reality, there are features mutually exclusive and thus data can be very sparse. To reduce the number of features, EFB bundles the exclusive features into a single feature. First, they take features as vertices and add edges between not mutually exclusive features. Edges are weighted by total conflicts between features. Then, they sort the features by degrees in the graph. Finally, they put a feature in the sorted list to an existing bundle or a new created one based on the conflicts comparing to a threshold. After feature histograms are constructed, they find the best split points by histogrambased algorithm, comparing to XGBoost approximates the best split points by weighted quantile sketch.
Even though LightGBM does not apply new techniques, such as cacheaware blocks and outofcore computing in XGBoost, to interact with system more efficiently, LightGBM still outperforms XGBoost with more efficient algorithm.
6 Conclusion
In this work, we proposed RoNGBa, a Robustly optimized NGBoost approach. RoNGBa applies leaf number clipping for base learners and find the best hyperparameters based on a simple yet effective insight on computationaccuracy tradeoff. Our approach significantly beats the stateoftheart performance on various kinds of UCI datasets while still has up to 4.85x speed up compared with the original approach of NGBoost.
Our future work is to apply the techniques of Gradientbased OneSide Sampling and Exclusive Feature Bundling from LightGBM for more efficient natural gradient boosting on largescale higherdimensional datasets.
Acknowledgments
We want to thank Michal Moshkovitz and Joseph Geumlek for the early discussions of the project.
References
 Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §1, §5.
 NGBoost: natural gradient boosting for probabilistic prediction. arXiv preprint arXiv:1910.03225. Cited by: RoNGBa: A Robustly Optimized Natural Gradient Boosting Training Approach with Leaf Number Clipping, §1, §3.
 A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §1, §5, §5.
 Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1.

Dropout as a bayesian approximation: representing model uncertainty in deep learning
. In international conference on machine learning, pp. 1050–1059. Cited by: §4. 
Probabilistic backpropagation for scalable learning of bayesian neural networks
. In International Conference on Machine Learning, pp. 1861–1869. Cited by: §3.  Lightgbm: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §1, §2, §5.
 Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §4.
 Bestfirst decision tree learning. Ph.D. Thesis, The University of Waikato. Cited by: §1, §2.
Comments
There are no comments yet.