have become very successful in machine learning. In this paper, we provide an empirical evaluation offour tree-based boosting algorithms for multi-class classification: mart, abc-mart, robust logitboost, and abc-logitboost, on a wide range of datasets.
Abc-boost, where “abc” stands for adaptive base class, is a recent new idea for improving multi-class classification. Both abc-mart and abc-logitboost are specific implementations of abc-boost. Although the experiments in [11, 12] were reasonable, we consider a more thorough study is necessary. Most datasets used in [11, 12] are (very) small. While those datasets (e.g., pendigits, zipcode) are still popular in machine learning research papers, they may be too small to be practically very meaningful. Nowadays, applications with millions of training samples are not uncommon, for example, in search engines.
It would be also interesting to compare these four tree-based boosting algorithms with other popular learning methods such as support vector machines (SVM) and deep learning. A recent study111 http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007 conducted a thorough empirical comparison of many learning algorithms including SVM, neural nets, and deep learning. The authors of  maintain a nice Web site from which one can download the datasets and compares the test mis-classification errors.
In this paper, we provide extensive experiment results using mart, abc-mart, robust logitboost, and abc-logitboost on the datasets used in , plus other publicly available datasets. One interesting dataset is the UCI Poker. By private communications with C.J. Lin (the author of LibSVM), we learn that SVM achieved a classification accuracy of on this dataset. Interestingly, all four boosting algorithms can easily achieve accuracies.
We try to make this paper self-contained by providing a detailed introduction to abc-mart, robust logitboost, and abc-logitboost in the next section.
2 LogitBoost, Mart, Abc-mart, Robust LogitBoost, and Abc-LogitBoost
We denote a training dataset by , where is the number of feature vectors (samples), is the th feature vector, and is the th class label, where in multi-class classification.
While traditional logistic regression assumes , logitboost and mart adopt the flexible “additive model,” which is a function of terms:
where , the base learner, is typically a regression tree. The parameters, and , are learned from the data, by maximum likelihood, which is equivalent to minimizing the negative log-likelihood loss
where if and otherwise.
) by a greedy stage-wise procedure, using a second-order (diagonal) approximation, which requires knowing the first two derivatives of the loss function (3) with respective to the function values .  obtained:
Those derivatives can be derived by assuming no relations among , to . However,  used the “sum-to-zero” constraint throughout the paper and they provided an alternative explanation.  showed (4) by conditioning on a “base class” and noticed the resultant derivatives are independent of the choice of the base.
At each stage, logitboost fits an individual regression function separately for each class. This is analogous to the popular individualized regression approach in multinomial logistic regression, which is known [3, 1] to result in loss of statistical efficiency, compared to the full (conditional) maximum likelihood approach.
On the other hand, in order to use trees as base learner, the diagonal approximation appears to be a must, at least from the practical perspective.
2.2 Adaptive Base Class Boost (ABC-Boost)
The base class must be identified at each boosting iteration during training.  suggested an exhaustive procedure to adaptively find the best base class to minimize the training loss (3) at each iteration.
2.3 Robust LogitBoost
The mart paper and a recent (2008) discussion paper  commented that logitboost (Alg. 1) can be numerically unstable. In fact, the logitboost paper suggested some “crucial implementation protections” on page 17 of :
In Line 5 of Alg. 1, compute the response by (if ) or (if ).
Bound the response by . The value of is not sensitive as long as in
Note that the above operations were applied to each individual sample. The goal was to ensure that the response should not be too large. On the other hand, we should hope to use larger to better capture the data variation. Therefore, this thresholding operation occurs very frequently and it is expected that part of the useful information is lost.
The next subsection explains that, if implemented carefully, logitboost is almost identical to mart. The only difference is the tree-splitting criterion.
2.4 Tree-Splitting Criterion Using Second-Order Information
Consider weights , and response values , to , which are assumed to be ordered according to the sorted order of the corresponding feature values. The tree-splitting procedure is to find the index , , such that the weighted mean square error (MSE) is reduced the most if split at . That is, we seek the to maximize
where , , . After simplification, one can obtain
Plugging in , yields,
Because the computations involve as a group, this procedure is actually numerically stable.
In comparison, mart only used the first order information to construct the trees, i.e.,
2.5 Adaptive Base Class Logitboost (ABC-LogitBoost)
The abc-boost  algorithm consists of two key components:
At each boosting iteration, adaptively select the base class according to the training loss.  suggested an exhaustive search strategy.
2.6 Main Parameters
Alg. 2 and Alg. 3 have three parameters (, and ), to which the performance is in general not very sensitive, as long as they fall in some reasonable range. This is a significant advantage in practice.
The number of terminal nodes, , determines the capacity of the base learner.  suggested . [7, 21] commented that is unlikely. In our experience, for large datasets (or moderate datasets in high-dimensions), is often a reasonable choice; also see  for more examples.
The shrinkage, , should be large enough to make sufficient progress at each step and small enough to avoid over-fitting.  suggested . Normally, is used.
The number of boosting iterations, , is largely determined by the affordable computing time. A commonly-regarded merit of boosting is that, on many datasets, over-fitting can be largely avoided for reasonable , and .
|dataset||# training||# test||# features|
The original UCI Covertype dataset is fairly large, with samples. To generate Covertype290k, we randomly split the original data into halves, one half for training and another half for testing. For Covertype145k, we randomly select one half from the training set of Covertype290k and still keep the test set.
The UCI Poker dataset originally used only samples for training and samples for testing. Since the test set is very large, we randomly divide it equally into two parts (I and II). Poker25kT1 uses the original training set for training and Part I of the original test set for testing. Poker25kT2 uses the original training set for training and Part II of the original test set for testing. This way, Poker25kT1 can use the test set of Poker25kT2 for validation, and Poker25kT2 can use the test set of Poker25kT1 for validation. As the two test sets are still very large, this treatment will provide reliable results.
Since the original training set (about ) is too small compared to the size of the test set, we enlarge the training set to form Poker525k, Poker275k, Poker150k, and Poker100k. All four enlarged training datasets use the same test set as Pokere25kT2 (i.e., Part II of the original test set). The training set of Poker525k contains the original () training set plus Part I of the original test set. Similarly, the training set of Poker275k / Poker150k / Poker100k contains the original training set plus 250k/125k/75k samples from Part I of the original test set.
The original Poker dataset provides 10 features, 5 “suit” features and 5 ”rank” features. While the “ranks” are naturally ordinal, it appears reasonable to treat “suits” as nominal features. By private communications, R. Cattral, the donor of the Poker data, suggested us to treat the “suits” as nominal. C.J. Lin also kindly told us that the performance of SVM was not affected whether “suits” are treated nominal or ordinal. In our experiments, we choose to use “suits” as nominal feature; and hence the total number of features becomes 25 after expanding each “suite” feature with 4 binary features.
While the original Mnist dataset is extremely popular, this dataset is known to be too easy. Originally, Mnist used 60000 samples for training and 10000 samples for testing.
Mnist10k uses the original (10000) test set for training and the original (60000) training set for testing. This creates a more challenging task.
3.4 Mnist with Many Variations
 (www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007) created a variety of much more difficult datasets by adding various background (correlated) noise, background images, rotations, etc, to the original Mnist dataset. We shortened the notations of the generated datasets to be M-Basic, M-Rotate, M-Image, M-Rand, M-RotImg, and M-Noise1, M-Noise2 to M-Noise6.
By private communications with D. Erhan, one of the authors of , we learn that the sizes of the training sets actually vary depending on the learning algorithms. For some methods such as SVM, they retrained the algorithms using all 120000 training samples after choosing the best parameters; and for other methods, they used 10000 samples for training. In our experiments, we use 12000 training samples for M-Basic, M-Rotate, M-Image, M-Rand and M-RotImg; and we use 10000 training samples for M-Noise1 to M-Noise6.
Note that the datasets M-Noise1 to M-Noise6 have merely 2000 test samples each. By private communications with D. Erhan, we understand this was because  did not mean to compare the statistical significance of the test errors for those six datasets.
The UCI Letter dataset has in total 20000 samples. In our experiments, Letter4k (Letter2k) use the last 4000 (2000) samples for training and the rest for testing. The purpose is to demonstrate the performance of the algorithms using only small training sets.
We also include Letter15k, which is one of the standard partitions of the Letter dataset, by using 15000 samples for training and 5000 samples for testing.
4 Summary of Experiment Results
We simply use logitboost (or even logit in the plots) to denote robust logitboost.
Table 2 summarizes the test mis-classification errors. For all datasets except Poker25kT1 and Poker25kT2, we report the test errors with the tree size =20 and shrinkage . For Poker25kT1 and Poker25kT2, we use and . We report more detailed experiment results in Sec. 5.
For Covertype290k, Poker525k, Poker275k, Poker150k, and Poker100k, as they are fairly large, we only train boosting iterations. For all other datasets, we always train iterations or terminate when the training loss (3) is close to the machine accuracy. Since we do not notice obvious over-fitting on those datasets, we simply report the test errors at the last iterations.
Table 3 summarizes the following four types of -values:
: for testing if abc-mart has significantly lower error rates than mart.
: for testing if (robust) logitboost has significantly lower error rates than mart.
: for testing if abc-logitboost has significantly lower error rates than abc-mart.
: for testing if abc-logitboost has significantly lower error rates than (robust) logitboost.
The, then the probability parameter
can be estimated by
, and the variance ofcan be estimated by . The -values can then be computed using normal approximation of binomial distributions.
Note that the test sets for M-Noise1 to M-Noise6 are very small because  originally did not intend to compare the statistical significance on those six datasets. We compute their -values anyway.
The results demonstrate that abc-logitboost and abc-mart considerably outperform logitboost and mart, respectively. In addition, except for Poker25kT1 and Poker25kT2, we observe that abc-logitboost outperforms abc-mart, and logitboost outperforms mart.
4.2 Comparisons with SVM and Deep Learning
For UCI Poker, we know that SVM could only achieve an error rate of about (by private communications with C.J. Lin). In comparison, all four algorithms, mart, abc-mart, (robust) logitboost, and abc-logitboost, could achieve much smaller error rates (i.e., ) on Poker25kT1 and Poker25kT2.
4.3 Performance vs. Boosting Iterations
5 More Detailed Experiment Results
Ideally, we would like to demonstrate that, with any reasonable choice of parameters and , abc-mart and abc-logitboost will always improve mart and logitboost, respectively. This is actually indeed the case on the datasets we have experimented. In this section, we provide the detailed experiment results on Mnist10k, Poker25kT1, Poker25kT2, Letter4k, and Letter2k.
5.1 Detailed Experiment Results on Mnist10k
For this dataset, we experiment with every combination of and . We train the four boosting algorithms till the training loss (3) is close to the machine accuracy, to exhaust the capacity of the learner so that we could provide a reliable comparison, up to iterations.
||3356 3060||3329 3019||3318 2855||3326 2794|
|3185 2760||3093 2626||3129 2656||3217 2590|
|3049 2558||3054 2555||3054 2534||3035 2577|
|3020 2547||2973 2521||2990 2520||2978 2506|
|2927 2498||2917 2457||2945 2488||2907 2490|
|2925 2487||2901 2471||2877 2470||2884 2454|
|2899 2478||2893 2452||2873 2465||2860 2451|
|2857 2469||2880 2460||2870 2437||2855 2454|
|2833 2441||2834 2448||2834 2444||2815 2440|
|2840 2447||2827 2431||2801 2427||2784 2455|
|2826 2457||2822 2443||2828 2470||2807 2450|
|2837 2482||2809 2440||2836 2447||2782 2506|
|2813 2502||2826 2459||2824 2469||2786 2499|
||2936 2630||2970 2600||2980 2535||3017 2522|
|2710 2263||2693 2252||2710 2226||2711 2223|
|2599 2159||2619 2138||2589 2120||2597 2143|
|2553 2122||2527 2118||2516 2091||2500 2097|
|2472 2084||2468 2090||2468 2090||2464 2095|
|2451 2083||2420 2094||2432 2063||2419 2050|
|2424 2111||2437 2114||2393 2097||2395 2082|
|2399 2088||2402 2087||2389 2088||2380 2097|
|2388 2128||2414 2112||2411 2095||2381 2102|
|2442 2174||2415 2147||2417 2129||2419 2138|
|2468 2235||2434 2237||2423 2221||2449 2177|
|2551 2310||2509 2284||2518 2257||2531 2260|
|2612 2353||2622 2359||2579 2332||2570 2341|
The experiment results illustrate that the performances of all four algorithms are stable on a wide-range of base class tree sizes , e.g., . The shrinkage parameter does not affect much the test performance, although smaller values result in more boosting iterations (before the training losses reach the machine accuracy).
We further randomly divide the test set of Mnist10k (60000 test samples) equally into two parts (I and II). We then test algorithms on Part I (using the same training results). We name this “new” dataset Mnist10kT1. The purpose of this experiment is to further demonstrate the stability of the algorithms.
Table 7 presents the test mis-classification errors of Mnist10kT1. Compared to Table 5, the mis-classification errors of Mnist10kT1 are roughly of the mis-classification errors of Mnist10k for all and . This helps establish that our experiment results on Mnist10k provide a very reliable comparison.
||1682 1514||1668 1505||1666 1416||1663 1380|
|1573 1382||1523 1320||1533 1329||1582 1288|
|1501 1263||1515 1257||1523 1250||1491 1279|
|1492 1270||1457 1248||1470 1239||1459 1236|
|1432 1244||1427 1234||1444 1228||1436 1227|
|1424 1237||1420 1231||1407 1223||1419 1212|
|1430 1226||1426 1224||1411 1223||1418 1204|
|1400 1222||1413 1218||1390 1210||1404 1211|
|1398 1213||1381 1205||1388 1213||1382 1198|
|1402 1221||1366 1201||1372 1199||1346 1205|
|1384 1211||1374 1208||1368 1224||1366 1205|
|1397 1244||1375 1220||1397 1222||1365 1246|
|1371 1239||1380 1221||1382 1223||1362 1242|
||1419 1299||1449 1281||1446 1251||1460 1244|
|1313 1111||1313 1114||1326 1101||1317 1097|
|1278 1058||1287 1050||1270 1036||1262 1058|
|1252 1061||1244 1057||1237 1040||1229 1041|
|1224 1020||1219 1049||1217 1053||1224 1047|
|1213 1038||1207 1050||1201 1039||1198 1026|
|1185 1050||1205 1058||1189 1044||1178 1041|
|1186 1048||1184 1038||1184 1046||1167 1056|
|1185 1077||1199 1063||1183 1042||1184 1045|
|1208 1095||1196 1083||1191 1064||1194 1068|
|1225 1113||1201 1117||1190 1113||1211 1087|
|1254 1159||1247 1145||1248 1127||1249 1127|
|1292 1177||1284 1174||1275 1161||1276 1176|
5.2 Detailed Experiment Results on Poker25kT1 and Poker25kT2
Recall the original UCI Poker dataset used 25010 samples for training and 1000000 samples for testing. To provide a reliable comparison (and validation), we form two datasets Poker25kT1 and Poker25kT2 by equally dividing the original test set into two parts (I and II). Both use the same training set. Poker25kT1 uses Part I of the original test set for testing and Poker25kT2 uses Part II for testing.
Table 8 and Table 9 present the test mis-classification errors, for and . Comparing these two tables, we can see the corresponding entries are very close to each other, which again verifies that the four boosting algorithms provide reliable results on this dataset.
For most and , all four algorithms achieve error rates . For both Poker25kT1 and Poker25kT2, the lowest test errors are attained at and . Unlike Mnist10k, the test errors, especially using mart and logitboost, are slightly sensitive to the parameters.
Note that when (and is small), only training steps would not be sufficient in this case.
||145880 90323||132526 67417||124283 49403||113985 42126|
|71628 38017||59046 36839||48064 35467||43573 34879|
|64090 39220||53400 37112||47360 36407||44131 35777|
|60456 39661||52464 38547||47203 36990||46351 36647|