An Empirical Evaluation of Four Algorithms for Multi-Class Classification: Mart, ABC-Mart, Robust LogitBoost, and ABC-LogitBoost

01/07/2010 ∙ by Ping Li, et al. ∙ cornell university 0

This empirical study is mainly devoted to comparing four tree-based boosting algorithms: mart, abc-mart, robust logitboost, and abc-logitboost, for multi-class classification on a variety of publicly available datasets. Some of those datasets have been thoroughly tested in prior studies using a broad range of classification algorithms including SVM, neural nets, and deep learning. In terms of the empirical classification errors, our experiment results demonstrate: 1. Abc-mart considerably improves mart. 2. Abc-logitboost considerably improves (robust) logitboost. 3. Robust) logitboost considerably improves mart on most datasets. 4. Abc-logitboost considerably improves abc-mart on most datasets. 5. These four boosting algorithms (especially abc-logitboost) outperform SVM on many datasets. 6. Compared to the best deep learning methods, these four boosting algorithms (especially abc-logitboost) are competitive.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Boosting algorithms [16, 4, 5, 2, 17, 7, 15, 6]

have become very successful in machine learning. In this paper, we provide an empirical evaluation of

four tree-based boosting algorithms for multi-class classification: mart[6], abc-mart[11], robust logitboost[13], and abc-logitboost[12], on a wide range of datasets.

Abc-boost[11], where “abc” stands for adaptive base class, is a recent new idea for improving multi-class classification. Both abc-mart[11] and abc-logitboost[12] are specific implementations of abc-boost. Although the experiments in [11, 12] were reasonable, we consider a more thorough study is necessary. Most datasets used in [11, 12] are (very) small. While those datasets (e.g., pendigits, zipcode) are still popular in machine learning research papers, they may be too small to be practically very meaningful. Nowadays, applications with millions of training samples are not uncommon, for example, in search engines[14].

It would be also interesting to compare these four tree-based boosting algorithms with other popular learning methods such as support vector machines (SVM) and deep learning. A recent study[9]111 http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007 conducted a thorough empirical comparison of many learning algorithms including SVM, neural nets, and deep learning. The authors of [9] maintain a nice Web site from which one can download the datasets and compares the test mis-classification errors.

In this paper, we provide extensive experiment results using mart, abc-mart, robust logitboost, and abc-logitboost on the datasets used in [9], plus other publicly available datasets. One interesting dataset is the UCI Poker. By private communications with C.J. Lin (the author of LibSVM), we learn that SVM achieved a classification accuracy of on this dataset. Interestingly, all four boosting algorithms can easily achieve accuracies.

We try to make this paper self-contained by providing a detailed introduction to abc-mart, robust logitboost, and abc-logitboost in the next section.

2 LogitBoost, Mart, Abc-mart, Robust LogitBoost, and Abc-LogitBoost

We denote a training dataset by , where is the number of feature vectors (samples), is the th feature vector, and is the th class label, where in multi-class classification.

Both logitboost[7] and mart (multiple additive regression trees)[6]

algorithms can be viewed as generalizations to logistic regression, which assumes class probabilities

as

(1)

While traditional logistic regression assumes , logitboost and mart adopt the flexible “additive model,” which is a function of terms:

(2)

where , the base learner, is typically a regression tree. The parameters, and , are learned from the data, by maximum likelihood, which is equivalent to minimizing the negative log-likelihood loss

(3)

where if and otherwise.

For identifiability, , i.e., the sum-to-zero constraint, is routinely adopted [7, 6, 19, 10, 18, 21, 20].

2.1 Logitboost

As described in Alg. 1, [7] builds the additive model (2

) by a greedy stage-wise procedure, using a second-order (diagonal) approximation, which requires knowing the first two derivatives of the loss function (

3) with respective to the function values . [7] obtained:

(4)

Those derivatives can be derived by assuming no relations among , to . However, [7] used the “sum-to-zero” constraint throughout the paper and they provided an alternative explanation. [7] showed (4) by conditioning on a “base class” and noticed the resultant derivatives are independent of the choice of the base.

0: , if , otherwise.
1: ,  ,     to ,   to
2: For to Do
3:       For to , Do
4:            Compute .
5:            Compute .
6:            Fit the function by a weighted least-square of
:                      to with weights .
7:           
8:       End
9:      
10: End

Algorithm 1 LogitBoost[7, Alg. 6]. is the shrinkage.

At each stage, logitboost fits an individual regression function separately for each class. This is analogous to the popular individualized regression approach in multinomial logistic regression, which is known [3, 1] to result in loss of statistical efficiency, compared to the full (conditional) maximum likelihood approach.

On the other hand, in order to use trees as base learner, the diagonal approximation appears to be a must, at least from the practical perspective.

2.2 Adaptive Base Class Boost (ABC-Boost)

[11] derived the derivatives of the loss function (3) under the sum-to-zero constraint. Without loss of generality, we can assume that class 0 is the base class. For any ,

(5)

The base class must be identified at each boosting iteration during training. [11] suggested an exhaustive procedure to adaptively find the best base class to minimize the training loss (3) at each iteration.

[11] combined the idea of abc-boost with mart. The algorithm, named abc-mart, achieved good performance in multi-class classification on the datasets used in [11].

2.3 Robust LogitBoost

The mart paper[6] and a recent (2008) discussion paper [8] commented that logitboost (Alg. 1) can be numerically unstable. In fact, the logitboost paper[7] suggested some “crucial implementation protections” on page 17 of [7]:

  • In Line 5 of Alg. 1, compute the response by (if ) or (if ).

  • Bound the response by . The value of is not sensitive as long as in

Note that the above operations were applied to each individual sample. The goal was to ensure that the response should not be too large. On the other hand, we should hope to use larger to better capture the data variation. Therefore, this thresholding operation occurs very frequently and it is expected that part of the useful information is lost.

The next subsection explains that, if implemented carefully, logitboost is almost identical to mart. The only difference is the tree-splitting criterion.

2.4 Tree-Splitting Criterion Using Second-Order Information

Consider weights , and response values , to , which are assumed to be ordered according to the sorted order of the corresponding feature values. The tree-splitting procedure is to find the index , , such that the weighted mean square error (MSE) is reduced the most if split at . That is, we seek the to maximize

where , , . After simplification, one can obtain

Plugging in , yields,

Because the computations involve as a group, this procedure is actually numerically stable.

In comparison, mart[6] only used the first order information to construct the trees, i.e.,

1: , , to , to
2: For to Do
3:     For to Do
4:       -terminal node regression tree from ,
:                                      with weights as in Sec. 2.4.
5:      
6:      
7:     End
8:   
9: End

Algorithm 2 Robust logitboost, which is very similar to mart, except for Line 4.

Alg. 2 describes robust logitboost using the tree-splitting criterion in Sec. 2.4. Note that after trees are constructed, the values of the terminal nodes are computed by

which explains Line 5 of Alg. 2.

2.5 Adaptive Base Class Logitboost (ABC-LogitBoost)

The abc-boost [11] algorithm consists of two key components:

  1. Using the sum-to-zero constraint[7, 6, 19, 10, 18, 21, 20] on the loss function, one can formulate boosting algorithms only for classes, by treating one class as the base class.

  2. At each boosting iteration, adaptively select the base class according to the training loss. [11] suggested an exhaustive search strategy.

[11] combined abc-boost with mart to develop abc-mart. More recently, [12] developed abc-logitboost, the combination of abc-boost with (robust) logitboost.

1: ,  ,     to ,   to
2: For to Do
3:     For to , Do
4:       For to , , Do
5:          -terminal node regression tree from
:                                      with weights , as in Sec. 2.4.
6:         

7:         
8:       End
9:      
10:      
11:      
12:     End
13:    
14:    
15:   
16: End

Algorithm 3 Abc-logitboost using the exhaustive search strategy for the base class, as suggested in [11]. The vector stores the base class numbers.

Alg. 3 presents abc-logitboost, using the derivatives in (5) and the same exhaustive search strategy as in abc-mart. Again, abc-logitboost differs from abc-mart only in the tree-splitting procedure (Line 5).

2.6 Main Parameters

Alg. 2 and Alg. 3 have three parameters (, and ), to which the performance is in general not very sensitive, as long as they fall in some reasonable range. This is a significant advantage in practice.

The number of terminal nodes, , determines the capacity of the base learner. [6] suggested . [7, 21] commented that is unlikely. In our experience, for large datasets (or moderate datasets in high-dimensions), is often a reasonable choice; also see [14] for more examples.

The shrinkage, , should be large enough to make sufficient progress at each step and small enough to avoid over-fitting. [6] suggested . Normally, is used.

The number of boosting iterations, , is largely determined by the affordable computing time. A commonly-regarded merit of boosting is that, on many datasets, over-fitting can be largely avoided for reasonable , and .

3 Datasets

Table 1 lists the datasets used in our study. [11, 12] provided experiments on several other (small) datasets.

dataset # training # test # features
Covertype290k 7 290506 290506 54
Covertype145k 7 145253 290506 54
Poker525k 10 525010 500000 25
Poker275k 10 275010 500000 25
Poker150k 10 150010 500000 25
Poker100k 10 100010 500000 25
Poker25kT1 10 25010 500000 25
Poker25kT2 10 25010 500000 25
Mnist10k 10 10000 60000 784
M-Basic 10 12000 50000 784
M-Rotate 10 12000 50000 784
M-Image 10 12000 50000 784
M-Rand 10 12000 50000 784
M-RotImg 10 12000 50000 784
M-Noise1 10 10000 2000 784
M-Noise2 10 10000 2000 784
M-Noise3 10 10000 2000 784
M-Noise4 10 10000 2000 784
M-Noise5 10 10000 2000 784
M-Noise6 10 10000 2000 784
Letter15k 26 15000 5000 16
Letter4k 26 4000 16000 16
Letter2k 26 2000 18000 16
Table 1: Datasets

3.1 Covertype

The original UCI Covertype dataset is fairly large, with samples. To generate Covertype290k, we randomly split the original data into halves, one half for training and another half for testing. For Covertype145k, we randomly select one half from the training set of Covertype290k and still keep the test set.

3.2 Poker

The UCI Poker dataset originally used only samples for training and samples for testing. Since the test set is very large, we randomly divide it equally into two parts (I and II). Poker25kT1 uses the original training set for training and Part I of the original test set for testing. Poker25kT2 uses the original training set for training and Part II of the original test set for testing. This way, Poker25kT1 can use the test set of Poker25kT2 for validation, and Poker25kT2 can use the test set of Poker25kT1 for validation. As the two test sets are still very large, this treatment will provide reliable results.

Since the original training set (about ) is too small compared to the size of the test set, we enlarge the training set to form Poker525k, Poker275k, Poker150k, and Poker100k. All four enlarged training datasets use the same test set as Pokere25kT2 (i.e., Part II of the original test set). The training set of Poker525k contains the original () training set plus Part I of the original test set. Similarly, the training set of Poker275k / Poker150k / Poker100k contains the original training set plus 250k/125k/75k samples from Part I of the original test set.

The original Poker dataset provides 10 features, 5 “suit” features and 5 ”rank” features. While the “ranks” are naturally ordinal, it appears reasonable to treat “suits” as nominal features. By private communications, R. Cattral, the donor of the Poker data, suggested us to treat the “suits” as nominal. C.J. Lin also kindly told us that the performance of SVM was not affected whether “suits” are treated nominal or ordinal. In our experiments, we choose to use “suits” as nominal feature; and hence the total number of features becomes 25 after expanding each “suite” feature with 4 binary features.

3.3 Mnist

While the original Mnist dataset is extremely popular, this dataset is known to be too easy[9]. Originally, Mnist used 60000 samples for training and 10000 samples for testing.

Mnist10k uses the original (10000) test set for training and the original (60000) training set for testing. This creates a more challenging task.

3.4 Mnist with Many Variations

[9] (www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007) created a variety of much more difficult datasets by adding various background (correlated) noise, background images, rotations, etc, to the original Mnist dataset. We shortened the notations of the generated datasets to be M-Basic, M-Rotate, M-Image, M-Rand, M-RotImg, and M-Noise1, M-Noise2 to M-Noise6.

By private communications with D. Erhan, one of the authors of [9], we learn that the sizes of the training sets actually vary depending on the learning algorithms. For some methods such as SVM, they retrained the algorithms using all 120000 training samples after choosing the best parameters; and for other methods, they used 10000 samples for training. In our experiments, we use 12000 training samples for M-Basic, M-Rotate, M-Image, M-Rand and M-RotImg; and we use 10000 training samples for M-Noise1 to M-Noise6.

Note that the datasets M-Noise1 to M-Noise6 have merely 2000 test samples each. By private communications with D. Erhan, we understand this was because [9] did not mean to compare the statistical significance of the test errors for those six datasets.

3.5 Letter

The UCI Letter dataset has in total 20000 samples. In our experiments, Letter4k (Letter2k) use the last 4000 (2000) samples for training and the rest for testing. The purpose is to demonstrate the performance of the algorithms using only small training sets.

We also include Letter15k, which is one of the standard partitions of the Letter dataset, by using 15000 samples for training and 5000 samples for testing.

4 Summary of Experiment Results

We simply use logitboost (or even logit in the plots) to denote robust logitboost.

Table 2 summarizes the test mis-classification errors. For all datasets except Poker25kT1 and Poker25kT2, we report the test errors with the tree size =20 and shrinkage . For Poker25kT1 and Poker25kT2, we use and . We report more detailed experiment results in Sec. 5.

For Covertype290k, Poker525k, Poker275k, Poker150k, and Poker100k, as they are fairly large, we only train boosting iterations. For all other datasets, we always train iterations or terminate when the training loss (3) is close to the machine accuracy. Since we do not notice obvious over-fitting on those datasets, we simply report the test errors at the last iterations.

Dataset mart abc-mart logitboost abc-logitboost # test
Covertype290k 11350 10454 10765 9727 290506
Covertype145k 15767 14665 14928 13986 290506
Poker525k 7061 2424 2704 1736 500000
Poker275k 15404 3679 6533 2727 500000
Poker150k 22289 12340 16163 5104 500000
Poker100k 27871 21293 25715 13707 500000
Poker25kT1 43575 34879 46789 37345 500000
Poker25kT2 42935 34326 46600 36731 500000
Mnist10k 2815 2440 2381 2102 60000
M-Basic 2058 1843 1723 1602 50000
M-Rotate 7674 6634 6813 5959 50000
M-Image 5821 4727 4703 4268 50000
M-Rand 6577 5300 5020 4725 50000
M-RotImg 24912 23072 22962 22343 50000
M-Noise1 305 245 267 234 2000
M-Noise2 325 262 270 237 2000
M-Noise3 310 264 277 238 2000
M-Noise4 308 243 256 238 2000
M-Noise5 294 244 242 227 2000
M-Noise6 279 224 226 201 2000
Letter15k 155 125 139 109 5000
Letter4k 1370 1149 1252 1055 16000
Letter2k 2482 2220 2309 2034 18000
Table 2: Summary of test mis-classification errors.

4.1 -Values

Table 3 summarizes the following four types of -values:

  • : for testing if abc-mart has significantly lower error rates than mart.

  • : for testing if (robust) logitboost has significantly lower error rates than mart.

  • : for testing if abc-logitboost has significantly lower error rates than abc-mart.

  • : for testing if abc-logitboost has significantly lower error rates than (robust) logitboost.

The

-values are computed using binomial distributions and normal approximations. Recall, if a random variable

, then the probability parameter

can be estimated by

, and the variance of

can be estimated by . The -values can then be computed using normal approximation of binomial distributions.

Note that the test sets for M-Noise1 to M-Noise6 are very small because [9] originally did not intend to compare the statistical significance on those six datasets. We compute their -values anyway.

Dataset
Covertype290k
Covertype145k
Poker525k 0 0 0 0
Poker275k 0 0 0 0
Poker150k 0 0 0 0
Poker100k 0 0 0 0
Poker25kT1 0 —- —- 0
Poker25kT2 0 —- —- 0
Mnist10k
M-Basic 0.0164
M-Rotate
M-Image
M-Rand
M-RotImg
M-Noise1 0.0574
M-Noise2 0.0024 0.0072 0.1158 0.0583
M-Noise3 0.0190 0.0701 0.1073 0.0327
M-Noise4 0.0014 0.0090 0.4040 0.1935
M-Noise5 0.0102 0.0079 0.2021 0.2305
M-Noise6 0.0043 0.0058 0.1189 0.1002
Letter15k 0.0345 0.1718 0.1449 0.0268
Letter4k 0.019
Letter2k 0.001

Table 3: Summary of test -Values.

The results demonstrate that abc-logitboost and abc-mart considerably outperform logitboost and mart, respectively. In addition, except for Poker25kT1 and Poker25kT2, we observe that abc-logitboost outperforms abc-mart, and logitboost outperforms mart.

4.2 Comparisons with SVM and Deep Learning

For UCI Poker, we know that SVM could only achieve an error rate of about (by private communications with C.J. Lin). In comparison, all four algorithms, mart, abc-mart, (robust) logitboost, and abc-logitboost, could achieve much smaller error rates (i.e., ) on Poker25kT1 and Poker25kT2.

Figure 1 provides the comparisons on the six (correlated) noise datasets: M-Noise1 to M-Noise6. Table 4 compares the error rates on M-Basic, M-Rotate, M-Image, M-Rand, and M-RotImg.

Figure 1: Six datasets: M-Noise1 to M-Noise6. Left panel: Error rates of SVM and deep learning [9]. Middle and right panels: Errors rates of four boosting algorithms. X-axis: degree of correlation from high to low; the values 1 to 6 correspond to the datasets M-Noise1 to M-Noise6.
M-Basic M-Rotate M-Image M-Rand M-RotImg
SVM-RBF
SVM-POLY
NNET
DBN-3
SAA-3
DBN-1
mart
abc-mart
logitboost
abc-logitboost 8.54%
Table 4: Summary of error rates of various algorithms on the modified Mnist dataset[9].

4.3 Performance vs. Boosting Iterations

Figure 2 presents the training loss, i.e., Eq. (3), on Covertype290k and Poker525k, for all boosting iterations. Figures 3 and 4 provide the test mis-classification errors on Covertype, Poker, Mnist10k, and Letter.

     

Figure 2: Training loss, Eq. (3), on Covertype290k and Poker525k.

           

Figure 3: Test mis-classification errors on Mnist10k, Letter15k, Letter4k, and Letter2k.

     

           

     

Figure 4: Test mis-classification errors on Covertype and Poker.

5 More Detailed Experiment Results

Ideally, we would like to demonstrate that, with any reasonable choice of parameters and , abc-mart and abc-logitboost will always improve mart and logitboost, respectively. This is actually indeed the case on the datasets we have experimented. In this section, we provide the detailed experiment results on Mnist10k, Poker25kT1, Poker25kT2, Letter4k, and Letter2k.

5.1 Detailed Experiment Results on Mnist10k

For this dataset, we experiment with every combination of and . We train the four boosting algorithms till the training loss (3) is close to the machine accuracy, to exhaust the capacity of the learner so that we could provide a reliable comparison, up to iterations.

Table 5 presents the test mis-classification errors and Table 6 presents the -values. Figures 5, 6, and 7 provide the test mis-classification errors for all boosting iterations.



mart abc-mart

3356 3060 3329 3019 3318 2855 3326 2794
3185 2760 3093 2626 3129 2656 3217 2590
3049 2558 3054 2555 3054 2534 3035 2577
3020 2547 2973 2521 2990 2520 2978 2506
2927 2498 2917 2457 2945 2488 2907 2490
2925 2487 2901 2471 2877 2470 2884 2454
2899 2478 2893 2452 2873 2465 2860 2451
2857 2469 2880 2460 2870 2437 2855 2454
2833 2441 2834 2448 2834 2444 2815 2440
2840 2447 2827 2431 2801 2427 2784 2455
2826 2457 2822 2443 2828 2470 2807 2450
2837 2482 2809 2440 2836 2447 2782 2506
2813 2502 2826 2459 2824 2469 2786 2499
logitboost abc-logit

2936 2630 2970 2600 2980 2535 3017 2522
2710 2263 2693 2252 2710 2226 2711 2223
2599 2159 2619 2138 2589 2120 2597 2143
2553 2122 2527 2118 2516 2091 2500 2097
2472 2084 2468 2090 2468 2090 2464 2095
2451 2083 2420 2094 2432 2063 2419 2050
2424 2111 2437 2114 2393 2097 2395 2082
2399 2088 2402 2087 2389 2088 2380 2097
2388 2128 2414 2112 2411 2095 2381 2102
2442 2174 2415 2147 2417 2129 2419 2138
2468 2235 2434 2237 2423 2221 2449 2177
2551 2310 2509 2284 2518 2257 2531 2260
2612 2353 2622 2359 2579 2332 2570 2341
Table 5: Mnist10k. Upper table: The test mis-classification errors of mart and abc-mart (bold numbers). Bottom table: The test mis-classification errors of logitboost and abc-logitboost (bold numbers)
P1
0
P2
P3
P4
Table 6: Mnist10k: -values. See Sec. 4.1 for the definitions of P1, P2, P3, and P4.

Figure 5: Mnist10k.Test mis-classification errors of four algorithms. , 6, 8, 10.

Figure 6: Mnist10k. Test mis-classification errors of four algorithms. , 14, 16, 18.

Figure 7: Mnist10k. Test mis-classification errors of four algorithms. , 24, 30, 40, 50.

The experiment results illustrate that the performances of all four algorithms are stable on a wide-range of base class tree sizes , e.g., . The shrinkage parameter does not affect much the test performance, although smaller values result in more boosting iterations (before the training losses reach the machine accuracy).

We further randomly divide the test set of Mnist10k (60000 test samples) equally into two parts (I and II). We then test algorithms on Part I (using the same training results). We name this “new” dataset Mnist10kT1. The purpose of this experiment is to further demonstrate the stability of the algorithms.

Table 7 presents the test mis-classification errors of Mnist10kT1. Compared to Table 5, the mis-classification errors of Mnist10kT1 are roughly of the mis-classification errors of Mnist10k for all and . This helps establish that our experiment results on Mnist10k provide a very reliable comparison.




mart abc-mart

1682 1514 1668 1505 1666 1416 1663 1380
1573 1382 1523 1320 1533 1329 1582 1288
1501 1263 1515 1257 1523 1250 1491 1279
1492 1270 1457 1248 1470 1239 1459 1236
1432 1244 1427 1234 1444 1228 1436 1227
1424 1237 1420 1231 1407 1223 1419 1212
1430 1226 1426 1224 1411 1223 1418 1204
1400 1222 1413 1218 1390 1210 1404 1211
1398 1213 1381 1205 1388 1213 1382 1198
1402 1221 1366 1201 1372 1199 1346 1205
1384 1211 1374 1208 1368 1224 1366 1205
1397 1244 1375 1220 1397 1222 1365 1246
1371 1239 1380 1221 1382 1223 1362 1242
logitboost abc-logit

1419 1299 1449 1281 1446 1251 1460 1244
1313 1111 1313 1114 1326 1101 1317 1097
1278 1058 1287 1050 1270 1036 1262 1058
1252 1061 1244 1057 1237 1040 1229 1041
1224 1020 1219 1049 1217 1053 1224 1047
1213 1038 1207 1050 1201 1039 1198 1026
1185 1050 1205 1058 1189 1044 1178 1041
1186 1048 1184 1038 1184 1046 1167 1056
1185 1077 1199 1063 1183 1042 1184 1045
1208 1095 1196 1083 1191 1064 1194 1068
1225 1113 1201 1117 1190 1113 1211 1087
1254 1159 1247 1145 1248 1127 1249 1127
1292 1177 1284 1174 1275 1161 1276 1176
Table 7: Mnist10kT1. Upper table: The test mis-classification errors of mart and abc-mart (bold numbers). Bottom table: The test mis-classification errors of logitboost and abc-logitboost (bold numbers). Mnist10kT1 only uses a half of the test data of Mnist10k.

5.2 Detailed Experiment Results on Poker25kT1 and Poker25kT2

Recall the original UCI Poker dataset used 25010 samples for training and 1000000 samples for testing. To provide a reliable comparison (and validation), we form two datasets Poker25kT1 and Poker25kT2 by equally dividing the original test set into two parts (I and II). Both use the same training set. Poker25kT1 uses Part I of the original test set for testing and Poker25kT2 uses Part II for testing.

Table 8 and Table 9 present the test mis-classification errors, for and . Comparing these two tables, we can see the corresponding entries are very close to each other, which again verifies that the four boosting algorithms provide reliable results on this dataset.

For most and , all four algorithms achieve error rates . For both Poker25kT1 and Poker25kT2, the lowest test errors are attained at and . Unlike Mnist10k, the test errors, especially using mart and logitboost, are slightly sensitive to the parameters.

Note that when (and is small), only training steps would not be sufficient in this case.

mart abc-mart


145880 90323 132526 67417 124283 49403 113985 42126
71628 38017 59046 36839 48064 35467 43573 34879
64090 39220 53400 37112 47360 36407 44131 35777
60456 39661 52464 38547 47203 36990 46351 36647