# Fast Meta-Learning for Adaptive Hierarchical Classifier Design

We propose a new splitting criterion for a meta-learning approach to multiclass classifier design that adaptively merges the classes into a tree-structured hierarchy of increasingly difficult binary classification problems. The classification tree is constructed from empirical estimates of the Henze-Penrose bounds on the pairwise Bayes misclassification rates that rank the binary subproblems in terms of difficulty of classification. The proposed empirical estimates of the Bayes error rate are computed from the minimal spanning tree (MST) of the samples from each pair of classes. Moreover, a meta-learning technique is presented for quantifying the one-vs-rest Bayes error rate for each individual class from a single MST on the entire dataset. Extensive simulations on benchmark datasets show that the proposed hierarchical method can often be learned much faster than competing methods, while achieving competitive accuracy.

• 4 publications
• 41 publications
04/27/2015

### Meta learning of bounds on the Bayes classifier error

Meta learning uses information from base learners (e.g. classifiers or e...
10/01/2018

### Convergence Rates for Empirical Estimation of Binary Classification Bounds

Bounding the best achievable error probability for binary classification...
09/16/2019

### Learning to Benchmark: Determining Best Achievable Misclassification Error from Training Data

We address the problem of learning to benchmark the best achievable clas...
11/15/2018

### Learning to Bound the Multi-class Bayes Error

In the context of supervised learning, meta learning uses features, meta...
02/01/2022

### Is the Performance of My Deep Network Too Good to Be True? A Direct Approach to Estimating the Bayes Error in Binary Classification

There is a fundamental limitation in the prediction performance that a m...
10/31/2017

### Rate-optimal Meta Learning of Classification Error

Meta learning of optimal classifier error rates allows an experimenter t...
06/07/2021

### Evaluating State-of-the-Art Classification Models Against Bayes Optimality

Evaluating the inherent difficulty of a given data-driven classification...

## Code Repositories

### SmartSVM

Python package for "Fast Meta-Learning and Adaptive Hierarchical Classifier Design" by G.J.J. van den Burg and A.O. Hero

## 1 Introduction

The Bayes error rate (BER) is a central concept in the statistical theory of classification. It represents the error rate of the Bayes classifier, which assigns a label to an object corresponding to the class with the highest posterior probability. By definition, the Bayes error represents the smallest possible average error rate that can be achieved by any decision rule

(Wald, 1947). Because of these properties, the BER is of great interest both for benchmarking classification algorithms as well as for the practical design of classification algorithms. For example, an accurate approximation of the BER can be used for classifier parameter selection, data dimensionality reduction, or variable selection. However, accurate BER approximation is difficult, especially in high dimension, and thus much attention has focused on tight and tractable BER bounds. This paper proposes a model-free approach to designing multiclass classifiers using a bias-corrected BER bound estimated directly from the multiclass data.

There exists several useful bounds on the BER that are functions of the class-dependent feature distributions. These include information theoretic divergence measures such as the Chernoff -divergence (Chernoff, 1952), the Bhattacharyya divergence (Kailath, 1967), or the Jensen-Shannon divergence (Lin, 1991). Alternatively, arbitrarily tight bounds on performance can be constructed using sinusoidal or hyperbolic approximations (Hashlamoun et al., 1994, Avi-Itzhak and Diep, 1996). These bounds are functions of the unknown class-dependent feature distributions.

Recently, Berisha et al. (2016) introduced a divergence measure belonging to the family of -divergences which tightly bounds the Bayes error rate in the binary classification problem. The bounds on the BER obtained with this measure are tighter than bounds derived from the Bhattacharyya or Chernoff bounds. Moreover, this divergence measure can be estimated nonparametrically from the data without resorting to density estimates of the distribution functions. Inspired by the Friedman-Rafsky multivariate runs test (Friedman and Rafsky, 1979), estimation is based on computing the Euclidean minimal spanning tree (MST) of the data, which can be done in approximately

time. In this paper we propose improvements to this estimator for problems when there are unequal class priors and apply the improved estimator to the adaptive design of a hierarchical multiclass classifier. Furthermore, a fast method is proposed for bounding the Bayes error rate of individual classes which only requires computing a single minimal spanning tree over the entire set of samples. Thus our proposed method is faster than competing methods that use density plug-in estimation of divergence or observed misclassification rates of algorithms, such as SVM or logistic regression, which involve expensive parameter tuning.

Quantifying the complexity of a classification problem has been of significant interest (Ho and Basu, 2002) and it is clear that a fast and accurate estimate of this complexity has many practical applications. For instance, an accurate complexity estimator allows the researcher to assess a priori whether a given classification problem is difficult to classify or not. In a multiclass problem, a pair of classes which are difficult to disambiguate could potentially be merged or could be designated for additional data collection. Moreover, an accurate estimate of the BER could be used for variable selection, an application that was explored previously in Berisha et al. (2016). In Section 3 further applications of the BER estimates to multiclass classification are presented and evaluated.

There are many methods available for the design of multiclass classification algorithms, including: logistic regression (Cox, 1958)

(Cortes and Vapnik, 1995)

; and neural networks

(McCulloch and Pitts, 1943). It is often the case that classifier performance will be better for some classes than for others, for instance due to sample imbalance in the training set. Often classifier designs apply weights to the different classes in order to reduce the effect of such imbalances on average classifier accuracy (Lu et al., 1998, Qiao and Liu, 2009). We take a different and more general approach that incorporates an empirical determination of the relative difficulties of classifying between different classes. Accurate empirical estimates of the BER are used for this purpose. A multiclass classifier is presented in Section 4 that uses MST-based BER estimates to create a hierarchy of binary subproblems that increase in difficulty as the algorithm progresses. This way, the classifier initially works on easily decidable subproblems before moving on to more difficult multiclass classification problems.

The paper is organized as follows. The theory of the nonparametric Bayes error estimator of Berisha et al. (2016) will be reviewed in Section 2. We will introduce a bias correction for this estimator, motivate the use of the estimator for multiclass classification, and discuss computational complexity. Section 3 will introduce applications of the estimator to meta-learning in multiclass classification. A novel hierarchical classification method will be introduced and evaluated in Section 4. Section 5 provides concluding remarks.

## 2 An Improved BER Estimator

Here the motivation and theory for the estimator of the Bayes error rate is reviewed, as introduced by Berisha et al. (2016)

. An improvement on this estimator is proposed for the case where class prior probabilities are unequal. Next, the application of the estimator in multiclass classification problems is considered. Finally, computational considerations and robustness analyses are presented.

### 2.1 Estimating the Bayes Error Rate

Consider the binary classification problem on a dataset , where and . Denote the multivariate density functions for the two classes by and and the prior probabilities by and , respectively. The Bayes error rate of this binary classification problem can be expressed as (Fukunaga, 1990)

 Pe(f1,f2)=∫min{p1f1(x),p2f2(x)}dx. (1)

Recently, Berisha et al. (2016)

derived a tight bound on the BER which can be estimated directly from the data without a parametric model for the density or density estimation. This bound is based on a divergence measure introduced by

Berisha and Hero (2015), defined as

 DHP(f1,f2)=14p1p2[∫(p1f1(x)−p2f2(x))2p1f1(x)+p2f2(x)dx−(p1−p2)2], (2)

and called the Henze-Penrose divergence, as it is motivated by an affinity measure defined by Henze and Penrose (1999). In Berisha and Hero (2015) it was shown that (2) is a proper -divergence as defined by Csiszár (1975).

Estimation of the Henze-Penrose (HP) divergence is based on the multivariate runs test proposed by Friedman and Rafsky (1979) and convergence of this test was studied by Henze and Penrose (1999). Let with denote multidimensional features from two classes. Define the class sample sizes , . Let the combined sample be denoted by where is the total number of samples from both classes. Define the complete graph over as the graph connecting all nodes with edge weights equal to Euclidean distances. The Euclidean minimal spanning tree that spans , denoted by , is defined as the subgraph of

that is both connected and whose sum of edge weights is the smallest possible. The Friedman-Rafsky test statistic equals the number of edges in

that connect an observation from class to an observation from class and is denoted by .

Building on the work by Henze and Penrose (1999), Berisha et al. (2016) show that if and in a linked manner such that then,

 1−nR1,22n1n2→DHP(f1,f2)a.s. (3)

Thus, the number of cross connections between the classes in the Euclidean MST is inversely proportional to the divergence between the respective probability density functions of these classes.

Finally, the HP-divergence can be used to bound the Bayes error rate, , following Theorem  of Berisha et al. (2016)

 12−12√uHP(f1,f2)≤Pe(f1,f2)≤12−12uHP(f1,f2), (4)

where

 uHP(f1,f2)=4p1p2DHP(f1,f2)+(p1−p2)2. (5)

Averaging these bounds yields an estimate of the BER given by

 ^Pe(f1,f2)=12−14√uHP(f1,f2)−14uHP(f1,f2). (6)

In the following, this estimator will be referred to as the HP-estimator of the BER.

### 2.2 A modified HP-estimator for unequal class priors

To illustrate the performance of the HP-estimator, and to motivate the proposed modification, consider a binary classification problem where the samples are drawn from two independent bivariate Gaussian distributions with equal covariance matrices. For this example the BER and associated bounds can be computed exactly

(Fukunaga, 1990). In Figure (a)a we compare the BER, the HP-estimator of the BER (6) and the popular Bhattacharyya bound on the BER (Bhattacharyya, 1946, Kailath, 1967). Figure (a)a shows that the HP-estimator is closer to the true BER than the Bhattacharyya bound. This result was illustrated by Berisha et al. (2016) for the case where and is confirmed here for the case where .

However, Figure (a)a also shows that a significant bias occurs in the HP-estimate of the BER when the distance between classes is small. Considering that if , the solution of the equation for suggests a bias corrected version of the HP-estimate:111This correction can readily be derived by using the fact that , which holds under the same conditions as for (3).

 R′1,2=min{γ,R1,2}, (7)

with

 γ=2nmin{^p1,^p2}−34n+14n√9−16min{^p1,^p2}, (8)

where and are estimates of the true prior probabilities. Figure (b)b shows the effect of this bias correction on the accuracy of the HP-estimator. As can be seen, the bias correction significantly improves the accuracy of the HP-estimator for when the class distributions are not well separated.

### 2.3 Multiclass classification

Here we apply the HP-estimate to multiclass classification problems by extending the bias corrected HP-estimator to a multiclass Bayes error rate. The original multiclass HP-estimator has been defined by Wisler et al. (2016) and we show how the framework can be applied to hierarchical multiclassifier design.

Consider a multiclass problem with classes with and , with prior probabilities and density functions for such that . Then, the BER can be estimated for each pair of classes using the bias-corrected HP-estimator using (7). The binary classification problem with the largest BER estimate is defined as most difficult.

Recall that the largest BER that can be achieved in a binary classification problem with unequal class priors is equal to the value of the smallest prior probability. This makes it difficult to compare empirical estimates of the BER when class sizes are imbalanced. To correct for this, the HP-estimator for pairwise classification BERs can be normalized for class sizes using

 ^P′e(fk,fl)=^Pe(fk,fl)min{^pk,^pl}. (9)

This normalization places the HP-estimate in the interval and makes it possible to more accurately compare the BER estimates of different binary problems.

In practice it can also be of interest to understand how difficult it is to discriminate each individual class. By reducing the multiclass problem to a One-vs-Rest classification problem, it is straightforward to define a confusion rate for a given class . This represents the fraction of instances that are erroneously assigned to class and the fraction of instances which are truly from class that are assigned to a different class. Formally, define the confusion rate for class as

 Ck(y,^y)=|{i:^yi=k,yi≠k}|+|{i:^yi≠k,yi=k}|n, (10)

with the predicted class for instance . Recall that the Bayes error rate is the error rate of the Bayes classifier, which assigns an instance to class . Hence, the BER for a single class equals the error of assigning to a class when the true class is and the total error of assigning to class when the true class is , thus

 Pe,k=\smashoperator[]∫maxl≠k{plfl(x)}≥pkfk(x)pkfk(x)dx+∑c≠k\smashoperator[]∫maxl≠k{plfl(x)}

We make two observations about this One-vs-Rest Bayes error rate (OvR-BER). First, the OvR-BER for class is smaller than the sum of the binary BERs for the problems involving class (see Appendix A). Second, the OvR-BER can be estimated using the Henze-Penrose divergence with , which yields the estimate . A computational advantage of using the OvR-BER in multiclass problems is that the MST only has to be computed only once on the set , since the union of and is equal to . Therefore, can be computed for all from the single MST on by keeping track of the labels of each instance.

### 2.4 Computational Considerations

The construction of the minimal spanning tree lies at the heart of the HP-estimator of the BER, so it is important to use a fast algorithm for the MST construction. Since the HP-estimator is based on the Euclidean MST the dual-tree algorithm by March et al. (2010) can be applied. This algorithm is based on the construction of Borůvka (1926) and implements the Euclidean MST in approximately time. For larger datasets it can be beneficial to partition the space into hypercubes and construct the MST in each partition.

A simple way to improve the robustness of the HP-estimator is to use multiple orthogonal MSTs and average the number of cross-connections (Friedman and Rafsky, 1979). Computing orthogonal MSTs is not straightforward in the dual-tree algorithm of March et al. (2010), but is easy to implement in MST algorithms that use a pairwise distance matrix such as that of Whitney (1972). Figure 2

shows the empirical variance of the HP-estimator for different numbers of orthogonal MSTs as a function of the separation between the classes. As expected, the variance decreases as the number of orthogonal MSTs increases, although the benefit of including more orthogonal MSTs also decreases when adding more MSTs. Therefore,

orthogonal MSTs are typically used in practice.

## 3 Meta-Learning of optimal classifier accuracy

Applying the HP-estimator to meta-learning problems creates a number of opportunities to assess the difficulty of a classification problem before training a classifier. For example, given a multiclass classification problem it may be useful to know which classes are difficult to distinguish from each other and which classes are easy to distinguish. Figure (a)a shows an illustration of this for handwritten digits in the well-known MNIST dataset (LeCun et al., 1998). This figure shows a heat map where each square corresponds to an estimate of the BER for a binary problem in the training set. From this figure it can be seen that the digits 4 and 9 are difficult to distinguish, as well as the digits 3 and 5. This information can be useful for the design of a classifier, to ensure for instance that higher weights are placed on misclassifications of the more difficult number pairs if correct classification of these pairs is of importance to the end-task. In Figure (b)b a similar heat map is shown based on misclassified instances of LeNet-5 (LeCun et al., 1998)

on the test set. This figure shows the symmetric confusion matrix based on the 82 misclassified instances. As can be seen, this figure closely corresponds to the heat map on the training data, which confirms the predictive accuracy of the HP-estimator for real data.

Another example of the accuracy of the BER estimates for multiclass classification problems is given in Figure 4. In this figure, OvR-BER estimates and class accuracy scores are shown for the Chess dataset (, , ) obtained from the UCI repository (Bache and Lichman, 2013). This dataset was split into a training dataset (70%) and a test dataset (30%) and the OvR-BER estimates were computed on the training dataset. These estimates are compared with the class error rates obtained from out-of-sample predictions of the test dataset using GenSVM (Van den Burg and Groenen, 2016). This figure shows that the OvR-BER estimates are accurate predictors of classification performance. The classes that are relatively difficult to classify may benefit from increasing misclassification weights.

The BER estimates can also be applied to feature selection and, in particular, to the identification of useful feature transformations of the data. A feature selection strategy based on forward selection was outlined in

Berisha et al. (2016). At each feature selection stage, this algorithm adds the feature which gives the smallest increase in the BER estimate. Berisha et al. (2016) show that this feature selection strategy quickly yields a subset of useful features for the classification problem.

Because the BER estimate is a fast and asymptotically consistent estimate of a bound on classification performance, it is easy to try a number of potential feature transformations and use the one with the smallest BER estimate in the classifier. This can be useful both for traditional feature transformations such as PCA (Pearson, 1901) and Laplacian Eigenmaps (Belkin and Niyogi, 2003), but also for commonly used kernel transformations in SVMs. For a researcher this can significantly reduce the time needed to train a classifier on different transformations of the data. In a multiclass setting where the One-vs-One strategy is used, one can even consider a different feature transformation for each binary subproblem. When using a unified classification method one can consider feature transformations which reduce the average BER estimate or the worst-case BER estimate.

Note that a feature transformation which reduces the dimensionality of the dataset without increasing the BER estimate can be considered beneficial, as many classification methods are faster for low-dimensional datasets. For instance, applying PCA with components on the Chess dataset only slightly increases the BER estimates for two classes, while remaining the same for the other classes. Thus, a classifier will likely achieve comparable accuracy with this transformed dataset, but will be much faster to train since the dimensionality can be reduced from to .

## 4 Hierarchical Multiclass Classification

In this section a novel hierarchical multiclass SVM classifier is introduced which is based on uncertainty clustering. The BER estimate can be considered a measure of the irreducible uncertainty of a classification problem, as a high BER indicates an intrinsically difficult problem. This can be used to construct a tree of binary classification problems that increase in difficulty along the depth of the tree. By fitting a binary classifier (such as an SVM) at each internal node of the tree, a classification method is obtained which proceeds from the easier binary subproblems to the more difficult binary problems.

Similar divide-and-conquer algorithms have been proposed (Schwenker and Palm, 2001, Takahashi and Abe, 2002, Frank and Kramer, 2004, Vural and Dy, 2004, Tibshirani and Hastie, 2007, among others). See Lorena et al. (2008) for a review. These approaches often apply a clustering method to create a grouping of the dataset into two clusters, repeating this process recursively to form a binary tree of classification problems. In Lorena and De Carvalho (2010) several empirical distance measures are used as indicators of separation difficulty between classes, which are applied in a bottom-up procedure to construct a classification tree. Finally, in El-Yaniv and Etzion-Rosenberg (2010) the Jensen-Shannon divergence is used to bound the BER with inequalities from Lin (1991)

and a classification tree is constructed using a randomized heuristic procedure. Unfortunately, the Jensen-Shannon divergence implementation requires parametric estimation of distribution functions. Moreover, for the equiprobable case the upper bound on the BER obtained with the Jensen-Shannon divergence can be shown to be less tight than that obtained with the HP-divergence (see Appendix

B). Because of this, these estimates of the BER may be less accurate than those obtained with the proposed HP-estimator.

To construct the hierarchical classification tree a complete weighted graph is created where the vertices correspond to the classes and the weight of the edges equals the HP-estimate for that binary problem. Formally, let , and define the edge weight for as . In the HP-estimator the bias correction (7) and the normalization (9) are used. By recursively applying min-cuts to this graph a tree of binary classification problems is obtained which increase in difficulty along the depth of the tree. Min-cuts on this weighted graph can be computed using for instance the method of Stoer and Wagner (1997). Figure 5 illustrates this process for a multiclass classification problem with .

The tree construction can be described formally as follows. Starting with the complete weighted graph with vertices , apply a min-cut algorithm to obtain the disjoint vertex sets and such that . This pair of vertex sets then forms a binary classification problem with datasets and . Recursively applying this procedure to the sets and until no further splits are possible yields a tree of binary classification problems, as illustrated in Figure 6.

In the remainder of this section the results of an extensive simulation study are presented, which aims to evaluate the performance of this hierarchical classifier on multiclass classification problems. The classifier that will be used in each binary problem in the tree will be a linear support vector machine, but in practice any binary classifier could be used in the algorithm. The implementation of the hierarchical classifier based on the linear binary SVM will be called SmartSVM.222The SmartSVM classifier and the meta-learning and BER estimation techniques presented in the previous sections have been implemented in the smartsvm Python package, available at: https://github.com/HeroResearchGroup/SmartSVM.

The experimental setup is comparable to that used in Van den Burg and Groenen (2016), where a nested cross-validation (CV) approach is used to reduce bias in classifier performance (Stone, 1974)

. From each original dataset 5 independent training and test datasets are generated. Subsequently, each classification method is trained using 10 fold CV on each of the training datasets. Finally, the model is retrained on the entire training dataset using the optimal hyperparameters and this model is used to predict the test set. In the experiments 16 datasets are used of varying dimensions from which 80 independent test sets are constructed. The train and test datasets were generated using a stratified split, such that the proportions of the classes correspond to those in the full dataset. Table

1

shows the descriptive statistics of each of the datasets used. Datasets are collected from the UCI repository

(Bache and Lichman, 2013) and the KEEL repository (Alcalá et al., 2010).

SmartSVM will be compared to five other linear multiclass SVMs in these experiments. Three of these alternatives are heuristic methods which use the binary SVM as underlying classifier, while two others are single-machine multiclass SVMs. One of the most commonly used heuristic approaches to multiclass SVMs is the One vs. One (OvO) method (Kreßel, 1999) which solves a binary SVM for each of the pairs of classes. An alternative is the One vs. Rest (OvR) method (Vapnik, 1998) in which a binary SVM is solved for each of the binary problems obtained by separating a single class from the others. The directed acyclic graph (DAG) SVM was proposed by Platt et al. (2000) as an extension of the OvO approach. It has a similar training procedure as OvO, but uses a different prediction strategy. In the OvO method a voting scheme is used where the class with the most votes from each binary classifier becomes the predicted label. In contrast, the DAGSVM method uses a voting scheme where the least likely class is voted away until only one remains. Finally, two single-machine multiclass SVMs are also compared: the method by Crammer and Singer (2002) and GenSVM (Van den Burg and Groenen, 2016).

All methods are implemented in either C or C++, to ensure that speed of the methods can be accurately compared. The methods that use a binary SVM internally are implemented with LibLinear (Fan et al., 2008). LibLinear also implements a fast solver for the method by Crammer and Singer (2002) using the algorithm proposed by Keerthi et al. (2008). For SmartSVM the Bayes error rates and the corresponding classification tree were calculated once for each training dataset as a preprocessing step. For most datasets the BERs were computed based on 3 orthogonal MSTs using the algorithm of Whitney (1972). For the two largest datasets (fars and shuttle) the BER was computed based on a single MST using the algorithm of March et al. (2010). Computing these MSTs was done in parallel using at most 10 cores. In the results on training time presented below the training time of SmartSVM is augmented with the preprocessing time.

The binary SVM has a cost parameter for the regularization term, which is optimized using cross validation. The range considered for this parameter is . The GenSVM method has additional hyperparameters which were varied in the same way as in the experiments of Van den Burg and Groenen (2016). All experiments were performed on the Dutch National LISA Compute Cluster using the abed utility.

The experiments are compared on training time and out-of-sample predictive performance. Table 2 shows the results for training time, averaged over the 5 nested cross validation folds for each dataset. As can be seen SmartSVM is the fastest method on 10 out of 16 datasets. This can be attributed to the smaller number of binary problems that SmartSVM needs to solve compared to OvO and the fact that the binary problems are smaller than those solved by OvR. The OvO method is the fastest classification method on the remaining 6 datasets. The single-machine multiclass SVMs by Crammer and Singer (2002) and Van den Burg and Groenen (2016) both have larger computation times than the heuristic methods. Since GenSVM has a larger number of hyperparameters, it is interesting to look at the average time per hyperparameter configuration as well. In this case, GenSVM is on average faster than Crammer and Singer (2002) due to the use of warm starts (see Appendix C for additional simulation results).

Classification performance of the methods is reported using the adjusted Rand index (ARI) which corrects for chance (Hubert and Arabie, 1985). Use of this index as a classification metric has been proposed previously by Santos and Embrechts (2009). Table 3 shows the predictive performance as measured with the ARI. As can be seen, SmartSVM obtains the maximum performance on two of the sixteen datasets. However, SmartSVM outperforms One vs. One on 3 datasets and outperforms One vs. Rest on 10 out of 16 datasets. The OvO and OvR methods are often used as default heuristic approaches for multiclass SVMs and are respectively the default strategies in the popular LibSVM (Chang and Lin, 2011) and LibLinear (Fan et al., 2008) libraries. Since SmartSVM is often faster than these methods, our results indicate a clear practical benefit to using SmartSVM for multiclass classification.

## 5 Discussion

In this work the practical applicability of nonparametric Bayes error estimates to meta-learning and hierarchical classifier design has been investigated. For the BER estimate introduced by Berisha et al. (2016) a bias correction was derived which improves the accuracy of the estimator for classification problems with unequal class priors. Furthermore, a normalization term was proposed which makes the BER estimates comparable in multiclass problems. An expression of the OvR-BER was given which represents the exact Bayes error for a single class in the multiclass problem and it was shown that this error can be efficiently estimated using the HP-estimator as well. A robustness analysis of the HP-estimator was performed which showed the benefit of using orthogonal MSTs in the estimator.

There are many potential applications of the BER estimates to meta-learning problems. Above, several possibilities were explored including the prediction of which pairs of classes are most difficult to distinguish and which individual classes will yield the highest error rate. Preliminary experiments with feature transformations were also performed, which showed that the BER estimates can be a useful tool in determining beneficial transformations before a classifier is trained.

Based on the weighted graph of pairwise BER estimates, a hierarchical multiclass classification method was proposed. The classifier uses a top-down splitting approach to create a tree of binary classification problems which increase in difficulty along the depth of the tree. By using a linear SVM for each classification problem, a hierarchical multiclass SVM was obtained which was named SmartSVM. Extensive simulation studies showed that SmartSVM is often faster than existing approaches and yields competitive predictive performance on several datasets.

Note that the SmartSVM classifier is only one example of how the BER estimates can be used to construct better classification methods. As discussed in Section 3, BER estimates could also be used to define class weights in a multiclass classifier. Moreover, the min-cut strategy used for SmartSVM may not be the optimal way to construct the classification tree. Evaluating different approaches to constructing classification hierarchies and other applications of the BER estimates to multiclass classification problems are topics for further research.

## Acknowledgements

The computational experiments of this work were performed on the Dutch National LISA Compute Cluster, and supported by the Dutch National Science Foundation (NWO). The authors thank SURFsara (www.surfsara.nl) for the support in using the LISA cluster. This research was partially supported by the US Army Research Office, grant W911NF-15-1-0479 and US Dept. of Energy grant DE-NA0002534.

## Appendix A One vs. Rest Bayes Error Rate

In this section bounds for the One vs. Rest Bayes error rate will be derived, which measures the error of the Bayes classifier in correctly identifying an individual class.

###### Definition 1 (OvR-BER).

Let and denote density functions and prior probabilities for the classes 1 through respectively, with . Then, the Bayes error rate between a class and the remaining classes is given by

 Pe,k=\smashoperator[]∫maxl≠k{plfl(x)}≥pkfk(x)pkfk(x)dx+∑c≠k\smashoperator[]∫maxl≠k{plfl(x)}

Below it will be shown that the OvR-BER can be bounded using the Friedman-Rafsky statistic in the One-vs-Rest setting, . Let the mixture distribution of the classes be given by

 gk(x)=∑l≠kplfl(x)∑l≠kpl, (13)

with prior probability . Then can be seen as a draw from this mixture distribution. By Theorem  of Berisha et al. (2016) it holds that

 (14)

The following theorem relates this error to the OvR-BER defined above.

###### Theorem 2.

The error rate between class and the mixture distribution without class is bounded from above by the OvR-BER,

 Qe,k=∫min{pkfk(x),pggk(x)}dx≤Pe,k. (15)
###### Proof.

Note that

 Qe,k=\smashoperator[]∫pkfk(x)≤pggk(x)pkfk(x)dx+\smashoperator[]∫pkfk(x)>pggk(x)pggk(x)dx. (16)

To simplify the notation, introduce the sets

 T ={x∈Rd:pkfk(x)≤pggk(x)} (17) S ={x∈Rd:pkfk(x)≤maxl≠k{plfl(x)}} (18)

and denote their respective complements by and . Then,

 Qe,k =∫Tpkfk(x)dx+∫T′pggk(x)dx (19) Pe,k =∫Spkfk(x)dx+∫S′pggk(x)dx. (20)

Since it holds that and . Hence,

 ∫Tpkfk(x)dx =∫Spkfk(x)dx+∫T∖Spkfk(x)dx (21) ∫S′pggk(x)dx (22)

However, the sets and both equal

 U={x∈Rd:maxl≠k{plfl(x)}

so it follows that

 Qe,k=Pe,k+∫Upkfk(x)dx−∫Upggk(x)dx≤Pe,k (24)

by definition of the set . ∎

This provides a lower bound for in terms of . What remains to be shown is that has an upper bound in terms of . No such bound has yet been found. However, the following result can be presented which does bound from above.

###### Theorem 3.

For a single class the OvR-BER is smaller than or equal to the sum of the pairwise BER estimates involving class .

###### Proof.

Recall that the OvR-BER for class is given by

 Pe,k=\smashoperator[]∫pkfk(x)≤maxl≠k{plfl(x)}pkfk(x)dx+∑c≠k\smashoperator[]∫pkfk(x)>maxl≠k{plfl(x)}pcfc(x)dx, (25)

and denote the sum of the pairwise BERs involving as given by,

 Fe,k =∑c≠k∫min{pkfk(x),pcfc(x)}dx (26) =∑c≠k\smashoperator[]∫pkfk(x)pcfc(x)pcfc(x)dx. (27)

Then comparing the first term of with that of shows

 ∑c≠k\smashoperator[]∫pkfk(x)

since the area of integration on the left is larger than on the right. Similarly,

 ∑c≠k\smashoperator[]∫pkfk(x)>pcfc(x)pcfc(x)dx≥∑c≠k\smashoperator[]∫pkfk(x)>maxl≠k{plfl(x)}pcfc(x)dx (29)

for the same reason. This completes the proof. ∎

## Appendix B Jensen-Shannon Bound Inequality

In this section a proof is given for the statement that the Henze-Penrose upper bound on the Bayes error rate is tighter than the Jensen-Shannon upper bound derived by Lin (1991). Before presenting the proof, the following lemma is presented.

###### Lemma 4.

For it holds that

 xlog(1+yx)+ylog(1+xy)≥4log(2)xyx+y. (30)
###### Proof.

Let and multiply both sides by , then the inequality reduces to

 (1t+1)log(1+t)+(1+t)log(1+1t)≥4log(2). (31)

Denote the left hand side by . The proof will now proceed by showing that for all . The derivatives of are given by

 f′(t)=log(1+1t)−log(1+t)t2 and f′′(t)=2log(1+t)−tt3. (32)

Write the numerator of as such that

 g(t)=2log(1+t)−t, and g′(t)=1−t1+t. (33)

Then it is clear that for and for . Furthermore and . Thus, it follows that increases on and decreases for . Let be such that , then for and for .

From this it follows that for and for . Hence, is increasing on and decreasing for . Moreover, and . Thus, it follows that is negative on , positive for , and attains a maximum at after which it decreases to . Since it follows that is decreasing on and increasing for . ∎

###### Definition 5 (Kullback-Leibler Divergence).

For probability density functions
and

the Kullback-Leibler divergence is given by

 DKL(f1∥f2)=∫f1(x)log2f1(x)f2(x)dx, (34)

Kullback and Leibler (1951).

###### Definition 6 (Jensen-Shannon Divergence).

According to El-Yaniv et al. (1998) the Jensen-Shannon divergence for two probability density functions and with prior probabilities and , can be stated in terms of the Kullback-Leibler divergence as

 JS(f1,f2)=p1DKL(f1∥M)+p2DKL(f2∥M) (35)

with the mixture distribution and .

###### Theorem 7.

For the Henze-Penrose upper bound on the BER is tighter than the Jensen-Shannon upper bound of Lin (1991),

 Pe(f1,f2)≤12−12uHP≤12J, (36)

where with the binary entropy and the Jensen-Shannon divergence.

###### Proof.

First, note that with the binary entropy . Second, for the equiprobable case it holds that

 12−12uHP(f1,f2)=∫f1(x)f2(x)f1(x)+f2(x)dx. (37)

The Jensen-Shannon upper bound can be written as

 12J =12−14∫f1(x)log22f1(x)f1(x)+f2(x)dx−14∫f2(x)log22f2(x)f1(x)+f2(x)dx (38) =14∫f1(x)+f2(x)dx−14∫f1(x)log22f1(x)f1(x)+f2(x)dx (39) −14∫f2(x)log22f2(x)f1(x)+f2(x)dx =14∫f1(x)[1−log22f1(x)f1(x)+f2(x)]dx (40) +14∫f2(x)[1−log22f2(x)f1(x)+f2(x)]dx (41) =14∫f1(x)log2(1+f2(x)f1(x)+f2(x)log2(1+f1(x)f2(x))dx (42)

By Lemma 4 it follows that

 f1(x)log2(1+f2(x)f1(x)+f2(x)log2(1+f1(x)f2(x))≥4f1(x)f2(x)f1(x)+f2(x), (43)

and therefore

 12J≥14∫4f1(x)f2(x)f1(x)+f2(x)dx=12−12uHP(f1,f2). (44)

## Appendix C Additional Simulation Results

In this section some additional simulation results are presented for the SmartSVM experiments presented in Section 4. Table 4 shows the average time per hyperparameter configuration for each of the methods. This is especially useful for comparing GenSVM (Van den Burg and Groenen, 2016) with the other methods, as it has a larger set of hyperparameters to consider.

A commonly used tool to summarize results of simulation experiments is to use rank plots (Demšar, 2006). For each dataset the methods are ranked, with the best method receiving rank 1 and the worst method receiving rank (since there are methods in this experiment). In case of ties fractional ranks are used. By averaging the ranks over all datasets, a visual summary of the results can be obtained. Figures (a)a, (b)b and (c)c show these average ranks for predictive performance, total training time, and average training time respectively.

The ordering of OvO and SmartSVM in the rank plots for training time may seem counterintuitive, considering that SmartSVM is more often the fastest method. This can be explained by the fact that in the cases where SmartSVM is slower than OvO it is usually also slower than DAG. In contrast, where SmartSVM is the fastest method OvO is usually the second fastest method. Because of this, SmartSVM obtains a slightly higher average rank than OvO.

## References

• Alcalá et al. (2010) Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3):255–287, 2010.
• Avi-Itzhak and Diep (1996) Avi-Itzhak, H. and Diep, T.

Arbitrarily tight upper and lower bounds on the bayesian probability of error.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1):89–91, 1996.
• Bache and Lichman (2013) Bache, K. and Lichman, M.

UCI machine learning repository.

2013.
• Belkin and Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
• Berisha and Hero (2015) Berisha, V. and Hero, A. Empirical non-parametric estimation of the Fisher information. Signal Processing Letters, IEEE, 22(7):988–992, 2015.
• Berisha et al. (2016) Berisha, V., Wisler, A., Hero, A., and Spanias, A. Empirically estimable classification bounds based on a nonparametric divergence measure. IEEE Transactions on Signal Processing, 64(3):580–591, 2016.
• Bhattacharyya (1946) Bhattacharyya, A. On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics, 7(4):401–406, 1946.
• Borůvka (1926) Borůvka, O. O jistém problému minimálním. Práce Moravské Pridovedecké Spolecnosti, 3:37–58, 1926.
• Chang and Lin (2011) Chang, C. and Lin, C. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011.
• Chernoff (1952) Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
• Cortes and Vapnik (1995) Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
• Cox (1958) Cox, D.

The regression analysis of binary sequences.

Journal of the Royal Statistical Society. Series B (Methodological), 20(2):215–242, 1958.
• Crammer and Singer (2002) Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research, 2(Dec):265–292, 2002.
• Csiszár (1975) Csiszár, I.

I-divergence geometry of probability distributions and minimization problems.

The Annals of Probability, 3(1):146–158, 1975.
• Demšar (2006) Demšar, J. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7(Jan):1–30, 2006.
• El-Yaniv and Etzion-Rosenberg (2010) El-Yaniv, R. and Etzion-Rosenberg, N. Hierarchical multiclass decompositions with application to authorship determination. arXiv preprint arXiv:1010.2102, 2010.
• El-Yaniv et al. (1998) El-Yaniv, R., Fine, S., and Tishby, N. Agnostic classification of markovian sequences. In: Advances of Neural Information Processing Systems 10, pp. 465–471. MIT Press, 1998.
• Fan et al. (2008) Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., and Lin, C.J. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9(Aug):1871–1874, 2008.
• Frank and Kramer (2004) Frank, E. and Kramer, S. Ensembles of nested dichotomies for multi-class problems. In: Proceedings of the 21st International Conference on Machine Learning, pp. 39–46. ACM, 2004.
• Friedman and Rafsky (1979) Friedman, J. and Rafsky, L. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 7(4):697–717, 1979.
• Fukunaga (1990) Fukunaga, K.

Introduction to Statistical Pattern Recognition

.
• Hashlamoun et al. (1994) Hashlamoun, W., Varshney, P., and Samarasooriya, V. A tight upper bound on the bayesian probability of error. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(2):220–224, 1994.
• Henze and Penrose (1999) Henze, N. and Penrose, M. On the multivariate runs test. Annals of statistics, 27(1):290–298, 1999.
• Ho and Basu (2002) Ho, T. and Basu, M. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289–300, 2002.
• Hubert and Arabie (1985) Hubert, L. and Arabie, P. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
• Kailath (1967) Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Transactions on Communication Technology, 15(1):52–60, 1967.
• Keerthi et al. (2008) Keerthi, S., Sundararajan, S., Chang, K.W., Hsieh, C.J., and Lin, C.J. A sequential dual method for large scale multi-class linear SVMs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 408–416. 2008.
• Kreßel (1999) Kreßel, U. Pairwise classification and support vector machines. In: B. Schölkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods, pp. 255–268. MIT Press, 1999.
• Kullback and Leibler (1951) Kullback, S. and Leibler, R. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951.
• LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
• Lin (1991) Lin, J. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, 1991.
• Lorena and De Carvalho (2010) Lorena, A. and De Carvalho, A. Building binary-tree-based multiclass classifiers using separability measures. Neurocomputing, 73(16):2837–2845, 2010.
• Lorena et al. (2008) Lorena, A., De Carvalho, A., and Gama, J. A review on the combination of binary classifiers in multiclass problems. Artificial Intelligence Review, 30(1):19–37, 2008.
• Lu et al. (1998) Lu, Y., Guo, H., and Feldkamp, L. Robust neural learning from unbalanced data samples. In: Proceedings of the IEEE International Joint Conference on Neural Networks, volume 3, pp. 1816–1821. IEEE, 1998.
• March et al. (2010) March, W., Ram, P., and Gray, A. Fast Euclidean minimum spanning tree: algorithm, analysis, and applications. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 603–612. ACM, 2010.
• McCulloch and Pitts (1943) McCulloch, W. and Pitts, W. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.
• Pearson (1901) Pearson, K. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
• Platt et al. (2000) Platt, J., Cristianini, N., and Shawe-Taylor, J. Large margin DAGs for multiclass classification. In: S.A. Solla, T.K. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems 12, pp. 547–553. MIT Press, 2000.
• Qiao and Liu (2009) Qiao, X. and Liu, Y. Adaptive weighted learning for unbalanced multicategory classification. Biometrics, 65(1):159–168, 2009.
• Santos and Embrechts (2009) Santos, J. and Embrechts, M. On the use of the adjusted rand index as a metric for evaluating supervised classification. In: Proceedings of the 19th International Conference on Artificial Neural Networks: Part II, pp. 175–184. Springer-Verlag, 2009.
• Schwenker and Palm (2001) Schwenker, F. and Palm, G. Tree-structured support vector machines for multi-class pattern recognition. In: International Workshop on Multiple Classifier Systems, pp. 409–417. Springer, 2001.
• Stoer and Wagner (1997) Stoer, M. and Wagner, F. A simple min-cut algorithm. Journal of the ACM, 44(4):585–591, 1997.
• Stone (1974) Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):111–147, 1974.
• Takahashi and Abe (2002) Takahashi, F. and Abe, S. Decision-tree-based multiclass support vector machines. In: Neural Information Processing, 2002. ICONIP’02. Proceedings of the 9th International Conference on, volume 3, pp. 1418–1422. IEEE, 2002.
• Tibshirani and Hastie (2007) Tibshirani, R. and Hastie, T. Margin trees for high-dimensional classification. Journal of Machine Learning Research, 8(Mar):637–652, 2007.
• Van den Burg and Groenen (2016) Van den Burg, G. and Groenen, P. GenSVM: A generalized multiclass support vector machine. Journal of Machine Learning Research, 17(225):1–42, 2016.
• Vapnik (1998) Vapnik, V. Wiley, New York, 1998.
• Vural and Dy (2004) Vural, V. and Dy, J. A hierarchical method for multi-class support vector machines. In: Proceedings of the 21st International Conference on Machine Learning, pp. 105–112. ACM, 2004.
• Wald (1947) Wald, A. Foundations of a general theory of sequential decision functions. Econometrica, 15(4):279–313, 1947.
• Whitney (1972) Whitney, V. Algorithm 422: minimal spanning tree. Communications of the ACM, 15(4):273–274, 1972.
• Wisler et al. (2016) Wisler, A., Berisha, V., Wei, D., Ramamurthy, K., and Spanias, A. Empirically-estimable multi-class classification bounds. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2594–2598. IEEE, 2016.