Solving for multi-class: a survey and synthesis

09/16/2018 ∙ by Peter Mills, et al. ∙ 0

We review common methods of solving for multi-class from binary and generalize them to a common framework. Since conditional probabilties are useful both for quantifying the accuracy of an estimate and for calibration purposes, these are a required part of the solution. There is some indication that the best solution for multi-class classification is dependent on the particular dataset. As such, we are particularly interested in data-driven solution design, whether based on a priori considerations or empirical examination of the data. Numerical results indicate that while a one-size-fits-all solution consisting of one-versus-one is appropriate for most datasets, a minority will benefit from a more customized approach. The techniques discussed in this paper allow for a large variety of multi-class configurations and solution methods to be explored so as to optimize classification accuracy, accuracy of conditional probabilities and speed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


We review common methods of solving for multi-class from binary and generalize them to a common framework. Since conditional probabilties are useful both for quantifying the accuracy of an estimate and for calibration purposes, these are a required part of the solution. There is some indication that the best solution for multi-class classification is dependent on the particular dataset. As such, we are especially interested in data-driven solution design, whether based on a priori considerations or empirical examination of the data. Numerical results indicate that while a one-size-fits-all solution consisting of one-versus-one is appropriate for most datasets, a minority will benefit from a more customized approach. The techniques discussed in this paper allow for a large variety of multi-class configurations and solution methods to be explored so as to optimize classification accuracy, accuracy of conditional probabilities and speed.


multi-class classification, probability estimation, constrained linear least squares, decision trees, error correcting codes, support vector machines

1 Introduction

Many statistical classifiers can only discriminate between two classes. Common examples include linear classifiers such as perceptrons and logistic regression classifiers

(Michie et al., 1994) as well as extensions of these methods such as support vector machines (SVM) (Müller et al., 2001) and piecewise linear classifiers (Bagirov, 2005; Mills, 2018). There are many possible ways of extending a binary classifier to deal with multi-class classification and the options increase exponentially with the number of class labels. Moreover, the best method may well depend on the type of problem (Dietterich and Bakiri, 1995; Allwein et al., 2000).

The goal of this paper is not only to provide a summary of the best methods of solving for multi-class, but to synthesize these ideas into a comprehensive framework whereby a multi-class problem can be solved using a broad array of configurations for the binary classifiers. In addition, we require that any algorithm solve for the multi-class conditional probabilities. These are useful both for gauging the accuracy of a result and for various forms of recalibration (Jolliffe and Stephenson, 2003; Fawcett, 2006; Mills, 2009, 2011).

1.1 Definition of the problem

In a statistical classication problem we are given a set of ordered pairs,

, of training data, where the vector, , is the location of the sample in the feature space, is the class of the sample, is the number of classes, and the classes are distributed according to an unknown conditional distribution, with the class label and the location in feature space.

Given an arbitrary test point, , we wish to estimate , however we only have the means to estimate some binary component of it, that is we have a set of binary classifiers, each returning a decision function, . In this paper we assume that the decision function returns estimates of the difference in conditional probabilities:

where is the conditional probability of the th binary classifier. The decision function is trained using the same type of ordered pairs as above except that the classes can only take on one of two values, which for convenience are chosen as either or , that is, .

The problem under consideration in this review is, first, how to partition the classes, , for each binary classifier? That is we want to create a mapping of the form:


where is the class value of th sample of the transformed data for trainng the th binary classifier and is the set of class labels from the original set, , that map to while is the set of classes that map to .

And second, once we have partitioned the classes and trained the binary classifers, how do we solve for the multi-class conditional probabilities, ? The class of the test point may then be estimated through maximum likelihood:


2 Nonhierarchical multi-class classification

In non-hierarchical multi-class classification, we solve for the classes or probabilities of the multi-class problem all at once: all the binary classifiers are used in the solution and the result of one binary classifier does not determine the use of any of the others. Using the notation provided in Section 1.1, we can write a system of equations relating the multi-class conditional probabilities to the decision functions:


where , , is the number of class labels on the negative side of the th partition, and is the number of class labels on the positive side of the th partition

It’s more natural (and considerably simpler) to describe the problem using a coding matrix, , which is structured such that each element, , where enumerates the binary classifier and enumerates the class of the multi-class problem. If is /, we would assign each of the th class labels in the training data a value of / when training the th binary classifier. If the value is , the th class label is excluded. (Dietterich and Bakiri, 1995; Windeatt and Ghaderi, 2002)

The non-zero elements of are:

We can rewrite Equation (3) using the coding matrix as follows:


where , is a vector of multi-class conditional probabilities, , is the number of classes, is the vector of decision functions, and is the number of partitions. Note that the coding matrix used here is transposed relative to the usual convention in the literature since this is the more natural layout when solving for the probabilities.

Some rearrangement shows that we can solve for the probabilities, , via matrix inversion:


Note that reduces to if contains no zeroes (Kong and Dietterich, 1997). The case of a coding matrix that contains no zeroes, that is all the partitions divide up all the classes rather than a subset, will be called the strict case. From a computational perspective, in the strict case, must be regenerated for every new test point or value of whereas in the non-strict case, can be inverted or decomposed and then applied to every subsequent value of .

Because the decision functions, , are not estimated perfectly, the final probabilities may need to be constrained and the inverse problem solved via minimization:


subject to:


where straight brackets, , denotes a vector norm which in this case is the Euclidian or norm and is a vector of all zeroes.

2.1 Basic inverse solution

Equation (7) can be solved via the normal equation:


This also takes care of the over-determined case, . Because the binary probability estimates in are rarely perfect, however, in many cases the constraints in (8) and (9) will be violated. Therefore, for most applications, either the results will need to be adjusted, likely reducing accuracy, or the problem constrained.

It is straightforward to incorporate the normalization constraint in (8) into the problem. There are at several ways of doing this. The most obvious is to write one probability in terms of the others:


where is an index between and , and solve the following, reduced-dimensional linear system:


A more symmetric method is the Lagrange multiplier which will be derived in Section 3.2. Lawson and Hanson (1995) discuss at least two other methods of enforcing equality constraints on linear least squares problems. Since they are inequality constraints, those in (9) are harder to enforce and details will be left to a later section.

2.2 Voting solution

In many other texts (Allwein et al., 2000; Hsu and Lin, 2002; Dietterich and Bakiri, 1995), the class of the test point is determined by how close is to each of the columns in :

where is the th column of . For the norm, , Hamming distance is frequently used, which is the number of bits that must be changed in a binary number in order for it to match another binary number. This assumes that each decision function returns only one of two values: . If the coding matrix is strict, then:

where is the Kronecker delta. Allwein et al. (2000) tailor the matric on the basis of the binary classifier used, each of which will return a different type of continuous decision function (that doesn’t represent the difference in conditional probabilities).

Here we are assuming that the decision functions return an approximation of the difference in conditional probabilities of the binary classifier. In this case a more natural choice of metric is the Euclidian. Expanding:

The length of is independent of , hence it can be eliminated from the expression. For the strict case, the length of each column will also be constant at . Even for the non-strict case, we would expect the column lengths to be close for typical coding matricesr; for instance, the column lengths are equal in the one-versus-one case. Eliminating these two terms produces a voting solution:

That is, if the sign of matches the th element of the column, then a vote is cast in proportion to the size of for the class label corresponding to the column number.

A voting solution can be used for any coding matrix and is especially appropriate if each returns only or . The LIBSVM libary, for instance, uses a one-versus-one arrangement with a voting solution if probabilities are not required (Chang and Lin, 2011). The disadvantage of a voting solution is that, except in special circumstances such as an orthogonal coding matrix (see Section 3.5, below), it does not return calibrated estimates of the probabilities.

3 Common coding matrices

There are a number of standard, symmetric coding matrices that are commonly used to solve for multi-class. These include “one-versus-the-rest”, “one-versus-one”, as well as error-correcting coding matrices such as orthogonal and random. We discuss each of these in turn and use them to demonstrate how to solve for the conditional probabilities while enforcing the constraints, expanding out to the general solution for “error-correcting-codes.”

3.1 One-versus-the-rest

Common coding matrices include “one-versus-the-rest” in which we take each class and train it against the rest of the classes. For it works out to:

or in the general case:

Probabilities for the one-versus-the-rest can be solved for directly by simply writing out one side of the equation:

The normalization constraint, (8), can be enforced through the use of a Lagrange multiplier. See next section.

3.2 One-versus-one

In a “one-versus-one” solution, we train each class against every other class. For :

The one-versus-one solution is used in LIBSVM (Chang and Lin, 2011).

Consider the following variation of (4):

We can include the normalization constraint, (8), via a Lagrange multiplier:

which produces the following linear system:

where is a vector of all ones. It can be shown that with this solution for a 1-vs-1 coding matrix, inequality constraints in (9) are always satisfied (Wu et al., 2004).

Hsu and Lin (2002) find that the one-vs.-one method is more accurate for support vector machines (SVM) than either error-correcting codes or one-vs.-the-rest.

3.3 Exhaustive codes

An exhaustive coding matrix is a strict coding matrix in which every possible permutation is listed. Again for :

This is like counting in binary except zero is ommitted and we only count half way so as to eliminate degenerate partitions. A disadvantage of exhaustive codes is that they become exponentially larger for more classes, making them slow moreover intractable for very large numbers of classes.

3.4 Error correcting codes

Another common coding matrix is an arbitrary one: this is commonly known as an “error-correcting” code (Dietterich and Bakiri, 1995). It can be random, but may also be carefully designed (Crammer and Singer, 2002; Zhou et al., 2008)

. In principle this case covers all the earliers ones, however in practice the term can also refer more specifically to a random coding matrix. We cover the solution of the general case, which includes any random matrix, in Section

4, below.

3.5 Orthogonal codes

To maximize the accuracy of an error-correcting coding matrix, Allwein et al. (2000) and Windeatt and Ghaderi (2002) show that the distance between each of the columns, , should be as large as possible, where is the th column of the matrix and . If we take the upright brackets once again to be a Euclidian metric and assume that is “strict” then this reduces to minimizing the absolute value of the dot product, . The absolute value is used because a pair of columns that are the same except for a factor of -1 are degenerate.

In other words, the optimal coding matrix will be orthogonal, , where is the identity matrix and . Orthogonal coding matrices are not hard to construct for certain class sizes, for instance:

For an orthogonal coding matrix, the voting solution will be equivalent to the unconstrained least-squares solution. Using this property, Mills (2017) provides a fast, simple and elegant iterative solution for solving for conditional probabilities when using a “strict” orthogonal coding matrix.

4 Solving for all the probabilities in the general case

We are interested here in a general method of solving for all the probabilities multi-class classification based on error-correcting codes. Ideally, this would be an exact solution to the constrained least-squares problem in Equations (7)-(9) but might also minimize some other cost function. Approximate solutions might be useful as well.

4.1 General comments

Figure 1: Solving for the multi-class conditional probabilities with three classes.

Once the normalization constraint in (8) has been applied, the remaining inequality constraints for the minimization problem in (7) to (9) form a triangular hyper-pyramid in a space of dimension . See Figure 1. Stating this as a more general constrained optimization problem, we have:


subject to:


where is a vector of length and is an matrix.

Suppose we transform this into a new problem by interpolating between the vertices of the hyper-pyramid. First, we find the vertices by solving the following linear system:

where is the th vertex of the hyper-pyramid.

We can locate any point inside the hyper-pyrmaid as follows:

where has the same properties as a probability:

In other words, any minimization problem of the form of (13) and (14) can be transformed into a problem of the same form as (7) to (9) and vice versa.

The first line of attack in solving constrained minimization problems of the type we are discussing here are the Karesh-Kuhn-Tucker (KKT) conditions which generalize Lagrange multipliers to inequality constraints (Lawson and Hanson, 1995; Boyd and Vandenberghe, 2004). For the minimization problem in (7)-(9), the KKT conditions translate to:



or more succinctly:

Another important property of the problem is that it is completely convex. A convex function, , has the following property:

where is a coefficient. Meanwhile, in a convex set, :

(Boyd and Vandenberghe, 2004). The convexity property means that any local minima is also a global minima, moreover simple, gradient descent algorithms should always eventually reach it. Both the convexity property and the KKT conditions are used in the Lawson and Hanson (1995) solution to inequality constrained least squares problems. See Section 4.3, below.

4.2 Zadrozny solution

Zadrozny (2001) describes the following, iterative method of solving for the probabilities using an arbitrary coding matrix, starting with a guess for the approximated probabilities, :

  • Set

  • For each : Set

  • Set ; Set

  • Repeat until convergence.

where is the number of training samples in the th class. The technique minimizes the weighted Kullback-Leibler distance between actual and calculated binary probabilities:

as opposed to the usual Euclidean distance. The method supplies probability estimates roughly as accurate as the others described here, however our tests indicate that convergence is too slow to be useful.

4.3 Lawson and Hanson solution

Lawson and Hanson (1995) describe an iterative solution to the following inequality constrained least-squares problem:

subject to:

where is an matrix.

The solution is divided into a set containing the indices of all the non-zero values and a set containing the indices of the zero values. The algorithm is as follows:

  1. Set ;

  2. Compute the -vector .

  3. If the set is empty or if for all go to Step 12.

  4. Find an index such that .

  5. Move the index from set to set

  6. Solve the least squares problem where = (columns of whose corresponding indices are in ).

  7. If for all , set and go to Step 2.

  8. Find an index such that .

  9. Set

  10. Set

  11. Move from set to set all indices for which . Go to step 6.

  12. The computation is completed.

There are two loops to the algorithm. In the first loop, a new index is added to the non-zero set in each iteration. So long as the solution doesn’t go out-of-bounds, this continues until all the indices are added to the set . In the second loop, if it’s found that one of the variables has gone out-of-bounds, then the solution is adjusted to the nearest point in-bounds between the old solution and the new.

Lawson and Hanson (1995) describe several methods of combining equality constraints such as the normalization constraint in (8) with the above constrained least squares problem. The most obvious is to repeat the variable substitution described in Equations (11) to (12) until the excluded probability is less-than-or-equal to .

5 Decision-trees

The most obvious method of dividing up a multi-class problem into binary classifiers is hierarchically using a decision tree (Cheon et al., 2004; Lee and Oh, 2003). In this method, the classes are first divided into two partitions, then those partitions are each divided into two partitions and so on until only one class remains. The classification scheme is hierarchical, with all the losing classes being excluded from consideration at each step. Only the conditional probability of the winning class is calculated as the product of all the returned conditional probabilities of the binary classifiers.

Decision trees have the advantage that they are fast since on average they require only classifications and there is no need to solve a constrained matrix inverse. On the other hand, because there is less information being taken into consideration, they may be less accurate.

Interestingly, the same partitions created for a decision tree can also be used in a non-hierarchical scheme to solve for all of the conditional probabilities. Consider the following coding matrix for instance:

While there are only seven rows for eight classes, once we add in the constraint in (8) the system becomes fully determined.

5.1 Variations

There are many variations on the method. Ramanan et al. (2007) train a one-versus-the-rest model at each level of the tree so that if the “one” class is returned, the lower levels of the tree are short circuited and this class is selected for the final result. Otherwise, the one class is left out of subsequent analysis. This is less a new method than simply a means of shaping the tree appropriate to datasets with unbalanced numbers of classes, for instance the “Shuttle” dataset (King et al., 1995).

In a decision directed acyclic graph (DDAG), rather than testing one group against another, each node of the tree tests one class against another (Platt et al., 2000). The losing class is excluded from subsequent analysis. The previous paragraph describes the “tree” version of the one-versus-the-rest. This is the tree version of one-versus-one. In a DDAG, there are multiple paths to the same node.

5.2 Empirically designed trees

Figure 2: Diagram of hierarchical or decision tree multi-class classification using a hypothetical surface-classification problem.

Consider the following land-classification problem: you have remote-sensing measurements of four surface types: corn field, wheat field, evergreen forest and deciduous forest. How do you divide up the tree to best classify the measurements into one of these four surface types? A priori, it would make the most sense to first divide them by their more general grouping: field versus forest and then, once you have field, classify by type of field, or if you have forest, classify by the type of forest. This is illustrated in Figure 2.

In this case we have prior knowledge of how the classes are related to one another. On the other hand the classes may be too abstract to have any knowledge without examining the actual training data. Many different methods of empirically designing both decision trees and coding matrices have been shown in the literature. Cheon et al. (2004)

, for instance, use a self-organizing-map (SOM)

(Kohonen, 2000) to visualize the relationship between the classes while Lee and Oh (2003)

use a genetic algorithm to optimize the decision tree.

Benabdeslem and Bennani (2006) design the tree by measuring the distance between the classes and building a dendrogram. This seems the most straightforward approach and is interesting in that it reduces a very large problem involving probabilities into a much smaller one. Consider the problem above: it stands to reason that the field and forest classes would be much more strongly separated than either of the sub-classes within. That is the interclass distance between field and forest is larger.

How does one measure the interclass distance? Fundamentally this is a distance measure between two sets and there are many methods of determining it. We could notate this as follows:

where is a distance operator between two distributions and is a distance operator between two sets.

Consider the square of the distance between the means of the two classes divided by their standard deviations. Let:

be the mean of the th class distribution where is the number of instances of that class, while:

is the standard deviation. Then let the distance between the two classes be:

That is, the closer the centers of the two classes, the shorter the distance, while the wider each class is, the farther the distance.

This would work well if each of the classes is quite distinct and clustered around a strong center. But for more diffuse classes, especially those with multiple centers, it would make more sense to use a metric designed specifically for sets rather than this somewhat crude adaptation of a vector metric. In this regard, the Hausdorff metric seems tailor-made for this application.

For training samples from a pair of classes—two finite sets—the Hausdorff distance works out to (Ott, 1993; Gulick, 1992):

6 Unifying framework

Since there are many ways of solving the multi-class classification problem, we present here a descriptive control language that unifies many of the ideas presented in the previous sections. This is not a ”one-size-fits-all” solution, but rather a means of specifying a particular partitioning that best suits the problem at hand. This partitioning could be arrived at either through prior knowledge, or empirically by measuring the distance between the classes, for instance–or by simply exhaustively testing different configurations.

In Backus-Naur form (BNF) the control language looks like this:

branch ::= model “{” branch-list “}” CLASS
model ::= TWOCLASS partition-list
branch-list ::= branch branch-list branch
partition-list ::= partition partition-list partition
partition ::= TWOCLASS class-list “ / ” class-list “;”
class-list ::= CLASS class-list “ ” CLASS


where CLASS is a class value between 0 and . It is used in two senses. It may be one of the class values in a partition in a non-hierarchical model. In this case it’s value is relative, that is local to the non-hierarchical model. It may also be the class value returned from a top level partition in the hierarchy in which case it’s value is absolute. TWOCLASS is a binary classification model. This could either be the name of model that has already been trained or it could be a list of options or specifications used to train said model.

For example, a one-versus-one specification for four classes would look like this:

  model01 0 / 1;
  model02 0 / 2;
  model03 0 / 3;
  model12 1 / 2;
  model13 1 / 3;
  model23 2 / 3;
  {0 1 2 3}

while a one-versus-the-rest specifications, also for four class, would look like this:

  model0 1 2 3 / 0;
  model1 0 2 3 / 1;
  model2 0 1 3 / 2;
  model3 0 1 2 / 3;
  {0 1 2 3}

A hierarchical specification might look like this:

  TreeVsField {
    EvergreenVsDeciduous {0 1}
    CornVsWheat {2 3}

The framework allows the two methods, hiearchical and non-hierarchical, to be combined as in the following, nine-class example:

  TREESvsFIELD 0 / 1;
  TREESvsWATER 0 / 2;
  FIELDvsWATER3 1 / 2;
    {1 2 3}
    CORNvsWHEAT 0 / 1;
    CORNvsLEGUME 0 / 2;
    WHEATvsLEGUME 1 / 2;
    {4 5 6}
    FRESHvsSALT 0 / 1;
    FRESHvsMARSH 0 / 2;
    SALTvsMARSH 1 / 2;
    {7 8 9}

The above demonstrates how the feature might be useful on a hypothetical surface-classification problem with the key as follows:

0 Deciduous forest
1 Evergreen forest
2 Shrubs
3 Corn field
4 Wheat field
5 Legume field
6 Freshwater
7 Saltwater
8 Marsh

7 Numerical trials

We wish to test a synthesis of the ideas contained in this review on some real datasets. To this end, we will test eight different datasets using six configurations solved using four different methods. The configurations are: one-vs-one, one-vs.-rest, orthogonal partioning, “adjacent” partitioning (see below), an arbitray tree, and a tree generated from a bottom-up dendrogram using the Hausdorf metric. The solution methods are: constrained least squares as described in Section 4.3, matrix inverse which is specific to one-vs.-one, the iterative method designed for orthogonal partitioning (see Section 3.5), and recursively which is appropriate only for hierarchical or tree-based configurations.

The control language allows us to represent any type of multi-class configuration relatively succinctly, including different parameters used for the binary classifiers. To illustrate the operation of the empirical partitioning, here are control files for the shuttle dataset. The arbitrary, balanced tree is as follows:

shuttle_hier {
  shuttle_hier.00 {
    shuttle_hier.00.01 {
  shuttle_hier.01 {
    shuttle_hier.01.00 {
    shuttle_hier.01.01 {

While the empirically-designed tree is:

shuttle_emp {
  shuttle_emp.00  {
    shuttle_emp.00.00 {
      shuttle_emp.00.00.00 {
        shuttle_emp. {
          shuttle_emp. {
class number
0 45586
1 50
2 171
3 8903
4 3267
5 10
6 13
Table 1: Class distribution in the shuttle dataset.
Figure 3: Multiclass decision trees for the shuttle dataset. In (a) we build the tree in a rigid pattern whereas in (b) the tree is a dendrogram based on the Hausdorff distance between each class.

Note that the shuttle dataset is very unbalanced, as listed in Table 1, hence the empirically-designed tree looks more like a chain as illustrated in Figure 3. To solve the hierarchical models using least-squares, hierarchical models were first translated to non-hierarchical, as discussed in Section 5. The above, for instance, becomes:

shuttle_emp 0 1 2 3 4 5 / 6;
shuttle_emp.00 0 1 2 3 4 / 5;
shuttle_emp.00.00 0 1 2 3 / 4;
shuttle_emp.00.00.00 0 1 2 / 3;
shuttle_emp. 0 1 / 2;
shuttle_emp. 0 / 1;
{ 2 1 6 5 3 4 0}

The “adjacent” partitioning is as follows:

shuttle_adj-00 0 / 1 2 3 4 5 6;
shuttle_adj-00 0 1 / 2 3 4 5 6;
shuttle_adj-00 0 1 2 / 3 4 5 6;
shuttle_adj-00 0 1 2 3 / 4 5 6;
shuttle_adj-00 0 1 2 3 4 / 5 6;
shuttle_adj-00 0 1 2 3 4 5 / 6;
{0 1 2 3 4 5 6}

with corresponding coding matrix:

The rational for using it will be explained in Section 8, below.

7.1 Data and software

Name Type Reference
letter 16 integer 26 20000 (Frey and Slate, 1991)
pendigits 16 integer 10 10992 (Alimoglu, 1996)
usps 256 float 10 9292 (Hull, 1994)
segment 19 float 7 2310 (King et al., 1995)
sat 36 float 6 6435 (King et al., 1995)
urban 147 float 9 675 (Johnson, 2013)
shuttle 9 float 7 58000 (King et al., 1995)
humidity 7 float 8 8600 (Mills, 2009)

Humidity dataset has been sub-sampled to keep training times reasonable.

Table 2: Summary of datasets used in the analysis

The datasets tested are as follows: “pendigits” and “usps” are both digit recognition problems (Alimoglu, 1996; Hull, 1994); the “letter” dataset is another text-recognition problem that classifies letters rather than numbers (Frey and Slate, 1991); the “segment” dataset is a pattern-based image-classification problem; the “sat” dataset is a satellite land-classification problem; the “shuttle” dataset predicts different flight configurations on the space shuttle (Michie et al., 1994; King et al., 1995)

; the “urban” dataset is another pattern-recognition dataset for urban land cover

(Johnson, 2013); and the “humidity” dataset classifies humidity values based on satellite radiometry (Mills, 2009). The characteristics of each dataset are summarized in Table 2.

The base binary classifier used to test the ideas in this paper is a support vector machine (SVM) (Müller et al., 2001). We use LIBSVM (Chang and Lin, 2011) to perform the training using the svm-train command. LIBSVM is a simple yet powerful library for SVM that implements multiple kernel types and includes two different regularization methods. It was developed by Chih-Chung Chang and Chih-Hen Lin of the National Taiwan University in Taipei and can be downloaded at:

Everything else was done using the libAGF library (Mills, 2011, 2018) which includes extensive codes for generalized multi-class classification. These codes interface seamlessly with LIBSVM and provide for automatic generation of multiple types of control file using the print_control command. Control files are used to train the binary classifiers and then to make predictions using the multi_borders and classify_m commands, respectively. Before making predictions, the binary classifiers were unified to eliminate duplicate support vectors using the mbh2mbm command, thus improving efficiency. LibAGF may be downloaded at:

To evaluate the conditional probabilities we use the Brier score (Brier, 1950; Jolliffe and Stephenson, 2003):

where is the number of test samples. Meanwhile, we use the uncertainty coefficient to evaluate classification skill. This is a measure based on Shannon’s channel capacity (Shannon and Weaver, 1963) and has a number of advantages over simple fraction correct or “accuracy”. (Press et al., 1992; Mills, 2011). If we treat the classifier as a noisy channel, with each classification a single symbol, the true class entering at the transmitter and the estimated class coming out at the receiver, then the uncertainty coefficient is the channel capacity divided by the entropy per symbol.

term meaning cross-ref.
config. configuration of multi-class partitioning Equation (1)
method solution method for computing probabilities Equation (2)
1 vs. 1 one-versus-one partitioning Section 3.2
1 vs. rest one-versus-the-rest partitioning Section 3.1
ortho. orthogonal coding Section 3.5
adj. adjacent partitioning Equation (7)
hier. “hierarchical” or decision tree partitioning Section 5
emp. empirically-designed decision tree Section 5.2
lsq. Lawson and Hanson constrained least-squares Section 4.3
inv. matrix inverse solution Section 3.2
iter. iterative solution for orthogonal codes Section 3.5
rec. recursive ascent of decision tree Section 5
Table 3: Key for Tables 4 through 13.

8 Results and discussion

config. letter pendigits usps segment
1 vs. 1
1 vs. rest
Table 4: Training times in seconds for the first four datasets for six different multi-class configurations.
config. sat urban shuttle humidity
1 vs. 1
1 vs. rest
Table 5: Training times in seconds for the last four datasets for six different multi-class configurations.
config. method letter pendigits usps segment
1 vs. 1 inv.
1 vs. rest lsq.
ortho. iter.
adj. lsq.
hier. rec.
emp. rec.
Table 6: Classification times in seconds for the first four datasets for eight different multi-class configurations and solution methods.
config. method sat urban shuttle humidity
1 vs. 1 inv.
1 vs. rest lsq.
ortho. iter.
adj. lsq.
hier. rec.
emp. rec.
Table 7: Classification times in seconds for the last four datasets for eight different multi-class configurations and solution methods.
config. method letter pendigits usps segment
1 vs. 1 inv.
1 vs. rest lsq.
ortho. iter.
adj. lsq.
hier rec.
emp. rec.

A random coding matrix was used since building an orthogonal matrix would take too long using current methods.

Table 8: Uncertainty coefficients for the first four datasets for eight different multi-class configurations and solution methods.
config. method sat urban shuttle humidity
1 vs. 1 inv.
1 vs. rest lsq.
ortho. iter.
adj. lsq.
hier rec.
emp. rec.
Table 9: Uncertainty coefficients for the last four datasets for eight different multi-class configurations and solution methods.
config. letter pendigits usps segment
1 vs. 1
1 vs. rest

A random coding matrix was used.

Table 10: Brier scores for the first four datasets for six different multi-class configurations.
config. sat urban shuttle humidity
1 vs. 1
1 vs. rest
Table 11: Brier scores for the last four datasets for six different multi-class configurations.
config. method letter pendigits usps segment
1 vs. 1 inv.
1 vs. rest lsq.
ortho. iter.
adj. lsq.
hier. rec.
emp. rec.

A random coding matrix was used.

Table 12: Brier scores for the first four datasets for eight different multi-class configurations and solution methods. Winning classes only.
config. method sat urban shuttle humidity
1 vs. 1 inv.
1 vs. rest lsq.
ortho. iter.
adj. lsq.
hier. rec.
emp. rec.
Table 13: Brier scores for the first four datasets for eight different multi-class configurations and solution methods. Winning classes only.
segment_emp {
  segment_emp.00 {
    segment_emp.00.00 {
      segment_emp.00.00.00 {
        segment_emp. {
          segment_emp. {
Figure 4: Control file for a multi-class decision tree designed empirically for the segment dataset. Image type is used for the class labels.
pendigits_emp {
  pendigits_emp.00 {
    pendigits_emp.00.00 {
      pendigits_emp.00.00.00 {
        pendigits_emp. {
          pendigits_emp. {
            pendigits_emp. {
              pendigits_emp. {
          pendigits_emp. {
Figure 5: Control file for a multi-class decision tree designed empirically for the pendigits dataset.
sat_emp {
  sat_emp.00 {
    sat_emp.00.00 {
      sat_emp.00.00.00 {
        sat_emp. {
          DAMP GREY SOIL
        RED SOIL
Figure 6: Control file for a multi-class decision tree designed empirically for the sat dataset. Surface-type is used for the class labels.

Results are shown Tables 4 through 13 with the key given in Table 3. For each result, ten trials were performed with individually randomized test data comprising 30% of the total. Error bars are the standard deviations.

If we take the results for these eight datasets as being representative, there are several conclusions that can be made. The first is that despite the seeming complexity of the problem, a “one-size-fits-all” approach seems perfectly adequate for most datasets. Moreover, this approach is the one-versus-one method, which we should note is used exclusively in LIBSVM (Chang and Lin, 2011)

. One-vs.-one has other advantages such as the simplicity of solution: a standard linear solver such as Gaussian elimination, QR decomposition or SVD is sufficient, as opposed to a complex, iterative scheme.

Further, the partitioning used does not even appear all that critical in most cases. Even a sub-optimal method, such as the “adjacent” partioning, which makes little sense for datasets in which the classes have no ordering, gives up relatively little accuracy to more sensible methods on most datasets. For the urban dataset it is actually superior, suggesting that there is some kind of ordering to the classes which are: trees, grass, soil, concrete, asphalt, buildings, cars, pools, shadows. If classification speed is critical, a hierarchical approach will trade off accuracy to save some compute cycles with performance (for SVM) instead of . Accuracy lost is again quite dependent on the dataset.

Not only can sub-optimal partitionings produce reasonable results, but approximate solution methods can also be used without much penalty. For instance, the iterative method for orthogonal coding matrices outlined in Mills (2017), when applied as a general method actually gives up very little in accuracy, even though it is not optimal in the least-squares sense. (These results are not shown.)

A data-dependent decision tree design can provide a small but significant increase in accuracy over a more arbitrary tree. An interesting side-effect is that it can also improve both training and classificaton speed. Strangely, the technique worked better for the character recognition datasets than the image classification datasets. The groupings found did not always correspond with what might be expected from intuition. In a pattern-recognition dataset, it might not always be clear how different image types should be related anyway, as in the segment dataset, Figure 4. For another example, the control file for the pendigits dataset is shown in Figure 5. We can see how 8 and 5 might be related, but it is harder to understand how 3 and 1 are related to 7 and 2. On the other hand, the arrangement might turn out very much as expected as in the sat dataset, Figure 6. This is also the only pattern-recognition dataset for which the method worked as intended.

Unfortunately the approach used here isn’t able to match the one-versus-one configuration in accuracy but this does not preclude cleverer schemes producing larger gains. Using similar techniques, Benabdeslem and Bennani (2006) and Zhou et al. (2008) are both able to beat one-vs.-one, but only by narrow margins.

Interestingly, solving hierarchical configurations via least-squares by converting them to non-hierarchical error-correcting codes is marginally more accurate than solving them recursively, both for the classes and the probabilities. Presumably, the greater information content contained in all partitions versus only , on average, accounts for this. There is a speed penalty, of course, with roughly performance for the least-squares solution.

Finally, some datasets may have special characteristics that can be exploited to produce more accurate results through the multi-class configuration. While this was the original thesis behind this paper, only one example was found in this group of eight datasets, namely the humidity dataset. Because the classes are a discretized continuous variable, they have an ordering. As such, it is detrimental to split up consecutive classes more than absolutely necessary and the “adjacent” partitioning is the most accurate. Meanwhile, the one-vs.-rest configuration performs abysimally while the one-vs.-one performs well enough, but worse than all the other methods save one. The excellent performance of the adjacent configuration for the urban dataset suggests that searching for an ordering to the classes might be a useful strategy for improving accuracy. This could be done using inter-set distance in a manner similar to the empirical hierarchical method.

Since the results present a somewhat mixed bag, it would be useful to have a framework and toolset with which to explore different methods of building up multi-class classification models so as to optimize classification accuracy, accuracy of conditional probabilities and speed. This is what we have layed out in this paper.


Thanks to Chih-Chung Chan and Chih-Jen Lin of the National Taiwan University for data from the LIBSVM archive and also to David Aha and the curators of the UCI Machine Learning Repository for statistical classification datasets.


  • Alimoglu (1996) Alimoglu, F. (1996). Combining Multiple Classifiers for Pen-Based Handwritten Digit Recognition. Master’s thesis, Bogazici University.
  • Allwein et al. (2000) Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Journal of Machine Learning Research, 1:113–141.
  • Bagirov (2005) Bagirov, A. M. (2005). Max-min separability. Optimization Methods and Software, 20(2-3):277–296.
  • Benabdeslem and Bennani (2006) Benabdeslem, K. and Bennani, Y. (2006). Dendrogram-based SVM for Multi-Class Classification. Journal of Computing and Information Technology, 14(4):283–289.
  • Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New York, NY, USA.
  • Brier (1950) Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3.
  • Chang and Lin (2011) Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27.
  • Cheon et al. (2004) Cheon, S., Oh, S. H., and Lee, S.-Y. (2004). Support Vector Machine with Binary Tree Architecture for Multi-Class Classification. Neural Information Processing, 2(3):47–51.
  • Crammer and Singer (2002) Crammer, K. and Singer, Y. (2002). On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning, 47(2-3):201–233.
  • Dietterich and Bakiri (1995) Dietterich, T. G. and Bakiri, G. (1995). Solving Multiclass Learning Problems via Error-Correcting Output Codes.

    Journal of Artificial Intelligence Research

    , 2:263–286.
  • Fawcett (2006) Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27:861–874.
  • Frey and Slate (1991) Frey, P. and Slate, D. (1991). Letter recognition using holland-style adaptive classifiers. Machine Learning, 6(2):161–182.
  • Gulick (1992) Gulick, D. (1992). Encounters with Chaos. McGraw-Hill.
  • Hsu and Lin (2002) Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines.

    IEEE Transactions on Neural Networks

    , 13(2):415–425.
  • Hull (1994) Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554.
  • Johnson (2013) Johnson, B. (2013). High resolution urban land cover classification using a competititive multi-scale object-based approach. Remote Sensing Letters, 4(2):131–140.
  • Jolliffe and Stephenson (2003) Jolliffe, I. T. and Stephenson, D. B. (2003). Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley.
  • King et al. (1995) King, R. D., Feng, C., and Sutherland, A. (1995). Statlog: Comparision of Classification Problems on Large Real-World Problems. Applied Artificial Intelligence, 9(3):289–333.
  • Kohonen (2000) Kohonen, T. (2000). Self-Organizing Maps. Springer-Verlag, 3rd edition.
  • Kong and Dietterich (1997) Kong, E. B. and Dietterich, T. G. (1997). Probability estimation via error-correcting output coding. In International Conference on Artificial Intelligence and Soft Computing.
  • Lawson and Hanson (1995) Lawson, C. L. and Hanson, R. J. (1995). Solving Least Squares Problems, volume 15 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics.
  • Lee and Oh (2003) Lee, J.-S. and Oh, I.-S. (2003). Binary Classification Trees for Multi-class Classification Problems. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, volume 2, pages 770–774. IEEE Computer Society.
  • Michie et al. (1994) Michie, D., Spiegelhalter, D. J., and Tayler, C. C., editors (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Prentice Hall, Upper Saddle River, NJ. Available online at:
  • Mills (2009) Mills, P. (2009). Isoline retrieval: An optimal method for validation of advected contours. Computers & Geosciences, 35(11):2020–2031.
  • Mills (2011) Mills, P. (2011). Efficient statistical classification of satellite measurements. International Journal of Remote Sensing, 32(21):6109–6132.
  • Mills (2017) Mills, P. (2017). Solving for multiclass using orthogonal coding matrices. Submitted to Pattern Analysis and Applications.
  • Mills (2018) Mills, P. (2018). Accelerating kernel classifiers through borders mapping. Real-Time Image Processing. doi:10.1007/s11554-018-0769-9.
  • Müller et al. (2001) Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201.
  • Ott (1993) Ott, E. (1993). Chaos in Dynamical Systems. Cambridge University Press.
  • Platt et al. (2000) Platt, J. C., Cristianini, N., and Shaw-Taylor, J. (2000). Large Margin DAGs for Multiclass Classification. In Solla, S., Leen, T., and Mueller, K.-R., editors, Advances in Information Processing, number 12, pages 547–553. MIT Press.
  • Press et al. (1992) Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press, 2nd edition.
  • Ramanan et al. (2007) Ramanan, A., Suppharangsan, S., and Niranjan, M. (2007). Unbalanced decision trees for multi-class classification. In International Conference on Industrial Information Systems. IEEE.
  • Shannon and Weaver (1963) Shannon, C. E. and Weaver, W. (1963). The Mathematical Theory of Communication. University of Illinois Press.
  • Windeatt and Ghaderi (2002) Windeatt, T. and Ghaderi, R. (2002). Coding and decoding strategies for multi-class learning problems. Information Fusion, 4(1):11–21.
  • Wu et al. (2004) Wu, T.-F., Lin, C.-J., and Weng, R. C. (2004). Probability Estimates for Multi-class Classification by Pairwise Coupling. Journal of Machine Learning Research, 5:975–1005.
  • Zadrozny (2001) Zadrozny, B. (2001). Reducing multiclass to binary by coupling probability estimates. In NIPS’01 Proceedings of the 14th International Conference on Information Processing Systems: Natural and Synthetic, pages 1041–1048.
  • Zhou et al. (2008) Zhou, J., Peng, H., and Suen, C. Y. (2008). Data-driven decomposition for multi-class classification. Pattern Recognition, 41:67–76.