Abstract
We review common methods of solving for multiclass from binary and generalize them to a common framework. Since conditional probabilties are useful both for quantifying the accuracy of an estimate and for calibration purposes, these are a required part of the solution. There is some indication that the best solution for multiclass classification is dependent on the particular dataset. As such, we are especially interested in datadriven solution design, whether based on a priori considerations or empirical examination of the data. Numerical results indicate that while a onesizefitsall solution consisting of oneversusone is appropriate for most datasets, a minority will benefit from a more customized approach. The techniques discussed in this paper allow for a large variety of multiclass configurations and solution methods to be explored so as to optimize classification accuracy, accuracy of conditional probabilities and speed.
Keywords
multiclass classification, probability estimation, constrained linear least squares, decision trees, error correcting codes, support vector machines
1 Introduction
Many statistical classifiers can only discriminate between two classes. Common examples include linear classifiers such as perceptrons and logistic regression classifiers
(Michie et al., 1994) as well as extensions of these methods such as support vector machines (SVM) (Müller et al., 2001) and piecewise linear classifiers (Bagirov, 2005; Mills, 2018). There are many possible ways of extending a binary classifier to deal with multiclass classification and the options increase exponentially with the number of class labels. Moreover, the best method may well depend on the type of problem (Dietterich and Bakiri, 1995; Allwein et al., 2000).The goal of this paper is not only to provide a summary of the best methods of solving for multiclass, but to synthesize these ideas into a comprehensive framework whereby a multiclass problem can be solved using a broad array of configurations for the binary classifiers. In addition, we require that any algorithm solve for the multiclass conditional probabilities. These are useful both for gauging the accuracy of a result and for various forms of recalibration (Jolliffe and Stephenson, 2003; Fawcett, 2006; Mills, 2009, 2011).
1.1 Definition of the problem
In a statistical classication problem we are given a set of ordered pairs,
, of training data, where the vector, , is the location of the sample in the feature space, is the class of the sample, is the number of classes, and the classes are distributed according to an unknown conditional distribution, with the class label and the location in feature space.Given an arbitrary test point, , we wish to estimate , however we only have the means to estimate some binary component of it, that is we have a set of binary classifiers, each returning a decision function, . In this paper we assume that the decision function returns estimates of the difference in conditional probabilities:
where is the conditional probability of the th binary classifier. The decision function is trained using the same type of ordered pairs as above except that the classes can only take on one of two values, which for convenience are chosen as either or , that is, .
The problem under consideration in this review is, first, how to partition the classes, , for each binary classifier? That is we want to create a mapping of the form:
(1) 
where is the class value of th sample of the transformed data for trainng the th binary classifier and is the set of class labels from the original set, , that map to while is the set of classes that map to .
And second, once we have partitioned the classes and trained the binary classifers, how do we solve for the multiclass conditional probabilities, ? The class of the test point may then be estimated through maximum likelihood:
(2) 
2 Nonhierarchical multiclass classification
In nonhierarchical multiclass classification, we solve for the classes or probabilities of the multiclass problem all at once: all the binary classifiers are used in the solution and the result of one binary classifier does not determine the use of any of the others. Using the notation provided in Section 1.1, we can write a system of equations relating the multiclass conditional probabilities to the decision functions:
(3) 
where , , is the number of class labels on the negative side of the th partition, and is the number of class labels on the positive side of the th partition
It’s more natural (and considerably simpler) to describe the problem using a coding matrix, , which is structured such that each element, , where enumerates the binary classifier and enumerates the class of the multiclass problem. If is /, we would assign each of the th class labels in the training data a value of / when training the th binary classifier. If the value is , the th class label is excluded. (Dietterich and Bakiri, 1995; Windeatt and Ghaderi, 2002)
The nonzero elements of are:
We can rewrite Equation (3) using the coding matrix as follows:
(4) 
where , is a vector of multiclass conditional probabilities, , is the number of classes, is the vector of decision functions, and is the number of partitions. Note that the coding matrix used here is transposed relative to the usual convention in the literature since this is the more natural layout when solving for the probabilities.
Some rearrangement shows that we can solve for the probabilities, , via matrix inversion:
(5)  
(6) 
Note that reduces to if contains no zeroes (Kong and Dietterich, 1997). The case of a coding matrix that contains no zeroes, that is all the partitions divide up all the classes rather than a subset, will be called the strict case. From a computational perspective, in the strict case, must be regenerated for every new test point or value of whereas in the nonstrict case, can be inverted or decomposed and then applied to every subsequent value of .
Because the decision functions, , are not estimated perfectly, the final probabilities may need to be constrained and the inverse problem solved via minimization:
(7) 
subject to:
(8)  
(9) 
where straight brackets, , denotes a vector norm which in this case is the Euclidian or norm and is a vector of all zeroes.
2.1 Basic inverse solution
Equation (7) can be solved via the normal equation:
(10) 
This also takes care of the overdetermined case, . Because the binary probability estimates in are rarely perfect, however, in many cases the constraints in (8) and (9) will be violated. Therefore, for most applications, either the results will need to be adjusted, likely reducing accuracy, or the problem constrained.
It is straightforward to incorporate the normalization constraint in (8) into the problem. There are at several ways of doing this. The most obvious is to write one probability in terms of the others:
(11) 
where is an index between and , and solve the following, reduceddimensional linear system:
(12) 
A more symmetric method is the Lagrange multiplier which will be derived in Section 3.2. Lawson and Hanson (1995) discuss at least two other methods of enforcing equality constraints on linear least squares problems. Since they are inequality constraints, those in (9) are harder to enforce and details will be left to a later section.
2.2 Voting solution
In many other texts (Allwein et al., 2000; Hsu and Lin, 2002; Dietterich and Bakiri, 1995), the class of the test point is determined by how close is to each of the columns in :
where is the th column of . For the norm, , Hamming distance is frequently used, which is the number of bits that must be changed in a binary number in order for it to match another binary number. This assumes that each decision function returns only one of two values: . If the coding matrix is strict, then:
where is the Kronecker delta. Allwein et al. (2000) tailor the matric on the basis of the binary classifier used, each of which will return a different type of continuous decision function (that doesn’t represent the difference in conditional probabilities).
Here we are assuming that the decision functions return an approximation of the difference in conditional probabilities of the binary classifier. In this case a more natural choice of metric is the Euclidian. Expanding:
The length of is independent of , hence it can be eliminated from the expression. For the strict case, the length of each column will also be constant at . Even for the nonstrict case, we would expect the column lengths to be close for typical coding matricesr; for instance, the column lengths are equal in the oneversusone case. Eliminating these two terms produces a voting solution:
That is, if the sign of matches the th element of the column, then a vote is cast in proportion to the size of for the class label corresponding to the column number.
A voting solution can be used for any coding matrix and is especially appropriate if each returns only or . The LIBSVM libary, for instance, uses a oneversusone arrangement with a voting solution if probabilities are not required (Chang and Lin, 2011). The disadvantage of a voting solution is that, except in special circumstances such as an orthogonal coding matrix (see Section 3.5, below), it does not return calibrated estimates of the probabilities.
3 Common coding matrices
There are a number of standard, symmetric coding matrices that are commonly used to solve for multiclass. These include “oneversustherest”, “oneversusone”, as well as errorcorrecting coding matrices such as orthogonal and random. We discuss each of these in turn and use them to demonstrate how to solve for the conditional probabilities while enforcing the constraints, expanding out to the general solution for “errorcorrectingcodes.”
3.1 Oneversustherest
Common coding matrices include “oneversustherest” in which we take each class and train it against the rest of the classes. For it works out to:
or in the general case:
Probabilities for the oneversustherest can be solved for directly by simply writing out one side of the equation:
The normalization constraint, (8), can be enforced through the use of a Lagrange multiplier. See next section.
3.2 Oneversusone
In a “oneversusone” solution, we train each class against every other class. For :
The oneversusone solution is used in LIBSVM (Chang and Lin, 2011).
Consider the following variation of (4):
We can include the normalization constraint, (8), via a Lagrange multiplier:
which produces the following linear system:
where is a vector of all ones. It can be shown that with this solution for a 1vs1 coding matrix, inequality constraints in (9) are always satisfied (Wu et al., 2004).
Hsu and Lin (2002) find that the onevs.one method is more accurate for support vector machines (SVM) than either errorcorrecting codes or onevs.therest.
3.3 Exhaustive codes
An exhaustive coding matrix is a strict coding matrix in which every possible permutation is listed. Again for :
This is like counting in binary except zero is ommitted and we only count half way so as to eliminate degenerate partitions. A disadvantage of exhaustive codes is that they become exponentially larger for more classes, making them slow moreover intractable for very large numbers of classes.
3.4 Error correcting codes
Another common coding matrix is an arbitrary one: this is commonly known as an “errorcorrecting” code (Dietterich and Bakiri, 1995). It can be random, but may also be carefully designed (Crammer and Singer, 2002; Zhou et al., 2008)
. In principle this case covers all the earliers ones, however in practice the term can also refer more specifically to a random coding matrix. We cover the solution of the general case, which includes any random matrix, in Section
4, below.3.5 Orthogonal codes
To maximize the accuracy of an errorcorrecting coding matrix, Allwein et al. (2000) and Windeatt and Ghaderi (2002) show that the distance between each of the columns, , should be as large as possible, where is the th column of the matrix and . If we take the upright brackets once again to be a Euclidian metric and assume that is “strict” then this reduces to minimizing the absolute value of the dot product, . The absolute value is used because a pair of columns that are the same except for a factor of 1 are degenerate.
In other words, the optimal coding matrix will be orthogonal, , where is the identity matrix and . Orthogonal coding matrices are not hard to construct for certain class sizes, for instance:
For an orthogonal coding matrix, the voting solution will be equivalent to the unconstrained leastsquares solution. Using this property, Mills (2017) provides a fast, simple and elegant iterative solution for solving for conditional probabilities when using a “strict” orthogonal coding matrix.
4 Solving for all the probabilities in the general case
We are interested here in a general method of solving for all the probabilities multiclass classification based on errorcorrecting codes. Ideally, this would be an exact solution to the constrained leastsquares problem in Equations (7)(9) but might also minimize some other cost function. Approximate solutions might be useful as well.
4.1 General comments
Once the normalization constraint in (8) has been applied, the remaining inequality constraints for the minimization problem in (7) to (9) form a triangular hyperpyramid in a space of dimension . See Figure 1. Stating this as a more general constrained optimization problem, we have:
(13) 
subject to:
(14) 
where is a vector of length and is an matrix.
Suppose we transform this into a new problem by interpolating between the vertices of the hyperpyramid. First, we find the vertices by solving the following linear system:
where is the th vertex of the hyperpyramid.
We can locate any point inside the hyperpyrmaid as follows:
where has the same properties as a probability:
In other words, any minimization problem of the form of (13) and (14) can be transformed into a problem of the same form as (7) to (9) and vice versa.
The first line of attack in solving constrained minimization problems of the type we are discussing here are the KareshKuhnTucker (KKT) conditions which generalize Lagrange multipliers to inequality constraints (Lawson and Hanson, 1995; Boyd and Vandenberghe, 2004). For the minimization problem in (7)(9), the KKT conditions translate to:
where:
and:
or more succinctly:
Another important property of the problem is that it is completely convex. A convex function, , has the following property:
where is a coefficient. Meanwhile, in a convex set, :
(Boyd and Vandenberghe, 2004). The convexity property means that any local minima is also a global minima, moreover simple, gradient descent algorithms should always eventually reach it. Both the convexity property and the KKT conditions are used in the Lawson and Hanson (1995) solution to inequality constrained least squares problems. See Section 4.3, below.
4.2 Zadrozny solution
Zadrozny (2001) describes the following, iterative method of solving for the probabilities using an arbitrary coding matrix, starting with a guess for the approximated probabilities, :

Set

For each : Set

Set ; Set

Repeat until convergence.
where is the number of training samples in the th class. The technique minimizes the weighted KullbackLeibler distance between actual and calculated binary probabilities:
as opposed to the usual Euclidean distance. The method supplies probability estimates roughly as accurate as the others described here, however our tests indicate that convergence is too slow to be useful.
4.3 Lawson and Hanson solution
Lawson and Hanson (1995) describe an iterative solution to the following inequality constrained leastsquares problem:
subject to:
where is an matrix.
The solution is divided into a set containing the indices of all the nonzero values and a set containing the indices of the zero values. The algorithm is as follows:

Set ;

Compute the vector .

If the set is empty or if for all go to Step 12.

Find an index such that .

Move the index from set to set

Solve the least squares problem where = (columns of whose corresponding indices are in ).

If for all , set and go to Step 2.

Find an index such that .

Set

Set

Move from set to set all indices for which . Go to step 6.

The computation is completed.
There are two loops to the algorithm. In the first loop, a new index is added to the nonzero set in each iteration. So long as the solution doesn’t go outofbounds, this continues until all the indices are added to the set . In the second loop, if it’s found that one of the variables has gone outofbounds, then the solution is adjusted to the nearest point inbounds between the old solution and the new.
Lawson and Hanson (1995) describe several methods of combining equality constraints such as the normalization constraint in (8) with the above constrained least squares problem. The most obvious is to repeat the variable substitution described in Equations (11) to (12) until the excluded probability is lessthanorequal to .
5 Decisiontrees
The most obvious method of dividing up a multiclass problem into binary classifiers is hierarchically using a decision tree (Cheon et al., 2004; Lee and Oh, 2003). In this method, the classes are first divided into two partitions, then those partitions are each divided into two partitions and so on until only one class remains. The classification scheme is hierarchical, with all the losing classes being excluded from consideration at each step. Only the conditional probability of the winning class is calculated as the product of all the returned conditional probabilities of the binary classifiers.
Decision trees have the advantage that they are fast since on average they require only classifications and there is no need to solve a constrained matrix inverse. On the other hand, because there is less information being taken into consideration, they may be less accurate.
Interestingly, the same partitions created for a decision tree can also be used in a nonhierarchical scheme to solve for all of the conditional probabilities. Consider the following coding matrix for instance:
While there are only seven rows for eight classes, once we add in the constraint in (8) the system becomes fully determined.
5.1 Variations
There are many variations on the method. Ramanan et al. (2007) train a oneversustherest model at each level of the tree so that if the “one” class is returned, the lower levels of the tree are short circuited and this class is selected for the final result. Otherwise, the one class is left out of subsequent analysis. This is less a new method than simply a means of shaping the tree appropriate to datasets with unbalanced numbers of classes, for instance the “Shuttle” dataset (King et al., 1995).
In a decision directed acyclic graph (DDAG), rather than testing one group against another, each node of the tree tests one class against another (Platt et al., 2000). The losing class is excluded from subsequent analysis. The previous paragraph describes the “tree” version of the oneversustherest. This is the tree version of oneversusone. In a DDAG, there are multiple paths to the same node.
5.2 Empirically designed trees
Consider the following landclassification problem: you have remotesensing measurements of four surface types: corn field, wheat field, evergreen forest and deciduous forest. How do you divide up the tree to best classify the measurements into one of these four surface types? A priori, it would make the most sense to first divide them by their more general grouping: field versus forest and then, once you have field, classify by type of field, or if you have forest, classify by the type of forest. This is illustrated in Figure 2.
In this case we have prior knowledge of how the classes are related to one another. On the other hand the classes may be too abstract to have any knowledge without examining the actual training data. Many different methods of empirically designing both decision trees and coding matrices have been shown in the literature. Cheon et al. (2004)
, for instance, use a selforganizingmap (SOM)
(Kohonen, 2000) to visualize the relationship between the classes while Lee and Oh (2003)use a genetic algorithm to optimize the decision tree.
Benabdeslem and Bennani (2006) design the tree by measuring the distance between the classes and building a dendrogram. This seems the most straightforward approach and is interesting in that it reduces a very large problem involving probabilities into a much smaller one. Consider the problem above: it stands to reason that the field and forest classes would be much more strongly separated than either of the subclasses within. That is the interclass distance between field and forest is larger.
How does one measure the interclass distance? Fundamentally this is a distance measure between two sets and there are many methods of determining it. We could notate this as follows:
where is a distance operator between two distributions and is a distance operator between two sets.
Consider the square of the distance between the means of the two classes divided by their standard deviations. Let:
be the mean of the th class distribution where is the number of instances of that class, while:
is the standard deviation. Then let the distance between the two classes be:
That is, the closer the centers of the two classes, the shorter the distance, while the wider each class is, the farther the distance.
This would work well if each of the classes is quite distinct and clustered around a strong center. But for more diffuse classes, especially those with multiple centers, it would make more sense to use a metric designed specifically for sets rather than this somewhat crude adaptation of a vector metric. In this regard, the Hausdorff metric seems tailormade for this application.
6 Unifying framework
Since there are many ways of solving the multiclass classification problem, we present here a descriptive control language that unifies many of the ideas presented in the previous sections. This is not a ”onesizefitsall” solution, but rather a means of specifying a particular partitioning that best suits the problem at hand. This partitioning could be arrived at either through prior knowledge, or empirically by measuring the distance between the classes, for instance–or by simply exhaustively testing different configurations.
In BackusNaur form (BNF) the control language looks like this:
branch  ::=  model “{” branchlist “}” CLASS 
model  ::=  TWOCLASS partitionlist 
branchlist  ::=  branch branchlist branch 
partitionlist  ::=  partition partitionlist partition 
partition  ::=  TWOCLASS classlist “ / ” classlist “;” 
classlist  ::=  CLASS classlist “ ” CLASS 
.
where CLASS is a class value between 0 and . It is used in two senses. It may be one of the class values in a partition in a nonhierarchical model. In this case it’s value is relative, that is local to the nonhierarchical model. It may also be the class value returned from a top level partition in the hierarchy in which case it’s value is absolute. TWOCLASS is a binary classification model. This could either be the name of model that has already been trained or it could be a list of options or specifications used to train said model.
For example, a oneversusone specification for four classes would look like this:
model01 0 / 1; model02 0 / 2; model03 0 / 3; model12 1 / 2; model13 1 / 3; model23 2 / 3; {0 1 2 3}
while a oneversustherest specifications, also for four class, would look like this:
model0 1 2 3 / 0; model1 0 2 3 / 1; model2 0 1 3 / 2; model3 0 1 2 / 3; {0 1 2 3}
A hierarchical specification might look like this:
TreeVsField { EvergreenVsDeciduous {0 1} CornVsWheat {2 3} }
The framework allows the two methods, hiearchical and nonhierarchical, to be combined as in the following, nineclass example:
TREESvsFIELD 0 / 1; TREESvsWATER 0 / 2; FIELDvsWATER3 1 / 2; { DECIDUOUSvsEVERGREEN 0 / 1; DECIDUOUSvsSHRUB 0 / 2; EVERGREENvsSHRUB 1 / 2; {1 2 3} CORNvsWHEAT 0 / 1; CORNvsLEGUME 0 / 2; WHEATvsLEGUME 1 / 2; {4 5 6} FRESHvsSALT 0 / 1; FRESHvsMARSH 0 / 2; SALTvsMARSH 1 / 2; {7 8 9} }
The above demonstrates how the feature might be useful on a hypothetical surfaceclassification problem with the key as follows:
0  Deciduous forest 

1  Evergreen forest 
2  Shrubs 
3  Corn field 
4  Wheat field 
5  Legume field 
6  Freshwater 
7  Saltwater 
8  Marsh 
7 Numerical trials
We wish to test a synthesis of the ideas contained in this review on some real datasets. To this end, we will test eight different datasets using six configurations solved using four different methods. The configurations are: onevsone, onevs.rest, orthogonal partioning, “adjacent” partitioning (see below), an arbitray tree, and a tree generated from a bottomup dendrogram using the Hausdorf metric. The solution methods are: constrained least squares as described in Section 4.3, matrix inverse which is specific to onevs.one, the iterative method designed for orthogonal partitioning (see Section 3.5), and recursively which is appropriate only for hierarchical or treebased configurations.
The control language allows us to represent any type of multiclass configuration relatively succinctly, including different parameters used for the binary classifiers. To illustrate the operation of the empirical partitioning, here are control files for the shuttle dataset. The arbitrary, balanced tree is as follows:
shuttle_hier { shuttle_hier.00 { 0 shuttle_hier.00.01 { 1 2 } } shuttle_hier.01 { shuttle_hier.01.00 { 3 4 } shuttle_hier.01.01 { 5 6 } } }
While the empiricallydesigned tree is:
shuttle_emp { shuttle_emp.00 { shuttle_emp.00.00 { shuttle_emp.00.00.00 { shuttle_emp.00.00.00.00 { shuttle_emp.00.00.00.00.00 { 2 1 } 5 } 6 } 3 } 4 } 0 }
class  number 

0  45586 
1  50 
2  171 
3  8903 
4  3267 
5  10 
6  13 
Note that the shuttle dataset is very unbalanced, as listed in Table 1, hence the empiricallydesigned tree looks more like a chain as illustrated in Figure 3. To solve the hierarchical models using leastsquares, hierarchical models were first translated to nonhierarchical, as discussed in Section 5. The above, for instance, becomes:
shuttle_emp 0 1 2 3 4 5 / 6; shuttle_emp.00 0 1 2 3 4 / 5; shuttle_emp.00.00 0 1 2 3 / 4; shuttle_emp.00.00.00 0 1 2 / 3; shuttle_emp.00.00.00.00 0 1 / 2; shuttle_emp.00.00.00.00.00 0 / 1; { 2 1 6 5 3 4 0}
The “adjacent” partitioning is as follows:
shuttle_adj00 0 / 1 2 3 4 5 6; shuttle_adj00 0 1 / 2 3 4 5 6; shuttle_adj00 0 1 2 / 3 4 5 6; shuttle_adj00 0 1 2 3 / 4 5 6; shuttle_adj00 0 1 2 3 4 / 5 6; shuttle_adj00 0 1 2 3 4 5 / 6; {0 1 2 3 4 5 6}
with corresponding coding matrix:
The rational for using it will be explained in Section 8, below.
7.1 Data and software
Name  Type  Reference  

letter  16  integer  26  20000  (Frey and Slate, 1991) 
pendigits  16  integer  10  10992  (Alimoglu, 1996) 
usps  256  float  10  9292  (Hull, 1994) 
segment  19  float  7  2310  (King et al., 1995) 
sat  36  float  6  6435  (King et al., 1995) 
urban  147  float  9  675  (Johnson, 2013) 
shuttle  9  float  7  58000  (King et al., 1995) 
humidity  7  float  8  8600  (Mills, 2009) 
Humidity dataset has been subsampled to keep training times reasonable.
The datasets tested are as follows: “pendigits” and “usps” are both digit recognition problems (Alimoglu, 1996; Hull, 1994); the “letter” dataset is another textrecognition problem that classifies letters rather than numbers (Frey and Slate, 1991); the “segment” dataset is a patternbased imageclassification problem; the “sat” dataset is a satellite landclassification problem; the “shuttle” dataset predicts different flight configurations on the space shuttle (Michie et al., 1994; King et al., 1995)
; the “urban” dataset is another patternrecognition dataset for urban land cover
(Johnson, 2013); and the “humidity” dataset classifies humidity values based on satellite radiometry (Mills, 2009). The characteristics of each dataset are summarized in Table 2.The base binary classifier used to test the ideas in this paper is a
support vector machine (SVM) (Müller et al., 2001).
We use LIBSVM (Chang and Lin, 2011) to perform the training
using the svmtrain
command.
LIBSVM is a simple yet powerful library for SVM that implements multiple
kernel types and includes two different regularization methods.
It was developed by ChihChung Chang and ChihHen Lin of the National
Taiwan University in Taipei
and can be downloaded at: https://www.csie.ntu.edu.tw/~cjlin/libsvm.
Everything else was done using the libAGF library (Mills, 2011, 2018)
which includes extensive codes for generalized multiclass classification.
These codes interface seamlessly with LIBSVM and provide for automatic
generation of multiple types of control file using the print_control
command.
Control files are used to train the binary classifiers and then
to make predictions using the multi_borders
and classify_m
commands, respectively.
Before making predictions, the binary classifiers were unified to eliminate
duplicate support vectors using the mbh2mbm
command, thus improving
efficiency.
LibAGF may be downloaded at: https://github.com/peteysoft/libmsci.
To evaluate the conditional probabilities we use the Brier score (Brier, 1950; Jolliffe and Stephenson, 2003):
where is the number of test samples. Meanwhile, we use the uncertainty coefficient to evaluate classification skill. This is a measure based on Shannon’s channel capacity (Shannon and Weaver, 1963) and has a number of advantages over simple fraction correct or “accuracy”. (Press et al., 1992; Mills, 2011). If we treat the classifier as a noisy channel, with each classification a single symbol, the true class entering at the transmitter and the estimated class coming out at the receiver, then the uncertainty coefficient is the channel capacity divided by the entropy per symbol.
term  meaning  crossref. 

config.  configuration of multiclass partitioning  Equation (1) 
method  solution method for computing probabilities  Equation (2) 
1 vs. 1  oneversusone partitioning  Section 3.2 
1 vs. rest  oneversustherest partitioning  Section 3.1 
ortho.  orthogonal coding  Section 3.5 
adj.  adjacent partitioning  Equation (7) 
hier.  “hierarchical” or decision tree partitioning  Section 5 
emp.  empiricallydesigned decision tree  Section 5.2 
lsq.  Lawson and Hanson constrained leastsquares  Section 4.3 
inv.  matrix inverse solution  Section 3.2 
iter.  iterative solution for orthogonal codes  Section 3.5 
rec.  recursive ascent of decision tree  Section 5 
8 Results and discussion
config.  letter  pendigits  usps  segment 

1 vs. 1  
1 vs. rest  
ortho.  
adj.  
hier.  
emp. 
config.  sat  urban  shuttle  humidity 

1 vs. 1  
1 vs. rest  
ortho.  
adj.  
hier.  
emp. 
config.  method  letter  pendigits  usps  segment 

1 vs. 1  inv.  
1 vs. rest  lsq.  
ortho.  iter.  
adj.  lsq.  
hier.  rec.  
lsq.  
emp.  rec.  
lsq. 
config.  method  sat  urban  shuttle  humidity 

1 vs. 1  inv.  
1 vs. rest  lsq.  
ortho.  iter.  
adj.  lsq.  
hier.  rec.  
lsq.  
emp.  rec.  
lsq. 
config.  method  letter  pendigits  usps  segment 

1 vs. 1  inv.  
1 vs. rest  lsq.  
ortho.  iter.  
adj.  lsq.  
hier  rec.  
lsq.  
emp.  rec.  
lsq. 
A random coding matrix was used since building an orthogonal matrix would take too long using current methods.
config.  method  sat  urban  shuttle  humidity 

1 vs. 1  inv.  
1 vs. rest  lsq.  
ortho.  iter.  
adj.  lsq.  
hier  rec.  
lsq.  
emp.  rec.  
lsq. 
config.  letter  pendigits  usps  segment 

1 vs. 1  
1 vs. rest  
ortho.  
adj.  
hier.  
emp. 
A random coding matrix was used.
config.  sat  urban  shuttle  humidity 

1 vs. 1  
1 vs. rest  
ortho.  
adj.  
hier.  
emp. 
config.  method  letter  pendigits  usps  segment 

1 vs. 1  inv.  
1 vs. rest  lsq.  
ortho.  iter.  
adj.  lsq.  
hier.  rec.  
lsq.  
emp.  rec.  
lsq. 
A random coding matrix was used.
config.  method  sat  urban  shuttle  humidity 

1 vs. 1  inv.  
1 vs. rest  lsq.  
ortho.  iter.  
adj.  lsq.  
hier.  rec.  
lsq.  
emp.  rec.  
lsq. 
Results are shown Tables 4 through 13 with the key given in Table 3. For each result, ten trials were performed with individually randomized test data comprising 30% of the total. Error bars are the standard deviations.
If we take the results for these eight datasets as being representative, there are several conclusions that can be made. The first is that despite the seeming complexity of the problem, a “onesizefitsall” approach seems perfectly adequate for most datasets. Moreover, this approach is the oneversusone method, which we should note is used exclusively in LIBSVM (Chang and Lin, 2011)
. Onevs.one has other advantages such as the simplicity of solution: a standard linear solver such as Gaussian elimination, QR decomposition or SVD is sufficient, as opposed to a complex, iterative scheme.
Further, the partitioning used does not even appear all that critical in most cases. Even a suboptimal method, such as the “adjacent” partioning, which makes little sense for datasets in which the classes have no ordering, gives up relatively little accuracy to more sensible methods on most datasets. For the urban dataset it is actually superior, suggesting that there is some kind of ordering to the classes which are: trees, grass, soil, concrete, asphalt, buildings, cars, pools, shadows. If classification speed is critical, a hierarchical approach will trade off accuracy to save some compute cycles with performance (for SVM) instead of . Accuracy lost is again quite dependent on the dataset.
Not only can suboptimal partitionings produce reasonable results, but approximate solution methods can also be used without much penalty. For instance, the iterative method for orthogonal coding matrices outlined in Mills (2017), when applied as a general method actually gives up very little in accuracy, even though it is not optimal in the leastsquares sense. (These results are not shown.)
A datadependent decision tree design can provide a small but significant increase in accuracy over a more arbitrary tree. An interesting sideeffect is that it can also improve both training and classificaton speed. Strangely, the technique worked better for the character recognition datasets than the image classification datasets. The groupings found did not always correspond with what might be expected from intuition. In a patternrecognition dataset, it might not always be clear how different image types should be related anyway, as in the segment dataset, Figure 4. For another example, the control file for the pendigits dataset is shown in Figure 5. We can see how 8 and 5 might be related, but it is harder to understand how 3 and 1 are related to 7 and 2. On the other hand, the arrangement might turn out very much as expected as in the sat dataset, Figure 6. This is also the only patternrecognition dataset for which the method worked as intended.
Unfortunately the approach used here isn’t able to match the oneversusone configuration in accuracy but this does not preclude cleverer schemes producing larger gains. Using similar techniques, Benabdeslem and Bennani (2006) and Zhou et al. (2008) are both able to beat onevs.one, but only by narrow margins.
Interestingly, solving hierarchical configurations via leastsquares by converting them to nonhierarchical errorcorrecting codes is marginally more accurate than solving them recursively, both for the classes and the probabilities. Presumably, the greater information content contained in all partitions versus only , on average, accounts for this. There is a speed penalty, of course, with roughly performance for the leastsquares solution.
Finally, some datasets may have special characteristics that can be exploited to produce more accurate results through the multiclass configuration. While this was the original thesis behind this paper, only one example was found in this group of eight datasets, namely the humidity dataset. Because the classes are a discretized continuous variable, they have an ordering. As such, it is detrimental to split up consecutive classes more than absolutely necessary and the “adjacent” partitioning is the most accurate. Meanwhile, the onevs.rest configuration performs abysimally while the onevs.one performs well enough, but worse than all the other methods save one. The excellent performance of the adjacent configuration for the urban dataset suggests that searching for an ordering to the classes might be a useful strategy for improving accuracy. This could be done using interset distance in a manner similar to the empirical hierarchical method.
Since the results present a somewhat mixed bag, it would be useful to have a framework and toolset with which to explore different methods of building up multiclass classification models so as to optimize classification accuracy, accuracy of conditional probabilities and speed. This is what we have layed out in this paper.
Acknowledgements
Thanks to ChihChung Chan and ChihJen Lin of the National Taiwan University for data from the LIBSVM archive and also to David Aha and the curators of the UCI Machine Learning Repository for statistical classification datasets.
References
 Alimoglu (1996) Alimoglu, F. (1996). Combining Multiple Classifiers for PenBased Handwritten Digit Recognition. Master’s thesis, Bogazici University.
 Allwein et al. (2000) Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Journal of Machine Learning Research, 1:113–141.
 Bagirov (2005) Bagirov, A. M. (2005). Maxmin separability. Optimization Methods and Software, 20(23):277–296.
 Benabdeslem and Bennani (2006) Benabdeslem, K. and Bennani, Y. (2006). Dendrogrambased SVM for MultiClass Classification. Journal of Computing and Information Technology, 14(4):283–289.
 Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New York, NY, USA.
 Brier (1950) Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3.
 Chang and Lin (2011) Chang, C.C. and Lin, C.J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27.
 Cheon et al. (2004) Cheon, S., Oh, S. H., and Lee, S.Y. (2004). Support Vector Machine with Binary Tree Architecture for MultiClass Classification. Neural Information Processing, 2(3):47–51.
 Crammer and Singer (2002) Crammer, K. and Singer, Y. (2002). On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning, 47(23):201–233.

Dietterich and
Bakiri (1995)
Dietterich, T. G. and Bakiri, G. (1995).
Solving Multiclass Learning Problems via ErrorCorrecting
Output Codes.
Journal of Artificial Intelligence Research
, 2:263–286.  Fawcett (2006) Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27:861–874.
 Frey and Slate (1991) Frey, P. and Slate, D. (1991). Letter recognition using hollandstyle adaptive classifiers. Machine Learning, 6(2):161–182.
 Gulick (1992) Gulick, D. (1992). Encounters with Chaos. McGrawHill.

Hsu and Lin (2002)
Hsu, C.W. and Lin, C.J. (2002).
A comparison of methods for multiclass support vector machines.
IEEE Transactions on Neural Networks
, 13(2):415–425.  Hull (1994) Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554.
 Johnson (2013) Johnson, B. (2013). High resolution urban land cover classification using a competititive multiscale objectbased approach. Remote Sensing Letters, 4(2):131–140.
 Jolliffe and Stephenson (2003) Jolliffe, I. T. and Stephenson, D. B. (2003). Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley.
 King et al. (1995) King, R. D., Feng, C., and Sutherland, A. (1995). Statlog: Comparision of Classification Problems on Large RealWorld Problems. Applied Artificial Intelligence, 9(3):289–333.
 Kohonen (2000) Kohonen, T. (2000). SelfOrganizing Maps. SpringerVerlag, 3rd edition.
 Kong and Dietterich (1997) Kong, E. B. and Dietterich, T. G. (1997). Probability estimation via errorcorrecting output coding. In International Conference on Artificial Intelligence and Soft Computing.
 Lawson and Hanson (1995) Lawson, C. L. and Hanson, R. J. (1995). Solving Least Squares Problems, volume 15 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics.
 Lee and Oh (2003) Lee, J.S. and Oh, I.S. (2003). Binary Classification Trees for Multiclass Classification Problems. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, volume 2, pages 770–774. IEEE Computer Society.
 Michie et al. (1994) Michie, D., Spiegelhalter, D. J., and Tayler, C. C., editors (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Prentice Hall, Upper Saddle River, NJ. Available online at: http://www.amsta.leeds.ac.uk/~charles/statlog/.
 Mills (2009) Mills, P. (2009). Isoline retrieval: An optimal method for validation of advected contours. Computers & Geosciences, 35(11):2020–2031.
 Mills (2011) Mills, P. (2011). Efficient statistical classification of satellite measurements. International Journal of Remote Sensing, 32(21):6109–6132.
 Mills (2017) Mills, P. (2017). Solving for multiclass using orthogonal coding matrices. Submitted to Pattern Analysis and Applications.
 Mills (2018) Mills, P. (2018). Accelerating kernel classifiers through borders mapping. RealTime Image Processing. doi:10.1007/s1155401807699.
 Müller et al. (2001) Müller, K.R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B. (2001). An introduction to kernelbased learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201.
 Ott (1993) Ott, E. (1993). Chaos in Dynamical Systems. Cambridge University Press.
 Platt et al. (2000) Platt, J. C., Cristianini, N., and ShawTaylor, J. (2000). Large Margin DAGs for Multiclass Classification. In Solla, S., Leen, T., and Mueller, K.R., editors, Advances in Information Processing, number 12, pages 547–553. MIT Press.
 Press et al. (1992) Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press, 2nd edition.
 Ramanan et al. (2007) Ramanan, A., Suppharangsan, S., and Niranjan, M. (2007). Unbalanced decision trees for multiclass classification. In International Conference on Industrial Information Systems. IEEE.
 Shannon and Weaver (1963) Shannon, C. E. and Weaver, W. (1963). The Mathematical Theory of Communication. University of Illinois Press.
 Windeatt and Ghaderi (2002) Windeatt, T. and Ghaderi, R. (2002). Coding and decoding strategies for multiclass learning problems. Information Fusion, 4(1):11–21.
 Wu et al. (2004) Wu, T.F., Lin, C.J., and Weng, R. C. (2004). Probability Estimates for Multiclass Classification by Pairwise Coupling. Journal of Machine Learning Research, 5:975–1005.
 Zadrozny (2001) Zadrozny, B. (2001). Reducing multiclass to binary by coupling probability estimates. In NIPS’01 Proceedings of the 14th International Conference on Information Processing Systems: Natural and Synthetic, pages 1041–1048.
 Zhou et al. (2008) Zhou, J., Peng, H., and Suen, C. Y. (2008). Datadriven decomposition for multiclass classification. Pattern Recognition, 41:67–76.
Comments
There are no comments yet.