Solving for multi-class using orthogonal coding matrices

01/27/2018 ∙ by Peter Mills, et al. ∙ 0

Probability estimates are desirable in statistical classification both for gauging the accuracy of a classification result and for calibration. Here we describe a method of solving for the conditional probabilities in multi-class classification using orthogonal error correcting codes. The method is tested on six different datasets using support vector machines and compares favorably with an existing technique based on the one-versus-one multi-class method. Probabilities are validated based on the cumulative sum of a boolean evaluation of the correctness of the class label divided by the estimated probability. Probability estimation using orthogonal coding is simple and efficient and has the potential for faster classification results than the one-versus-one method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

We describe a method of solving for the conditional probabilities in multi-class classification using orthogonal error-correcting codes. The method is tested on six different datasets using support vector machines as the binary classifiers and both the classification results as well as the probability estimates are found to be more accurate than using random coding matrices, as predicted by recent literature. Probability estimates are desirable in statistical classification both for gauging the accuracy of a classification result and for calibration. Probability estimation using orthogonal codes is simple and elegant making it faster than the more general constrained optimization solutions required for arbitrary codes.

Keywords

multi-class classifier, conditional probabilities, SVM, error-correcting codes, constrained linear least squares, quadratic optimization

List of symbols

symbol description first used
class (1)
test point (1)
coding matrix (1)
vector of binary decision functions (4)
conditional probability of th binary classifier (4)
multi-class conditional probability (5)
number of classes (5)
number of binary classifiers (5)
identity matrix (10)
whole number for enumerating cases Section 3

1 Introduction

Many methods of statistical classication can only discriminate between two classes. Examples include lineear classifiers such as perceptrons and logistic regression

(Michie et al., 1994), piecewise linear classifiers (Herman and Yeung, 1992; Mills, 2011), as well as support vector machines (Müller et al., 2001). There are many ways of generalizing binary classification to multi-class. Three of the most common are one versus one, one versus the rest and error-correcting coding matrices (Hsu and Lin, 2002). Here we are interested in the error-correcting coding matrices (Dietterich and Bakiri, 1995; Windeatt and Ghaderi, 2002) and rather than use a random coding matrix we are interested in one that is more carefully designed.

In error-correcting coding, there is a coding matrix, , that specifies how the set of multiple classes is partitioned. Typically, the class of the test point is determined by the distance between a row in the matrix and a vector of binary decision functions:

(1)

where is the th row of the coding matrix and is a vector of decision functions at test point, . If we take the upright brackets as a Euclidean distance, and assume that each partition partitions all of the classes, that is, there are no zeroes in , then this reduces to a voting solution:

(2)

Both Allwein et al. (2000) and Windeatt and Ghaderi (2002) show that to maximize the accuracy of an error-correcting coding matrix, the distance between each row, should be as large as possible. Using the same assumptions, this reduces to:

(3)

In other words, the coding matrix, , should be orthogonal. This approach to the multi-class problem will be described in detail in this note.

2 Algorithm

We wish to design a set of binary classifiers, each of which return a decision function:

(4)

where is the conditional probability of the th class of the th classifier. Each binary classifier partitions a set of classes such that for a given test point, :

(5)

where is a coding matrix and is the conditional probability of the th class. In vector notation:

(6)

The more general case where a class can be excluded, that is the coding may include zeroes, , will not be addressed here.

Note that this assumes that the binary decision functions, , estimate the conditional probabilities perfectly. In practice there are a set of constrainsts that must be enforced because is only allowed to take on certain values. Thus, we wish to solve the following minimization problem:

(7)
(8)
(9)

If is orthogonal,

(10)

where is the identity matrix, then the unconstrained minimization problem is easy to solve. Note that the voting solution in (2) is now equivalent to the inverse solution in (6). This allows us to determine the class easily, but we also wish to solve for the probabilities, , so that none of the constraints in (8) or (9) are violated. Probabilities are useful for gauging the accuracy of a classification result when its true value is unknown and for recalibrating an image derived from statistical classification (Fawcett, 2006; Mills, 2009, 2011).

The orthogonality property allows us to reduce the minimization problem in (7) to something much simpler:

(11)

where with the constraints in (8) and (9) remaining the same. Because the system has been rotated and expanded, the non-negativity constraints in (9) remain orthogonal, meaning they are independent: enforcing one by setting one of the probabilities to zero, for example, shouldn’t otherwise affect the solution. This still leaves the normalization constraint in (8): the problem, now strictly geometrical, is comprised of finding the point nearest on the diagonal hyper-surface that bisects the unit hyper-cube.

Briefly, we can summarize the algorithm as follows: 1. move to the nearest point that satisfies the normalization constraint, (8); 2. if one or more of the probabilities is negative, move to the nearest point that satisfies both the normalization constraint and the non-negativity constraints, (9), for the negative probabilities; 3. repeat step 2. More formally, let be a vector of all ’s:

  • ;

  • while :

    • if then

    • let be the set of such that

    • for each in :

      • Remove from the problem

Note that resultant direction vectors for each step form an orthogonal set. For instance, suppose and after enforcing the normalization constraint, the first probability is less than zero, , then the direction vectors for the two motions are:

(12)

More generally, consider the following sequence of vectors:

(13)

where and . (Boyd and Vandenberghe, 2004)

3 Constructing the coding matrix

Finding an such that and is quite a difficult combinatorial problem. Work in signal processing may be of limited applicability because coding matrices are typically comprised of ’s and ’s rather than ’s and ’s (Hedayat et al., 1999; Panse et al., 2014). A further restriction is that columns must have both positive and negative elements, or:

(14)

A simple method of designing an orthogonal is using harmonic series. Consider the following matrix for six classes () and eight binary classifiers ():

(15)

This will limit the size of relative to ; more precisely: . Moreover, only certain values of will be admitted: where is a whole number.

The first three rows in (15) comprise a Walsh-Hadamard code (Arora and Barak, 2009): all possible permutations are listed. A square () orthogonal coding matrix is called a Hadamard matrix (Sylvester, 1867). It can be shown that besides and , only Hadamard matrices of size exist, and it is still unproven that examples exist for all values of (Hedayat and Wallis, 1978). A very simple, recursive method exists to generate matrices of size (Hedayat and Wallis, 1978) but without the properties in (14). Such a matrix will include a “harmonic series” of the same type as in (15).

To compute the results in this note, orthogonal coding matrices were generated using a “greedy” algorithm. We choose to be the smallest multiple of equal to or larger than . and start with an empty matrix. Candidate vectors containing both positive and negative elements are chosen at random to comprise a row of the matrix but never repeated. If the candidate vector is orthogonal to existing rows, then it is added to the matrix. New candidates are tested until the matrix is filled or we run out of permutations. A full matrix is almost always returned especially if . The matrix is then checked to ensure that each column contains both positive and negative elements. Note that the whole process can be repeated as many times as necessary.

More work will need to be done to find efficient methods of generating these matrices if they are to be efficiently applied to problems with a large number of classes.

4 Results

Dataset Method time (s) U.C. Brier score
pendigits ECC
Ortho.
sat ECC
Ortho.
segment ECC
Ortho.
shuttle ECC
Ortho.
usps ECC
Ortho.
vehicle ECC
Ortho.
Table 1: Solution time, uncertainty coefficient and Brier score for six different datasets using random and orthogonal error-correcting codes.

Orthogonal error-correcting codes were tested on six different datasets: two for digit recognition–“pendigits” (Alimoglu, 1996) and “usps” (Hull, 1994); the space shuttle control dataset–“shuttle” (King et al., 1995); a satellite land recognition dataset–“sat”; a similar dataset for image recognition–“segment”; and a dataset for vehicle recognition–“vehicle” (Siebert, 1987). The last four are borrowed from the “statlog” project (King et al., 1995; Michie et al., 1994).

The method was compared with random error-correcting codes using using the same number of codes or matrix rows, . Random codes were solved using a constrained linear least squares method based on the Karesh-Kuhn-Tucker conditions (Lawson and Hanson, 1995). Both techniques were applied to support vector machines (SVMs) trained using LIBSVM (Chang and Lin, 2011). Partitions were trained separately then combined by finding the union of sets of support vectors for each partition. By indexing into the combined list of support vectors, the algorithms are optimized in both space and time (Chang and Lin, 2011).

Results are shown in Table 1

. Confidence limits represent standard deviations over 20 trials using different, randomly chosen coding matrices. For each trial, datasets were randomly separated into 70% training and 30% test. “U.C” stands for uncertainty coefficient, a skill score based on Shannon’s channel capacity

(Shannon and Weaver, 1963; Press et al., 1992; Mills, 2011) that has many advantage over simple fraction of correct guesses or “accuracy”. Probabilities are validated with the Brier score which is root-mean-square error measured against the truth of the class as a 0 or 1 value (Brier, 1950; Jolliffe and Stephenson, 2003).

For most of the datasets, orthogonal coding matrices provide a small but significant improvement over random coding matrices in both classification accuracy and in the accuracy of the conditional probabilities. This is in line with the literature as in Dietterich and Bakiri (1995); Windeatt and Ghaderi (2002). Results are also more consistent for the orthogonal codes as given by the calculated error bars.

Also as expected, solution times are considerably faster for the orthogonal coding matrices. Depending on the problem and classification method, this may or may not be significant. Since SVM is a relatively slow classifier, solution times are a minor portion of the total. For fast classifiers such as a linear classifier or perceptron, solving the constrained optimization problem for the probabilities could easily comprise the bulk of classification times. Note that for both methods under examination, time to perform the binary classifications should be roughly equivalent.

As predicted, solving for multi-class using orthogonal coding matrices is more accurate than the equivalent problem using random coding matrices. While the difference is small, the method is simple and elegant and may suggest new directions in the search for more efficient and accurate multi-class classification algorithms.

References

  • Alimoglu (1996) Alimoglu, F. (1996). Combining Multiple Classifiers for Pen-Based Handwritten Digit Recognition. Master’s thesis, Bogazici University.
  • Allwein et al. (2000) Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers.

    Journal of Machine Learning Research

    , 1:113–141.
  • Arora and Barak (2009) Arora, S. and Barak, B. (2009). Computational Complexity: A Modern Approach. Cambridge University Press, Cambridge.
  • Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New York, NY, USA.
  • Brier (1950) Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3.
  • Chang and Lin (2011) Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27.
  • Dietterich and Bakiri (1995) Dietterich, T. G. and Bakiri, G. (1995). Solving Multiclass Learning Problems via Error-Correcting Output Codes.

    Journal of Artificial Intelligence Research

    , 2:263–286.
  • Fawcett (2006) Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27:861–874.
  • Hedayat and Wallis (1978) Hedayat, A. and Wallis, W. (1978). Hadamard matrices and their applications. Annals of Statistics, 6(6):1184–1238.
  • Hedayat et al. (1999) Hedayat, A. S., Sloane, N. J. A., and Stufken, J. (1999). Orthogonal Arrays and Error-Correcting Codes. In Orthogonal Arrays: Theory and Applications, Springer Series in Statistics, chapter 4, pages 61–68. Springer, New York.
  • Herman and Yeung (1992) Herman, G. T. and Yeung, K. T. D. (1992). On piecewise-linear classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(7):782–786.
  • Hsu and Lin (2002) Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines.

    IEEE Transactions on Neural Networks

    , 13(2):415–425.
  • Hull (1994) Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554.
  • Jolliffe and Stephenson (2003) Jolliffe, I. T. and Stephenson, D. B. (2003). Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley.
  • King et al. (1995) King, R. D., Feng, C., and Sutherland, A. (1995). Statlog: Comparision of Classification Problems on Large Real-World Problems. Applied Artificial Intelligence, 9(3):289–333.
  • Lawson and Hanson (1995) Lawson, C. L. and Hanson, R. J. (1995). Solving Least Squares Problems, volume 15 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics.
  • Michie et al. (1994) Michie, D., Spiegelhalter, D. J., and Tayler, C. C., editors (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. Prentice Hall, Upper Saddle River, NJ. Available online at: http://www.amsta.leeds.ac.uk/~charles/statlog/.
  • Mills (2009) Mills, P. (2009). Isoline retrieval: An optimal method for validation of advected contours. Computers & Geosciences, 35(11):2020–2031.
  • Mills (2011) Mills, P. (2011). Efficient statistical classification of satellite measurements. International Journal of Remote Sensing, 32(21):6109–6132.
  • Müller et al. (2001) Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., and Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201.
  • Panse et al. (2014) Panse, M. S., Mesham, S., Chaware, D., and Raut, A. (2014). Error Detection Using Orthogonal Code. IOSR Journal of Engineering, 4(3):2278–8719.
  • Press et al. (1992) Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press, 2nd edition.
  • Shannon and Weaver (1963) Shannon, C. E. and Weaver, W. (1963). The Mathematical Theory of Communication. University of Illinois Press.
  • Siebert (1987) Siebert, J. (1987). Vehicle Recognition Using Rule-Based Methods. TIRM. Turing Institute, Glasgow.
  • Sylvester (1867) Sylvester, J. J. (1867). Thoughts on inverse orthogonal matrices, simultaneous sign successions, and tesselated pavements in two or more colours, with applications to newton’s rule, ornamental tile-work, and the theory of numbers. Philosophical Magazine, 34:461–475.
  • Windeatt and Ghaderi (2002) Windeatt, T. and Ghaderi, R. (2002). Coding and decoding strategies for multi-class learning problems. Information Fusion, 4(1):11–21.