DCSVM: Fast Multi-class Classification using Support Vector Machines

10/23/2018 ∙ by Duleep Rathgamage Don, et al. ∙ Georgia Southern University 0

We present DCSVM, an efficient algorithm for multi-class classification using Support Vector Machines. DCSVM is a divide and conquer algorithm which relies on data sparsity in high dimensional space and performs a smart partitioning of the whole training data set into disjoint subsets that are easily separable. A single prediction performed between two partitions eliminates at once one or more classes in one partition, leaving only a reduced number of candidate classes for subsequent steps. The algorithm continues recursively, reducing the number of classes at each step, until a final binary decision is made between the last two classes left in the competition. In the best case scenario, our algorithm makes a final decision between k classes in O( k) decision steps and in the worst case scenario DCSVM makes a final decision in k-1 steps, which is not worse than the existent techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman in a highly acclaimed article considering problems in dynamic optimization

[1, 2]. In essence, as dimensionality increases, the volume of the space increases rapidly, and the available data become sparser and sparser. In general, this sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality, which would prevent common data processing techniques from being efficient.

Figure 1.

Binary SVM classifier for classes 1 and 2 out of a dataset of six classes

Since its introduction, the Support Vector Machines (SVM) [6]

has quickly become a popular tool for classification which has attracted a lot of interest in the machine learning community. However, SVM is primarily a binary classification tool. The multiclass classification with SVM is still an ongoing research problem (see, for example,

[3, 22, 17, 18]

for some recent work). We present an SVM-based multi-class classification method that exploits the curse of dimensionality to efficiently perform classification of highly dimensional data.

The Divide and Conquer SVM (DCSVM) algorithm’s idea is based on the following simple observation, best described using the example in Figure 1. The figure shows 6 classes (1 – red, 2 – blue, 3 – green, 4 – black, 5 – orange, and 6 – maroon) of two-dimensional points and a linear SVM separation of classes 1 and 2 (the line that separates the points in these classes). It happens that the SVM model for classifying classes 1 and 2 completely separates the points in classes 4 (which takes class 2’s side) and 6 (which takes class 1’s side). Moreover, the classifier does a relatively good job classifying most points of the class 5 as class 2 (with relatively few points classified as 1) and a poor job on classifying the points of class 3 (as the points in this class are classified about half as 1’s and the other half as 2’s). With DCSVM we use the SVM classifier for classes 1 and 2 for a candidate of an unknown class: if the classifier predicts 1, then we next decide between classes 1, 6, 3, and 5; if the classifier predicts class 2, then we next decide between classes 2, 4, 3, and 5. Notice that in either case one or more classes are eliminated, and we are left to predict a fewer number of classes. That is, a multi-class classification problem of a smaller size (less classes). The algorithm then proceeds recursively on the smaller problem. In the best case scenario at each step half of the classes will be eliminated and the algorithm will finish in steps. Notice that, in the above scenario, classes 2 and 4 are completely separated from classes 1 and 6, whereas classes 3 and 5 are not clearly on one side or the other of the separation line. For this reason, classes 3 and 5 are part of the next decision step, regardless of the prediction of the first classifier.

However, there is a significant difference between classes 3 and 5. While class 3 is almost divided in half by the separation line, class 5 can be predicted as “2” with a relatively small error. In DCSVM we use a threshold value to indicate the maximum classification error accepted in order to consider a class on one side or the other of a separation line. For instance, let us consider that only 2% of the points of class 5 are on the same side as class 1. With the threshold value set to 0.02, DCSVM will separate classes 1, 3, and 6 (when 1 is predicted) from classes 2, 3, 4, and 5 (when 2 is predicted). A higher threshold value will produce a better separation of classes (less overlapping) and less classes to process in the subsequent steps. This comes at the price of possibly sacrificing the accuracy of the final prediction.

Clearly, the method presented in the example above is suitable for multiclass classification using a binary classifier, in general. Our choice of using SVM is based on the SVM algorithm’s remarkable power in producing accurate binary classification.

The content of this article is organized as follows. We give a brief description of binary classification with SVM and related work on using SVM for multi-class classification in Section 2. DCSVM is described in detail in Section 3 and experimental results (including performance comparisons with one-versus-one approach) are given in Section 4. We conclude in Section 5.

2. Preliminaries and related work

Support Vector Machines (SVM) [6]

was primary developed as a tool for the binary classification problem by finding a separation hyperplane for the classes in feature space. If such a plane cannot be find, the “separating plane” requirement is softened and a maximal margin separation is produced instead. Formally, the problem of finding a maximal margin separation can be stated as a quadratic optimization problem. Given a set of

training vectors , with labels and a feature space projection , the SVM method consists in finding the solution of the following:

subject to

where is the weights vector and is a cost regularization constant. The corresponding decision function is:

An interesting property of the method is that the dot product can be represented by a kernel function:

which is computationally much less expensive than actually projecting and into the feature space .

In the case of multiple classes, the problem formulation becomes more complicated and inherently more difficult to address. Given a set of training vectors , with labels , one must find a way to distinguish between classes.

Several approaches were proposed, which can be grouped into direct methods (a single optimization problem formulation for multi-class classification) and indirect methods (using multiple binary classifiers to produce multi-class classification). Many of the indirect methods were introduced, in fact, as methods for multi-class classification using binary classifiers, in general. They are not limited to the SVM method.

A comparison [12] of these methods of multi-class classification using binary SVM classifiers shows that one-versus-one method and its DAG improvement are more suitable for practical use.

2.1. Direct formulation of multi-class classification

Direct formulations to distinguish between classes in a single optimization problem were given in [20, 5, 21, 7] or, more recently, in [11, 22]. Each of these formulations has a single objective function for training all -binary SVMs simultaneously and maximize the margins from each class to the remaining ones. The decision function then chooses the “best classified” class.

For instance, Crammer et al. in [7] solve the following optimization problem for classes:

subject to

where is the Kronecker delta function. The corresponding decision function is:

The original formulation addresses the classification without taking into account the bias terms (for each of the classes). These can be easily included in the formulation using additional constraints (see, for instance, [12]). Crammer’s formulation is among the most compact optimization problem formulations for multi-class classification problem.

A common issue of the single optimization problem formulations for multi-class classification is the large number of variables involved. For instance, (2.1), although a compact formulation, includes variables (not taking into account ’s, if included), which yields large computation complexity. In [11], Crammer’s formulation is extended by relaxing its constraints and subsequently solving a single -variable quadratic programming problem for multi-class classification.

2.2. One-versus-rest approach

The one-versus-rest approach [20, 4, 19] is an indirect method relying on binary classifiers as follows. For each class a binary classifier is created between class (as positive examples in the training set) and all the other classes, (all as negative examples in the training set). The corresponding decision function is then:

That is, the class label is determined by the binary classifier that gives maximum output value (the winner among all classifiers). A well-known shortcoming of the one-versus-rest approach is the highly imbalanced training set for each binary classifier (the more classes, the bigger the imbalance). Assuming equal number of training examples for all classes, the ratio of positive to negative examples for each binary classifier is . The symmetry of the original problem is lost and the classification results may be dramatically affected (especially for sparse classes).

2.3. One-versus-one approach

The one-versus-one approach ([13, 10, 14, 15] or the improvement by Platt et al. [16]) aims to remove the imbalance problem of the one-versus-rest method by training binary classifiers strictly with data in the two classifier’s classes. For each pair of classes, a binary classifier is created. This classifier is trained using all data in class as positive examples and all data in class as negative examples, hence all balanced binary classifiers. Each binary classifier is the result of a smaller optimization problem, at the cost of producing classifiers. The corresponding decision function is based on majority voting. All classifiers are used on an input data item and each class appears in exactly classifiers, hence an opportunity for up to votes out of the binary classification rounds. The class with the majority of votes is the winner.

An improvement on the number of voting rounds was proposed by Platt et al. in [16]. Their method, called Directed Acyclic Graph SVM (DAGSVM), forms a decision-graph structure for the testing phase and it takes exactly individual voting rounds to decide the label of a data item . In a nutshell, DAGSVM uses one binary classifier at the time and subsequently removes the losing class from all subsequent classifications. There is no particular criterion on the order of using each binary classifier in this process.

3. Divide and Conquer SVM (DCSVM)

As noted in the introduction and illustrated in Figure 1

, the key idea is that any binary classifier may, in practice, separate more than two classes. Which raises a natural question: which classes are separated (and with what accuracy) by each binary classifier? DCSVM combines the one-versus-one method’s simplicity of producing balanced, fast binary classifiers with the classification speed of the DAGSVM’s decision graph. The essential difference consists of producing the most efficient decision tree capable of delivering the decision in at most

steps in the worst case scenario, or steps in the best case scenario.

3.1. DCSVM training

Let us introduce some notations and then we will proceed to the formal description of the algorithm. Given a data set of classes (labels) where to each data item has been assigned a label , we want to construct a decision function so that , where is the corresponding label of . As usual, by considering a split of the data set into two disjoint sets (the training set) and (the test set), we will be using the data in to construct our decision function and then the data in to measure its accuracy. Furthermore, we consider as an union of disjoint sets , where each has label , . (Similarly, we consider as a union of disjoint sets , where each has label , .)

Let , be a SVM binary classifier created using the training set , and . There are such one-versus-one binary classifiers. We must clearly specify that the decision function we consider here is not the ideal one, but the practical one, likely affected by misclassification errors. That is, for some , we may have that .

Our goal is to create the decision function that uses a minimal number of binary decisions for -classes classification, while not sacrificing the classification accuracy. We define next a few measures we use in the process of identifying the shortest path to a multi-class classification decision.

Definition 1 (Class Predictions Likelihoods).

The class predictions likelihoods of a SVM binary classifier for a label , denoted respectively as and , are:

Each class prediction likelihood represents the expected outcome likelihood for or when a binary classifier is used for prediction on all data items in . These likelihoods are computed for each binary classifier and each class in the training data set.

All pairs of likelihood predictions, for every binary classifier and classes are stored in a table, as follows.

Definition 2 (All-Predictions Table).

We arrange all classes predictions likelihoods in rows (corresponding to each binary classifier ) and columns (corresponding to each class ) to form a table where each entry is given by a pair of predictions likelihoods as follows:

Figure 2 shows the All-Predictions Table computed for the glass data set in [8]. The data set contains 6 classes, labeled as 1, 2, 3, 5, 6, and 7. Each row corresponds to a binary classifier and the columns correspond to the class labels. Each table cell contains a pair of likelihood predictions (as percentages) for the row classifier and class column. For instance, , and , .

Figure 2. The All-Predictions table for the glass data set in [8]

We define next two measures for the quality of the classification of each . The purity index measures how good the binary classifier is for classifying all classes as or for a given precision threshold . In a nutshell, a class is classified as “definitely” by if ; as “definitely” if ; otherwise, it is classified as “undecided” or . The purity index counts how many “undecided” decisions a binary classifier produces. The lower the index, the better the separation. The balance index measures how “balanced” a separation is in terms of the number of classes predicted as and . The larger the index, the better.

Definition 3 (SVM Purity and Balance Indexes).

For an accuracy threshold of a SVM classifier , we define:
- the purity index, denoted as , as:

- the balance index, denoted as , as:

where is the step function:

For instance, the purity index for row and threshold in Figure 2 is:

and indicates that 3 of the classes (namely 2, 5, and 7) are undecided when the required precision is at least .

For accuracy threshold , the balance index for row in Figure 2 is and for row is .

The SVM score, defined next, is a measure of the precision of the binary classifier for classifying classes and . The higher the score the better the classifier precision.

Definition 4 (SVM Score).

The score of a SVM classifier , denoted as , is

For instance, the table in Figure 2 shows that

1:procedure TrainDCSVM() Creates DCSVM classifier
2:Input: : data set, : accuracy threshold
3:Output:
4:      train SVM with , , ,
5:     , for all ,
6:     //Recursively construct a binary decision tree
7:     //with each node associated with a binary classifier
8:      empty binary tree
9:      new tree node
10:     DCSVM-subtree()
11:     return Returns the decision tree
12:end procedure
1:procedure DCSVM-subtree() Creates subtree routed at
2:Input: : current parent node, : current predictions table, : accuracy threshold
3:Output: recursively constructs sub-tree rooted at
4:      optimal in , for given
5:     
6:      classes labeled as or undecided by
7:      classes labeled as or undecided by
8:     if  then reached a leaf
9:          tree-node(label in )
10:     else
11:          minus or rows, and columns of classes not in
12:          new tree node
13:         DCSVM-subtree()
14:     end if
15:     if  then reached a leaf
16:          tree-node(label in )
17:     else
18:          minus or rows, and columns of classes not in
19:          new tree node
20:         DCSVM-subtree()
21:     end if
22:end procedure
Algorithm 1 DCSVM training

Algorithm 1 describes the DCSVM training and proceeds as follows. In the main procedure, TrainDCSVM, the SVM binary classifiers for all class pairs are trained (line 4) and the predictions likelihoods are stored in the predictions table (line 5). The decision function is created as an empty tree (line 8) and then recursively populated in the DCSVM-subtree procedure (line 10). The recursion procedure creates a left and/or a right node at each step (lines 12 and 19, respectively) or may stop with creating a left and/or a right label (lines 9 and 16, respectively). Each new node is associated to the binary that is the decider at that node (line 5), or with a class label if an end node (lines 9 and 16).

An important part of the DCSVM-subtree procedure is choosing the “optimal” from a current predictions likelihoods table (line 4). For this purpose, we use the SVM Purity Index, Balance, and Score from Definitions 3 and 4, respectively. The order these measures are used may influence the decision tree shape and precision. If Score is used then the Purity and Balance Indexes are used to break a tie, the resulting tree favors accuracy over the speed of decisions (may yield bushier trees). If Purity and Balance Indexes are used first, then Score, if a tie, the resulting tree may be more balanced. The decision speed is favored while possibly sacrificing accuracy.

A decision tree for the glass data set is shown in Figure 3. Clearly, the algorithm may produce highly unbalanced decision trees (when some classes are decided faster than others) or very balanced decision trees (when most of class labels are leaves situated at about same depth). Regardless of outcome, the following result is almost immediate.

Proposition 1.

The decision tree constructed in Algorithm 1 has depth at most .

Proof.

The lists of classes labeled and (lines 6, 7 in DCSVM-subtree procedure) contain at least one label each: or , respectively. Once a class column is removed from at some tree node , it will not appear again in a node or leaf in the subtree rooted at that node . Hence with each recursion the number of classes decreases by at least one (lines 11, 18) from to 2, ending the recursion with a left or a right label node in lines 9 or 16, respectively. ∎

Notice that a scenario where each decision tree label has depth is possible in practice: when no binary classifier is a good separator for classes other than and (and therefore at each node only classes and are separated, while the other are undecided and will appear in both left and right branches). We call this the worst case scenario, for obvious reasons. The opposite case scenario is also possible in practice: each separates all classes into two disjoint lists of about same lengths. The decision tree is also very balanced in this case, but a lot smaller.

Proposition 2.

The decision tree constructed in Algorithm 1 when each produced balanced, disjoint separation between all classes has depth at most .

Proof.

Clearly, this is a case scenario where at each recursion step a node is created such that half of the classes are assigned to the left subtree and the other half to the right subtree. This produces a balanced binary tree with leaves, hence of depth at most . ∎

1:procedure DCSVMclassify() Produces DCSVM classification
2:Input: : decision tree; : data item
3:Output: Label of data item
4:     
5:     while  not a leaf do Visits the decision tree nodes towards a leaf
6:          Retrieves the associated to current node
7:         
8:         if  then
9:              
10:         else
11:              
12:         end if
13:     end while
14:     return label of Returns the leaf label
15:end procedure
Algorithm 2 DCSVM classifier

The DCSVM classifier Algorithm 2 relies on the decision tree produced by Algorithm 1 to take any data item and predict its label. The algorithm starts at the decision tree root node (line 4) then each node’s associated predicts the path to follow (lines 6–12) until a leaf node is reached. The label of the leaf node is the DCSVM’s prediction (line 14) for the input data item . An example of a prediction path in a tree is illustrated in Figure 3 (b).

Propositions 1 and 3.1 directly justify the following result.

Theorem 1.

The Algorithm 2 performs multi-class classification of any data item in at most binary decisions steps (in the worst case scenario) and at most binary decision steps (in the best case scenario).

We illustrate next how the decision tree is created and how a prediction is computed using a working example.

3.2. A working example

We use the glass data set [8] to illustrate DCSVM at work. This data set contains 6 classes, labeled 1, 2, 3, 5, 6, and 7 (notice there is no label 4). Consequently, binary classifiers are created and then the “all predictions likelihoods” table is computed (Figure 2). Let us choose the accuracy threshold , for simplicity. That is, a class is classified by an as only if predicts that all data items in have class ; is classified as only if predicts that all data items in have class ; else, is undecided and will appear on both sides of the decision tree node associated with .

Figure 3. DCSVM decision tree (a) with a decision process example (b), for the glass data set

We follow next the DCSVM-subtree procedure in Algorithm 1 and construct the decision tree. Notice that for all , so score does not matter for choosing the optimal in line 4. The choice will be solely based on the purity and balance indexes. Table 1 shows all values for these measures for the initial predictions likelihoods table.

1. 0 1 100%
2. 4 1 100%
3. 3 1 100%
4. 3 1 100%
5. 3 1 100%
6. 3 1 100%
7. 1 2 100%
8. 3 1 100%
9. 2 1 100%
10. 2 1 100%
11. 1 1 100%
12. 3 1 100%
13. 0 2 100%
14. 0 2 100%
15. 4 1 100%
Table 1. optimality measures for glass data set and the initial All-Predictions table

The table shows rows 1, 13, and 14 as candidates with minimum purity indexes. Then a tie between rows 13 and 14 as the winners among these. Row 13 comes first and hence is selected as the root node. Figure 3 (a) shows the full decision tree, with as the root node. Subsequently, labels classes 1, 2, 3, and 5 as “5” (left), and classes 6 and 7 as “6” (right). The algorithm continues recursively with classes to the left, and classes to the right. The right branch will be completed immediately with one more tree node (for ) and two corresponding leaf nodes (for labels 6 and 7).

For the left branch the algorithm will proceed with a reduced All-Predictions table: rows 4, 5, 8, 9, 11, 12, 13, 14, and 15 and columns for classes 6 and 7 are removed. The optimality measures will be subsequently computed for all and classes still in competition (1, 2, 3, and 5) in the left branch. The corresponding measures are given in Table 2 (for an easier identification, the indices in the first column are kept the same as the original indices in the All-Predict table in Figure 2).

1. 0 1 100%
2. 2 1 100%
3. 1 1 100%
6. 1 1 100%
7. 0 1 100%
10. 2 1 100%
Table 2. Optimality measures in the second step of creating the decision tree in Figure 3 (b)

There is a tie between and , and is being used first. A node is consequently created, with a leaf as a left child. The rest of the tree is subsequently created in the same manner.

4. Experimental results

No Dataset Classes No Dataset Classes
1. artificial 6 8. covertype 7
2. iris 3 9. svmguide4 6
3. segmentation 7 10. vowel 11
4. heart 5 11. usps 10
5. wine 3 12. letter 26
6. wine-quality 6 13. poker 10
7. glass 6 14. sensorless 11
Table 3. Data sets
Figure 4. Multi-class prediction accuracy comparison: built-in SVM, one-versus-one, and DCSVM

We implemented DCSVM in R v3.4.3 using the e1071 library [9], running on Windows 10, 64-bit Intel Core i7 CPU @3.40GHz, 16GB RAM. For testing, we used 14 data sets from the UCI repository [8] (as listed in Table 3). We performed three sets of experiments: (i) multi-class prediction accuracy comparison, (ii) prediction performance in terms of speed (time and number of binary decisions) and resources (number of support vectors), and (iii) DCSVM performance comparisons for different data sets and accuracy threshold parameter values. For the first set of experiments we compared three multi-class predictors: the built-in multi-class SVM (from the e1071 library), our R implementation of one-versus-one, and the R implementation of DCSVM. For a fair comparison, in the second set of experiments we compared only the R implementations of one-versus-one and DCSVM. The built-in multi-class SVM would benefit of the inherent speed of native code it relies on. Finally, the third set of experiments focused on the DCSVM’s R implementation performance and fine tuning.

4.1. Accuracies comparison: built-in multi-class SVM, one-versus-one, and DCSVM

The main goal of DCSVM is to improve multi-class prediction performance while not sacrificing the prediction accuracy. The first experimental results compare multi-class prediction accuracy of: (i) built-in SVM multi-class prediction (in the e1071 package), (ii) one-versus-one implementation in R, and (iii) DCSVM implementation in R. For the experiment, we used cross-validation with 80% data for training and 20% for testing, for each data set. We ran 10 trials and averaged the results. The results are displayed in Figure 4 and show no significant differences between the three methods.

4.2. Prediction performance comparison

Figure 5. Average number of Support Vectors for multi-class predictions
Figure 6. Average number of binary decisions for multi-class predictions
Figure 7. Average prediction times for multi-class predictions

For this purpose, we compared the R implementations of one-versus-one method and DCSVM. We analyzed prediction performance in three aspects: the average number of support vectors, the average number of binary decisions, and time. The number of support vectors used was computed by summing up all support vectors from every binary decider, over all steps of binary decisions until the multi-class prediction was achieved. The number of such support vectors is clearly proportional not only to the number of decision steps (which are illustrated separately), but also to the configuration of data separated by each binary classifier. The corresponding performance results are presented in Figure 5, Figure 6, and Figure 7, respectively. Due to large variations in size between the data sets we used, we split the data sets into two size-balanced groups and displayed each graph side-by-side for each group. DCSVM significantly outperforms one-versus-one, clearly being much less computationally intensive (number of support vectors for prediction) and faster (number of binary decisions and prediction times).

From the first two sets of experimental results we can conclude already that DCSVM achieved the goal of being a faster multi-class predictor without sacrificing prediction accuracy.

4.3. DCSVM performance fine tuning

In this set of experiments we analyze in close detail DCSVM’s performance in terms of the accuracy threshold parameter.

Figure 8. Accuracy and the average number of prediction steps for different thresholds

Figure 8 shows the trade-off between accuracy (left) and the average prediction steps (right) with various threshold values. Clearly, the accuracy threshold parameter permits a trade-off between accuracy and speed. However, this is largely data dependent. The more separable the data is, the less influence the threshold has on speed. For less separable data (such as the letter data set), fine adjustment of the threshold permits trade-off between prediction accuracy and prediction speed. This is not the case for the vowel data set, which is highly separable: changes in the threshold influence neither the accuracy of prediction nor the average number of prediction steps.

Figure 9. Accuracy for predicting “letter” with each method, for different split thresholds

Figure 9 shows how DCSVM accuracy compares to other multi-class classification methods (BI = built-in, OvsO = one-against-one) for various threshold values. For less separable data (such as letter) DCSVM’s accuracy drops sharply with the threshold (starting at some small threshold value) compared to the accuracy of one-against-one method, which we found to perform better than the built-in method. The built-in and one-against-one methods do not depend on the threshold value, of course. They are shown on the same graph for comparison purpose. However, it is interesting to notice that by increasing the threshold the prediction accuracy of DCSVM on letter data sets decreases from a comparable value with one-versus-one method’s accuracy (which performs best on this data set) to the accuracy of the built-in method. With a threshold value the prediction accuracy of DCSVM is still above the accuracy of the built-in method (for the letter data set).

DCSVM
No Dataset BI OvsO
1 artificial1 98.85 98.76 98.70 98.70 98.76 98.76
2 iris 96.97 96.97 96.97 96.97 96.97 96.97
3 segmentation 27.71 27.71 29.44 29.44 29.44 29.44
4 heart 58.82 58.82 59.12 59.12 59.12 59.12
5 wine 96.97 96.97 96.97 96.97 96.97 96.97
6 wine-quality 62.33 62.33 62.39 62.39 62.39 62.39
7 glass 87.55 91.70 91.29 91.29 91.29 91.29
8 covertype 49.95 49.95 49.95 49.95 49.95 49.95
9 svmguide4 57.88 71.21 72.73 72.73 72.73 72.73
10 vowel 94.34 97.26 97.26 97.26 97.26 97.26
11 usps 94.17 93.94 93.89 94.08 94.17 94.17
12 letter 95.25 96.43 95.52 96.00 96.41 96.41
13 poker 55.94 55.96 55.56 55.83 55.96 55.96
14 sensorless 97.46 98.87 98.32 98.60 98.86 98.87
Table 4. Prediction accuracies for different split thresholds
No Dataset
1 artificial1 3.76 3.75 3.67 3.67
2 iris 1.71 1.71 1.71 1.71
3 segmentation 5.63 5.63 5.63 5.63
4 heart 4.00 4.00 4.00 4.00
5 wine 1.69 1.69 1.69 1.69
6 wine-quality 4.85 4.86 4.87 4.87
7 glass 4.09 4.12 4.12 4.12
8 covertype 5.93 5.93 5.93 5.93
9 svmguide4 4.63 4.88 4.88 4.88
10 vowel 5.41 5.41 5.41 5.41
11 usps 7.16 7.29 7.80 7.80
12 letter 17.63 19.56 22.29 22.29
13 poker 8.36 8.40 8.43 8.43
14 sensorless 5.44 5.49 6.28 6.93
Table 5. DCSVM: Average number of steps per decision, for different split thresholds
No Dataset
1 artificial1 115.17 117.29 127.22 127.22
2 iris 32.47 32.47 32.47 32.47
3 segmentation 305.49 305.49 305.49 305.49
4 heart 270.18 270.18 270.18 270.18
5 wine 66.41 66.41 66.41 66.41
6 wine-quality 1154.49 1155.55 1157.13 1157.13
7 glass 112.14 114.37 114.37 114.37
8 covertype 4528.47 4528.47 4528.47 4528.47
9 svmguide4 236.58 245.71 245.71 245.71
10 vowel 218.36 218.36 218.36 218.36
11 usps 785.21 798.41 846.84 846.84
12 letter 2822.54 3110.42 3307.19 3307.19
13 poker 22166.49 22735.92 22803.01 22807.62
14 sensorless 2977.89 3057.77 3404.91 3735.30
Table 6. DCSVM: Average support vectors per decision, for different split thresholds

Table 4 shows side-by-side accuracies of multi-class classification using (i) built-in (BI), (ii) one-against-one (OvsO), and (iii) DCSVM (for a few threshold values ). DCSVM performs very well in terms of accuracy (compared to the other methods) for all data sets, for threshold values (the larger the threshold, the better the accuracy, in general). A larger threshold may increase the prediction speed (Table 5) and reduce the computation effort (Table 6). Interesting to notice: Table 5 shows that in all cases displayed in the table the number of decision steps is less than , where is the number of classes in the respective data set. DCSVM outperforms (even for very small threshold) one-against-one and its improvement DAGSVM [16], which reaches multi-class prediction after steps.

Figure 10. Number of separated classes for different thresholds

The All-Predictions table in Figure 2 collects all information used by DCSVM to construct its multi-class prediction strategy (the dcsvm decision tree in Algorithm 1). The same information can be used to predict how much separation can be achieved for different threshold values. For instance, for the glass data set All-Predictions table in Figure 2 and for a threshold value there are 58 entries in the table where the percentage of predicting one class or the other is at least (out of a total of entries in the table). The percentage is a good indicator of purity for DCSVM with threshold : the higher the percentage, the more separation is produced at each step and hence a shallow decision tree. Figure 10 shows the class separation percentages for threshold values and four data sets (letter, vowel, usps, and sensorless). Intuitively, as threshold increases so does the separation percentage. The letter and usps data sets display an almost linear increase of separation with threshold. sensorless displays a sharp increase for small threshold values, then it tends to flatten, that is, not much gain for significant increase in threshold (and hence possibly less accuracy). Lastly, vowel displays a step-like behavior: not much gain in separation until threshold value reaches approx , a steep increase until approaches , then nothing much happens again. One can use these indicators to decide the trade-off between speed and accuracy of predictions.

5. Conclusion

In this paper we present DCSVM, a fast algorithm for multi-class classification using Support Vector Machines. Our method relies on dividing the whole training data set into two partitions that are easily separable by a single binary classifier. Then, a prediction between the two training set partitions would eliminate one or more classes at the time. The algorithm continues recursively until a final binary decision is made between the last two classes left in the competition. Our algorithm performs consistently better than the existent methods on average. In the best case scenario, our algorithm makes a final decision between classes in decision steps between different partitions of the training data set. In the worst case scenario, DCSVM makes a final decision in steps, which is not worse than the existent techniques.

The SVM divide and conquer technique we present for multi-class classification can be easily used with any binary classifier. It is rather a consequence of increasing data sparsity with the dimensionality of the space, which can be exploited, in general, in favor of producing fast multi-class classification using binary classifiers. Our experimental results on a few popular data sets show the applicability of the method.

References

  • [1] R. Bellman and Rand Corporation. Dynamic Programming. Rand Corporation research study. Princeton University Press, 1957.
  • [2] R.E. Bellman. Dynamic Programming. Dover Books on Computer Science Series. Dover Publications, 2003.
  • [3] Arit Kumar Bishwas, Ashish Mani, and Vasile Palade. An all-pair approach for big data multiclass classification with quantum SVM. CoRR, abs/1704.07664, 2017.
  • [4] Leon Bottou, Corinna Cortes, J. S. Denker, Harris Drucker, I. Guyon, L.D. Jackel, Yann Lecun, U.A. Muller, Eduard Sackinger, Patrice Simard, and V. Vapnik. Comparison of classifier methods: A case study in handwritten digit recognition. In IAPR, editor,

    Proceedings of the International Conference on Pattern Recognition, Jerusalem, October 1994

    , volume II, pages 77–82. IEEE, 1994.
  • [5] Erin J. Bredensteiner and Kristin P. Bennett. Multicategory classification by support vector machines. Comput. Optim. Appl., 12(1-3):53–79, January 1999.
  • [6] Corinna Cortes and Vladimir Vapnik. Support-vector networks. In Machine Learning, pages 273–297, 1995.
  • [7] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292, March 2002.
  • [8] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
  • [9] Evgenia Dimitriadou, Kurt Hornik, Friedrich Leisch, David Meyer, and Andreas Weingessel. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.7-0., July 2018.
  • [10] Jerome H. Friedman. Another approach to polychotomous classification. Technical report, Department of Statistics, Stanford University, 1996.
  • [11] Xisheng He, Zhe Wang, Cheng Jin, Yingbin Zheng, and Xiangyang Xue. A simplified multi-class support vector machine with reduced dual optimization. 33:71–82, 01 2012.
  • [12] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines.

    IEEE Transactions on Neural Networks

    , 13(2):415–425, Mar 2002.
  • [13] Stefan Knerr, Léon Personnaz, and Gérard Dreyfus. Single-layer learning revisited: A stepwise procedure for building and training a neural network. In F. Fogelman Soulié and J. Hérault, editors, Neurocomputing: Algorithms, Architectures and Applications, volume F68 of NATO ASI Series, pages 41–50. Springer-Verlag, 1990.
  • [14] Ulrich H.-G. Kreßel. Advances in kernel methods. chapter Pairwise Classification and Support Vector Machines, pages 255–268. MIT Press, Cambridge, MA, USA, 1999.
  • [15] Sang-Hyeun Park and Johannes Fürnkranz. Efficient pairwise classification. In Joost N. Kok, Jacek Koronacki, Raomon Lopez de Mantaras, Stan Matwin, Dunja Mladenič, and Andrzej Skowron, editors, Machine Learning: ECML 2007, pages 658–665, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
  • [16] John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin dags for multiclass classification. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 547–553, Cambridge, MA, USA, 1999. MIT Press.
  • [17] A. Rosales-Perez, S. Garcia, H. Terashima-Marin, C. A. Coello Coello, and F. Herrera. Mc2esvm: Multiclass classification based on cooperative evolution of support vector machines. IEEE Computational Intelligence Magazine, 13(2):18–29, May 2018.
  • [18] Daniel Silva-Palacios, Cèsar Ferri, and María José Ramírez-Quintana. Probabilistic class hierarchies for multiclass classification. Journal of Computational Science, 26:254 – 263, 2018.
  • [19] Sandor Szedmak, John Shawe-Taylor, Craig Saunders, and David Hardoon. Multiclass classification by l1 norm support vector machine. In

    Pattern Recognition and Machine Learning in Computer Vision Workshop

    , 05 2004.
  • [20] Vladimir Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
  • [21] Watkins-C. Weston, J. Multi-class support vector machines. In Proceedings of the European Symposium on Artificial Neural Networks ESANN99, pages 219–224, 1999.
  • [22] Jie Xu, Xianglong Liu, Zhouyuan Huo, Cheng Deng, Feiping Nie, and Heng Huang. Multi-class support vector machine via maximizing multi-class margins. In

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    , pages 3154–3160. AAAI Press, 2017.