Multiclass classification is the problem of classifying data instances into one of three or more classes. In a typical learning process of multiclass classification, assuming that there areclasses, i.e. , and training instances, i.e. , each training instance belongs to one of different classes, and the goal is to construct a function which, given a new data instance , can correctly predict the class to which the new instance belongs. Multiclass classification problems are very common in the real-world with a variety of scenarios, such as image classification [ciregan2012multi], text classification [nigam2000text], e-commerce product classification [schulten2001commerce], medical diagnosis [panca2017application], etc. Currently, one of the most widely-used solutions for multiclass classification is the decomposition methods111There are also some other efforts that trying to solve multiclass problem directly, like [bredensteiner1999multicategory, choromanska2015logarithmic, mroueh2012multiclass, weston1998multi, hsu2009multi, prabhu2014fastxml, si2017gradient, yen2016pd]. However, they are not as popular as decomposition methods and thus are not in the scope of this paper.
, which splits a multiclass problem, or polychotomy, into a series of independent two-class problems (dichotomies) and recompose them using the outputs of dichotomies in order to reconstruct the original polychotomy. In practice, the widespread use of decomposition methods is mainly due to its simplicity and easy-adaptation to existing popular learners, e.g. support vector machines, neural networks, gradient boosting trees, etc.
There are a couple of concrete realization of decomposition methods, including One-Versus-All (OVA) [nilsson1965learning], One-Versus-One (OVO) [hastie1998classification], and Error-Correcting-Output-Code (ECOC) [dietterich1995solving].In particular, OVA trains different base learners, for the -th of which let the positive examples be all the instances in class and the negative examples be all not in ; OVO trains base learners, one of each to distinguish each pair of classes. While OVA and OVO are simple to implement and widely-used in practice, they yield some obvious disadvantages. First, both OVA and OVO are based on the assumption that all classes are orthogonal and the corresponding base learners are independent with each other, which, nevertheless, neglect the latent correlation between these classes in real-world applications. For example, in a task of image classification, the instances under the ‘Cat’ class apparently yield stronger correlation to those under the ‘Kitty’ class than those under the ‘Dog’ class. Moreover, the training of OVA and OVA is inefficient since its high computation complexity when is large, leading to extremely high training cost when processing large-scale classification datasets.
ECOC-based methods, on the other hand, are theoretically preferable over both OVA and OVO since it can in some sense alleviate their disadvantages. More concretely, ECOC-based methods rely on a coding matrix, which defines a new transformation of instance labeling, to decompose the multiclass problem into dichotomies, and then recompose in a way that makes decorrelations and correct errors. Generating different distances for different pairs of classes, indeed, enable ECOC-based methods to leverage the correlations among classes into the whole learning process. For example, if the coding matrix assigns , and to ‘Cat’, ‘Kitty’ and ‘Dog’, respectively, the learned model can ensure a closer distance between instance pairs across ‘Cat’ and ‘Kitty’ than those across ‘Cat’ and ‘Dog’. Moreover, since the length of the code, also the number of base learners, could be much smaller than , ECOC-based methods can significantly reduce the computation complexity over OVA and OVO, especially when the original class number is very large.
Given the delicate design of class coding, the performance of ECOC-based methods highly depends on the design of the coding matrix and the corresponding decoding strategy. The most straightforward way is to create a random coding matrix for class transformation with Hamming decoding strategy. The accuracy of this simple approach, apparently, can be of highly volatile due to its randomness. To address this problem, many efforts have been made focusing on optimizing the coding matrix. However, it is almost impossible to find an optimal coding matrix due to its complexity and even finding a sub-optimal coding matrix is likely to be quite time-consuming. Such uncertainty and inefficiency in recognizing a sub-optimal coding matrix undoubtedly prevent the broader using of the ECOC-based methods in real-world scenarios.
To address this challenge, we propose a new dynamic ECOC-based decomposition approach, named LightMC. Instead of using fixed coding matrix and decoding strategy, LightMC can dynamically optimize the coding matrix and decoding strategy, toward more accurate multiclass classification, jointly with the training of base learners in an iterative way. To achieve this, LightMC takes advantage of a differentiable decoding strategy which allows it to perform the optimization by gradient descent, guarantees that the training loss can be further reduced. In addition to improving final classification accuracy and obtaining the coding matrix and decoding strategy more beneficial to the classification performance, LightMC can, furthermore, significantly boost the efficiency since it saves much time for searching sub-optimal coding matrix. As LightMC will optimize coding matrix together with the model training process, it is not necessary to spend much time in tuning an initial coding matrix, and, as shown by further empirical studies, even a random coding matrix can result in satisfying.
To validate the effectiveness and efficiency of LightMC, we conduct experimental analysis on several public large-scale datasets. The results illustrate that LightMC can outperform OVA and existing ECOC-based solution on both training speed and accuracy.
This paper has following major contributions:
We propose a new dynamic decomposition algorithm, named LightMC, that can outperform traditional ECOC-based methods in terms of both accuracy and efficiency.
We define a differentiable decoding strategy and derive an effective algorithm to dynamically refine the coding matrix by extending the well-known back propagation algorithm.
Extensive experimental analysis on multiple public large-scale datasets to demonstrate both the effectiveness and the efficiency of proposed new decomposition algorithm is highly efficient.
The rest of the paper is organized as followed. Section 2 introduces ECOC decomposition approaches and related work. Section 3 presents the details of the LightMC. Section 4 shows experiment results that validate our proposition on large-scale public available multiclass classification data sets. Finally, we conclude the paper in Section 5.
2.1 Error Correcting Output Code (ECOC)
ECOC was first introduced to decompose multiclass classification problems by Dietterich and Bakiri [dietterich1995solving]. In this method, each class is assigned to a codeword , where represents the label of data from class when learning the base learner . All codewords can be combined to form a matrix , where is the length of one codeword as well as the number of base learners. Given the output of base learners , the final multiclass classification result can be obtained through a decoding strategy:
where is the predicted class and is the sign function and equals 1 if otherwise . This decoding strategy is also called hamming decoding as it makes the prediction by choosing the class with lowest hamming distance. Under such decoding strategy, the coding matrix is capable of correcting a certain amount of errors made by base learners [dietterich1995solving].
ECOC-based methods yield many advantages over traditional decomposition approaches. First, the introducing of the coding matrix, which can indicate different distances between different class pairs, indeed enables us to integrate the correlation among classes into the classification modeling so as to further improve the classification accuracy. Moreover, since code length , i.e., the number of base learners, could be much smaller than the number of classes , ECOC-based methods can be more efficient than OVA and OVO, especially when is very large.
It is obvious that the classification performance of ECOC-based methods highly depend on the design of coding matrix. Nevertheless, the complexity of finding the best coding matrix is NP-Complete as stated in [crammer2002learnability]. Thus, it is almost impossible to find an optimal coding matrix, and even finding a sub-optimal coding matrix is likely to be quite time-consuming. Such uncertainty and inefficiency in finding a sub-optimal coding matrix undoubtedly prevent the broader using of the ECOC-based methods in real-world applications.
2.2 Related work
Recent years have witnessed many efforts attempting to improve ECOC-based decomposition methods. Especially, many of existing studies focused on discovering more appropriate coding matrix. For example, some efforts made hierarchical partition of the class space to generate corresponding code [baro2009traffic, pujol2006discriminant]
; some other studies explored the genetic algorithm to produce coding matrix with good properties[garcia2008evolving, bautista2012minimal, bagheri2013genetic, bautista2014design]; moreover, there are a couple of efforts that have demonstrated significant improvement on ECOC-based methods by using spectral decomposition to find a good coding matrix [zhang2009spectral] or by relaxing the integer constraint on the coding matrix elements so as to adopting a continuous-valued coding matrix [zhao2013sparse]. In the meantime, some previous studies turned to optimizing the decoding strategy by employing the bagging and boosting approach [hatami2012thinned, rocha2014multiclass] or assigning deliberate weigmost of previoushts on base learners for further aggregation [escalera2006ecoc].
While these previous studies can improve ECOC-based methods in some sense, they still suffer from two main challenges: 1) Efficiency: In order to increase multiclass classification accuracy, many of previous works like [baro2009traffic, pujol2006discriminant] designed the coding matrix with a long code length , ranging from to , which leads to almost as many base learners as models needed in OVA and OVO. Such limitation makes existing ECOC-methods very inefficient in the large-scale classification problems. 2) Scalability: In fact, most of the previous ECOC-based methods were studied under a small-scale classification data, which usually consists of, for example, tens of classes and thousands of samples [zhao2013sparse]. To the best of knowledge, there is no existing deep verification of the performance of ECOC-based methods on a large-scale classification data. Meanwhile, such investigation is even quite difficult theoretically, since most of them cannot scale up to the large-scale data due to the long coding length and expected great pre-processing cost.
Because of these major shortages, it is quite challenging in applying existing ECOC-based methods into the real-world applications, especially those large-scale multiclass classification problems.
To address those major shortages of ECOC-based methods stated in Sec. 2, we proposed a new multiclass decomposition algorithm, named LightMC. Instead of determining the coding matrix and decoding strategy before training, LightMC attempts to dynamically refine ECOC decomposition by directly optimizing the global objective function, jointly with the training of base learners. More specifically, LightMC introduces a new differentiable decoding strategy, which enables LightMC to optimize the coding matrix and decoding strategy directly via gradient descent during the training of base learners. As a result, LightMC yields two-fold advantages: 1) Effectiveness: rather than separate the designing of coding matrix and decoding strategy from the base learning training, LightMC can further enhance ECOC-based methods in terms of classification accuracy by jointly optimizing the coding matrix, decoding strategy, and base learners; 2) Efficiency: since the coding matrix will be automatically optimized in the subsequent training, LightMC can significantly reduce time cost for finding a good coding matrix before training,
In this section, we will first introduce the overall training algorithm. Then, we will present our new decoding model and derive the optimization algorithms for decoding strategy and coding matrix based on it. Moreover, we will take further discussions on the performance and efficiency of LightMC.
3.1 Overall Algorithm
The general learning procedure of LightMC is summarized as shown in Fig. 1. More specifically, before LightMC starts training, a coding matrix is first initialized by existing ECOC-based solutions. Then, to make full use of training information from base learners, LightMC employs an alternating optimization algorithm, which alternates the learning of base learners together with the coding and decoding optimization: when training base learners, the coding and decoding strategy is fixed when training base learners, and vice versa. This joint learning procedure will run repeatedly until the whole training converges.
Note that, instead of determining coding matrix before training, LightMC develops an end-to-end solution to jointly train base learners and the decomposition models in an iterative way. The details of the LightMC algorithm can be found in Alg. 1. Within this algorithm, there are two essential steps: TrainDecoding is used to optimize the decoding strategy, the details of which will be revealed in Sec. 3.2; and, TrainCodingMatrix aims at optimizing the coding matrix, the details of which will be introduced in Sec. 3.3.
3.2 New Differentiable Decoding Strategy: Softmax Decoding
To find the optimal coding and decoding strategies, it is necessary to optimize directly on the global objective function. However, since most existing decoding strategies are not differentiable, it prevents us from optimizing the global objective function directly by employing widely-used back propagation method. To remove this obstacle, it is critical to design a decoding strategy which is differentiable while preserving error correcting properties.
A deep-dive into the decoding strategy, i.e., Eq. 1, discloses two non-differentiable functions: and . As introduced in [escalera2010decoding], can be removed directly, since the resulting distance function will become Manhattan (L1) distance, which still preserves its error correcting property. In the meantime, can be replaced by the widely-used , which is able to approximate
with producing continuous probabilities and thus differentiable. More specifically, we can first replace theto by reversing the sign of at the same time. In this way, when the output of the -th classifier equals to , the distance will be the maximum value instead of the minimum. After that, we can replace the to directly, and the whole decoding strategy becomes
where denotes the similarity between the classifier output and the code of class . Although the L1 loss is applied in the algorithm, L2 loss or other distance functions mentioned in [escalera2010decoding] are also applicable and should produce similar results. Note that, after all the transformation mentioned above, the decoding strategy will assign the highest score to the class closest to the output vector, which, in other words, is exactly the error-correcting property [escalera2010decoding].
Recognizing such differentiable error correcting decoding strategy enables us to employ the widely-used gradient descent algorithm to optimize the decoding strategy directly. Before doing this, we notice that the new decoding function can be rewritten into a form of single layer softmax regression. As the distance function in Eq. 2 satisfies
it allows the decoding strategy to be rewritten into:
which yields exactly the same form as a single-layer linear model with a softmax activation. As a result, we can use the gradient descent to train the softmax’s parameters , which is initialized by
, in order to reduce the overall loss. Considering the convenience of derivative computation, we choose multiclass cross entropy, which is commonly used together with the softmax function, as our loss function. The overall loss on a single data point can be formulated as
where is the learning rate, is a one-hot vector transformed from the original label. This optimization process is called by TrainDecoding in Alg. 2. Like ordinary gradient descent, data are partitioned into mini batches which are used to calculate current gradients for a single round of update. We can also apply the L1/L2 regularization here to improve the generalization ability. Note that, the validity of gradient descent guarantees the overall loss to decrease through iterations, which ensures this algorithm is a valid method to refine the decoding strategy.
3.3 Coding Matrix Optimization
Besides decoding optimization, it is quite beneficial to optimize coding matrix through the iterative training as well. We notice that, if the input of softmax decoding can also be updated via back propagation, we are able to further lower the overall training loss. The corresponding update process can be defined as , where is the learning rate. However, cannot be updated directly since it is the output of base learners. Fortunately, optimizing the coding matrix enables us to update the indirectly so as to further reduce the overall training loss.
As stated in Sec. 2.1, determines the label of the data belonging to class when they are used to train base learner . If we assume that base learners are able to fit the given learning target perfectly, then for any classifier , its output for any data belonging to class will always satisfy . Thus, the changes of will affect the targets of base learners, and then the output of base learners will be changed subsequently. Moreover, since the gradient is equal to in this situation, we can optimize by gradient descent: .
However, there is no perfect base learner in practice. As a result, we cannot use above solution to optimize directly since . Nevertheless, there are many data samples that can be used for a single class . That is, for a , there are many , where . So instead of using unstable gradient point
, we can use average gradient of each class to have a more stable estimation for:
Then this estimation can be used to update the coding matrix. This optimization algorithm is described in Alg. 3, which is almost the same as a normal back propagation algorithm except using the whole batch data to calculate average gradients before performing updates. This method is also empirically proved to be effective by our experiment, as shown in the next section, which means, by optimizing global objective function, the coding matrix can be definitely refined to reduce the loss as well as enhance the generalization capability.
In the rest of section, we take further discussions about the efficiency and performance of LightMC.
Efficiency: Compared with existing ECOC-based methods, LightMC is more efficient as it can use much less time to find a coding matrix before training. Meanwhile, it can even produce the comparable performance since the coding matrix will be dynamically refined in the subsequent training. Moreover, LightMC only requires little additional optimization computation cost, which is the same as the cost of single layer linear model and much smaller than the cost of powerful base learners like the neural networks and GBDT. The experimental results in the following section will further demonstrate the efficiency of LightMC.
Mini-Batch Coding Optimization Method: One shortage of Alg. 3 is inefficient in memory usage as it uses the full batch to update. Actually, it is quite natural to switch to mini-batch update since the average gradients can be calculated in mini-batches as well.
Distributed Coding: Binary coding is used in most existing ECOC-based methods. On the other hand, LightMC employs the distributed coding to perform the continuous optimization. Apparently, distributed coding, also called embedding, contains more information than binary coding [mikolov2013distributed, zhao2013sparse], which enables LightMC to leverage more information over the correlations among classes.
Alternating Training with Base Learners: As shown in Alg. 1, when the base learner is not the boosting learner, for example, the neural networks, LightMC can be called at each iteration(epoch). For the boosting learners, LightMC is conducted starting from -th round and called once per round. It is because there is a learning rate , which will shrinkage the output of model at each iteration, in boosting learners. As a result, boosting learners need more iterations to fit the new training targets. Therefore, using initial rounds and being called once per round can improve the efficiency, since calling LightMC at each iteration is not necessary.
Compared with Softmax Layer in Neural Networks:
The form of softmax decoding is similar to the softmax layer in neural networks. However, they are different indeed: 1) the softmax layer is actually the same to OVA decomposition, and it does not use coding matrix to encode the correlations among classes; 2) they use different optimization schemes: the loss per sample is reduced in the optimization of softmax layer, while softmax decoding optimizes the loss per class (see Eq.4). It is hard to say which one is better in practice for neural networks, even some recent works found the accuracy is almost the same while using fixed softmax layer [hoffer2018fix]. This topic, however, is not in the scoop of this paper.
4.1 Experiment Setting
|AmazonCat-14K [mcauley2015inferring, mcauley2015image] 222http://manikvarma.org/downloads/XC/XMLRepository.html||14,588 333Number of class is 3344 after converting to multi-class format||597,940||5,497,775|
In this section, we report the experimental results regarding our proposed LightMC algorithm. We conduct experiments on five public datasets, as listed in Table 1. From this table, we can see a wide range of the sizes of datasets, the largest of which has millions of samples with ten thousand classes and can be used to validate the scalability of LightMC. Among them, AmazonCat-14K is originally a multilabel dataset; we convert it to a multi-class one by randomly sampling one label per data. As stated in Sec. 2.2, to the best of our knowledge, it is the first time to examine ECOC-based methods on such large-scale datasets.
For the baselines, we use OVA and evolutionary ECOC proposed in [bautista2012minimal]. OVO is excluded in baselines due to its extremely inefficiency of using base learners. For example, in LSHTC1 data, OVO needs 72 million base learners and is estimated to take about 84 days to run an experiment even when the cost of one base learner is 0.1 second. The initial coding matrix of LightMC is set to be exactly the same as the ECOC baseline to make them comparable. Besides, to see the efficiency of LightMC, we add another LightMC baseline, but starting from random coding matrix, called LightMC(R). As for the length of the coding matrix , a length of was suggested in [allwein2000reducing]. Considering that our base learner is more powerful, we set the to .
For all decomposition methods we use LightGBM [ke2017lightgbm] as to train base learners. In all experiments we set learning_rate () to , num_leaves (max number of leaves in a single tree) to and early_stopping (early stopping rounds) to 20. For the AmazonCat-14K, we override num_leaves to 300 and early_stopping to 10, otherwise it needs several weeks to run a experiment. Other parameters remain to be the same as default. Our experimental environment is a Windows server with two E5-2670 v2 CPUs (in total 20 cores) and 256GB memories. All experiments run with multi-threading and the number of threads is fixed to 20.
Regarding parameters used by LightMC, the starting round is set to , to and to . And softmax’s parameters are trained for one epoch each time the optimization method is called.
4.2 Experiment Result Analysis
|News20||18.66%||20.82% 0.33%||20.63% 0.57%||18.63% 0.37%|
|Aloi||11.44%||10.72% 0.12%||10.75% 0.23%||9.75% 0.12%|
|Domz||N/A||55.87% 0.34%||55.55% 0.44%||53.95% 0.25%|
|LSHTC1||N/A||76.04% 0.59%||76.17% 0.73%||75.63% 0.33%|
|AmazonCat-14K||N/A||27.05% 0.11%||26.98% 0.21%||25.54% 0.10%|
The experiment results are reported in Table 2 and 3. The OVA error result on Dmoz, LSHTC1 and Amazon-Cat-14k datasets are not reported since the time costs are extremely too high. However, we estimate their convergence time by using the subset of the original data.
From these two tables, we find LightMC outperforms all the others in terms of both accuracy and convergence time. In particular, both ECOC and LightMC yield faster convergence over OVA when is larger. Furthermore, compared with ECOC, LightMC increases the accuracy by about 3% (relatively), and improves 5.88% at the best case on the LSHTC1 dataset. As for the speed, LightMC also uses less time than ECOC to converge. These results clearly indicate that LightMC can further reduce the overall loss by dynamically refining the coding and decoding strategy as expected.
We can also find that, while starting from random coding matrix, the accuracy of LightMC(R) is comparable with that of ECOC. Despite the slower convergence of LightMC(R), the total time cost of LightMC(R) is still much less than ECOC, since ECOC spends an enormous additional time to find a good coding matrix before training. This result further implies the efficiency of LightMC: it can provide comparable accuracy without searching a sub-optimal coding matrix before training.
To demonstrate more learning details, we plot the curves of the test error regarding the training time on Aloi and LSHTC1 datasets, as shown in Fig. 1(a) and 1(b), respectively. From Fig. 1(a), we can see clearly that the curve of LightMC always stays below the curves of the other two methods and converges earliest at the lowest point. Fig. 1(b) shows a slightly different pattern: LightMC and ECOC have similar accuracy and take comparable time to converge. However, LightMC still always stays below ECOC and converges 1,405 seconds earlier than ECOC, which also indicates that LightMC succeeds in enhancing existing ECOC methods.
In addition, to illustrate the effects of LightMC in optimizing the code matrix, we calculate the distances of some class pairs, on News20, over the optimized coding matrix. As shown in Table 4, the distance over the class pair (‘ibm.hardware’,‘mac.hardware’) is obviously much smaller than that over (‘mac.hardware’,‘politics.mideast’). Moreover, the distance over the former pair keeps reducing along with the training of LightMC, while that over the latter, on the other hand, keeps increasing due to the irrelevance between this class pair. This result empirically implies the effectiveness of LightMC in optimizing the coding matrix towards to the right direction.
As a summary, all these results have illustrated the effectiveness and efficiency of LightMC. LightMC cannot only empower existing ECOC-based methods but also achieve the comparable classification accuracy using much less time since it saves the time for finding a sound coding matrix. Moreover, LightMC can optimize the coding matrix towards to the better direction.
We propose a novel dynamic ECOC-based multiclass decomposition algorithm, named LightMC, to solve large-scale classification problems efficiently. To leverage better of correlations among classes, LightMC dynamically optimizes its coding matrix and decoding strategy, jointly with the training of base learners. Specifically, we design a new differentiable decoding strategy to enable direct optimization over the decoding strategy and coding matrix. Experiments on public datasets with classes ranging from twenty to more than ten thousand empirically show the effectiveness and the efficiency of LightMC. In future, we plan to examine how LightMC will work while replacing the softmax layer in neural networks.