1 Introduction
Multiclass classification is the problem of classifying data instances into one of three or more classes. In a typical learning process of multiclass classification, assuming that there are
classes, i.e. , and training instances, i.e. , each training instance belongs to one of different classes, and the goal is to construct a function which, given a new data instance , can correctly predict the class to which the new instance belongs. Multiclass classification problems are very common in the realworld with a variety of scenarios, such as image classification [ciregan2012multi], text classification [nigam2000text], ecommerce product classification [schulten2001commerce], medical diagnosis [panca2017application], etc. Currently, one of the most widelyused solutions for multiclass classification is the decomposition methods^{1}^{1}1There are also some other efforts that trying to solve multiclass problem directly, like [bredensteiner1999multicategory, choromanska2015logarithmic, mroueh2012multiclass, weston1998multi, hsu2009multi, prabhu2014fastxml, si2017gradient, yen2016pd]. However, they are not as popular as decomposition methods and thus are not in the scope of this paper., which splits a multiclass problem, or polychotomy, into a series of independent twoclass problems (dichotomies) and recompose them using the outputs of dichotomies in order to reconstruct the original polychotomy. In practice, the widespread use of decomposition methods is mainly due to its simplicity and easyadaptation to existing popular learners, e.g. support vector machines, neural networks, gradient boosting trees, etc.
There are a couple of concrete realization of decomposition methods, including OneVersusAll (OVA) [nilsson1965learning], OneVersusOne (OVO) [hastie1998classification], and ErrorCorrectingOutputCode (ECOC) [dietterich1995solving].In particular, OVA trains different base learners, for the th of which let the positive examples be all the instances in class and the negative examples be all not in ; OVO trains base learners, one of each to distinguish each pair of classes. While OVA and OVO are simple to implement and widelyused in practice, they yield some obvious disadvantages. First, both OVA and OVO are based on the assumption that all classes are orthogonal and the corresponding base learners are independent with each other, which, nevertheless, neglect the latent correlation between these classes in realworld applications. For example, in a task of image classification, the instances under the ‘Cat’ class apparently yield stronger correlation to those under the ‘Kitty’ class than those under the ‘Dog’ class. Moreover, the training of OVA and OVA is inefficient since its high computation complexity when is large, leading to extremely high training cost when processing largescale classification datasets.
ECOCbased methods, on the other hand, are theoretically preferable over both OVA and OVO since it can in some sense alleviate their disadvantages. More concretely, ECOCbased methods rely on a coding matrix, which defines a new transformation of instance labeling, to decompose the multiclass problem into dichotomies, and then recompose in a way that makes decorrelations and correct errors. Generating different distances for different pairs of classes, indeed, enable ECOCbased methods to leverage the correlations among classes into the whole learning process. For example, if the coding matrix assigns , and to ‘Cat’, ‘Kitty’ and ‘Dog’, respectively, the learned model can ensure a closer distance between instance pairs across ‘Cat’ and ‘Kitty’ than those across ‘Cat’ and ‘Dog’. Moreover, since the length of the code, also the number of base learners, could be much smaller than , ECOCbased methods can significantly reduce the computation complexity over OVA and OVO, especially when the original class number is very large.
Given the delicate design of class coding, the performance of ECOCbased methods highly depends on the design of the coding matrix and the corresponding decoding strategy. The most straightforward way is to create a random coding matrix for class transformation with Hamming decoding strategy. The accuracy of this simple approach, apparently, can be of highly volatile due to its randomness. To address this problem, many efforts have been made focusing on optimizing the coding matrix. However, it is almost impossible to find an optimal coding matrix due to its complexity and even finding a suboptimal coding matrix is likely to be quite timeconsuming. Such uncertainty and inefficiency in recognizing a suboptimal coding matrix undoubtedly prevent the broader using of the ECOCbased methods in realworld scenarios.
To address this challenge, we propose a new dynamic ECOCbased decomposition approach, named LightMC. Instead of using fixed coding matrix and decoding strategy, LightMC can dynamically optimize the coding matrix and decoding strategy, toward more accurate multiclass classification, jointly with the training of base learners in an iterative way. To achieve this, LightMC takes advantage of a differentiable decoding strategy which allows it to perform the optimization by gradient descent, guarantees that the training loss can be further reduced. In addition to improving final classification accuracy and obtaining the coding matrix and decoding strategy more beneficial to the classification performance, LightMC can, furthermore, significantly boost the efficiency since it saves much time for searching suboptimal coding matrix. As LightMC will optimize coding matrix together with the model training process, it is not necessary to spend much time in tuning an initial coding matrix, and, as shown by further empirical studies, even a random coding matrix can result in satisfying.
To validate the effectiveness and efficiency of LightMC, we conduct experimental analysis on several public largescale datasets. The results illustrate that LightMC can outperform OVA and existing ECOCbased solution on both training speed and accuracy.
This paper has following major contributions:

[leftmargin=12pt,itemsep=1pt,topsep=2pt]

We propose a new dynamic decomposition algorithm, named LightMC, that can outperform traditional ECOCbased methods in terms of both accuracy and efficiency.

We define a differentiable decoding strategy and derive an effective algorithm to dynamically refine the coding matrix by extending the wellknown back propagation algorithm.

Extensive experimental analysis on multiple public largescale datasets to demonstrate both the effectiveness and the efficiency of proposed new decomposition algorithm is highly efficient.
The rest of the paper is organized as followed. Section 2 introduces ECOC decomposition approaches and related work. Section 3 presents the details of the LightMC. Section 4 shows experiment results that validate our proposition on largescale public available multiclass classification data sets. Finally, we conclude the paper in Section 5.
2 Preliminaries
2.1 Error Correcting Output Code (ECOC)
ECOC was first introduced to decompose multiclass classification problems by Dietterich and Bakiri [dietterich1995solving]. In this method, each class is assigned to a codeword , where represents the label of data from class when learning the base learner . All codewords can be combined to form a matrix , where is the length of one codeword as well as the number of base learners. Given the output of base learners , the final multiclass classification result can be obtained through a decoding strategy:
(1) 
where is the predicted class and is the sign function and equals 1 if otherwise . This decoding strategy is also called hamming decoding as it makes the prediction by choosing the class with lowest hamming distance. Under such decoding strategy, the coding matrix is capable of correcting a certain amount of errors made by base learners [dietterich1995solving].
ECOCbased methods yield many advantages over traditional decomposition approaches. First, the introducing of the coding matrix, which can indicate different distances between different class pairs, indeed enables us to integrate the correlation among classes into the classification modeling so as to further improve the classification accuracy. Moreover, since code length , i.e., the number of base learners, could be much smaller than the number of classes , ECOCbased methods can be more efficient than OVA and OVO, especially when is very large.
It is obvious that the classification performance of ECOCbased methods highly depend on the design of coding matrix. Nevertheless, the complexity of finding the best coding matrix is NPComplete as stated in [crammer2002learnability]. Thus, it is almost impossible to find an optimal coding matrix, and even finding a suboptimal coding matrix is likely to be quite timeconsuming. Such uncertainty and inefficiency in finding a suboptimal coding matrix undoubtedly prevent the broader using of the ECOCbased methods in realworld applications.
2.2 Related work
Recent years have witnessed many efforts attempting to improve ECOCbased decomposition methods. Especially, many of existing studies focused on discovering more appropriate coding matrix. For example, some efforts made hierarchical partition of the class space to generate corresponding code [baro2009traffic, pujol2006discriminant]
; some other studies explored the genetic algorithm to produce coding matrix with good properties
[garcia2008evolving, bautista2012minimal, bagheri2013genetic, bautista2014design]; moreover, there are a couple of efforts that have demonstrated significant improvement on ECOCbased methods by using spectral decomposition to find a good coding matrix [zhang2009spectral] or by relaxing the integer constraint on the coding matrix elements so as to adopting a continuousvalued coding matrix [zhao2013sparse]. In the meantime, some previous studies turned to optimizing the decoding strategy by employing the bagging and boosting approach [hatami2012thinned, rocha2014multiclass] or assigning deliberate weigmost of previoushts on base learners for further aggregation [escalera2006ecoc].While these previous studies can improve ECOCbased methods in some sense, they still suffer from two main challenges: 1) Efficiency: In order to increase multiclass classification accuracy, many of previous works like [baro2009traffic, pujol2006discriminant] designed the coding matrix with a long code length , ranging from to , which leads to almost as many base learners as models needed in OVA and OVO. Such limitation makes existing ECOCmethods very inefficient in the largescale classification problems. 2) Scalability: In fact, most of the previous ECOCbased methods were studied under a smallscale classification data, which usually consists of, for example, tens of classes and thousands of samples [zhao2013sparse]. To the best of knowledge, there is no existing deep verification of the performance of ECOCbased methods on a largescale classification data. Meanwhile, such investigation is even quite difficult theoretically, since most of them cannot scale up to the largescale data due to the long coding length and expected great preprocessing cost.
Because of these major shortages, it is quite challenging in applying existing ECOCbased methods into the realworld applications, especially those largescale multiclass classification problems.
3 LightMC
To address those major shortages of ECOCbased methods stated in Sec. 2, we proposed a new multiclass decomposition algorithm, named LightMC. Instead of determining the coding matrix and decoding strategy before training, LightMC attempts to dynamically refine ECOC decomposition by directly optimizing the global objective function, jointly with the training of base learners. More specifically, LightMC introduces a new differentiable decoding strategy, which enables LightMC to optimize the coding matrix and decoding strategy directly via gradient descent during the training of base learners. As a result, LightMC yields twofold advantages: 1) Effectiveness: rather than separate the designing of coding matrix and decoding strategy from the base learning training, LightMC can further enhance ECOCbased methods in terms of classification accuracy by jointly optimizing the coding matrix, decoding strategy, and base learners; 2) Efficiency: since the coding matrix will be automatically optimized in the subsequent training, LightMC can significantly reduce time cost for finding a good coding matrix before training,
In this section, we will first introduce the overall training algorithm. Then, we will present our new decoding model and derive the optimization algorithms for decoding strategy and coding matrix based on it. Moreover, we will take further discussions on the performance and efficiency of LightMC.
3.1 Overall Algorithm
The general learning procedure of LightMC is summarized as shown in Fig. 1. More specifically, before LightMC starts training, a coding matrix is first initialized by existing ECOCbased solutions. Then, to make full use of training information from base learners, LightMC employs an alternating optimization algorithm, which alternates the learning of base learners together with the coding and decoding optimization: when training base learners, the coding and decoding strategy is fixed when training base learners, and vice versa. This joint learning procedure will run repeatedly until the whole training converges.
Note that, instead of determining coding matrix before training, LightMC develops an endtoend solution to jointly train base learners and the decomposition models in an iterative way. The details of the LightMC algorithm can be found in Alg. 1. Within this algorithm, there are two essential steps: TrainDecoding is used to optimize the decoding strategy, the details of which will be revealed in Sec. 3.2; and, TrainCodingMatrix aims at optimizing the coding matrix, the details of which will be introduced in Sec. 3.3.
3.2 New Differentiable Decoding Strategy: Softmax Decoding
To find the optimal coding and decoding strategies, it is necessary to optimize directly on the global objective function. However, since most existing decoding strategies are not differentiable, it prevents us from optimizing the global objective function directly by employing widelyused back propagation method. To remove this obstacle, it is critical to design a decoding strategy which is differentiable while preserving error correcting properties.
A deepdive into the decoding strategy, i.e., Eq. 1, discloses two nondifferentiable functions: and . As introduced in [escalera2010decoding], can be removed directly, since the resulting distance function will become Manhattan (L1) distance, which still preserves its error correcting property. In the meantime, can be replaced by the widelyused , which is able to approximate
with producing continuous probabilities and thus differentiable. More specifically, we can first replace the
to by reversing the sign of at the same time. In this way, when the output of the th classifier equals to , the distance will be the maximum value instead of the minimum. After that, we can replace the to directly, and the whole decoding strategy becomes(2) 
where denotes the similarity between the classifier output and the code of class . Although the L1 loss is applied in the algorithm, L2 loss or other distance functions mentioned in [escalera2010decoding] are also applicable and should produce similar results. Note that, after all the transformation mentioned above, the decoding strategy will assign the highest score to the class closest to the output vector, which, in other words, is exactly the errorcorrecting property [escalera2010decoding].
Recognizing such differentiable error correcting decoding strategy enables us to employ the widelyused gradient descent algorithm to optimize the decoding strategy directly. Before doing this, we notice that the new decoding function can be rewritten into a form of single layer softmax regression. As the distance function in Eq. 2 satisfies
it allows the decoding strategy to be rewritten into:
(3) 
which yields exactly the same form as a singlelayer linear model with a softmax activation. As a result, we can use the gradient descent to train the softmax’s parameters , which is initialized by
, in order to reduce the overall loss. Considering the convenience of derivative computation, we choose multiclass cross entropy, which is commonly used together with the softmax function, as our loss function. The overall loss on a single data point can be formulated as
where is the learning rate, is a onehot vector transformed from the original label. This optimization process is called by TrainDecoding in Alg. 2. Like ordinary gradient descent, data are partitioned into mini batches which are used to calculate current gradients for a single round of update. We can also apply the L1/L2 regularization here to improve the generalization ability. Note that, the validity of gradient descent guarantees the overall loss to decrease through iterations, which ensures this algorithm is a valid method to refine the decoding strategy.
3.3 Coding Matrix Optimization
Besides decoding optimization, it is quite beneficial to optimize coding matrix through the iterative training as well. We notice that, if the input of softmax decoding can also be updated via back propagation, we are able to further lower the overall training loss. The corresponding update process can be defined as , where is the learning rate. However, cannot be updated directly since it is the output of base learners. Fortunately, optimizing the coding matrix enables us to update the indirectly so as to further reduce the overall training loss.
As stated in Sec. 2.1, determines the label of the data belonging to class when they are used to train base learner . If we assume that base learners are able to fit the given learning target perfectly, then for any classifier , its output for any data belonging to class will always satisfy . Thus, the changes of will affect the targets of base learners, and then the output of base learners will be changed subsequently. Moreover, since the gradient is equal to in this situation, we can optimize by gradient descent: .
However, there is no perfect base learner in practice. As a result, we cannot use above solution to optimize directly since . Nevertheless, there are many data samples that can be used for a single class . That is, for a , there are many , where . So instead of using unstable gradient point
, we can use average gradient of each class to have a more stable estimation for
:(4) 
Then this estimation can be used to update the coding matrix. This optimization algorithm is described in Alg. 3, which is almost the same as a normal back propagation algorithm except using the whole batch data to calculate average gradients before performing updates. This method is also empirically proved to be effective by our experiment, as shown in the next section, which means, by optimizing global objective function, the coding matrix can be definitely refined to reduce the loss as well as enhance the generalization capability.
3.4 Discussion
In the rest of section, we take further discussions about the efficiency and performance of LightMC.

[leftmargin=12pt ,itemsep=1pt,topsep=1pt]

Efficiency: Compared with existing ECOCbased methods, LightMC is more efficient as it can use much less time to find a coding matrix before training. Meanwhile, it can even produce the comparable performance since the coding matrix will be dynamically refined in the subsequent training. Moreover, LightMC only requires little additional optimization computation cost, which is the same as the cost of single layer linear model and much smaller than the cost of powerful base learners like the neural networks and GBDT. The experimental results in the following section will further demonstrate the efficiency of LightMC.

MiniBatch Coding Optimization Method: One shortage of Alg. 3 is inefficient in memory usage as it uses the full batch to update. Actually, it is quite natural to switch to minibatch update since the average gradients can be calculated in minibatches as well.

Distributed Coding: Binary coding is used in most existing ECOCbased methods. On the other hand, LightMC employs the distributed coding to perform the continuous optimization. Apparently, distributed coding, also called embedding, contains more information than binary coding [mikolov2013distributed, zhao2013sparse], which enables LightMC to leverage more information over the correlations among classes.

Alternating Training with Base Learners: As shown in Alg. 1, when the base learner is not the boosting learner, for example, the neural networks, LightMC can be called at each iteration(epoch). For the boosting learners, LightMC is conducted starting from th round and called once per round. It is because there is a learning rate , which will shrinkage the output of model at each iteration, in boosting learners. As a result, boosting learners need more iterations to fit the new training targets. Therefore, using initial rounds and being called once per round can improve the efficiency, since calling LightMC at each iteration is not necessary.

Compared with Softmax Layer in Neural Networks:
The form of softmax decoding is similar to the softmax layer in neural networks. However, they are different indeed: 1) the softmax layer is actually the same to OVA decomposition, and it does not use coding matrix to encode the correlations among classes; 2) they use different optimization schemes: the loss per sample is reduced in the optimization of softmax layer, while softmax decoding optimizes the loss per class (see Eq.
4). It is hard to say which one is better in practice for neural networks, even some recent works found the accuracy is almost the same while using fixed softmax layer [hoffer2018fix]. This topic, however, is not in the scoop of this paper.
4 Experiment
4.1 Experiment Setting
Dataset  #class  #feature  #data 

News20 [chang2011libsvm]  20  62,021  19,928 
Aloi [chang2011libsvm]  1,000  128  108,000 
Dmoz [yen2016pd]  11,878  833,484  373,408 
LSHTC1 [yen2016pd]  12,045  347,255  83,805 
AmazonCat14K [mcauley2015inferring, mcauley2015image] ^{2}^{2}2http://manikvarma.org/downloads/XC/XMLRepository.html  14,588 ^{3}^{3}3Number of class is 3344 after converting to multiclass format  597,940  5,497,775 
In this section, we report the experimental results regarding our proposed LightMC algorithm. We conduct experiments on five public datasets, as listed in Table 1. From this table, we can see a wide range of the sizes of datasets, the largest of which has millions of samples with ten thousand classes and can be used to validate the scalability of LightMC. Among them, AmazonCat14K is originally a multilabel dataset; we convert it to a multiclass one by randomly sampling one label per data. As stated in Sec. 2.2, to the best of our knowledge, it is the first time to examine ECOCbased methods on such largescale datasets.
For the baselines, we use OVA and evolutionary ECOC proposed in [bautista2012minimal]. OVO is excluded in baselines due to its extremely inefficiency of using base learners. For example, in LSHTC1 data, OVO needs 72 million base learners and is estimated to take about 84 days to run an experiment even when the cost of one base learner is 0.1 second. The initial coding matrix of LightMC is set to be exactly the same as the ECOC baseline to make them comparable. Besides, to see the efficiency of LightMC, we add another LightMC baseline, but starting from random coding matrix, called LightMC(R). As for the length of the coding matrix , a length of was suggested in [allwein2000reducing]. Considering that our base learner is more powerful, we set the to .
For all decomposition methods we use LightGBM [ke2017lightgbm] as to train base learners. In all experiments we set learning_rate () to , num_leaves (max number of leaves in a single tree) to and early_stopping (early stopping rounds) to 20. For the AmazonCat14K, we override num_leaves to 300 and early_stopping to 10, otherwise it needs several weeks to run a experiment. Other parameters remain to be the same as default. Our experimental environment is a Windows server with two E52670 v2 CPUs (in total 20 cores) and 256GB memories. All experiments run with multithreading and the number of threads is fixed to 20.
Regarding parameters used by LightMC, the starting round is set to , to and to . And softmax’s parameters are trained for one epoch each time the optimization method is called.
4.2 Experiment Result Analysis
Dataset  OVA  ECOC  LightMC(R)  LightMC 

News20  18.66%  20.82% 0.33%  20.63% 0.57%  18.63% 0.37% 
Aloi  11.44%  10.72% 0.12%  10.75% 0.23%  9.75% 0.12% 
Domz  N/A  55.87% 0.34%  55.55% 0.44%  53.95% 0.25% 
LSHTC1  N/A  76.04% 0.59%  76.17% 0.73%  75.63% 0.33% 
AmazonCat14K  N/A  27.05% 0.11%  26.98% 0.21%  25.54% 0.10% 
Dataset  OVA  ECOC  LightMC(R)  LightMC  Coding Matrix 

News20  71  120  133  100  34 
Aloi  1494  717  753  627  201 
Domz  > 259k  58,320  61,930  51,840  13,233 
LSHTC1  > 86k  5,796  5,995  5,690  926 
AmazonCat14K  > 969k  332,280  354,480  311,040  48,715 
Class Pairs  0  50  100  150  200  300  400  500  1000 

ibm.hardware, mac.hardware  98.9  97.4  96.8  95.9  95.1  93.1  91.1  89.3  81.8 
mac.hardware, politics.mideast  120.9  135.5  136.4  136.8  140.8  145.6  149.5  152.7  163.2 
The experiment results are reported in Table 2 and 3. The OVA error result on Dmoz, LSHTC1 and AmazonCat14k datasets are not reported since the time costs are extremely too high. However, we estimate their convergence time by using the subset of the original data.
From these two tables, we find LightMC outperforms all the others in terms of both accuracy and convergence time. In particular, both ECOC and LightMC yield faster convergence over OVA when is larger. Furthermore, compared with ECOC, LightMC increases the accuracy by about 3% (relatively), and improves 5.88% at the best case on the LSHTC1 dataset. As for the speed, LightMC also uses less time than ECOC to converge. These results clearly indicate that LightMC can further reduce the overall loss by dynamically refining the coding and decoding strategy as expected.
We can also find that, while starting from random coding matrix, the accuracy of LightMC(R) is comparable with that of ECOC. Despite the slower convergence of LightMC(R), the total time cost of LightMC(R) is still much less than ECOC, since ECOC spends an enormous additional time to find a good coding matrix before training. This result further implies the efficiency of LightMC: it can provide comparable accuracy without searching a suboptimal coding matrix before training.
To demonstrate more learning details, we plot the curves of the test error regarding the training time on Aloi and LSHTC1 datasets, as shown in Fig. 1(a) and 1(b), respectively. From Fig. 1(a), we can see clearly that the curve of LightMC always stays below the curves of the other two methods and converges earliest at the lowest point. Fig. 1(b) shows a slightly different pattern: LightMC and ECOC have similar accuracy and take comparable time to converge. However, LightMC still always stays below ECOC and converges 1,405 seconds earlier than ECOC, which also indicates that LightMC succeeds in enhancing existing ECOC methods.
In addition, to illustrate the effects of LightMC in optimizing the code matrix, we calculate the distances of some class pairs, on News20, over the optimized coding matrix. As shown in Table 4, the distance over the class pair (‘ibm.hardware’,‘mac.hardware’) is obviously much smaller than that over (‘mac.hardware’,‘politics.mideast’). Moreover, the distance over the former pair keeps reducing along with the training of LightMC, while that over the latter, on the other hand, keeps increasing due to the irrelevance between this class pair. This result empirically implies the effectiveness of LightMC in optimizing the coding matrix towards to the right direction.
As a summary, all these results have illustrated the effectiveness and efficiency of LightMC. LightMC cannot only empower existing ECOCbased methods but also achieve the comparable classification accuracy using much less time since it saves the time for finding a sound coding matrix. Moreover, LightMC can optimize the coding matrix towards to the better direction.
5 Conclusion
We propose a novel dynamic ECOCbased multiclass decomposition algorithm, named LightMC, to solve largescale classification problems efficiently. To leverage better of correlations among classes, LightMC dynamically optimizes its coding matrix and decoding strategy, jointly with the training of base learners. Specifically, we design a new differentiable decoding strategy to enable direct optimization over the decoding strategy and coding matrix. Experiments on public datasets with classes ranging from twenty to more than ten thousand empirically show the effectiveness and the efficiency of LightMC. In future, we plan to examine how LightMC will work while replacing the softmax layer in neural networks.
Comments
There are no comments yet.