I Introduction
Deep learning has become one of the major research areas in the machine learning community. One of the challenge is that the structure of the deep network model is usually complicated. For general multiclass classification problems, the required parameters of the deep network need to have hyperlinear growth with respect to the class number. If the number of classes are large, the classification problem will become infeasible because the required resources for model computation and storage will be huge. However, today there are lots of applications that require to perform classification with huge number of classes, such as language model of word level, image recognition of shopping items in ecommerce (multibillions of shopping items today in Taobao and Amazon), as well as handwriting recognition of 10K Chinese characters.
In fact, A general deep neural network classifier of classes can be treated as a series connection of a complex embedding in Euclidean space to the last but one layer, and a softmax classifier softmax of classes in the last layer. The complex embedding can be interpreted as a clustering process to cluster data based on their class labels, and the last layer tries to separate them. If the dimension of the Euclidean space in the last but one layer is bigger or equal to
, there exists a softmax classifier to separate those clusters with probability 1. But if the dimension of the Euclidean space in the last but one layer is less than
, there may exist a cluster where the center is inside the convex closure of the other cluster centers. In this case, there is no softmax classifier that can separate this cluster from other clusters, because a linear function on a convex set always take its maximal value at a vertex. (For a image, see Figure 2. For more detail, see section III.)In order to solve the classification problems of classes with growing , either the the dimension in the last but one layer is fixed, which leads to that the performance is bad, or the dimension in the last but one layer grow with the growing of , which leads to that the parameter number in the last two layers grow hyperlinearly with the growing of . The hyperlinear growth of the network size increases the training time and memory usage significantly, which limits many real applications that require huge number of class labels.
This paper proposes a method socalled Label Mapping to solve this contradiction. Our idea is to reduce a multiclass classification problem with huge number of classes to several multiclass classification problems with middle number of classes. Every multiclass classification problems with middle number of classes can be trained parallel. When we train them distributedly, the cost of storage and computing in a single machine increase slowly with the increasing of the class number. Moreover, the communication between the machines is not needed.
A similar method to our method is Errorcorrecting output codes (ECOC), which is discussed in [42], [7], [30] and etc.. It reduces a nclass classification problem to several binary classification problems. The ECOC typically applies binary classifier such as SVM, and therefore the binary errorcorrecting code is naturally used in this case.
However, if we use a deep learning network as a base learner, it is not necessary to limit the code to be binary. In fact, there is a tradeoff between the class number of one base learner and the number of base learner used. According to information theory, if we use classes classifiers as basic classifiers to solve a classification problem of class, we need at least ’s base learners. For example, if we need to solve a classifying problem of 1M’s classes, and we use the binary classifier as base learners, we need at least 20 base learners. For some classical applications, for example, the CNN image classification, we need to build a CNN network for every binary classifier. It is huge cost for computation and memory resources. But if we combine different base learners with 1000 classes, we need only 2 base learners.
In order to combine several multilabel base learners, the ECOC is not usable. Our Label Mapping (LM) method is very suitable for this purpose.
We discuss the design principles for LM, socalled “classes high separable” and “base learners independence” . The principle “classes high separable” ensures that for any two different classes, there are as many as possible base learners are trained to separate them. The principle “base learners independence” ensures that the repeat part of the information learned by any two different base learners is as few as possible.
Then we propose two classes of LM and prove that they conform with these principles.
As numeric experiments, we show the accuracies of LM on three dataset, namely, the dataset Cifar100, dataset CJK characters and the dataset “Republic”. On all the datasets, the accuracies of LM increase remarkably with the increasing of the number of base learner. When the class number is bigger than the dimension of the last but one layer of the network (dataset CJK characters), the accuracy of LM is better than the onehot encoding with standard softmax and negative sampling with same number of parameters of network. When the class number is much bigger than the dimension of the last but one layer of the network (dataset “Republic”), the accuracy of LM is much better than the onehot encoding with standard softmax and negative sampling with bigger number of parameters of network.
The base learners can be trained parallel, when we train them distributedly the cost of storage and computing in a single machine increase slowly. (In fact, at most , where is the number of classes.) Moreover, the communication between the machines is not needed.
We compare LM with the classical method ECOC also, the accuracy of LM+DNN is much greater than the accuracy of ECOC+DNN of the same and even bigger number of parameters.
This paper is organized as follows. In section 2, we give a literature review. In section 3, we discuss the week point of the classifier of onehot encoding with softmax. In section 4, we give the formula definition of the LM, discuss the principle to design LM and propose two classes of the LM and prove that the principles were satisfied. In section 5, we give some numeric examples.
There are some symbols used in this paper:
1). For a positive integral number , denote the set ;
2). for a power of a prime number, denote the finite field (Galois field) of elements;
3). for a field , denote the polynomial ring on ;
4). for a polynomial , denote the degree of .
Ii Literature review
There are some researches about the multiclass classification problem of huge number of classes using DNN, for example, the hierarchical softmax method [5] and negative sampling method [6]. These method can reduce the computational complexity of train, but can not reduce the number of parameters. Their performance is bad than the standard onehot plus softmax method also.
There is a method reduced a multiclass classification problems of big number of classes to several binary classification problem, i.e, the ECOC. T. G. Dietterich and G. Bakiri in [42]
introduced ECOC to combine several binary classifiers to solve multiclass classification problems. In that paper, the design principles of ECOC “Row separation” and “Column separation” are proposed. In that paper, a decision tree C4.5 and a shallow neural network with sigmoid output are used as binary base learners. The “exhaustive codes” (equivalence to Hadamard code) for class numbers
, column selection from exhaustive codes for class number , and random hill climbing or BCH codes for class num bigger than 11 are used. For decoder, it minimizes the L1 distance between the codewords and output probabilities.In [7], Allwein and Schapire proposed to use symbols from instead of in encode. The output bits which take value 0 in the encoded label does not appear in Loss. Using this modification, the three approaches, namely, one versus one, one versus others, and binary encode are collected into a common framework.
In [29], Escalera, Pujol and Radeva discussed the design principles of the modified binary ECOC which may take value in , and gave some examples satisfying these principles.
In [30], Passerini, Pontil and Frasconi disscused the decode method and the combining of ECOC with SVM with kernels.
In [31], Langford and Beygelzimer proposed a reduction from costsensitive classification to binary classification based on a modification of ECOC.
In [26], [27], [28], [38], some applications dependent ECOC are proposed. The codebook is generated based on a discrimination tree.
In [37], another class of application dependent ECOCs are proposed. It is constructed with considering the neighborhood of samples.
In [33], ECOC is used to the representation learning.
In [34], ECOC is used to zeroshot action recognition.
In [32], the ECOC is used to the text classification problem with a large number of categories.
Up to now, all the codes used in ECOC are binary, and all the basic classifiers used in ECOC are twoclass classifiers.
Iii Analysis for classification using deep neural network
Generally, A DNN classier of classes can be treated as a series connection of a complex mapping in Euclidean space to the last but one layer, and a softmax classifier softmax of classes in the last layer. The complex mapping can be interpreted as a clustering process to cluster data based on their class labels, and the last layer tries to separate them. But the softmax classifier softmax can separate all the classes in the Euclidean space only if the centers of the clusters satisfy the convex property as following.
Definition 1.
We call a set of points in an Euclidean space satisfies the convex property if and only if the convex closure of has exact vertexes.
For example, the set of the centers of the 4 clusters in Figure 2 has the convex property, but the set of the centers of the 5 clusters in Figure 2 has not the convex property.
In other words, the softmax classifier softmax can separate all the clusters in the Euclidean space only if there are not any cluster, which center lie in the inner of the convex closure of the centers of other clusters. The reason is that, a linear function on a convex body can not take its max value in the inner of the body (Figure 2).
If the dimension of the Euclidean space in the last but one layer is bigger or equal to , the centers of the clusters satisfy the convex property with probability 1 (unless the centers lie in an affine subspace of dimension less than of , which probability is ), and hence there exists a softmax classifier to separate those clusters with probability 1.
But if the dimension of the Euclidean space in the last but one layer is less than , the probability of that the centers of the clusters satisfy the convex property is less than 1. Moreover, when the dimension of the Euclidean space in the last but one layer is fixed, the probability of that the centers of the clusters satisfy the convex property decrease with the increasing of . If the class number is much bigger than the dimension of , the complex mapping in font layers in the network difficult to map the clusters such that the centers of the clusters satisfy the convex property. Hence there are not any softmax classifier can separate them.
Iv Label Mapping
For a class classification problem, we define a Label Mapping (LM) as a sequence of map
where each is called a “siteposition function”, and is called the “length” of the Label Mapping. If all the are equal to each other, we call it “simplex LM”; otherwise we call it “mixed LM”.
Generally, is a huge number, and are some numbers of middle size. We can reduce a classes classification problem to ’s classification problems of middle size through a LM. Suppose the training dataset is , where is feature and is label, there are two method to use DNN plus LM. The one is to use one network with outputs (Figure 4). The other one is to use networks, every network is trained as a base learner on the dataset for (Figure 4). Considering the convenience of distributed training, we use method in Figure 4.
A good LM should satisfy the follow properties:
Classes high separable. For two different labels , there should be as many as possible siteposition functions such that .
Base learners independence. When are selected randomly uniformly from , the mutual information of and approximate to 0 for .
The property “classes high separable” ensures that for any two different classes, there are as many as possible base learners are trained to separate them. The property “base learners independence” ensures that the common part of the information learned by any two different base learners is as few as possible.
Remark. There are some similarities between LM and ECOC:
A ECOC of length for classes can be regarded as a sequence of maps
where each is called “bitposition function”.
We can see that comparing to ECOC, our LM does not need that the reduced classification problems are twoclass classification problems. It even does not need that the reduced classification problems have the same number of classes.
In [42], two properties which a good errorcorrecting output binary code for a multiclass problem should be satisfied are proposed: ([42], section 2.3)
Row separation. Each codeword should be wellseparated in Hamming distance from each of the other codewords.
Column separation. Each bitposition function should be uncorrelated with the functions to be learned for the other bit positions.
The property “Row separation” is similar to the property “Classes high separable” of LM, and the property “Column separation” is similar to the property ”Base learners independence”. ∎
We give 2 classes of LM, one is mixed LM, and the other is simplex LM, which satisfies the properties.
Iva Mixed LM
Theorem 2.
(Prime Number Theorem) The density of prime number approximate is .
For the original label’s set , a small number k like 2, or 3, etc., and a small positive number , select ’s prime numbers in the domain . According to the Prime Number Theorem, there are about prime numbers in this domain.
We define a LM as
where . Then we have the following proposition:
Theorem 3.
For any , there is at most ’s s.t .
Proof. Suppose there exist ’s different such, that , we can suppose that
Then we have for all .
Because are prime numbers, we have . But we know , which in , hence .
This theorem tells us that the mixed LM satisfies the “Classes high separable” property. Following, we prove that it satisfies the property “Base learners independence”.
Theorem 4.
Let be uniformly randomly selected from , we have that for any , the mutual Information of and approximate 0.
In order to prove this theorem, we give a lemma firstly.
Lemma 5.
Let be uniformly randomly selected from , is an positive integral number, . Then we have that the probability of at every point in are or .
Proof. Because the preimage of every point in under the map
is a set of or elements. ∎
Now, we proof the theorem 4:
Proof of Theorem 4: Let and for every . We have that the probabilities of at every point in are or and the probabilities of at every point in are or by using the lemma 5.
We know that the mutual information of and is
a.) When , we have and hence and on ’s point in and on other points. Hence we have
b.) When , we have , and
Because
We have
This theorem tells us that, the mixed LM satisfies the property “Base learners independence”.
IvB Simplex LM
IvB1 padic representation
For any prime number , we can represent any nonnegative integral number less than as the unique form , which gives a bijection
For the classification problem of classes and any small positive integral number (for example, k=2, 3), let in (0,1) be a small positive real number, and take the a prime number in the domain . (By The Prime Number Theorem [35][1], the number of prime number in the domain is about ), and get a injection
by padic representation. Then we can combine this map with any injection to get ary simplex LM.
IvB2 Singleton bound and MDS code
In coding theory, the Singleton bound, named after Richard Colome Singleton [4], is a relatively crude upper bound on the size of an arbitrary qary code with block length , size and minimum distance .
The minimum distance of a set C of codewords of length is defined as
where is the Hamming distance between and . The expression represents the maximum number of possible codewords in a qary block code of length and minimum distance . Then the Singleton bound states that
Theorem 6.
(Richard Collom Singleton) .
The code achieving Singleton bound is called MDS (maximal distinct separate) code.
It is easy to see that, for a fixed original ID’s number , code length , MDS codes are the codes which most satisfies the property “Class high separable”. Fortunately, for big prime number or power over prime number , there are some nontrivial MDS codes found, for example the ReedSolomon code[25].
Theorem 7.
(Reed and Solomon) For ’s different elements in , the code defined by the composite of the map
and the map
is a MDS code.
In this paper, we use only the ReedSolomon code with q=p be a prime number.
Remark. In the case of ECOC, the property similar to “Class high separable” is ”Row separation”. If there exists a nontrivial binary MDS code, it will be the code most satisfies “Row separation” also. But unfortunately, it has not find any nontrivial binary MDS code yet up to now. In fact, for some situation, the fact that there are not any nontrivial binary MDS code is proved. ([36] and Proposition 9.2 on p. 212 in [10] ). This is an advantage of simplex LM better than ECOC also.
IvB3 Separability and independency
We can combine the adic representation map with a ReedSolomon encoder over field to get a simplex LM for any prime number . The above theorem ensures that, this code satisfies the property “Classes high separable”. We will prove that, it satisfies the property “Base learners independence” also.
Theorem 8.
If
is a random variable with uniform distribution on
, and are the isite value and jsite value () of the codeword of under the simplex LM described above, then the mutual information of and approach to when grows up.The proof of this theorem is similar to the proof of the theorem 4, we omit it due to space limitations.
IvC Decode Algorithm
Suppose we used the LM
to reduce a classification problem of class number to the classification problems of class number ’s, and trained base learner for every , the output of every base learner is a distribution on . Now, for a input feature data, how we collect the output of every base learner to get the predict label?
In this paper, we search the such that is maximal, and let such be the decoded label.
In fact, , where is the Delta distribution at , and is the marginal distribution of induced by . This decode algorithm is that find the Delta distribution on such that the marginal distribution on every included from it is as closed to as possiple.
V Numeric experiments
We give performance of LM on three dataset, namely, the dataset Cifar100 [44] , the dataset CJK characters and the dataset Republic. The CIFAR100 dataset consists of 60000 32x32 color images in 100 classes, with 500 training images and 100 testing images per class. The dataset CJK characters is the greylevel image of size 139x139 of 20901 CJK characters (0x4e00 0x9fa5) in 8 fonts. The dataset Republic is a text with 118684 words and 7409 unique words in the vocabulary.
We use a simple CNN network, which dimension in the last but one layer is 128, with a onehot encoding as the baseline for the cifar100 dataset.
We use an inception V3 network [43], which dimension in the last but one layer is 2048, with a onehot encoding as the baseline for the CJK characters dataset.
We use a RNN network which dimension in the last but one layer is 100, with a onehot encoding as the baseline for the dataset “Republic”.
We will see that the accuracy of LM increases with the increasing of its length on all the datasets. But the accuracy of LM on (cifar100, simple CNN) is difficult to be higher than the onehot, the accuracy of LM on (CJK character, inception V3) better than onehot with the almost same number of parameters, and the accuracy of LM on (Republic, RNN) is much better than onehot with more number of parameters.
Why there is a such big difference of accuracy in three situations? Because the dimension of the last but one layer of the simple CNN is bigger than the class number of cifar100 dataset, and hence the onehot encoding can brought into full play the power of simple CNN. But the dimension of the last but one layer of inception V3 is less than the class number of dataset CJK character, which causes that the onehot encoding can not bring the power of inception V3 into full play. Moreover, the dimension of the last but one layer of RNN is much less than the class number of dataset Republic, which causes that the onehot encoding absolutely can not bring the power of RNN into full play (According to the relation of the dimension of the last but one layer and the class number discussed in section III.)
Va On a dataset of small class number
We use a simple CNN network on the dataset Cifar100. The simple CNN network includes 3 convolution layers and a fullconnected layer. The sizes of convolution kernels are , and the weigth of the three convolution layers are 32, 64, 128 respectively. After each convolution layer, a average poling layer is applied. After the 3rd pooling layer and the fullconnection layer, we use dropout layer of probability 0.25. The network structure is like in Figure 6.
Note the dimension 128 of the last but one layer is greater than the class number 100, hence the onehot encoding can bring the power of simple CNN into full play. In this experiment the LM is difficult to surpass the onehot, but we can see the accuracy increases with the increasing of length of LM. We will see the accuracy of LM is greater than the accuracy of ECOC also.
VA1 The performance of simplex LM of different length
We use the simplex LM defined above with and . The simplex LM can be writn as
where , and .
We train the networks with batch size=128 and 390 batch per epoch. The Figure
6 shows the accuracy of these simplex LMs with the single CNN on dataset cifar100:In Figure 6, the horizontal axis is the traning epoch, and the vertical axis is the validation accuracy. The five curves, which colors are red, yellow, green, blue and black respectively, are the epochaccuracy curves of the simple CNN with simplex LM defined above with and and the onehot encoding respectively.
We can see, the accuracy of these networks with simplex LM and onehot encoding increase in the first 5080 epoch, and then a little of overfitting occur. The accuracy of simple CNN with simplex LM increases with the increasing of length of the LM, it approximates the accuracy of onehot as increase, but it is difficult to surpass the onehot. The reason is that the dimension of the last but one layer of the simple CNN is bigger than the class number of cifar100 dataset, and hence the onehot encoding can bring the power of simple CNN into full play.
VA2 The performance of mixed LM of different length
We use the mixed LMs defined above with and . The mixed LM can be writen as
where , and .
The Figure 8 shows the accuracy of these mixed LMs with the simple CNN on dataset cifar100:
In Figure 8, the horizontal axis is the traning epoch, and the vertical axis is the validation accuracy. The five curves, which colors are red, yellow, green, blue and black respectively, are the epochaccuracy curves of the simple CNN with simplex LM defined above with are , , , and the onehot encoding respectively.
We can see, the accuracy of these networks with mixed LM and onehot encoding increase in the first 5080 epoch, and then a little of overfitting occur. The accuracy of simple CNN with mixed LM increases with the increasing of length of the LM, it approximates the accuracy of onehot as increase, but it is difficult to surpass the onehot. The reason is that the dimension of the last but one layer of the simple CNN is bigger than the class number of cifar100 dataset, and hence the onehot encoding can brought into full play the power of simple CNN.
VA3 Compare LM with ECOC
On this subsection, we compare the performance of LM and ECOC. We show the accuracy of the following LMs (ECOC is a special case of LM) with the simple CNN network on the dataset cifar100:
a. A ECOC of n=7
b. A simplex LM of p=11 and n=3.
c. A mixed LM of .
For the ECOC of , we encode a label to its binary representation of length 7.
The simplex LM can be written as
where , and .
The mixed LM can be writn as
where , and .
The number of parameters of the three method are 1112622, 480321, 481353. The accuracies of the three method is like in Figure 8. We can see, the accuracies of LMs are better than the ECOC even when the number of parameters of LM is much less than one of ECOC.
VB On a dataset “CJK Characters” of big class number
We use the Inception V3 network and LM on the dataset “CJK characters”. CJK is a collective term for the Chinese, Japanese, and Korean languages, all of which use Chinese characters and derivatives (collectively, CJK characters) in their writing systems. The data set “CJK characters” is the greylevel image of size 139x139 of 20901 CJK characters (0x4e00 0x9fa5) in 8 fonts , these fonts are in the paths in Figure 9 of a MacBook Pro with OS X version 10.11.
We use 7 fonts as the train set, and other one font as the test set. We use inception v3 network as base learner, and train the networks using batch size=128 and 100 batch per an epoch.
Noted the dimension 2048 of the last but one layer of Inception V3 is much less than the class number 20901, hence the onehot encoding can not bring the power of simple CNN into full play. In this experiment we can see the accuracy increases with the increasing of length of LM, and the LM is easy to surpass the onehot. We will see the accuracy of LM is greater than the accuracy of ECOC also.
This section is divided into 3 parts. The part 1, 2 shows the performance of simplex LM and mixed LM respectively. Their accuracies increase with the increasing of length, and surpass the onehot when length is greater or equal to 3. The part 3 compares the performance of LMs with the ECOC.
VB1 The performance of simplex LM of different length
In this part, we shows how the accuracy of simplex LM increases with the increasing of length of LM. We also show that when the length increases, the performance of LM is better than the network with onehot encoding.
We use the simplex LM with and . The simplex LM can be writn as
where , and . The accuracies of theses simplex LMs are like in Table I.
In the table I, the accuracies in epoch 20, 40, 60, 80 are showed. The column “epoch” means the epoch number in training, the column “n sites” is the accuracy of the inception V3 networks with simplex LM of n sites for , and the column “onehot” is the accuracy of the inception V3 with onehot encoding.
We see that, the accuracies of LM of all the site numbers increase with the increasing of training epoch. The accuracy of LM increase with the increasing of sites number also. When the sites number is equal to 3, the parameters number of LM is approximate to the parameters number of onehot, but the accuracy of LM is greater than the onehot with softmax or negative sampling.
Remark. In the onehot network in the table I, the last layer has dimension 20901, but the last but one layer has dimension only 2048. If we set the dimension of the last but one layer to 20900, the performance may be better, but our GPU does not have such huge memory.
Remark. In the term “onehot with negative sampling” in the table I, the using negative sampling ratio is 10:1.
ep.  2 sites  3 sites  4 sites  5 sites  6 sites  onehot  onehot with 

with  negative  
softmax  sampling  
20  0.0118  0.0318  0.0604  0.0585  0.0640  0.0325  0.0004 
40  0.6657  0.9373  0.9812  0.9865  0.9878  0.5152  0.0007 
60  0.8172  0.9840  0.9943  0.9964  0.9968  0.9399  0.0019 
80  0.8684  0.9920  0.9978  0.9984  0.9988  0.9854  0.0031 
param. num.  6.46  6.46  
() 
VB2 The performance of mixed LM of different length
In this part, we show how the accuracy of mixed LM increases with the increasing of length of LM. We also show that when the length increases, the performance of LM is better than the network with onehot encoding.
We use the mixed Label Mappings with primes in {149, 151, 157, 163, 167, 173, 179}. The mixed LM can writn as
where , and , .
The accuracies are like in following table II. In this table, the accuracies in epoch 20, 40, 60, 80 are showed. The column “epoch” means the epoch number in training, the column “n sites” is the accuracy of the inception V3 networks with simplex LM of n sites for , and the column “onehot” is the accuracy of the inception V3 with onehot encoding.
We see that, the accuracies of mixed LM of all the site numbers increase with the increasing of training epoch. The accuracy of mixed LM increases with the increasing of sites number also. When the sites number is equal to 3, the parameters number of LM is approximate to the parameters number of onehot, but the accuracy of LM is greater than the onehot.
Remark. In the onehot network in the table II, the last layer has dimension 20901, but the last but one layer has dimension only 2048. If we set the dimension of the last but one layer to 20900, the performance may be better, but our GPU does not have such huge memory.
Remark. In the term “onehot with negative sampling” in the table II, the using negative sampling ratio is 10:1.
ep.  2 sites  3 sites  4 sites  5 sites  6 sites  7 sites  onehot with  onehot with 

softmax  negative sampling  
20  0.0081  0.0101  0.0100  0.0309  0.0585  0.0926  0.0325  0.0004 
40  0.6130  0.8100  0.8707  0.9656  0.9851  0.9903  0.5152  0.0007 
60  0.7629  0.9765  0.9925  0.9957  0.9967  0.9974  0.9399  0.0019 
80  0.8757  0.9912  0.9971  0.9980  0.9982  0.9987  0.9854  0.0031 
param. num.  6.46  6.46  
() 
VB3 Compare LM with ECOC
We show the arccuracies of the following ensemble methods with the inception V3 network on the CJK dataset:
a. A 15 bits ECOC corresponding to the binary representation of label
b. A 2 sites simplex LM of p=181 and n=2
c. A 2 sites mixed LM of p in {149, 151}
The three settings are the minimal setting for the three methods respectively, it means, if we reduce any bit of the encoding or any site of the label mapping, the encoding or the label mapping will be not injection. The accuracies are in Table III:
ep.  ECOC of 15 bit  simplex LM of 2 sites  mixed LM of 2 sites 

20  0.0069  0.0118  0.0081 
40  0.0795  0.6657  0.6130 
60  0.3660  0.8172  0.7629 
80  0.5740  0.8684  0.8757 
param. num. () 
We can see, even when the base learner number 2 of LM is much less than the base learner number 15 of ECOC, and the parameters number of LM is much less than the parameters number of ECOC, the performance of LM is better than the ECOC.
VC On the dataset “Republic”
The Republic ([45],[46]) is a Socratic dialogue, written by Plato around 380 BC, concerning justice, the order and character of the just, citystate, and the just man.
We use the following produce firstly:
a). Replace ‘’ with a white space.
b). Split words based on white space.
c). Remove all punctuation from words.
d). Remove all words that are not alphabetic to remove standalone punctuation tokens.
e). Normalize all words to lowercase.
After the produce, there are 118684 words in the produced text, and 7409 unique words in the vocabulary.
We construct a network which use the 50 previous words as input and predict the current word. Because both the input and output are categorical with big number of classes, we use the LM method not only for output, but for input also.
In fact, for a LM
We get a sparse encode method
induced by this LM naturally. The middle map is defined by . Use this map, we can get a hot code of length for every label in , which can be used as input encoding.
The network include an input encoding layer of dimension , an embedding layer of dimension 150, two LSTM layer of dimension 100, a dense layer of dimension 100, and a dense output layer. After every output layers of dimension , a softmax is used. The structure of the network is like in Figure 10. In Figure 10 we draw only one encoding unit and one embedding unit, in fact there are encoding unit and embedding unit before every LSTM cell in first LSTM layer, but the weight of the encoding units and embedding units are same respectively.
The performance is like following table IV, where the mixed LM of 2 sites use , the mixed LM of 4 sites use , the mixed LM of 6 sites use . The simplex LM use the prime number .
input  output  par. num.  2  4  6  8  10 
onehot  onehot  2.9E6  0.0946  0.1265  0.1379  0.1471  0.1540 
softmax  
onehot  onehot  2.9E6  0.048  0.058  0.058  0.058  0.058 
negative sampling  
724 bit cutoff  724 bit cutoff  3.7E5  0.0589  0.1076  0.1245  0.1272  0.1321 
mix. LM of 6 sites  mix. LM of 2 sites  6.2E5  0.1331  0.1402  0.1366  0.1358  0.1189 
mix. LM of 6 sites  mix. LM of 4 sites  1.2E6  0.1609  0.1722  0.1795  0.1836  0.1845 
mix. LM of 6 sites  mix. LM of 6 sites  1.9E6  0.1590  0.1731  0.1812  0.1849  0.1865 
sim. LM of 6 sites  sim. LM of 2 sites  6.3E5  0.1453  0.1505  0.1531  0.1522  0.1444 
sim. LM of 6 sites  sim. LM of 4 sites  1.3E6  0.1586  0.1685  0.1759  0.1805  0.1832 
sim. LM of 6 sites  sim. LM of 6 sites  1.9E6  0.1575  0.1694  0.1776  0.1814  0.1851 
We see that, the accuracies of LMs of all the site numbers increase with the increasing of training epoch basically. An overfitting occurs at epoch 10 when we use LM of 6 sites as input encoding and LM of 2 sites as output encoding, but it disappear with the increasing of number of sites of output encoding. The accuracies of LMs increase with the increasing of sites number. Even when the number of parameters used in LM is much less than the onehot with softmax or onehot with negative sampling, the performance of LM is better than onehot.
There is an usually used method for big vocabulary in language model, i.e. the cutoff method: the most frequent words are encoded onhot, and all other words are common encoded as ’’. If we view the LM of 6 sites with as a binary encoding, its length is 107+109+113+127+131+137=724. We see that, the performance of cutoff method of 724 bits is much lower than the mixed LM of 6 sites with .
Vi Conclusion
We give an ensemble method so called Label Mapping (LM), which translates a classification problem of huge class number to several classification subproblems of middle class number, and trains a base learner for every subproblem. The necessary number of base learners is sublinear grow with the growing of class number.
We propose two design principles, namely, Classes high separable and Base learners independence of Label Mapping, and give two classes of Label Mapping and prove they are satisfying the two principles.
As numeric experiments, we show the accuracies of LM on three datasets, namely, the dataset Cifar100, the dataset CJK characters and the dataset “Republic”. On all the datasets, the accuracies of LM increase with the increasing of length. When the class number is big (the dataset CJK characters and the dataset “Republic”), specially, the class number is much greater than the dimension of the last but one layer of the network, the accuracy of LM plus Network is better than the onehot encoding plus Network with almost same or big number of parameters. We compare LM with the classical method ECOC also, the accuracy of LM is much greater than the accuracy of ECOC of bigger number of parameters.
References
 [1] Bernhard Riemann. Ueber die Anzahl der Primzahlen unter einer gegebenen Grosse. Monatsberichte der Berliner Akademie, November 1859.
 [2] https://en.wikipedia.org/wiki/Prime_number_theorem.
 [3] Irving S. Reed and Gustav Solomon. Polynomial codes over certain finite fields. J. SIAM, 8:300304, 1960.
 [4] Richard C. Singleton. Maximum distance qnary codes. IEEE Transactions on Information Theory, 10(2):116–118, April 1964.
 [5] Morin, F., Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model.

[6]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in NIPS, pages 3111–3119.
 [7] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. The Journal of Machine Learning Research. Volume 1, 2001, pages 113141.
 [8] Sejnowski T.J., Rosenberg C.R.(1987).Parallel networks that learn to pronounce english text. Journal of Complex Systems,1(1), 145168.
 [9] Shu Lin; Daniel Costello (2005). Error Control Coding (2 ed.). Pearson. ISBN 0130179736. Chapter 4.
 [10] L. R. Vermani. Elements of Algebraic Coding Theory. CRC Press, 1996.
 [11] E. Guerrini and M. Sala. A classification of MDS binary systematic codes. BCRI preprint, www.bcri.ucc.ie 56, UCC, Cork, Ireland, 2006.

[12]
T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of artificial intelligence research, pp. 263–286, 1995.
 [13] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. The Journal of Machine Learning Research, 1:113–141, 2001.
 [14] A. Passerini, M. Pontil, and P. Frasconi. New results on error correcting output codes of kernel machines. Neural Networks, IEEE Transactions on, 15(1):45–54, 2004.
 [15] Langford, J., and Beygelzimer, A. 2005. Sensitive error correcting output codes. In COLT.
 [16] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006
 [17] S. Escalera, O. Pujol, and P. Radeva. Ecocone: A novel coding and decoding strategy. In ICPR, volume 3, pp. 578–581, 2006.

[18]
O. Pujol, P. Radeva, and J. Vitria. Discriminant ECOC: a heuristic method for application dependent design of error correcting output codes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(6):1007–1012, 2006.

[19]
O. Pujol, S. Escalera, and P. Radeva. An incremental node embedding technique for error correcting output codes. Pattern Recognition, 41(2):713–725, 2008.
 [20] S. Escalera, O. Pujol, and P. Radeva. Separability of ternary codes for sparse designs of errorcorrecting output codes. Pattern Recognition Letters, 30(3):285–297, 2009.
 [21] G. Zhong, K. Huang, and C.L. Liu. Joint learning of errorcorrecting output codes and dichotomizers from data. Neural Computing and Applications, 21(4):715–724, 2012.
 [22] G. Zhong and M. Cheriet. Adaptive errorcorrecting output codes. In IJCAI, 2013.

[23]
G. Zhong and C.L. Liu. Errorcorrecting output codes based ensemble feature extraction. Pattern Recognition, 46(4):1091–1100, 2013.
 [24] Yang, Luo, Loy, Shum, Tang. Deep Representation Learning with Target Coding. TwentyNinth AAAI Conference on Artificial Intelligence, 2015 :38483854.
 [25] Irving S. Reed and Gustav Solomon. Polynomial codes over certain finite fields. JSIAM volume 8, number 2, jun 1960, pages 300304.
 [26] S. Escalera, O. Pujol, and P. Radeva. Ecocone: A novel coding and decoding strategy. International Conference on Pattern Recognition. Volume 3, 2006, pages 578581.
 [27] O. Pujol, P. Radeva, and J. Vitria. Discriminant ECOC: a heuristic method for application dependent design of error correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence. Volume 28, number 6, 2006, pages 10071012.
 [28] O. Pujol, S. Escalera, and P. Radeva. An incremental node embedding technique for error correcting output codes. Pattern Recognition. Volume 41, number 2, 2008, pages 713725.
 [29] S. Escalera, O. Pujol, and P. Radeva. Separability of ternary codes for sparse designs of errorcorrecting output codes. Pattern Recognition Letters. Volume 30, number 3 2009, pages 285297.
 [30] A. Passerini, M. Pontil and P. Frasconi. New results on error correcting output codes of kernel machines. IEEE Transactions on Neural Networks. Volume 15, number 1, 2004, pages 4554.

[31]
Langford J., and Beygelzimer A. Sensitive error correcting output codes. International Conference on Computational Learning Theory. Volume 2459, number 3, 2005, pages 158172.
 [32] Ghani R.. Using ErrorCorrecting Codes for Efficient Text Classification with a Large Number of Categories. KDD Lab Project Proposal. 2001.
 [33] Yang, Luo, Loy, Shum, Tang. Deep Representation Learning with Target Coding. TwentyNinth AAAI Conference on Artificial Intelligence. 2015, pages 38483854.

[34]
Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, Yunhong Wang. ZeroShot Action Recognition With ErrorCorrecting Output Codes. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pages 28332842.
 [35] Apostol T. M. Introduction to Analytic Number Theory. SpringerVerlag , 1976, New York.
 [36] E. Guerrini and M. Sala. A classification of MDS binary systematic codes. BCR preprint, 2006. www.bcri.ucc.ie/FILES/PUBS/BCRI_57.pdf
 [37] Niloufar Eghbali and Gholam AliMontazer. Improving multiclass classification using neighborhood search in error correcting output codes. Pattern Recognition Letters. Dec 2017, volume 100, number 1, pages 7482.
 [38] Fa Zheng, Hui Xue, Xiaohong Chen, Yunyun Wang. Maximum Margin Tree Error Correcting Output Codes. Pacific Rim International Conference on Artificial Intelligence. 2016, pages 681691.
 [39] Berger, A.: ErrorCorrecting Output Coding for text classification. In: IJCAI(1999)
 [40] Ghani, R.: Using errorcorrecting codes for text classification. Proceedings of ICML00, 17th International Conference on Machine Learning (pp. 303–310). Stanford, US: Morgan Kaufmann Publishers, San Francisco, US.
 [41] Ghani, R. Using ErrorCorrecting Codes for Efficient Text Classification with a Large Number of Categories. KDD Lab Project Proposal.
 [42] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of artificial intelligence research, 1995, p263286.
 [43] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. 2015. https://arxiv.org/abs/1512.00567
 [44] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009. https://www.cs.toronto.edu/ kriz/cifar.html
 [45] https://en.wikipedia.org/wiki/Republic_(Plato)
 [46] http://www.gutenberg.org/cache/epub/1497/pg1497.txt
Comments
There are no comments yet.