Label Confusion Learning to Enhance Text Classification Models

Representing a true label as a one-hot vector is a common practice in training text classification models. However, the one-hot representation may not adequately reflect the relation between the instances and labels, as labels are often not completely independent and instances may relate to multiple labels in practice. The inadequate one-hot representations tend to train the model to be over-confident, which may result in arbitrary prediction and model overfitting, especially for confused datasets (datasets with very similar labels) or noisy datasets (datasets with labeling errors). While training models with label smoothing (LS) can ease this problem in some degree, it still fails to capture the realistic relation among labels. In this paper, we propose a novel Label Confusion Model (LCM) as an enhancement component to current popular text classification models. LCM can learn label confusion to capture semantic overlap among labels by calculating the similarity between instances and labels during training and generate a better label distribution to replace the original one-hot label vector, thus improving the final classification performance. Extensive experiments on five text classification benchmark datasets reveal the effectiveness of LCM for several widely used deep learning classification models. Further experiments also verify that LCM is especially helpful for confused or noisy datasets and superior to the label smoothing method.



There are no comments yet.


page 5


Structure-Aware Label Smoothing for Graph Neural Networks

Representing a label distribution as a one-hot vector is a common practi...

Multi-Task Label Embedding for Text Classification

Multi-task learning in text classification leverages implicit correlatio...

HTCInfoMax: A Global Model for Hierarchical Text Classification via Information Maximization

The current state-of-the-art model HiAGM for hierarchical text classific...

Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data

Industry datasets used for text classification are rarely created for th...

Locally Adaptive Label Smoothing for Predictive Churn

Training modern neural networks is an inherently noisy process that can ...

Benchmarking Popular Classification Models' Robustness to Random and Targeted Corruptions

Text classification models, especially neural networks based models, hav...

Label-similarity Curriculum Learning

Curriculum learning can improve neural network training by guiding the o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Text classification is one of the fundamental tasks in natural language processing (NLP) with wide applications such as sentiment analysis, news filtering, spam detection and intent recognition. Plenty of algorithms, especially deep learning-based methods, have been applied successfully in text classification, including recurrent neural networks (RNN), convolutional networks (CNN)

kim2014convolutional. More recently, large pre-training language models such as ELMO peters2018deep, BERT devlin2018bert, Xlnet yang2019xlnet and so on have also shown their outstanding performance in all kinds of NLP tasks, including text classification.

Although numerous deep learning models have shown their success in text classification problems, they all share the same learning paradigm: a deep model for text representation, a simple classifier to predict the label distribution and a cross-entropy loss between the predicted probability distribution and the one-hot label vector. However, this learning paradigm have at least two problems: (1) In general text classification tasks, one-hot label representation is based on the assumption that all categories are independent with each other. But in real scenarios, labels are often not completely independent and instances may relate to multiple labels, especially for the confused datasets that have similar labels. As a result, simply representing the true label by a one-hot vector fails to take the relations between instances and labels into account, which further limits the learning ability of current deep learning models. (2) The success of deep learning models heavily relies on large annotated data, noisy data with labeling errors will severely diminish the classification performance, but it is inevitable in human-annotated datasets. Training with one-hot label representation is particularly vulnerable to mislabeled samples as full probability is assigned to a wrong category. In brief, the limitation of current learning paradigm will lead to confusion in prediction that the model is hard to distinguish some labels, which we refer as label confusion problem (LCP). A label smoothing (LS) method is proposed to remedy the inefficiency of one-hot vector labeling

muller2019does, however, it still fails to capture the realistic relation among labels, therefore not enough the solve the problem.

In this work, we propose a novel Label Confusion Model (LCM) as an enhancement component to current deep learning text classification models and make the model stronger to cope with label confusion problem. In particular, LCM learns the representations of labels and calculates their semantic similarity with input text representations to estimate their dependency, which is then transferred to a label confusion distribution (LCD). After that, the original one-hot label vector is added to the LCD with a controlling parameter and normalized by a softmax function to generate a simulated label distribution (SLD). We use the obtained SLD to replace the one-hot label vector and supervise the training of model training. With the help of LCM, a deep model not only capture s the relations between instances and labels, but also learns the overlaps among different labels, thus, performs better in text classification tasks. We conclude our contributions as follows:

  • We propose a novel label confusion model (LCM) as an effective enhancement component for text classification models, which models the relations between instances and labels to cope with LCP problems. In addition, LCM is only used during training and doesn’t change the original model structure, which means LCM can improve the performance without extra computation cost in prediction procedure.

  • Extensive experiments on five benchmark datasets (both in English and Chinese) illustrate the effectiveness of LCM on three widely used deep learning structures: LSTM, CNN and BERT. Experiments also verified its advantage over label smoothing method.

  • We construct four datasets with different confusion degree, and four datasets with different proportion of noise. Experiments on these datasets demonstrate that LCM is especially helpful for confused or noisy datasets and superior to the label smoothing method (LS) to a large degree.

Related Work

Text Classification With Deep Learning

Deep learning models have been widely use in natural language processing, including text classification problems. The studies of deep text representations can be categorized into two groups. One is focusing on the word embeddingsmikolov2013distributed; le2014distributed; pennington2014glove

. Another group mainly study the deep learning structures that can learn better text representations. Typical deep structures include recurrent neural networks (RNNs) based long short-term memory (LSTM)

LSTM; liu2016recurrent; wang2018topic

, convolutional neural networks (CNN)

kalchbrenner2014convolutional; kim2014convolutional; zhang2015character; shen2017deconvolutional and context-dependent language models like BERT devlin2018bert.The reason why deep learning methods have become so popular is their ability to learn sophisticated semantic representations from text, which are much richer than hand-crafted features.

Label Smoothing

Label smoothing (LS) is first proposed in image classification tasks as a regularization technique to prevent the model from predicting the training examples too confidently, and has been used in many state-of-the-art models, including image classification szegedy2016rethinking; zoph2018learning, language translation vaswani2017attention and speech recognition chorowski2016towards. LS improves model accuracy by computing loss not with the “hard” one-hot targets, but with a weighted mixture of these targets with a uniform noise distribution.

Nevertheless, the label distribution generated form LS cannot reflect the true label distribution for each training sample, since it is obtained by simply adding some noise. The true label distribution should reveal the semantic relation between the instance and each label, and similar labels should have similar degree in the distribution. In nature, label smoothing encourages the model to learn less, rather than learn more accurately of the knowledge in training samples, which may have the risk of underfitting.

Label Embedding

Label embedding is to learn the embeddings of the labels in classification tasks and has been proven to be effective. zhang2017multi convert labels into semantic vectors and thereby convert the classification problem into vector matching tasks. Then attention mechanisms are used to jointly learn the embedding of words and labels wang2018joint. yang2018sgm use label embedding in a sequence generation model for multi-label classification which captures the co-relation between labels. In our work, we also use jointly learn the label embeddings, which can be used to further capture the semantic relation between text and labels.

Label Distribution Learning

Label Distribution Learning (LDL) LDL

is a novel machine learning paradigm for applications where the overall distribution of labels matters. A label distribution covers a certain number of labels, representing the degree to which each label describes the instance. LDL is proposed for problems where the distribution of labels matters.


gives out several algorithms for this kind of tasks. However, the true label distribution is hard to obtain for many existing classification tasks such as 20NG (a typical text classification dataset) and MNIST (a typical image classification dataset)

mnist where we only have a unique label for each sample. In this kind of classification tasks, LDL is not applicable.

Figure 1: The structure of LCM-based classification model, which is composed of a basic predictor and a label confusion model.
Models 20NG AG’s News DBPedia FDCNews THUCNews
LSTM-rand + LCM
LSTM-pre 3
LSTM-pre + LCM
CNN-rand + LCM
CNN-pre + LCM
Table 1: Test Accuracy on different text classification tasks. We report their mean standard deviation. The bold means significant improvement on baseline methods on t-test (p

Our Approach

Intuitively, there exists a label distribution which reflect the degree of how each label describes the current instance for most classification tasks. However, in practice, we can only obtain a unique label (in single-label classification) or several labels (in multi-label classification) for samples, rather than the relation degree between the samples and the labels. There isn’t a natural and verified way to transfer the one-hot label to a proper distribution, if no statistical information is provided. Though the theoretical true label distribution cannot be easily achieved, we can still try to simulate it by digging the semantic information behind instances and labels.

Considering that label confusion problem is usually caused by the semantic similarity, we suppose that a label distribution that can reflect the similarity relation between labels can help to train a stronger model and address the label confusion problem. A simple idea is to find the descriptions of each label and calculate the similarity between every two labels. Then for each one-hot label representation, we can use the normalized similarity values to create a label distribution. However, the label distributions got in this way are all the same for instances with the same label, regardless of their content. In fact, even two instances have the same label, their content may be quite different, so their label distribution should also be different.

Therefore, we should construct the label distribution using the relations between instances and labels, thus the label distribution will dynamically be changing for different instances with the same label. For text classification problems, we can simulate the label distribution by the similarity between the representation of document text and each label. In this way, not only the relations between instances and labels are captured, the dependency among labels is also reflected by these relations. With this basic idea, we designed a label confusion model (LCM) to learn the simulated label distribution (SLD) by calculating the semantic relations between instances and labels. Then the SLD is seen to be the true label distribution and is compared with the predicted distribution to compute the loss by KL-divergence. In the latter part, we will introduce the LCM-based classification model in detail.

LCM-based Classification Model

A LCM-based classification model is composed by two parts: a basic predictor and a label confusion model. The overall structure can be seen in Figure 1.

The basic predictor is usually constructed by a input encoder (such as RNNs, CNN, BERT) followed by a simple classifier (usually a softmax classifier). The basic predictor can be chosen from all kinds of main stream deep learning based text classifiers. As shown in Figure 1, the text to be classified is inputted to the input decoder to generate the input text representation. Then it will be fed to the softmax classifier to predict the label distribution (PLD):

where is the input encoder function, transforming the input sequence with length to the input text representation with length and dimension . is the predicted label distribution PLD.

The LCM is constructed by a label encoder and a simulated label distribution computing block (SLD Block). The label encoder is a deep neural network (DNN) to generate the label representation matrix. The SLD block is composed of a similarity layer and a SLD computing layer. The similarity layer takes the label representations and the current instance representation as inputs, and computes their similarity values by dot product, then a neural net with softmax activation is applied to get the label confusion distribution (LCD). The LCD captures the dependency among labels by calculating the similarity between instances and labels. Thereby, the LCD is a dynamic, instance-dependent distribution, which is superior to the distribution that solely considers the similarity among labels, or simply a uniform noise distribution like the way in LS.

Finally, the original one-hot vector is added to the LCD with a controlling parameter , and then normalized by a softmax function the generate the simulated label distribution SLD. The controlling parameter decides how much of the one-hot vector will be changed by the LCD. The above process can be formulated by:

where is the label encoder function to transfer labels to label representation matrix , is the number of categories.

in our case is implemented by a embedding lookup layer followed by a DNN, which can be multi-layer perceptron (MLP), LSTM or attention networks and so on. Note that the order of the label sequence inputted to LCM should be the same with the one-hot targets.

is the LCD and is the SLD.

The SLD is then be viewed as the new training targets to replace the one-hot vector and supervise the model training. Since the SLD and the predicted label vector

are both probability distributions, we use the Kullback–Leibler divergence (KL-divergence)


as the loss function to measure their difference:

By training with LCM, the actual targets the model trying to fit are dynamically changing according to the semantic representation of documents and labels learned by the deep model. The learned simulated label distribution help the model to better represent the instance with different labels, especially for those easily confused samples. SLD is also more robust when facing noisy data cause the probability of the wrong label is allocated to similar labels (often include the right label), thus model can still learn some useful information from mislabeled data.

(a) Cosine similarity matrix
(b) t-SNE visualization of label representations
Figure 2: Cosine similarity matrix (a) and corresponding t-SNE visualization (b) of label representations of 20NG datasets. The representations are drawn from the embedding layer of LCM. The labels with the same color are of the same label group.
(a) The effect of in LCM
(b) Early stop of LCM enhancement
Figure 3: Hyper-parameter analysis of LCM-based models, including the effect of and the early stop strategy for LCM enhancement.


Experiment Setup


To assess the effectiveness of our proposed method, we choose 5 benchmark datasets, including 3 English datasets and 2 Chinese datasets:

The 20NG dataset111̃mimarog/textmining/datasets/ (bydata version) is an English news dataset that contains 18846 documents evenly categorized into 20 different categories.
The AG’s News dataset222̃ulli is constructed by Xiang Zhang zhang2015character which contains 127600 samples with 4 classes. We choose a subset of size 50000 samples in our experiments.
The DBPedia dataset333 is also created by Xiang Zhang zhang2015character. It is an ontology classification dataset which has 630000 samples categorized into 14 classes. We randomly selected 50000 samples to form our experiment dataset.
The FDCNews dataset444 is provided by Fudan University which contains 9833 Chinese news categorized into 20 different classes.
The THUCNews dataset555 is a Chinese news classification dataset collected by Tsinghua University. We constructed a subset from it which contains 39000 news evenly split into 13 news categories.

Choice of Basic Predictors

Our LCM is proposed as an enhancement for current main stream models, therefore, we only select some widely used model structures in text classification tasks.

LSTM: We implement the LSTM model defined in liu2016recurrent which use the last hidden state as the text representation. We tried LSTM-rand which use random weights initialization for embedding layer and LSTM-pre that initialize with pre-trained word embeddings.
CNN: We use the CNN structure in kim2014convolutional and explore both CNN-rand and CNN-pre with and without using pre-trained word vectors respectively.
BERT: Bidirectional Encoder Representations from Transformers devlin2018bert. For faster training, we apply BERT-tiny bert-tiny for English datasets. and ALBERT albert for Chinese datasets.


For LSTM we set embedding size and hidden size as 64. For CNN, we use 3 filters with size 3, 10 and 25 and the number of filters for each convolution block is 100. For both LSTM and CNN models, the embedding size is 64 if no pre-trained word embedding are used. Otherwise, the embedding size is 250 for Chinese tasks and 100 for English tasks. The Chinese word embedding is pre-trained in around 1GB Chinese Wikipedia corpus using skip-gram mikolov2013distributed algorithm. The English word embedding we choose is the GloVe embedding pennington2014glove. In BERT models, we obtain text representations from the BERT model and then use a dense layer with 64 units to decrease the dimension of the text representation to 64. The LCM component in our case is implemented using an embedding lookup layer followed by a dense neural net, where the embedding size and the hidden size are kept the same as the basic predictor. The decides the importance of LCM on the basic predictor. In our main experiments we just set as a moderate choice. But by carefully tuned, the perfomance can get better. We train our model’s parameters with the Adam Optimizer kingma2014adam

with an initial learning rate of 0.001 and batch size of 128. The model is implemented using Keras and is trained on GPU GeForce GTX 1070 Ti.

’’, ’’, ’’, ’’,
’’, ’’, ’’, ’’,
’’, ’alt.atheism’, ’’, ’alt.atheism’,
Labels ’’, ’’, ’’ ’’
’sci.crypt’, ’talk.politics.guns’,
’sci.electronics’, ’’,
’’, ’soc.religion.christian’,
’’ ’talk.politics.misc’
Table 2: The labels of each 20NG subsets.
Models 8NG-H 8NG-E 4NG-H 4NG-E
Basic Predictor
Basic Predictor + LCM
Table 3: Test accuracy on some subsets of 20NG.

Experimental Results

Most of the datasets have already been split into train and test set. However the different split can directly affect the final performance of the model. Therefore, in our experiments, we combine the separated train and test set to one dataset and randomly split them to different train and test set 10 times by splitting ratio of 7:3. Then we assess all models 10 times on each the these different train test splitting. By doing this, we can better evaluate whether LCM can enhance the basic text classification predictors.

Models Original 20NG 6% Noise 12% Noise 30% Noise
LSTM 0.6822 0.5946 0.5747 0.4681
LSTM + LS 0.7015 0.6155 0.6010 0.4994
LSTM + LCM 0.7242 0.6572 0.6385 0.5178
BERT 0.8853 0.8695 0.8546 0.7916
BERT + LS 0.8855 0.8742 0.8535 0.7932
BERT + LCM 0.8896 0.8789 0.8581 0.7980
Table 4: Experiments on noisy datasets and comparison with label smoothing (LS) method, where the percentage means the proportion of samples randomly being mislabeled.

Test Performance

Table 1 presents the test performance and the improvement of LCM-based models compared with their corresponding basic predictors grouped by the structure. We can see from the results that LCM-based classification models outperform their baselines in all datasets when using LSTM-rand, CNN-rand and BERT structures. The LCM-based CNN-pre model is lightly worse in FDCNews and 20NG datasets. And in the LCM-based LSTM-pre model is not significantly different with the baseline in AG’s News. The overall results in 5 datasets using 3 widely used deep learning structures illustrate that LCM has the ability to enhance the performance of text classification models. We can also see that LCM-base models usually have lower standard deviation, which reflects their robustness on the datasets splitting.

The biggest improvement was achieved by LCM on baseline LSTM-rand on the 20NG dataset, with 4.20% increase in test accuracy. The performance gain on CNN-rand on the same dataset is also quite obvious with 1.04% of improvement. The natural confusion of the labels in 20NG dataset can shine light on the reason why LCM performs quite well. Although there are 20 categories in 20NG dataset, these categories can naturally be divided into several groups. Therefore, it is natural that the labels in the same group are more difficult for model to distinguish. We further visualize the learned label representations of the 20 labels in 20NG dataset and is shown in Figure 2. The label representations are extracted from the embedding layer of LCM. Figure 1(a) illustrate the cosine similarity matrix of the label representations, where the elements off the diagonal reflects how one label is similar to another label. Then we use t-SNE tsne to visualize the high-dimensional representations on a 2D map, which is shown in Figure 1(b). We can find that the labels that are easily confused, especially those in the same group, tend to have similar representations. Since the label representations are all randomly initialized at the beginning, we can see that the LCM has the ability to learn very meaningful representations of the labels, which reflect the confusion between labels.

The reason why LCM-based classification models can usually get better test performance can be concluded into several aspects: (1) The LCM part learns the simulated label distribution SLD during training which captures the complex relation among labels by considering the semantic similarity between input document and labels. This is superior to simply using one-hot vector to represent the true label. (2) It is common that there exists some mislabeled data, especially for datasets with a large number of categories or very similar labels. In this scenario, training with one-hot label representation tend to be influenced by these mislabeled data more severely. However, by using SLD, the value on the index of the wrong label will be crippled and allocated to those similar labels. Therefore the misleading of the wrong label is relatively trivial. (3) Apart from the mislabeled data, when the given labels share some similarity (for example, “computer” and “electronics” are similar topics in semantics and share many key words in content), it is natural and reasonable to label a text sample with a label distribution that tells the various aspects of information. However, current classification learning paradigm ignore the difference between the samples and this important information is lost.

The Effect of and Early Stop of LCM

The is a controlling hyper-parameter to decide the importance of LCM on the original model. The larger will give the original one-hot label more weight when generating the SLD, thus reduce the influence of LCM. To ensure that the largest value in SLD keeps the same position with the original one-hot label, the should be at least . Figure 2(a) shows the accuracy curve of different on 20NG dataset using LSTM as basic structure. We can see from the graph, in the early stage, LCM-based models learn much faster than baseline model, and smaller can lead to faster convergence. However, when is too small, for example in this case, the model will easily over-fit. In this situation, we can apply the early stop strategy on the LCM, that is, close the effect of the LCM influence at a certain number of iterations. From the leaning curve of

we find that the accuracy begin to decrease at about 10 epochs, then we can choose to stop LCM at about 10 iterations and continue train the model using the original one-hot label vector. The result shown in Figure

2(b) reveals the effectiveness of early stop of LCM enhancement that the model can prevent the over-fitting and continues to learn more to behave better. The choice of and early stop strategy should be based on the nature of the specific dataset and task. According to the experience of our experiments, a smaller tend to behave better in datasets with similar labels or labeling errors.

The Influence of the Dataset Confusion Degree

It is not easy to directly tell the confusion degree for each of our benchmark datasets, due to the difference in language, style and number of classes. Note that the 20NG dataset has some inner groups, which means some labels are naturally similar to each other in the same group. We sampled four subset from 20NG: 8NG-H, 8NG-E, 4NG-H and 4NG-E. The datasets with the “H” are sampled from label groups so the documents are harder to distinguish, while the “E” tag means the samples are selected from different groups and are easy to classify. The detailed labels of each datasets are listed in Table 2. We choose LSTM as the basic predictor and the results on these four datasets are shown in Table 3. We can observe from the results that LCM helps a lot for 8NG-H and 4NG-H datasets but less helpful for 4NG-E, and even slightly decrease the accuracy for 4NG-E. It is straight-forward that 8NG-H has a higher confusion degree than 8NG-E, and so is to 4NG-H compared to 4NG-E. This phenomenon proves that LCM is especially helpful for datasets with high confusion degree.

Experiments on Noisy Datasets and Comparison with Label Smoothing

The success of deep learning models heavily depends on large annotated data, noisy data with labeling errors will severely diminish the classification performance which usually leads to an overfitted model. We constructed some noisy datasets with different percentage of noise from 20NG dataset, since 20NG inherently has some label groups. To make the noisy datasets closer to reality, the mislabeling samples are all chosen from the same label group, such as “comp”, “rec” and “talk”. Then we conduct experiments to verify the effect of label smoothing (LS) and LCM. We explore two deep learning models as basic predictor. The results are shown in Table 4. The smoothing hyper-parameter is set to 0.1 for LS, and for LCM. From the results we can see that LCM outperforms LS to a large degree, both in original cleaner dataset and datasets with obvious label errors.

The Application of LCM on Images

In fact, LCP is a common and natural problem in all classification tasks, not limited to text classification. For example, in the famous MNIST handwritten digits classification task, we can find that number “0” looks similar to number “6”, and “5” is similar to “8” in many cases. Therefore, simply representing the label with a one-hot vector will omit these similarity information among labels. Thus, the idea of LCM might also help with this problem. We choose MNIST dataset and the Fashion MNIST dataset xiao2017fashion to evaluate the effectiveness of LCM on image classification tasks.

Models MNIST Fashion MNIST
Basic Predictor 0.9822 0.8929
Basic Predictor + LS 0.9834 0.9009
Basic Predictor + LCM 0.9841 0.9028
Table 5: Test accuracy on image classification tasks. Here the basic predictor is a simple CNN model.

Due to the limited time and resource, we only implement a simple CNN network as the basic predictor. The results are shown in Table 5, which indicates that LCM can also be used in image classification tasks to improve performance. More basic predictor networks and datasets will be experimented in the future.

Conclusion and Future Work

In this work, we propose Label Confusion Model (LCM) as an enhancement component to current text classification models to improve their performance. LCM can capture the relations between instances and labels as well as the dependency among labels. Experiments on five benchmark datasets proved LCM’s enhancement on several popular deep learning models such as LSTM, CNN and BERT.

Our future work include the following directions: (i

) Designing a better LCM structure for computer vision tasks and conducting more experiments on image classification. (

ii) Generalizing the LCM method to multi-label classification problems and label distribution prediction.