An Effective Label Noise Model for DNN Text Classification

Because large, human-annotated datasets suffer from labeling errors, it is crucial to be able to train deep neural networks in the presence of label noise. While training image classification models with label noise have received much attention, training text classification models have not. In this paper, we propose an approach to training deep networks that is robust to label noise. This approach introduces a non-linear processing layer (noise model) that models the statistics of the label noise into a convolutional neural network (CNN) architecture. The noise model and the CNN weights are learned jointly from noisy training data, which prevents the model from overfitting to erroneous labels. Through extensive experiments on several text classification datasets, we show that this approach enables the CNN to learn better sentence representations and is robust even to extreme label noise. We find that proper initialization and regularization of this noise model is critical. Further, by contrast to results focusing on large batch sizes for mitigating label noise for image classification, we find that altering the batch size does not have much effect on classification performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

01/27/2021

Towards Robustness to Label Noise in Text Classification via Noise Modeling

Large datasets in NLP suffer from noisy labels, due to erroneous automat...
05/27/2019

Combating Label Noise in Deep Learning Using Abstention

We introduce a novel method to combat label noise when training deep neu...
09/08/2021

A robust approach for deep neural networks in presence of label noise: relabelling and filtering instances during training

Deep learning has outperformed other machine learning algorithms in a va...
09/13/2017

Co-training for Demographic Classification Using Deep Learning from Label Proportions

Deep learning algorithms have recently produced state-of-the-art accurac...
08/19/2019

A novel text representation which enables image classifiers to perform text classification, applied to name disambiguation

Patent data are often used to study the process of innovation and resear...
11/14/2017

On Extending Neural Networks with Loss Ensembles for Text Classification

Ensemble techniques are powerful approaches that combine several weak le...
01/19/2021

Initialization Using Perlin Noise for Training Networks with a Limited Amount of Data

We propose a novel network initialization method using Perlin noise for ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have led to significant advances in the fields of computer vision

He et al. (2016), speech processing Graves et al. (2013)

and natural language processing

Kim (2014); Young et al. (2018); Devlin et al. (2018). To be effective, supervised DNNs rely on large amounts of carefully labeled training data. However, it is not always realistic to assume that example labels are clean. Humans make mistakes and, depending on the complexity of the task, there may be disagreement even among expert labelers. To support noisy labels in data, we need new training methods that can be used to train DNNs directly from the corrupted labels to significantly reduce human labeling efforts. Zhu and Wu Zhu and Wu (2004)

perform an extensive study on the effect of label noise on classification performance of a classifier and find that noise in input features is less important than noise in training labels.

In this work, we add a noise model layer on top of our target model to account for label noise in the training set, following Jindal et al. (2016); Sukhbaatar et al. (2014). We provide extensive experiments on several text classification datasets with artificially injected label noise. We study the effect of two different types of label noise; Uniform label flipping (Uni), where a clean label is swapped with another label sampled uniformly at random; and Random label flipping (Rand) where a clean label is swapped with another label from the given number of labels sampled randomly over a unit simplex.

We also study the effect of different initialization, regularization, and batch sizes when training with noisy labels. We observe that proper initialization and regularization helps the noise model learn to be robust to even extreme amounts of noise. Finally, we use low-dimensional projections of the features of the training examples to understand the effectiveness of the noise model.

The rest of the paper is organized as follows. Section 2 discusses the various approaches in literature to handle label noise. In Section 3, we describe the problem statement along with the proposed approach. We describe the experimental setup and datasets in Section 4. We empirically evaluate the performance of the proposed approach along with the discussion in Section 5 and finally conclude our work in Section 6.

2 Related Work

Learning from label noise is a widely studied problem in the classical machine learning setting. Earlier works

Brodley and Friedl (1999); Rebbapragada and Brodley (2007); Manwani and Sastry (2013) consider learning from noisy labels for a wide range of classifiers including SVMs Natarajan et al. (2013) and fisher discriminants Lawrence (2001). Traditional approaches handle label noise by detecting and eliminating the corrupted labels. More details about these approaches can be found in Frénay and Verleysen (2014).

Recently, DNNs have made huge gains in performance over traditional methods on large datasets with very clean labels. However large real-world datasets often contain label errors. A number of works have attempted to address this problem of learning from corrupted labels for DNNs. These approaches can be divided into two categories; attempts to mitigate the effect of label noise using auxiliary clean data, and attempts to learn directly from the noisy labels.

Presence of auxiliary clean data: This line of research exploits a small, clean dataset to correct the corrupted labels. For instance, Li et al. (2017)

learn a teacher network with clean data to re-weight a noisy label with a soft label in the loss function. Similarly,

Veit et al. (2017) use the clean data as a label correction network. One can use this auxiliary source of information to do inference over latent clean labels Vahdat (2017). Further, Yao et al. (2018) models the auxiliary trustworthiness of noisy image labels to alleviate the effect of label noise. Though these methods show very promising results, the absence of clean data in some situations might hinder the applicability of these methods.

Learning directly from noisy labels: This research directly learns from the noisy labels by designing a robust loss function, or by modeling the latent labels. For instance, Reed et al. (2014), apply bootstrapping to the loss function to have consistent label prediction for similar images. Similarly, Joulin et al. (2016) alleviate the label noise effect by adequately weighting the loss function using the sample number. Jiang et al. (2017) propose a sequential meta-learning model that takes in a sequence of loss values and outputs the weights for the labels. Ghosh et al. (2017) further explores the conditions on loss functions such that the loss function is noise tolerant.

A number of approaches learn the transition from latent labels to the noisy labels. For example, Mnih and Hinton (2012) propose a noise adaptation framework for symmetric label noise. Based on this work, several other works Sukhbaatar et al. (2014); Jindal et al. (2016); Patrini et al. (2017); Han et al. (2018)

account for the label noise by learning a noisy layer on top of a DNN where the learned transition matrix represents the label flip probabilities. Similarly,

Xiao et al. (2015) propose a probabilistic image conditioned noise model. Azadi et al. (2015) proposed an image regularization technique to detect and discard the noisy labeled images. Other approaches include building two parallel classifiers Misra et al. (2016) where one classifier deals with image recognition and the other classifier models human’s reporting bias.

All of these approaches have targeted image classification. In this work, we propose a framework for learning from noisy labels for text classification using a DNN architecture. Similar to Sukhbaatar et al. (2014); Jindal et al. (2016); Patrini et al. (2017), we append a non-linear processing layer on top of this architecture to model the label noise. This layer helps the base architecture to learn better representations, even in the presence of label noise. We empirically show that, for better classification performance, the knowledge of noise transition matrix is not needed. Instead, the process forces the DNN to learn better sentence representations.

3 Problem Statement

In a supervised text classification setting where is a -dimensional word embedding of the th word in a sentence of length

(padded wherever necessary), we represent the sample as an temporal embedding matrix

which belongs to one of the classes. Let the noise-free training set be denoted by where represents the category of th sample,

is the total number of training samples, and there is an unknown joint distribution

on the sample/label pairs. This temporal representation of a sample is fed as input to a classifier on the training set with sample categories . However, as mentioned in Section 2, we cannot access the true noise-free samples labels and instead, observe noisy labels corrupted by an unknown noise distribution. Let this noisy training set be denoted by

where represents the corrupted label for the sentence . In this work, we suppose the label noise is class-conditional, where the noisy label depends only on the true label , but not on the input or any other labels or . Under this model, the label noise is characterized by the conditional distribution which we describe via the

column-stochastic matrix

, parameterized by a matrix .

In our experiments, we artificially inject label noise into the training and validation sets. We fix the noise distribution and, for a training sample, we generate a noisy label by drawing i.i.d from this noise distribution . However, we do not alter the test labels.

Though the proposed approach works for any noise distribution, for this study, we focus on two different types of label flip distributions. We use a noise model parameterized by the overall probability of a label error, denoted by . For a noise level , we set the noise distribution matrix

(1)

and we call it Uniform label flip noise model. Here,

represents the identity matrix and

denotes the all-ones matrix. Similarly, we describe the random label flip noise model as

(2)

where is the identity matrix, and is a matrix with zeros along the diagonal and remaining entries of each column are drawn uniformly and independently from the -dimensional unit simplex. The label error probability for each class is

, while the probability distribution

within the erroneous classes is drawn uniformly at random.

Our objective is to train a classifier on the noisy labeled sample categories on the training set such that it jointly makes accurate predictions of the true label and learns the noise transition matrix simultaneously, given . For the noisy dataset , it is straightforward to train a classifier that predicts the noisy labels using conditional distribution for the noisy labeled input sentence :

(3)

One can learn the classifier associated with via standard training on the noisy set . To predict the clean labels by learning the conditional distribution requires more effort, as we cannot extract the “clean” classifier from the noisy classifier when the label noise distribution is unknown.

3.1 Proposed Framework

We refer to the DNN model without the final layer as the base model or network without noise model (WoNM). This model, along with the non-linear layer, is trained via back-propagation on the noisy training dataset. The non-linear processing layer in the noise model transforms the base model outputs to match the noisy labels during the forward pass better and presents the denoised labels to the base model during the backward pass. The noise layer is parameterized by a square matrix ). At test time, we remove this learned noise model and use the output of the base model as final predictions.

We refer to the base model parameters as . The base model outputs a probability distribution over the number of categories denoted as . During the forward pass the noise model transforms this output to obtain the noisy labels as

(4)

where represents the usual softmax operator. Note that both the equations (3) and (4) compute the probability distribution over noisy labels – our noise model does not learn a noise transition matrix. However, we assert that the knowledge of exact noise statistics is neither necessary nor sufficient for the better prediction results.

We learn the base model parameters and the noise model parameters by maximizing the log likelihood (4) over all of the training samples, minimizing the cross-entropy loss:

(5)

Similar to Sukhbaatar et al. (2014), we initialize the noise model weights to the identity matrix. Since DNNs have high capacity, we may encounter the situation when the model absorbs all the label noise and, thus, the noise model does not learn anything at all. In order to avoid this situation, and to prevent overfitting, we apply regularization to the noise model. However, we want the noise model to overfit the label noise. In the experiment section, we observe that with proper regularization and weight initialization the noise model absorbs most of the label noise. Finally, we train the entire network according to the following loss function:

(6)

Here, is a tuning parameter and we validate the value of by repeating the experiment multiple times with multiple values over different datasets and choose the one with better classification performance. A value of works best.

4 Datasets and Experimental Setup

In this section, we empirically evaluate the performance of the proposed approach for text classification and compare our results with the other methods.

4.1 General Setting

In all the experiments, we use a publicly-available deep learning library

Baseline – a fast model development tool for NLP tasks Pressel et al. (2018). For all the different datasets, we choose a commonly-used, high-performance model from Kim (2014) as a base model. To examine the robustness of the proposed approach, we intentionally flip the class labels with to label noise, in other words:  , and observe the effect of different types of label flipping, such as uniform (Uni) and random (Rand) label flipping, along with instance-dependent label noise. For all the experiments, we use early stopping based on validation set accuracy where the class labels in validation are also corrupted.

We indicate the performance of a standard deep network Without Noise model (WoNM) on the noisy label dataset. We also plot the results for the stacked Noise Model Without Regularization (NMWoRegu) and stacked Noise Model With Regularization (NMwRegu). Unless otherwise stated, in all the deep networks with the stacked noise model, we initialize the noise layer parameters as an identity matrix. We further analyze the effect of the noise layer initialization on the overall performance. We define TDwRegu as the stacked noise model with regularization, initialized with true injected noise distribution and RandwRegu as the stacked noise model with regularization, initialized randomly. We run all experiments five times and report the mean accuracy.

4.2 Datasets

Text Data

Dataset K L N T Type
SST-2 2 19 76961 1821 Balanced
Trec 6 10 5000 500 Not Balanced
AG-News 4 110K 10K Balanced
DbPedia 14 29 504K 70K Balanced
Table 1: Summary of text classification datasets; K: denotes the number of classes, L: represents the average length of sentence, N: denotes the number of training samples, T: represents the number of test samples, Type: describes whether the dataset is balanced.

SST-2

Batch Size 50 100
Label Flips Random Random
Noise% Clean Data 10 20 30 40 45 47 50 0 10 20 30 40 45 47 50
WoNM
TDwRegu01 83.29% 78.53% 74.01% 49.5% 86.88% 84.88% 85.08% 82.41% 76.09% 70.10% 58.98%
NMWoRegu 87.28% 86.2% 55.76% 52.24%
NMwRegu001 86.51% 85.26%
NMwRegu01 87.78% 86.04% 85.04% 82.7% 77.43% 66.96% 61.5% 49.08% 85.10% 81.9% 76.2% 65.47% 58.92% 52.46%

Trec

Batch Size 10
Label Flips Uniform Random
Noise% Clean data 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
WoNM
TDwRegu01
NMWoRegu
NMwRegu001 34.87%
NMwRegu01 92.73% 90.8% 89.53% 88.67% 84.93% 79.67% 69.67% 52.4% 92.7% 90.33% 90.6% 86.47% 83.07% 70.93% 65.2%
Batch Size 50
Label Flips Uniform Random
Noise% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
WoNM
TDwRegu01
NMWoRegu
NMwRegu001
NMwRegu01 92.53% 91.33% 90.27% 88.47% 83.87% 77.87% 68.73% 55.67% 92.53% 90.00% 90.2% 85.93% 82.6% 71.4% 67.33% 37.53%

AG-News

Batch Size 100
Label Flips Uniform Random
Noise% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
WoNM
TDwRegu01
NMWoRegu 77.66%
NMwRegu001 92.62%
NMwRegu01 92.55% 92.23% 92.2% 91.98% 91.7% 91.23% 90.54% 89.78% 92.23% 91.96% 91.69% 91.13% 90.77% 62.04%
Batch Size 1024
Label Flips Uniform Random
Noise% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
WoNM
TDwRegu01
NMWoRegu
NMwRegu001
NMwRegu01 92.66% 92.2% 92.29% 92.09% 91.7% 91.24% 90.72% 89.88% 92.57% 92.11% 91.99% 91.57% 91.2% 90.5% 77.93% 61.12%

DBpedia

Batch Size 512
Label Flips Uniform Random
Noise% Clean data 30 50 70 75 80 85 90 0 30 50 70 75 80 85 90
WoNM
NMWoRegu
NMwRegu001 99.04% 98.94% 98.81% 98.61% 98.52% 98.33% 98.13% 97.53% 99.04% 98.48% 98.33% 89.00%
NMwRegu01 99.01% 98.88% 98.72% 16.27%
Batch Size 1024
Label Flips Uniform Random
Noise% Clean data 30 50 70 75 80 85 90 0 30 50 70 75 80 85 90
WoNM
NMWoRegu
NMwRegu001 98.97% 98.9% 98.79% 98.53% 98.50% 98.32% 98.19% 97.27% 98.97% 98.49% 98.32% 83.79%
NMwRegu01 98.88% 98.72% 98.35% 15.94%
Table 2: Test performance for different text classification datasets

Here, we describe all the text classification datasets used to evaluate the performance of the proposed approach. The base model architecture is the same for all datasets. For each set, we tune the number of filter windows and filter lengths using the development set. Along with the description, we also provide the hyper-parameters we selected for each. Table 1 summarizes the basic statistic of the datasets.

  1. SST-2111http://nlp.stanford.edu/sentiment/ Socher et al. (2011): Stanford Sentiment Treebank dataset for predicting the sentiment of movie reviews. The classification task involves detecting positive or negative reviews. Using the base model with clean labels we obtain classification accuracy of . For this dataset, the base model network architecture consists of an input and embedding layer + feature windows with 100 feature maps each and dropout rate with batch size 50.

  2. TREC222http://cogcomp.cs.illinois.edu/Data/QA/QC/ Voorhees and Tice (1999): A question classification dataset consisting of fact based questions divided into broad semantic categories. We use a six-class version of TREC dataset. For this dataset, the base model network architecture consists of an input and embedding layer + one feature windows with 100 feature maps and dropout rate with batch size 10.

  3. Ag-News333http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html Zhang et al. (2015): A large-scale, four-class topic classification dataset. It contains approx 110K training samples. For this dataset, the base model network architecture consists of Input layer + Embedding layer + feature windows with 200 feature maps and dropout rate with batch size 100.

  4. DBpedia3 Zhang et al. (2015): A large scale 14-class topic classification dataset containing training samples per category. For this dataset, the base model network architecture consists of Input layer + Embedding layer + feature windows with 400 feature maps each and dropout rate with batch size 1024.

For all the datasets, we use Rectified Linear Units (ReLU) and fix the base model architecture. We use early stopping on dev sets for all the datasets. We run all the experiments 5 times and report the average classification accuracy in Table

2

. We train all the networks end-to-end via stochastic gradient descent over shuffled mini-batches with the Adadelta update rule

Zeiler (2012) except for the DBpedia, where we use SGD. In order to improve base model performance, we initialize the word embedding layer with the publicly available word2vec

word vectors

Mikolov et al. (2013) for all the datasets except for DBpedia, where we use GloVe embeddings Pennington et al. (2014).

5 Results and Discussion

We evaluate the performance of our model in Table 2 for each datasets in the presence of uniform and random label noise and compare the performance with the base model (WoNM) as our baseline. For the other datasets, the proposed approach is significantly better than the baseline for both types of label noise. For all datasets, we observe a gain of approximately w.r.t baseline in the presence of extreme label noise. Interestingly, if we assume an oracle to determine prior knowledge of true noise distribution (TDwRegu01), it does not necessarily improve classification performance, especially for multi-class classification problems. For binary classification, using the SST-2 dataset, we did observe that the noise model initialized with the true noise distribution works better than all the other models.

5.1 Effect of different regularizers

The NMwRegu01 performs better in all cases for both types of label noise. We plot the weight matrix learned by all the noise models in all the noise regimes. For brevity, we only plot the weight matrix for AG-News datasets with label noise in Fig.1. We find that regularization diffuses the diagonal weight elements and learns more smoothed off-diagonal elements which resemble the corresponding input label noise distribution in Fig. 0(d). This also means that, without regularization, the noise model has less ability to diffuse the diagonal elements which leads to poor classification performance. Therefore, we use a regularizer () to diffuse the diagonal entries.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 1: AG-News Dataset: a) Input random label noise; (b-f) learned weight matrix learned by different noise models.

In some cases, especially for low label noise, we find the regularization with a small penalty works better than a large penalty since, for low label noise, learning a less diffuse noise is beneficial. The proposed approach scales to a large number of label categories, as evident from the experiments on DBpedia dataset in the last row of Table .2.

5.2 Effect of different scaling factors on noise layer initialization

We initialize the noise model weights as identity matrices with gain equal to the number of classes (gain = ) for all experiments. We observe the effect of different gain values on the overall performance of the proposed network in Fig. 2. We plot the classification performance for the DBpedia dataset with random noise. For each noise model in Fig. 1(a), we find that setting the gain to works best and any other gain results in poor performance.

(a) Classification accuracy
(b) Noise model norm
Figure 2: Effect of noise model initialization scaling on the classification performance

In Fig. 1(b) we plot the Frobenius norm of the learned noise model weights with respect to the different gain values. We find that, using the high gain initialization, the model learns a high noise model norm, resulting in poor classification performance. This finding provides support to the claim in Liao et al. (2018) that “higher capacity leads to high test errors.”

5.3 Effect of Batch size

We also observe the effect of different batch sizes on performance as described in Rolnick et al. (2017). For all datasets, we do observe small performance gains for highly non-uniform noisy labels, for instance , in Fig. 3 column 2. However, for uniform label flips, we do not observe performance gains with increasing batch size.

(a) Trec [Uniform]
(b) Trec [Random]
(c) AG-News [Uniform]
(d) AG-News [Random]
(e) DBpedia [Uniform]
(f) DBpedia [Random]
Figure 3: Effect of batch size on label noise classification for different datasets

5.4 Instance Dependent label noise

We further investigate the performance of the proposed approach on instance-dependent label noise by flipping each class labels with different noise percentages as shown in Fig. 3(a). For brevity, we present results on AG-News dataset in Fig. 4. On this type of label noise, the performance of proposed approach is far better than the baseline with a performance improvement of . The learned noise model by the proposed approach is shown in Fig. 3(b) and we show the normalized weight matrix in Fig. 3(c). We observe that the learned noise model is able to capture the input label noise statistics and is highly correlated to the input noise distribution with Pearson Correlation Coefficient .

(a)
(b)
(c)
Figure 4: AG-News Dataset: a) input instance dependent label noise; b) learned weight matrix by proposed approach; c) column normalization of (b).

5.5 Understanding Noise Model

TRB TRPr
Data(N) WoNM Noisy True NMwRegu01 Noisy True
SST2 (40%) 70.24 70.95 79.24 82.32 73.90 83.25
AG (70%) 59.70 52.44 79.18 90.33 86.27 89.4
AG (60%) 83.25 68.8 88.28 90.45 87.77 90.78
Trec (40%) 66.80 63.4 79.0 73.40 69.6 83.2
Trec (20%) 83.6 80.0 86.0 87.40 83.6 90.0
Table 3: SVM Classification
Iteration 0
Iteration 5
Iteration 10
Iteration 18
(a) Proposed model
Iteration 0
Iteration 5
Iteration 10
Iteration 18
(b) No noise model stacked
Figure 5: tSNE visualization of the last layer activations of a base network before softmax for Trec Dataset with corrupted labels; First row in (a) when the corresponding true labels are superimposed on the tSNE data points; Second row in (a) when the noisy labels are superimposed onto the tSNE data points.

In order to further understand the noise model, we first train the base model and the proposed model on noisy labels. Afterward, we collect the last fully-connected layer’s activations for all the training samples and treat them as the learned feature representation of the input sentence. We get two different sets of feature representations, one corresponding to the base model (TRB), and the other corresponding to the proposed model (TRPr). Given these learned feature representations – the artificially injected noisy labels and the true labels of the training data – we learn two different SVMs for each model, with and without noise. For the base model, for both SVMs, we use TRB representation as inputs and train the first SVM with the true labels as targets and the second SVM with the unreliable labels as targets. Similarly, we train two SVMs for the proposed model. After training, we evaluate the performance of all the learned SVMs on clean test data in Table 3, where the 1st column represents the corresponding model performance, “Noisy” and “True” column represents the SVM performance when trained on noisy and clean labels, respectively. We run these experiments for different datasets with different label noise.

The SVM, trained on TRB and noisy labels, is very close to the base model performance (3). This suggests that the base model is just fitting the noisy labels. On the other hand, when we train an SVM on the TRPr representations with true labels as targets, the SVM achieves the proposed model performance. This means that the proposed approach helps the base model to learn better feature representations even with the noisy targets, which suggest that this noise model is learning a label denoising operator.

We analyze the representation of training samples in feature domain by plotting the tSNE embeddings Van Der Maaten (2014) of the TRB and TRPr. For brevity, we plot the t-SNE visualizations for trec dataset with label noise in Fig. 5 .

For each network, we show two different t-SNE plots. For example in Fig. 4(a) we plot two rows of tSNE embeddings for the proposed model. In the first row of Fig. 4(a), each training sample is represented by its corresponding true label, while in the second row (the noisy label plot) each training sample is represented by its corresponding noisy label. We observe that, as the learning process progresses, the noise model helps the base model to cluster the training samples in the feature domain. With each iteration, we can see the formation of clusters in Row 1. However, in Row 2, when the noisy labels are superimposed, the clusters are not well separated. This means that the noise model denoises the labels and presents the true labels to the base network to learn.

In Fig. 4(b), we plot two rows of tSNE embeddings of the TRB representations. It seems that the network directly learns the noisy labels. This provides further evidence to support Zhang et al. (2016)’s finding that the deep network memorizes data without knowing of true labels. In Row 2 of Fig. 4(b), we can observe that the network learns noisy features representations which can be well clustered according to given noisy labels.

6 Conclusion

In this work, we propose a framework to enable a DNN to learn better sentence representations in the presence of label noise for text classification tasks. To model the label noise, we append a non-linear noise model on top of the base CNN architecture. With proper initialization and regularization, the noise model is able to absorb most of the label noise and helps the base model to learn better sentence representations.

References