1 Introduction
State-of-the-art deep neural networks require large amounts of annotated training data. Though the success of large pre-training models
(devlin2018bert) alleviates such requirements, high-quality labeled data are still crucial to obtain the best performance on downstream tasks. However, it is extremely expensive to acquire large-scale annotated data for every new task.To overcome the challenge of rigorous data requirements, recent works utilize weak labels for supervision, including heuristic rules (augenstein2016stance; bach2019snorkel; awasthi2020learning), large-scale datasets with cheap but noisy labels (li2017webvision; lee2018cleannet);, and self-training (park2020improved; wang2020adaptive). In self-training, one trains a model on a labeled dataset and then predicts on a large amount of unlabeled data. Then the clean labeled data and pseudo-labeled data are combined to further train the model.
The aforementioned sources of supervision share two important characteristics: First, they are learning from noisy labels. Second, the noise is dependent on data features. Thus, they could be unified into the framework of noisy label learning, for which numerous approaches have been proposed to reduce the negative impact of noise. However, there are three problems in prior work on noisy label learning: First, many existing approaches are based on heuristics such as sample losses which are not flexible enough (han2018co; yu2019does; zhou2020robust; song2019selfie)
; Second, many previous works require prior knowledge of the noise distribution of the dataset to adjust the hyperparameters, which is often not available in real-world applications
(song2019selfie; song2020learning). Third, meta learning based methods avoid previous problems but suffer from optimization difficulties (ren2018learning; zheng2021meta) such as longer training time, heavy hyper-parameter tuning and an unstable convergence process. To address the above problems, we propose a simple data-driven approach which does not rely on meta learning while being flexible.Our contributions are summarized as follows:
-
We propose a simple yet effective de-noising approach which avoids the optimization difficulty of meta learning while enjoying the flexibility of being data-driven.
-
We unify the settings of both self-training and label corruption into a noisy label learning framework and demonstrate the effectiveness of our approach under both settings.
-
Our approach improves performance over state-of-the-art baselines on a wide range of datasets, including text classification and automatic speech recognition (ASR). Last but not least, our approach achieves even larger gains on few-shot learning.
2 Related Work
2.1 Self-training
Self-training is a powerful learning method that enables models to learn from huge amounts of unlabeled data by generating weak labels through either the teacher model predictions or heuristic rules. Self-training has been shown to be effective in many scenarios, including image classification (yalniz2019billion), text classification (li2019learning), machine translation (wu2019exploiting), etc. However, noise contained in weak labels could largely hinge the performance of self-training.
Recently, xie2020self improved the performance of self-training on image classfication by injecting noise to the student model, which is called NoisyStudent. park2020improved customized NoisyStudent on automatic speech recognition. One problem related with self-training is error propagation (zou2019confidence); in other words, pseudo labelling on unlabeled data might bring noise to the training set which leads to the degradation of further training. Most previous work simply set a fixed threshold to filter samples with low confidence (sohn2020fixmatch; xie2020self). wang2020adaptive used meta learning for adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels. zhang2021flexmatch used a curriculum learning approach to re-weight unlabeled data according to the model’s learning status. In our work, we alleviate the error propagation from another perspective, by learning a selection model using model-based features based on a clean dataset.
2.2 Noisy Label Learning
Learning from noisy labels has long been a research area. One of the most classical works is to add a noise adaptation layer on top of the main model to learn a label transition matrix for label correction (goldberger2017deep). Bootstrapping (reed2014training) introduces the notion of perceptual consistency that a model predicts correct labels for noisy samples before overfitting to noisy labels. Co-Teaching (han2018co) and Co-Teaching+ (yu2019does) train two networks while each network selects its small-loss samples as clean samples for its peer network. However, the aforementioned approaches only deal with class dependent noise (CDN) and make a strong assumption that noise distribution is independent of each instance, which is not flexible enough for many cases.
SEAL (chen2020beyond) goes beyond previous work to consider instance dependent noise (IDN), which is more realistic and common than CDN on real world datasets. SELFIE (song2019selfie) uses a hybrid approach that selects refurbishable samples based on the entropy of model predictions and then refurbishes the labels with model predictions. RoCL (zhou2020robust) utilizes curriculum learning that starts with easy and clean samples and gradually moves to data with pseudo labels produced by a time-ensemble. However, both SELFIE and RoCL require prior knowledge of the noise distribution of the dataset and manual adjustment for hyperparameters. To avoid such efforts, meta learning is introduced to learn selection and refurbishment.
Learning to Re-weight (ren2018learning) is a meta learning algorithm that learns to assign weights to training examples based on their gradient directions. Meta-weight-net (shu2019meta)
parameterizes the reweighting function as a multi-layer perceptron network. Meta Label Correction
(zheng2021meta) trains the target model with corrected labels generated by a label correction model trained on clean validation data which is jointly trained by solving a bi-level optimization problem. These meta learning algorithms afford a large degree of flexibility by directly optimizing a reliable objective. However, meta learning based models are known to be sensitive to hyperparameter tuning and the quality of support data (agarwal2021sensitivity) and suffer from optimization difficulties as they are trained by propagating second-order gradients (hospedales2020meta).The major differences between our approach and previous methods are as follows. Compared with meta-learning based models, our approach do not suffer from optimization difficulties. Compared with models with CDN assumptions, our approach can handle the IDN settings. Compared with other state-of-the-art IDN methods such as SELFIE and RoCL, the selection strategy in our approach is learnable with model-based features. And our approach does not require the prior knowledge of noise distribution. Last but not least, we unify the settings of self-training and label corruption in the framework of noisy label learning and conduct extensive experiments on both settings.

3 Method
Now we present our approach, SENT (Selection Enhanced Noisy label Training), that learns to select a subset from a noisy dataset and only uses the selected subset for training to reduce label noise. The core idea is to transfer the noise distribution so that both clean and noisy labels are available on a data subset. A selection model is trained to distinguish clean labels from noisy ones and then applied to selecting a clean subset from a noisy dataset.
Formally, given a noisy training dataset that is corrupted following some unknown noise distribution , our approach is to learn a selection strategy to select a clean subset for training. Here is a training sample, is the noisy label, is the corresponding unknown true label, and is the size of training set. Let model be our main model which is trained on the noisy dataset and performs certain tasks such as text classification, speech recognition and others. We will also have a selection model trained on a small clean development dataset . The task of is to learn the selection strategy . There are two main stages in our approach: noise transfer and selection learning.
3.1 Noise Transfer
Now we describe the noise transfer stage. Given the corrupted training set , we learn the unknown noise distribution and transfer the noise to . We train a model on till full convergence, which predicts a (noisy) label given a text input. is now assumed to have learned to capture the unknown distribution by its parameters. Then we will use the model to get noisy labels on by making predictions. Here we argue that the noise on is approximately following the noise distribution . Formally, now the development set is , where is the original clean development sample, and is the noisy label predicted by .
3.2 Selection Learning
We model the selection learning stage as a binary classification task. On , a model
is trained to classify whether a label is clean or noisy. In our approach, the model
is constructed as a multi-layer perceptron with one hidden layer, which uses a pre-calculated 5-dimensional feature vector as the input and outputs a binary classification probability. Given a sample from the development set
, the selection training sample is defined as , where is a 5-dimensional feature vector for the -th sample. We will discuss how to compute the 5-dimensional feature in later sections. Meanwhile, is the corresponding true selection result, defined as . In other words, if the -th sample has a clean label, , and otherwise zero. Let denote the probability given by the selection model. The loss function for each sample can be written as:
(1) |
3.3 Model-Based Features
Now we discuss how to compute the 5-dimensional feature for each selection sample .
The first feature is called the instant loss. Given a sample , let be the predicted probability from the main model . The instant loss (IL) for the sample is defined as:
(2) |
Intuitively, a larger instant loss indicates a possibly noisier sample, because the model has low confidence in predicting the label. However, as training proceeds, the model will overfit some of the noisy labels. As a result, the instant loss will decrease for noisy samples as well. To address this issue, following zhou2020robust, we additionally use an exponential moving average (EMA) loss to better differentiate noisy and clean samples. In the
-th training epoch, the EMA loss (EMAL) for the
-th sample is defined as follows:(3) |
where is a discounting factor. Intuitively, a larger EMAL represents a possibly noisier sample as the model has lower confidence in the training history.
song2019selfie has shown that the entropy of model prediction is a strong indicator to differentiate noisy and clean samples as noisy samples tend to have a larger entropy. Here we adopt two entropy signals as additional features: Instant Entropy (IE) and History Entropy (HE). The calculation of IE follows song2019selfie. Let be the predicted label of the -th sample at epoch and let be the prediction history of first epochs. Then we formulate an empirical distribution
which equals the ratio of prediction in the first epochs. The IH feature at the -th epoch is computed as
(4) |
where is a normalizing factor with being the number of labels. The HE feature at the -th epoch is computed as
(5) |
We explore another informative feature inspired by (han2019deep)
who discovered that distances between low level features and high level features in a convolution model are larger on noisy samples than on clean samples. We find this holds for transformer models. Thus, we adopt the cosine similarity between the hidden states of the first layer and the last layer as another feature. Formally, FLS (First Last Similarity) is defined as:
(6) |
where and represent the hidden states of the -th token in the first and last layers respectively, and refers to cosine similarity. We normalize the FLS feature into the range of .
Finally, we concatenate the above all features to be the input for the selection model as follows,
(7) |
where denotes concatenation.
3.4 Overall Training Procedure
Now we show the training procedure with pseudo code in Algorithm 1 and illustrations in Figure 1. In the first box of Figure 1, we clarify our data setting, where a clean set and a corrupted set are the input to our method. In the second box, we learn the noise distribution of the training set by fitting model on and then transfer to . After this step, we will have both noisy and clean labels on . This box corresponds to lines 2 to 4 in Algorithm 1.
Next, we will move on to the repeated training stage for model and model , which is shown in box 3 and 4 of the figure. First, as illustrated in box 3, we use model to infer the aforementioned signals on and . Then we will use the signals on as the input to train the selection model . Since has both clean and noisy labels, we can easily get the training targets as mentioned in Section 3.2 for the selection model . Once we are done with selection learning, given the inferred signals on , we can use to predict and select clean samples from to get . Then we will train on the clean subset and repeat above steps. is not guaranteed to be entirely clean but is expected to be cleaner than . This repeated training phase corresponds to the code from lines 6 to lines 13 in Algorithm 1.
3.5 Adaptation to Self-Training
Our above framework can be directly applied to learning scenarios with noisy labels. In this section, we will further discuss how to adapt this method to self-training, a classic semi-supervised learning paradigm. Generally, it trains the teacher model on labeled data to infer pseudo labels on unlabeled data and add them back to the original training set. Then a student model is trained on the combined data. After that, the student becomes a new teacher model and the above process is repeated. Obviously, the process of pseudo labelling will bring noise since it cannot ensure 100% accuracy on the predicted labels. Therefore, it fits the nature of our proposed framework. Specifically, the noise distribution
in self-training is known because the source of noise is the teacher model. Thus, it is natural to directly leverage the teacher model to infer noisy labels on .
The whole pipeline of adapting our framework to classic self-training is displayed in Figure 2, and the corresponding algorithm is shown in Algorithm 2 in Appendix. After training the teacher model on labeled data , we use the teacher model to infer on the unlabeled data and the dev set . This is followed by training the selection model based on the signals of . Then, we utilize the selection model to predict on the unlabeled data to judge whether to choose each pseudo-labeled sample or not. Then, we train the student model on the combination of original labeled data and selected samples from unlabeled data. The above procedure is repeated as in classical self-training.
4 Experiments
IMDB | SMS | SMS* | TREC | TREC* | YOUTUBE | AGNEWS | |
Train | 250 | 69 | 61 | 68 | 60 | 77 | 60 |
Test | 25000 | 500 | 392 | 500 | 500 | 392 | 7600 |
Unlabeled | 24500 | 4502 | 4948 | 4965 | 5409 | 1409 | 11876 |
Dev_eval | 126 | 250 | 31 | 251 | 30 | 40 | 32 |
Dev_select | 124 | 250 | 31 | 249 | 32 | 38 | 32 |
Method | TREC | TREC* | SMS | SMS* | AGNEWS | IMDB | YOUTUBE |
Supervised | 83.4 | 84.2 | 97.8 | 97.5 | 78.1 | 85.1 | 92.9 |
Co-teaching+ | 81.8 | 79.0 | 98.2 | 98.4 | 82.8 | 86.8 | 92.1 |
L2R | 80.8 | 86.0 | 98.6 | 97.2 | 84.2 | 84.7 | 92.9 |
SELF | 81.0 | 81.2 | 98.4 | 98.6 | 82.4 | 83.5 | 92.9 |
Self-train | 84.0 | 84.8 | 97.9 | 98.4 | 82.4 | 87.0 | 93.6 |
Self-train (thres) | 84.2 | 87.8 | 97.9 | 98.4 | 83.3 | 85.6 | 93.1 |
Noisy Student | 85.0 | 88.6 | 98.4 | 99.0 | 85.0 | 87.7 | 93.9 |
Ours | 85.0 | 86.6 | 99.0 | 99.2 | 83.4 | 88.3 | 94.4 |
Ours + noisy | 85.0 | 89.2 | 99.0 | 99.2 | 86.0 | 89.0 | 95.2 |
4.1 Experimental Setup
4.1.1 Overview
To verify the effectiveness of SENT, we conduct extensive experiments on text classification and automatic speech recognition (ASR) benchmarks.
We use text classification tasks to evaluate the self-training setting. We perform finetuning based on the BERT (devlin2018bert)
model. We use ASR tasks to evaluate the label corruption setting. Our approach and other baselines are built on top of an encoder-decoder transformer network. Details of model configuration can be found in Appendix
A.2.4.1.2 Datasets
For text classification in the self-training setting, we evaluate our framework on the following five benchmark datasets: question classification TREC-6 (li2002learning), spam classification of SMS messages (almeida2011contributions), spam classification of YouTube comments (alberto2015tubespam), AG’s news topic classification dataset (zhang2015character), and sentiment classification on IMDB movie reviews (maas-etal-2011-learning). Our data splits follow previous work (karamanolakis-etal-2021-self). Related details are shown in Table 1. For the SMS and TREC datasets, we consider two separate versions. The datasets with have smaller development sets and while the ones without have larger developments. Smaller development sets are more challenging for noisy label learning because selection learning has to perform on a smaller clean set. We use these two separate versions to test the robustness of our approach.
For ASR in the label corruption learning setting, we use AISHELL-1 (bu2017aishell) as the benchmark. Following prior work, we model IDN using DNNs’ prediction error (du2015modelling; menon2018learning). Specifically, we train three small transformer models to corrupt the training set to different corruption levels: hard, medium, easy. The higher the error rate, the harder the corrupted dataset. In the following experiments of SENT, the prior noise information (i.e. how the training set is corrupted) is assumed to be unknown. Related statistics are shown in Table 8 in Appendix.
4.1.3 Evaluation
For text classification, we report micro F1 for SMS and accuracy for the rest of the datasets. For ASR, we follow bu2017aishell to use the character error rate (CER) for evaluation.
Since our approach relies on an additional clean set , we split the normal development set into two halves. We use one half as for selection learning and the other half for standard model tuning, so as to set up fair comparison with the baselines.
4.1.4 Baselines
For text classification, we compare with the following baselines: (a) “Supervised” refers to supervised learning using only labeled data; (b) “self-train” is standard self training that utilizes both labeled and unlabeled data for iterative training; (c) “self-train (thres)” means self-training that uses the development set to select a threshold of confidence score for filtering pseudo-labeled data; (d) “noisy student”
(xie2020self) adds dropout noise to the student model in self training; (e) “co-teaching+” (yu2019does) uses two neural networks to select small-loss samples for each other and applies a disagreement strategy; (f) “L2R” (ren2018learning) learns to re-weight noisy labels via meta-learning (g) “SELF” (nguyen2019self) utilizes self-ensemble predictions to progressively remove noisy labels. We also evaluate the performance when combining noisy student and our method, denoted as “ours + noisy”.In the experiments of ASR, we include the following baselines: Vanilla (a naive encoder-decoder transformer network), Co-Teaching+ (yu2019does), L2R (ren2018learning), RoCL (zhou2020robust) and SELFIE (song2019selfie). The details of these baselines can be found in Appendix A.1.1.
4.2 Experimental Results
Model | Hard | Medium | Easy |
Vanilla | 32.63 | 21.82 | 16.10 |
Co-Teaching+ | 32.08 | 21.67 | 15.11 |
L2R | 30.43 | 20.07 | 15.15 |
RoCL | 27.16 | 17.87 | 14.95 |
SELFIE | 27.31 | 19.10 | 13.98 |
Ours | 26.91 | 17.67 | 13.47 |
Data | Method | Pse-Acc. | Sel-Pre. | Sel-Rec. | #Selected |
IMDB | Self-train | 85.7 | 85.7 | 100.0 | 24500 |
Self-train (thres) | 85.2 | 94.3 | 58.1 | 12870 | |
Ours | 88.1 | 97.9 | 47.5 | 7742 | |
YOUTUBE | Self-train | 95.0 | 95.0 | 100.0 | 1409 |
Self-train (thres) | 95.7 | 96.1 | 98.7 | 1386 | |
Ours | 94.6 | 96.9 | 94.1 | 1294 | |
SMS* | Self-train | 98.1 | 98.1 | 100.0 | 4948 |
Self-train (thres) | 96.0 | 96.0 | 98.6 | 4880 | |
Ours | 98.3 | 98.3 | 100.0 | 4897 | |
TREC* | Self-train | 77.4 | 77.4 | 100.0 | 4965 |
Self-train (thres) | 77.2 | 77.2 | 99.8 | 4944 | |
Ours | 78.3 | 81.2 | 67.1 | 4897 | |
AGNEWS | Self-train | 84.4 | 84.4 | 100.0 | 11876 |
Self-train (thres) | 83.4 | 83.6 | 99.7 | 11826 | |
Ours | 84.3 | 88.0 | 85.6 | 9728 |
The key metrics of pseudo labeling and sample selection during the self-training. We report the accuracy of pseudo labeling, the precision and recall of sample selection, and the number of selected samples.
4.2.1 Results in Text Classification.
As shown in Table 2, the self-training baseline improves text classification performance. In comparison, our selection approach can stably lift the performance, which shows that our selection model has learned an informative selection strategy. Last, although using SENT alone outperforms self-train and noisy student, our approach can be combined with the noisy student approach to achieve even better performance. Overall, this combined approach achieves the best performance among all the datasets we consider.
4.2.2 Results in ASR
Table 3 shows that Co-Teaching+ is not able to handle our setting as it only achieves similar result to the vanilla model. L2R is effective on our problem setting with improved performance. However, meta learning based L2R underperforms RoCL and SELFIE. Compared with above baselines, our approach consistently excels on all error levels, which demonstrates the effectiveness of our approach.
4.3 Empirical Analysis
Case Study: Key Metrics During Self-Training
In order to gain a deeper understanding of how our method improves over the traditional self-training methods, we investigate some of the key metrics regarding pseudo labelling and selection performance. We consider self-train, self-train (thres), and our model. Specifically, we display the accuracy of the pseudo labelling on unlabeled data (Pse-Acc.), the precision of sample selection (Sel-Pre.), the recall of sample selection (Sel-Rec.) and the number of selected samples in the best round (i.e., the training round that achieves the best performance in the repeated self-training process). As can be seen in Table 4, our approach achieves a better pseudo labeling accuracy. This is because our approach obtains a more balanced tradeoff between selection precision and selection recall compared to self training. Because of a more rigorous selection model, our approach tends to only select samples with a higher probability of having clean labels; i.e., increasing the selection precision. This is also reflected in small decrease in the number of selected samples. We believe this is crucial for mitigating error propagation (xie2020self) and thus for better performance.
Ablation Study: Substituting with Simpler Models
To further test the robustness of our approach, we substitute the base model from BERT to simple multi-layer percetrons (MLPs) without pretraining. As show in Table 5, the performance will decrease after applying simpler MLPs. However, our approach remains effective compared to the other baselines. The relative gain and absolute improvement from “Supervised” to our approach is still significant.
Method | TREC | SMS | YOUTUBE |
Supervised | 66.5 | 93.3 | 91.0 |
Co-teaching+ | 66.8 | 98.3 | 93.3 |
L2R | 66.4 | 98.0 | 93.3 |
SELF | 66.3 | 98.1 | 92.3 |
Self-train | 71.1 | 95.1 | 92.5 |
Self-train (thres) | 70.1 | 98.1 | 92.1 |
Noisy Student | 68.9 | 98.1 | 92.1 |
Ours | 70.3 | 98.2 | 93.3 |
Ours+Noisy | 70.0 | 98.4 | 93.4 |
Signal | Hard | Medium | Easy |
All | 26.91 | 17.67 | 13.47 |
-EMAL | 27.31 | 17.95 | 13.85 |
-IL | 28.56 | 19.32 | 14.94 |
-HE | 29.32 | 19.88 | 15.01 |
Signal | IMDB | YT | TREC | TREC* | SMS | SMS* | AG |
All | 85.9 | 95.2 | 85.0 | 89.2 | 99.0 | 99.2 | 85.1 |
-FLS | 88.2 | 92.9 | 83.6 | 82.0 | 99.0 | 99.0 | 84.1 |
-EMAL | 89.0 | 92.1 | 84.2 | 86.8 | 99.0 | 99.0 | 86.0 |
-IL | 87.8 | 92.6 | 83.4 | 85.8 | 99.0 | 98.8 | 82.5 |
-HE | 88.0 | 93.4 | 82.6 | 79.8 | 98.4 | 98.0 | 79.6 |

Ablation Study: Features
We study the effects of the model-based features we introduced in Section 3.3 with an ablation. Note that we did not use FLS for ASR because the the first layer and the last layer have different lengths. As shown in Table 6 and Table 7, all of the features contribute to the final performance. Among the features, adding the instant loss (IL) feature results in the most relative gain for performance.
Also, we do similar experiments on text classification tasks. As seen in Table 7, we can do basic search on feature engineering to improve the final performance. Among all signals, IL leads to the greatest relative gain of the performance. But across all the datasets, each signal has its own role and contributes to our final performance.
Performance Under Few-shot Setting
LST (li2019learning) has shown that self-training paradigm can be customized on few-shot classification. Here, we also investigate the effectiveness of our method when applying to the few-shot setting. Specifically, we evaluate on selected text classification datasets (i.e., YOUTUBE, SMS and IMDB). Figure 5 showcases that generally self-train performs better than “Supervised” (using only labeled data), while our model achieves the best performance in most cases, indicating the robustness of our method. More results are displayed in Appendix A.3.2.
5 Conclusions
In this paper, we propose SENT to address the problem of label noises. Compared with meta learning based models, our selection model is trained with full supervision using cross entropy loss which facilitates the convergence process. In the meantime, we model IDN noise without the prior knowledge of noise distribution. We also unify the setting of self-training and label corruption in the framework of noisy label learning and conduct extensive experiments on both settings.
6 Limitations
Although our framework has been proved to be effective under the setting of self-training and label corruption on text classification and speech recognition tasks, adapting our approach to more sequence-level tasks such as named entity recognition (NER) and machine translation will also be interesting. Besides, the selection model in our framework is feature-based. These features are very informative but might be limited in terms of expressivity. This reserves the space for further improvement to learn more data-driven features under our framework.
References
Appendix A Appendix
We provide details of our datasets (Section A.1) and experimental results (Section A.2).
Stats | OriginalTrain | HardTrain | MediumTrain | EasyTrain | OriginalDev | OriginalTest |
CER | 37.11 | 26.19 | 16.14 | |||
MaxLen | 44 | 82 | 89 | 44 | 35 | 37 |
AvgLen | 14.41 | 14.30 | 14.35 | 14.41 | 14.33 | 14.60 |
Num | 120098 | 120098 | 120098 | 120098 | 14326 | 7176 |
a.1 Details of Implementation
a.1.1 Baselines
In the first experiment of text classification, we also take the same baselines which consider rules and report the same results as utilized in ASTRA ( karamanolakis-etal-2021-self). (a) Majority predicts the majority vote of the rules with ties resolved by predicting a random class. (b) Snorkel+Labeled (ratner2017snorkel) trains classifiers using weakly-labeled data with a generative model. . (c) L2R (ren2018learning) learns to re-weight noisy labels from rules by meta learning. (d) PosteriorReg (hu2016harnessing) uses rules as soft constraints for training by posterior regularization. (e) ImplyLoss (awasthi2020learning) learns both instance-specific and rule specific weights by minimizing an implication loss (h) ASTRA (karamanolakis-etal-2021-self) introduces an rule attention network to leverage multiple sources of weak supervision with trainable weights to compute soft weak labels for unlabeled data.
For ASR baselines, Vanilla is a naive encoder-decoder transformer network without any denoising moduels. All following baselines and our approache are build on top of the vanilla model. (b) Co-Teaching+ trains two networks with each network selecting its small-loss samples as clean samples for its peer network. (c) L2R is the same as mentioned before. (d) RoCL starts learning with easy and clean samples and gradually moves to learn noisy-labeled data with pseudo labels produced by a time-ensemble of the model and data augmentations.(e) SELFIE selects refurbishable samples based on the entropy of model predictions and refurbs the samples with model predictions.
a.2 Details of Experiments
For ASR models, the transformer model contains 12 layers of encoder and 6 layers of decoder. For each transformer block, the number of heads in the multiheadattention module is 4. The dimension of the encoder and decoder input is 256. The dimension of the feedforward network is 2048. We use 80-dimensional filter bank coefficient as input features. The hyper parameter for training is show in Table 9. Batch_size_in_s2 means the maximum allowed length of audio in seconds in one batch. History_length represents the maximum allowed length for stored history predicted labels. These history predictions are used to calculate entropy.
It should be noted that in both text classification as ASR tasks, we split the total development set into dev_select and dev_eval, where dev_select is used to train the selection model and dev_eval is used to evaluate the selection model.
HP | HardTrain | MediumTrain | EasyTrain |
Max_LR | 5e-4 | 3e-4 | 3e-4 |
Min_LR | 5e-6 | 1e-6 | 1e-6 |
Warmup_step | 20000 | 20000 | 20000 |
Max_steps | 80000 | 50000 | 50000 |
Batch_size | 300 | 300 | 300 |
Batch_size_in_s2 | 500 | 500 | 500 |
History_length | 18 | 12 | 12 |
There are three details worth to noice in ASR: a) we perform an additional correction module for ASR. The correction module has the same architecture as the selection module. Correction module takes the same signal as selection module, and it outputs three weights which sum to one. The weights are assigned to the noisy labels (NL), model predicted probabilities (Pred), accumulated corrected labels respectively. The corrected label(CL) at T-th epoch is:
(8) |
after correction, we will perform the normal selection module to select clean labels from corrected labels. b) Since ASR is sequence level problem, we can not correct and select each token independently which would ignore the word dependencies. We first align the predicted word sequence to the noisy target sequence accoding to their longest common sequence. Then we will correct and select the token that are not common in both sequence. c) As ASR is a generation problem and the length of input and output is different, we do not extract FSL as a feature for our approach.
a.3 Details of Ablation Study
Model | Hard | Medium | Easy |
Vanilla | 30.06 | 20.21 | 14.42 |
Co-Teaching+ | 29.67 | 20.02 | 13.71 |
L2R | 27.79 | 19.08 | 14.12 |
RoCL | 24.50 | 16.34 | 13.22 |
Selfie | 24.40 | 17.22 | 13.01 |
ours | 23.97 | 16.01 | 11.96 |
a.3.1 The Influence of Selection Threshold On The Final Performance
In practice, we find that choosing a proper threshold for selection model might have some influence on the final performance. In detail, we choose FX-score as the target to choose the threshold which yields best FX-score on the dev_eval set, and use this threshold to select the samples from the unlabeled data based on the output of the selection model. We investigate the influence on final performance by changing the X of FX-score on YOUTUBE and SMS. The computation of this metric is displayed as follow:
(9) |
Noted that this metric becomes F1-score if we set X as 1. The X measures the preference of precision to recall. If X approaches 0, it becomes precision. If X approaches infinite, it becomes recall. As shown in Figure 4, the performance gradually decreases as the X grows, which implicitly indicates that precision matters more than recall when we are going to select samples from unlabeled data.

a.3.2 The Performance Under The Few-shot Setting
We also investigate the performance of IMDB and YOUTUBE under the few-shot learning setting. The results are shown in Figure 5.
![]() |
![]() |
a.3.3 Case Study: Signal Difference On Train and Development Sets.
We show the difference of signals on training and development sets in Figure 6, 7, 8. We can see that the all signals show close statistics on both sets. This indicates that our noise transfer approach holds and has a good performance. This phenomena exists in all three error levels.



a.3.4 Influence of Signals On AISHELL-1 Dev
We show the influence of different signals on AISHELL-1 development set in Table 11. We can see that the tendency is same as the tendency on the test set.
Signal | Hard | Medium | Easy |
All | 23.97 | 16.01 | 11.96 |
-EMAL | 24.35 | 16.43 | 12.45 |
-IL | 25.48 | 17.55 | 13.71 |
-HE | 25.89 | 18.01 | 13.99 |