Official repository for Reliable Label Bootstrapping
Reducing the amount of labels required to trainconvolutional neural networks without performance degradationis key to effectively reduce human annotation effort. We pro-pose Reliable Label Bootstrapping (ReLaB), an unsupervisedpreprossessing algorithm that paves the way for semi-supervisedlearning solutions, enabling them to work with much lowersupervision. Given a dataset with few labeled samples, we firstexploit a self-supervised learning algorithm to learn unsupervisedlatent features and then apply a label propagation algorithm onthese features and select only correctly labeled samples using alabel noise detection algorithm. This enables ReLaB to createa reliable extended labeled set from the initially few labeledsamples that can then be used for semi-supervised learning.We show that the selection of the network architecture andthe self-supervised method are important to achieve successfullabel propagation and demonstrate that ReLaB substantiallyimproves semi-supervised learning in scenarios of very lim-ited supervision in CIFAR-10, CIFAR-100, and mini-ImageNet. Code: https://github.com/PaulAlbert31/ReLaB.READ FULL TEXT VIEW PDF
Official repository for Reliable Label Bootstrapping
Convolutional neural networks (CNNs) are now the established standard for visual representation learning [10, 26, 60], yet one of their most prevalent limitations is the large quantity of labeled data required to better exploit them. Although enormous quantities of unlabeled data are now accessible and can be collected with minimal effort, the annotation process remains limited by human intervention [12, 30, 58, 67]. Representation learning has great potential to address this and the research community is actively developing new algorithms to train CNNs with little to no supervision [7, 19].
. Self-supervised learning defines a pretext task where labels are automatically generated and serve as supervisory training signal. By solving pretext tasks such as colorization of grayscale images, predicting image rotations 
, or automatically estimated clusters assignments, CNNs can learn general representations that reduce the amount of supervision needed for downstream tasks.
Despite improvements in methods for learning general representations using self-supervision, labels are required to solve tasks . Automatic annotation of data becomes a plausible answer  that unavoidably infers some incorrect or noisy labels. To prevent harming the representations learned , label noise-resistant training of CNNs is often necessary [13, 40, 44, 49]. In particular, the small loss trick  associates examples with a low (high) training loss to samples with clean (noisy) labels. Distinguishing between clean and noisy samples helps with discarding noisy labels [13, 39], correcting labels [1, 44], or reducing their effect on parameter updates .
Aiming to reduce the labeling effort, semi-supervised learning jointly exploits a small set of labeled samples and large quantities of unlabeled ones. In particular, consistency regularization methods (e.g. [7, 50]) encourage consistency in the predictions for the same sample under different perturbations while pseudo-labeling methods (e.g. [46, 2]) directly generate labels for the unlabeled samples. Recent work [56, 6] has allowed semi-supervised algorithms to work with very few labels, aiming to minimize human annotation. Berthelot et al.  use self-supervised regularization based on  to stabilize network training in cases of extremely few labels and Wang et al.  use the self-supervision approach from  to regularize the MixMatch algorithm . Finally, Rebuffi et al.  make use of self-supervision  to initialize the network before a two-stage semi-supervised training, achieving substantial improvements over a random initialization.
This paper contributes to a further reduction of human supervision by proposing Reliable Label Bootstrapping (ReLaB), a novel approach to exploiting knowledge transfer from self-supervised learning and paving the way for semi-supervised learning with very scarce annotation. We exploit synergies between label noise, self-supervised, and semi-supervised learning to bootstrap additional reliable labels from a small set of seed samples. In particular, we leverage label propagation algorithms in a self-supervised feature space to extend the provided labels to the entirety of the samples, select a trusted clean subset from this noisy dataset and use the selected subset for semi-supervised training. This enables strong performance for very limited supervision, where we outperform direct training of recent semi-supervised methods and reduce the sensitivity to the initial labeled samples.
There have been many attempts in the literature to reduce the amount of strong supervision required to train deep neural networks. These include tasks such as transfer learning or few-shot learning , where supervised pre-trained features are exploited, and semi-supervised learning , self-supervised learning , or label noise , where all features are learned on the same dataset. This paper focuses on the latter; the following reviews some closely related literature.
seeks to reduce human supervision by jointly learning from sparsely labeled data and extensive unlabeled data. Semi-supervised learning has evolved rapidly in recent years by exploiting two main strategies : consistency regularization and pseudo-labeling. Consistency regularization promotes consistency in the network’s predictions for the same unlabeled sample altered by different perturbations. Notable examples of consistency regularization algorithms are  where samples are perturbed by virtual adversarial attacks,  where a teacher network is built from the exponential moving average of a student network weights to produce perturbed predictions, and 
, which encourages predictions of interpolated samples to be consistent with the interpolation of the predictions. Recently, Berthelot et al. proposed MixMatch, where perturbed predictions are generated by means of data-augmented sharpened labels and labeled and unlabeled examples are mixed together using mixup . MixMatch was extended in ReMixMatch  by exploiting distribution alignment  and an augmentation anchoring policy. Pseudo-labeling on the other hand directly exploits the network predictions on unlabeled samples by using them as labels (pseudo-labels) to regularize training.  is an early attempt at pseudo-labeling but is limited to a finetuning stage on a pre-trained network.  implements a graph-based, weighted pseudo-label generation based on a label propagation algorithm and  derive certainty weights for unlabeled samples from their distance to neighboring samples in the feature space. Recently, Arazo et al.  have shown that a pure pseudo-labeling without using consistency regularization can reach competitive performance when addressing confirmation bias .
defines proxy or pretext tasks to learn useful representations without human intervention . Context prediction , colorization , puzzle solving , instance discrimination , image rotation prediction , and image transformation prediction  are some examples of pretext tasks. Some recent efforts on self-supervised learning generate meta-labels via -means clustering in the feature space  or by solving an optimal transport problem . Conversely,  explore the feature space by iteratively constructing local neighborhoods with a high instance discrimination consistency to learn useful representations.
Recent contributions shows that coupling self-supervised and semi-supervised learning can increase accuracies with fewer labels. Rebuffi et al.  use RotNet  as a network initialization strategy, ReMixMatch  exploits RotNet  together with their semi-supervised algorithm to achieve stability with few labels, and EnAET  leverage transformation encoding from AET  to improve the consistency of predictions on transformed images.
transfers the information from labeled data to an unlabeled dataset 
. The process stems from random walk diffusion for image retrieval[16, 48, 66]
where a pairwise affinity matrix is constructed, relating images to each other before diffusing the affinity values to the entirety of the graph. The diffusion result can be directly used to estimate labels and finetune pre-trained networks in few-shot learning or to define a pseudo-labeling for semi-supervised learning . Other attempts at semi-supervised learning exploit label propagation to dynamically capture the manifold’s structure and regularize it to form compact clusters that facilitate class separation  or to encourage random walks that end up in the same class from which they started while penalizing different class endings .
is a topic of increasing interest for the community  that aims at limiting degradation of CNNs representations when learning in label noise conditions . Label noise algorithms can be categorized in three different approaches: loss correction [20, 40, 44], relabeling [49, 59], and semi-supervised [13, 39]. Loss correction seeks to reduce the contribution of the incorrect or noisy labels in the training objective. The authors of  define per-sample losses based on combining both the potentially noisy label and the potentially clean network prediction and 
extend this idea by dynamically defining such combinations in an attempt to fully dismiss the noisy labels contribution. Other loss correction approaches multiply the softmax probability by a label noise transition matrixthat specifies the probability of one label being flipped to another ([21, 40]) whereas per-sample weights to reduce the influence of noisy samples has also been addressed [20, 54]. Relabeling approaches propose to avoid fitting noisy labels by relabeling all samples using either the network predictions  or estimated label distributions  as soft-labels. Semi-supervised learning methods detect the noisy samples before discarding their harmful labels and exploiting their content in a semi-supervised setup [13, 28, 39]. Finally, a recurrent observation to identify clean samples is the small loss trick [1, 20, 39, 49] where clean samples exhibit a lower loss as they represent easier patterns. It is worth mentioning that mixup data augmentation  has shown good performance when dealing with label noise in real scenarios  without explicitly addressing it.
We formulate a semi-supervised classification task for classes as learning a model given a training set of samples. The dataset consists of the labeled set
with corresponding one-hot encoded labelsand the unlabeled set , being . We consider a CNN for , where denotes the model parameters. The network comprises a feature extractor with parameters , which maps the input space into the feature space
, and a classifierwith parameters . Substantially decreasing the number of labels significantly decreases semi-supervised learning performance . We therefore propose to bootstrap additional labels for unlabeled samples from . First, label propagation [16, 23, 24, 48, 51] is performed using self-supervised visual representations to estimate labels for the unlabeled set and create a extended dataset . Second, the small loss trick from the label noise literature  is used to select reliable samples from whose label can be trusted (i.e. it is not noisy) to create a reliable extended labeled set . Finally, semi-supervised learning is applied to the extended labeled set and the unlabeled set . Figure 1 presents an overview of the proposed approach.
Knowledge transfer from the labeled set to the unlabeled set is implicitly done by semi-supervised learning approaches as network predictions for can be seen as estimated labels . With few labeled samples, however, it is difficult to learn useful initial representations from and performance is substantially degraded  (see Subsection IV-E).
Although label propagation for semi-supervised learning has previously been studied as a regularisation or as a semi-supervised objective , we propose here to follow an alternative direction as our goal is first to leverage self-supervised features and second to only label a reliable subset. Given a set of descriptors learnt in an unsupervised manner, we seek to use an efficient label propagation algorithm capable of efficiently fitting to the data manifold. Diffusion [16, 23, 24, 48, 51] is a well documented label propagation algorithm that provides a good solution to our problem. We reformulate under the diffusion algorithm in a similar fashion than . Here we study the estimation of as a label propagation task using unsupervised visual representations learned from all data samples . In particular, we fit a feature extractor using self-supervision to obtain class-discriminative image representations  and subsequently propagate labels from the labeled images to estimate labels for the unlabeled samples. We do so by solving a label propagation problem based on graph diffusion . First, the set of descriptors are used to define the affinity matrix:
where is the degree matrix of the graph and the adjacency matrix is computed as if and otherwise. weighs the affinity term to controls the sensitivity to far neighbors and is set to 3 as in . The diffusion process estimates the matrix as:
where denotes the probability of jumping to adjacent vertices in the graph and is the label matrix defined such that if sample and (i.e. belongs to the class), where () indexes the rows (columns) in Finally, the estimated one-hot label is:
for each unlabeled sample . This estimated labels allows the creation of the extended dataset with estimated labels , where , . Note that we follow common practices for image retrieval [4, 41] and perform PCA whitening as well as normalization on the features .
Propagating existing labels using self-supervised representations as described in Subsection III-A, results in estimated labels that might be incorrect, i.e. label noise. Using noisy labels as a supervised objective on leads to performance degradation due to label noise memorization [39, 62] (see Table III in Subsection IV-C). Since the label noise present in
comes from features extracted from the data, noisy samples tend to be visually similar to the seed sample. Consequently, robust state-of-the-art algorithms for supervised learning with label noise[1, 49, 63], principally designed to work on symmetric noise distributions, underperform (see Subsection IV-C). The small-loss trick [1, 39, 49] states that samples with a smaller loss are cleaner than their high loss counterpart. Previous works utilizing the small loss have proven its efficiency for artificial noise distributions and our selection of a clean subset and training in a semi-supervised manner follows a similar approach [13, 28, 39]. However, our feature-based noise generated after label propagation is unbalanced in number of samples and level of noise in each class, thus posing a difficult scenario that has not being addressed in the label noise literature. We therefore propose a different method to identify clean samples using the cross-entropy loss:
with softmax-normalized logitsand training with a high learning rate that helps prevent label noise memorization  on the extended dataset . Samples whose associated loss is low are more likely to have a correct label. The reliable set with , is then created by selecting for each class the originally labeled samples for that class in and the samples in class from with the lowest loss , i.e. highly reliable samples. The challenging noise present in makes the loss
during any particular epoch unstable (see Figure2). We therefore propose to average it over the last training epochs to create . We set the number of labeled samples per-class equally for all classes, i.e. , and choose it based on traditionally reported baselines for semi-supervised experiments [2, 6, 38]. For example, in CIFAR-10 usually achieves convergence to reasonable performance. Table II
shows that the approach and the noise percentage of the generated dataset is not overly sensitive to this hyperparameter.
Unlike traditional learning from and , ReLaB empowers semi-supervised algorithms with a (larger) reliable labeled set extended from the original (smaller) labeled set . The extension from to is done in a completely unsupervised manner and as a consequence, we greatly reduce the error rates of SSL algorithms when few labels are given, e.g. the % error of ReMixMatch  in CIFAR-10 for one labeled sample per class () is reduced to % when using representative labeled samples.
We experiment with three image classification datasets: CIFAR-10 , CIFAR-100 , and mini-ImageNet . CIFAR (mini-ImageNet) data consists of 60K () RGB images split into 50K training samples and 10K for testing. CIFAR-10 samples are organized in 10 classes, while CIFAR-100 and mini-ImageNet are in 100.
We construct the reliable set by training for 60 epochs with a high learning rate (0.1) to prevent label noise memorization  and select the lowest loss samples per class at the end of the training. We average the per-sample loss over the last epochs of training to stabilize the reliable sample selection (see Figure 2). Regarding SSL, we always use a standard WideResNet-28-2  for fair comparison with related work. We combine our approach with state-of-the-art pseudo-labeling  and consistency regularization-based  semi-supervised methods to prove the stability of ReLaB for different semi-supervised strategies. We use the default configuration for pseudo-labeling111https://github.com/EricArazo/PseudoLabeling except for the network initialization, where we make use of self-supervision  and freeze all the layers up to the last convolutional block in a similar fashion than . The network is warmed up on the labeled set for epochs  and then trained for epochs on the whole dataset. For ReMixMatch222https://github.com/google-research/remixmatch we found no initialisation was necessary and train for epochs. Experiments in Subsection IV-C for the supervised alternatives on dealing with label noise [1, 63] follow the authors’s configurations, while cross-entropy training in Table III is done for 150 epochs with an initial learning rate of 0.1 that we divide by 10 in epochs 80 and 130.
after label propagation for different self-supervised methods and architectures. The average error and the standard deviation are reported over 3 runs with different labeled samples in. Lower is better.
|Noise (%)||SSL error||Noise (%)||SSL error|
|ReLaB + RMM ()|
|ReLaB + RMM ()|
|ReLaB + RMM ()|
|ReLaB + RMM ()|
Label propagation relies upon representations extracted form the data and is as such conditioned by the quality of these representations. We propose to exploit unsupervised learning to obtain these representations, which strongly impacts the label propagation proposed in Subsection III-A (see Table I). In particular, we present the label noise percentage of the extended labeled set in CIFAR-10 (100) formed after label propagation of the specified self-supervised representations with 1, 4 and 10 (4, 10 and 25) labeled samples per-class in . We select RotNet , NPID , UEL , and AND  as four recent self-supervised methods, and experiment with the WideResNet-28-2 (WRN-28-2) , ResNet-18 (RN-18) and ResNet-50 (RN-50)  architectures. We confirm that the architecture has a key impact on the label noise percentage, which agrees with previous observations on the quality of self-supervised features from larger architectures . More capacity does not reduce the noise percentage for RotNet, whereas NPID, UEL, and AND are more stable across architectures and different amounts of labels. We select AND coupled with ResNet-50 for learning self-supervised features suitable for label propagation in the subsequent experimentations.
|ReLaB + PL|
|ReLaB + RMM|
|ReLaB + PL|
|ReLaB + RMM|
The extended dataset after label propagation contains label noise; we proposed in Subsection III-B to select a subset of samples by selecting the most reliable samples via the small loss trick to reduce such noise. represents an extended labeled set when compared to the small labeled set . Here we analyze the importance of ’s size on its label noise percentage and SSL performance. Table II shows how, although selecting more samples slightly increases the noise percentage, the semi-supervised errors are relatively insensitive to this and are even sometimes reduced due to more samples being considered. This tendency ceases at , where more samples do not compensate the higher noise percentage. Based on this experiment and the typical amounts of labeled samples needed to perform successful SSL [2, 7, 23, 50], we choose use for CIFAR-10 (100) for further experiments.
There are also supervised alternatives on dealing with label noise [1, 63]. Table III compares the proposed approach with standard cross-entropy (CE) training on and recent label noise robust methods such as the noise resistant Mixup (M) augmentation  and the Dynamic Bootstrapping (DB) loss correction method . In both CIFAR-10 and CIFAR-100, ReLaB + ReMixMatch (RMM) outperforms supervised alternatives. This does not hold for ReLaB + Pseudo-labeling (PL) in CIFAR-100, which is slightly ourperformed by DB. To demonstrate that ReLaB + RMM does not lead to better performance solely because of stronger data augmentation used in RMM, we equip DB with the strong augmentation policy AutoAugment (AA)  (DB + AA). This improved DB is still far from ReLaB + RMM performance, demonstrating the utility of the the reliable set selection followed by SSL compared to supervised alternatives.
Table IV shows the benefits of ReLaB for semi-supervised learning with PL  and ReMixMatch (RMM)  compared to direct application of semi-supervised methods in CIFAR-10/100. Our focus is very low levels of labeled samples: semi-supervised methods  already achieve very good performance with larger numbers of labeled samples. ReLaB acts as a pre-processing step that extends the number of available samples, thus enabling better performance of semi-supervised methods. We further study the 1 sample per class scenario in Subsection IV-E.
The high standard deviation using 1 sample per class () in CIFAR-10 (Table IV) motivates the proposal of a more reasonable method to compare against other approaches. To this end, the authors of  proposed 8 different labeled subsets for 1 sample per class in CIFAR-10, ordered from more representative to less representative, we reduce the experiments to 3 subsets: the most representative, the least representative, and one in the middle. Figure 3 shows the selected subsets; the exact sample ids will are available on https://github.com/PaulAlbert31/ReLaB.
Table V reports the performance for each subset and compares against FixMatch  and our configuration of ReMixMatch . Note that the results obtained for the less representative samples reflect the results that can be expected on average when drawing labeled samples randomly (see Table IV). Furthermore, although there is a high accuracy variability with 1 sample per class on CIFAR-10, the standard deviation over the CIFAR-100 and mini-ImageNet runs is low enough that it can be directly compared to others even when drawing the labeled samples randomly and therefore we omit the fixed samples comparison.
|ReLaB + PL|
ReLaB is a label bootstrapping method that enables the use of standard semi-supervised algorithms with very sparsely labelled data by efficiently leveraging self-supervised learning. We extend the labeled pool through propagation in a self-supervised feature space and properly deal with label noise resulting from the automatic label assignment to extract an extended clean subset of labeled samples before training in a semi-supervised fashion. We demonstrate the direct impact of better unsupervised features for the performance of ReLaB and enable traditional semi-supervised algorithms to reach remarkable and stable accuracies with very few labeled samples on standard datasets.
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under grant number [SFI/15/SIRG/3283] and [SFI/12/RC/2289_P2] as well as from the Department of Agriculture, Food and Marine on behalf of the Government of Ireland under Grant Number [16/RC/3835].
International Conference on Machine Learning (ICML), Cited by: §I, §II, §II, §III-B, §IV-A, §IV-C, TABLE III.
European Conference on Computer Vision (ECCV), Cited by: §III-A.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §IV-C, TABLE III.
Unsupervised Deep Learning by Neighbourhood Discovery. In International Conference on Machine Learning (ICML), Cited by: §II, §IV-B, TABLE I.
International Joint Conferences on Artificial Intelligence (IJCAI), Cited by: §II.
EnAET: Self-Trained Ensemble AutoEncoding Transformations for Semi-Supervised Learning. arXiv: 1911.09265. Cited by: §I, §II, TABLE IV.