MoPro: Webly Supervised Learning
We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised learning. Most existing works on webly-supervised representation learning adopt a vanilla supervised learning method without accounting for the prevalent noise in the training data, whereas most prior methods in learning with label noise are less effective for real-world large-scale noisy data. We propose momentum prototypes (MoPro), a simple contrastive learning method that achieves online label noise correction, out-of-distribution sample removal, and representation learning. MoPro achieves state-of-the-art performance on WebVision, a weakly-labeled noisy dataset. MoPro also shows superior performance when the pretrained model is transferred to down-stream image classification and detection tasks. It outperforms the ImageNet supervised pretrained model by +10.5 on 1-shot classification on VOC, and outperforms the best self-supervised pretrained model by +17.3 when finetuned on 1% of ImageNet labeled samples. Furthermore, MoPro is more robust to distribution shifts. Code and pretrained models are available at https://github.com/salesforce/MoPro.READ FULL TEXT VIEW PDF
Unsupervised learning has been a long-standing goal of machine learning ...
We prove a new upper bound on the generalization gap of classifiers that...
Several multi-modality representation learning approaches such as LXMERT...
Contrastive learning is a key technique of modern self-supervised learni...
(Very early draft)Traditional supervised learning keeps pushing convolut...
Detecting out-of-distribution (OOD) samples plays a key role in open-wor...
In this paper, we present a novel approach, Momentum^2 Teacher, for
MoPro: Webly Supervised Learning
Large-scale datasets with human-annotated labels have revolutionized computer vision. Supervised pretraining on ImageNet(Deng et al., 2009) has been the de facto formula of success for almost all state-of-the-art visual perception models. However, it is extremely labor intensive to manually annotate millions of images, which makes it a non-scalable solution. One alternative to reduce annotation cost is self-supervised representation learning, which leverages unlabeled data. However, self-supervised learning methods (Goyal et al., 2019; He et al., 2019; Chen et al., 2020a; Li et al., 2020b) have yet consistently shown superior performance compared to supervised learning, especially when transferred to downstream tasks with limited labels.
With the help of commercial search engines, photo-sharing websites, and social media platforms, there is near-infinite amount of weakly-labeled images available on the web. Several works have exploited the scalable source of web images and demonstrated promising results with webly-supervised representation learning (Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020). However, there exists two competing claims on whether weakly-labeled noisy datasets lead to worse generalization performance. One claim argues that the effect of noise can be overpowered by the scale of data, and simply applies standard supervised learning method on web datasets (Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020). The other claim argues that deep models can easily memorize noisy labels, resulting in worse generalization (Zhang et al., 2017; Ma et al., 2018). In this paper, we show that both claims are partially true. While increasing the size of data does improve the model’s robustness to noise, our method can substantially boost the representation learning performance by addressing noise.
There exists a large body of literature on learning with label noise (Jiang et al., 2018; Han et al., 2018; Guo et al., 2018; Tanaka et al., 2018; Arazo et al., 2019; Li et al., 2020a). However, existing methods have several limitations that make them less effective for webly-supervised representation learning. First, most methods do not consider out-of-distribution (OOD) samples, which is a major source of noise in real-world web datasets. Second, many methods perform computation-heavy procedures for noise cleaning (Jiang et al., 2018; Li et al., 2019, 2020a), or require access to a set of samples with clean labels (Vahdat, 2017; Veit et al., 2017; Lee et al., 2018), which limit their scalability in practice.
We propose a new method for efficient representation learning from weakly-labeled web images. Our method is inspired by recent developments in contrastive learning for self-supervised learning (He et al., 2019; Chen et al., 2020a; Li et al., 2020b) We introduce Momentum Prototypes (MoPro), a simple component which is effective in label noise correction, OOD sample removal, and representation learning. A visual explanation of our method is shown in Figure 1. We use a deep network to project images into normalized low-dimensional embeddings, and calculate the prototype for a class as the moving-average embedding for clean samples in that class. We train the network such that embeddings are pulled closer to their corresponding prototypes, while pushed away from other prototypes. Images with corrupted labels are corrected either as another class or as an OOD sample based on their distance to the momentum prototypes.
We experimentally show that:
MoPro achieves state-of-the-art performance on the upstream weakly-supervised learning task.
MoPro substantially improves representation learning performance when the pretrained model is transferred to downstream image classification and object detection tasks. For the first time, we show that weakly-supervised representation learning achieves similar performance as supervised representation learning, under the same data and computation budget. With a larger web dataset, MoPro outperforms ImageNet supervised learning by a large margin.
MoPro learns more robust and calibrated model that generalizes better to distribution variations.
A number of prior works exploit large web datasets for visual representation learning (Joulin et al., 2016; Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017; Kolesnikov et al., 2020). These datasets contain a considerable amount of noise. Approximately 20% of the labels in the JMT-300M dataset (Sun et al., 2017) are noisy, whereas 34% of images in the WebVision dataset (Li et al., 2017)
are considered outliers. Surprisingly, prior works have chosen to ignore the noise and applied vanilla supervised method, with the claim that the scale of data can overpower the noise(Mahajan et al., 2018; Sun et al., 2017; Li et al., 2017). However, we show that supervised method cannot fully harvest the power of large-scale weakly-labeled datasets. Our method achieves substantial improvement by addressing noise, and advances the potential of webly-supervised representation learning.
Learning with label noise has been widely studied. Some methods require access to a small set of clean samples (Xiao et al., 2015; Vahdat, 2017; Veit et al., 2017; Lee et al., 2018), and other methods assume that no clean labels are available. There exist two major types of approaches. The first type performs label correction using predictions from the network (Reed et al., 2015; Ma et al., 2018; Tanaka et al., 2018; Yi and Wu, 2019; Yang et al., 2020). The second type separates clean samples from corrupted samples, and trains the model on clean samples (Han et al., 2018; Arazo et al., 2019; Jiang et al., 2018; Wang et al., 2018; Chen et al., 2019; Li et al., 2020a). However, existing methods have yet shown promising results for large-scale weakly-supervised representation learning. The main reasons include: (1) most methods do not consider OOD samples, which commonly occur in real-world web datasets; (2) most methods are computational-heavy due to co-training (Han et al., 2018; Li et al., 2020a; Jiang et al., 2018, 2020), iterative training (Tanaka et al., 2018; Yi and Wu, 2019; Wang et al., 2018; Chen et al., 2019), or meta-learning (Li et al., 2019; Zhang et al., 2019).
Different from existing methods, MoPro achieves both label correction and OOD sample removal on-the-fly with a single step, based on the similarity between an image embedding and the momentum prototypes. MoPro also leverages contrastive learning to learn a robust embedding space.
Self-supervised methods have been proposed for representation learning using unlabeled data. The recent developments in self-supervised representation learning can be attributed to contrastive learning. Most methods (He et al., 2019; Chen et al., 2020a; Oord et al., 2018; Wu et al., 2018) leverage the task of instance discrimination, where augmented crops from the same source image are enforced to have similar embeddings. Prototypical contrastive learning (PCL) (Li et al., 2020b) performs clustering to find prototypical embeddings, and enforces an image embedding to be similar to its assigned prototypes. Different from PCL, we update prototypes on-the-fly in a weakly-supervised setting, where the momentum prototype of a class is the moving average of clean samples’ embeddings. Furthermore, we jointly optimize two contrastive losses and a cross-entropy loss.
Current self-supervised representation learning methods are limited in (1) inferior performance in low-shot task adaptation, (2) huge computation cost, and (3) inadequate to harvest larger datasets. We show that weakly-supervised learning with MoPro addresses these limitations.
In this section, we delineate the details of our method. First, we introduce the components in our representation learning framework. Then, we describe the loss functions. Finally, we explain the noise correction procedure for label correction and OOD sample removal. A pseudo-code of MoPro is provided in appendixB.
Our proposed framework consists of the following components. Figure 2 gives an illustration.
A noisy training dataset , where is an image and is its class label.
A pseudo-label for each image , which is its corrected label. Details for generating the pseudo-label is explained in Sec 3.3.
A classifier (a fully-connected layer followed by softmax) which receives the representationas input and outputs class predictions .
A projection network, which maps the representation into a low-dimensional embedding (). is always normalized to the unit sphere. Following SimCLR (Chen et al., 2020a), we use a MLP with one hidden layer as the projection network.
Momentum embeddings generated by a momentum encoder. The momentum encoder has the same architecture as the encoder followed by the projection network, and its parameters are the moving-average of the encoder’s and the projection network’s parameters. Same as in MoCo (He et al., 2019), we maintain a queue of momentum embeddings of past samples.
Momentum prototypes . The momentum prototype of the -th class, , is the moving-average embedding for samples with pseudo-label .
As illustrated in Figure 1, we aim to learn an embedding space where samples from the same class gather around its class prototype, while samples from different classes are seperated. We achieve it with two contrastive losses: (1) a prototypical contrastive loss which increases the similarity between an embedding and its corresponding class prototype, , in contrast to other prototypes; (2) an instance contrastive loss which increases the similarity between two embeddings of the same source image, , in contrast to embeddings of other images. Specifically, the contrastive losses are defined as:
where is a temperature parameter, and is the pseudo-label. We use negative momentum embeddings to construct the denominator of the instance contrastive loss.
We train the classifier with cross-entropy loss, using pseudo-labels as targets.
We jointly optimize the contrastive losses and the classification loss. The training objective is:
For simplicity, we set for all experiments.
We propose a simple yet effective method for online noise correction during training, which cleans label noise and removes OOD samples. For each sample, we generate a soft pseudo-label
by combining the classifier’s output probabilitywith
, a probability distribution calculated using the sample’s similarityw.r.t the momentum prototypes:
where the combination weight is simply set as in all experiments.
We convert into a hard pseudo-label based on the following rules: (1) if the highest score of is above certain threshold , use the class with the highest score as the pseudo-label; (2) otherwise, if the score for the original label is higher than uniform probability, use as the pseudo-label; (3) otherwise, label it as an OOD sample.
We remove OOD samples from both the cross-entropy loss and the prototypical contrastive loss so that they do not affect class-specific learning, but include them in the instance contrastive loss to further separate them from in-distribution samples. Examples of OOD images and corrected pseudo-labels are shown in the appendices.
For each class , we calculate its momentum prototype as a moving-average of the normalized embeddings for samples with pseudo-label . Specifically, we update by:
Here is a momentum coefficient and is set as in our experiments.
We use the WebVision (Li et al., 2017) dataset as the noisy source data. It consists of images crawled from Google and Flickr, using visual concepts from ImageNet as queries. We experiment with three versions of WebVision with different sizes: (1) WebVision-V1.0 contains 2.44m images with the same classes as the ImageNet-1k (ILSVRC 2012) dataset; (2) WebVision-V0.5 is a randomly sampled subset of WebVision-V1.0, which contains the same number of images (1.28m) as ImageNet-1k; (3) WebVision-V2.0 contains 16m images with 5k classes.
We follow standard settings for ImageNet training: batch size is 256; total number of epochs is 90; optimizer is SGD with a momentum of 0.9; initial learning rate is 0.1, decayed at 40 and 80 epochs; weight decay is 0.0001. We use ResNet-50(He et al., 2016)
as the encoder. For MoPro-specific hyperparameters, we set( for WebVision-V2.0). The momentum for both the momentum encoder and momentum prototypes is set as 0.999. The queue to store momentum embeddings has a size of 8192. We apply standard data augmentation (crop and horizontal flip) to the encoder’s input, and stronger data augmentation (color changes in MoCo (He et al., 2019)) to the momentum encoder’s input. We warm-up the model for 10 epochs by training on all samples with original labels, before applying noise correction.
|MentorNet (Jiang et al., 2018)||InceptionResNet-V2||70.8||88.0||62.5||83.0|
|CurriculumNet (Guo et al., 2018)||Inception-V2||72.1||89.1||64.8||84.9|
|CleanNet (Lee et al., 2018)||ResNet-50||70.3||87.8||63.4||84.6|
|CurriculumNet (Guo et al., 2018; Tu et al., 2020)||ResNet-50||70.7||88.6||62.7||83.4|
|SOM (Tu et al., 2020)||ResNet-50||72.2||89.5||65.0||85.1|
In Table 1, we compare MoPro with existing weakly-supervised learning methods trained on WebVision-V1.0, where MoPro achieves state-of-the-art performance. Since the training dataset has imbalanced number of samples per-class, inspired by Kang et al. (2020), we perform the following decoupled training steps to re-balance the classifier: (1) pretrain the model with MoPro; (2) perform noise correction on the training data using the pretrained model, following the method in Section 3.3; (3) keep the pretrained encoder fixed and finetune the classifier on the cleaned dataset, using square-root data sampling (Mahajan et al., 2018) which balances the classes. We finetune for 15 epochs, using a learning rate of 0.01 which is decayed at 5 and 10 epochs. Surprisingly, we also find that a vanilla cross-entropy method with decoupled classifier re-balancing can also achieve competitive performance, outperforming most existing baselines.
In this section, we transfer weakly-supervised learned models to a variety of downstream tasks. We show that MoPro yields superior performance in image classification, object detection, instance segmentation, and obtains better robustness to domain shifts. Implementation details for the transfer learning experiments are described in appendixC.
First, we transfer the learned representation to downstream tasks with few training samples. We perform low-shot classification on two datasets: PASCAL VOC2007 (Everingham et al., 2010) for object classification and Places205 (Zhou et al., 2014)
for scene recognition. Following the setup byGoyal et al. (2019); Li et al. (2020b), we train linear SVMs using fixed representations from pretrained models. We vary the number of samples per-class and report the average result across 5 independent runs. Table 2 shows the results. When pretrained on weakly-labeled datasets, MoPro consistently outperforms the vanilla CE method. The improvement of MoPro becomes less significant when the number of web images increases from 2.4m to 16m, suggesting that increasing dataset size is a viable solution to combat noise.
When compared with ImageNet pretrained models, MoPro substantially outperforms self-supervised learning (MoCo v2 (Chen et al., 2020b) and PCL v2 (Li et al., 2020b)), and achieves comparable performance with supervised learning when the same amount of web images (i.e. WebVision-V0.5) is used. Our results for the first time show that weakly-supervised representation learning can be as powerful as supervised representation learning under the same data and computation budget.
Next, we perform experiment to evaluate whether the pretrained model provides a good basis for finetuning when the downstream task has limited training data. Following the setup by Chen et al. (2020a), we finetune the pretrained model on or of ImageNet training samples. Table 3 shows the results. MoPro consistently outperforms CE when pretrained on weakly-label datasets. Compared to self-supervised learning methods pretrained on ImageNet, weakly-supervised learning achieves significantly better performance with fewer number of epochs.
Surprisingly, pretraining on the larger WebVision-V2 leads to worse performance compared to V0.5 and V1.0. This is because V0.5 and V1.0 contain the same 1k class as ImageNet, whereas V2 also contains 4k other classes. Hence, the representations learned from V2 are less task-specific and more difficult to adapt to ImageNet, especially with only of samples for finetuning. This suggests that if the classes for a downstream task are known beforehand, it is more effective to curate a task-specific weakly-labeled dataset with similar classes.
|Pretrain Method||Pretrain dataset||#Pretrain||Top-1||Top-5|
|MoPro||39.7 (+0.8)||60.9 (+1.3)||43.1 (+0.4)||36.1 (+0.7)||57.5 (+1.0)||38.6 (+0.5)|
|MoPro||WebVision-V2.0||40.7 (+1.8)||61.7 (+2.1)||44.5 (+1.8)||36.8 (+1.4)||58.4 (+1.9)||39.6 (+1.5)|
Method Pretrain dataset AP AP AP AP AP AP random None 36.7 56.7 40.0 33.7 53.8 35.9 CE (Sup.) ImageNet 40.6 61.3 44.4 36.8 58.1 39.5 MoCo Instagram-1B 41.1 61.8 45.1 37.4 59.1 40.2 CE WebVision-V1.0 40.9 61.6 44.7 37.2 58.7 40.1 MoPro 41.2 (+0.6) 62.2 (+0.9) 45.0 (+0.6) 37.4 (+0.6) 58.9 (+0.8) 40.3 (+0.8) MoPro WebVision-V2.0 41.8 (+1.2) 62.6 (+1.3) 45.6 (+1.2) 37.8 (+1.0) 59.5 (+1.4) 40.6 (+1.1)
using Mask-RCNN with R50-FPN fine-tuned on COCOtrain2017. We evaluate bounding-box AP (AP) and mask AP (AP) on val2017. Weakly-supervised learning with MoPro outperforms both supervised learning on ImageNet and self-supervised learning (MoCo (He et al., 2019)) on one billion Instagram images.
We further transfer the pretrained model to object detection and instance segmentation tasks on COCO (Lin et al., 2014). Following the setup by He et al. (2019), we use the pretrained ResNet-50 as the backbone for a Mask-RCNN (He et al., 2017) with FPN (Lin et al., 2017). We finetune all layers end-to-end, including BN. The schedule is the default or in Girshick et al. (2018) Table 4 shows the results. Weakly-supervised learning with MoPro outperforms both supervised learning on ImageNet and self-supervised learning on one billion Instagram images.
It has been shown that deep models trained on ImageNet lack robustness to out-of-distribution samples, often falsely producing over-confident predictions. Hendricks et al. have curated two benchmark datasets to evaluate models’ robustness to real-world distribution variation: (1) ImageNet-R (Hendrycks et al., 2020) which contains various artistic renditions of object classes from the original ImageNet dataset, and (2) ImageNet-A (Hendrycks et al., 2019) which contains natural images where ImageNet-pretrained models consistently fail due to variations in background elements, color, or texture. Both datasets contain 200 classes, a subset of ImageNet’s 1,000 classes.
We evaluate weakly-supervised trained models on these two robustness benchmarks. We report both accuracy and the calibration error (Kumar et al., 2019). The calibration error measures the misalignment between a model’s confidence and its accuracy. Concretely, a well-calibrated classifier which give examples 80% confidence should be correct 80% of the time. Results are shown in Table 5. Models trained on WebVision show significantly higher accuracy and lower calibration error. The robustness to distribution shift could come from the higher diversity of samples in web images. Compared to vanilla CE, MoPro further improves the model’s robustness on both datasets. Note that we made sure that the training data of WebVision does not overlap with the test data.
|Accuracy ()||Calib. Error ()||Accuracy ()||Calib. Error ()|
We perform ablation study to verify the effectiveness of three important components in MoPro: (1) prototypical contrastive loss , (2) instance contrastive loss , (3) prototypical similarity used for pseudo-labeling (equation 4). We choose low-resource finetuning on 1% of ImageNet training data as the benchmark, and report the top-1 accuracy for models pretrained on WebVision-V0.5. As shown in Table 6, all of the three components contribute to the efficacy of MoPro.
|MoPro||w/o||w/o||w/o &||w/o (i.e. )|
|ImageNet top-1 acc.||69.3||68.0||68.2||66.9||68.4|
This paper introduces a new contrastive learning framework for webly-supervised representation learning. We propose momentum prototypes, a simple component that is effective in label noise correction, OOD sample removal, and representation learning. MoPro achieves state-of-the-art performance on the upstream task of learning from noisy data, and superior representation learning performance on multiple down-stream tasks. Webly-supervised learning with MoPro does not require the expensive annotation cost in supervised learning, nor the huge computation budget in self-supervised learning. For future work, MoPro could be extended to utilize other sources of free Web data, such as weakly-labeled videos, for representation learning in other domains.
Understanding and utilizing deep neural networks trained with noisy labels. In ICML, pp. 1062–1070. Cited by: §2.2.
Beyond synthetic noise: deep learning on controlled noisy labels. In ICML, Cited by: §2.2.
DivideMix: learning with noisy labels as semi-supervised learning. In ICLR, Cited by: §1, §2.2.
Learning deep features for scene recognition using places database. In NIPS, pp. 487–495. Cited by: §5.1.
In Figure 3, we show example images randomly chosen from the out-of-distribution samples filtered out by our method. In Figure 4, we show random examples where their pseudo-labels are different from the original training labels. By visual examination, we observe that our method can remove OOD samples and correct noisy labels at a high success rate.
Algorithm 1 summarizes the proposed method.
For low-shot image classification on Places and VOC, we follow the procedure in Li et al. (2020b) and train linear SVMs on the global average pooling features of ResNet-50. We preprocess all images by resizing to 256 pixels along the shorter side and taking a center crop. The SVMs are implemented in the LIBLINEAR (Fan et al., 2008) package.
For low-resource finetuning on ImageNet, we adopt different finetuning strategy for different versions of WebVision pretrained models. For WebVision V0.5 and V1.0, since they contain the same 1000 classes as ImageNet, we finetune the entire model including the classification layer. We train with SGD, using a batch size of 256, a momentum of 0.9, a weight decay of 0, and a learning rate of 0.005. We train for 40 epochs, and drop the learning rate by 0.2 at 15 and 30 epochs. For WebVision 2.0, since it contains 5000 classes, we randomly initialize a new classification layer with 1000 output dimension, and finetune the model end-to-end. We train for 50 epochs, using a learning rate of 0.01, which is dropped by 0.1 at 20 and 40 epochs.
For object detection and instance segmentation on COCO, we adopt the same setup in MoCo (He et al., 2019), using Detectron2 (Girshick et al., 2018) codebase. The image scale is in [640, 800] pixels during training and is 800 at inference. We fine-tune all layers end-to-end. We finetune on the train2017 set (118k images) and evaluate on val2017.
reports the standard deviation for the low-shot image classification experiment in Section5.1.