Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features

05/03/2019 ∙ by Mathilde Caron, et al. ∙ 20

Pre-training general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused on either small or highly curated datasets like ImageNet, whereas using uncurated raw datasets was found to decrease the feature quality when evaluated on a transfer task. Our goal is to bridge the performance gap between unsupervised methods trained on curated data, which are costly to obtain, and massive raw datasets that are easily available. To that effect, we propose a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data. We validate our approach on 96 million images from YFCC100M, achieving state-of-the-art results among unsupervised methods on standard benchmarks, which confirms the potential of unsupervised learning when only uncurated data are available. We also show that pre-training a supervised VGG-16 with our method achieves 74.6 on the validation set of ImageNet classification, which is an improvement of +0.7



There are no comments yet.


page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained convolutional neural networks, or convnets, are important components of image recognition applications [6, 7, 38, 47]. They improve the generalization of models trained on a limited amount of data [39] and speed up the training on applications when annotated data is abundant [18]. Convnets produce good generic representations when they are pre-trained on large supervised datasets like ImageNet [10]. However, designing such fully-annotated datasets has required a significant effort from the research community in terms of data cleansing and manual labeling. Scaling up the annotation process to datasets that are orders of magnitudes bigger raises important difficulties. Using raw metadata as an alternative has been shown to perform comparatively well [21, 41], even surpassing ImageNet pre-training when trained on billions of images [28]. However, metadata are not always available, and when they are, they do not necessarily cover the full extent of a dataset. These difficulties motivate the design of methods that learn transferable features without using any annotations.

Figure 1: Illustration of our method: we alternate between a hierachical clustering of the features and learning the parameters of a convnet by predicting both the rotation angle and the cluster assignments in a single hierachical loss.

Recent works describing unsupervised approaches have reported performances that are closing the gap with their supervised counterparts [5, 14, 52]. However, the best performing unsupervised methods are trained on ImageNet, a carefully curated dataset made of images selected to form well-balanced and diversified classes [10]. Simply discarding the labels does not undo this careful selection, as it only removes part of the human supervision. Because of that, previous works that have experimented with uncurated data report a degradation of the quality of features [5, 11]. In this work, we aim at learning good visual representations from unlabeled and uncurated datasets. We focus on the YFCC100M dataset [42], which contains million images from the Flickr photo-sharing website. This dataset is unbalanced, with a “long-tail” distribution of hashtags contrasting with the well-behaved label distribution of ImageNet. For example, guenon and baseball correspond to labels with associated images in ImageNet, while there are respectively and images associated with these hashtags in YFCC100M. Our goal is to understand if trading manually-curated data for scale leads to an improvement in the feature quality.

To that effect, we propose a new unsupervised approach designed to work on large-scale uncurated data. Our method is inspired from two domains of unsupervised learning: self-supervised learning and clustering. Self-supervised learning 

[9] consists in designing a pretext task by predicting pseudo-labels automatically extracted from input signals [11, 14]. These target labels are designed to induce certain properties on the learned representations making these approaches robust to changes in the training set statistics. Yet, they may also induce limitations in the expressivity of the resulting features since the pretext task is only a proxy of subsequent tasks on which visual features will be used. On the other hand, clustering-based approaches infer target labels at the same time as features are learned [2, 5, 51]. As a consequence, target labels evolve during training, making clustering-based approaches inherently more unstable but capable of capturing important information from the dataset. The novelty of our method lies in the combination of these two paradigms so that they benefit from one another. Our approach automatically generates targets by clustering the features of the entire dataset, under constraints derived from self-supervision. Due to the “long-tail” distribution of raw uncurated data, processing huge datasets and learning a large number of targets is necessary, making the problem challenging from a computational point of view. For this reason, we propose a hierachical formulation that is suitable for distributed training. This enables the discovery of latent categories present in the “tail” of the image distribution. While our framework is general, in practice we focus on combining the large rotation classification task of Gidaris et al [14] with the clustering approach of Caron et al [5]. As we increase the number of training images, the quality of features improves to the point where it surpasses those trained without labels on curated datasets. More importantly, we evaluate the quality of our approach as a pre-training step for ImageNet classification. Pre-training a supervised VGG-16 with our unsupervised approach leads to a top- accuracy of , which is an improvement of over a model trained from scratch. This shows the potential of unsupervised pre-training on large uncurated datasets as a way to improve the quality of visual features.

In this paper, we make the following contributions: (i) a novel training process allowing to mix feature learning methods based on clustering and self-supervision; (ii) unsupervised pre-training on large uncurated datasets improves the quality of visual features trained on ImageNet; (iii) state-of-the-art performance on standard evaluation benchmarks for unsupervised methods; (iv) a distributed implementation that scales to tens of millions of images and hundreds of thousands of clusters.

2 Related Work


Self-supervised learning builds a pretext task from the input signal to train a model without annotation [9]. Many pretext tasks have been proposed [20, 29, 45, 49], exploiting, amongst others, spatial context [11, 22, 31, 32, 34], cross-channel prediction [25, 26, 53, 54], or the temporal structure of videos [1, 33, 44]. Some pretext tasks explicitly encourage the representations to be either invariant or discriminative to particular types of input tranformations. For example, Dosovitskiy et al [12] consider each image and its transformations as a class to enforce invariance to data transformations. In this paper, we build upon the work of Gidaris et al [14] where the model encourages features to be discriminative for large rotations. Recently, Kolesnikov et al [23] have conducted an extensive evaluation of self-supervised learning methods by benchmarking them with different convnet architectures. As opposed to our work, they use curated datasets for pre-training, namely ImageNet [10] and Places205 [55].

Deep clustering.

Clustering, along with density estimation and dimensionality reduction, is a family of standard unsupervised learning methods. Various attempts have been made to train convnets using clustering 

[2, 5, 27, 46, 50, 51]. Our paper builds upon the work of Caron et al [5], in which -means is used to cluster the visual representations. As oppposed to our work, they mainly focus on training their approach using ImageNet without labels. Recently, Noroozi et al [32] show that clustering can also be used as a form of distillation to improve the performance of networks trained with self-supervision. As opposed to our work, they use clustering only as a post-processing step and does not leverage the complementarity between clustering and self-supervision to further improve the quality of features.

Learning on uncurated datasets.

Some methods [8, 16, 30] aim at learning visual features from uncurated data streams. They typically use metadata such as hashtags [21, 41] or geolocalization [48] as a source of noisy supervision. In particular, Mahajan et al [28]

train a network to classify billions of Instagram images into predefined and clean sets of hashtags. They show that with little human effort, it is possible to learn features that transfer well to ImageNet, even achieving state-of-the-art performance if finetuned. As opposed to our work, they use an extrinsic source of supervision that had to be cleaned beforehand.

3 Preliminaries

In this work, we refer to the vector obtained at the penultimate layer of the convnet as a

feature or representation. We denote by

the feature-extracting function, parametrized by a set of parameters

. Given a set of images, our goal is then to learn a “good” mapping . By “good”, we mean a function that produces general-purpose visual features that are useful on downstream tasks. Several types of unsupervised approaches have been developed to this end. Our approach is an attempt to combine two of them, self-supervision and clustering, which we describe in more details in the following sections.

3.1 Self-supervision

In self-supervised learning, a pretext task is used to extract target labels directly from data [11]. These targets can take a variety of forms. They can be categorical labels associated with a multiclass problem, as when predicting the transformation of an image [14, 52] or the ordering of a set of patches [31]. Or they can be continuous variables associated with a regression problem, as when predicting image color [53] or surrounding patches [34]. In this work, we are interested in the former. We suppose that we are given a set of images and we assign a pseudo-label in to each input . Given these pseudo-labels, we learn the parameters of the convet jointly with a linear classifier to predict pseudo-labels by solving the problem



is a loss function. The pseudo-labels

are fixed during the optimization and the quality of the learned features entirely depends on their relevance.

Rotation as self-supervision.

Gidaris et al [14] have recently shown that good features can be obtained when training a convnet to discriminate between different image rotations. In this work, we focus on their pretext task since its performance on standard evaluation benchmarks is among the best in self-supervised learning. This pretext task corresponds to a multiclass classification problem with four categories: rotations in . Each input in Eq. (1) is randomly rotated and associated with a target  that represents the angle of the applied rotation.

3.2 Deep clustering

Clustering-based approaches for deep networks typically build target classes by clustering visual features produced by convnets. As a consequence, the targets are updated during training along with the representations and are potentially different at each epoch. In this context, we define a latent pseudo-label

in for each image as well as a corresponding linear classifier . These clustering-based methods alternate between learning the parameters and and updating the pseudo-labels . Between two reassignments, the pseudo-labels are fixed, and the parameters and classifier are optimized by solving


which is of the same form as Eq. (1). Then, the pseudo-labels can be reassigned by minimizing an auxiliary loss function. This loss sometimes coincides with Eq. (2[2, 50] but some works proposed to use another objective [5, 51].

Updating the targets with -means.

In this work, we focus on the framework of Caron et al [5], where latent targets are obtained by clustering the activations with -means. More precisely, the targets are updated by solving the following optimization problem:


where is the matrix where each column corresponds to a centroid, is the number of centroids, and is a binary vector with a single non-zero entry. This approach assumes that the number of clusters is known a priori; in practice, we set it by validation on a downstream task (see Sec. 5.3). The latent targets are updated every

epochs of stochastic gradient descent steps when minimizing the objective (


Note that this alternate optimization scheme is prone to trivial solutions and controlling the way optimization procedures of both objectives interact is crucial. Re-assigning empty clusters and performing a batch-sampling based on an uniform distribution over the cluster assignments are workarounds to avoid trivial parametrization 


4 Method

In this section, we describe how we combine self-supervised learning with deep clustering in order to scale up to large numbers of images and targets.

4.1 Combining self-supervision and clustering

We assume that the inputs are rotated images, each associated with a target label encoding its rotation angle and a cluster assignment . The cluster assignment changes during training along with the visual representations. We denote by the set of possible rotation angles and by , the set of possible cluster assignments. A way of combining self-supervision with deep clustering is to add the losses defined in Eq. (1) and Eq. (2). However, summing these losses implicitly assumes that classifying rotations and cluster memberships are two independent tasks, which may limit the signal that can be captured. Instead, we work with the Cartesian product space , which can potentially capture richer interactions between the two tasks. This leads to the following optimization problem:


Note that any clustering or self-supervised approach with a multiclass objective can be combined with this formulation. For example, we could use a self-supervision task that captures information about tiles permutations [31] or frame ordering in a video [44]. However, this formulation does not scale in the number of combined targets, i.e., its complexity is . This limits the use of a large number of cluster or a self-supervised task with a large output space [52]. In particular, if we want to capture information contained in the tail of the distribution of uncurated dataset, we may need a large number of clusters. We thus propose an approximation of our formulation based on a scalable hierarchical loss that it is designed to suit distributed training.

4.2 Scaling up to large number of targets

Hierarchical losses are commonly used in language modeling where the goal is to predict a word out of a large vocabulary [4]. Instead of making one decision over the full vocabulary, these approaches split the process in a hierarchy of decisions, each with a smaller output space. For example, the vocabulary can be split into clusters of semantically similar words, and the hierarchical process would first select a cluster and then a word within this cluster.

Following this line of work, we partition the target labels into a -level hierarchy where we first predict a super-class and then a sub-class among its associated target labels. The first level is a partition of the images into super-classes and we denote by the super-class assignment vector in of the image and by the -th entry of . This super-class assignment is made with a linear classifier  on top of the features. The second-level of the hierarchy is obtained by partitioning within each super-class. We denote by the vector in of the assignment into  sub-classes for an image belonging to super-class . There are sub-class classifiers , each predicting the sub-class memberships within a super-class . The parameters of the linear classifiers  and are jointly learned by minimizing the following loss function:


where is the negative log-softmax function. Note that an image that does not belong to the super-class does not belong either to any of its sub-classes.

Choice of super-classes.

A natural partition would be to define the super-classes based on the target labels from the self-supervised task and the sub-classes as the labels produced by clustering. However, this would mean that each image of the entire dataset (M images for YFCC100M) would be present in each super-class (with a different rotation), which is problematic for distributed training.

Instead, we split the dataset into sets by running -means with centroids on the entire dataset every epochs. We then use the Cartesian product between the assignment to these clusters and the angle rotation classes to form the super-classes. There are super-classes, each associated with the subset of data belonging to the corresponding cluster ( images if the clustering is perfectly balanced). These subsets are then further split with -means into subclasses. This is equivalent to running a hierarchical -means with rotation constraints on the full datasets to form our hierarchical loss. We typically use and k, leading to a total of k different clusters split in subsets. Our approach shares similarities with DeepCluster but we replace their -means loss by a Deeper Clustering loss (“DeeperCluster”) that is designed to scale to larger datasets. Figure 1 summarizes DeeperCluster: we alternate between clustering the non-rotated images features and training the network to predict both the rotation applied to the input data and its cluster assignment amongst the clusters corresponding to this rotation.

Distributed training.

Building the super-classes based on data splits lends itself to a distributed implementation that scales well in the number of images. Specifically, when optimizing Eq. (5), we form as many distributed communication groups of GPUs as the number of super-classes, i.e., . Different communication groups share the parameters and the super-class classifier , while the parameters of the sub-class classifiers are only shared within a communication group. Each communication group deals only with the subset of images and the rotation angle associated with the super-class .

Distributed -means.

Every epochs, we recompute the super and sub-class assignments by running two consecutive -means on the entire dataset. This is achieved by first randomly splitting the dataset across different GPUs. Each GPU is in charge of computing cluster assignments for its partition, whereas centroids are updated across GPUs.

We reduce communication between GPUs by sharing only the number of assigned elements for each cluster and the sum of their features. The new centroids are then computed from these statistics. We observe empirically that -means converges in iterations. We cluster M features of dimension into clusters using GPUs ( minute per iteration). Then, we split this pool of GPUs into groups of GPUs. rach group clusters around M features into k clusters ( minutes per iteration).

4.3 Implementation details

The loss in Eq. (5) is minimized with mini-batch stochastic gradient descent [3]. Each mini-batch contains instances distributed accross GPUs, leading to instances per GPU and per minibatch [17]

. Batch normalization statistics are computed on the full mini-batch, i.e, on

instances. We use dropout, weight decay, momentum and a constant learning rate of . We reassign clusters every epochs. We use the Pascal VOC classification task without finetuning as a downstream task to select hyper-parameters. In order to speed up convergence and experimentations, we initialize our method with a network pre-trained with RotNet on YFCC100M dataset. Before clustering, we perform a whitening of the activations and -normalize each activation. Images are rescaled to and we use standard data augmentations, i.e., cropping of random sizes and aspect ratios and horizontal flips [24]). We use the VGG- architecture [40] with batch normalization layers. Following [2, 5, 35], we pre-process images with a Sobel filtering. We train our models on the M images from YFCC100M [42] that we managed to download. We use YFCC100M in this paper for research purposes only. As the dataset is entirely publicly available, it permits replication by other research teams.

5 Experiments

In this section we evaluate the quality of the features learned with DeeperCluster on a variety of downstream tasks, such as classification or object detection. We also provide insights about the impact of the number of images and clusters on the performance of our model.

5.1 Evaluating unsupervised features

We evaluate the quality of the features extracted from a convnet trained with DeeperCluster on YFCC100M by considering several standard transfer learning tasks, namely image classification, object detection and scene classification.

Classif. Detect.
Method Data fc68 all fc68 all
ImageNet labels INet
Unsupervised on curated data
Larsson et al [26] INet+Pl.
Wu et al [49] INet
Doersh et al [11] INet
Caron et al [5] INet
Unsupervised on uncurated data
Mahendran et al [29] YFCCv
Wang and Gupta [44] YT8M
Wang et al [45] YT9M
DeeperCluster YFCC
Table 1: Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification and detection on Pascal VOC . We disassociate methods using curated datasets, like imageNet, and methods using uncurated datasets. We selected hyper-parameters for each transfer task on the validation set, and then retrain on both training and validation sets. We report results on the test set averaged over runs. “YFFCv” stands for the videos contained in the YFFC100M dataset. numbers taken from their original paper.

Pascal VOC 2007 [13].

This dataset has small training and validation sets (k images each), making it close to the setting of real applications where models trained using large computational resources are adapted to a new task with a small number of instances. We report numbers on the classification and detection tasks with finetuning (“all”) or by only retraining the last two fully connected layers of the network (“fc68”). For classification, we use the code of Caron et al [5]111github.com/facebookresearch/deepcluster and for detection, fast-rcnn [15]222github.com/rbgirshick/py-faster-rcnn. For classification, we train the models for iterations, starting with a learning rate of decayed by a factor every iterations, and we report results averaged over random crops. For object detection, we finetune our network for iterations, dividing the step-size by after the first steps with an initial learning rate of (fc68) or (all) and a weight decay of . Following Doersch et al [11], we use the multiscale configuration, with scales for training and for testing.

We report results of classification and detection in both settings (all and fc6-8) in Table 1. The fc68 setting gives a better measure of the quality of the evaluated features since fewer parameters are retrained. We compare DeeperCluster with two sets of unsupervised methods that use a VGG-16 network: those trained on curated datasets and those trained on uncurated datasets. Previous unsupervised methods that worked on unucurated datasets with a VGG-16 use videos: Youtube8M (“YT8M”), Youtube9M (“YT9M”) or the videos from YFCC100M (“YFFCv”). Our approach achieves state-of-the-art performance among all the unsupervised method that uses a VGG-16 architecture, even those that use ImageNet as a training set. The gap with a supervised network is still important when we freeze the convolutions ( for detection and for classification) but drops to less than for both tasks with finetuning.

Linear classifiers on ImageNet [10] and Places205 [55].

ImageNet (“INet”) and Places205 (“Pl.”) are two large scale image classification datasets: ImageNet’s domain covers objects and animals with M images and Places205’s domain covers indoor and outdoor scenes with M images. We train linear classifiers with a logistic loss on top of frozen convolutional layers at different depths. To reduce the influence of feature dimension in the comparison, we average-pool the features until their dimension is below  [53]. This experiment probes the quality of the features extracted at each convolutional layer.

Figure 2 shows the evolution of classification performance along the layers on ImageNet and Places205. On Places205, DeeperCluster matches the performance of a supervised network for all layers. On ImageNet, it also matches supervised features up to the th convolutional block; then the gap suddenly increases to around . It is not surprising since the supervised features are trained on ImageNet, while ours are trained on YFCC100M. This means that the “low-level features” captured by DeeperCluster on YFCC100M retain information that is not captured by a network trained on ImageNet with supervision.

Figure 2: Accuracy of linear classifiers on ImageNet and Places205 using the activations from different layers as features. We train a linear classifier on top of frozen convolutional layers at different depths. We compare a VGG-16 trained with supervision on ImageNet to VGG-16 trained with either RotNet or DeeperCluster on YFCC100M.

5.2 Pre-training for ImageNet

In the previous section, we observed that a VGG-16 trained with DeeperCluster on YFCC100M has better low level features than the same network trained on ImageNet with supervision. In this experiment, we want to check whether these low-level features pre-trained on YFCC100M can serve as a good initialization for fully-supervised ImageNet classification. To this end, we pre-train a VGG-16 on YFCC100M using either DeeperCluster or RotNet. The resulting weights are then used as initialization for the training of a network on ImageNet with supervision. We merge the Sobel weights of a network pre-trained with DeeperCluster with the first convolutional layer during the initialization. We then train the networks on ImageNet with mini-batch SGD for epochs, a learning rate of , a weight decay of , a batch size of and dropout of . We reduce the learning rate by a factor of at epochs , and .

In Table 2, we compare the performance of a network trained with a standard intialization (“Supervised”) to one initialized with a pre-training obtained from either DeeperCluster (“Supervised + DeeperCluster pre-training”) or RotNet (“Supervised + RotNet pre-training”) on YFCC100M. We report top- and top- accuracy on the validation set of ImageNet. We see that our pre-training improves the performance of a supervised network by , leading to top-1 accuracy. This means that our pre-training captures important statistics from YFCC100M that transfers well to ImageNet. Note that we show the results at convergence, instead of the standard epochs of training. Showing results before convergence (e.g., at epochs) gives an unfair advantage toward pre-trained networks, since they start from a better initialization. We refer the reader to the appendix for the performance after epochs of training.

ImageNet top- top-
Supervised + RotNet pre-training
Supervised + DeeperCluster pre-training
Table 2: Accuracy on the validation set of ImageNet classification for a supervised VGG-16 trained with different initializations: we compare a network trained from a standard initialization to networks trained from pre-trained weights using either DeeperCluster or RotNet on YFCC100M.

5.3 Model analysis

In this final set of experiments, we analyze some components of our model. Since DeeperCluster derives from RotNet and DeepCluster, we first look at the difference between these methods and ours, when trained on curated and uncurated datasets. We then report quantitative and qualitative evaluations of the clusters obtained with DeeperCluster.

Comparison with RotNet and DeepCluster.

In Table 3, we compare DeeperCluster with DeepCluster and RotNet when a linear classifier is trained on top of the last convolutional layer of a VGG-16 on several datasets. For reference, we also report previously published numbers [49] with a VGG- architecture. We do not perform any finetuning or layer selection. We average-pool the features of the last layer resulting in representations of dimensions. Our approach outperforms both RotNet and DeepCluster, even when they are trained on curated datasets (except for ImageNet classification task where DeepCluster trained on ImageNet yields the best performance). More interestingly, we see that the quality of the dataset or its scale has little impact on RotNet while it has on DeepCluster. This is confirming that self-supervised methods are more robust than clustering to a change of dataset distribution.

Method Data ImageNet Places VOC2007
Supervised ImageNet
Wu et al [49] ImageNet -
RotNet ImageNet
DeepCluster ImageNet
RotNet YFCC100M
DeepCluster YFCC100M
DeeperCluster YFCC100M
Table 3: Comparaison between DeeperCluster, RotNet and DeepCluster when pre-trained on curated and uncurated dataset. We report the accuracy on several datasets of a linear classifier trained on top of features of the last convolutional layer. All the methods use the same architecture. DeepCluster does not scale to the full YFCC100M dataset, we thus train it on a random subset of M images.
Figure 3: Influence of amount of data (left) and number of clusters (right) on the features quality. We report validation mAP on Pascal VOC classification task (fc68 setting).
Figure 4: Normalized mutual information between our clusterings and different sorts of metadata: hashtags, user IDs, geographic coordinates, and device types. We also plot the NMI with an ImageNet classifier labeling.
tag: cat tag: elephantparadelondon tag: always device: CanoScan
GPS: (, ) GPS: (, ) GPS: (, ) GPS: (, )
Figure 5: Clusters visualisation: we randomly select images for each cluster and indicate the dominant metadata of the cluster. In the bottom row we show clusters pure for GPS coordinates but unpure for user IDs. Not surprisingly, they turn out to correlate with tourist landmarks. Note that no metadata is used during training.

Influence of dataset size and number of clusters.

To measure the influence of the number of images on features, we train models with M, M, M, and M images for epochs and report their accuracy on the validation set of the Pascal VOC 2007 classification task (fc68 setting). We also train models on M images with a number of clusters that varies from k to k. For the experiment with a total of k clusters, we choose which results in super-classes. The results are shown in Fig. 3.

We observe that the quality of our features improves when scaling both in terms of images and clusters. As expected, augmenting the number of images has a bigger impact than the number of clusters, versus mAP. Yet, this improvement is significant since it corresponds to a reduction of more than of the relative error w.r.t. the supervised model.

Quality of the clusters.

In addition to features, our method also provides a clustering of the input images. We evaluate the quality of these clusters by measuring their correlation with existing partitions of the data. In particular, YFCC100M comes with many different metadata and we consider a few: hashtags, users, camera and GPS coordinates. If an images has several hashtags, we pick the least frequent hashtag in the total hashtag distribution and treat it as its label. We also measure the correlation of ours clusters with labels predicted by a classifier trained on ImageNet categories. We use a ResNet- network [19], pre-trained on ImageNet, to classify the YFC100m images and we select those for which the confidence in prediction is higher than . This evaluation omits a large amount of the data but gives some insight about the quality of our clustering in object classification.

In Figure 4

, we show the evolution during training of the normalized mutual information (NMI) between our clustering and different metadata, and the predicted labels from ImageNet. The higher this measure, the more correlated our clusters are to the specific considered partition. For reference, we also compute the NMI for a clustering of RotNet features and of a supervised model. First, it is interesting to observe that our clustering is improving over time for every type of metadata. One important factor is that most of these commodities are correlated since a given user takes pictures in specific places with probably a single camera and use a preferred fixed set of hashtags. Nonetheless, these plots show that our model captures in the input signal enough information to predict these metadata at least as well as the features trained with supervision.

Qualitative results.

We visually assess the consistency of our clusters in Figure 5. We display random images from manually picked clusters. The first two clusters contain a majority of images associated with tag from the head (first cluster) and from the tail (second cluster) in the YFC100M dataset. Indeed, YFC100M images are associated with the tag cat whereas only images contain the tag elephantparadelondon ( of the dataset). We also show a cluster for which the dominant hashtag does not corrolate visually with the content of the cluster. As already mentioned, this database is uncurated and contains images that basically do not depict anything semantic. The dominant metadata of the last cluster in the first row is the device ID CanoScan. As this cluster is about drawings, its images have been mainly taken with a scanner. Finally, the bottom row depict clusters that are pure for GPS coordinates but unpure for user IDs. Therefore, it results in clusters of images taken by many different users in the same place, in other words tourist landmarks.

6 Conclusion

In this paper, we present an unsupervised approach that combines self-supervision with clustering, leading to improvements in the quality of the features over both approaches. Our method is designed for distributed training, which allows training of convnets on uncurated datasets with a hundred million of images. We show that, with such amount of data, our approach surpasses unsupervised methods trained on curated datasets, which validates the potential of unsupervised learning in applications where annotations are scarce or curation is not trivial. Finally, we show that unsupervised pre-training improves the performance of a network trained on ImageNet with a VGG-16 architecture. In future work, we are planning to validate this finding on more recent architectures [19] and on larger datasets.


We thank Thomas Lucas, Matthijs Douze, Francisco Massa, Ishan Misra, Priya Goyal, Abhinav Gupta, Laurens van der Maaten and Yann LeCun along with the rest of the THOTH and FAIR teams for their feedback and fruitful discussion around this paper. Julien Mairal is funded by the ERC grant number 714381 (SOLARIS project).


  • [1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. In

    Proceedings of the International Conference on Computer Vision (ICCV)

    , 2015.
  • [2] P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In

    Proceedings of the International Conference on Machine Learning (ICML)

    , 2017.
  • [3] L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
  • [4] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai.

    Class-based n-gram models of natural language.

    Computational linguistics, 18(4):467–479, 1992.
  • [5] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [6] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In

    Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • [7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
  • [8] X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  • [9] V. R. de Sa. Learning classification with unlabeled data. In Advances in Neural Information Processing Systems (NIPS), 1994.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [11] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  • [12] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 38(9):1734–1747, 2016.
  • [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [14] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), 2018.
  • [15] R. Girshick. Fast r-cnn. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  • [16] L. Gomez, Y. Patel, M. Rusiñol, D. Karatzas, and C. Jawahar. Self-supervised learning of visual features through embedding images into text topic spaces. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [17] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  • [18] K. He, R. Girshick, and P. Dollár. Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [20] S. Jenni and P. Favaro. Self-supervised feature learning by learning to spot artifacts. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [21] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [22] D. Kim, D. Cho, D. Yoo, and I. S. Kweon. Learning image representations by completing damaged jigsaw puzzles. In Winter Conference on Applications of Computer Vision (WACV), 2018.
  • [23] A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.
  • [25] G. Larsson, M. Maire, and G. Shakhnarovich.

    Learning representations for automatic colorization.

    In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [26] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [27] R. Liao, A. Schwing, R. Zemel, and R. Urtasun. Learning deep parsimonious representations. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • [28] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [29] A. Mahendran, J. Thewlis, and A. Vedaldi. Cross pixel optical flow similarity for self-supervised learning. arXiv preprint arXiv:1807.05636, 2018.
  • [30] K. Ni, R. Pearce, K. Boakye, B. Van Essen, D. Borth, B. Chen, and E. Wang. Large-scale deep learning on the yfcc100m dataset. arXiv preprint arXiv:1502.03409, 2015.
  • [31] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [32] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [33] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan. Learning features by watching objects move. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [34] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [35] M. Paulin, M. Douze, Z. Harchaoui, J. Mairal, F. Perronin, and C. Schmid.

    Local convolutional features with unsupervised training for image retrieval.

    In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  • [36] J. Philb sur clear:in, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
  • [37] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [38] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • [39] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops, 2014.
  • [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [41] C. Sun, A. Shrivastava, S. Singh, and A. Gupta.

    Revisiting unreasonable effectiveness of data in deep learning era.

    In Proceedings of the International Conference on Computer Vision (ICCV), pages 843–852, 2017.
  • [42] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
  • [43] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879, 2015.
  • [44] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  • [45] X. Wang, K. He, and A. Gupta. Transitive invariance for self-supervised visual representation learning. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
  • [46] X. Wang, L. Lu, H.-C. Shin, L. Kim, M. Bagheri, I. Nogues, J. Yao, and R. M. Summers.

    Unsupervised joint mining of deep features and image labels for large-scale radiology image categorization and scene recognition.

    In Winter Conference on Applications of Computer Vision (WACV), 2017.
  • [47] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In Proceedings of the International Conference on Computer Vision (ICCV), 2013.
  • [48] T. Weyand, I. Kostrikov, and J. Philbin. Planet-photo geolocation with convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [49] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [50] J. Xie, R. Girshick, and A. Farhadi.

    Unsupervised deep embedding for clustering analysis.

    In Proceedings of the International Conference on Machine Learning (ICML), 2016.
  • [51] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [52] L. Zhang, G.-J. Qi, L. Wang, and J. Luo. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. arXiv preprint arXiv:1901.04596, 2019.
  • [53] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  • [54] R. Zhang, P. Isola, and A. A. Efros.

    Split-brain autoencoders: Unsupervised learning by cross-channel prediction.

    In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [55] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems (NIPS), 2014.


Method conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 conv9 conv10 conv11 conv12 conv13
Table 4: Accuracy of linear classifiers on ImageNet and Places205 using the activations from different layers as features. We train a linear classifier on top of frozen convolutional layers at different depths. We compare a VGG-16 trained with supervision on ImageNet to VGG-16s trained with either RotNet or our approach on YFCC100M.

1 Evaluating unsupervised features

Here we provide numbers from Figure 2 in Table 4.

2 Pre-training for ImageNet

epochs epochs
Supervised + RotNet pre-training
Supervised + DeeperCluster pre-training
Table 5: Top- accuracy on validation set of a VGG-16 trained on ImageNet with supervision with different initializations. We compare a network initialized randomly to networks pre-trained with our unsupervised method or with RotNet on YFCC100M.

In Table 5, we compare the performance of a network trained with supervision on ImageNet with a standard intialization (“Supervised”) to one pre-trained with DeeperCluster (“Supervised + DeeperCluster pre-training”) and to one pre-trained with RotNet (“Supervised + RotNet pre-training”). The convnet is finetuned on ImageNet with supervision with mini-batch SGD for epochs (instead of epochs in Table 2). We use a learning rate of , a weight decay of , a batch size of and dropout of . We reduce the learning rate by a factor of at epochs and . This setting is unfair towards the supervised from scratch baseline since as we start the optimization with a good initialization we arrive at convergence earlier. Indeed, we observe that the gap between our pretraining and the baseline shrinks from to when evaluating at convergence instead of evaluating before convergence. As a matter of fact, the gap for the RotNet pretraining with the baseline goes from to when evaluating at convergence.

3 Model analysis

3.1 Instance retrieval

Method Pretraining OxfordK ParisK
ImageNet labels ImageNet
Random -
Doersch et al [11] ImageNet
Wang et al [45] Youtube M
RotNet ImageNet
DeepCluster ImageNet
RotNet YFCC100M
DeepCluster YFCC100M
DeeperCluster YFCC100M
Table 6: mAP on instance-level image retrieval on Oxford and Paris dataset. We apply R-MAC with a resolution of pixels and grid levels [43]. We disassociate the methods using unsupervised ImageNet and the methods using uncurated datasets. DeepCluster does not scale to the full YFCC100M dataset, we thus train it on a random subset of M images.

Instance retrieval consists of retrieving from a corpus the most similar images to a given a query. We follow the experimental setting of Tolias et al [43]: we apply R-MAC with a resolution of pixels and grid levels and we report mAP on instance-level image retrieval on Oxford Buildings [36] and Paris [37] datasets.

As described by Dosovitskiy et al [12], class-level supervision induces invariance to semantic categories. This property may not be beneficial for other computer vision tasks such as instance-level recognition. For that reason, descriptor matching and instance retrieval are tasks for which unsupervised feature learning might provide performance improvements. Moreover, these tasks constitute evaluations that do not require any additionnal training step, allowing a straightforward comparison accross different methods. We evaluate our method and compare it to previous work following the experimental setup proposed by Caron et al [5]. We report results for the instance retrieval task in Table 6.

We observe that features trained with RotNet have significantly worse performance than DeepCluster both on OxfordK and ParisK. This performance discrepancy means that properties acquired by classifying large rotations are not relevant to instance retrieval. An explanation is that all images in Oxfordk and Parisk have the same orientation as they picture buildings and landmarks. As our method is a combination of the two paradigms, it suffers an important performance loss on OxforkK, but is not affected much on Parisk. These results emphasize the importance of having a diverse set of benchmarks to evaluate the quality of features produced by unsupervised learning methods.

Figure 6:

Sorted standard deviations to clusters mean colors. If the standard deviation of a cluster to its mean color is low, the images of this cluster have a similar colorization.

3.2 Influence of data pre-processing

In this section we experiment with our method on raw RGB inputs. We provide some insights into the reasons why sobel filtering is crucial to obtain good performance with our method.

First, in Figure 6, we randomly select a subset of clusters and sort them by standard deviation to their mean color. If the standard deviation of a cluster to its mean color is low, it means that the images of this cluster tend to have a similar colorization. Moreover, we show in Figure 7 some clusters with a low standard deviation to the mean color. We observe in Figure 6 that the clustering on features learned with our method focuses more on color than the clustering on RotNet features. Indeed, clustering by color and low-level information produces balanced clusters that can easily be predicted by a convnet. Clustering by color is a solution to our formulation. However, as we want to avoid an uninformative clustering essentially based on colors, we remove some part of the input information by feeding the network with the image gradients instead of the raw RGB image (see Figure 8). This allows to greatly improve the performance of our features when evaluated on downstream tasks as it can be seen in Table 7. We observe that Sobel filter improves slightly RotNet features as well.

Figure 7: We show clusters with an uniform colorization accross their images. For each cluster, we show the mean color of the cluster.
RGB Sobel
Figure 8: Visualization of two images preprocessed with Sobel filter. Sobel gives a channels output which at each point contain the vertical and horizontal derivative approximations.
Method Data RGB Sobel
RotNet YFCC 1M
DeeperCluster YFCC 20M
Table 7: Influence of applying Sobel filter or using raw RGB input on the features quality. We report validation mAP on Pascal VOC classification task (fc68 setting).