1 Introduction
Learning from unlabelled data can dramatically reduce the cost of deploying algorithms to new applications, thus amplifying the impact of machine learning in the real world. Selfsupervision is an increasingly popular framework for learning without labels. The idea is to define pretext learning tasks can be constructed from raw data alone, but that still result in neural networks that transfer well to useful applications.
Much of the research in selfsupervision has focused on designing new pretext tasks. However, given supervised data such as ImageNet
(Deng et al., 2009), the standard classification objective of minimizing the crossentropy loss still results in better pretraining than any of such methods (for a certain amount of data and model complexity). This suggest that the task of classification may be sufficient for pretraining networks, provided that suitable data labels are available. In this paper, we thus focus on the problem of obtaining the labels automatically by designing a selflabelling algorithm.Learning a deep neural network together while discovering the data labels can be viewed as simultaneous clustering and representation learning. The latter can be approached by combining crossentropy minimization with an offtheshelf clustering algorithm such as means. This is precisely the approach adopted by the recent DeepCluster method (Caron et al., 2018), which achieves excellent results in unsupervised representation learning. However, combining representation learning, which is a discriminative task, with clusteringis not at all trivial. In particular, we show that the combination of crossentropy minimization and means as adopted by DeepCluster cannot be described as the consistent optimization of an overall learning objective; instead, there exist degenerate solutions that the algorithm avoids via particular implementation choices.
In order to address this technical shortcoming, in this paper we contribute a new principled formulation for simultaneous clustering and representation learning. The starting point is to minimize the crossentropy loss for learning the deep network as well as
the data labels. This is often done in semisupervised learning and multiple instance learning. However, when applied naively to the unsupervised case, it immediately leads to a degenerate solution where all data points are mapped to the same cluster.
We solve this issue by adding the constraint that the labels must induce an equipartition of the data, which we show maximizes the information between data indices and labels. We also show that the resulting label assignment problem is the same as optimal transport, and can therefore be solved in polynomial time as a linear program. However, since we want to scale the algorithm to millions of data points and thousands of labels, standard transport solvers are inadequate. Thus, we also propose to use a fast version of the SinkhornKnopp algorithm for finding an approximate solution to the transport problem efficiently at scale, using fast matrixvector algebra.
Compared to methods such as DeepCluster, the new formulation is more principled and allows to more easily demonstrate properties of the method such as convergence. Most importantly, via extensive experimentation, we show that our new approach leads to significantly superior results than DeepCluster, achieving the new stateoftheart for representation learning for clusteringbased approaches. In fact, the method’s performance surpasses others that use a single type of supervisory signal for selfsupervision.
2 Related Work
Our paper relates to two broad areas of research: (a) selfsupervised representation learning, and (b) more specifically, training a deep neural network using pseudolabels, i.e. the assignment of a label to each image. We discuss closely related works for each.
Selfsupervised learning:
A wide variety of methods that do not require manual annotations have been proposed for the selftraining of deep convolutional neural networks. These methods use various cues and proxy tasks namely, inpainting
(Pathak et al., 2016), patch context and jigsaw puzzles (Doersch et al., 2015; Noroozi & Favaro, 2016; Noroozi et al., 2018; Mundhenk et al., 2017), clustering (Caron et al., 2018; Huang et al., 2019), noiseastargets (Bojanowski & Joulin, 2017)(Zhang et al., 2016; Larsson et al., 2017), generation (Jenni & Favaro, 2018; Ren & Lee, 2018; Donahue et al., 2017; Donahue & Simonyan, 2019), predictive coding (Oord et al., 2018; Hénaff et al., 2019), geometry (Dosovitskiy et al., 2016), predicting transformations (Gidaris et al., 2018; Zhang et al., 2019) and counting (Noroozi et al., 2017). In (Feng et al., 2019), predicting rotation (Gidaris et al., 2018) is combined with instance retrieval (Wu et al., 2018).The idea is that the pretext task can be constructed automatically and easily from images alone. Thus, methods often modify information in the images and require the network to recover them. Inpainting or colorization techniques fall in this category. However these methods have the downside that the features are learned on modified images which potentially harms the generalization to unmodified ones. For example, colorization uses a gray scale image as input, thus the network cannot learn to extract color information, which can be important for other tasks.
Slightly less related are methods that use additional information to learn features. Here, often temporal information is used in the form of videos. Typical pretext tasks are based on temporalcontext (Misra et al., 2016; Wei et al., 2018; Lee et al., 2017; Sermanet et al., 2018), spatiotemporal cues (Isola et al., 2015; Gao et al., 2016; Wang et al., 2017), foregroundbackground separation via video segmentation (Pathak et al., 2017), opticalflow (Gan et al., 2018; Mahendran et al., 2018), futureframe synthesis (Srivastava et al., 2015), audio prediction from video (de Sa, 1994; Owens et al., 2016), audiovideo alignment (Arandjelović & Zisserman, 2017)
, egomotion estimation
(Jayaraman & Grauman, 2015), slow feature analysis with higher order temporal coherence (Jayaraman & Grauman, 2016), transformation between frames (Agrawal et al., 2015) and patch tracking in videos (Wang & Gupta, 2015).Pseudolabels for images:
In the selfsupervised domain, we find a spectrum of methods that either give each data point a unique label (Wu et al., 2018; Dosovitskiy et al., 2016)
or train on a flexible number of labels with kmeans
(Caron et al., 2018), with mutual information (Ji et al., 2018) or with noise (Bojanowski & Joulin, 2017). In (Noroozi et al., 2018) a large network is trained with a pretext task and a smaller network is trained via knowledge transfer of the clustered data. Finally, (Bach & Harchaoui, 2008; Vo et al., 2019) use convex relaxations to regularized affinetransformation invariant linear clustering, that does not scale to larger datasets.Our contribution is a simple method that combines a novel pseudolabel extraction procedure from raw data alone and the training of a deep neural network using a standard crossentropy loss.
3 Method
We will first derive our selflabelling method, then interpret the method as optimizing labels and targets of a crossentropy loss and finally analyze similarities and differences with other clustering based methods.
3.1 Selflabelling
Neural network pretraining is often achieved via a supervised data classification task. Formally, consider a deep neural network mapping data (e.g. images) to feature vectors . The model is trained using a dataset (e.g. ImageNet) of data points with corresponding labels , drawn from a space of possible labels. The model is followed by a classification head
, usually consisting of a single linear layer, converting the feature vector into a vector of class scores. The class scores are mapped to class probabilities via the softmax operator:
The model and head parameters are learned by minimizing the average crossentropy loss
(1) 
Training with objective (1) requires a labelled dataset. When labels are unavailable, we require a selflabelling mechanism to assign the labels automatically.
In semisupervised learning, selflabelling is often achieved by jointly optimizing (1) with respect to the model and the labels . This can work if at least part of the labels is known, thus constraining the optimization. However, in the fully unsupervised case, it leads to a degenerate solution: eq. 1 is trivially minimized by assigning all data points to a single (arbitrary) label.
To address this issue, we first rewrite eq. 1 by encoding the labels as posterior distributions :
(2) 
If we set the posterior distributions to be deterministic, the formulations in eqs. 2 and 1 are equivalent, in the sense that . In this case, optimizing is the same as reassigning the labels, which leads to the degeneracy. To avoid this, we add the constraint that the label assignments must partition the data in equallysized subsets. Formally, the learning objective objective^{1}^{1}1We assume for simplicity that divides exactly, but the formulation is easily extended to any by setting the constraints to either or , in order to assure that there is a feasible solution. is thus:
(3) 
The constraints mean that each data point is assigned to exactly one label and that, overall, the data points are split uniformly among the classes.
The objective in eq. 3 is combinatorial in and thus may appear very difficult to optimize. However, this is an instance of the optimal transport problem, which can be solved relatively efficiently. In order to see this more clearly, let be the
matrix of scaled logposterior probabilities estimated by the model. Likewise, let
be the scaled matrix of label assignments. Using the notation of (Cuturi, 2013), we relax matrix to be an element of the transportation polytope(4) 
Here are vectors of all ones of the appropriate dimensions, so that and are the marginal projections of matrix onto its rows and columns, respectively. In our case, we require
to be a matrix of conditional probability distributions that split the data uniformly, which is captured by:
With this notation, we can rewrite the objective function in eq. 3, up to a constant shift, as
(5) 
where is the Frobenius dotproduct between two matrices. Hence optimizing eq. 3 with respect to the assignments is equivalent to solving the problem:
(6) 
This is a linear program, and can thus be solved in polynomial time. Furthermore, solving this problem always leads to an integral solution despite having relaxed to the continuous polytope , guaranteeing the exact equivalence to the original problem.
In practice, however, the resulting linear program is large, involving millions of data points and thousands of classes. Traditional algorithms to solve the transport problem scale badly to instances of this size. We address this issue by adopting a fast version (Cuturi, 2013) of the SinkhornKnopp algorithm. This amounts to introducing a regularization term
(7) 
where
is the KullbackLeibler divergence and
can be interpreted as a probability matrix. The advantage of this regularization term is that the minimizer of eq. 7 can be written as:(8) 
where exponentiation is meant elementwise and and are two vectors of scaling coefficients chosen so that the resulting matrix is also a probability matrix (see (Cuturi, 2013) for a derivation). The vectors and can be obtained, as shown below, via a simple matrix scaling iteration.
For very large , optimizing eq. 7 is of course equivalent to optimizing eq. 6, but even for moderate values of the two objectives tend to have approximately the same optimizer (Cuturi, 2013). Choosing trades off convergence speed with closeness to the original transport problem. In our case, using a fixed is appropriate as we are ultimately interested in the final clustering and representation learning results, rather than in solving the transport problem exactly.
Our final algorithm’s core can be described as follows. We learn a model and a label assignment matrix by solving the optimization problem eq. 6 with respect to both , which is a probability matrix, and the model , which determines the predictions . We do so by alternating the two steps:
Step 1: representation learning.
Given the current label assignment , the model is updated by minimizing eq. 6 with respect to (the parameters of) . This is the same as training the model using the common crossentropy loss for classification.
Step 2: selflabelling.
Given the current model , we compute the log probabilities . Then, we find using eq. 8 by iterating the updates (Cuturi, 2013)
Each update involves a single matrixvector multiplication with complexity , so it is relatively quick even for millions of data points and thousands of labels and so the cost of this method scales linearly with the number of images . In practice, convergence is reached within 2 minutes on ImageNet when computed on a GPU. Also note that the parameters and can be retained between steps, thus allowing a warm start of Step 2.
3.2 Interpretation
As shown above, the formulation in eq. 2 uses scaled versions of the probabilities. We can interpret these by treating the data index
as a random variable with uniform distribution
and by rewriting the posteriors and as conditional distributions with respect to the data index instead of the feature vector . With these changes, we can rewrite eq. 5 as(9) 
which is the crossentropy between the joint labelindex distributions and . The minimum of this quantity w.r.t. is obtained when , in which case reduces to the entropy of the random variables and . Additionally, since we assumed that , the marginal entropy is constant and, due to the equipartition condition , we have is also constant. Subtracting these two constants from the entropy yields:
Thus we see that minimizing is the same as maximizing the mutual information between the label and the data index .
In our formulation, the maximization above is carried out under the equipartition constraint. We can instead relax this constraint and directly maximize the information . However, by rewriting information as the difference , we see that the optimal solution is given by , which states each data point is associated to only one label deterministically, and by , which is another way of stating the equipartition condition.
In other words, our learning formulation can be interpreted as maximizing the information between data indices and labels while explicitly enforcing the equipartition condition, which is implied by maximizing the information in any case. Compared to minimizing the entropy alone, maximizing information avoids degenerate solutions as the latter carry no mutual information between labels and indices .
3.3 Relation to simultaneous representation learning and clustering
In the discussion above, selflabelling amounts to assigning discrete labels to data and can thus be interpreted as clustering. Most of the traditional clustering approaches are generative. For example, means takes a dataset of vectors and partitions it into classes in order to minimize the reconstruction error
(10) 
where are the datatocluster assignments and are means approximating the vectors in the corresponding clusters. The means energy can thus be interpreted as the average data reconstruction error.
It is natural to ask whether a clustering method such as means, which is generative, could be combined with representation learning, which is discriminative. In this setting, the feature vectors are extracted by the neural network from the input data . Unfortunately, optimizing a loss such as eq. 10 with respect to the clustering and representation parameters is meaningless: in fact, the obvious solution is to let the representation send all the data points to the same constant feature vector and setting all the means to coincide with it, in which case the means reconstruction error is zero (minimal).
Nevertheless, DeepCluster (Caron et al., 2018) does successfully combine means with representation learning. DeepCluster can be related to our approach as follows. Step 1 of the algorithm, namely representation learning via crossentropy minimization, is exactly the same. Step 2, namely selflabelling, differs: where we solve an optimal transport problem to obtain the pseudolabels, they do so by running means on the feature vectors extracted by the neural network.
DeepCluster does have an obvious degenerate solution: we can assign all data points to the same label and learn a constant representation, achieving simultaneously a minimum of the crossentropy loss in Step 1 and of the means loss in Step 2. The reason why DeepCluster avoids this pitfall is due to the particular interaction between the two steps. First, during Step 2, the features are fixed so means cannot pull them together. Instead, the means spread to cover the features as they are, resulting in a balanced partitioning. Second, during the classification step, the cluster assignments are fixed, and optimizing the features with respect to the crossentropy loss tends to separate them. Lastly, the method in (Caron et al., 2018)
also uses other heuristics such as sampling the training data inversely to their associated clusters’ size, leading to further regularization.
However, a downside of DeepCluster is that it does not have a single, welldefined objective to optimize, which means that it is difficult to characterize its convergence properties. By contrast, in our formulation, both Step 1 and Step 2 optimize the same objective, with the advantage that convergence to a (local) optimum is guaranteed.
3.4 Augmenting selflabelling via data transformations
Methods such as DeepCluster extend the training data via augmentations. In vision problems, this amounts to (heavily) distorting and cropping the input images at random. Augmentations are applied so that the neural network is encouraged to learn a labelling function which is transformation invariant. In practice, this is crucial to learn good clusters and representations, so we adopt it here. This is achieved by setting where the transformations are sampled at random. In practice, in Step 1 (representation learning), this is implemented via the application of the random transformations to data batches during optimization via SGD, which is corresponds to the usual data augmentation scheme for deep neural networks.
3.5 Multiple simultaneous selflabellings
Intuitively, the same data can often be clustered in many equally good ways. For example, visual objects can be clustered by color, size, typology, viewpoint, and many other attributes. Since our main objective is to use clustering to learn a good data representation , we consider a multitask setting in which the same representation is shared among several different clustering tasks, which can potentially capture different and complementary clustering axis.
In our formulation, this is easily achieved by considering multiple heads (Ji et al., 2018) , one for each of clustering tasks (which may also have a different number of labels). Then, we optimize a sum of objective functions of the type eq. 6, one for each task, while sharing the parameters of the feature extractor among them.
4 Experiments
In this section, we will evaluate the quality of the learned representations. We first ablate our hyperparameters and then compare to the state of the art in selfsupervised learning, where we find that our method is the best clusteringbased feature learner and overall second best on many benchmarks.
4.1 Linear probes and baseline architecture
In order to quantify if a neural network has learned useful feature representations, we follow the standard approach of using linear probes (Zhang et al., 2017)
. This amounts to solving a difficult task, such as ImageNet classification, by training a linear classifier on top of pretrained feature representations, which are kept fixed. Linear classifiers heavily rely on the quality of the representation since their discriminative power is low.
We apply linear probes to all intermediate convolutional blocks of networks and train on the ImageNet LSVRC12 (Deng et al., 2009) and other smaller scale datasets, and transfer to MIT Places (Zhou et al., 2014), all of which are standard benchmarks for evaluation in selfsupervised learning. Our base encoder architecture is AlexNet (Krizhevsky et al., 2012)
, since this is most often used in other selfsupervised learning work for the purpose of benchmarking. We insert the probes right after the ReLU layer in each of the five blocks, and denote these entry points
conv1 to conv5. Applying the linear probes at each convolutional layer allows studying the quality of the representation learned at different depths of the network. While linear probes are conceptually straightforward, there are several technical details that can affect the final accuracy. We detail the exact setup in the Appendix.Method  #opt.  c3  c4  c5 
SL []  0  
SL []  40  
SL []  80  
SL []  160 
Method  c3  c4  c5 
SL []  
SL []  
SL []  
SL [] 
Method  Architecture  Top1 
SL []  AlexNet  
SL []  AlexNet  
SL []  ResNet50 
Method  Architecture  Top1 
SL []  AlexNet (small)  
SL []  AlexNet  
SL []  ResNet50 
Method  Source (Top1)  Target (Top1) 
SL []  AlexNet ()  AlexNet () 
SL []  ResNet50 ()  AlexNet () 
SL []  ResNet50 ()  AlexNet () 
4.2 Ablation
Our method contains two major hyperparameters. As any clustering method, the number of clusters (or an equivalent parameter) needs to be defined. Additionally, the number of clustering heads can be specified. Due to the simplicity of the approach, no other parameters such as balancing losses are needed. In our experiments we specify and by denoting our selflabelling method as “SL”.
In Tables 55 we show extensive ablations of our method. Since the conv1 and conv2 are mostly learned from just augmentations (Asano et al., 2019) alone, we evaluate only the deeper layers. First, in Table 5, we validate that our labeloptimization method is key for achieving a good performance and not augmentations or random labels (i.e. zero label optimizations). We further observe performance gain from increasing the number of times we optimize the label assignment (#opts) yields diminishing or slightly decreasing returns indicating saturation.
Next, in Table 5, we compare the different choices of . We find that moving from 1k to 3k improves the results, but larger number of clusters decrease the quality slightly.
In Table 5 we observe that increasing the number of heads from to yields strongest performance gain, with .
In Table 5, we show that the performance increases with larger architectures from a smaller variant of AlexNet which uses (64, 192) filters in its first two convolutional layers (Krizhevsky, 2014), to the standard variant with (96, 256) (Krizhevsky et al., 2012), all the way to a ResNet50. This indicates that the task is hard enough to scale to better architectures, yet simple enough to also be able to train AlexNet, which other methods such as BigBiGAN (Donahue & Simonyan, 2019) or CPC (Hénaff et al., 2019) cannot do.
Lastly, in Table 5
, we find that the labels extracted using our method can be used to quickly retrain a network from scratch. For this we use a shorter 90 epoch schedule and conduct standard supervised training using the labels. We find that retraining an AlexNet this way recovers the original performance. This is an interesting result, indicating that the quality of the features depends on the final label assignment and not on the intermediate “labelpaths” during training. Since the labels are independent of the network architecture, one can use them to pretrain any architecture without running the actual method. To verify this assumption, we use the labels obtained from a SL [
] ResNet50 and train an AlexNet and find it performing even better than the directly trained AlexNet. For this reason we will publish our selflabels for the ImageNet dataset together with the code and trained models.As we show qualitatively in the appendix, the labels identified by our algorithm are highly meaningful and group visually similar concepts in the same clusters, often even capturing whole ImageNet classes.
4.3 Smallscale Datasets
Dataset  
Method  CIFAR10  CIFAR100  SVHN 
Classifier/Feature  Linear Classifier / conv5  
Supervised  
Counting  
DeepCluster  
Instance  
AND  
SL  
Classifier/Feature  Weighted kNN / FC 

Supervised  
Counting  
DeepCluster  
Instance  
AND  
SL 
PascalVOC Task  
Method  Cls.  Det.  Seg. 
ImageNet labels  
Random  
Random Rescaled  
BiGAN  
Context  
Context 2  
CC+VGG  
RotNet  
DeepCluster  
RotNet+retrieval  
SL [] 
First, we evaluate our method on relatively simple and small datasets, namely CIFAR10/100 (Krizhevsky et al., 2009) and SVHN (Netzer et al., 2011). For this, we follow the experimental and evaluation protocol from the current stateoftheart in selfsupervised learning in these datasets, AND (Huang et al., 2019). In Table 7, we compare our method with the settings [] for CIFAR10, [] for CIFAR100 and [] for SVHN to other published methods. We observe that our method outperforms the best previous method by 5.8% for CIFAR10, by 9.5% for CIFAR100 and by 0.8% for SVHN respectively. The relatively minor gains on SVHN can be explained by the fact that the gap between the supervised baseline and the selfsupervised results already being very small (<3%). Even in the nearest neighbour retrieval evaluation, which should naturally favour the AND method, as it is based on learning local neighbourhoods, we surpass AND by around 2% consistently across these datasets.
4.4 Large Scale Benchmarks
To compare to the state of the art and concurrent work, we evaluate several architectures using linear probes on public benchmark datasets.
AlexNet.
ILSVRC12  Places  
Method  c1  c2  c3  c4  c5  c1  c2  c3  c4  c5 
ImageNet supervised  
Places supervised            
Random  
Inpainting, (Pathak et al., 2016)  
BiGAN, (Donahue et al., 2017)  
Instance retrieval, (Wu et al., 2018)  
RotNet, (Gidaris et al., 2018)  
DeepCluster (RGB), (Caron et al., 2018)            
AND,(Huang et al., 2019)            
DeepCluster, (Caron et al., 2018)  
AET,(Zhang et al., 2019)  
RotNet+retrieval, (Feng et al., 2019)  
SL [] 
The main benchmark for feature learning methods is linear probing of an AlexNet on ImageNet. In Table 8 we compare the performance across layers also on the Places dataset. We find that across both datasets our method outperforms DeepCluster at every layer. From our ablation studies in Tables 55 we also note that even our single head variant outperforms DeepCluster, which searches for the optimal number of clusters resulting k clusters. Furthermore we find that our method is either first or second best in all layers and datasets, and the best method that utilizes a combination of two selfsupervised modalities, which are known to increase performance (Doersch & Zisserman, 2017) but blur the sources of gain. Barring this combining method, we improve upon the latest single selfsupervision benchmark, AutoEncodingTransformations (AET) by on ImageNet and by on Places.
Larger models.
Method  Architecture  Evaluation details (epochs)  Top1  Top5 
Supervised, (Donahue & Simonyan, 2019)  ResNet50  Adam, LR sweeps (135)  
Supervised, (Donahue & Simonyan, 2019)  ResNet101  Adam, LR sweeps (135)  
Jigsaw, (Kolesnikov et al., 2019)  ResNet50  SGD (500)  
Rotation, (Kolesnikov et al., 2019)  ResNet50  SGD (500)  
CPC, (Oord et al., 2018)  ResNet101  SGD (145)  
BigBiGAN, (Donahue & Simonyan, 2019)  ResNet50  Adam, LR sweeps (135)  
SL []  ResNet50  SGD (145)  
SL []  ResNet50  SGD (145)  
SL []  ResNet50  SGD (145)  
other architectures  
Rotation, (Kolesnikov et al., 2019)  RevNet50  SGD (500)  53.7   
BigBiGAN, (Donahue & Simonyan, 2019)  RevNet50  Adam, LR sweeps (135)  60.8  81.4 
Efficient CPC, (Hénaff et al., 2019)  ResNet170  SGD (145)  61.0  83.0 
Training better models than AlexNets is not yet standardized in the feature learning community. In Table 9 we compare a ResNet50 trained with our method to other works. We perform better than all other methods except the computationally very expensive BigBiGAN.
4.5 Finetuning: Classification, object detection and semantic segmentation
Finally, since pretraining is usually aimed at improving downstream tasks, we evaluate the quality of the learned features by finetuning the model for three distinct tasks on the Pascal VOC benchmark. In Table 7 we compare results with regard to multilabel classification, object detection and semantic segmentation on PascalVOC (Everingham et al., ).
As in the linear probe experiments, we find our method better or close to the best performing method. This shows that our trained convolutional network does not only learn useful feature representations but is also able to perform well on actual downstream tasks.
5 Conclusion
We present a selfsupervised feature learning method that is based on clustering. In contrast to other methods, our method optimizes the same objective during feature learning and during clustering. This becomes possible through a weak assumption that the number of samples should be equal across clusters. This constraint is explicitly encoded in the label assignment step and can be solved for efficiently using a modified SinkhornKnopp algorithm. Our method outperforms all other clusteringbased feature learning approaches and the resulting selflabels can be used to learn features for new architectures using simple crossentropy training.
Acknowledgments
Yuki Asano gratefully acknowledges support from the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines & Systems (EP/L015897/1). We are also grateful to ERC IDIU638009, AWS Machine Learning Research Awards (MLRA) and the use of the University of Oxford Advanced Research Computing (ARC).
References
 Agrawal et al. (2015) Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proc. ICCV, pp. 37–45. IEEE, 2015.
 Arandjelović & Zisserman (2017) R. Arandjelović and A. Zisserman. Look, listen and learn. In Proc. ICCV, 2017.
 Arthur & Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, 2007.
 Asano et al. (2019) Yuki M Asano, Christian Rupprecht, and Andrea Vedaldi. Surprising effectiveness of fewimage unsupervised feature learning. arXiv preprint arXiv:1904.13132, 2019.
 Bach & Harchaoui (2008) Francis R. Bach and Zaïd Harchaoui. Diffrac: a discriminative and flexible framework for clustering. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (eds.), Advances in Neural Information Processing Systems 20, pp. 49–56. Curran Associates, Inc., 2008.
 Bojanowski & Joulin (2017) Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In Proc. ICML, pp. 517–526. PMLR, 2017.
 Caron et al. (2018) M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In Proc. ECCV, 2018.
 Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300, 2013.
 de Sa (1994) Virginia R de Sa. Learning classification with unlabeled data. In NIPS, pp. 112–119, 1994.
 Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Proc. CVPR, 2009.
 Doersch & Zisserman (2017) Carl Doersch and Andrew Zisserman. Multitask selfsupervised visual learning. In Proc. ICCV, 2017.
 Doersch et al. (2015) Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proc. ICCV, pp. 1422–1430, 2015.
 Donahue & Simonyan (2019) Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning, 2019.
 Donahue et al. (2017) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. Proc. ICLR, 2017.
 Dosovitskiy et al. (2016) A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE PAMI, 38(9):1734–1747, Sept 2016. ISSN 01628828. doi: 10.1109/TPAMI.2015.2496141.
 (16) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.

Feng et al. (2019)
Zeyu Feng, Chang Xu, and Dacheng Tao.
Selfsupervised representation learning by rotation feature
decoupling.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 10364–10374, 2019.  Gan et al. (2018) Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas. Geometry guided convolutional neural networks for selfsupervised video representation learning. In Proc. CVPR, 2018.
 Gao et al. (2016) Rouhan Gao, Dinesh Jayaraman, and Kristen Grauman. Objectcentric representation learning from unlabeled videos. In Proc. ACCV, 2016.
 Gidaris et al. (2018) Spyros Gidaris, Praveen Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In Proc. ICLR, 2018.
 Hénaff et al. (2019) Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Dataefficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019.

Huang et al. (2019)
Jiabo Huang, Q Dong, Shaogang Gong, and Xiatian Zhu.
Unsupervised deep learning by neighbourhood discovery.
In Proceedings of the International Conference on machine learning (ICML), 2019.  Isola et al. (2015) Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups from cooccurrences in space and time. In Proc. ICLR, 2015.
 Jayaraman & Grauman (2015) Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to egomotion. In Proc. ICCV, 2015.
 Jayaraman & Grauman (2016) Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In Proc. CVPR, 2016.
 Jenni & Favaro (2018) Simon Jenni and Paolo Favaro. Selfsupervised feature learning by learning to spot artifacts. In Proc. CVPR, 2018.
 Ji et al. (2018) Xu Ji, João F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
 Kolesnikov et al. (2019) Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting selfsupervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
 Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
 Krizhevsky (2014) Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
 Krizhevsky et al. (2009) Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Larsson et al. (2017) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proc. CVPR, 2017.
 Lee et al. (2017) HsinYing Lee, JiaBin Huang, Maneesh Kumar Singh, and MingHsuan Yang. Unsupervised representation learning by sorting sequence. In Proc. ICCV, 2017.
 Mahendran et al. (2018) A. Mahendran, J. Thewlis, and A. Vedaldi. Cross pixel opticalflow similarity for selfsupervised learning. In Proc. ACCV, 2018.
 Misra et al. (2016) Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In Proc. ECCV, 2016.
 Mundhenk et al. (2017) T Mundhenk, Daniel Ho, and Barry Y. Chen. Improvements to context based selfsupervised learning. In Proc. CVPR, 2017.
 Mundhenk et al. (2018) T. Nathan Mundhenk, Daniel Ho, and Barry Y. Chen. Improvements to context based selfsupervised learning. pp. 9339–9348, 2018.
 Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural images with unsupervised feature learning. NIPS, 01 2011.
 Noroozi & Favaro (2016) Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proc. ECCV, pp. 69–84. Springer, 2016.
 Noroozi et al. (2017) Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In Proc. ICCV, 2017.
 Noroozi et al. (2018) Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. Boosting selfsupervised learning via knowledge transfer. In Proc. CVPR, 2018.
 Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 Owens et al. (2016) Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. Visually indicated sounds. In Proc. CVPR, pp. 2405–2413, 2016.
 Pathak et al. (2016) Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proc. CVPR, pp. 2536–2544, 2016.
 Pathak et al. (2017) Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Proc. CVPR, 2017.
 Ren & Lee (2018) Zhongzheng Ren and Yong Jae Lee. Crossdomain selfsupervised multitask feature learning using synthetic imagery. In Proc. CVPR, 2018.
 Sermanet et al. (2018) Pierre Sermanet et al. Timecontrastive networks: Selfsupervised learning from video. In Proc. Intl. Conf. on Robotics and Automation, 2018.
 Srivastava et al. (2015) N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In Proc. ICML, 2015.
 Vinh et al. (2010) Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct):2837–2854, 2010.
 Vo et al. (2019) Huy V Vo, Francis Bach, Minsu Cho, Kai Han, Yann LeCun, Patrick Pérez, and Jean Ponce. Unsupervised image matching and object discovery as optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8287–8296, 2019.
 Wang & Gupta (2015) Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proc. ICCV, pp. 2794–2802, 2015.
 Wang et al. (2017) Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for selfsupervised visual representation learning. In Proc. ICCV, 2017.
 Wei et al. (2018) D. Wei, J. Lim, A. Zisserman, and W. T. Freeman. Learning and using the arrow of time. In Proc. CVPR, 2018.
 Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018.
 Zhang et al. (2019) Liheng Zhang, GuoJun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation learning by autoencoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2555, 2019.
 Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In Proc. ECCV, pp. 649–666. Springer, 2016.

Zhang et al. (2017)
Richard Zhang, Phillip Isola, and Alexei A. Efros.
Splitbrain autoencoders: Unsupervised learning by crosschannel prediction.
In Proc. CVPR, 2017. 
Zhou et al. (2014)
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.
Learning deep features for scene recognition using places database.
In Advances in neural information processing systems, pp. 487–495, 2014.
Appendix A Appendix
a.1 Imbalanced Data Experiments
kNN  Linear/conv5  
Training data  CIFAR10  CIFAR100  CIFAR10  CIFAR100 
CIFAR10, full  
Supervised  
kmeans []  
ours []  
CIFAR10, light imbalance  
Supervised  
kmeans []  
ours []  
CIFAR10, heavy imbalance  
Supervised  
kmeans []  
ours [] 
In order to understand if our equipartition regularization affects the type of datasets it can be applied to, we perform multiple ablation experiments on artificially imbalanced datasets in Table A.1. In particular, we compare the performance of our clustering algorithm based on optimal transport with a simple kmeans clustering using the same number of clusters. For kmeans we use the "kmeans++" (Arthur & Vassilvitskii, 2007) initialization method with three initializations.
On the full dataset, we find that while supervised training yields the best results on the same dataset for both kNN and linear separation evaluations; kNN evaluation on CIFAR100 works better on our selfsupervised method (
). This might be due to the fact that the last layer of the network heavily overfits on the specific training data under supervised learning signals. Compared to our method, kmeans performs significantly for all evaluation metrics.
Under the first imbalance scenario, where we leave out half of the training images of the last class of CIFAR10 (truck), we find the same ordering in the methods’ performances. This is not surprising as the change in training data is very small.
Under the much stronger heavy imbalance scenario, where the number of images per class are linearly increasing from to of the original size, we find that all methods drop in performance. However, compared to full data and light imbalance, the gap between supervised and selfsupervised decreases only slightly for both kmeans and our method. While the supervised method still yields better performance, this indicates that the supervised baseline is affected more by the imbalance than the other methods. We find that even in this heavy imbalance scenario, our method outperforms a kmeans based objective by more than for kNN baselines and by percentage points for the linear evaluation.
In conclusion, our method does not rely on the data to contain the same number of classes for every class and outperforms a kmeans baseline even in very strong imbalance settings.
a.2 Implementation Details
Learning Details
Unless otherwise noted, we train all our selfsupervised models with SGD and intial learning rate 0.05 for 400 epochs with two learning rate drops where we divide the rate by ten at 150 and 300 and 350 epochs. We spread our pseudolabel optimizations throughout the whole training process in a logarithmic distribution. We optimize the label assignment at , where is the userdefined number of optimizations and is expressed as a fraction of total training epochs. For the SinkhornKopp optimization we set as in (Cuturi, 2013). We use standard data augmentations during training that consist of randomly resized crops, horizontal flipping and adding noise, as in (Wu et al., 2018).
Quantitative Evaluation – Technical Details.
Unfortunately, prior work has used several slightly different setups, so that comparing results between different publications must be done with caution.
In our ImageNet implementation, we follow the original proposal (Zhang et al., 2017) in pooling each representation to a vector with dimensions for conv15
using adaptive maxpooling, and absorb the batch normalization weights into the preceding convolutions. For evaluation on ImageNet we follow RotNet to train linear probes: images are resized such that the shorter edge has a length of
pixels, random crops of are computed and flipped horizontally with probability. Learning lasts for epochs and the learning rate schedule starts from and is divided by five at epochs , and . The top1 accuracy of the linear classifier is then measured on the ImageNet validation subset by extracting crops for each validation image (four at the corners and one at the center along with their horizontal flips) and averaging the prediction scores before the accuracy is computed. For CIFAR10/100 and SVHN we train AlexNet architectures on the resized images with batchsize , learning rate and also the same image augmentations (random resized crops, color jitter and random grayscale) as is used in prior work (Huang et al., 2019). We use the same linear probing protocol as for our ImageNet experiments but without using crops. For the kNN experiments we use and we use an embedding of size as done in previous works.a.3 Further details
NMI over time
In Figure A.1 we find that most learning takes place in the early epochs, and we reach a final NMI value of around 66%. Similarly, we find that due to the updating of the pseudolabels at regular intervals and our data augmentation, the pseudolabel accuracies keep continously rising without overfitting to these labels.
Clustering metrics
In Table A.2, we report standard clustering metrics (see (Vinh et al., 2010) for detailed definitions) of our trained models with regards to the ImageNet validation set groundtruth labels. These metrics include chancecorrected metrics which are the adjusted normalized mutual information (NMI) and the adjusted RandIndex, as well as the default NMI, also reported in DeepCluster (Caron et al., 2018).
Metrics  
adjusted  
Variant  NMI  adjusted NMI  RandIndex  Top1 Acc. 
SL [] AlexNet  
SL [] AlexNet  
SL [] AlexNet  
SL [] AlexNet  
SL [] ResNet50 
Conv1 filters
In Figure A.3 we show the first convolutional filters of two of our trained models. We can find the typical Gaborlike edge detectors as well as color blops and dotdetectors.
Entropy over time
In Figure A.4, we show how the distribution of entropy with regards to the true ImageNet labels changes with training time. We find that while at first, all 3000 pseudolabels contain random real ImageNet labels, yielding high entropy of around . Towards the end of training we arrive at a broad spectrum of entropies with some as low as (see Fig. A.5 and A.6 for low entropy label visualizations) and the mean around (see Fig. A.7 and A.8 for randomly chosen labels’ visualizations).
Further AlexNet baselines In Table A.3, we report additional linear probe evaluation details for fullysupervisedly trained and random AlexNet 2012 models. Averaging over 10crops is consistently better.
ILSVRC12  
Method  c1  c2  c3  c4  c5 
ImageNet labels (10crop)  
ImageNet labels (1crop)  
Random (10crop)  
Random (1crop) 
a.4 Complete tables
In the following, we report the unabridged tables with all related work.
ILSVRC12  Places  
Method  c1  c2  c3  c4  c5  c1  c2  c3  c4  c5 
ImageNet supervised  
Places supervised            
Random  
Inpainting, (Pathak et al., 2016)  
BiGAN, (Donahue et al., 2017)  
Context, (Doersch et al., 2015)  
Colorization, (Zhang et al., 2016)  
Jigsaw, (Noroozi & Favaro, 2016)  
Counting, (Noroozi et al., 2017)  
SplitBrain, (Zhang et al., 2017)  
Instance retrieval, (Wu et al., 2018)  
CC+VGG, (Noroozi et al., 2018)  
Context 2 (Mundhenk et al., 2018)  
RotNet, (Gidaris et al., 2018)  
Artifacts, (Jenni & Favaro, 2018)  
DeepCluster (RGB), (Caron et al., 2018)            
AND,(Huang et al., 2019)            
DeepCluster, (Caron et al., 2018)  
AET,(Zhang et al., 2019)  
RotNet+retrieval, (Feng et al., 2019)  
SL [] 
Method  Architecture  Evaluation details (epochs)  Top1  Top5 
Supervised, (Donahue & Simonyan, 2019)  ResNet50  Adam, LR sweeps (135)  
Supervised, (Donahue & Simonyan, 2019)  ResNet101  Adam, LR sweeps (135)  
Jigsaw, (Kolesnikov et al., 2019)  ResNet50  SGD (500)  
RelPathLoc, (Kolesnikov et al., 2019)  ResNet50  SGD (500)  
Exemplar, (Kolesnikov et al., 2019)  ResNet50  SGD (500)  
Rotation, (Kolesnikov et al., 2019)  ResNet50  SGD (500)  
Multitask, (Doersch & Zisserman, 2017)  ResNet101  unclear  
CPC, (Oord et al., 2018)  ResNet101  SGD (145)  
BigBiGAN, (Donahue & Simonyan, 2019)  ResNet50  Adam, LR sweeps (135)  
SL []  ResNet50  SGD (145)  
SL []  ResNet50  SGD (145)  
SL []  ResNet50  SGD (145)  
other architectures  
Rotation, (Kolesnikov et al., 2019)  RevNet50  SGD (500)  53.7   
BigBiGAN, (Donahue & Simonyan, 2019)  RevNet50  Adam, LR sweeps (135)  60.8  81.4 
Efficient CPC, (Hénaff et al., 2019)  ResNet170  SGD (145)  61.0  83.0 
Comments
There are no comments yet.