Unsupervised learning of visual features has attracted wide attentions as it provides an alternative way to efficiently train very deep networks without labeled data. Recent breakthroughs in this direction focus on two categories of methods: contrastive learning [1, 2, 3] and transformation prediction [4, 5, 6, 7]
, among many alternative unsupervised methods such as generative adversarial networks[8, 9, 10, 11], and auto-encoders [12, 13].
The former category [1, 2, 3, 14, 15, 16, 17] trains a network based on a self-training task by distinguishing between different instance classes each containing the samples augmented from the same instance. Such a contrastive learning problem seeks to explore the inter-instance discrimination to perform unsupervised learning. On the contrary, the other category of transformation prediction methods [4, 5, 6, 7] train a deep network by predicting the transformations used to augment input instances. It attempts to explore the intra-instance variations under multiple augmentations to learn the feature embedding.
A good visual representation ought to combine both advantages of the inter-instance discrimination and the intra-instance variations. In particular, the feature embedding should not only capture the significant intra-instance variations among augmented samples from each instance, as well as discern the distinction between instances by considering their potential variations to enable the inter-instance discrimination. In other words, the inter-instance discrimination should be performed by matching a query against all potential variants of an instance. To this end, we propose a novel -shot contrastive learning as a first attempt to combine their strengths, and we will show that most of existing contrastive learning methods are a special one-shot case.
In particular, we apply multiple augmentations to transform each instance, resulting in an instance subspace spanned by the augmented samples. Each instance subspace learns significant factors of variations from the augmented samples, which configures how these factors can be linearly combined to form the variants of the instance. Then given a query, the most relevant sample of variant for each instance is retrieved by projecting the query onto the associated subspace [18, 19]
. After that, the inter-instance discrimination is conducted by assigning the query to the instance class with the shortest projection distance. An eigenvalue decomposition is performed to configure each instance subspace with the orthonormal eigenvectors as its basis. This configuration of instance subspaces is non-parametric and differentiable, allowing an end-to-end training of the embedding network through back-propagation.
Experiment results demonstrate that the proposed
-Shot Contrastive Learning (KSCL) can consistently improve the state-of-the-art performance on unsupervised learning. Particularly, with the ResNet50 backbone, it improves the top-1 accuracy of the SimCLR and the MoCo v2 to 68.8% on ImageNet over 200 epochs. It also reaches a higher top-1 accuracy of
over 800 epochs than the baseline SimCLR and the rerun MoCo v2. For the sake of fair comparison, all these improvements are achieved with the same experiment settings such as network architecture, data augmentation, training strategy and the version of deep learning framework and libraries. The consistently improved performances with the same model settings suggest the proposed KSCL can serve as a generic plugin to further increase the accuracy of contrastive learning methods on downstream tasks.
The remainder of the paper is organized as follows. We will review the related works in Section 2, and present the proposed -shot contrastive learning in Section 3. Implementation details will be depicted in Section 4. We will demonstrate the experiment results in Section 5, and conclude the paper in Section 6.
2 Related Works
In this section, we review the related works to the proposed K-Short Contrastive Learning (KSCL) in the following four areas.
2.1 Contrastive Learning
Contrastive learning  was first proposed to learn unsupervised representations by maximizing the mutual information between the learned representation and a particular context. It usually focused on the context of the same instance to learn features by discriminating between one example from the other in an embedding space [21, 3, 2]. For example, the instance discrimination has been used as a pretext task by distinguishing augmented samples from each other in a minimatch , over a memory bank , or a dynamic dictionary with a queue . The comparison between the augmented samples of individual instances was usually performed on a pairwise basis. The state-of-the-art performances on contrastive learning have relied on a composite of carefully designed augmentations  to prevent the unsupervised training from utilizing side information to accomplish the pretext task. This has been shown necessary to reach competitive results on downstream tasks.
2.2 Transformation Prediction
Transformation prediction [5, 4] also constitutes a category of unsupervised methods in learning visual embeddings. In contrast to contrastive learning that focuses on inter-instance discrimination, it aims to learn the representations that equivary against various transformations [6, 22, 23]. These transformations are used to augment images and the learned representation is trained to capture the visual structures from which these transformations can be recognized. It focuses on modeling the intra-instance variations from which variants of an instance can be leveraged on downstream tasks such as classification [5, 22, 4], object detection [4, 22], semantic segmentations on images [4, 24] and 3D cloud points . This category of methods provide an orthogonal perspective to contrastive learning based on inter-instance discrimination.
2.3 Few-Shot Learning
From an alternative perspective, contrastive learning based inter-instance discrimination can be viewed as a special case of few-shot learning [25, 26, 27, 28, 29], where each instance is a class and it has several examples augmented from the instance. The difference lies that the examples for each class can be much abundant since one can apply many augmentations to generate an arbitrary number of examples. Of course, these examples are not statistically independent as they share the same instance. Based on this point of view, the non-parametric instance discrimination  and thus several perspective works [3, 2] can be viewed as an extension of the weight imprinting 21, 3, 2]. Such a surprising connection between the non-parametric instance discrimination and the few-shot learning may open a new way to train the contrastive prediction model. In this sense, the proposed -shot contrastive learning generalizes the few-shot learning by imprinting the orthonormal basis of an instance subspace with the embeddigns of augmented samples from the instance.
2.4 Capsule Nets
. In capsule nets, a group of neurons form a capsule (vector) of which the direction represents different instantiation that equivaries against transformations and the length accounts for the confidence that a particular class of object is detected. From this point of view, the projected vector of a query example to an instance subspace in this paper also carries an analogy to a capsule. Its direction represents the instantiated configuration of how-shot augmentations from the instance are linearly combined to form the query, while its length gives rise to the likelihood of the query belonging to this instance class, since a longer projection means a shorter distance to the subspace. This idea of using projections onto several capsule subspaces each corresponding to a class has shown promising results by effectively training deep networks .
3 The Approach
In this section, we define a -shot contrastive learning as the pretext task for training unsupervised feature embedding with Multiple Instance Augmentations (MIAs).
3.1 Preliminaries on Contrastive Learning
Suppose we are given a set of unlabeled instances in a minibatch (e.g., in the SimCLR ) or from a dictionary (e.g., the memory bank in non-parametric instance discrimination  and the dynamic queue in the MoCo 
). Then the contrastive learning can be formulated as classifying a query exampleinto one of instance classes each corresponding to an instance .
The goal is to learn a deep network embedding each instance and the query to a feature vector and
. Then the probability of the embedded querybelonging to an instance class is defined as
where a similarity measure (e.g., cosine similarity) is defined between two embeddings, and is a positive temperature hypermeter. When the query is the embedding of an augmented sample from , gives rise to the probability of a relevant embedding being successfully retrieved from the instance class . One can minimize the contrastive loss called InfoNCE in  resulting from the negative log-likelihood of the above probability over a dictionary to train the embedding network.
The idea underlying the contrastive learning approach is a good representation ought to help retrieve the relevant samples from a set of instances given a query . For example, the SimCLR  has achieved the state-of-the-art performance by applying two separate augmentations to each instance in a minibatch. Then, given a query example, it views the sample augmented from the same instance as the positive example, while treating those augmented from the other instances as negative ones. Alternatively, the MoCo  seeks to retrieve relevant samples from a dynamic queue separate from the current minibatch. Both are based on the similarity between a query and a candidate sample to train the embedding network, which can be viewed as one-shot contrastive learning as explained later.
However, the discrimination between different instance classes not only relies on their inter-instance similarities, but also is characterized by the distribution of augmented samples from the same instance, i.e., the intra-instance variations. While existing contrastive learning methods explore the inter-instance discrimination to predict instance classes, we believe the intra-instance variations also play an indispensable role. Thus we propose -shot contrastive learning by matching a query against the variants of each instance in the associated instance subspace spanned by -shot augmentations.
3.2 -Shot Multiple Instance Augmentations
Let us consider a -Shot Contrastive Learning (KSCL) problem. Suppose that different augmentations are drawn and applied to each instance , resulting in augmented samples and their embeddings for .
As aforementioned, the information contained in -shot augmentations provides important clues to distinguish between different instance classes. Comparing a query against each augmented sample individually fails to leverage such intra-instance variations, since the most relevant sample of variant could be a combination of rather than individual factors of variations. Therefore, we are motivated to explore the intra-instance variations through a linear subspace spanned by the augmented samples of each instances. Given a query, the most relevant instance is retrieved by projecting it onto the closest subspace.
As illustrated in Figure 1, consider the embeddings of -shot augmentations for an instance . These embeddings are normalized to have a unit length and thus reside on the surface of a unit hyper-sphere. Meanwhile, they span an instance subspace in the ambient feature space. Then, the projection of the query (of a unit length) onto the instance subspace is , and the projection distance of the query from becomes
where is normal to , and the second equality follows from , since the normal vector should be orthogonal to all vectors within the subspace. As the embedding has a constant unit length , minimizing the projection distance is equivalent to maximizing its projection length .
Let be the acute angle between the query and the instance subspace . Then we have , i.e., the projection length can be viewed as the cosine similarity between the query and the whole instance subspace. Compared with the cosine similarity between individual embeddings of instances used in literature [2, 3, 33], it aims to learn a better representation by discriminating different instance subspaces containing the variations of sample augmentations.
Now we can define the probability of belonging to an instance class
Then the KSCL seeks to train the embedding network by maximizing the loglikehood of the above probability over mini-batches to match a query against the correct instance. Particularly, given a query of a unit norm, its projection length achieves its maximum if belongs to , i.e., it is a linear combination of -shot augmentations . In other words, it matches the query against all linear combinations of the augmented samples from each instance , and retrieves the most similar one by projecting the query onto the instance subspace with the shortest distance.
In this section, we discuss the details to implement the proposed -Shot Contrastive Learning (KSCL) model.
4.1 Projection onto Instance Subspace via Eigenvalue Decomposition
Mathematically, there is a close-form solution to the projection onto the instance subspace spanned by -shot augmentations ’s. Suppose there exists an othonormal basis for denoted by the columns of a matrix , the projection of a feature vector can be written as .
Since we have with spanning , the problem of finding can be formulated by minimizing the following projection residual
where , with containing the embeddings of the augmented samples in its columns.
After conducting an eigenvalue decomposition on the positive-definite matrix , the eigenvectors corresponding to the largest eigenvalues give rise to an orthonormal basis of the associated instance subspace, which minimizes (4).
Since the eigenvalue decomposition is differentiable, the embedding network can be trained end-to-end through the error back-propagation. However, like the other contrastive learning methods [3, 33], the errors will only be back-propagated through the embedding network of queries to save the computing cost.
4.2 Most Significant Inter-Instance Variations
Usually, we only consider a smaller number of eigenvectors, corresponding to the largest eigenvalues that account for the most significant factors of variations among -shot augmentations. This ignores the remaining minor factors of intra-instance variations that may be incurred by noisy augmentations. It also results in a thinner projection matrix than , and the projection length becomes . Thus, we will only need to store and update in the KSCL.
In practice, rather than setting to a prefixed number, we will choose such as the largest eigenvalues cover a preset percentage of total eigenvalues. The more percentage of total eigenvalues are preserved, the smaller the projection residual is in Eq. (3); when , the residual vanishes. This allows a distinct number of eigenvectors per instance to flexibly model various degrees of variations among -shot augmentations.
4.3 One-Shot Contrastive Learning when
It is not hard to see that the cosine similarity used in SimCLR and MoCo is a special case when , i.e., they are one-shot contrastive learning of visual embeddings. When , there is a single augmented sample per instance. Its instance subspace collapses to a vector . Since is -normalized to have a unit length in the SimCLR and the MoCo, the projection length of a query to this single vector becomes . This is the cosine similarity between two vectors used in existing contrastive learning methods [2, 3, 21] up to an absolute value.
In this section, we perform experiments to compare the KSCL with the other state-of-the-art unsupervised learning methods.
5.1 Training Details
, we follow the same evaluation protocol with the same hyperparameters.
Specifically, a ResNet-50 network is first pretrained on 1.28M ImageNet dataset  without labels, and the performance is evaluated by training a linear classifier upon the fixed features. We report top-1 accuracy on the ImageNet validation set with a single crop to images. The momentum update with the same size of dynamic queue, the MLP head, the data augmentation (e.g., color distortion and blur augmentation) are also adopted for the sake of fair comparison with the SimCLR and MoCo v2. We adopt the same temperature in  without exhaustively searching for an optimal one, yet still obtain better results. This demonstrates the proposed KSCL can be used as a universal plugin to consistently improve the contrastive learning with no need of further tuning of existing models. We will evaluate the impact of and the percentage of preserved eigenvalues on the performance later.
5.2 Results on ImageNet Dataset
|Model||epochs||batch size||top-1 accuracy|
|MoCo v2 (rerun)||200||256||67.5|
|Results under more epochs of unsupervised pretraining|
|MoCo v2 (rerun) ||800||256||70.6|
compares the top-1 accuracy of the proposed KSCL with that of SimCLR and MoCo on the ImageNet. We make a direct comparison between the KSCL and the MoCo v2 by running both on the same hardware platform with the same set of software such as CUDA10, pytorch v1.3 and torchvision 1.1.0 (used in the data augmentation that plays a key role in the contrastive learning). With 200 epochs of pretraining, the same top-1 accuracy has been achieved on MoCo v2. However, its rerun result over 800 epochs is slightly lower than the reported result (71.1%) in literature, and this may be due to different versions of deep learning frameworks and drivers that could cause variations in the model performance.
Table I shows that, after unsupervised pretraining of the KSCL with epochs and a batch size of 256, the KSCL achieves a top-1 accuracy of 68.8% with augmentations and of preserved eigenvalues. It is worth noting that a larger batch size is often required to sufficiently train the SimCLR while the other models such as KSCL and MoCo maintain a long dynamic queue as the dictionary. By viewing the SimCLR with a larger batch size of as a baseline, the KSCL makes a much larger improvement of than the MoCo v2 () on the SimCLR baseline under epochs. The KSCL also improves the top-1 accuracy to on the ImageNet over 800 epochs of pretraining. Although a better result may be obtained by finetuning the hyperparameter and the data augmentation , we stick to the same experimental setting in the previous methods [2, 33] for a direct comparison.
We also visualize the learned basis images in Figure 2. The last column presents the basis images spanning the underlying instance subspace for a ”cat” image. The weight beneath each image is the inner product between the decomposed eigenvector and the embedding of the corresponding augmentation, and each base is a weighted combination of the augmented images in the row. The results show that two bases suffice to capture the major variations among the five image augmentations, while the remaining three only model the minor ones that can be discarded as noises.
5.3 Impacts of and on Performance
We also study the impact of different ’s and ’s on the model performance. Table II shows the top-1 accuracy under various ’s and ’s. When , it reduces to one-shot contrastive learning which is similar to the MoCo v2. The difference vs. between the KSCL () and the MoCo v2 is probably because we did not fine-tune the temperature for the projection length to optimize the KSCL.
The accuracy increases with a larger number of augmentations per instance and a smaller value of perceived eigenvalues. This implies that eliminating the minor noisy variations (as illustrated in Figure 2) with a smaller could improve the performance. Further growing only marginally improves the performance. This is probably because the data augmentation adopted in experiments is limited to those used in the compared methods for a direct comparison. Applying more types of augmentations (e.g., jigsaw and rotations) may inject more intra-instance variations that encourage to use a larger . However, studying the role of more types of augmentations in contrastive learning is beyond the scope of this paper, and we leave it to future research.
|epochs||batch size||top-1 accuracy||time/epoch (min.)|
5.4 Results on VOC Object Detection
Finally, we evaluate the unsupervised representations on the VOC object detection task . The ResNet-50 backbone pretrained on the ImageNet dataset is fine-tuned with a Faster RCNN detector  end-to-end on the VOC 2007+2012 trainval set, and is evaluated on the VOC 2007 test set. Table III compares the results with both the MoCo models. Under the same setting, the proposed KSCL outperforms the compared MoCo v1 and MoCo v2 models. The SimCLR model does not report on the VOC object detection task in .
In this paper, we present a novel -shot contrastive learning to learn unsupervised visual features. It randomly draws
-shot augmentations and applies them separately to each instance. This results in the instance subspace modeling how the significant factors of variances learned from the augmented samples can be linearly combined to form the variants of an associated instance. Given a query, the most relevant samples are then retrieved by projecting the query onto individual instance subspaces, and the query is assigned to the instance subspace with the shortest projection distance. The proposed-shot contrastive learning combines the advantages of both the inter-instance discrimination and the intra-instance variations to discriminate the distinctions between different instances. The experiment results demonstrate its superior performances to the state-of-the-art contrastive learning methods based on the same experimental setting.
-  A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
-  T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709, 2020.
-  K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” arXiv preprint arXiv:1911.05722, 2019.
-  S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018.
-  L. Zhang, G.-J. Qi, L. Wang, and J. Luo, “Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data,” arXiv preprint arXiv:1901.04596, 2019.
G.-J. Qi, L. Zhang, C. W. Chen, and Q. Tian, “Avt: Unsupervised learning of transformation equivariant representations by autoencoding variational transformations,” in
Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8130–8139.
-  M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision. Springer, 2016, pp. 69–84.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  G.-J. Qi, “Loss-sensitive generative adversarial networks on lipschitz densities,” International Journal of Computer Vision, vol. 128, no. 5, pp. 1118–1140, 2020.
G.-J. Qi, L. Zhang, H. Hu, M. Edraki, J. Wang, and X.-S. Hua, “Global versus
localized generative adversarial nets,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1517–1525.
-  Y. Zhao, Z. Jin, G.-j. Qi, H. Lu, and X.-s. Hua, “An adversarial approach to hard triplet generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 501–517.
J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convolutional auto-encoders for hierarchical feature extraction,” in
International conference on artificial neural networks. Springer, 2011, pp. 52–59.
-  G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in International conference on artificial neural networks. Springer, 2011, pp. 44–51.
-  P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Advances in Neural Information Processing Systems, 2019, pp. 15 509–15 519.
-  O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord, “Data-efficient image recognition with contrastive predictive coding,” arXiv preprint arXiv:1905.09272, 2019.
-  R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670, 2018.
-  Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” arXiv preprint arXiv:1906.05849, 2019.
-  X.-S. Hua and G.-J. Qi, “Online multi-label active annotation: towards large-scale content-based video search,” in Proceedings of the 16th ACM international conference on Multimedia, 2008, pp. 141–150.
-  X. Shu, J. Tang, G.-J. Qi, Z. Li, Y.-G. Jiang, and S. Yan, “Image classification with tailored fine-grained dictionaries,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 2, pp. 454–467, 2016.
-  S. Chang, G.-J. Qi, C. C. Aggarwal, J. Zhou, M. Wang, and T. S. Huang, “Factorized similarity learning in networks,” in 2014 IEEE International Conference on Data Mining. IEEE, 2014, pp. 60–69.
-  Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.
-  G.-J. Qi, “Learning generalized transformation equivariant representations via autoencoding transformations,” arXiv preprint arXiv:1906.08628, 2019.
-  X. Gao, W. Hu, and G.-J. Qi, “Graphter: Unsupervised learning of graph transformation equivariant representations via auto-encoding node-wise transformations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7163–7172.
-  G.-J. Qi, “Hierarchically gated deep networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2267–2275.
-  Z. Peng, Z. Li, J. Zhang, Y. Li, G.-J. Qi, and J. Tang, “Few-shot image recognition with knowledge transfer,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 441–449.
-  M. A. Jamal and G.-J. Qi, “Task agnostic meta-learning for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 719–11 727.
-  H. Xu, H. Xiong, and G. Qi, “Flat: Few-shot learning via autoencoding transformation regularizers,” arXiv preprint arXiv:1912.12674, 2019.
-  Y. Qin, W. Zhang, C. Zhao, Z. Wang, X. Zhu, G. Qi, J. Shi, and Z. Lei, “Prior-knowledge and attention-based meta-learning for few-shot learning,” arXiv, pp. arXiv–1812, 2018.
-  G.-J. Qi, W. Liu, C. Aggarwal, and T. Huang, “Joint intermodal and intramodal label transfers for extremely rare or unseen classes,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1360–1373, 2016.
-  H. Qi, M. Brown, and D. G. Lowe, “Low-shot learning with imprinted weights,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5822–5830.
-  S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Advances in neural information processing systems, 2017, pp. 3856–3866.
L. Zhang, M. Edraki, and G.-J. Qi, “Cappronet: Deep feature learning via orthogonal projections onto capsule subspaces,” inAdvances in Neural Information Processing Systems, 2018, pp. 5814–5823.
-  X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
-  Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning,” arXiv preprint arXiv:2005.10243, 2020.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.