1 Introduction
Unsupervised learning of visual features has attracted wide attentions as it provides an alternative way to efficiently train very deep networks without labeled data. Recent breakthroughs in this direction focus on two categories of methods: contrastive learning [1, 2, 3] and transformation prediction [4, 5, 6, 7]
, among many alternative unsupervised methods such as generative adversarial networks
[8, 9, 10, 11], and autoencoders [12, 13].The former category [1, 2, 3, 14, 15, 16, 17] trains a network based on a selftraining task by distinguishing between different instance classes each containing the samples augmented from the same instance. Such a contrastive learning problem seeks to explore the interinstance discrimination to perform unsupervised learning. On the contrary, the other category of transformation prediction methods [4, 5, 6, 7] train a deep network by predicting the transformations used to augment input instances. It attempts to explore the intrainstance variations under multiple augmentations to learn the feature embedding.
A good visual representation ought to combine both advantages of the interinstance discrimination and the intrainstance variations. In particular, the feature embedding should not only capture the significant intrainstance variations among augmented samples from each instance, as well as discern the distinction between instances by considering their potential variations to enable the interinstance discrimination. In other words, the interinstance discrimination should be performed by matching a query against all potential variants of an instance. To this end, we propose a novel shot contrastive learning as a first attempt to combine their strengths, and we will show that most of existing contrastive learning methods are a special oneshot case.
In particular, we apply multiple augmentations to transform each instance, resulting in an instance subspace spanned by the augmented samples. Each instance subspace learns significant factors of variations from the augmented samples, which configures how these factors can be linearly combined to form the variants of the instance. Then given a query, the most relevant sample of variant for each instance is retrieved by projecting the query onto the associated subspace [18, 19]
. After that, the interinstance discrimination is conducted by assigning the query to the instance class with the shortest projection distance. An eigenvalue decomposition is performed to configure each instance subspace with the orthonormal eigenvectors as its basis. This configuration of instance subspaces is nonparametric and differentiable, allowing an endtoend training of the embedding network
[20] through backpropagation.Experiment results demonstrate that the proposed
Shot Contrastive Learning (KSCL) can consistently improve the stateoftheart performance on unsupervised learning. Particularly, with the ResNet50 backbone, it improves the top1 accuracy of the SimCLR and the MoCo v2 to 68.8% on ImageNet over 200 epochs. It also reaches a higher top1 accuracy of
over 800 epochs than the baseline SimCLR and the rerun MoCo v2. For the sake of fair comparison, all these improvements are achieved with the same experiment settings such as network architecture, data augmentation, training strategy and the version of deep learning framework and libraries. The consistently improved performances with the same model settings suggest the proposed KSCL can serve as a generic plugin to further increase the accuracy of contrastive learning methods on downstream tasks.
The remainder of the paper is organized as follows. We will review the related works in Section 2, and present the proposed shot contrastive learning in Section 3. Implementation details will be depicted in Section 4. We will demonstrate the experiment results in Section 5, and conclude the paper in Section 6.
2 Related Works
In this section, we review the related works to the proposed KShort Contrastive Learning (KSCL) in the following four areas.
2.1 Contrastive Learning
Contrastive learning [1] was first proposed to learn unsupervised representations by maximizing the mutual information between the learned representation and a particular context. It usually focused on the context of the same instance to learn features by discriminating between one example from the other in an embedding space [21, 3, 2]. For example, the instance discrimination has been used as a pretext task by distinguishing augmented samples from each other in a minimatch [2], over a memory bank [21], or a dynamic dictionary with a queue [3]. The comparison between the augmented samples of individual instances was usually performed on a pairwise basis. The stateoftheart performances on contrastive learning have relied on a composite of carefully designed augmentations [2] to prevent the unsupervised training from utilizing side information to accomplish the pretext task. This has been shown necessary to reach competitive results on downstream tasks.
2.2 Transformation Prediction
Transformation prediction [5, 4] also constitutes a category of unsupervised methods in learning visual embeddings. In contrast to contrastive learning that focuses on interinstance discrimination, it aims to learn the representations that equivary against various transformations [6, 22, 23]. These transformations are used to augment images and the learned representation is trained to capture the visual structures from which these transformations can be recognized. It focuses on modeling the intrainstance variations from which variants of an instance can be leveraged on downstream tasks such as classification [5, 22, 4], object detection [4, 22], semantic segmentations on images [4, 24] and 3D cloud points [23]. This category of methods provide an orthogonal perspective to contrastive learning based on interinstance discrimination.
2.3 FewShot Learning
From an alternative perspective, contrastive learning based interinstance discrimination can be viewed as a special case of fewshot learning [25, 26, 27, 28, 29], where each instance is a class and it has several examples augmented from the instance. The difference lies that the examples for each class can be much abundant since one can apply many augmentations to generate an arbitrary number of examples. Of course, these examples are not statistically independent as they share the same instance. Based on this point of view, the nonparametric instance discrimination [21] and thus several perspective works [3, 2] can be viewed as an extension of the weight imprinting [30]
by initialing the weights of each instance class with the embedded feature vector of an augmented sample, resulting in the inner product and cosine similarity used in these algorithms
[21, 3, 2]. Such a surprising connection between the nonparametric instance discrimination and the fewshot learning may open a new way to train the contrastive prediction model. In this sense, the proposed shot contrastive learning generalizes the fewshot learning by imprinting the orthonormal basis of an instance subspace with the embeddigns of augmented samples from the instance.2.4 Capsule Nets
The length of a vectorized feature representation has been used in capsule nets pioneered by Hinton et al. [31, 32]
. In capsule nets, a group of neurons form a capsule (vector) of which the direction represents different instantiation that equivaries against transformations and the length accounts for the confidence that a particular class of object is detected. From this point of view, the projected vector of a query example to an instance subspace in this paper also carries an analogy to a capsule. Its direction represents the instantiated configuration of how
shot augmentations from the instance are linearly combined to form the query, while its length gives rise to the likelihood of the query belonging to this instance class, since a longer projection means a shorter distance to the subspace. This idea of using projections onto several capsule subspaces each corresponding to a class has shown promising results by effectively training deep networks [32].3 The Approach
In this section, we define a shot contrastive learning as the pretext task for training unsupervised feature embedding with Multiple Instance Augmentations (MIAs).
3.1 Preliminaries on Contrastive Learning
Suppose we are given a set of unlabeled instances in a minibatch (e.g., in the SimCLR [2]) or from a dictionary (e.g., the memory bank in nonparametric instance discrimination [21] and the dynamic queue in the MoCo [3]
). Then the contrastive learning can be formulated as classifying a query example
into one of instance classes each corresponding to an instance .The goal is to learn a deep network embedding each instance and the query to a feature vector and
. Then the probability of the embedded query
belonging to an instance class is defined as(1) 
where a similarity measure (e.g., cosine similarity) is defined between two embeddings, and is a positive temperature hypermeter. When the query is the embedding of an augmented sample from , gives rise to the probability of a relevant embedding being successfully retrieved from the instance class . One can minimize the contrastive loss called InfoNCE in [1] resulting from the negative loglikelihood of the above probability over a dictionary to train the embedding network.
The idea underlying the contrastive learning approach is a good representation ought to help retrieve the relevant samples from a set of instances given a query . For example, the SimCLR [2] has achieved the stateoftheart performance by applying two separate augmentations to each instance in a minibatch. Then, given a query example, it views the sample augmented from the same instance as the positive example, while treating those augmented from the other instances as negative ones. Alternatively, the MoCo [3] seeks to retrieve relevant samples from a dynamic queue separate from the current minibatch. Both are based on the similarity between a query and a candidate sample to train the embedding network, which can be viewed as oneshot contrastive learning as explained later.
However, the discrimination between different instance classes not only relies on their interinstance similarities, but also is characterized by the distribution of augmented samples from the same instance, i.e., the intrainstance variations. While existing contrastive learning methods explore the interinstance discrimination to predict instance classes, we believe the intrainstance variations also play an indispensable role. Thus we propose shot contrastive learning by matching a query against the variants of each instance in the associated instance subspace spanned by shot augmentations.
3.2 Shot Multiple Instance Augmentations
Let us consider a Shot Contrastive Learning (KSCL) problem. Suppose that different augmentations are drawn and applied to each instance , resulting in augmented samples and their embeddings for .
As aforementioned, the information contained in shot augmentations provides important clues to distinguish between different instance classes. Comparing a query against each augmented sample individually fails to leverage such intrainstance variations, since the most relevant sample of variant could be a combination of rather than individual factors of variations. Therefore, we are motivated to explore the intrainstance variations through a linear subspace spanned by the augmented samples of each instances. Given a query, the most relevant instance is retrieved by projecting it onto the closest subspace.
As illustrated in Figure 1, consider the embeddings of shot augmentations for an instance . These embeddings are normalized to have a unit length and thus reside on the surface of a unit hypersphere. Meanwhile, they span an instance subspace in the ambient feature space. Then, the projection of the query (of a unit length) onto the instance subspace is , and the projection distance of the query from becomes
(2) 
where is normal to , and the second equality follows from , since the normal vector should be orthogonal to all vectors within the subspace. As the embedding has a constant unit length , minimizing the projection distance is equivalent to maximizing its projection length .
Let be the acute angle between the query and the instance subspace . Then we have , i.e., the projection length can be viewed as the cosine similarity between the query and the whole instance subspace. Compared with the cosine similarity between individual embeddings of instances used in literature [2, 3, 33], it aims to learn a better representation by discriminating different instance subspaces containing the variations of sample augmentations.
Now we can define the probability of belonging to an instance class
(3) 
Then the KSCL seeks to train the embedding network by maximizing the loglikehood of the above probability over minibatches to match a query against the correct instance. Particularly, given a query of a unit norm, its projection length achieves its maximum if belongs to , i.e., it is a linear combination of shot augmentations . In other words, it matches the query against all linear combinations of the augmented samples from each instance , and retrieves the most similar one by projecting the query onto the instance subspace with the shortest distance.
4 Implementations
In this section, we discuss the details to implement the proposed Shot Contrastive Learning (KSCL) model.
4.1 Projection onto Instance Subspace via Eigenvalue Decomposition
Mathematically, there is a closeform solution to the projection onto the instance subspace spanned by shot augmentations ’s. Suppose there exists an othonormal basis for denoted by the columns of a matrix , the projection of a feature vector can be written as .
Since we have with spanning , the problem of finding can be formulated by minimizing the following projection residual
(4) 
where , with containing the embeddings of the augmented samples in its columns.
After conducting an eigenvalue decomposition on the positivedefinite matrix , the eigenvectors corresponding to the largest eigenvalues give rise to an orthonormal basis of the associated instance subspace, which minimizes (4).
Since the eigenvalue decomposition is differentiable, the embedding network can be trained endtoend through the error backpropagation. However, like the other contrastive learning methods [3, 33], the errors will only be backpropagated through the embedding network of queries to save the computing cost.
4.2 Most Significant InterInstance Variations
Usually, we only consider a smaller number of eigenvectors, corresponding to the largest eigenvalues that account for the most significant factors of variations among shot augmentations. This ignores the remaining minor factors of intrainstance variations that may be incurred by noisy augmentations. It also results in a thinner projection matrix than , and the projection length becomes . Thus, we will only need to store and update in the KSCL.
In practice, rather than setting to a prefixed number, we will choose such as the largest eigenvalues cover a preset percentage of total eigenvalues. The more percentage of total eigenvalues are preserved, the smaller the projection residual is in Eq. (3); when , the residual vanishes. This allows a distinct number of eigenvectors per instance to flexibly model various degrees of variations among shot augmentations.
4.3 OneShot Contrastive Learning when
It is not hard to see that the cosine similarity used in SimCLR and MoCo is a special case when , i.e., they are oneshot contrastive learning of visual embeddings. When , there is a single augmented sample per instance. Its instance subspace collapses to a vector . Since is normalized to have a unit length in the SimCLR and the MoCo, the projection length of a query to this single vector becomes . This is the cosine similarity between two vectors used in existing contrastive learning methods [2, 3, 21] up to an absolute value.
5 Experiments
In this section, we perform experiments to compare the KSCL with the other stateoftheart unsupervised learning methods.
5.1 Training Details
To ensure the fair comparison with the previous unsupervised methods [2, 3, 33], in particular SimCLR [2] and MoCo v2 [33]
, we follow the same evaluation protocol with the same hyperparameters.
Specifically, a ResNet50 network is first pretrained on 1.28M ImageNet dataset [34] without labels, and the performance is evaluated by training a linear classifier upon the fixed features. We report top1 accuracy on the ImageNet validation set with a single crop to images. The momentum update with the same size of dynamic queue, the MLP head, the data augmentation (e.g., color distortion and blur augmentation) are also adopted for the sake of fair comparison with the SimCLR and MoCo v2. We adopt the same temperature in [33] without exhaustively searching for an optimal one, yet still obtain better results. This demonstrates the proposed KSCL can be used as a universal plugin to consistently improve the contrastive learning with no need of further tuning of existing models. We will evaluate the impact of and the percentage of preserved eigenvalues on the performance later.
5.2 Results on ImageNet Dataset
Model  epochs  batch size  top1 accuracy 
SimCLR[2]  200  256  61.9 
SimCLR (baseline)[2]  200  8192  66.6 
MoCo v1[3]  200  256  60.5 
MoCo v2 (rerun)[33]  200  256  67.5 
Proposed KSCL  200  256  68.8 
Results under more epochs of unsupervised pretraining  
SimCLR (baseline)[2]  1000  4096  69.3 
MoCo v2 (rerun) [33]  800  256  70.6 
Proposed KSCL  800  256  71.4 
Table I
compares the top1 accuracy of the proposed KSCL with that of SimCLR and MoCo on the ImageNet. We make a direct comparison between the KSCL and the MoCo v2 by running both on the same hardware platform with the same set of software such as CUDA10, pytorch v1.3 and torchvision 1.1.0 (used in the data augmentation that plays a key role in the contrastive learning). With 200 epochs of pretraining, the same top1 accuracy has been achieved on MoCo v2. However, its rerun result over 800 epochs is slightly lower than the reported result (71.1%) in literature
[33], and this may be due to different versions of deep learning frameworks and drivers that could cause variations in the model performance.Table I shows that, after unsupervised pretraining of the KSCL with epochs and a batch size of 256, the KSCL achieves a top1 accuracy of 68.8% with augmentations and of preserved eigenvalues. It is worth noting that a larger batch size is often required to sufficiently train the SimCLR while the other models such as KSCL and MoCo maintain a long dynamic queue as the dictionary. By viewing the SimCLR with a larger batch size of as a baseline, the KSCL makes a much larger improvement of than the MoCo v2 () on the SimCLR baseline under epochs. The KSCL also improves the top1 accuracy to on the ImageNet over 800 epochs of pretraining. Although a better result may be obtained by finetuning the hyperparameter and the data augmentation [35], we stick to the same experimental setting in the previous methods [2, 33] for a direct comparison.
We also visualize the learned basis images in Figure 2. The last column presents the basis images spanning the underlying instance subspace for a ”cat” image. The weight beneath each image is the inner product between the decomposed eigenvector and the embedding of the corresponding augmentation, and each base is a weighted combination of the augmented images in the row. The results show that two bases suffice to capture the major variations among the five image augmentations, while the remaining three only model the minor ones that can be discarded as noises.
5.3 Impacts of and on Performance
We also study the impact of different ’s and ’s on the model performance. Table II shows the top1 accuracy under various ’s and ’s. When , it reduces to oneshot contrastive learning which is similar to the MoCo v2. The difference vs. between the KSCL () and the MoCo v2 is probably because we did not finetune the temperature for the projection length to optimize the KSCL.
The accuracy increases with a larger number of augmentations per instance and a smaller value of perceived eigenvalues. This implies that eliminating the minor noisy variations (as illustrated in Figure 2) with a smaller could improve the performance. Further growing only marginally improves the performance. This is probably because the data augmentation adopted in experiments is limited to those used in the compared methods for a direct comparison. Applying more types of augmentations (e.g., jigsaw and rotations) may inject more intrainstance variations that encourage to use a larger . However, studying the role of more types of augmentations in contrastive learning is beyond the scope of this paper, and we leave it to future research.
epochs  batch size  top1 accuracy  time/epoch (min.)  

1  –  200  256  67.2  16 
3  40%  200  256  68.5  26 
5  40%  200  256  68.8  37 
5  90%  200  256  68.4  37 
5.4 Results on VOC Object Detection
Finally, we evaluate the unsupervised representations on the VOC object detection task [36]. The ResNet50 backbone pretrained on the ImageNet dataset is finetuned with a Faster RCNN detector [37] endtoend on the VOC 2007+2012 trainval set, and is evaluated on the VOC 2007 test set. Table III compares the results with both the MoCo models. Under the same setting, the proposed KSCL outperforms the compared MoCo v1 and MoCo v2 models. The SimCLR model does not report on the VOC object detection task in [2].
Model  epochs  batch size  AP  AP  AP 

MoCo v1  200  256  81.5  55.9  62.6 
MoCo v2  200  256  82.4  57.0  63.6 
MoCo v2  800  256  82.5  57.4  64.0 
KSCL  200  256  82.4  57.1  63.9 
KSCL  800  256  82.7  57.5  64.2 
6 Conclusion
In this paper, we present a novel shot contrastive learning to learn unsupervised visual features. It randomly draws
shot augmentations and applies them separately to each instance. This results in the instance subspace modeling how the significant factors of variances learned from the augmented samples can be linearly combined to form the variants of an associated instance. Given a query, the most relevant samples are then retrieved by projecting the query onto individual instance subspaces, and the query is assigned to the instance subspace with the shortest projection distance. The proposed
shot contrastive learning combines the advantages of both the interinstance discrimination and the intrainstance variations to discriminate the distinctions between different instances. The experiment results demonstrate its superior performances to the stateoftheart contrastive learning methods based on the same experimental setting.References
 [1] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
 [2] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709, 2020.
 [3] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” arXiv preprint arXiv:1911.05722, 2019.
 [4] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018.
 [5] L. Zhang, G.J. Qi, L. Wang, and J. Luo, “Aet vs. aed: Unsupervised representation learning by autoencoding transformations rather than data,” arXiv preprint arXiv:1901.04596, 2019.

[6]
G.J. Qi, L. Zhang, C. W. Chen, and Q. Tian, “Avt: Unsupervised learning of transformation equivariant representations by autoencoding variational transformations,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2019, pp. 8130–8139.  [7] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision. Springer, 2016, pp. 69–84.
 [8] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [9] G.J. Qi, “Losssensitive generative adversarial networks on lipschitz densities,” International Journal of Computer Vision, vol. 128, no. 5, pp. 1118–1140, 2020.

[10]
G.J. Qi, L. Zhang, H. Hu, M. Edraki, J. Wang, and X.S. Hua, “Global versus
localized generative adversarial nets,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2018, pp. 1517–1525.  [11] Y. Zhao, Z. Jin, G.j. Qi, H. Lu, and X.s. Hua, “An adversarial approach to hard triplet generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 501–517.

[12]
J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convolutional autoencoders for hierarchical feature extraction,” in
International conference on artificial neural networks
. Springer, 2011, pp. 52–59.  [13] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming autoencoders,” in International conference on artificial neural networks. Springer, 2011, pp. 44–51.
 [14] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in Advances in Neural Information Processing Systems, 2019, pp. 15 509–15 519.
 [15] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord, “Dataefficient image recognition with contrastive predictive coding,” arXiv preprint arXiv:1905.09272, 2019.
 [16] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670, 2018.
 [17] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” arXiv preprint arXiv:1906.05849, 2019.
 [18] X.S. Hua and G.J. Qi, “Online multilabel active annotation: towards largescale contentbased video search,” in Proceedings of the 16th ACM international conference on Multimedia, 2008, pp. 141–150.
 [19] X. Shu, J. Tang, G.J. Qi, Z. Li, Y.G. Jiang, and S. Yan, “Image classification with tailored finegrained dictionaries,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 2, pp. 454–467, 2016.
 [20] S. Chang, G.J. Qi, C. C. Aggarwal, J. Zhou, M. Wang, and T. S. Huang, “Factorized similarity learning in networks,” in 2014 IEEE International Conference on Data Mining. IEEE, 2014, pp. 60–69.
 [21] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via nonparametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.
 [22] G.J. Qi, “Learning generalized transformation equivariant representations via autoencoding transformations,” arXiv preprint arXiv:1906.08628, 2019.
 [23] X. Gao, W. Hu, and G.J. Qi, “Graphter: Unsupervised learning of graph transformation equivariant representations via autoencoding nodewise transformations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7163–7172.
 [24] G.J. Qi, “Hierarchically gated deep networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2267–2275.
 [25] Z. Peng, Z. Li, J. Zhang, Y. Li, G.J. Qi, and J. Tang, “Fewshot image recognition with knowledge transfer,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 441–449.
 [26] M. A. Jamal and G.J. Qi, “Task agnostic metalearning for fewshot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 719–11 727.
 [27] H. Xu, H. Xiong, and G. Qi, “Flat: Fewshot learning via autoencoding transformation regularizers,” arXiv preprint arXiv:1912.12674, 2019.
 [28] Y. Qin, W. Zhang, C. Zhao, Z. Wang, X. Zhu, G. Qi, J. Shi, and Z. Lei, “Priorknowledge and attentionbased metalearning for fewshot learning,” arXiv, pp. arXiv–1812, 2018.
 [29] G.J. Qi, W. Liu, C. Aggarwal, and T. Huang, “Joint intermodal and intramodal label transfers for extremely rare or unseen classes,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1360–1373, 2016.
 [30] H. Qi, M. Brown, and D. G. Lowe, “Lowshot learning with imprinted weights,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5822–5830.
 [31] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Advances in neural information processing systems, 2017, pp. 3856–3866.

[32]
L. Zhang, M. Edraki, and G.J. Qi, “Cappronet: Deep feature learning via orthogonal projections onto capsule subspaces,” in
Advances in Neural Information Processing Systems, 2018, pp. 5814–5823.  [33] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
 [34] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
 [35] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning,” arXiv preprint arXiv:2005.10243, 2020.
 [36] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
 [37] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
Comments
There are no comments yet.