K-Shot Contrastive Learning of Visual Features with Multiple Instance Augmentations

07/27/2020 ∙ by Haohang Xu, et al. ∙ 0

In this paper, we propose the K-Shot Contrastive Learning (KSCL) of visual features by applying multiple augmentations to investigate the sample variations within individual instances. It aims to combine the advantages of inter-instance discrimination by learning discriminative features to distinguish between different instances, as well as intra-instance variations by matching queries against the variants of augmented samples over instances. Particularly, for each instance, it constructs an instance subspace to model the configuration of how the significant factors of variations in K-shot augmentations can be combined to form the variants of augmentations. Given a query, the most relevant variant of instances is then retrieved by projecting the query onto their subspaces to predict the positive instance class. This generalizes the existing contrastive learning that can be viewed as a special one-shot case. An eigenvalue decomposition is performed to configure instance subspaces, and the embedding network can be trained end-to-end through the differentiable subspace configuration. Experiment results demonstrate the proposed K-shot contrastive learning achieves superior performances to the state-of-the-art unsupervised methods.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning of visual features has attracted wide attentions as it provides an alternative way to efficiently train very deep networks without labeled data. Recent breakthroughs in this direction focus on two categories of methods: contrastive learning [1, 2, 3] and transformation prediction [4, 5, 6, 7]

, among many alternative unsupervised methods such as generative adversarial networks

[8, 9, 10, 11], and auto-encoders [12, 13].

The former category [1, 2, 3, 14, 15, 16, 17] trains a network based on a self-training task by distinguishing between different instance classes each containing the samples augmented from the same instance. Such a contrastive learning problem seeks to explore the inter-instance discrimination to perform unsupervised learning. On the contrary, the other category of transformation prediction methods [4, 5, 6, 7] train a deep network by predicting the transformations used to augment input instances. It attempts to explore the intra-instance variations under multiple augmentations to learn the feature embedding.

A good visual representation ought to combine both advantages of the inter-instance discrimination and the intra-instance variations. In particular, the feature embedding should not only capture the significant intra-instance variations among augmented samples from each instance, as well as discern the distinction between instances by considering their potential variations to enable the inter-instance discrimination. In other words, the inter-instance discrimination should be performed by matching a query against all potential variants of an instance. To this end, we propose a novel -shot contrastive learning as a first attempt to combine their strengths, and we will show that most of existing contrastive learning methods are a special one-shot case.

In particular, we apply multiple augmentations to transform each instance, resulting in an instance subspace spanned by the augmented samples. Each instance subspace learns significant factors of variations from the augmented samples, which configures how these factors can be linearly combined to form the variants of the instance. Then given a query, the most relevant sample of variant for each instance is retrieved by projecting the query onto the associated subspace [18, 19]

. After that, the inter-instance discrimination is conducted by assigning the query to the instance class with the shortest projection distance. An eigenvalue decomposition is performed to configure each instance subspace with the orthonormal eigenvectors as its basis. This configuration of instance subspaces is non-parametric and differentiable, allowing an end-to-end training of the embedding network

[20] through back-propagation.

Experiment results demonstrate that the proposed

-Shot Contrastive Learning (KSCL) can consistently improve the state-of-the-art performance on unsupervised learning. Particularly, with the ResNet50 backbone, it improves the top-1 accuracy of the SimCLR and the MoCo v2 to 68.8% on ImageNet over 200 epochs. It also reaches a higher top-1 accuracy of

over 800 epochs than the baseline SimCLR and the rerun MoCo v2. For the sake of fair comparison, all these improvements are achieved with the same experiment settings such as network architecture, data augmentation, training strategy and the version of deep learning framework and libraries. The consistently improved performances with the same model settings suggest the proposed KSCL can serve as a generic plugin to further increase the accuracy of contrastive learning methods on downstream tasks.

The remainder of the paper is organized as follows. We will review the related works in Section 2, and present the proposed -shot contrastive learning in Section 3. Implementation details will be depicted in Section 4. We will demonstrate the experiment results in Section 5, and conclude the paper in Section 6.

2 Related Works

In this section, we review the related works to the proposed K-Short Contrastive Learning (KSCL) in the following four areas.

2.1 Contrastive Learning

Contrastive learning [1] was first proposed to learn unsupervised representations by maximizing the mutual information between the learned representation and a particular context. It usually focused on the context of the same instance to learn features by discriminating between one example from the other in an embedding space [21, 3, 2]. For example, the instance discrimination has been used as a pretext task by distinguishing augmented samples from each other in a minimatch [2], over a memory bank [21], or a dynamic dictionary with a queue [3]. The comparison between the augmented samples of individual instances was usually performed on a pairwise basis. The state-of-the-art performances on contrastive learning have relied on a composite of carefully designed augmentations [2] to prevent the unsupervised training from utilizing side information to accomplish the pretext task. This has been shown necessary to reach competitive results on downstream tasks.

2.2 Transformation Prediction

Transformation prediction [5, 4] also constitutes a category of unsupervised methods in learning visual embeddings. In contrast to contrastive learning that focuses on inter-instance discrimination, it aims to learn the representations that equivary against various transformations [6, 22, 23]. These transformations are used to augment images and the learned representation is trained to capture the visual structures from which these transformations can be recognized. It focuses on modeling the intra-instance variations from which variants of an instance can be leveraged on downstream tasks such as classification [5, 22, 4], object detection [4, 22], semantic segmentations on images [4, 24] and 3D cloud points [23]. This category of methods provide an orthogonal perspective to contrastive learning based on inter-instance discrimination.

2.3 Few-Shot Learning

From an alternative perspective, contrastive learning based inter-instance discrimination can be viewed as a special case of few-shot learning [25, 26, 27, 28, 29], where each instance is a class and it has several examples augmented from the instance. The difference lies that the examples for each class can be much abundant since one can apply many augmentations to generate an arbitrary number of examples. Of course, these examples are not statistically independent as they share the same instance. Based on this point of view, the non-parametric instance discrimination [21] and thus several perspective works [3, 2] can be viewed as an extension of the weight imprinting [30]

by initialing the weights of each instance class with the embedded feature vector of an augmented sample, resulting in the inner product and cosine similarity used in these algorithms

[21, 3, 2]. Such a surprising connection between the non-parametric instance discrimination and the few-shot learning may open a new way to train the contrastive prediction model. In this sense, the proposed -shot contrastive learning generalizes the few-shot learning by imprinting the orthonormal basis of an instance subspace with the embeddigns of augmented samples from the instance.

2.4 Capsule Nets

The length of a vectorized feature representation has been used in capsule nets pioneered by Hinton et al. [31, 32]

. In capsule nets, a group of neurons form a capsule (vector) of which the direction represents different instantiation that equivaries against transformations and the length accounts for the confidence that a particular class of object is detected. From this point of view, the projected vector of a query example to an instance subspace in this paper also carries an analogy to a capsule. Its direction represents the instantiated configuration of how

-shot augmentations from the instance are linearly combined to form the query, while its length gives rise to the likelihood of the query belonging to this instance class, since a longer projection means a shorter distance to the subspace. This idea of using projections onto several capsule subspaces each corresponding to a class has shown promising results by effectively training deep networks [32].

3 The Approach

In this section, we define a -shot contrastive learning as the pretext task for training unsupervised feature embedding with Multiple Instance Augmentations (MIAs).

3.1 Preliminaries on Contrastive Learning

Suppose we are given a set of unlabeled instances in a minibatch (e.g., in the SimCLR [2]) or from a dictionary (e.g., the memory bank in non-parametric instance discrimination [21] and the dynamic queue in the MoCo [3]

). Then the contrastive learning can be formulated as classifying a query example

into one of instance classes each corresponding to an instance .

The goal is to learn a deep network embedding each instance and the query to a feature vector and

. Then the probability of the embedded query

belonging to an instance class is defined as


where a similarity measure (e.g., cosine similarity) is defined between two embeddings, and is a positive temperature hypermeter. When the query is the embedding of an augmented sample from , gives rise to the probability of a relevant embedding being successfully retrieved from the instance class . One can minimize the contrastive loss called InfoNCE in [1] resulting from the negative log-likelihood of the above probability over a dictionary to train the embedding network.

The idea underlying the contrastive learning approach is a good representation ought to help retrieve the relevant samples from a set of instances given a query . For example, the SimCLR [2] has achieved the state-of-the-art performance by applying two separate augmentations to each instance in a minibatch. Then, given a query example, it views the sample augmented from the same instance as the positive example, while treating those augmented from the other instances as negative ones. Alternatively, the MoCo [3] seeks to retrieve relevant samples from a dynamic queue separate from the current minibatch. Both are based on the similarity between a query and a candidate sample to train the embedding network, which can be viewed as one-shot contrastive learning as explained later.

However, the discrimination between different instance classes not only relies on their inter-instance similarities, but also is characterized by the distribution of augmented samples from the same instance, i.e., the intra-instance variations. While existing contrastive learning methods explore the inter-instance discrimination to predict instance classes, we believe the intra-instance variations also play an indispensable role. Thus we propose -shot contrastive learning by matching a query against the variants of each instance in the associated instance subspace spanned by -shot augmentations.

3.2 -Shot Multiple Instance Augmentations

Fig. 1: The pipline of the proposed -Shot Contrastive Learning (KSCL). For each instance , an instance subspace is spanned by the -normalized embeddings of -shot augmentations on a unit hyper-sphere. A given query embedding of unit length is projected onto the subspace of each instance, resulting in the projection length to measure the probability of the query belonging to the associated instance class. The projection length also gives the cosine similarity of the acute angle between the query vector and the instance subspace .

Let us consider a -Shot Contrastive Learning (KSCL) problem. Suppose that different augmentations are drawn and applied to each instance , resulting in augmented samples and their embeddings for .

As aforementioned, the information contained in -shot augmentations provides important clues to distinguish between different instance classes. Comparing a query against each augmented sample individually fails to leverage such intra-instance variations, since the most relevant sample of variant could be a combination of rather than individual factors of variations. Therefore, we are motivated to explore the intra-instance variations through a linear subspace spanned by the augmented samples of each instances. Given a query, the most relevant instance is retrieved by projecting it onto the closest subspace.

As illustrated in Figure 1, consider the embeddings of -shot augmentations for an instance . These embeddings are normalized to have a unit length and thus reside on the surface of a unit hyper-sphere. Meanwhile, they span an instance subspace in the ambient feature space. Then, the projection of the query (of a unit length) onto the instance subspace is , and the projection distance of the query from becomes


where is normal to , and the second equality follows from , since the normal vector should be orthogonal to all vectors within the subspace. As the embedding has a constant unit length , minimizing the projection distance is equivalent to maximizing its projection length .

Let be the acute angle between the query and the instance subspace . Then we have , i.e., the projection length can be viewed as the cosine similarity between the query and the whole instance subspace. Compared with the cosine similarity between individual embeddings of instances used in literature [2, 3, 33], it aims to learn a better representation by discriminating different instance subspaces containing the variations of sample augmentations.

Now we can define the probability of belonging to an instance class


Then the KSCL seeks to train the embedding network by maximizing the loglikehood of the above probability over mini-batches to match a query against the correct instance. Particularly, given a query of a unit norm, its projection length achieves its maximum if belongs to , i.e., it is a linear combination of -shot augmentations . In other words, it matches the query against all linear combinations of the augmented samples from each instance , and retrieves the most similar one by projecting the query onto the instance subspace with the shortest distance.

4 Implementations

In this section, we discuss the details to implement the proposed -Shot Contrastive Learning (KSCL) model.

4.1 Projection onto Instance Subspace via Eigenvalue Decomposition

Mathematically, there is a close-form solution to the projection onto the instance subspace spanned by -shot augmentations ’s. Suppose there exists an othonormal basis for denoted by the columns of a matrix , the projection of a feature vector can be written as .

Since we have with spanning , the problem of finding can be formulated by minimizing the following projection residual


where , with containing the embeddings of the augmented samples in its columns.

After conducting an eigenvalue decomposition on the positive-definite matrix , the eigenvectors corresponding to the largest eigenvalues give rise to an orthonormal basis of the associated instance subspace, which minimizes (4).

Since the eigenvalue decomposition is differentiable, the embedding network can be trained end-to-end through the error back-propagation. However, like the other contrastive learning methods [3, 33], the errors will only be back-propagated through the embedding network of queries to save the computing cost.

4.2 Most Significant Inter-Instance Variations

Usually, we only consider a smaller number of eigenvectors, corresponding to the largest eigenvalues that account for the most significant factors of variations among -shot augmentations. This ignores the remaining minor factors of intra-instance variations that may be incurred by noisy augmentations. It also results in a thinner projection matrix than , and the projection length becomes . Thus, we will only need to store and update in the KSCL.

In practice, rather than setting to a prefixed number, we will choose such as the largest eigenvalues cover a preset percentage of total eigenvalues. The more percentage of total eigenvalues are preserved, the smaller the projection residual is in Eq. (3); when , the residual vanishes. This allows a distinct number of eigenvectors per instance to flexibly model various degrees of variations among -shot augmentations.

4.3 One-Shot Contrastive Learning when

It is not hard to see that the cosine similarity used in SimCLR and MoCo is a special case when , i.e., they are one-shot contrastive learning of visual embeddings. When , there is a single augmented sample per instance. Its instance subspace collapses to a vector . Since is -normalized to have a unit length in the SimCLR and the MoCo, the projection length of a query to this single vector becomes . This is the cosine similarity between two vectors used in existing contrastive learning methods [2, 3, 21] up to an absolute value.

5 Experiments

In this section, we perform experiments to compare the KSCL with the other state-of-the-art unsupervised learning methods.

5.1 Training Details

To ensure the fair comparison with the previous unsupervised methods [2, 3, 33], in particular SimCLR [2] and MoCo v2 [33]

, we follow the same evaluation protocol with the same hyperparameters.

Specifically, a ResNet-50 network is first pretrained on 1.28M ImageNet dataset [34] without labels, and the performance is evaluated by training a linear classifier upon the fixed features. We report top-1 accuracy on the ImageNet validation set with a single crop to images. The momentum update with the same size of dynamic queue, the MLP head, the data augmentation (e.g., color distortion and blur augmentation) are also adopted for the sake of fair comparison with the SimCLR and MoCo v2. We adopt the same temperature in [33] without exhaustively searching for an optimal one, yet still obtain better results. This demonstrates the proposed KSCL can be used as a universal plugin to consistently improve the contrastive learning with no need of further tuning of existing models. We will evaluate the impact of and the percentage of preserved eigenvalues on the performance later.

5.2 Results on ImageNet Dataset

Model epochs batch size top-1 accuracy
SimCLR[2] 200 256 61.9
SimCLR (baseline)[2] 200 8192 66.6
MoCo v1[3] 200 256 60.5
MoCo v2 (rerun)[33] 200 256 67.5
Proposed KSCL 200 256 68.8
Results under more epochs of unsupervised pretraining
SimCLR (baseline)[2] 1000 4096 69.3
MoCo v2 (rerun) [33] 800 256 70.6
Proposed KSCL 800 256 71.4
TABLE I: The top-1 accuracy of different models on ImageNet. The ResNet-50 backbone was unsupervisedly pretrained with two-layer MLP head by applying the same combination of enhanced data augmentations used in SimCLR including stronger color distortion and blurring for a fair comparison. The proposed KSCL is trained with and . The top-1 accuracy is obtained by training a single-layer linear classifier upon the pretrained features.

Table I

compares the top-1 accuracy of the proposed KSCL with that of SimCLR and MoCo on the ImageNet. We make a direct comparison between the KSCL and the MoCo v2 by running both on the same hardware platform with the same set of software such as CUDA10, pytorch v1.3 and torchvision 1.1.0 (used in the data augmentation that plays a key role in the contrastive learning). With 200 epochs of pretraining, the same top-1 accuracy has been achieved on MoCo v2. However, its rerun result over 800 epochs is slightly lower than the reported result (71.1%) in literature

[33], and this may be due to different versions of deep learning frameworks and drivers that could cause variations in the model performance.

Table I shows that, after unsupervised pretraining of the KSCL with epochs and a batch size of 256, the KSCL achieves a top-1 accuracy of 68.8% with augmentations and of preserved eigenvalues. It is worth noting that a larger batch size is often required to sufficiently train the SimCLR while the other models such as KSCL and MoCo maintain a long dynamic queue as the dictionary. By viewing the SimCLR with a larger batch size of as a baseline, the KSCL makes a much larger improvement of than the MoCo v2 () on the SimCLR baseline under epochs. The KSCL also improves the top-1 accuracy to on the ImageNet over 800 epochs of pretraining. Although a better result may be obtained by finetuning the hyperparameter and the data augmentation [35], we stick to the same experimental setting in the previous methods [2, 33] for a direct comparison.

We also visualize the learned basis images in Figure 2. The last column presents the basis images spanning the underlying instance subspace for a ”cat” image. The weight beneath each image is the inner product between the decomposed eigenvector and the embedding of the corresponding augmentation, and each base is a weighted combination of the augmented images in the row. The results show that two bases suffice to capture the major variations among the five image augmentations, while the remaining three only model the minor ones that can be discarded as noises.

Fig. 2: The learned basis in an instance subspace. Each of the first five columns is an augmented image from an instance, and the last column is the basis images each of which is synthesized as a linear combination of the five augmented images weighted by the inner product with the corresponding eigenvector in the embedding space.

5.3 Impacts of and on Performance

We also study the impact of different ’s and ’s on the model performance. Table II shows the top-1 accuracy under various ’s and ’s. When , it reduces to one-shot contrastive learning which is similar to the MoCo v2. The difference vs. between the KSCL () and the MoCo v2 is probably because we did not fine-tune the temperature for the projection length to optimize the KSCL.

The accuracy increases with a larger number of augmentations per instance and a smaller value of perceived eigenvalues. This implies that eliminating the minor noisy variations (as illustrated in Figure 2) with a smaller could improve the performance. Further growing only marginally improves the performance. This is probably because the data augmentation adopted in experiments is limited to those used in the compared methods for a direct comparison. Applying more types of augmentations (e.g., jigsaw and rotations) may inject more intra-instance variations that encourage to use a larger . However, studying the role of more types of augmentations in contrastive learning is beyond the scope of this paper, and we leave it to future research.

epochs batch size top-1 accuracy time/epoch (min.)
1 200 256 67.2 16
3 40% 200 256 68.5 26
5 40% 200 256 68.8 37
5 90% 200 256 68.4 37
TABLE II: The top-1 accuracy of the proposed KSCL with varying ’s and ’s under 200 epochs of pretraining on ImageNet. The ResNet-50 backbone was pretrained with two-layer MLP head by applying the same combination of enhanced data augmentations used in SimCLR including stronger color distortion and blurring for a fair comparison. The top-1 accuracy is obtained by training a single-layer linear classifier upon the pretrained features. We also compare the computing time used to train the KSCL per epoch in eight V100. Note when , need not be set as it becomes a trivial case of an one-shot contrastive learning.

5.4 Results on VOC Object Detection

Finally, we evaluate the unsupervised representations on the VOC object detection task [36]. The ResNet-50 backbone pretrained on the ImageNet dataset is fine-tuned with a Faster RCNN detector [37] end-to-end on the VOC 2007+2012 trainval set, and is evaluated on the VOC 2007 test set. Table III compares the results with both the MoCo models. Under the same setting, the proposed KSCL outperforms the compared MoCo v1 and MoCo v2 models. The SimCLR model does not report on the VOC object detection task in [2].

Model epochs batch size AP AP AP
MoCo v1 200 256 81.5 55.9 62.6
MoCo v2 200 256 82.4 57.0 63.6
MoCo v2 800 256 82.5 57.4 64.0
KSCL 200 256 82.4 57.1 63.9
KSCL 800 256 82.7 57.5 64.2
TABLE III: The comparison between the proposed KSCL ( and ) and the MoCo models. The pretrained ResNet-50 backbone was transferred to train on VOC 2007+2012 trainval set with a Faster RCNN detector end-to-end, and evaluated on the VOC 2007 test set. The COCO metrics were adopted to evaluate the performance.

6 Conclusion

In this paper, we present a novel -shot contrastive learning to learn unsupervised visual features. It randomly draws

-shot augmentations and applies them separately to each instance. This results in the instance subspace modeling how the significant factors of variances learned from the augmented samples can be linearly combined to form the variants of an associated instance. Given a query, the most relevant samples are then retrieved by projecting the query onto individual instance subspaces, and the query is assigned to the instance subspace with the shortest projection distance. The proposed

-shot contrastive learning combines the advantages of both the inter-instance discrimination and the intra-instance variations to discriminate the distinctions between different instances. The experiment results demonstrate its superior performances to the state-of-the-art contrastive learning methods based on the same experimental setting.