Decoupled Contrastive Learning

10/13/2021 ∙ by Chun-Hsiao Yeh, et al. ∙ Facebook berkeley college Academia Sinica 10

Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented “views” of the same image as positive to be pulled closer, and all other images negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and aim at establishing a simple, efficient, and yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used cross-entropy (InfoNCE) loss, leading to unsuitable learning efficiency with respect to the batch size. Indeed the phenomenon tends to be neglected in that optimizing infoNCE loss with a small-size batch is effective in solving easier SSL tasks. By properly addressing the NPC effect, we reach a decoupled contrastive learning (DCL) objective function, significantly improving SSL efficiency. DCL can achieve competitive performance, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate the usefulness of DCL in various benchmarks, while manifesting its robustness being much less sensitive to suboptimal hyperparameters. Notably, our approach achieves 66.9% ImageNet top-1 accuracy using batch size 256 within 200 epochs pre-training, outperforming its baseline SimCLR by 5.1%. With further optimized hyperparameters, DCL can improve the accuracy to 68.2%. We believe DCL provides a valuable baseline for future contrastive learning-based SSL studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a fundamental task in machine learning, representation learning aims to extract features to reconstruct the raw data fully. It has been regarded as a long-acting goal over the past decades. Recent progress on representation learning has achieved a significant milestone over self-supervised learning (SSL), facilitating feature learning with its competence in exploiting massive raw data without any annotated supervision. In the early stage of SSL, representation learning has focused on exploiting pretext tasks, which are addressed by generating pseudo-labels to the unlabeled data through different transformations, such as solving jigsaw puzzles 

(Noroozi and Favaro, 2016)

, colorization 

(Zhang et al., 2016) and rotation prediction (Gidaris et al., 2018)

. Though these approaches succeed in computer vision, there is a large gap between these methods and supervised learning. Recently, there has been a significant advancement in using contrastive learning 

(Wu et al., 2018; van den Oord et al., 2018; Tian et al., 2020a; He et al., 2020; Chen et al., 2020a) for self-supervised pre-training, which significantly closes the gap between the SSL method and supervised learning. Contrastive SSL methods, e.g., SimCLR (Chen et al., 2020a), in general, try to pull different views of the same instance close and push different instances far apart in the representation space.

Despite the evident progress of the state-of-the-art contrastive SSL methods, there have been several challenges in future developing this direction: 1) The SOTA models (He et al., 2020) may require unique structures like the momentum encoder and large memory queues, which may complicate the understanding. 2) The contrastive SSL models (Chen et al., 2020a) may depend on large batch size and huge epoch numbers to achieve competitive performance, posing a computational challenge for academia to explore this direction. 3) They may be sensitive to hyperparameters and optimizers, introducing additional difficulty to reproduce the results on various benchmarks.

Through our analysis of the widely adopted InfoNCE loss in contrastive learning, we identified a negative-positive-coupling (NPC) multiplier in the gradient as shown in Proposition 1.

The NPC multiplier modulates the gradient of each sample, and it reduces the learning efficiency when the SSL classification task is easy. A less informative positive view would reduce the gradient from a batch of informative negative samples or vice versa. Such a coupling exacerbates when smaller batch sizes are used. By removing the coupling term, we reach a new formulation, the decoupled contrastive learning (DCL). The new objective function significantly improves the training efficiency, requires neither large batches, momentum encoding, or large epochs to achieve competitive performance on various benchmarks. Specifically, DCL reaches ImageNet top-1 (linear probing) accuracy with batch size , SGD optimizer within epochs. Even if DCL is trained for epochs, it still reaches ImageNet top-1 accuracy with batch size .

The main contributions of the proposed DCL can be characterized as follows:

  • We provide both theoretical analysis and empirical evidence to show the negative-positive coupling in the gradient of InfoNCE-based contrastive learning;

  • We introduce a new, decoupled contrastive learning (DCL) objective, which casts off the coupling phenomenon between positive and negative samples in contrastive learning, and significantly improves the training efficiency; Additionally, the proposed DCL objective is less sensitive to several important hyperparameters;

  • We demonstrate our approach via extensive experiments and analysis on both large and small-scale vision benchmarks, with an optimal configuration for the standard SimCLR baseline to have a competitive performance within contrastive approaches. This leads to a plug-and-play improvement to the widely adopted InfoNCE contrastive learning methods.

Figure 1: An overview of the batch size issue in the general contrastive approaches: (a) shows the NPC multiplier in different batch sizes. As the large batch size increasing the will approach 1 with a small coefficient of variation (). (b) illustrates the distribution of .

2 Related work

2.1 Self-supervised representation learning

Self-supervised representation learning (SSL) aims to learn a robust embedding space from data without human annotation. Previous arts can be roughly categorized into generative and discriminative. Generative approaches, such as autoencoders and adversarial learning, focus on reconstructing images from latent representations 

(Goodfellow et al., 2014; Radford et al., 2016). Conversely, recent discriminative approaches, especially contrastive learning-based approaches, have gained the most ground and achieved state-of-the-art standard large-scale image classification benchmarks with increasingly more compute and data augmentations.

2.2 Contrastive learning

Contrastive learning (CL) constructs positive and negative sample pairs to extract information from the data itself. In CL, each anchor image in a batch has only one positive sample to construct a positive sample pair (Hadsell et al., 2006; Chen et al., 2020a; He et al., 2020). CPC (van den Oord et al., 2018) predicts the future output of sequential data by using current output as prior knowledge, which can improve the feature representing the ability of the model. Instance discrimination (Wu et al., 2018) proposes a non-parametric cross-entropy loss to optimize the model at the instance level. Inv. spread (Ye et al., 2019) makes use of data augmentation invariants and the spread-out property of instance to learn features. MoCo (He et al., 2020) proposes a dictionary to maintain a negative sample set, thus increasing the number of negative sample pairs. Different from the aforementioned self-supervised CL approaches, Khosla et al. (2020) proposes a supervised CL that considers all the same categories as positive pairs to increase the utility of images.

2.3 Collapsing issue via batch size and negative sample

In CL, the objective is to maximize the mutual information between the positive pairs. However, to avoid the “collapsing output”, vast quantities of negative samples are needed so that the learning objectives obtain the maximum similarity and have the minimum similarity with negative samples. For instance, in SimCLR (Chen et al., 2020a), training requires many negative samples, leading to a large batch size (i.e., 4096). Furthermore, to optimize such a huge batch, a specially designed optimizer LARS (You et al., 2017) is used. Similarly, MoCo (He et al., 2020) needs a vast queue (i.e., 65536) to achieve competitive performance. BYOL (Grill et al., 2020) does not collapse output without using any negative samples by considering all the images are positive and to maximize the similarity of “projection” and “prediction ” features. On the other hand, SimSiam (Chen and He, 2021) leverages the Siamese network to introduce inductive biases for modeling invariance. With the small batch size (i.e., 256), SimSiam is a rival to BYOL (i.e., 4096). Unlike both approaches that achieved their success through empirical studies, this paper tackles from a theoretical perspective, proving that an intertwined multiplier of positive and negative is the main issue to contrastive learning.

2.4 Contrastive Learning on batch size sensitivity

Recent literature discusses the losses for contrastive learning and focuses on batch size sensitivity. Tsai et al. (2021) start from the contrastive predictive code , which is equivalent to SimCLR loss (Chen et al., 2020a), and then proposes a new term . However, is not the same as SimCLR loss or in essence. is more similar to the ranking loss (Chen et al., 2009), which collects and pushes away the positive pairs and negative pairs. Since the ranking loss is not stable enough, Tsai et al. (2021) add additional regularization terms to control the magnitude of the network and gains better results. On the other hand, it brings additional hyperparameters and needs more time to search for the best weight combinations. Hjelm et al. (2019) follow the (Belghazi et al., 2018) and extend the idea between the local and global features. Hence,  (Hjelm et al., 2019) is quite different from the contrastive loss. Ozair et al. (2019) follow the approach of and proposes a Wasserstein distance to prevent the encoder from learning any other differences between unpaired samples. The starting point of this paper comes from SimCLR (Chen et al., 2020a) and then provides theoretical analysis to support why decoupling the positive and negative terms in contrastive loss is essential. The target problems are different though the motivations are similar.

3 Decouple negative and positive samples in contrastive learning

Figure 2: Contrastive learning and negative-positive coupling (NPC). (a) In SimCLR, each sample has two augmented views . They are encoded by the same encoder and further projected to by a normalized MLP. (b) According to Equation 3. For the view , the cross-entropy loss leads to a positive force , which comes from the other view of and a negative force, which is a weighted average of all the negative samples, i.e. . However, the gradient is proportional to the NPC multiplier. (c) We show two cases when the NPC term would affect the learning efficiency. On the top, the positive sample is close to the anchor and less informative. However, the gradient from the negative samples are also reduced. On the bottom, when the negative samples are far away and less informative, the learning rate from the positive sample is mistakenly reduced. In general, the NPC multiplier from the InfoNCE loss tend to make the SSL task simpler to solve, which leads to a reduced learning efficiency.

We choose to start from SimCLR because of its conceptual simplicity. Given a batch of samples (e.g. images), , let be two augmented views of the sample and be the set of all of the augmented views in the batch, i.e. . As shown by Figure 2(a), each of the views is sent into the same encoder network and the output is then projected by a normalized MLP projector that . For each augmented view , SimCLR solves a classification problem by using the rest of the views in as targets, and assigns the only positive label to , where

. So SimCLR creates a cross-entropy loss function

for each view , and the overall loss function is .

(1)
Proposition 1.

There exists a negative-positive coupling (NPC) multiplier in the gradient of :

(2)

where the NPC multiplier is:

(3)

Due to the symmetry, a similar NPC multiplier exists in the gradient of .

As we can see, all of the partial gradients in Equation 2 are modified by the common NPC multiplier in Equation 3. Equation 3 makes intuitive sense: when the SSL classification task is easy, the gradient would be reduced by the NPC term. However, the positive samples and negative samples are strongly coupled. When the negative samples are far away and less informative (easy negatives), the gradient from an informative positive sample would be reduced by the NPC multiplier . On the other hand, when the positive sample is close (easy positive) and less informative, the gradient from a batch of informative negative samples would also be reduced by the NPC multiplier. When the batch size is smaller, the SSL classification problem can be significantly simpler to solve. As a result, the learning efficiency can be significantly reduced with a small batch size setting.

Figure 1(b) shows the NPC multiplier distribution shift w.r.t. different batch sizes for a pre-trained SimCLR baseline model. While all of the shown distributions have prominent fluctuation, the smaller batch size makes cluster towards , while the larger batch size pushes the distribution towards . Figure 1(a) shows the averaged NPC multiplier changes w.r.t. the batch size and the relative fluctuation. The small batch sizes introduce significant NPC fluctuation. Based on this observation, we propose to remove the NPC multipliers from the gradients, which corresponds to the case . This leads to the decoupled contrastive learning formulation. Wang et al. (2021a) also proposes an important loss which does not have the NPC. However, by a similar analysis that it introduces negative-negative coupling from different positive samples. In Section A.5, we provide a thorough discussion and demonstrate the advantage of DCL loss.

Proposition 2.

Removing the positive pair from the denominator of Equation 1 leads to a decoupled contrastive learning loss. If we remove the NPC multiplier from Equation 2, we reach a decoupled contrastive learning loss , where is:

(4)
(5)

The proofs of Proposition 1 and 2 are given in Appendix. Further, we can generalize the loss function to by introducing a weighting function for the positive pairs i.e. .

(6)

where we can intuitively choose to be a negative von Mises-Fisher weighting function that and . is a special case of and we can see that . The intuition behind is that there is more learning signal when a positive pair of samples are far from each other.

4 Experiments

This section empirically evaluates our proposed decoupled contrastive learning (DCL) and compares it to general contrastive learning methods. We summarize our experiments and analysis as the following: (1) our proposed work significantly outperforms the general contrastive learning on large and small-scale vision benchmarks; (2) we show the better version of DCL: LDCW could further improve the representation quality. (3) we further analyze our DCL with ablation studies on ImageNet-1K, hyperparameters, and few learning epochs, which shows fast convergence of the proposed DCL. Detailed experimental settings can be found in the Appendix.

4.1 Implementation details

To understand the effect of the sample decoupling, we consider our proposed DCL based on general contrastive learning, where model optimization is irrelevant to the size of batches (i.e., negative samples). Extensive experiments and analysis are demonstrated on large-scale benchmarks: ImageNet-1K (Deng et al., 2009), ImageNet-100 (Tian et al., 2020a)

, and small-scale benchmark: CIFAR 

(Krizhevsky et al., 2009), and STL10 (Coates et al., 2011). Note that all of our experiments are conducted with 8 Nvidia V100 GPUs on a single machine.

ImageNet

For a fair comparison on ImageNet data, we implement our proposed decoupled structure, DCL, by following SimCLR (Chen et al., 2020a) with ResNet-50 (He et al., 2016) as the encoder backbone and use cosine annealing schedule with SGD optimizer. We set the temperature

to 0.1 and the latent vector dimension to 128. Following the OpenSelfSup benchmark 

(Zhan et al., 2020)

, we evaluate the pre-trained models by training a linear classifier with frozen learned embedding on ImageNet data. We further consider evaluating our approach on ImageNet-100, a selected subset of 100 classes of ImageNet-1K.

CIFAR and STL10

For CIFAR10, CIFAR100, and STL10, ResNet-18 (He et al., 2016) is used as the encoder architecture. Following the small-scale benchmark of CLD (Wang et al., 2021b), we set the temperature to 0.07. All models are trained for 200 epochs with SGD optimizer, a base , and

= 200 for nearest neighbor (kNN) classifier. Note that on STL10, we follow CLD to use both

set and set for model pre-training. We further use ResNet-50 as a stronger backbone by adopting the implementation (Ren, 2020), using the same backbone and hyperparameters.

4.2 Experiments and analysis

DCL on ImageNet

This section illustrates the effect of our DCL under different batch sizes and queues. The initial setup is to have 1024 batch size (SimCLR) and 65536 queues (MoCo (He et al., 2020)) and gradually reduce the batch size (SimCLR) and queue (MoCo) to show the corresponding top-1 accuracy by linear evaluation. Figure 3 indicates that without DCL, the top-1 accuracy drastically drops when batch size (SimCLR) or queue (MoCo) becomes very small. While with DCL, the performance keeps steadier than baselines (SimCLR: vs. , MoCo: vs. ).

Figure 3: Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature of SimCLR is , and the temperature of MoCo is in the comparison.
Architecture@epoch ResNet-18@200 epoch
Dataset ImageNet-100 (linear) STL10 (kNN)
Batch Size 32 64 128 256 512 32 64 128 256 512
SimCLR 74.2 77.6 79.3 80.7 81.3 74.1 77.6 79.3 80.7 81.3
SimCLR w/ DCL 80.8 82.0 81.9 83.1 82.8 82.0 82.8 81.8 81.2 81.0
Dataset CIFAR10 (kNN) CIFAR100 (kNN)
Batch Size 32 64 128 256 512 32 64 128 256 512
SimCLR 78.9 80.4 81.1 81.4 81.3 49.4 50.3 51.8 52.0 52.4
SimCLR w/ DCL 83.7 84.4 84.4 84.2 83.5 51.1 54.3 54.6 54.9 55.0
Architecture@epoch ResNet-50@500 epoch
SimCLR 82.2 - 88.5 - 89.1 49.8 - 59.9 - 61.1
SimCLR w/ DCL 86.1 - 89.9 - 90.3 54.3 - 61.6 - 62.2
Table 1: Comparisons with/without DCL under different numbers of batch sizes from 32 to 512. Results show the effectness of DCL on four widely used benchmarks. The performance of DCL keeps steadier than the SimCLR baseline while the batch size is varied.
Dataset CIFAR10 CIFAR100 ImageNet-100 ImageNet-1K
SimCLR 81.8 51.8 79.3 61.8
DCL 84.2 (+3.1) 54.6 (+2.8) 81.9 (+2.6) 65.9 (+4.1)
DCLW 84.8 (+3.7) 54.9 (+3.1) 82.8 (+3.5) 66.9 (+5.1)
Table 2: Comparisons between SimCLR baseline, DCL, and DCLW. The linear and kNN top-1 () results indicate that DCL improves the performance of baseline, and DCLW further provides an extra boost. Note that results are under the batch size 256 and epoch 200. All of models are both trained and evaluated with same experimental settings.

Specifically, Figure 3 further shows that in SimCLR, the performance with DCL improves from to under 256 batch size; MoCo with DCL improves from to under 256 queues. The comparison fully demonstrates the necessity of DCL, especially when the number of negatives is small. Although batch size is increased to 1024, our DCL () still improves over the SimCLR baseline ().

We further observe the same phenomenon on ImageNet-100 data. Table 1 shows that, while with DCL, the performance only drops compare to the SimCLR baseline of .

In summary, it is worth noting that, while the batch size is small, the strength of , which is used to push the negative samples away from the positive sample, is also relatively weak. This phenomenon tends to reduce the efficiency of learning representation. While taking advantage of DCL alleviates the performance gap between small and large batch sizes. Hence, through the analysis, we find out DCL can simply tackle the batch size issue in contrastive learning. With this considerable advantage given by DCL, general SSL approaches can be implemented with fewer computational resources or lower standard platforms.

DCL on CIFAR and STL10

For STL10, CIFAR10, and CIFAR100, we implement our DCL with ResNet-18 as encoder backbone. In Table 1, it is observed that DCL also demonstrates its effectiveness on small-scale benchmarks. In summary, DCL outperforms its baseline by (CIFAR10) and (CIFAR100) and keeps the performance relatively steady under batch size 256. The kNN accuracy of the SimCLR baseline on STL10 is also improved by .

Further experiments are conducted based on the ResNet-50 backbone and large learning epochs (i.e., 500 epochs). The DCL model with kNN eval, batch size 32, and 500 epochs of training could reach 86.1% compared to 82.2%. For the following experiments in Table 1, we show DCL ResNet-50 performance on CIFAR10 and CIFAR100. In these comparisons, we vary the batch size to show the effectiveness of DCL.

Decoupled Objective with Re-Weighting DCLW

We only replace with with no possible advantage from additional tricks. That is, both our approach and the baselines apply the same training instruction of the OpenSelfSup benchmark for fairness. Note that we empirically choose in the experiments. Results in Table 2 indicates that, DCLW achieves extra (ImageNet-1K), (ImageNet-100) gains compared to the baseline. For CIFAR data, extra (CIFAR10), are gained from the addition of DCLW. It is worth to note that, trained with 200 epochs, our DCLW reaches with batch size 256, surpassing the SimCLR baseline: with batch size 8192.

4.3 Small-scale benchmark results: STL10, CIFAR10, and CIFAR100

For STL10, CIFAR10, and CIFAR100, we implement our DCL with ResNet-18 (He et al., 2016) as encoder backbone by following the small-scale benchmark of CLD (Wang et al., 2021b). All the models are trained for 200 epochs with 256 batch sizes and evaluate by using kNN accuracies (). Results in Table 3 indicate that, our DCLW with multi-cropping (Caron et al., 2020) consistently outperforms the state-of-the-art baselines on CIFAR10, STL10, and CIFAR100. Our DCL also demonstrates its capability while comparing against other baselines. More analysis of large-scale benchmarks can be found in Appendix.

kNN (top-1) SimCLR MoCo MoCo + CLD NPID NPID + CLD Inv. Spread Exemplar DCL DCLW w/ mcrop
CIFAR10 81.4 82.1 87.5 80.8 86.7 83.6 76.5 84.1 87.8
CIFAR100 52.0 53.1 58.1 51.6 57.5 N/A N/A 54.9 58.8
STL10 77.3 80.8 84.3 79.1 83.6 81.6 79.3 81.2 84.1
Table 3: kNN top-1 accuracy () comparison of SSL approaches on small-scale benchmarks: CIFAR10, CIFAR100, and STL10. Results show that DCL consistently improves its SimCLR baseline. With multi-cropping (Caron et al., 2020), our DCLW reaches competitive performance within other contrastive learning approaches (Chen et al., 2020a; He et al., 2020; Wu et al., 2018; Ye et al., 2019; Dosovitskiy et al., 2015).
ImageNet-1K (batch size = 256; epoch = 200) Linear Top-1 Accuracy (%)
DCL 65.9
+ optimal () = (0.2, 0.07) 67.8 (+1.9)
+ BYOL augmentation 68.2 (+0.4)
Table 4: Improve the DCL model performance on ImageNet-1K with better hyperparameters, temperature and learning rate and stronger augmentation.

4.4 Ablations

We perform extensive ablations on the hyperparameters of our DCL and DCLW on both ImageNet data and other small-scale data, i.e., CIFAR10, CIFAR100, and STL10. By seeking better configurations empirically, we see that our approach gives consistent gains over the standard SimCLR baseline. In other ablations, we see that our DCL achieves more gains over both SimCLR and MoCo v2, i.e., contrastive learning baselines, also when training for 100 epochs only.

Ablations of DCL on ImageNet

In Table 4, we have slightly improved the DCL model performance on ImageNet-1K: 1) better hyperparameters, temperature and learning rate ; 2) stronger augmentation (e.g., BYOL). We conduct an empirical hyperparameter search with batch size 256 and 200 epochs to obtain a stronger baseline. This improves DCL from 65.9% to 67.8% top-1 accuracy on ImageNet-1K. We further adopt the BYOL augmentation policy and improve our DCL from 67.8% to 68.2% top-1 accuracy on ImageNet-1K.

SimCLR SimCLR w/ DCL MoCo v2 MoCo v2 w/ DCL
100 epoch 57.5 64.6 63.6 64.4
200 epoch 61.8 65.9 67.5 67.7
Table 5: ImageNet-1K top-1 accuracy (%) on SimCLR and MoCo v2 with/without DCL under few training epochs. We further list results under 200 epochs for clear comparison. With DCL, the performance of SimCLR trained under 100 epochs nearly reaches its performance under 200 epochs. The MoCo v2 with DCL also reaches higher accuracy than the baseline under 100 epochs.

Few learning epochs

Our DCL is inspired by the traditional contrastive learning framework, which needs a large batch size, long learning epochs to achieve higher performance. The previous state-of-the-art, SimCLR, heavily relies on large quantities of learning epochs to obtain high top-1 accuracy. (e.g., with up to 1000 epochs). The purpose of our DCL is to achieve higher learning efficiency with few learning epochs. We demonstrate the effectiveness of DCL in contrastive learning frameworks SimCLR and MoCo v2 (Chen et al., 2020b). We choose the batch size of 256 (queue of 65536) as the baseline and train the model with only 100 epochs instead of the normal number of 200. We make sure other parameter settings are the same for a fair comparison. Table 5 shows the result on ImageNet-1K using linear evaluation. With DCL, SimCLR can achieve top-1 accuracy with only 100 epochs compared to SimCLR baseline: ; MoCo v2 with DCL reaches compared to MoCo v2 baseline: with 100 epochs pre-training.

(a) CIFAR10
(b) CIFAR100
(c) STL10
Figure 4: Comparisons between DCL and SimCLR baseline on (a) CIFAR10, (b) CIFAR100, and (c) STL10 data. During the SSL pre-training, DCL speeds up the model convergence and provides better performance than the baseline on CIFAR and STL10 data.

We further demonstrate that, with DCL, learning representation becomes faster during the early stage of training. The reason is that DCL successfully solves the decoupled issue between positive and negative pairs. Figure 4 on (a) CIFAR10, CIFAR100, and STL10, show that our DCL improves the speed of convergence and reaches higher performance than the baseline on CIFAR and STL10 data.

Temperature 0.07 0.1 0.2 0.3 0.4 0.5 Standard deviation
SimCLR 83.6 87.5 89.5 89.2 88.7 89.1 2.04
DCL 88.3 89.4 90.8 89.9 89.6 90.3 0.78
Table 6: The ablation study of various temperature on the CIFAR10.

Analysis of temperature

In Table 6, we further provide extensive analysis on temperature in the loss to support that the DCL method is not sensitive to hyperparameters compared against the baselines. In the following, show the temperature search on both DCL and SimCLR baselines on CIFAR10 data. Specifically, we pretrain the network with temperature in and report results with kNN eval, batch size 512, and 500 epochs. As shown in Table 6, compared to the SimCLR baseline, DCL is less sensitive to hyperparameters, e.g., temperature .

5 Conclusion

In this paper, we identify the negative-positive-coupling (NPC) effect in the wide used InfoNCE loss, making the SSL task significantly easier to solve with smaller batch size. By removing the NPC effect, we reach a new objective function, decoupled contrastive learning (DCL). The proposed DCL loss function requires minimal modification to the SimCLR baseline and provides efficient, reliable, and nontrivial performance improvement on various benchmarks. Given the conceptual simplicity of DCL and that it requires neither momentum encoding, large batch sizes, or long epochs to reach competitive performance, we wish that DCL can serve as a strong baseline for the contrastive-based SSL methods. Further, an important lesson we learn from the DCL loss is that a more efficient SSL task shall maintain its complexity when the batch size becomes smaller.

References

  • A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §A.6.
  • A. Bardes, J. Ponce, and Y. LeCun (2021)

    VICReg: variance-invariance-covariance regularization for self-supervised learning

    .
    CoRR abs/2105.04906. External Links: 2105.04906 Cited by: §A.6.
  • M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, R. D. Hjelm, and A. C. Courville (2018)

    Mutual information neural estimation

    .
    In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 530–539. Cited by: §2.4.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    .
    In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11218, pp. 139–156. Cited by: §A.3, Table 7.
  • M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: Table 7, §4.3, Table 3.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020a) A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 1597–1607. Cited by: §A.3, Table 7, §1, §1, §2.2, §2.3, §2.4, §4.1, Table 3.
  • W. Chen, T. Liu, Y. Lan, Z. Ma, and H. Li (2009) Ranking measures and loss functions in learning to rank. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 315–323. Cited by: §2.4.
  • X. Chen, H. Fan, R. B. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. CoRR abs/2003.04297. External Links: 2003.04297 Cited by: Table 7, §4.4.
  • X. Chen and K. He (2021) Exploring simple siamese representation learning. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

    ,
    pp. 15750–15758. Cited by: §A.4, Table 7, §2.3.
  • A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    ,
    pp. 215–223. Cited by: §4.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: Table 7.
  • A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox (2015)

    Discriminative unsupervised feature learning with exemplar convolutional neural networks

    .
    IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1734–1747. Cited by: Table 3.
  • D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2021) With a little help from my friends: nearest-neighbor contrastive learning of visual representations. CoRR abs/2104.14548. External Links: 2104.14548 Cited by: §A.6.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: Table 7, §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko (2020) Bootstrap your own latent - A new approach to self-supervised learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §A.3, Table 7, §2.3.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 17-22 June 2006, New York, NY, USA, pp. 1735–1742. Cited by: §2.2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: Table 7, §1, §1, §2.2, §2.3, §4.2, Table 3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1, §4.1, §4.3.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §2.4.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §2.2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio (2019) Speech model pre-training for end-to-end spoken language understanding. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic (Eds.), pp. 814–818. Cited by: item ‡.
  • I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §A.3, Table 7.
  • A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman (2020) Voxceleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60. Cited by: item †.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9910, pp. 69–84. Cited by: §1.
  • S. Ozair, C. Lynch, Y. Bengio, A. van den Oord, S. Levine, and P. Sermanet (2019) Wasserstein dependency measure for representation learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 15578–15588. Cited by: §2.4.
  • A. Radford, L. Metz, and S. Chintala (2016)

    Unsupervised representation learning with deep convolutional generative adversarial networks

    .
    In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §2.1.
  • H. Ren (2020)

    A pytorch implementation of simclr

    .
    GitHub. Note: https://github.com/leftthomas/SimCLR Cited by: §4.1.
  • I. Susmelj, M. Heller, P. Wirth, J. Prescott, M. Ebner, and et al. (2020) Lightly. GitHub. Note: https://github.com/lightly-ai/lightly. Cited by: item †.
  • Y. Tian, D. Krishnan, and P. Isola (2020a) Contrastive multiview coding. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12356, pp. 776–794. Cited by: Table 7, §1, §4.1.
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020b) What makes for good views for contrastive learning?. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: Table 7.
  • Y. H. Tsai, M. Q. Ma, M. Yang, H. Zhao, L. Morency, and R. Salakhutdinov (2021) Self-supervised representation learning with relative predictive coding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, Cited by: §2.4.
  • A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. External Links: 1807.03748 Cited by: §1, §2.2.
  • T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pp. 9929–9939. Cited by: §A.5, Table 7.
  • X. Wang, Z. Liu, and S. X. Yu (2021a) Unsupervised feature learning by cross-level instance-group discrimination. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 12586–12595. Cited by: §3.
  • X. Wang, Z. Liu, and S. X. Yu (2021b) Unsupervised feature learning by cross-level instance-group discrimination. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 12586–12595. Cited by: §4.1, §4.3.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: Table 7, §1, §2.2, Table 3.
  • M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 6210–6219. Cited by: §2.2, Table 3.
  • Y. You, I. Gitman, and B. Ginsburg (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: §2.3.
  • X. Zhan, J. Xie, Z. Liu, D. Lin, and C. Change Loy (2020) OpenSelfSup: open mmlab self-supervised learning toolbox and benchmark. GitHub. Note: https://github.com/open-mmlab/openselfsup Cited by: §A.4, §4.1.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9907, pp. 649–666. Cited by: §1.
  • C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6002–6012. Cited by: Table 7.

Appendix A Appendix

a.1 Proof of proposition 1

Proposition 1. There exists a negative-positive coupling (NPC) multiplier in the gradient of :

where the NPC multiplier is:

Due to the symmetry, a similar NPC multiplier exists in the gradient of .

Proof.

where , .

a.2 Proof of proposition 2

Proposition 2. Removing the positive pair from the denominator of Equation 2 leads to a decoupled contrastive learning loss. If we remove the NPC multiplier from Equation 2, we reach a decoupled contrastive learning loss , where is:

Proof.

By removing the positive term in the denominator of Equation 4, we can repeat the procedure in the proof of Proposition 1 and see that the coupling term disappears.

Method Param. (M) Batch size Epochs Top-1 ()
Relative-Loc. (Doersch et al., 2015) 24 256 200 49.3
Rotation-Pred. (Gidaris et al., 2018) 24 256 200 55.0
DeepCluster (Caron et al., 2018) 24 256 200 57.7
NPID (Wu et al., 2018) 24 256 200 56.5
Local Agg. (Zhuang et al., 2019) 24 256 200 58.8
MoCo (He et al., 2020) 24 256 200 60.6
SimCLR (Chen et al., 2020a) 28 256 200 61.8
CMC (Tian et al., 2020a) 47 256 280 64.1
MoCo v2 (Chen et al., 2020b) 28 256 200 67.5
SwAV (Caron et al., 2020) 28 4096 200 69.1
SimSiam (Chen and He, 2021) 28 256 200 70.0
InfoMin (Tian et al., 2020b) 28 256 200 70.1
BYOL (Grill et al., 2020) 28 4096 200 70.6
Hypersphere (Wang and Isola, 2020) 28 256 200 67.7
DCL 28 256 200 67.8
DCL+BYOL aug. 28 256 200 68.2
PIRL (Misra and Maaten, 2020) 24 256 800 63.6
SimCLR (Chen et al., 2020a) 28 4096 1000 69.3
MoCo v2 (Chen et al., 2020b) 28 256 800 71.1
SwAV (Caron et al., 2020) 28 4096 400 70.7
SimSiam (Chen and He, 2021) 28 256 800 71.3
DCL 28 256 400 69.5
Table 7: ImageNet-1K top-1 accuracies () of linear classifiers trained on representations of different SSL methods with ResNet-50 backbone. The results in the lower section are the same methods with a larger experiment setting.

a.3 Linear classification on ImageNet-1K

Top-1 accuracies of linear evaluation in Table 7 shows that, we compare with the state-of-the-art SSL approaches on ImageNet-1K. For fairness, we list each individual approach’s batch size and learning epoch, which are shown in the original paper. During pre-training, our DCL is based on a ResNet-50 backbone, with two views with the size 224 224. Our DCL relies on its simplicity to reach competitive performance without relatively huge batch sizes or other pre-training schemes, i.e., momentum encoder, clustering, and prediction head. We report both 200-epoch and 400-epoch versions of our DCL. It achieves under the batch size of 256 and 400-epoch pre-training, which is better than SimCLR (Chen et al., 2020a) in their optimal case, i.e., batch size of 4096 and 1000-epoch. Note that SwAV (Caron et al., 2018), BYOL (Grill et al., 2020), SimCLR, and PIRL (Misra and Maaten, 2020) need huge batch size of 4096, and SwAV further applies multi-cropping as generating extra views to reach optimal performance.

a.4 Implementation details

Default DCL augmentations.

We follow the settings of SimCLR to set up the data augmentations. We use with scale in [0.08, 1.0] and follow by . Then,

with strength in [0.8, 0.8, 0.8, 0.2] with probability of 0.8, and

with probability of 0.2. includes Gaussian kernel with standard deviation in [0.1, 2.0].

Strong DCL augmentations.

We follow the augmentation pipeline of BYOL to replace default DCL augmentation in ablations. Table 4 demonstrates that the ImageNet-1K top-1 performance is increased from 67.8% to 68.2% by applying BYOL’s augmentations.

Linear evaluation.

Following the OpenSelfSup benchmark (Zhan et al., 2020), we first train the linear classifier with batch size 256 for 100 epochs. We use the SGD optimizer with momentum = 0.9, and weight decay = 0. The base is set to 30.0 and decay by 0.1 at epoch [60, 80]. We further demonstrate the linear evaluation protocol of SimSiam (Chen and He, 2021), which raises the batch size to 4096 for 90 epochs. The optimizer is switched to LARS optimizer with base and cosine decay schedule. The momentum and weight decay have remained unchanged. We found the second one slightly improves the performance.

a.5 Relation to alignment and uniformity

In this section, we provide a thorough discussion of the connection and difference between DCL and Hypersphere (Wang and Isola, 2020), which does not have negative-positive coupling either. However, there is a critical difference between DCL and Hypersphere, and the difference is that the order of the expectation and exponential is swapped. Let us assume the latent embedding vectors are normalized for analytical convenience. When are normalized, and are the same, except for a trivial scale difference. Thus we can write and in a similar fashion:

With the right weight factor, can be made exactly the same as . So let’s focus on and :

Similar to our earlier analysis in the manuscript, the latter introduces a negative-negative coupling between the negative samples of different positive samples. If two negative samples of are close to each other, the gradient for would also be attenuated. This behaves similarly to the negative-positive coupling. That being said, while Hypersphere does not have a negative-positive coupling, it has a similarly problematic negative-negative coupling. Next, we provide a comprehensive empirical comparison. The empirical experiments match our analytical prediction: DCL outperforms Hypersphere with a more considerable margin under a smaller batch size.

The comparisons of DCL to Hypersphere are evaluated on STL10, ImageNet-100, ImageNet-1K under various settings. For STL10 data, we implement DCL based on the official code of Hypersphere. The encoder and the hyperparameters are the same as Hypersphere, which has not been optimized for DCL in any way. We have found that Hyperspherehas did a pretty thorough hyperparameter search. We believe the default hyperparameters are relatively optimized for Hypersphere.

In Table 8, DCL reaches 84.4% (fc7+Linear) compared to 83.2% (fc7+Linear) reported in Hypersphere on STL10. In Table 9 and Table 10, our DCL achieves better performance than Hypersphere under the same setting (MoCo & MoCo v2) on ImageNet-100 data. Our DCL further shows strong results compared against Hypersphere on ImageNet-1K in Table 11. We also provide the STL10 comparisons of our DCL and Hypersphere under different batch sizes in Table 12. The experiment shows the advantage of DCL becomes larger with smaller batch size. Please note that we did not tune the parameters for DCL at all. This should be a more than fair comparison.

STL10 fc7+Linear fc7+5-NN Output + Linear Output + 5-NN
Hypersphere 83.2 76.2 80.1 79.2
DCL 84.4 (+1.2) 77.3 (+1.1) 81.5 (+1.4) 80.5 (+1.3)
Table 8: STL10 comparisons Hypersphere and DCL under the same experiment setting.
ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)
Hypersphere 240 16384 75.6
DCL 240 16384 76.8 (+1.2)
Table 9: ImageNet-100 comparisons of Hypersphere and DCL under the same setting (MoCo).
ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)
Hypersphere 200 16384 77.7
DCL 200 8192 80.5 (+2.7)
Table 10: ImageNet-100 comparisons of Hypersphere and DCL under the same setting (MoCo v2) except for memory queue size.
ImageNet-1K Epoch Batch Size Linear Top-1 Accuracy (%)
Hypersphere 200 256 (Memory queue = 16384) 67.7
DCL 200 256 68.2 (+0.5)
Table 11: ImageNet-1K comparisons of and DCL under the best setting. In this experiment both of the methods used their optimized hyperparameters.
Batch Size 32 64 128 256 768
Hypersphere 78.9 81.0 81.9 82.6 83.2
DCL 81.0 (+2.1) 82.9 (+1.9) 83.7 (+1.8) 84.2 (+1.6) 84.4 (+1.2)
Table 12: STL10 comparisons of Hypersphere and DCL under different batch sizes.

In every single one of the experiments, DCL outperforms Hypersphere. We hope these results show the unique value of DCL compared to Hypersphere.

a.6 Limitations of the proposed DCL

We summarize two limitations of the proposed DCL method. First, the performance of DCL appears to have less gain compared to the SimCLR baseline when the batch size is large. According to Figure 1 and the theoretical analysis, the reason is that the NPC multiplier when the batch size is large (e.g., 1024). As we have shown in the analysis, the baseline SimCLR loss converges to the DCL loss as the batch size approaches infinity. With 400 training epochs, the ImageNet-1K top-1 accuracy slightly increases from 69.5% to 69.9% when the batch size is increased from 256 to 1024. Please refer to Table 13.

Second, the scenario of DCL focuses on contrastive learning-based methods, where we decouple the positive and negative terms to achieve better learning efficiency. However, non-negative methods, e.g., VICReg (Bardes et al., 2021), does not rely on negative samples. While the non-contrastive methods have achieved better performance on large-scale benchmarks like ImageNet, competitive contrastive methods, like NNCLR (Dwibedi et al., 2021), have recently been proposed. The DCL method can be potentially combined with the method NNCLR to achieve further improvement. The SOTA SSL speech models, e.g., wav2vec 2.0 (Baevski et al., 2020) still uses contrastive loss in the objective. In Table 14, we show the effectiveness of DCL with wav2vec 2.0 (Baevski et al., 2020). We replace the contrastive loss with the DCL loss and train a wav2vec 2.0 base model (i.e., 7-Conv + 24-Transformer) from scratch.111The experiment is downscaled to 8 V100 GPUs rather than 64. After the training, we evaluate the representation on two downstream tasks, speaker identification and intent classification. Table 14 shows the representation improvement.

After all, there is not a consensus that non-contrastive methods would lead to the SOTA universally on different datasets. In fact, in Table 15, we can use CIFAR-10 as an example to show that DCL achieves competitive results compared to BYOL, SimSiam, and Barlow Twins.

ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)
DCL 256 200 67.8
DCL 256 400 69.5 (+1.7)
DCL 1024 400 69.9 (+0.4)
  • The values are taken from Table 7

Table 13: Results of DCL with large batch size and learning epochs.
Downstream task (Accuracy) Speaker Identification (%) Intent Classification (%)
wav2vec 2.0 Base Baseline 74.9 92.3
wav2vec 2.0 Base w/ (DCL) 75.2 92.5
  • In the downstream training process, the pre-trained representation first mean-pool and forward a fully a connected layer with cross-entropy loss on the VoxCeleb1 (Nagrani et al., 2020).

  • In the downstream training process, the pre-trained representation first mean-pool and forward a fully a connected layer with cross-entropy loss on Fluent Speech Commands (Lugosch et al., 2019).

Table 14: Results of DCL on wav2vec 2.0 be evaluated on two downstream tasks.
CIFAR-10 (ResNet-18) Batch Size Epoch kNN Accuracy (%)
BYOL† 128 200 85
SimSiam 128 200 73
Barlow Twins 128 200 84
DCL 128 200 84
BYOL 512 200 84
SimSiam 512 200 81
Barlow Twins 512 200 78
DCL 512 200 84
  • The method is implemented by (Susmelj et al., 2020).

Table 15: CIFAR-10 as an example to show that DCL achieves competitive results compared to BYOL, SimSiam, and Barlow Twins.