Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral
Contrastive learning, which aims at minimizing the distance between positive pairs while maximizing that of negative ones, has been widely and successfully applied in unsupervised feature learning, where the design of positive and negative (pos/neg) pairs is one of its keys. In this paper, we attempt to devise a feature-level data manipulation, differing from data augmentation, to enhance the generic contrastive self-supervised learning. To this end, we first design a visualization scheme for pos/neg score (Pos/neg score indicates cosine similarity of pos/neg pair.) distribution, which enables us to analyze, interpret and understand the learning process. To our knowledge, this is the first attempt of its kind. More importantly, leveraging this tool, we gain some significant observations, which inspire our novel Feature Transformation proposals including the extrapolation of positives. This operation creates harder positives to boost the learning because hard positives enable the model to be more view-invariant. Besides, we propose the interpolation among negatives, which provides diversified negatives and makes the model more discriminative. It is the first attempt to deal with both challenges simultaneously. Experiment results show that our proposed Feature Transformation can improve at least 6.0 baseline, and about 2.0 Transferring to the downstream tasks successfully demonstrate our model is less task-bias. Visualization tools and codes https://github.com/DTennant/CL-Visualizing-Feature-Transformation .READ FULL TEXT VIEW PDF
Contrastive learning has shown remarkable results in recent self-supervi...
Deep clustering successfully provides more effective features than
In general, an experimental environment for deep learning assumes that t...
Contrastive approaches to self-supervised learning (SSL) learn
Contrastive learning (CL) has recently emerged as an effective approach ...
The typical contrastive self-supervised algorithm uses a similarity meas...
Generic Event Boundary Detection (GEBD) is a newly introduced task that ...
Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral
, is a de facto dominant approach in computer vision community. But recently self-supervised contrastive learning achieves comparable transfer performance without the human-provided annotations. One of the key issues of contrastive learning is to design positive and negative (pos/neg) pairs to learn an embedding space such that the positives stay closer in the space while the negatives are pushed away.
Most existing approaches [4, 6, 42, 7] acquire pos/neg pairs by data augmentation, which exploits various views of the same image to form positive pairs. For example, CMC uses the luminance and chrominance color channel of an image as two views. InfoMin 
demonstrates that incremental data augmentations indeed lead to decreasing mutual information between views and thus improve transfer performance. In other words, an effective positive pair prefers to convey more variance of one instance. With a series of promotions, the contrastive learning methods based on data augmentations[4, 6, 42, 7] are achieving closer to the fully supervised performance on ImageNet.
Most previous data augmentations (e.g., cropping, color distortion) are directly sourced from human intuitions, which may lack much interpretability, thus they can not guarantee their effectiveness. We argue, however, that the feature-level data manipulation (i.e., feature transformation) can provide more explainable or effective pos/neg pairs to enhance the feature embedding. To this end, we first design a scheme to visualize the pos/neg pair score distributions during the training. We believe that, from these score distributions, we can reveal and explain how the model parameter values affect its performance. The visualization can help us trace back the training process. Moreover, it enables us to observe the characteristics of the pos/neg pairs, and then invent more effective feature transformations (FT).
Figure 1 demonstrates the motivation of score visualization. By plotting the score distributions under different momentum values of MoCo , we can clearly observe that the case of has smaller positive scores while achieves better performance. A small positive score indicates less similarity between the pair, which means this positive pair actually carrying large view variance of one example. Actually, this is consistent with the goal of feature learning, which targets at a more view-invariant visual representation. Therefore, we conjecture that “hard positives” are the ones conveying large view variance of a sample. Inspired by this observation, we introduce an extrapolation operation on positive pairs to increase view variance and thus acquire hard positives. Figure 1(c) shows that the extrapolation of positives can boost the model performance from the “blue” one to the “orange” one.
Besides, to make full use of negative features, we propose the random interpolation among negatives, which intuitively provides diversified negatives for each training step and makes the model more discriminative.
Unlike the traditional data augmentation, our feature transformation does not bring additional training examples. Instead, it aims at reshaping the feature distribution by manipulating both positive and negative pairs. Basically, our feature transformation will create hard positives and diversified negatives to learn a more view-invariant (hard positive) and a more discriminative (diversified negatives) representation. It is directly driven by the performance of the learned representation, while data augmentation is kind of blind to the performance. Furthermore, our feature transformation makes the model less “task-bias”, which means we can achieve performance improvement for various downstream tasks. It has been verified by our experiments on object detection, instance segmentation, and long-tailed classification with significant improvement.
Both our visualization tool and feature transformation are generic, and can be applied to various self-supervised contrastive learning including MoCo, SimCLR, InfoMin, SwAv, SimSiam. In the following sections, we employ the classic model MoCo to demonstrate our framework. To summarize, our contributions include:
We are the first to design a visualization tool to analyze and interpret how the score distribution of pos/neg pairs affects the model’s capability. The visualization also helps us come into some significant observations.
Inspired by the observations on the model visualization, we propose a simple yet effective feature transformation, which creates both “hard positives” and “diversified negatives” to enhance the training. The feature transformations enable to learn more “view-invariant” and discriminative representations.
We conduct thorough experiments and our model achieves the state-of-the-art performance. In addition, the experiments on the downstream tasks successfully demonstrate our model is less task biased.
Contrastive Learning: Contrastive losses have been widely used in self-supervised learning and brought significant improvements on classification [13, 1, 14, 41, 42, 6, 7, 12, 4, 18, 2, 59, 49, 52, 9, 3, 41, 45, 51, 56, 48, 46, 24] and detection [47, 53, 54, 55]. InfoMin  uses the lower bound of NCE to demonstrate that incremental data augmentations lead to decreasing mutual information between views and thus improve transfer performance. In other words, relatively harder data augmentation for contrastive learning boosts the transfer performance[20, 6]. We show that our proposed feature transformation can be easily adopted on current state-of-the-art models.
provide highly effective data augmentation strategies when paired with a cross-entropy loss for supervised and semi-supervised learning. Manifold mixup is a feature-level regularization for supervised learning while Un-mix  proposes using mixup in the image/pixel space for self-supervised learning; And in MoChi  the authors propose mixing the negative sample in the embedding space for hard negatives augmentation but hurt the classification accuracy. i-Mix  proposed a strategy mixing instances in both input and virtual label spaces to regularize contrastive training. In this paper, we proposed to use feature transformation rather than data augmentation. Positive features are extrapolated to increase the hardness of positives, and negative features in the memory queue are interpolated to increase the diversity. Our FT provides more efficacy compared with augmentations.
Generating examples for metric learning: The idea of generating new examples for metric learning has been explored by [27, 10, 23]. The Embedding Expansion  work uses uniform interpolation between two positive and negative points, creates a set of synthetic points, and then selects the hardest pair as negative. [27, 10] generate new hard examples by generators and improve performance for metric learning. Different from the approaches [27, 10] for supervised metric learning, our pos/neg FTs are aiming at self-supervised learning and doesn’t require labels, extra parameters and loss terms to be optimized.
Let us start from the basic procedures of contrastive learning, as shown in Figure 2. Each data sample passes through two separate data augmentation pipeline and , which are randomly sampled from the same data augmentation pool, and two views and will be acquired to construct positive pairs [6, 12]. The encoder and 222Encoder and might be the same  or different network [14, 12]. will respectively map two views into feature embedding space. An
normalization is applied on feature vectorand to project the corresponding vector and (i.e., ) onto the unit sphere and obtain and . Their inner product will produce the cos similarity score, namely one positive pair score and negative pair scores . These pair scores are input to InfoNCE loss 1 for contrastive learning:
Here we roughly defined Feature Transformation process as certain manipulations on encoder embeddings and , in order to reshape the distribution of the output pos/neg pair score ( and .), for better contrastive learning in the follow-up InfoNCE loss. The most common FT applied in current SOTA is the [4, 6, 42, 7, 14] unit-sphere projection of normalization. We provide empirical studies of this regular FT and illustrate it importance for significant constriction of feature length () in Supp F.
We choose to visualize the score distribution of pos/neg pairs instead of the loss curves and transfer accuracy, as the inside training dynamics can unearth the learning capability of the model. Specifically, there are two practical reasons: (1) The basic idea of InfoNCE loss is to compare the pos/neg scores in a log-softmax manner, so visualizing the input score pairs can help study the contrastive learning process. (2) The normalized feature vectors and are high-dimensional, which is challenging for storage and visualization; The exponential amplification of scores is too large to observe the details of characteristics of pos/neg scores. However, is one-dimensional and limited to , which is suitable to observe inside the contrastive process.
Notice that this practical visualization tool is offline and doesn’t affect training speed with negligible computation. Even with larger datasets and batch size, it’s still feasible. The details of the visualization tool are present in Supp A.
We choose the computationally-efficient model, MoCo  as an example to demonstrate our visualization design.
Momentum Update Mechanism: Memory queue  is an initial approach for solving the large batch computational burden which stores negative features in the memory that will be updated using the output of the encoder at each training step. However, the rapid change of the encoder ( and ) could bring inconsistency into the memory queue which usually contains outdated features. MoCo solves the inconsistency issue by leveraging a momentum update mechanism  where only is updated by back-propagation and the is updated by momentum mechanism:
where is the momentum coefficient and has a huge influence to the final transfer accuracy. The memory queue is then updated using the features from because the momentum update of brings a smoother change of features that could reduce the inconsistency in memory queue.
In the following sections, we provide thorough experiments and visualization analysis to show how the parameter affects the contrastive learning process. We attempt various for MoCo on ImageNet-100 (denoted as IN-100)  with linear readout protocol for evaluation (details in Supp B). As the Tab 1 shown, with the decrease of (increasing the update speed of encoder ), the accuracy presents an inverse U-shape and the max locates at and the model collapse333Model collapse means that the transfer accuracy with linear readout protocol can not achieve the accuracy of training from random initialization, i.e., , indicating the negative effect brought by pre-train. when . The trend of these results is similar with BYOL .
We choose three non-trivial statistics to visualize the score distribution: the mean of pos/neg scores (indicating the approximate average of the pos/neg pair distance) and the variance of negative scores (indicating the fluctuation degree of the negative samples in the memory queue). As shown in Fig 3(a), when becomes smaller, the update speed of encoder is increasing, leading to incremental differences of features among training steps, which is reflected as the growing variance of negative scores of the queue, namely the inconsistency. Specifically, when (no update of during training), the variance is closed to zero (blue line) while the variance of (red) is larger but relatively unstable. (grey) brings more violent fluctuations/inconsistency in the memory queue, leading to a poor transfer accuracy even model collapse.
Inside Analysis of Model Collapse: The model collapse is caused by various reasons. Small (fast update speed of ) brings not only the inconsistency, but also the confusion of negative scores. For the mean of neg scores (lines in Fig 3(b)), the volatility degree of (pink) and (grey) is much sharper than the best model (green). The mean of neg scores reflects the approximate score for all the negative pairs in the memory queue. If it becomes drastically volatile with the training process, the corresponding loss value and gradient will fluctuate violently, resulting in bad convergence. As shown in Fig 4, the smooth and stable gradient landscape of (Fig 4(a)) becomes sharp and messy with the decrease of (Fig 4(b) for and Fig 4(c) for ). Details of gradient landscape are put in Supp C. Basically, to learn a better pre-trained model, we need to prepare negative pairs that can maintain the stability and smoothness of score distribution and gradient for the training process, which is similar to supervised learning .
Hard Positive Boosts Performance: Small not only indicates the faster update speed, but also more similarity between encoder and , i.e., in an extreme case, when , the parameters is completely the same with in each training step. The increasing similarity of encoder and will reduce the dissimilarity between and , and only the view variance brought by data augmentations remains, leading to a higher positive score. Fig 3(c) shows that high positive scores of will produce easy positive pairs with the close distance and little view variance in feature space.
However, in Fig 5(c), when we increase from (green) to (orange), the easy pos pair becomes hard pos pair (from very similar to less similar ), leading to a higher transfer accuracy ( v.s , increased). Note that this observation (converting easy positive to hard one) could be explained by InfoMin principle : Raising the view variance between and corresponds to increasing the mutual information for contrastive learning, which forces the encoder learns a more robust embedding and thus improves the transfer accuracy.
In the guarantee of stable and smooth score distribution and gradient, we can adopt some feature transformation methods which create hard ones by decreasing easy positive scores. Thus, we propose a positive feature extrapolation method to improve transfer accuracy in section 4.1.
The learning objective of Info-NCE is to draw the positive pair ( and ) closer while pushing away negative pairs ( and all the in memory queue) in the embedding space. Therefore, we could directly apply feature transformation on the pos/neg features, in order to provide appropriate regularization  or make the learning harder . Specifically, we develop positive extrapolation to transform the original positive pair to be further to increase the hardness and negative interpolation of memory queue to increase the diversity of negative samples, as Fig 6 shown. Notably, our method does not change the loss terms because it only replaces original pair scores with the new transformed pos/neg scores for calculating loss term.
Following the discussion in Sec 3.3 which indicates that lowering the easy positive pair scores to create hard positive pairs during training could be beneficial for the final transfer performance. Thus we would like to explore a way to manipulate the positive features and to increase the view variance between them during training.
First, we simply adopt weighted addition for the two positive features to generate new feature:
where and are the transformed new features. Meanwhile, considering the design principle of mixup [44, 58], we make sure that the summation of weights equals to . More importantly, we should guarantee than the transformed pos score is smaller than the original pos score , namely . Take Equation 3 into the transformed score:
Because and thus . To make sure the lower score , we need to set to let . So we choose 444We choose to set the two parameter of the beta distribution to be the same, because the two mixed features are symmetrical. And the same applies to the negative feature interpolation.
is sampled from a beta distribution and then addingresults in a range of . And the range of transformed pos score will be .
Intuitively, it can be seemed as a simple approach to push away and in feature space. After extrapolation, the distance between the extrapolated feature vector is enlarged. Therefore the extrapolation can serve as a feature transformation to create hard positives from easy ones. As shown in Fig 6(b), it brings a minor direction change for two positive vectors and meanwhile conveying a larger view variance of a sample for better contrastive learning. The visualization of lowering pos score by extrapolation is shown in Fig 1(c).
We evaluate the efficacy of positive extrapolation on IN-100 and attempt various in Tab 2. The positive extrapolation with various consistently improves the accuracy from the baseline MoCo (), which clearly demonstrates the efficacy of positive extrapolation. It is interesting that will get better results than those of . Because the beta distribution with provides extreme large or small
with high probability,e.g., or , while the beta distribution with gives neutral with high probability 555The beta distribution with shows an inverted U shape which samples 0.5 with a greater probability and thus making to have a greater chance to be .. According to Equation 4, extreme will bring too much/little hardness, so the corresponding performance is not robust as the neutral one.
|MoCo||0.2||69.1 / 71.6|
|(baseline: 71.1 )||2.0||67.4 / 72.8|
What if Positive Interpolation? To further verify our conjecture that extrapolation can create hard positives while interpolation won’t, we also conduct experiments for the interpolation of positive features, shown in Tab 3. We can observe a clear performance drop ( drop for neutral ) for this experiment. The reason is that the interpolation between positive features pulls the positive pairs together thus reducing the hardness in the training process. In other words, the view variance of positive pairs is decreasing, and thus easy to cause non-robust features.
Previous contrastive models [6, 14] do not make full use of negative samples. e.g., In MoCo, there are many repetitive negative features stores in the memory queue iteration by iteration. Thus we could design a new strategy to fully utilize negative features and increase the diversity of the memory queue. With sufficient randomness, we propose the negative interpolation in memory queue, which intuitively provides diversified negatives for each training step.
Specifically, we denote the negative memory queue of MoCo as where is the size of the memory queue, and as the random permutation of . We propose to use a simple interpolation between two memory queue to create a new queue :
where is in the range of , as Fig 6(a) shown. The transformed memory queue provides fresh interpolated negatives for contrastive loss iteration by iteration, where the random permutation and ensure the diversity of of each training step. The diversity makes the model to compare with much more linear combinations of previous negatives in each training step. Positive extrapolation increases the view variance between two pos features while the negative interpolation similarly boosts the “sample variance” (diversity) of the memory queue. We conjecture that original queue provides discrete distribution of negative samples but our method can fill in the incomplete sample points of the distribution by random interpolation, leading to a more discriminative model We evaluate the efficacy of negative interpolation on IN-100 and attempt various in Tab 4. The neg interpolation is fairly robust with various , with the improvement of - from the baseline (). More interesting discussions about negative feature transformation (hard negatives & negative extrapolation) are shown in Supp G.
Previous works have explored the method leveraging image-level  and feature-level  mixing in contrastive learning. Our method differs from the previous works in three ways, first is the motivation, we are motivated by our observation in Sec 3.2 to propose the feature transformation strategies. Second, the way we extrapolate between two positive features is novel and outperforms the other two methods on several experiments in Tab 8 and 9. Third, the negative interpolation aims at fully utilizing negative samples in each training step. Both FT methods focus on exploring an effective way to perform feature transformation, not simply extending hard negatives to memory queue , neither the image-level mixup . In the following sections, we provide inside discussions for the proposed FT, including (1) What if extending memory queue instead of FT. (2) When to add FT? (3) Dimension-level mixing rather than linear mixup. (4) Could the gains brought by FT vanish if training longer?
|moco+ original queue||-||71.10|
|moco+ original queue||-||71.40|
|moco+ Neg FT queue||1.6||74.64|
|moco+ Neg FT+original||1.6||74.73|
Extending memory queue instead of FT: Previous works [14, 6] show that increasing the number of negative example () in contrastive learning could be beneficial for the final performance, thus they either uses a memory queue  or a large batchsize  to obtain more negative examples. Specifically, [31, 17, 42] shows that increasing will improve the lower bound of the mutual information. The negative interpolation method could also be leveraged to enlarge the number of negative examples: We use the union queue of original negatives and interpolated queue, , which contains twice the number of negative examples () than .
We compare the performance of using only the interpolated queue , original with / negative samples, and their combination , in Tab 5. We found that using the combination queue shows negligible improvement over the performance () of using the interpolated queue alone (). We consider that the interpolated negative features contain sufficient diversified negatives compared with the original queue. So even the double negative samples (more mutual information) of the extended queue () cannot boost the performance. Notably, the extended queue requires double times computation for each contrastive loss. Thus we recommend feature transformations with less computation but more efficacy rather than feature augmentation.
|FT begin epoch||0||2||30||50||80||-|
|Res18 acc (%)||62.6||63.3||62.9||61.8||59.2||56.2|
|Res50 acc (%)||76.9||76.4||75.9||74.0||72.2||71.1|
When to add feature transformation? Here we present the efficacy of FT by analysis of starting FT in various training stages. As shown in Tab 6, starting FT (pos extrapolation + neg interpolation) from various epoch can consistently boost the accuracy of baseline, and starting from earlier can improve more (/ boosts with Res-18/Res-50). With the visualizations of score distribution and gradient landscape in Fig 7, we can see that our FT brings hard positives (lowering pos scores in Fig 7(b)) and hard negatives (rising neg scores in Fig 7(a)) simultaneously when the combined FT is inserted in various stages. Besides, with the comparison of the gradient ( norm) landscape, we can observe that our FT brings a greater gradient for the training, which makes the model escape from the local minima and avoid over-fitting. These analyses indicate our FT is a plug-and-play method and brings persistent view-invariance and discrimination for the training of contrastive models. More detailed discussions and visualizations are put in Supp D.
How about Dimension-level mixing: Besides the proposed linear feature interpolation and extrapolation on the feature-level (128-d vector), we also extend the transformation to a dimension-level where the parameter is a vector rather than a scalar number, this dimension-level mixing can be described as follows:
where stands for Hadamard product, and is a vector with the same dimension as the feature vector. The value of each dimension of is randomly sampled from a beta distribution . This formulation is used for negative interpolation; For positive, is added to perform extrapolation. For neg/pos features, the dimension-level mixing could introduce more diversity/more view variance (hardness) because every dimension is performed with transformation. Experiments of dimension-level mixing on IN-100 shows improvement over the feature-level mixing (the 5th row in Tab 7).
Could the gains brought by FT vanish if training longer? Simply training longer leads to significant performance boost for contrastive pre-train. So here we provide the results of MoCov2/MoCov2+FT (500 epoch) on IN-100: 80.7%-81.5%. Compared with 200 epoch results (75.6%-78.3% in Tab 7), longer training actually minimizes the improvement over the baseline. More training epochs can lead to comparing much more pos/neg pairs to increase the diversity. However, our proposed FT accelerates this process by providing diversity and results in fast convergence, which responds to the motivation of learning diversified and discriminative representations.
In this section, we evaluate our Feature Transformation methods from four perspectives: (1) Ablation studies (2) FT on various contrastive models. (3) Evaluating the representation on ImageNet-1k. (4) Finetuning on various downstream tasks. We keep the fairness of the experiments, especially when compared with other methods. Notice that the data augmentations are followed with the baseline methods. Details of experiments and datasets are put in Supp B.
|pre-train||IN-1k||Faster  R50-C4 VOC||Mask R-CNN 
We adopt the linear readout protocol 
to compare performance for image classification on IN-100, where we freeze the features and train a supervised linear classifier using softmax. Tab7 summarizes the results of ablation studies. We observe that the positive extrapolation and negative interpolation components are complementary which can improve the top-1 accuracy by 5.77%/2.72 when combined on MoCoV1/MoCoV2. The dimension-level mix also shows improvement based on the already high performance of both components. The performance-boosting of ablation studies over MoCo shows the efficacy of our FT. Notice that the transformed features are not necessarily on the unit sphere (i.e., has a norm of 1), we did not need to re-perform norm for transformed features, because the performance difference is negligible ( v.s post-norm ). More discussions about for vector length are put in Supp F. Here we strongly recommend to re-perform norm for the transformed features on all the datasets, for the sake of contrasting all the scores on the unit-sphere.
We apply our FT to various contrastive models in Tab 7. It presents that our FT brings , , , , and improvement over MoCo , SimCLR , InfoMin , SWAV  and SimSiam , respectively on IN-100 dataset (200 epoch). It is worthy to point out that the series of ablation studies of our FT can boosts the SimCLR model. The experiments shows our FT is generic and robust for various contrastive models.
After ablations on IN-100 dataset, we use the best settings of and to train a model on ImageNet-1k (IN-1K). Note that the dimension-level mix is not used for the experiments on IN-1K due to computational constraints. We apply our method on the baseline MoCo  and MoCoV2 , which are both trained on IN-1K with 200 epochs. The results and comparison are summarized in Tab 8. Our method improves MoCoV1 and MoCoV2 by 1.3% and 2.1% on Top-1 accuracy respectively which are significant on a large dataset like IN-1K. UnMix  and MoCHi  are the methods that also leverage mixup to better aid the contrastive learning process. Notably, we can observe that our method with MoCoV2 can provide larger performance gain than UnMix and MoCHi respectively.
Fine-grained image classification We evaluate the efficacy on real world fine-grained classification datasets, e.g., large scale long-tail iNaturalist2018 , CUB-200  and FGVC-aircraft . As shown in Tab 8, our FT significantly boosts the transfer performance on iNat-18, with and improvement based on MoCo and MoCo-V2. Besides, our FT brings consistent improvement on CUB-200 and FGVC-aircraft.
Object Detection Recent works [47, 53, 54, 55, 59] have shown that the transfer accuracy of state-of-the-arts (SOTAs) models [4, 6, 42, 7, 14] on classification and detection are inconsistent and have low correlation, denoted as “task-bias”. One important reason is that pre-tasks of SOTA are specifically designed and optimized for classification, such as instance discrimination [51, 14] and clustering , leading to substantial enhancement on classification but slight gain for detection. Therefore we evaluate our FT on detection/instance segmentation tasks. As summarized in Tab 9, our FT can boosts the baseline model MoCo-V2 on various datasets and metrics respectively. Our FT strongly improves the transfer accuracy] on VOC  and MSCOCO . Besides, our FT with MoCo-V2 can get slightly better accuracy than those contrastive models specifically designed for detection tasks, e.g., DetCo and InsLoc . Moreover, our FT can get much better classification results than DetCo. Notice that our FT is not aiming at the local information during pre-task design, but more invariance from feature transformation. These experiments indicate that our FT is less task-bias than the pre-task-based contrastive models. The performance boosts suggest the efficacy and robustness of our proposed FT, and enable us to learn more “view-invariant” and discriminative representations.
In this work, we have developed a visualization tool to visualize the score distributions of positive and negative pairs. Leveraging this visualization tool, we can understand the inside of the contrastive learning process. More specifically, we discover significant observations inspiring our novel Feature Transformation, including positive extrapolation such that more hard positives are created for the training. Besides, we propose the interpolation among negatives, which makes full use of negatives and provides diversified negatives. The feature transformations enable to learn more view-invariant and discriminative representations. Experiments show that our proposed Feature Transformation can improve at least accuracy on ImageNet-100 over MoCo, and about accuracy on ImageNet-1K over the MoCoV2 baseline. Transferring to the downstream tasks successfully demonstrate our model is less task-bias. In our future work, we will explore more feature manipulation strategies with the help of our visualization tool.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2780–2789. Cited by: §2.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §2.
How does batch normalization help optimization?. NeurIPS. Cited by: §3.3.
Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §3.3.
International Conference on Machine Learning, pp. 6438–6447. Cited by: §2, §4.1, §4.
What makes instance discrimination good for transfer learning?. ICLR. Cited by: §2, §5.4.
We choose three non-trivial statistics to visualize the score distribution: the mean of pos/neg scores (denoted as /, indicating the approximate average of the pos/neg pair distance) and the variance of negative scores (denoted as , indicating the fluctuation degree of the negative samples in the memory queue).
Without loss of generality, we randomly choose samples 666We usually apply batch-size 256 on 4-GPU servers. Here we collect one batch on a single GPU for statistics. in one batch to calculate the statistics data and perform visualization. (1) For the positive pair score: As shown in Fig 8(a), we denote the , () as the 64 query samples. And , are the corresponding positive features of . Then we can get positive score , by inner product. Finally, we retain the mean value of these positive scores as . (2) For the negative pair scores: As shown in Fig 8(b), we denote the where is the size of the memory queue 777 is a large number, e.g., in MoCo  and the largest in one batch for SimCLR . We use in all the MoCo experiments.. Each combining will create negative pair scores in a set, named . To keep all the negative scores is challenging (about 4TB storage for the pair scores), so for each , we retain their mean and variance to show the distribution of negative sample scores corresponding to . More generally, we further average these means and variances to show the statistical characteristics of these negative samples ( and ). These statistics are recorded at each training step to track the score distribution in the training process.
Our visualization is very practical. It is offline, which almost does not affect the training speed. Instead of storing K (65536) pair scores, we save their statistical mean & variance to represent the scores’ distribution. As a result, it only takes about 20MB storage and 5 minutes extra time for a 256 batch-size 100 epoch training. Even with larger datasets and batch size, it’s still feasible.
The experiments are mainly implemented using the code from InfoMin  888https://github.com/HobbitLong/PyContrast. The transfer experiments on object detection and instance segmentation are implemented using Detectron2 999https://github.com/facebookresearch/detectron2. We keep the fairness of the experiments, especially when compared with other methods. The code of our proposed methods and visualization tools will be made public.
For the experiments of combining our feature transformation module with other contrastive learning methods, we use the same image-level data augmentation strategies as the respective methods. Specifically, for our visualization experiments and other experiments using MoCo, we use the same data augmentation strategies with MoCo which contains Random Resized Crop, Horizontal Flip, ColorJitter, and Random Gray Scale. For the experiments on MoCoV2  and SimCLR , the data augmentation strategies are the same which contains Random Resized Crop, Horizontal Flip, ColorJitter, Random Gray Scale, and Gaussian Blur.
For training All the visualization experiments are carried on ImageNet-100 and ResNet-18 for fast evaluation and parameters-tuning experiments. For the visualization experiments (including Table 1, Table 6 (2nd row), figure 1(a),3,4,5,7 in the paper and Table 11 (2nd row), 14, figure 9,10,11 in supplementary materials), we apply a mini-batch size of 256 is used with 4-GPUs, where the number of negative examples is set to 65,536, with initial learning of 0.03. And we use samples to perform visualizations. For the fast grid experiments, the model is trained for only 100 epochs with the learning rate multiplied by 0.1 at 60 and 80 epochs. We use SGD as the optimizer, the weight decay of SGD is 0.0001 and the momentum of SGD is 0.9. And for various unit-sphere projection experiments, we apply 200 epochs training to perform visualization.
For testing we use the linear readout protocol to evaluate the trained representation on the validation set by fixing the learned representation and train a supervised linear classifier on the representations, the single-crop top-1 accuracy on the validation set is reported. An initial learning rate of 10 and weight decay 0. The classifier is trained with 100 epochs and the learning rate is multiplied by 0.1 at 60, and 80 epochs.
For training we use ResNet-50 for ImageNet-100 implementations And momentum parameter is set to be 0.99 for our experiments. (including Table 2,3,4,5,6 (2nd row),7 in the paper and Table 11 (1st row),12,15,16 in supplementary materials). A mini-batch size of 256 is used with 8-GPUs, where the number of negative examples is set to 65,536, with initial learning of 0.03. The model is trained for 200 epochs with the learning rate multiplied by 0.1 at 120 and 160 epochs. We use SGD as the optimizer, the weight decay of SGD is 0.0001 and the momentum of SGD is 0.9.
For testing we use the linear readout protocol to evaluate the trained representation on the validation set by fixing the learned representation and train a supervised linear classifier on the representations, the single-crop top-1 accuracy on the validation set is reported. We use an initial learning rate of 10 and weight decay 0. The classifier is trained with 60 epochs and the learning rate is multiplied by 0.1 at 30, 40, and 50 epochs following .
For training The momentum update parameter for the experiments on ImageNet-1k is set to 0.999, other parameters are set to the same as the experiments on ImageNet-100. ResNet-50 is used as an encoder. (including Table 8 in the paper and Table 13 in supp material). We can observe that the best result for positive extrapolation and negative interpolation is achieved when and are set to 1.6 and 2.0 respectively. Thus we use this value for the other experiments. Except otherwise stated, other hyper-parameters are set to be the same with MoCo  and MoCoV2.
For testing The same linear readout protocol is used where the linear classifier is trained for 100 epochs and the initial learning rate is 30 which are multiplied by 0.1 at 60, 80epochs.
In addition to object detection and instance segmentation tasks, we also provide a study of fine-grained classification. We choose three challenging fine-grained datasets to conduct the experiments, iNaturalist 2018 dataset, CUB-200 dataset, and FGVC-aircraft dataset. (1) The iNaturalist 2018 has 437k images and 8142 classes, this dataset is commonly used for fine-grained classification and long-tailed recognition, and is used by several papers for evaluating the transfer performance of self-supervised representations . (2) The CUB-200 dataset contains 6033 images belong to 200 bird species and is used for fine-grained classification. (3) The FGVC-aircraft dataset has 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which are airplanes. When transferring to these datasets, the pre-trained model is fine-tuned with 100 epochs, the learning rate is set to 5e-3 with cosine decay.
The main goal of self-supervised pre-training is to obtain representation that can be beneficial for downstream tasks. We choose to use PASCAL VOC and COCO  as our benchmark for testing the transfer performance of the representation to object detection and instance segmentation tasks following previous works . For PASCAL VOC dataset, we use the trainval07+12 split for fine-tuning, and the test2007 split for evaluating. The image scale is set to [480, 800] pixels for training and 800 for testing. For COCO dataset, we use the train2017 split (118k images) for fine-tuning, the val2017 split for evaluating. The image scale is set the same with PASCAL VOC.
When transferring to detection tasks, feature normalization has been shown to be crucial during fine-tuning . Therefore, the pre-trained backbone is fine-tuned with Synchronized BN (SyncBN)  and add SyncBN to the FPN layer following . We use Faster R-CNN  with R50-C4 architectures for object detection on the PASCAL VOC dataset. All layers of the model are fine-tuned with 24,000 iterations with each batch consisting of 16 images. The initial learning is set to 0.02 and is multiplied by 0.1 at 18,000 and 22,000 iterations. Other hyper-parameters are set to be the same with .
We also tested the transferring abilities of the pre-trained model using the instance segmentation tasks on MS COCO dataset. We uses a Mask R-CNN  R50-FPN pipeline following . The batch size is set to 16 with the learning rate as 0.02, the model is trained with 1x and 2x schedules, for 1x schedules, the model is trained for 90,000 iterations on the MS COCO datasets with the learning rate multiplied by 0.1 at 60,000 and 80,000 iterations, for the 2x schedules, we use 180,000 iterations with the learning rate multiplied by 0.1 at 120,000 and 160,000 iterations. The transfer results of the 2x schedule is provided in Tab 10. Other hyper-parameters are set to be the same with .
We provide the details of our gradient landscape Figure 4 of various in the paper. As shown in Fig 9, we provide norm for each layer of the encoder (ResNet-18) with the training process. X axis indicates the layers of the encoder, while Y axis indicates the 100 training epochs. And Z axis means the value of norm. We choose the norm of this layer (total gradient norm of this layer) because the norm of gradient is very obvious to show the smoothness of gradient landscape. We can see that small and brings drastic volatility with the training process. The corresponding loss value and the gradient will fluctuate violently, resulting in bad convergence. As shown in Fig 9, the smooth and stable gradient landscape of (Fig 9(a)) becomes sharp and messy with the decrease of (Fig 9(b) for and Fig 9(c) for ). Therefore, to learn a better pre-trained model, we need to prepare negative pairs that can maintain the stability and smoothness of score distribution and gradient for the training process. It seems that the gradient landscape looks spiky: 1) Across Y axis indicating the training epochs. 2) Across X axis representing the ResNet layers, it shows the gradients of all layers including the BatchNorm layer whose gradient is small. But the gradient of Convolution layer is large, thus it seems to be spiky across X axis. The spiky gradient on X axis doesn’t influence the training, while the smooth gradient on Y axis matters.
We present the efficacy of FT by analysis of starting FT in various training stages. As shown in Tab 11, starting FT (pos extrapolation + neg interpolation) from various epoch can boost the accuracy of baseline, and starting from earlier can improve more (/ boosts with Res-18/Res-50). It is worthy to note that even adding FT in the 80th epoch can bring improvement compared with the MoCo baseline (No FT in training). With the visualizations of score distribution Fig 10, we can see that our FT not only brings hard positives (lowering pos scores in Fig 10(d)) and hard negatives (rising neg scores in Fig 10(c)) simultaneously when the combined FT is inserted in various stages. The combination of positive extrapolation and negative interpolation can help rise the neg scores in the training process. Besides, with the comparison of the Gradient ( norm) landscape, we can observe that our FT brings a greater gradient for the training (Adding FT in the 30th epoch Fig 10(h) and 50th epoch Fig 10(i)), which makes the model escape from the local minima and avoid over-fitting. These analyses indicate our FT is a plug-and-play method and brings persistent view-invariance and discrimination for the training of contrastive models.
Due to the memory queue is initialized by random vectors at the start of training, the positive score and negative score have confusion, as shown in the visualizations in the early training stage (Fig 10(a) and Fig 10(b)). We provide the visualizations in the first 10 epoch to see the score distribution: (1) Adding FT from the 0th epoch will bring negative pairs whose score is very high (blue line in Fig 10(a), negative score, which is too large for negative pairs), indicating the feature transformations for the random vectors will hurt the pair score distribution. From the perspective of gradient landscape in Fig 10(f), the initial gradient brought by FT is too sharp and not smooth for training compared with the baseline MoCo in Fig 10(e). (2) Adding FT from the 2nd epoch (In the 2nd epoch, the memory queue is filled by the semantic features from training data rather than the random vectors) will relieve solve too high negative scores (orange line in Fig 10(a), normal negative score) and meanwhile lower the positive score from easy positive to hard one (orange line in Fig 10(b), decreasing the positive score). The gradient (Fig 10(g)) seems more smooth and stable compare with starting FT from 0th epoch (Fig 10(f)). More importantly, in Tab 11, starting from the 2nd epoch () can achieve slightly better accuracy than that at the beginning (). However, in the final experiments of imagenet-1K, we still use the strategy of starting FT from the 0th epoch. Because there seems no obvious performance difference in the ResNet-50 backbone in Tab 11. Future work will focus more on this issue.
|FT begin epoch||0||2||30||50||80||-|
|Res18 acc (%)||62.6||63.3||62.9||61.8||59.2||56.2|
|Res50 acc (%)||76.9||76.4||75.9||74.0||72.2||71.1|
In this section, we discuss the details of how to apply our feature transformation to other self-supervised methods. We evaluate the performance of feature transformation on three representative methods, namely InfoMin, SwAV, and SimSiam.
For feature transformation on InfoMin, we perform both positive extrapolation and negative interpolation. Note that we perform the feature transformation on both branches of the InfoMin method, i.e. the original branch and the JigSaw branch. For feature transformation on SwAV, we only transform the two features of the input image by positive extrapolation, the rest of the SwAV pipeline is left unchanged. For SimSiam, as the method only uses positive pairs for training, so we only apply the positive extrapolation as the feature transformation. All the other hyperparameters are set to be the same as the original paper of each self-supervised method.
To demonstrate the effectiveness of our feature transformation methods (Negative feature interpolation and Positive feature extrapolation), we also provide the experimental results on ImageNet-100  of applying our method on another classic contrastive learning model, SimCLR . Instead of using two encoders for encoding and like in MoCo , SimCLR directly uses a single network to encode the two views and contrast them against other negative examples. Because both MoCo and SimCLR are contrastive-based methods, the negative interpolation and positive extrapolation strategies can also be applied to SimCLR. We show the results of combining negative interpolation and positive extrapolation in Tab 12.
Here we complement the results of applying positive extrapolation on non-contrastive models [12, 6]. The models are pre-trained for 100 epochs on IN-1K with the same data augmentation setting of the original paper. As shown in Table 13, we provide the IN-1k results (100ep) of BYOL/BYOL+posFT (66.5% - 67.2%) and SimSiam/SimSiam+posFT (68.1% - 68.7%) indicating pos extrapolation alone can help BYOL and SimSiam. Notice that we didn’t perform the parameter experiments (not the optimal extrapolation parameter ), so the improvement is slight.
Here we provide additional visualization and analysis on the regular Feature Transformation (feature normalization, normalization) due to its significant constriction (unit-sphere projection) and Whether to add Normalization after our proposed FT.
Unit-sphere projection ( norm) constricts the feature vector length from unbounded to , in the meanwhile retains the vector direction. Thus the pair scores can be limited to . Recent paper  concludes that unit sphere projection plays a key role in ensuring the large gradients of hard positives and negatives from the loss gradient properties. However, without the unit sphere projection, the feature vector length lost the constriction to , and the too-large score distribution leads to bad contrastive learning and poor transfer performance ( v.s ). As shown in our empirical study (Fig 11 and Table 14) of this significant FT, the mean of positive pair score is similar to the mean of negative pair score when we removed unit-sphere projection, which will lead to an awful contrastive learning process: confusing the pos/neg pairs and bad gradient landscape brought by too large score distribution. Meanwhile, with the constrictions of unit-sphere projection, the mean of pos/neg pair scores are as expected: and
, which can be discriminated by the log-softmax loss function. This limited small score distribution benefits the later contrastive learning and brought a stable training process. Finally, the variance of the negative pair score shows that model with unit-sphere projection will provide less volatile negative pairs, which is better for contrastive learning.
|MoCo w/ unit sphere proj|
|MoCo w/ unit sphere proj|
|MoCo w/o unit sphere proj|
|MoCo w/o unit sphere proj|
In this section, we provide empirical studies about whether re-perform the norm for the transformed features after FT. As shown in Tab15, the performance difference is negligible for the model with/without re-performing the normalization, ( v.s post-norm for negative interpolation, v.s post-norm for positive extrapolation, v.s post-norm for combined FT). So we conclude that the transformed features are not necessarily on the unit sphere (i.e.has a norm of 1) due to the negligible performance difference. And in the final experiments of imagenet-1K, we do not re-perform the norm after feature transformations. However, we strongly recommend to re-perform norm for the transformed features on all the datasets, for the sake of contrasting all the scores on the unit-sphere.
In this section, we provide more discussions about the feature manipulation of the negative examples. We have discussed negative interpolation to fully utilize negative features and increase the diversity of the memory queue. Here we provide the situation about negative extrapolation in memory queue and creating hard negatives.
|Method (MoCov1)||Beta parameter||Acc%|
We have explored the negative interpolation to fully utilize negative features and increase the diversity of the memory queue. Then how about the negative extrapolation in the memory queue? Will the extrapolated negatives still be effective to increase the diversity of the memory queue and the performance?
Specifically, we denote the negative memory queue of MoCo as where is the size of the memory queue, and as the random permutation of . We propose to use a simple extrapolation between two memory queue to create a new queue :
where is in the range of . The transformed memory queue provides fresh extrapolated negatives for contrastive loss iteration by iteration. As shown in Tab 16, the negative extrapolation brings slight improvement over baseline ( v.s. , improved), while negative interpolation significantly improves to . Both the negative interpolation and extrapolation can increase the diversity of the memory queue, but why extrapolation cannot boost the performance? We conjecture that the original queue provides discrete distribution of negative samples but our method can fill in the incomplete sample points of the distribution by random interpolation, leading to a more discriminative model. But the extrapolated sample points may not stay in the previous manifold/distribution. Future work will focus more on this discussion.
The negative interpolation and extrapolation are both performed in the memory queue to increase the diversity. Another feature transformation for negative features is to increase the hardness during training, like the way of positive extrapolation. Our goal is to increasing the easy negative pair scores (similarity) to create hard negative pairs during training could be beneficial for the final transfer performance. Specifically, we use interpolation between and all the negatives in the memory queue to create a hard negative queue .
This equation indicates that each negative sample in the memory queue will be interpolated with to create hard negative queue . And is in the range of . By this transformation, we can guarantee that the transformed neg score is larger than the original pos score , namely , which means we create a hard negative queue. Intuitively, it can be seemed like a simple approach to draw and closer in feature space. After interpolation, the distance between the pos/neg feature vector is lowered. Therefore this interpolation can serve as a feature transformation to create hard negatives from easy ones. As shown in Fig 12, it brings a minor direction change for positive/negative vectors. As shown in Tab 16, our hard negatives can bring consistent boosts over the baseline ( v.s. (, improved), indicating that this hard negative is effective for the contrastive learning. Future work will focus more on this topic. However, we choose the negative interpolation rather than the hard negative strategy in the final experiments of IN-1K. Because the computation of hard negative strategy is too large (Each needs a new hard negative queue, so it takes time for one large batch to produce hard negative queue.).