GPT[radford2018improving] and BERT[devlin2018bert] are two representative works in self-supervised learning (SSL) that use transformers[vaswani2017attention]
for the task of natural language processing (NLP). Motivated by these successes, various efforts on self-supervised representation learning[oord2018representation, hjelm2018learning, bachman2019learning] have been made in the vision domain as well, many of which follow the recent paradigm of instance discrimination that matches representations of different views of the same image by different augmentations [chen2020simple, he2020momentum, grill2020bootstrap, caron2020unsupervised]. Recent self-supervised frameworks have focused on using transformer-based models such as ViT[dosovitskiy2020image], which demonstrated superior performance than the conventional ResNet[he2016deep] architectures. MoCo v3[chen2021empirical] and DINO [caron2021emerging] achieved state-of-the-art performances using ViT in self-supervised learning. MoCo v3 investigated the learning instability of ViT and tackled this to enhance performance, while DINO exploited the characteristics of ViT and proposed a unique MLP head to improve representation learning.
In this work, we propose Self-Distilled Self-Supervised representation Learning (SDSSL), a simple method utilizing knowledge distillation[hinton2015distilling] in SSL to learn more useful representations for downstream tasks. Our work is motivated by the recent interpretation of SSL from the perspective of mutual information (MI) maximization. Prior works[he2020momentum, chen2020simple, sordoni2021decomposed] have focused on maximizing MI between output representations, and from differently augmented input images and like (a) and (b) in Fig. 1111In this paper, any form of is used to denote the output feature of a network with layers from an input .. This is realized by using a contrastive loss (e.g. InfoNCE), in which and are trained to be closer while other samples are trained to be further, as the loss sets a lower bound for MI [oord2018representation]. In our work, rather than focusing on the lower bound, we focus on the upper bound of MI. Suppose encoder has layers. By the data processing inequality, the MI between output representations are lower than the MI between outputs of the intermediate layer, , from and the last layer from , i.e, . Since existing frameworks only attempt to maximize , is not controlled explicitly and may limit from being maximized (Fig. 1(c)). On the other hand, if we can make the upper bound, , larger, this may in turn render the optimization of easier (Fig. 1(d)). We discuss this further in Sec. 3.2.
Because our method operates in an orthogonal manner to the baseline SSL method used, we can simply apply our method to other existing works. In this work, we apply our method to three representative SSL frameworks, namely SimCLR[chen2020simple], BYOL[grill2020bootstrap], and MoCo v3[chen2021empirical], using ViT[dosovitskiy2020image]
as the backbone and show that our method improves upon the already competitive baseline. We demonstrate the effectiveness of SDSSL on ImageNet via k-nearest neighbor (k-NN) and linear evaluation. The superiority of SDSSL is also shown in various practical tasks such as copy detection, video segmentation and image retrieval. We also investigate representations found by SDSSL using recently proposed metrics[wang2020understanding] and discover SDSSL uses a broader range of the representation space than the baselines, which may help explain why SDSSL performs particularly well in fine-grained datasets. Finally, similar to [phuong2019distillation, zhang2019your], by allowing the intermediate layers to explicitly learn the pretext task, we show that even the intermediate features outperform the baseline counterparts.
Overall, we propose a self-distillation method that lets the intermediate layers explicitly learn to discriminate instances. We show that our method, when applied on top of the conventional SSL frameworks such as SimCLR, BYOL, and MoCo v3, improves upon the corresponding baseline on various datasets and tasks. Through an ablation study, we also demonstrate that naively applying our self-distillation method leads to performance degradation and show how our approach overcomes these potential pitfalls.
We propose a self-distillation method that is broadly applicable to existing self-supervised frameworks orthogonal to other techniques.
We empirically demonstrate SDSSL outperforms SimCLR, BYOL, and MoCo v3 on various tasks and datasets. In particular, SDSSL performs better at lower layers than its counterparts.
2 Related Work
Self supervised learning
Due to the capability of deep neural networks (DNN), various fields including computer vision, natural language processing and speech processing have grown rapidly. In general, DNNs need large-scale datasets to avoid the overfitting problem caused by their massive parameters. However, collecting and annotating large-scale datasets are exceedingly time-consuming and expensive. To relieve these issues, self-supervised methods have been widely studied these days. DIM[hjelm2018learning] maximizes mutual information between input and output. AMDIM [bachman2019learning] makes multi views for maximizing mutual information between input and output in different multi views. CPC [oord2018representation] trains the representation in the sequential model by using a contrastive method and proves InfoNCE loss maximizes the lower bound of mutual information between inputs and representations. Although SimCLR and MoCo [chen2020simple, he2020momentum] improve the performance via the contrastive learning between different views, they need a large batch size and an additional memory bank. BYOL [grill2020bootstrap] only uses positive samples by mimicking the output of the moving average network and it shows a significant improvement in the performance.
Meanwhile, with the advent of the transformer, ViT [dosovitskiy2020image] has been proposed in the vision domain and self-supervised ViTs such as MoCo v3 and DINO [chen2021empirical, caron2021emerging] have been studied, which outperform and have many advantages compared CNN-based SSL. While SSL shows promising results in many tasks, sufficient analyses on how it works have not been made. ReLIC [mitrovic2020representation] introduces the causal mechanism to explain SSL and [wang2020understanding] tries to analyze SSL by using alignment and uniformity. In this work, we also provide the analysis in Sec. 4.3 using both alignment and uniformity, to show the effect of SDSSL.
Knowledge distillation Knowledge distillation (KD) is one of the regularization methods widely used to improve performance [hinton2015distilling, NEURIPS2018_6d9cb7de, he2019knowledge]. The conventional offline KD framework utilizes a pre-trained teacher network and a student network with supervised labels to improve the performance of the student network. In contrast, online KD methods do not require any pre-trained teacher network [zhang2018deep, chung2020feature]. In the training phase, a student network and a teacher network are trained concurrently, distilling information from each network. Recently, many self-distillation works that only require the student network have been studied [radosavovic2018data, rebuffi2017icarl], where a model is trained with the knowledge from the previously trained model. Multi-exit [phuong2019distillation]
proposed a learning framework based on distillation for multi-exit architectures. This method encourages the lower-layer to mimic the higher-layer by matching their output probabilities.
In terms of only training the network with knowledge generated from multi views, SDSSL can be categorized into online self-distillation method. However, unlike the aforementioned methods, SDSSL does not require supervised labels, but instead utilize unlabeled data through the self-supervised learning methods.
SimCLR makes two views, and , of an input image, , (positive sample) by performing random augmentations. After obtaining representations of and using a backbone network, they are projected and the network is learned by the contrastive loss which tries to
increase cosine similarity between positive sampleswhile decreasing cosine similarity with negative samples (other images in the batch) using softmax function [bridle1990training].
MoCo v3 learns via a contrastive loss like SimCLR, but instead of using the identical network to generate features for and , a teacher network with exponential moving average (EMA) parameters is used. Randomly augmented and are forwarded to the student network and the teacher network, respectively, and then projected. The projected output of the student network is further processed through an additional MLP head to perform contrastive learning.
BYOL also has an EMA teacher and a predictor like MoCo v3, but learns by simply increasing the cosine similarity of positive samples without using the contrastive loss. Therefore, unlike the aforementioned SSL frameworks that utilize negative samples, the performance is robust to the batch size.
In this section, we first explain some existing analysis of how SSL learns meaningful features from prior works. Then, we introduce the motivation of our method. Mitrovic et al. [mitrovic2020representation] introduced a new perspective to analyze SSL using causal mechanism. They assume input image is generated from two independent variables, the content variable and style variable . They also proposed that the content variable is relevant for the unknown downstream tasks, which implies that is a good representation of the image. Because random augmentation intervenes only on the
style variable but preserves the content variable, conditional distribution of pretext task of estimatinggiven the content , , is invariant under the augmentations. The goal of [mitrovic2020representation] is to model such that it remains invariant to stochastic augmentation of , which leads to training a neural network to extract content from a set of augmented inputs ’s.
Meanwhile, earlier works in SSL have tried to explain the mechanism of SSL from the perspective of maximizing the mutual information of two randomly augmented images. In other words, they train an encoder to extract same representations from the positive samples. Given a set of learnable neural networks , this objective can be written as
As shown in Fig. 2, adopting the definition of the content variable, , from [mitrovic2020representation] - the common invariant information of randomly augmented images - the objective function is bounded by the amount of information in which is the entropy of , . Hence, maximizing the lower bound in Eq. 1 causes to move towards as shown in Fig. 1(a) and (b). In other words, the encoder is trained to extract content which is useful for various downstream tasks from and .
Now let us use the shorthand notation to denote the output of the layer for the input . Because the mapping from an intermediate layer, , to the final layer, , is deterministic and assuming discrete inputs, , it becomes and the information on is completely enclosed by that on as shown in Fig. 1(c). This in turn induces an upper bound of the mutual information between and as
where the second inequality is from the same analogy and the last equality is from the definition of which states .
This shows that the upper bound for the mutual information between two outputs from different augmentations depends on the output of the previous layer . While existing self-supervised frameworks train explicitly, is updated implicitly via the loss signal at the final layer. Due to this, adding an explicit signal for may be favorable to the information extraction process. Based on this hypothesis, we provide an additional loss signal by training to move towards explicitly as shown in Fig. 1(d). If this leads to that can extract more of from , such representations will perform better on unknown downstream tasks than those obtained from only optimizing Eq. 1.
We propose Self-Distilled Self-Supervised Representation Learning (SDSSL) which provides explicit signal to the intermediate representation by inducing intermediate representation to mimic the output representation as illustrated in Fig. 3. Our method can be applied to any existing SSL frameworks that match representations from multi-views.
Self-Supervised Learning (SSL) SSL enables the training of a model only using manipulation techniques over the input. The SSL frameworks used as the baselines in this paper are SimCLR, MoCo v3 and BYOL. The SSL objective functions differ for different methods. Following common practice, let denote the output of the student’s last MLP head (projector or predictor) and denote the output of the teacher’s (student in SimCLR) projector. Then the objective for BYOL is
while for the two contrasitive methods,
where denotes the inner product, is a temperature parameter and are for the positive/negative samples.
Self-Distilled SSL We define our intermediate self-distillation loss , which tries to maximize the mutual information of the output of an intermediate layer and (), as the following
where is the representation of the layer of the student encoder passed through the MLP heads corresponding to each layer, and is the output of the teacher MLP head. The stop-gradient operator, implies that the gradient is not propagated through so that only learns to predict without affecting . The objective of SDSSL consists of and , resulting in :
where the choice of , which controls the weight of the self-distillation loss, is detailed in Sec. 4.1.
We observe that for frameworks where predictors exist, simply using Eq. 6 leads to some performance improvement, but can be further enhanced. This is because the predictors of the intermediate layers are only updated using gradients from as opposed to the encoder, which is able to utilize both and . Consequently, the optimality of the predictors at intermediate layers are not guaranteed, which is a key component of SSL training as discussed by [grill2020bootstrap]. Simply enlarging causes the last predictor to be sub-optimal, because this updates the intermediate backbone layers as well. To alleviate this issue, we employ another loss :
where is the representation of the layer of the student after passing through the projector. To only update the predictors, the operator is used to . By doing so, we attain better predictors and the final loss for SSL frameworks with predictors is
In this section we describe details of our implementation. We follow the implementation of MoCo v3 [chen2021empirical] unless otherwise noted. We show that SDSSL outperforms the baselines in various downstream tasks including ImageNet. Furthermore, the ablation study demonstrates the efficacy of each factors of SDSSL.
4.1 Implementation Details
ViT Architecture We adopt the sine-cosine variant [vaswani2017attention] in 2-D for positional embedding and freeze the random initialized patch projector. We concatenate patch embedding with a learnable [CLS] token and add its positional embedding. The representations are outputs of [CLS] token after passing through each transformer block and the layer normalization layer [ba2016layer].
MLP Heads Following [chen2020simple, grill2020bootstrap]
, projectors are 3-layer MLPs and predictors are 2-layer MLPs. Batch normalization[ioffe2015batch] is applied to all output layers except BYOL and the hidden layers for all methods. The dimension of the hidden layer is 4096 for the last projector and all predictors, but 2048 for intermediate projectors. All outputs have 256 dimension. For frameworks using exponential moving average (EMA) teacher, the teacher’s projector is updated using the student’s projector via EMA. This is done in SDSSL as well using only the last projector.
Hyper-parameter We use AdamW [loshchilov2017decoupled] as the optimizer and batch size of 1024 for ViT-B/16, and 4096 for ViT-S/32. Learning rate is 1.5e-4 for MoCo v3 and BYOL, 1.3e-4 for SimCLR. We also adopt learning rate warmup for 40 epochs and cosine decay after warmup [goyal2017accurate]. Weight decay is 0.1. For , cosine scheduling [loshchilov2016sgdr] is performed from 00.8 for ViT-B/16 and 00.6 for ViT-S/32.
4.2 Main Results
ImageNet Pretraining We experiment with ViT-B/16 and ViT-S/32 on three self-supervised learning frameworks. In Table 1 we validate the representations found in ImageNet [deng2009imagenet] pretrained encoder through using k-NN [wu2018unsupervised] and linear evaluation. We follow the protocol of MoCo v3 for the linear evaluation and DINO for k-NN. Across all frameworks, models, and evaluations, applying SDSSL increases performance. The baseline accuracies are lower than those reported in MoCo v3 paper [chen2021empirical], because of using 1024 batch size instead of 4096 due to computation constraint. Contrastive frameworks are particularly affected by the batch size. Nevertheless, our method significantly improves upon our reproduced baselines. In ViT-S/32 model, linear evaluation performance improved more than it did on k-NN, whereas in ViT-B/16 model, the opposite was true. We used 8 NVIDIA A100 for five days to train our ViT-B/16 models and 4 NVIDIA A6000 for three days to train ViT-S/32 models.
ViT Capacity Here we analyze the effect of model capacity on the performance. When compared with the MoCo v3 baseline by applying SDSSL to ViT-T/32, ViT-S/32 and ViT-B/32, it can be seen that the performance gain also increases as the model capacity increases as shown in Table 2. In addition, it can be observed that the gain due to SDSSL is slightly higher when the parameters of the model are the same but smaller patch size is used.
Multi-exit Since self-distillation enables lower layers to learn from the higher layers, we expect that the lower layers of SDSSL learned more meaningful representations than those of baselines. This is verified in Figure 4, which shows that the representations of lower layers for SDSSL are much more suitable as features than the counterparts of vanilla MoCo v3. We performed linear evaluation on ImageNet using frozen representations of each layer. In the last layer, the accuracy increased by 3%p, and the 7th layer showed the largest performance gap of 25.5%p.
In this subsection, we evaluate the transferability of our method on various downstream tasks. Following DINO [caron2021emerging], we evaluate on the image retrieval task. In addition, we also evaluate on the copy-detection task and the video segmentation task, which uses features of patches
rather than the [CLS] token. The three evaluation protocol do not require additional training of the encoder. Then, we evaluate on other image classification datasets such as CIFAR-10, CIFAR-100[krizhevsky2009learning], Oxford Flowers-102 [nilsback2008automated], and Oxford-IIT-Pets [parkhi2012cats] by k-NN evaluation and end-to-end fine-tuning [dosovitskiy2020image]. Experiments are performed using all the three frameworks.
Copy-detection We report the mean average precision (mAP) of copy-detection on the strong subset of INRIA Copydays dataset [douze2009evaluation]. The goal of Copy-detection is to recognize the original image when given a distorted (e.g. blur, insertion, print, scan) version of it. Following [berman2019multigrain], we use the 10K samples of the YFCC100M dataset [thomee2016yfcc100m] as distractors, while 20K samples are used for whitening [berman2019multigrain] the features. The features of [CLS] token and patch token are pooled using GeM [radenovic2018fine] and concatenated. We use features of all layers to verify whether similar trend occurs as in the multi-exit experiment. We observe in Figure 5 that most SD-MoCo v3 intermediate features surpass those of MoCo v3 and have better performance on the respective best-performing features. For SD-MoCo v3 and MoCo v3, this is the 9th and 11th layer, respectively. We believe that the best-performing features are not formed in the final layer for some tasks that utilizes the features of the patch rather than using only the [CLS] token. Moreover, for SDSSL the best-performing layer is formed in the lower layers than the baseline. For SD-SimCLR and SimCLR, the best-performing layers are 11th and 10th, respectively; for SD-BYOL and BYOL, 12th and 8th layer, respectively. This may be explained by our motivation as our method intends to extract more information in the lower layers about the content rather than the style of an image. By providing an explicit loss, our method forms a suitable feature for copy detection earlier in the layers than the baseline.
Video segmentation We perform video instance segmentation on the DAVIS-2017 dataset [pont20172017]. We follow the experimental protocol in Jabri et al. [jabri2020space] and segment scenes with a nearest neighbor between consecutive frames in DINO. When all representations of all layers are tested as done in copy detection, a similar trend is observed in the video segmentation task as well. The best performing layer is 6th and 10th for SD-MoCo v3 and MoCo v3, respectively, and SD-MoCo v3 outperforms MoCo v3 as shown in Table 3. SD-SimCLR performs slightly better than SimCLR, both with the best performance in the 7th layer. However, in the case of SD-BYOL, the performance is slightly reduced compared to BYOL. In BYOL, the 7th layer is the best, whereas in SD-BYOL, the 2nd layer has the best performance.
Image Retrieval Revisited [radenovic2018revisiting] Oxford and Paris image retrieval datasets [philbin2008lost] contain 3 splits of various difficulty with query and database pairs. We evaluate all baselines and SDSSL on the Medium and Hard splits. We directly apply k-NN for image retrieval. As shown in Table 4, SDSSL outperform baselines.
Classification In this section, we demonstrate the results for image classification on CIFAR-10, CIFAR-100, Oxford Flowers-102, and Oxford-IIT-Pets. Since end-to-end fine-tuning maybe lead to over-fitting on the particular dataset, this may obscure whether the representations of the pre-trained encoder is actually good [radford2021learning]. Due to this, we also report numbers for k-NN evaluation. Table 5 shows that for Flowers and Pets both k-NN evaluation and fine-tuning leads to large performance gap compared to the baseline, while the gap is relatively small or slightly falls behind the baseline for CIFAR-10 and CIFAR-100. The two groups of datasets have distinct characteristics in that the former (Flowers and Pets) are composed of homogeneous classes, while the latter have distinct classes such as automobile, airplane, deer, etc. In the next subsection, we provide further analysis on why SDSSL performs exceptionally well on such datasets that require fine-grained features.
Wang et al. [wang2020understanding] demonstrated that contrastive learning optimizes two distinct metrics: (1) Alignment, which quantifies compactness of representations of positive samples
for some . And (2) uniformity, which measures how dispersed the entire representations are in a hypersphere using the Gaussian potential kernel (also known as the RBF kernel) [cohn2007universally, borodachov2019discrete]
is the distribution of positive pairs generated by random augmentation from the input data, and is the input data distribution. They asserted that low alignment signifies the positive samples are close to each other while low uniformity signifies that the negative samples are further apart. Thus, low alignment and uniformity lead to a better representation with high linearly separability, although the two metrics are inherently in a trade-off relationship.
Empirically, we observed that SD-MoCo v3 has higher alignment, but lower uniformity than vanilla MoCo v3. However, considering their conflicting characteristics, it is difficult to ascertain which representation is better. To answer this question, we propose another metric modifying the alignment metric to quantify the difference of alignment between negative samples and positive samples. Alignment between negative samples is defined as follow
Higher means that the negative samples are further apart from each other similar to uniformity.
The difference of negative alignment and positive alignment then quantifies the difference of mean distance between the positive samples and that of the negative samples. As shown in Figure 7, SD-MoCo v3 has higher alignment difference than MoCo v3 in almost all layers. In particular, when is adequately high with high positive alignment, a representation may be sufficiently far apart from negative representations and the positive samples will also relatively dispersed, which makes it easier to distinguish between positive samples that are potentially different classes in fine-grained datasets. This may explain why SDSSL performs exceptionally well for Flowers and Pets on Table 5, which require more fine-grained representations due to the homogeneous classes compared to CIFAR datasets.
4.5 Ablation Study
In this subsection, we show the efficacy of ratio scheduling and the predictor loss through ablation and verify these are necessary factors for optimal performance.
Table 6 shows the performance of ablating the predictor loss results in performance degradation. As discussed, this is consistent with the results in [grill2020bootstrap] showing that optimality of predictor is crucial.
Additionally, when only the predictor loss is used without the intermediate distillation loss , the performance change is minimal. This verifies that the intermediate distillation loss is a key component.
During training, we used ratio annealing in Eq. 6 and Eq. 8, i.e, is set very low at initial iterations and gradually increased afterward rather than using fixed for the entire training. Without ratio annealing the performance decreases significantly, which shows that self-distilling once some training has been done is important.
We have explained the mechanism of SDSSL from an information-theoretic perspective using mutual information. Nonetheless, many factors of SDSSL can be interpreted using well-known studies of knowledge distillation. Yosinski et al. [yosinski2014transferable] have proposed that representations of higher layers contain more of the task-specific information than those of the lower layers. Likewise, the output representations of self-supervised networks will have representations that are more focused on the instance discrimination pretext task. This explains the observation of the multi-exit experiment, in which the lower layers of SDSSL have better representations than the baselines by allowing the lower layers to explicitly learn the pretext task as well. Additionally, the scheduling of can also be explained by the performance of the teacher. Because the teacher does not have sufficient representation of the pretext task early in the training, should be lower and increased later on to distill better representation.
From a more intuitive perspective, the [CLS] token starts from a single representation for all images, and slowly aligns with the representations of the positive samples and distances from those of the negative samples as the layer progresses. Through , SDSSL induces the unaligned intermediate representations to mimic the output representations, which influences the features in Fig. 8(a) to become more like Fig. 8(b). Then, the next layer will receive features that are more aligned to the corresponding class and separated from the other classes than the original representations, which makes the instance discrimination task easier for forthcoming layers, leading to a better representation. In other words, makes the representation from the earlier layer more dispersed between negative samples, and align positive samples more effectively, which means that the representation space is used efficiently. Figure 7 shows this phenomenon quantitatively. Visualizations of the representations in the lower layers using t-SNE (shown in the appendix) also support this phenomenon.
In this work, we proposed a self-distillation method generally applicable to existing self-supervised learning frameworks. From the mutual information maximization perspective, our method is motivated by the hypothesis that maximizing the upper bound of mutual information between two views may be favorable for representation learning, and through experiments, we empirically validated the effectiveness of our method. We showed SDSSL leads to superior performance not only in the final layers, but also in various lower layers through the multi-exit experiment. In the future, our method should be applied to other techniques that lead to further performance gains such as larger model capacity, smaller patches, and multi-crop images. Additionally, more rigorous theoretical analyses should give insights into the empirically superior performances.
Appendix A Representation Visualization
We visualize the representations of each layer of MoCo v3 and SD-MoCo v3 for five random classes among the ImageNet validation set with t-SNE [van2008visualizing] in Fig. 9. We observe that the lower layer representations of SD-MoCo v3 are more cohesive than the representations of the same layer of MoCo v3.
Appendix B Copy detection and Video Segmentation
|SimCLR / 12||32.5||34.1||29.0||18.3||30.9||16.9||14.6|
|SD-SimCLR / 12||31.8||33.6||26.7||17.7||30.1||15.9||15.7|
|BYOL / 12||30.6||32.0||26.4||20.1||29.1||14.6||15.1|
|SD-BYOL / 12||30.7||32.2||26.9||15.5||29.3||14.3||12.5|
|MoCo v3 /12||35.7||38.0||34.4||15.2||33.3||19.9||13.4|
|SD-MoCo v3 / 12||34.4||36.6||31.5||14.8||32.2||17.3||13.1|
Appendix C Distillation in same view
In SDSSL the low-layer representations of the student mimic the output representation of the teacher (different view). However, like prior self-distillation works, [phuong2019distillation, zhang2019your] distillation can be performed in the same view (student’s output representation). We use the contrastive loss between the low layer representations and the output representation of the same view instead of in SimCLR. Although there is an increase in performance compared to the baseline, the performance is not comparable to that of SD-SimCLR as shown in Tab. 11.
|in same view||48.4 (-0.6)|