In the past decade, deep learning achieves great success in different fields of artificial intelligence. A large amount of manually labeled data is the fuel behind such success. However, manually labeled data is expensive and far less than unlabeled data in practice. To relieve the constraint of costly annotations, self-supervised learning(DosovitskiyFSRB16; wu2018unsupervised; cpcv1; 2020_moco; 2020_simclr) aims to learn transferable representations for downstream tasks by training networks on unlabeled data. Great progress is made in large models, i.e., models bigger than ResNet-50 (2016_ResNet) that has roughly 25M parameters. For example, ReLICv2 (relicv2)
achieves 77.1% accuracy on ImageNet(ImageNet) under linear evaluation protocol with ResNet-50, outperforming the supervised baseline 76.5%.
In contrast to the success of the large model pre-training, self-supervised learning with small models lags behind. For instance, supervised ResNet-18 with 12M parameters achieves 72.1% accuracy on ImageNet, but its self-supervised result with MoCov2 (mocov2) is only 52.5% (seed). The gap is nearly 20%. To fulfill the large performance gap between supervised and self-supervised small models, previous methods (seed; disco; 2022_bingo) mainly focus on knowledge distillation, namely, they try to transfer the knowledge of a self-supervised large model into small ones. Nevertheless, such methodology actually has a two-stage procedure: first train an additional large model, then train a small model to mimic the large one. Besides, one-time distillation only produces a single small model for a specific computation scenario.
An interesting question naturally arises: can we obtain different small models through one time pre-training to meet various computation scenarios without extra teachers? Inspired by the success of slimmable networks (2019_slim) in supervised learning, we present a novel one-stage method to obtain pre-trained small models without adding large models: slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and some weight-sharing sub-networks with different widths. The width denotes the number of channels in a network. Slimmable networks can execute at various widths, permitting flexible deployment on different computing devices. We can thus obtain multiple networks including small ones meeting low computing cases via one-time pre-training. Weight-sharing networks can also inherit knowledge from the large ones via the sharing parameters to achieve better generalization performance.
Weight-sharing networks in a slimmbale network cause interference to each other when training simultaneously, and the situation is worse in self-supervised cases. As shown in Figure 1, with supervision, weight-sharing networks only have a slight impact on each other, e.g., the full model achieves 76.6% vs. 76.0% accuracy in and . Without supervision, the corresponding numbers become 67.2% vs. 64.8%. One observed phenomenon of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation. The imbalance occurs because the sharing parameters receive gradients from multiple losses of different networks during optimization. The main parameters may not be fully optimized due to gradient imbalance. Besides, the gradient directions of weight-sharing networks may also diverge from each other and lead to conflicts during training. More explanations and visualizations about the interference of weight-sharing networks can be found in Appendix A.3.
To relieve the gradient imbalance, the main parameters should produce dominant gradients during the optimization process. To avoid conflicts in gradient directions of various networks, sub-networks should have consistent guidance. Following these principles, we introduce three simple yet effective techniques during pre-training to relieve the interference of networks. 1) We adopt a slow start strategy for sub-networks. The networks and pseudo supervision of contrastive learning are both unstable and fast updating at the start of training. To avoid interference making the situation worse, we only train the full model at first. After the full model becomes relatively stable, sub-networks can inherit the knowledge via sharing parameters and start with better initialization. 2) We apply online distillation to make all sub-networks consistent with the full model to eliminate divergence of networks. The predictions of the full model will serve as global guidance for all sub-networks. 3) We re-weight the losses of networks according to their widths to ensure that the full model dominates the optimization process. Besides, we adopt a switchable linear probe layer to avoid interference of weight-sharing linear layers during evaluation. A single slimmable linear layer cannot achieve several complex mappings simultaneously when the data distribution is complicated.
We instantiate two algorithms for SlimCLR with typical contrastive learning frameworks, i.e., MoCov2 and MoCov3 (mocov2; 2021_mocov3). Extensive experiments are done on ImageNet (ImageNet) dataset, and the results show that our methods achieve significant performance improvements compared to previous arts with fewer parameters and FLOPs.
2 Related Works
Self-supervised learning Self-supervised learning aims to learn transferable representations for downstream tasks from input data itself. According to 2021_survey, self-supervised methods can be summarized into three main categories according to their objectives: generative, contrastive, and generative-contrastive (adversarial)
. Methods belonging to the same categories can be further classified by the difference between pretext tasks. Given input, generative methods encode
into an explicit vectorand decode to reconstruct from , e.g., auto-regressive (2016_pixelcnn; 2016_pixelrnn), auto-encoding models (1987_ae; 2014_VAE; 2019_bert; 2022_mae). Contrastive learning methods encoder input into an explicit vector to measure similarity. The two mainstream methods below this category are context-instance contrast (infoMax 2019_infomax, CPC cpcv1, AMDIM 2019_addim) and instance-instance contrast (DeepCluster 2018_deepcluster, MoCo 2020_moco; 2021_mocov3, SimCLR 2020_simclr; 2020_simclrv2, SimSiam 2021_simsiam). Generative-contrastive methods generate a fake sample from and try to distinguish from real samples, e.g., DCGANs 2016_dcgans, inpainting 2016_inpaint
, and colorization2016_color.
Slimmable neworks Slimmable networks are first proposed to achieve instant and adaptive accuracy-efficiency trade-offs on different devices (2019_slim). It can execute at different widths during runtime. Following the pioneering work, universally slimmable networks (2019_unslim) develop systematic training approaches to allow slimmable networks to run at arbitrary widths. AutoSlim (yu2019autoslim) further achieves one-shot architecture search for channel numbers under a certain computation budget. MutualNet (2020_mutualnet) trains slimmable networks using different input resolutions to learn multi-scale representations. Dynamic slimmable networks (Li_2022_pami; Li_2021_CVPR) change the number of channels of each layer in the fly according to the input. In contrast to weight-sharing sub-networks in slimmable networks, some methods try to train multiple sub-networks with independent parameters (2020_dc). A relevant concept of slimmable networks in network pruning is network slimming (2017_networkslim; 2022_vitslim; pmlr-v139-wang21e), which aims to achieve channel-level sparsity for computation efficiency.
3.1 Description of SlimCLR
We develop two instantial algorithms for SlimCLR with typical contrastive learning frameworks MoCov2 and MoCov3 (mocov2; 2021_mocov3). As shown in Figure 1(a) (right), a slimmable network with widths contains multiple weight-sharing networks , which are parameterized by learnable weights , respectively. Each network in the slimmable network has its own set of weights and . A network with a small width shares the weights with large ones, namely, if . Generally, we assume if , i.e., arrange in descending order, and represent the parameters of the full model.
We first illustrate the learning process of SlimCLR-MoCov2 in Figure 1(a). Given a set of images , an image sampled uniformly from , and one distribution of image augmentation , SlimCLR produces two data views and from by applying augmentations and , respectively. For the first view, SlimCLR outputs multiple representations and predictions , where and .
is a stack of slimmable linear transformation layers,i.e., a slimmable version of the MLP head in MoCov2 and SimCLR (2020_simclr). For the second view, SlimCLR only outputs a single representation from the full model and prediction . We minimize the InfoNCE (cpcv1) loss to maximize the similarity of positive pairs and :
where , , is a temperature hyper-parameter, and are features of negative samples. For SlimCLR-MoCov2, comes from a queue. Following MoCov2, the queue is updated by every iteration during training. The overall objective is the sum of losses of all networks with various widths:
is updated by every iteration: , where is a momemtum coefficient.
Compared to SlimCLR-MoCov2, SlimCLR-MoCov3 has an additional projection process. It first projects the representation to another high dimensional space, then makes predictions. The projector is a stack of slimmable linear transformation layers. SlimCLR-MoCov3 also adopts the InfoNCE loss, but the negative samples come from other samples in the mini-batch.
After contrastive learning, we only keep and abandon other components.
3.2 Gradient imbalance and solutions
As shown in Figure 1, a vanilla implementation of the above framework leads to server performance degradation as weight-sharing networks interfere with each other during pre-training. One evidence of such interference we observed is gradient imbalance.
Gradient imbalance refers to that a small proportion of parameters produces dominant gradients during backpropagation. To quantitatively evaluate the phenomenon, we show the ratios of gradient norms of main and minor parameters: and versus in Figure 3, where
is the loss function. Meanwhile, the ratio of the numbers of parameters is, where . Generally, the main parameters dominate the optimization process and produce large gradient norms, i.e., the two ratios should both be large (). In Figure 2(a), the two ratios are both around 3.5 when training a normal network. However, in Figure 2(b) and 2(c), when training a slimmable network, gradient imbalance occurs because sharing parameters obtain multiple gradients from different losses. To be specific, if the widths of a slimmable network arrange in a descending order and the training loss is , that only represent a small part of parameters will receive gradients from different losses and obtain a large gradient norm:
Gradient imbalance is more obvious in self-supervised cases. In the supervised case in Figure 2(b), is close to at first, and the former becomes larger along with the training process. By contrast, for vanilla SlimCLR-MoCov2 in Figure 2(c), is smaller than the other most time. A conjecture is that instance discrimination is harder than supervised classification. Consequently, small networks with limited capacity are hard to convergence, produce large losses, and cause more disturbances to other weight-sharing networks.
When gradient imbalance occurs, the main parameters may not be fully optimized. Besides, the gradient directions of weight-sharing networks may also diverge from each other during backpropagation. All these lead to interference between weight-sharing networks and performance degradation.
To avoid gradient imbalance, one natural idea is to make the main parameters dominate the optimization process, i.e., the two ratios in Figure 3 should both be large. To resolve the possible conflicts of gradient directions, networks should have a consistent optimization goal. In order to achieve the above purposes, we develop three simple yet effective techniques during pre-training: slow start, online distillation, and loss reweighting. Besides, we further introduce a switchable linear probe layer to avoid the interference of weight-sharing linear layers during linear evaluation.
slow start At the start of training, the model and pseudo supervision of contrastive learning are both fast updating. The optimization procedure is unstable. To avoid interference between weight-sharing networks making the situation harder, at the first epochs, we only train the full model, i.e., only update by . In Figure 2(d), the ratios of gradient norms are large before the -th epoch; then they dramatically become small after slow start. At the first epochs, the full model can learn certain knowledge from the data without disturbances, and sub-networks can inherit the knowledge via the sharing parameters and start with better initialization at the -th epoch.
online distillation The full model has the greatest capacity to learn knowledge from the data. The prediction of the full model can serve as consistent guidance for all sub-networks to resolve the conflicts of weight-sharing networks. Following 2019_unslimof sub-networks and the full model:
is a temperature coefficient of distillation. In Figure 2(e), we observe that online distillation helps become larger than . This means that online distillation also relieves the gradient imbalance and helps the main parameters dominate the optimization process.
loss reweighting Another straightforward solution to gradient imbalance is to assign large confidence to networks with large widths. We adopt a strategy in which the strongest takes control. The weight for the loss of the network with width is:
where equals to if the inner condition is true, 0 otherwise. In Figure 2(f), both ratios become large, and are larger than by a clear margin. Loss reweighting helps the main parameters produce large gradient norms and dominate the optimization process.
The overall pre-training objective of SlimCLR is:
switchable linear probe layer
After pre-training, we observe that a single slimmable linear layer is not able to achieve several complex mappings from different representations to the same object classes simultaneously. We provide theoretical results about the conditions of inputs when using a single slimmable linear layer to solve multiple multi-class linear regression problems in AppendixA.1. The failure of a single slimmable linear layer is possibly because the learned representations in Figure 2 do not meet the requirement. In this case, we propose a switchable linear probe layer mechanism. Namely, each network in the slimmable network will have its own linear probe layer for linear evaluation. We also apply the online distillation strategy during fine-tuning.
4.1 Experimental details
Datatest We train SlimCLR on ImageNet (ImageNet), which contains 1.28M training and 50K validation images. During pre-training, we use training images without labels.
Learning strategies of SlimCLR-MoCov2 By default, we use a total batch size 1024, an initial learning rate 0.2, and weight decay . We adopt the SGD optimizer with a momentum 0.9. A linear warm-up and cosine decay policy (2017_PriyaGoyal; 2018_bags_of_tricks) for learning rate is applied, and the warm-up epoch is 10. The temperatures are for InfoNCE and for online distillation. Without special mentions, other settings including data augmentations, queue size (65536), and feature dimension (128) are the same as the counterparts of MoCov2 (mocov2). The slow start epoch of sub-networks is set to be half of the number of total epochs.
Learning strategies of SlimCLR-MoCov3 We use a total batch size 1024, an initial learning rate 1.2, and weight decay . We adopt the LARS (you2017large) optimizer and a cosine learning rate policy with warm-up epoch 10. The temperatures are and . The slow start epoch is half of the total epochs. One different thing is that we increase the initial learning rate to 3.2 after epochs. Pre-training is all done with mixed precision (2018_AMP).
Linear evaluation Following the general linear evaluation protocol (2020_simclr; 2020_moco), we add new linear transformation layers on the backbone and freeze the parameters of the backbone during evaluation. As described above, we also apply online distillation with a temperature when training these linear layers. For linear evaluation of SlimCLR-MoCov2, we use a total batch size 1024, total epochs 100, and an initial learning rate 60, which is decayed by 10 at 60 and 80 epochs. For linear evaluation of SlimCLR-MoCov3, we use a total batch size 1024, total epochs 90, and an initial learning rate 0.4 with cosine decay policy.
|Supervised||R-50||✗||76.6||93.2||100||25.6 M||4.1 G|
|R-34||75.0||-||-||21.8 M||3.7 G|
|R-18||72.1||-||-||11.9 M||1.8 G|
|2-8[0.5pt/1.3pt]||R-50||✗||76.0||92.9||100||25.6 M||4.1 G|
|R-50||74.9||92.3||14.7 M||2.3 G|
|R-50||72.2||90.8||6.9 M||1.1 G|
|R-50||64.4||86.0||2.0 M||278 M|
|Baseline (individual networks trained with MoCov2)||R-50||✗||67.5||-||200||25.6 M||4.1 G|
|2-8[0.5pt/1.3pt]||R-50||✗||67.2||87.8||200||25.6 M||4.1 G|
|R-50||64.3||85.8||14.7 M||2.3 G|
|R-50||58.9||82.2||6.9 M||1.1 G|
|R-50||47.9||72.8||2.0 M||278 M|
|MoCov2 (mocov2, preprint)||R-50||✗||71.1||-||800||25.6 M||4.1 G|
|MoCov3 (2021_mocov3, ICCV)||R-50||72.8||-||300|
|SEED (seed, ICLR)||R-34||R-50 (67.4)||58.5||82.6||200||21.8 M||3.7 G|
|DisCo (disco, ECCV)||R-34||R-50 (67.4)||62.5||85.4||200|
|BINGO (2022_bingo, ICLR)||R-34||R-50 (67.4)||63.5||85.7||200|
|SEED (seed, ICLR)||R-34||R-502 (77.3)||65.7||86.8||800|
|DisCo (disco, ECCV)||R-34||R-502 (77.3)||67.6||88.6||200|
|BINGO (2022_bingo, ICLR)||R-34||R-502 (77.3)||68.9||88.9||200|
|2-8[0.5pt/1.3pt] SlimCLR-MoCov2||R-50||✗||65.5||87.0||200||14.7 M||2.3 G|
|CompRess (compress, NeurIPS)||R-18||R-50 (71.1)||62.6||-||130||11.9 M||1.8 G|
|SEED (seed, ICLR)||R-18||R-502 (77.3)||63.0||84.9||800|
|DisCo (disco, ECCV)||R-18||R-502 (77.3)||65.2||86.8||200|
|BINGO (2022_bingo, ICLR)||R-18||R-502 (77.3)||65.5||87.0||200|
|SEED (seed, ICLR)||R-18||R-152 (74.1)||59.5||65.5||200|
|DisCo (disco, ECCV)||R-18||R-152 (74.1)||65.5||86.7||200|
|BINGO (2022_bingo, ICLR)||R-18||R-152 (74.1)||65.9||87.1||200|
|2-8[0.5pt/1.3pt] SlimCLR-MoCov2||R-50||✗||62.5||84.8||200||6.9 M||1.1 G|
|SlimCLR-MoCov2||R-50||✗||55.1||79.5||200||2.0 M||278 M|
4.2 Results of SlimCLR on ImageNet
Results of SlimCLR on ImageNet are shown in Table 1. Even though we pay huge efforts to relieve the interference of weight-sharing networks as described in Section 3.2, slimmable training inevitably leads to a drop in performance for the full model. When training for more epochs, the degradation is more obvious. However, such degradation also occurs in the supervised case. Considering the advantages of slimmable training we will discuss below, the degradation is acceptable.
Compared to MoCov2 with individual networks, SlimCLR helps sub-networks achieve significant performance improvements. Specifically, for ResNet-50 and ResNet-50, SlimCLR-MoCov2 achieves 3.5% and 6.6% improvements in performance when pre-training for 200 epochs, respectively. This verifies that sub-networks can inherit knowledge from the full model via sharing parameters to improve their generalization ability. We can also use more powerful contrastive learning framework to further boost the performance of sub-networks, i.e., SlimCLR-MoCov3.
Compared to previous methods that aim to distill the knowledge of large teacher models, sub-networks of ResNet-50 achieve better performance with fewer parameters and FLOPs. SlimCLR also helps small models get closer performance to their supervised counterparts. Furthermore, SlimCLR does not need any additional training process of large teacher models, and all networks in SlimCLR are trained jointly. By only training for one time, we can get different models with various computation cost which are suitable for different devices. This demonstrates the superiority of adopting slimmable networks for contrastive learning to get pre-trained small models.
In this section, we will discuss the influences of different components in SlimCLR.
switchable linear probe layer The influence of the switchable linear probe layer is shown in Table 1(a). A switchable linear probe layer brings significant improvements in accuracy compared to slimmable linear probe layer. For only one slimmable layer, the interference between weight-sharing linear layers is not unavoidable. It is also possible that the learned representations of pre-trained models do not meet the requirements as we discussed in Appendix A.1.
slow start and training time Experiments with and without slow start are shown in Table 1(b). The pre-training time of SlimCLR-MoCov2 without and with slow start epoch on 8 Tesla V100 GPUs are 45 and 33 hours, respectively. For reference, the pre-training time of MoCov2 with ResNet-50 is roughly 20 hours. Slow start largely reduces the pre-training time. It also avoids the interference between weight-sharing networks at the start stage of training and helps the system reach a steady point fast during optimization. Thus sub-networks can start with good initialization and achieve better performance. We also provide ablations of slow start epoch when training for a longer time in Table 1(f). Setting to be half of the total epochs is a natural and appropriate choice.
online distillation Here we compare two classical distillation losses: mean-square-error (MSE) and KL divergence (KD), and two other distillation losses from recent works: ATKD (guo2022reducing) and DKD (zhao2022decoupled). ATKD reduces the difference in sharpness between distributions of teacher and student to help the student better mimic the teacher. DKD decouples the classical knowledge distillation objective function (2015_kd) into target class and non-target class knowledge distillation to achieve more effective and flexible distillation. In Table 1(c), we can see that these four distillation losses make trivial differences in our context.
Combining the results of distillation with ResNet in Figure 2(e), we find that distillation mainly improves the performance of the full model, and the improvements of sub-networks are relatively small. This violates the purpose of knowledge distillation: distill the knowledge of large models and improve the performance of small ones. This is possibly because sub-networks in a slimmable network already inherit the knowledge from the full model via the sharing weights, and feature distillation cannot help the sub-networks much in this case. The main function of online distillation in our context is to relieve the interference of sub-networks as shown in Figure 2(e).
We also test the influence of different temperatures in online distillation, i.e., related to in equation 4, and results are shown in Table 1(d). Following classical KD (2015_kd), we choose . The choices of temperatures make a trivial difference in our context. SEED (seed) use a small temperature 0.01 for the teacher to get sharp distribution and a temperature 0.2 for the student. BINGO (2022_bingo) adopts a single temperature 0.2. Their choices are quite different from ours, and SlimCLR is more robust to the choice of temperatures. We further provide an analysis of the influence of temperatures in Appendix A.2.
loss reweighting We compared four loss reweighting manners in Table 1(e). They are
|(1). ,||(2). ,|
|(3). ,||(4). ,|
where equals to if the inner condition is true, 0 otherwise. The corresponding weights of ResNet-50 are , , , and . It is clear that a larger weight for the full model helps the system achieve better performance. This demonstrates again that it is important for the full model to lead the optimization direction during training. The differences of the above four loss reweighting strategies are mainly reflected in the sub-networks with small sizes. To ensure the performance of the smallest network, we adopt the reweighting manner (1) in practice.
ablation study with SlimCLR-MoCov3 Besides the above ablation studies with SlimCLR-MoCov2, we also provide empirical analysis for SlimCLR-MoCov3. Different from SlimCLR-MoCov2, SlimCLR-MoCov3 adopts different temperatures and . When applying slow start, SlimCLR-MoCov3 also increases the initial learning rate at the same time.
In Table 2(a), we show the influence of temperature for online distillation. Different from the choice in SlimCLR (MoCov2), it is better for SlimCLR-MoCov3 to choose a temperature for online distillation, which is close to the temperature of contrastive loss. is the default choice of MoCov3 (2021_mocov3), and we do not modify it.
Another interesting phenomenon is that when the number of training epochs of SlimCLR-MoCov3 becomes larger, we need to increase the learning rate when we start to train the sub-networks. The influence of learning rate for slimmable training in SlimCLR-MoCov3 is shown in Table 2(b). Here the learning rate refers to the base learning rate, the immediate learning rate after the warm-up is calculated by this base learning rate and the current training steps: . Different from SlimCLR-MoCov2 and SlimCLR-MoCov3 with fewer epochs, SlimCLR-MoCov3 will get poor performance if we do not change the learning rate when training for more epochs. We attribute the difference to the LARS (you2017large) optimizer we adopt for SlimCLR-MoCov3. LARS normalizes the gradient of layers in the networks to avoid the imbalance of gradient magnitude across layers and ensures convergence when training networks with very large batch size. LARS is sensitive to the change in learning rate and helps self-supervised models training with large batches converge fast (2021_mocov3; 2020_simclr). When training for the 300 epochs, the full model can reach a local minima fast in the first 150 epochs. In this case, a learning rate 0.6 (half of the base learning rate) is not able to help the system walk out of the valley and reach a better local minima. Consequently, a large learning rate is needed to give the system more powerful momentum. From Table 2(b), we can also see that SlimCLR-MoCov3 with LARS is sensitive to the change in learning rate. This is consistent with observations of previous works (2021_mocov3; 2020_simclr).
In this work, we adapt slimmable networks for contrastive learning to obtain pre-trained small models in a self-supervised manner. By using slimmable networks, we can pre-train for one time and get several models with different sizes which are suitable for various devices with different computation resources. Besides, unlike previous distillation based methods, our methods do not require additional training process of large teacher models. However, weight-sharing sub-networks in a slimmable network cause severe interference to each other in self-supervised learning. One evidence of such interference we observed is the gradient imbalance in the backpropagation process. We develop several techniques to relieve the interference of weight-sharing networks during pre-training and linear evaluation. Two specific algorithms are instantiated in this work, i.e., SlimCLR-MoCov2 and SlimCLR-MoCov3. We take extensive experiments on ImageNet and achieve better performance than previous arts with fewer network parameters and FLOPs.
Thanks for Boxi Wu and Minghan Chen for their helpful discussion.
Appendix A Appendix
a.1 Conditions of inputs given a slimmable linear layer
We consider the conditions of inputs when only using one slimmable linear transformation layer, i.e., consider solving multiple multi-class linear regression problems with shared weights. The parameters of the linear layer are , is the number of classes, where , , , .
The first input for the full model is , where is the number of samples, , , . The second input is the input feature for the sub-model parameterized by .
Generally, we have . We assume that both and have independent columns, i.e., and are invertible. The ground truth is .
The prediction of the full model is , to minimize the sum-of-least-squares loss between prediction and ground truth, we get
By setting the derivative w.r.t. to 0, we get
In the same way, we can get
For , we have
We denote the inverse of is , , as is a symmetric matrix, its inverse is also symmetric, so . For , we have
Then we can get
At the same time
From equation 14, we get
a.2 Influence of temperatures during distillation
In this section, we will analyze the influence of temperatures when applying distillation. One of the previous methods SEED (seed) uses different temperatures for the student and teacher, without loss of generalization, we will also adopt a such strategy in our analysis. Specifically, we adopt for teacher and for student.
The predicted probability for a certain category of the student is , where is the output of the student model, i.e
., logit of the model. The probability for a certain categoryof the teacher is , where is the output of the student model. The loss is the KL divergence:
The gradient of w.r.t. is:
The gradient of w.r.t. is:
Following classical KD (2015_kd), we assume temperatures are much larger than logits and use the first-order Taylor series to approximate the exponential function:
where is the number of classes. Following classical KD (2015_kd), we further assume , we can get:
a.3.1 Error surface and optimization trajectory
The performance degradation caused by the interference between weight-sharing networks is more severe in self-supervised learning compared to supervised learning. We already know that the gradient imbalance is more significant in self-supervised cases. Besides, the optimization process of self-supervised learning maybe also harder, so the gradient directions of weight-sharing networks may diverge from each other greatly and make the training process chaotic. In supervised cases, consistent global guidance relieves such divergence.
To better understand the interference between weight-sharing sub-networks in a slimmable network during optimization, we visualize the error surface and optimization trajectory using the method proposed by 2018_vis. We train slimmable networks in both supervised and self-supervised (MoCo (2020_moco)
) manners on CIFAR-10(krizhevsky2009learning). The base network is a ResNet-204, which has 4.3M parameters. We train the model for 100 epochs. At the end of each epoch, we save the weights of the full model and calculate the Top-1 accuracy. For self-supervised cases, we use a -NN predictor (wu2018unsupervised) to obtain the accuracy. After training, we calculate the principle components of the differences of the saved weights at each epoch and the weights of the final model following 2018_vis. Then we use the first two principle components as the directions to plot the error surface and optimization trajectory in Figure 4.
The visualization shows that self-supervised learning is harder than supervised learning. In the left error surface in Figure 3(a) and Figure 3(c), we can see that the terrain around the valley is flat in supervised cases; by contrast, the terrain around the valley is more complicated in self-supervised cases. From the trajectory of ResNet-204 in the left of Figure 3(b) and Figure 3(d), the contours in supervised cases are denser, i.e., the nearby two contours are closer. Namely, the model in self-supervised cases costs more time to achieve the same improvement of accuracy compared to the model in supervised cases (the gaps between two contour lines are all the same). In supervised cases, clear global guidance helps the model quickly reach the global minima. In self-supervised cases, it is harder for the model to reach the global minima fast without such global guidance.
The visualization shows that the interference of weight-sharing networks is more significant in self-supervised cases. First of all, in self-supervised cases, weight-sharing networks bring huge changes to the error surface in Figure 3(c). In contrast, the change is not so obvious in supervised cases. Second, the interference between weight-sharing networks in self-supervised cases makes the model shift more away from the global minima (the origin in the visualization) as shown in Figure 3(d). In Figure 3(b), the maximal offsets from the global minima along the 2nd PCA component are 21.75 and 28.49 for ResNet-204 and ResNet-204. The offset increased 31.0%. For self-supervised cases in Figure 3(d), the maximal offsets from the global minima along the 2nd PCA component are 13.26 and 18.75 for ResNet-204 and ResNet-204. The offset increased 41.4%. It is clear that the interference of weight-sharing networks is more significant in self-supervised cases compared to supervised cases.
a.3.2 Gradient imbalance
a.4 More implementation details
We adopt the implementation of slimmable networks described in 2019_slim
, which has switchable batch normalization layers. Namely, each network in the slimmable network has its own independent batch normalization process.
We train SlimCLR-MoCov2 on 8 Tesla V100 32GB GPUs without synchronized batch normalization across GPUs. The momentum coefficient is during training.
We train SlimCLR-MoCov3 on 8 Tesla V100 32GB GPUs with synchronized batch normalization across GPUs. Synchronized batch normalization is important for MoCov3 to obtain a better performance in linear evaluation. The momentum coefficient is with a cosine schedule when training for 300 epochs. The data augmentations are the same as augmentations of MoCov3 2021_mocov3.