1 Introduction
In the past decade, deep learning achieves great success in different fields of artificial intelligence. A large amount of manually labeled data is the fuel behind such success. However, manually labeled data is expensive and far less than unlabeled data in practice. To relieve the constraint of costly annotations, selfsupervised learning
(DosovitskiyFSRB16; wu2018unsupervised; cpcv1; 2020_moco; 2020_simclr) aims to learn transferable representations for downstream tasks by training networks on unlabeled data. Great progress is made in large models, i.e., models bigger than ResNet50 (2016_ResNet) that has roughly 25M parameters. For example, ReLICv2 (relicv2)achieves 77.1% accuracy on ImageNet
(ImageNet) under linear evaluation protocol with ResNet50, outperforming the supervised baseline 76.5%.In contrast to the success of the large model pretraining, selfsupervised learning with small models lags behind. For instance, supervised ResNet18 with 12M parameters achieves 72.1% accuracy on ImageNet, but its selfsupervised result with MoCov2 (mocov2) is only 52.5% (seed). The gap is nearly 20%. To fulfill the large performance gap between supervised and selfsupervised small models, previous methods (seed; disco; 2022_bingo) mainly focus on knowledge distillation, namely, they try to transfer the knowledge of a selfsupervised large model into small ones. Nevertheless, such methodology actually has a twostage procedure: first train an additional large model, then train a small model to mimic the large one. Besides, onetime distillation only produces a single small model for a specific computation scenario.
An interesting question naturally arises: can we obtain different small models through one time pretraining to meet various computation scenarios without extra teachers? Inspired by the success of slimmable networks (2019_slim) in supervised learning, we present a novel onestage method to obtain pretrained small models without adding large models: slimmable networks for contrastive selfsupervised learning (SlimCLR). A slimmable network consists of a full network and some weightsharing subnetworks with different widths. The width denotes the number of channels in a network. Slimmable networks can execute at various widths, permitting flexible deployment on different computing devices. We can thus obtain multiple networks including small ones meeting low computing cases via onetime pretraining. Weightsharing networks can also inherit knowledge from the large ones via the sharing parameters to achieve better generalization performance.
Weightsharing networks in a slimmbale network cause interference to each other when training simultaneously, and the situation is worse in selfsupervised cases. As shown in Figure 1, with supervision, weightsharing networks only have a slight impact on each other, e.g., the full model achieves 76.6% vs. 76.0% accuracy in and . Without supervision, the corresponding numbers become 67.2% vs. 64.8%. One observed phenomenon of the interference is gradient imbalance: a small proportion of parameters produces dominant gradients during backpropagation. The imbalance occurs because the sharing parameters receive gradients from multiple losses of different networks during optimization. The main parameters may not be fully optimized due to gradient imbalance. Besides, the gradient directions of weightsharing networks may also diverge from each other and lead to conflicts during training. More explanations and visualizations about the interference of weightsharing networks can be found in Appendix A.3.
To relieve the gradient imbalance, the main parameters should produce dominant gradients during the optimization process. To avoid conflicts in gradient directions of various networks, subnetworks should have consistent guidance. Following these principles, we introduce three simple yet effective techniques during pretraining to relieve the interference of networks. 1) We adopt a slow start strategy for subnetworks. The networks and pseudo supervision of contrastive learning are both unstable and fast updating at the start of training. To avoid interference making the situation worse, we only train the full model at first. After the full model becomes relatively stable, subnetworks can inherit the knowledge via sharing parameters and start with better initialization. 2) We apply online distillation to make all subnetworks consistent with the full model to eliminate divergence of networks. The predictions of the full model will serve as global guidance for all subnetworks. 3) We reweight the losses of networks according to their widths to ensure that the full model dominates the optimization process. Besides, we adopt a switchable linear probe layer to avoid interference of weightsharing linear layers during evaluation. A single slimmable linear layer cannot achieve several complex mappings simultaneously when the data distribution is complicated.
We instantiate two algorithms for SlimCLR with typical contrastive learning frameworks, i.e., MoCov2 and MoCov3 (mocov2; 2021_mocov3). Extensive experiments are done on ImageNet (ImageNet) dataset, and the results show that our methods achieve significant performance improvements compared to previous arts with fewer parameters and FLOPs.
2 Related Works
Selfsupervised learning Selfsupervised learning aims to learn transferable representations for downstream tasks from input data itself. According to 2021_survey, selfsupervised methods can be summarized into three main categories according to their objectives: generative, contrastive, and generativecontrastive (adversarial)
. Methods belonging to the same categories can be further classified by the difference between pretext tasks. Given input
, generative methods encodeinto an explicit vector
and decode to reconstruct from , e.g., autoregressive (2016_pixelcnn; 2016_pixelrnn), autoencoding models (1987_ae; 2014_VAE; 2019_bert; 2022_mae). Contrastive learning methods encoder input into an explicit vector to measure similarity. The two mainstream methods below this category are contextinstance contrast (infoMax 2019_infomax, CPC cpcv1, AMDIM 2019_addim) and instanceinstance contrast (DeepCluster 2018_deepcluster, MoCo 2020_moco; 2021_mocov3, SimCLR 2020_simclr; 2020_simclrv2, SimSiam 2021_simsiam). Generativecontrastive methods generate a fake sample from and try to distinguish from real samples, e.g., DCGANs 2016_dcgans, inpainting 2016_inpaint, and colorization
2016_color.Slimmable neworks Slimmable networks are first proposed to achieve instant and adaptive accuracyefficiency tradeoffs on different devices (2019_slim). It can execute at different widths during runtime. Following the pioneering work, universally slimmable networks (2019_unslim) develop systematic training approaches to allow slimmable networks to run at arbitrary widths. AutoSlim (yu2019autoslim) further achieves oneshot architecture search for channel numbers under a certain computation budget. MutualNet (2020_mutualnet) trains slimmable networks using different input resolutions to learn multiscale representations. Dynamic slimmable networks (Li_2022_pami; Li_2021_CVPR) change the number of channels of each layer in the fly according to the input. In contrast to weightsharing subnetworks in slimmable networks, some methods try to train multiple subnetworks with independent parameters (2020_dc). A relevant concept of slimmable networks in network pruning is network slimming (2017_networkslim; 2022_vitslim; pmlrv139wang21e), which aims to achieve channellevel sparsity for computation efficiency.
3 Method
3.1 Description of SlimCLR
We develop two instantial algorithms for SlimCLR with typical contrastive learning frameworks MoCov2 and MoCov3 (mocov2; 2021_mocov3). As shown in Figure 1(a) (right), a slimmable network with widths contains multiple weightsharing networks , which are parameterized by learnable weights , respectively. Each network in the slimmable network has its own set of weights and . A network with a small width shares the weights with large ones, namely, if . Generally, we assume if , i.e., arrange in descending order, and represent the parameters of the full model.
We first illustrate the learning process of SlimCLRMoCov2 in Figure 1(a). Given a set of images , an image sampled uniformly from , and one distribution of image augmentation , SlimCLR produces two data views and from by applying augmentations and , respectively. For the first view, SlimCLR outputs multiple representations and predictions , where and .
is a stack of slimmable linear transformation layers,
i.e., a slimmable version of the MLP head in MoCov2 and SimCLR (2020_simclr). For the second view, SlimCLR only outputs a single representation from the full model and prediction . We minimize the InfoNCE (cpcv1) loss to maximize the similarity of positive pairs and :(1) 
where , , is a temperature hyperparameter, and are features of negative samples. For SlimCLRMoCov2, comes from a queue. Following MoCov2, the queue is updated by every iteration during training. The overall objective is the sum of losses of all networks with various widths:
(2) 
is updated by every iteration: , where is a momemtum coefficient.
Compared to SlimCLRMoCov2, SlimCLRMoCov3 has an additional projection process. It first projects the representation to another high dimensional space, then makes predictions. The projector is a stack of slimmable linear transformation layers. SlimCLRMoCov3 also adopts the InfoNCE loss, but the negative samples come from other samples in the minibatch.
After contrastive learning, we only keep and abandon other components.
3.2 Gradient imbalance and solutions
As shown in Figure 1, a vanilla implementation of the above framework leads to server performance degradation as weightsharing networks interfere with each other during pretraining. One evidence of such interference we observed is gradient imbalance.
Gradient imbalance refers to that a small proportion of parameters produces dominant gradients during backpropagation. To quantitatively evaluate the phenomenon, we show the ratios of gradient norms of main and minor parameters: and versus in Figure 3, where
is the loss function. Meanwhile, the ratio of the numbers of parameters is
, where . Generally, the main parameters dominate the optimization process and produce large gradient norms, i.e., the two ratios should both be large (). In Figure 2(a), the two ratios are both around 3.5 when training a normal network. However, in Figure 2(b) and 2(c), when training a slimmable network, gradient imbalance occurs because sharing parameters obtain multiple gradients from different losses. To be specific, if the widths of a slimmable network arrange in a descending order and the training loss is , that only represent a small part of parameters will receive gradients from different losses and obtain a large gradient norm:(3) 
Gradient imbalance is more obvious in selfsupervised cases. In the supervised case in Figure 2(b), is close to at first, and the former becomes larger along with the training process. By contrast, for vanilla SlimCLRMoCov2 in Figure 2(c), is smaller than the other most time. A conjecture is that instance discrimination is harder than supervised classification. Consequently, small networks with limited capacity are hard to convergence, produce large losses, and cause more disturbances to other weightsharing networks.
When gradient imbalance occurs, the main parameters may not be fully optimized. Besides, the gradient directions of weightsharing networks may also diverge from each other during backpropagation. All these lead to interference between weightsharing networks and performance degradation.
To avoid gradient imbalance, one natural idea is to make the main parameters dominate the optimization process, i.e., the two ratios in Figure 3 should both be large. To resolve the possible conflicts of gradient directions, networks should have a consistent optimization goal. In order to achieve the above purposes, we develop three simple yet effective techniques during pretraining: slow start, online distillation, and loss reweighting. Besides, we further introduce a switchable linear probe layer to avoid the interference of weightsharing linear layers during linear evaluation.
slow start At the start of training, the model and pseudo supervision of contrastive learning are both fast updating. The optimization procedure is unstable. To avoid interference between weightsharing networks making the situation harder, at the first epochs, we only train the full model, i.e., only update by . In Figure 2(d), the ratios of gradient norms are large before the th epoch; then they dramatically become small after slow start. At the first epochs, the full model can learn certain knowledge from the data without disturbances, and subnetworks can inherit the knowledge via the sharing parameters and start with better initialization at the th epoch.
online distillation The full model has the greatest capacity to learn knowledge from the data. The prediction of the full model can serve as consistent guidance for all subnetworks to resolve the conflicts of weightsharing networks. Following 2019_unslim
, we minimize the KullbackLeibler (KL) divergence between the estimated probabilities
of subnetworks and the full model:(4) 
is a temperature coefficient of distillation. In Figure 2(e), we observe that online distillation helps become larger than . This means that online distillation also relieves the gradient imbalance and helps the main parameters dominate the optimization process.
loss reweighting Another straightforward solution to gradient imbalance is to assign large confidence to networks with large widths. We adopt a strategy in which the strongest takes control. The weight for the loss of the network with width is:
(5) 
where equals to if the inner condition is true, 0 otherwise. In Figure 2(f), both ratios become large, and are larger than by a clear margin. Loss reweighting helps the main parameters produce large gradient norms and dominate the optimization process.
The overall pretraining objective of SlimCLR is:
(6) 
switchable linear probe layer
After pretraining, we observe that a single slimmable linear layer is not able to achieve several complex mappings from different representations to the same object classes simultaneously. We provide theoretical results about the conditions of inputs when using a single slimmable linear layer to solve multiple multiclass linear regression problems in Appendix
A.1. The failure of a single slimmable linear layer is possibly because the learned representations in Figure 2 do not meet the requirement. In this case, we propose a switchable linear probe layer mechanism. Namely, each network in the slimmable network will have its own linear probe layer for linear evaluation. We also apply the online distillation strategy during finetuning.4 Experiments
4.1 Experimental details
Datatest We train SlimCLR on ImageNet (ImageNet), which contains 1.28M training and 50K validation images. During pretraining, we use training images without labels.
Learning strategies of SlimCLRMoCov2 By default, we use a total batch size 1024, an initial learning rate 0.2, and weight decay . We adopt the SGD optimizer with a momentum 0.9. A linear warmup and cosine decay policy (2017_PriyaGoyal; 2018_bags_of_tricks) for learning rate is applied, and the warmup epoch is 10. The temperatures are for InfoNCE and for online distillation. Without special mentions, other settings including data augmentations, queue size (65536), and feature dimension (128) are the same as the counterparts of MoCov2 (mocov2). The slow start epoch of subnetworks is set to be half of the number of total epochs.
Learning strategies of SlimCLRMoCov3 We use a total batch size 1024, an initial learning rate 1.2, and weight decay . We adopt the LARS (you2017large) optimizer and a cosine learning rate policy with warmup epoch 10. The temperatures are and . The slow start epoch is half of the total epochs. One different thing is that we increase the initial learning rate to 3.2 after epochs. Pretraining is all done with mixed precision (2018_AMP).
Linear evaluation Following the general linear evaluation protocol (2020_simclr; 2020_moco), we add new linear transformation layers on the backbone and freeze the parameters of the backbone during evaluation. As described above, we also apply online distillation with a temperature when training these linear layers. For linear evaluation of SlimCLRMoCov2, we use a total batch size 1024, total epochs 100, and an initial learning rate 60, which is decayed by 10 at 60 and 80 epochs. For linear evaluation of SlimCLRMoCov3, we use a total batch size 1024, total epochs 90, and an initial learning rate 0.4 with cosine decay policy.
Method  Backbone  Teacher  Top1  Top5  Epochs  #Params  #FLOPs 
Supervised  R50  ✗  76.6  93.2  100  25.6 M  4.1 G 
R34  75.0      21.8 M  3.7 G  
R18  72.1      11.9 M  1.8 G  
28[0.5pt/1.3pt]  R50  ✗  76.0  92.9  100  25.6 M  4.1 G 
R50  74.9  92.3  14.7 M  2.3 G  
R50  72.2  90.8  6.9 M  1.1 G  
R50  64.4  86.0  2.0 M  278 M  
Baseline (individual networks trained with MoCov2)  R50  ✗  67.5    200  25.6 M  4.1 G 
28[0.5pt/1.3pt]  R50  ✗  67.2  87.8  200  25.6 M  4.1 G 
R50  64.3  85.8  14.7 M  2.3 G  
R50  58.9  82.2  6.9 M  1.1 G  
R50  47.9  72.8  2.0 M  278 M  
MoCov2 (mocov2, preprint)  R50  ✗  71.1    800  25.6 M  4.1 G 
MoCov3 (2021_mocov3, ICCV)  R50  72.8    300  
SlimCLRMoCov2  R50  67.4  87.9  200  
SlimCLRMoCov2  R50  70.4  89.6  800  
SlimCLRMoCov3  R50  72.3  90.8  300  
SEED (seed, ICLR)  R34  R50 (67.4)  58.5  82.6  200  21.8 M  3.7 G 
DisCo (disco, ECCV)  R34  R50 (67.4)  62.5  85.4  200  
BINGO (2022_bingo, ICLR)  R34  R50 (67.4)  63.5  85.7  200  
SEED (seed, ICLR)  R34  R502 (77.3)  65.7  86.8  800  
DisCo (disco, ECCV)  R34  R502 (77.3)  67.6  88.6  200  
BINGO (2022_bingo, ICLR)  R34  R502 (77.3)  68.9  88.9  200  
28[0.5pt/1.3pt] SlimCLRMoCov2  R50  ✗  65.5  87.0  200  14.7 M  2.3 G 
SlimCLRMoCov2  R50  68.8  88.8  800  
SlimCLRMoCov3  R50  69.7  89.4  300  
CompRess (compress, NeurIPS)  R18  R50 (71.1)  62.6    130  11.9 M  1.8 G 
SEED (seed, ICLR)  R18  R502 (77.3)  63.0  84.9  800  
DisCo (disco, ECCV)  R18  R502 (77.3)  65.2  86.8  200  
BINGO (2022_bingo, ICLR)  R18  R502 (77.3)  65.5  87.0  200  
SEED (seed, ICLR)  R18  R152 (74.1)  59.5  65.5  200  
DisCo (disco, ECCV)  R18  R152 (74.1)  65.5  86.7  200  
BINGO (2022_bingo, ICLR)  R18  R152 (74.1)  65.9  87.1  200  
28[0.5pt/1.3pt] SlimCLRMoCov2  R50  ✗  62.5  84.8  200  6.9 M  1.1 G 
SlimCLRMoCov2  R50  65.6  87.2  800  
SlimCLRMoCov3  R50  67.6  88.2  300  
SlimCLRMoCov2  R50  ✗  55.1  79.5  200  2.0 M  278 M 
SlimCLRMoCov2  R50  57.6  81.5  800  
SlimCLRMoCov3  R50  62.4  84.4  300 
4.2 Results of SlimCLR on ImageNet
Results of SlimCLR on ImageNet are shown in Table 1. Even though we pay huge efforts to relieve the interference of weightsharing networks as described in Section 3.2, slimmable training inevitably leads to a drop in performance for the full model. When training for more epochs, the degradation is more obvious. However, such degradation also occurs in the supervised case. Considering the advantages of slimmable training we will discuss below, the degradation is acceptable.
Compared to MoCov2 with individual networks, SlimCLR helps subnetworks achieve significant performance improvements. Specifically, for ResNet50 and ResNet50, SlimCLRMoCov2 achieves 3.5% and 6.6% improvements in performance when pretraining for 200 epochs, respectively. This verifies that subnetworks can inherit knowledge from the full model via sharing parameters to improve their generalization ability. We can also use more powerful contrastive learning framework to further boost the performance of subnetworks, i.e., SlimCLRMoCov3.
Compared to previous methods that aim to distill the knowledge of large teacher models, subnetworks of ResNet50 achieve better performance with fewer parameters and FLOPs. SlimCLR also helps small models get closer performance to their supervised counterparts. Furthermore, SlimCLR does not need any additional training process of large teacher models, and all networks in SlimCLR are trained jointly. By only training for one time, we can get different models with various computation cost which are suitable for different devices. This demonstrates the superiority of adopting slimmable networks for contrastive learning to get pretrained small models.
4.3 Discussion
In this section, we will discuss the influences of different components in SlimCLR.
switchable linear probe layer The influence of the switchable linear probe layer is shown in Table 1(a). A switchable linear probe layer brings significant improvements in accuracy compared to slimmable linear probe layer. For only one slimmable layer, the interference between weightsharing linear layers is not unavoidable. It is also possible that the learned representations of pretrained models do not meet the requirements as we discussed in Appendix A.1.
slow start and training time Experiments with and without slow start are shown in Table 1(b). The pretraining time of SlimCLRMoCov2 without and with slow start epoch on 8 Tesla V100 GPUs are 45 and 33 hours, respectively. For reference, the pretraining time of MoCov2 with ResNet50 is roughly 20 hours. Slow start largely reduces the pretraining time. It also avoids the interference between weightsharing networks at the start stage of training and helps the system reach a steady point fast during optimization. Thus subnetworks can start with good initialization and achieve better performance. We also provide ablations of slow start epoch when training for a longer time in Table 1(f). Setting to be half of the total epochs is a natural and appropriate choice.
online distillation Here we compare two classical distillation losses: meansquareerror (MSE) and KL divergence (KD), and two other distillation losses from recent works: ATKD (guo2022reducing) and DKD (zhao2022decoupled). ATKD reduces the difference in sharpness between distributions of teacher and student to help the student better mimic the teacher. DKD decouples the classical knowledge distillation objective function (2015_kd) into target class and nontarget class knowledge distillation to achieve more effective and flexible distillation. In Table 1(c), we can see that these four distillation losses make trivial differences in our context.
Combining the results of distillation with ResNet in Figure 2(e), we find that distillation mainly improves the performance of the full model, and the improvements of subnetworks are relatively small. This violates the purpose of knowledge distillation: distill the knowledge of large models and improve the performance of small ones. This is possibly because subnetworks in a slimmable network already inherit the knowledge from the full model via the sharing weights, and feature distillation cannot help the subnetworks much in this case. The main function of online distillation in our context is to relieve the interference of subnetworks as shown in Figure 2(e).
We also test the influence of different temperatures in online distillation, i.e., related to in equation 4, and results are shown in Table 1(d). Following classical KD (2015_kd), we choose . The choices of temperatures make a trivial difference in our context. SEED (seed) use a small temperature 0.01 for the teacher to get sharp distribution and a temperature 0.2 for the student. BINGO (2022_bingo) adopts a single temperature 0.2. Their choices are quite different from ours, and SlimCLR is more robust to the choice of temperatures. We further provide an analysis of the influence of temperatures in Appendix A.2.






loss reweighting We compared four loss reweighting manners in Table 1(e). They are
(1). ,  (2). , 
(3). ,  (4). , 
where equals to if the inner condition is true, 0 otherwise. The corresponding weights of ResNet50 are , , , and . It is clear that a larger weight for the full model helps the system achieve better performance. This demonstrates again that it is important for the full model to lead the optimization direction during training. The differences of the above four loss reweighting strategies are mainly reflected in the subnetworks with small sizes. To ensure the performance of the smallest network, we adopt the reweighting manner (1) in practice.
ablation study with SlimCLRMoCov3 Besides the above ablation studies with SlimCLRMoCov2, we also provide empirical analysis for SlimCLRMoCov3. Different from SlimCLRMoCov2, SlimCLRMoCov3 adopts different temperatures and . When applying slow start, SlimCLRMoCov3 also increases the initial learning rate at the same time.
In Table 2(a), we show the influence of temperature for online distillation. Different from the choice in SlimCLR (MoCov2), it is better for SlimCLRMoCov3 to choose a temperature for online distillation, which is close to the temperature of contrastive loss. is the default choice of MoCov3 (2021_mocov3), and we do not modify it.
Another interesting phenomenon is that when the number of training epochs of SlimCLRMoCov3 becomes larger, we need to increase the learning rate when we start to train the subnetworks. The influence of learning rate for slimmable training in SlimCLRMoCov3 is shown in Table 2(b). Here the learning rate refers to the base learning rate, the immediate learning rate after the warmup is calculated by this base learning rate and the current training steps: . Different from SlimCLRMoCov2 and SlimCLRMoCov3 with fewer epochs, SlimCLRMoCov3 will get poor performance if we do not change the learning rate when training for more epochs. We attribute the difference to the LARS (you2017large) optimizer we adopt for SlimCLRMoCov3. LARS normalizes the gradient of layers in the networks to avoid the imbalance of gradient magnitude across layers and ensures convergence when training networks with very large batch size. LARS is sensitive to the change in learning rate and helps selfsupervised models training with large batches converge fast (2021_mocov3; 2020_simclr). When training for the 300 epochs, the full model can reach a local minima fast in the first 150 epochs. In this case, a learning rate 0.6 (half of the base learning rate) is not able to help the system walk out of the valley and reach a better local minima. Consequently, a large learning rate is needed to give the system more powerful momentum. From Table 2(b), we can also see that SlimCLRMoCov3 with LARS is sensitive to the change in learning rate. This is consistent with observations of previous works (2021_mocov3; 2020_simclr).


5 Conclusion
In this work, we adapt slimmable networks for contrastive learning to obtain pretrained small models in a selfsupervised manner. By using slimmable networks, we can pretrain for one time and get several models with different sizes which are suitable for various devices with different computation resources. Besides, unlike previous distillation based methods, our methods do not require additional training process of large teacher models. However, weightsharing subnetworks in a slimmable network cause severe interference to each other in selfsupervised learning. One evidence of such interference we observed is the gradient imbalance in the backpropagation process. We develop several techniques to relieve the interference of weightsharing networks during pretraining and linear evaluation. Two specific algorithms are instantiated in this work, i.e., SlimCLRMoCov2 and SlimCLRMoCov3. We take extensive experiments on ImageNet and achieve better performance than previous arts with fewer network parameters and FLOPs.
Acknowledgments
Thanks for Boxi Wu and Minghan Chen for their helpful discussion.
References
Appendix A Appendix
a.1 Conditions of inputs given a slimmable linear layer
We consider the conditions of inputs when only using one slimmable linear transformation layer, i.e., consider solving multiple multiclass linear regression problems with shared weights. The parameters of the linear layer are , is the number of classes, where , , , .
The first input for the full model is , where is the number of samples, , , . The second input is the input feature for the submodel parameterized by .
Generally, we have . We assume that both and have independent columns, i.e., and are invertible. The ground truth is .
The prediction of the full model is , to minimize the sumofleastsquares loss between prediction and ground truth, we get
(7) 
By setting the derivative w.r.t. to 0, we get
(8) 
In the same way, we can get
(9) 
For , we have
(10) 
We denote the inverse of is , , as is a symmetric matrix, its inverse is also symmetric, so . For , we have
(11) 
Then we can get
(12)  
(13)  
(14)  
(15) 
At the same time
(16) 
and
(17) 
From equation 14, we get
(18)  
(19) 
a.2 Influence of temperatures during distillation
In this section, we will analyze the influence of temperatures when applying distillation. One of the previous methods SEED (seed) uses different temperatures for the student and teacher, without loss of generalization, we will also adopt a such strategy in our analysis. Specifically, we adopt for teacher and for student.
The predicted probability for a certain category of the student is , where is the output of the student model, i.e
., logit of the model. The probability for a certain category
of the teacher is , where is the output of the student model. The loss is the KL divergence:(24) 
The gradient of w.r.t. is:
(25)  
(26) 
Similarly,
(27)  
(28) 
The gradient of w.r.t. is:
(29)  
(30)  
(31) 
Following classical KD (2015_kd), we assume temperatures are much larger than logits and use the firstorder Taylor series to approximate the exponential function:
(32) 
where is the number of classes. Following classical KD (2015_kd), we further assume , we can get:
(33) 




a.3 Visualization
a.3.1 Error surface and optimization trajectory
The performance degradation caused by the interference between weightsharing networks is more severe in selfsupervised learning compared to supervised learning. We already know that the gradient imbalance is more significant in selfsupervised cases. Besides, the optimization process of selfsupervised learning maybe also harder, so the gradient directions of weightsharing networks may diverge from each other greatly and make the training process chaotic. In supervised cases, consistent global guidance relieves such divergence.
To better understand the interference between weightsharing subnetworks in a slimmable network during optimization, we visualize the error surface and optimization trajectory using the method proposed by 2018_vis. We train slimmable networks in both supervised and selfsupervised (MoCo (2020_moco)
) manners on CIFAR10
(krizhevsky2009learning). The base network is a ResNet204, which has 4.3M parameters. We train the model for 100 epochs. At the end of each epoch, we save the weights of the full model and calculate the Top1 accuracy. For selfsupervised cases, we use a NN predictor (wu2018unsupervised) to obtain the accuracy. After training, we calculate the principle components of the differences of the saved weights at each epoch and the weights of the final model following 2018_vis. Then we use the first two principle components as the directions to plot the error surface and optimization trajectory in Figure 4.The visualization shows that selfsupervised learning is harder than supervised learning. In the left error surface in Figure 3(a) and Figure 3(c), we can see that the terrain around the valley is flat in supervised cases; by contrast, the terrain around the valley is more complicated in selfsupervised cases. From the trajectory of ResNet204 in the left of Figure 3(b) and Figure 3(d), the contours in supervised cases are denser, i.e., the nearby two contours are closer. Namely, the model in selfsupervised cases costs more time to achieve the same improvement of accuracy compared to the model in supervised cases (the gaps between two contour lines are all the same). In supervised cases, clear global guidance helps the model quickly reach the global minima. In selfsupervised cases, it is harder for the model to reach the global minima fast without such global guidance.
The visualization shows that the interference of weightsharing networks is more significant in selfsupervised cases. First of all, in selfsupervised cases, weightsharing networks bring huge changes to the error surface in Figure 3(c). In contrast, the change is not so obvious in supervised cases. Second, the interference between weightsharing networks in selfsupervised cases makes the model shift more away from the global minima (the origin in the visualization) as shown in Figure 3(d). In Figure 3(b), the maximal offsets from the global minima along the 2nd PCA component are 21.75 and 28.49 for ResNet204 and ResNet204. The offset increased 31.0%. For selfsupervised cases in Figure 3(d), the maximal offsets from the global minima along the 2nd PCA component are 13.26 and 18.75 for ResNet204 and ResNet204. The offset increased 41.4%. It is clear that the interference of weightsharing networks is more significant in selfsupervised cases compared to supervised cases.
a.3.2 Gradient imbalance
a.4 More implementation details
Slimmable networks
We adopt the implementation of slimmable networks described in 2019_slim
, which has switchable batch normalization layers. Namely, each network in the slimmable network has its own independent batch normalization process.
SlimCLRMoCov2
We train SlimCLRMoCov2 on 8 Tesla V100 32GB GPUs without synchronized batch normalization across GPUs. The momentum coefficient is during training.
SlimCLRMoCov3
We train SlimCLRMoCov3 on 8 Tesla V100 32GB GPUs with synchronized batch normalization across GPUs. Synchronized batch normalization is important for MoCov3 to obtain a better performance in linear evaluation. The momentum coefficient is with a cosine schedule when training for 300 epochs. The data augmentations are the same as augmentations of MoCov3 2021_mocov3.