To make deep neural networks (DNNs) practically applied, the demand for developing compact DNNs came into being. Shrinking the performance gap between compact models and large-scale models is of paramount importance. Knowledge distillation (KD)hinton2015distilling is one of the representative schemes to develop compact models by distilling knowledge from teacher (large-scale models) to student (compact models). Up to now, there have derived many different paradigms, such as offline distillation hinton2015distilling; romero2014fitnets, online distillation zhang2018deep; anil2018large
etc. More importantly, these derivative knowledge distillation paradigms have been widely used in computer visionpeng2019few; wu2020learning; zhang2020knowledge; li2017mimicking; mullapudi2019online
and natural language processing (NLP) taskskim2016sequence; zhou2019understanding; tan2019multilingual; sun2019patient. KD community attributes the success of knowledge distillation to dark knowledge hinton2015distilling and hence mainly focuses on boosting student’s performance from the perspective of knowledge quality. To capture the nature of teacher representation, various types of knowledge, such as response-based knowledge hinton2015distilling; tian2019contrastive, feature-based knowledge romero2014fitnets; zagoruyko2016paying; wang2020exclusivity, relation-based knowledge yim2017gift; tung2019similarity; passalis2020heterogeneous and etc are designed to improve knowledge quality.
Despite the great success of KD, there are a few works to interpret what student benefits from dark knowledge. cheng2020explaining explains the working mechanism of knowledge distillation by quantifying the knowledge encoded in the intermediate layer, and they find out that knowledge distillation makes student learn more visual concepts. In this paper, we do not intend to further demystify dark knowledge, but this interesting observation motivates us to rethink knowledge distillation schemes from the perspective of knowledge quantity rather than knowledge quality. Orange arrows in Figure 1 represents the feature information corresponding to different knowledge quantity with one given knowledge type. cho2019efficacy empirically finds that the teacher accuracy is a poor predictor of student performance, which implies that teacher consisting of more knowledge may not be a better teacher. TAKD mirzadeh2020improved introduces a teacher assistant with intermediate less knowledge to make the students learn better. Thus, we suppose that knowledge quantity of teacher has a great influence on the efficacy of knowledge distillation.
Based on the above observation, we survey most knowledge distillation literature whether it is offline or online paradigms. Then, we find out that teacher transfer the whole knowledge (logits, intermediate feature maps, etc obtained with the whole feed-forward computation graph) to student at once, which is counter-intuitive in the view of human beings. Imagining that a human teacher imparts all of what he has learned throughout his life to student, is it easy for student to accept it, especially in the early learning phases with poor cognitive ability? Hinton describes GLOMhinton2021represent to parse an image into a part-whole hierarchy, which motivates us to model the knowledge quantity of teacher into the partial-whole paradigm. We further make an intuitive hypothesis that progressively distilling knowledge from teacher with a partial-whole paradigm boosts student better (named partial-whole hypothesis for short).
To verify partial-whole hypothesis, we propose a partial to whole knowledge distillation (PWKD) paradigm as Figure 1 shows. We define a new setup of knowledge decomposition to parse knowledge of teacher into partial-whole knowledge. Different from GLOM hinton2021represent parsing image into independent partial representation, our goal is much simpler and aims to decompose knowledge representation into monotonically increasing fragments. Specifically, we reconstruct teacher into multiple weight-shared sub-networks with the same layer depth but different channel width, which preserves that knowledge quantity positively correlates with channel width (e.g., knowledge of 1.0 teacher is more than the one of 0.5 teacher). Then, all sub-networks are trained jointly to obtain decomposed knowledge pieces. In the distillation process, we split the distillation into multiple training stages, in which student extract different knowledge fragments. To converge to a local minimum in each stage, cyclical learning rate scheduler smith2017cyclical is leveraged to student to accelerate convergence in the distillation process.
We empirically evaluate PWKD on two benchmark datasets: CIFAR-100 and ImageNet, and PWKD is used as a general plugin compatible with nine mainstream offline distillation approaches (e.g., FitNet romero2014fitnets, AT sun2019patient, CRD tian2019contrastive and etc). Comprehensive experiments show that knowledge distillation plugged with PWKD consistently improves performance with a substantial margin across various datasets, distillation approaches, and teacher-student pairs.
We summary the main contributions of this paper as follows:
For the first time, we analyze knowledge distillation from a new perspective of teacher knowledge quantity instead of obsessing over knowledge quality.
We present the partial-whole hypothesis and put forward a PWKD scheme to verify this hypothesis. Without loss of generality, PWKD is compatible with almost all offline distillation approaches.
Comprehensive experiment results demonstrate that PWKD is such a simple yet effective distillation paradigm, which consistently improves distillation baseline with a substantial margin. The ablation study also strongly supports the intuition of this paper. We believe these results will further inspire a new understanding of knowledge distillation.
Knowledge Distillation (KD).
KD hinton2015distilling has developed to present and mainly consists of two distillation schemes: offline distillation hinton2015distilling; romero2014fitnets; huang2017like; heo2019knowledge; passalis2018learning and online distillation (self-distillation is regarded as a special online distillation) zhang2018deep; anil2018large; zhang2019your; hou2019learning. Offline distillation is the earliest distillation scheme and it includes two sequential training phases: firstly, training a teacher model before distillation; secondly, the pre-trained teacher is used to guide the optimization of student. However, teacher with higher performance is not always available or training teacher costs large computation. Thus online distillation came into being. Online distillation has only one training phase, during which both teacher and student are optimized from scratch and updated simultaneously. To obtain decomposed knowledge, teacher must be pre-trained, and hence we mainly focus on the offline distillation scheme.
For offline distillation, it is inevitable to consider two questions: (i) what knowledge to distill? and (ii) how to distill knowledge? To answer the first question, there are three categories of knowledge from teacher: output logits hinton2015distilling; chen2017learning; zhang2019fast, intermediate feature map romero2014fitnets; huang2017like; passalis2018learning and relation-based knowledge yim2017gift; lee2019graph. Although the three types of knowledge represent different information, they share the same pattern that knowledge is obtained from the whole computation graph of teacher. There are few literatures that try to better distill knowledge. Opposed to the all-in distillation paradigm, we follow a similar rationale of curriculum distillation and propose a novel partial to whole knowledge distillation (PWKD) paradigm to answer the second question in this paper.
Adaptive Neural Networks (ANNs).
ANNs have attracted increasing attention because of their advantages in computation efficiency, representation power and etc han2021dynamic. Opposed to static ones, ANNs dynamically adjust structures according to input samples wang2018skipnet; huang2017multi or constrained computation budgets yu2018slimmable; wang2020resolution. More specifically, ANNs adjust structures from the perspective of depth wang2018skipnet; huang2017multi, channel width yu2018slimmable; gao2018dynamic, spatial resolution wang2020resolution; yang2020resolution or dynamic routing li2020learning. To answer the second question above (how to distill knowledge?), we instantiate knowledge decomposition with the adjusted attribute of ANNs in the channel dimension. Different from most ANNs, which aim for the trade-off between accuracy and computation efficiency, we reconstruct static teacher into ANNs to obtain decomposed knowledge. Sub-networks preserve both coarse-grained and fine-grained features, and the relationship between the amount of knowledge and the number of sub-network channel widths can be clearly defined.
We clarify the goal of this paper is to build a new distillation setup: partial to whole knowledge distillation (PWKD) framework. To achieve this goal, we introduce the concept of knowledge decomposition. To our best knowledge, it is the first time to involve knowledge decomposition in the field of knowledge distillation. We reconstruct teacher into weight-sharing sub-networks with increasing channel widths and train these networks jointly to obtain decomposed knowledge. Then, we specify the interaction between student and teacher. At last, we put forward the overall PWKD framework that student progressively extracts partial to whole knowledge from teacher.
Before introducing knowledge decomposition, we first define the knowledge contained in a neural network. To keep the definition simple, we take the classical image classification as an example. Given a training dataset consisting of image and label tuples , a neural network with learnable weights is build to fit the mapping:
. Cross entropy is used as the loss function () and
is achieved with Stochastic Gradient Descent (SGD) optimization:
Then, the well-optimized neural network is used for inference. Given an input , propagates forward and the output logits, intermediate feature maps, and relationships between layers or samples can be defined as the knowledge .
Because is obtained with the whole network computation graph before the knowledge output node, we further define as the ’whole knowledge’. In the knowledge distillation field, the conventional distillation approaches transfer ’whole knowledge’ directly to student in the whole student optimization procedure (name all-in scheme for short). We argue that the all-in scheme is much simpler but counter-intuitive. Knowledge distillation should obey the rule of curriculum learning from easy to difficult, just like humans. Intuitively, we can quantify the difficulty of learning from the perspective of knowledge quantity. Thus, it is necessary to decompose teacher’s knowledge before knowledge distillation.
Adaptive neural networks (ANNs) adapt structures to the input or computation constrain, which implies that network representation can be split into multiple parts. Inspired by the partial activation property of ANNs, we can also decompose the knowledge of teacher by reconstructing teacher into multiple sub-networks. To clearly quantify knowledge, we need to explicitly define the correspondence between the sub-networks and the amount of knowledge. In general, model representation ability and model computational complexity are positively correlated. Therefore, an intuitive way to reconstruct teacher is to divide teacher into multiple sub-networks with the same depth but different channel widths and parameters are shared between different sub-networks. As Figure 2 shows, we decompose the knowledge of vanilla teacher in four channel width groups (, , , and channel width in all layers of the vanilla teacher). With bigger channel width, the sub-network has stronger prediction performance, which means that the sub-network contains more knowledge. We define the knowledge of sub-networks as ’partial knowledge’
After clarifying the correspondence between sub-networks and knowledge quantity, we train these sub-networks to obtain representation knowledge. Due to the weight-sharing strategy, we involve switchable-BN and switchable-classifieryu2018slimmable (each sub-network has private BN ioffe2015batch in each layer and classifier layer) into each sub-networks to prevent performance collapse. Then, we train 1.0 teacher with cross entropy loss ():
and train the other sub-networks with both cross entropy loss (
) and Kullback-Leibler divergence loss ():
where , and represent weighted factor, channel width scale and temperature factor. To make it simple, we set and in this paper if there are no specific instructions. During each training iteration, all sub-networks propagate forward and backward successively. Then, the accumulated back-propagation gradients are applied to update teacher weights. After all, sub-networks are trained to converge, we can obtain the decomposed knowledge.
Distillation with Decomposed Knowledge
Given the pre-trained teacher optimized with Eq.2 and Eq.3, student can extract the decomposed knowledge from the sub-networks with pre-defined channel width. Because the only difference between sub-networks and the full model is the channel width in each layer, the decomposed knowledge can be applied to almost all off-line distillation approaches. The whole loss function can be formulated as:
where represents weighted factor in student training phase and stands for a general distillation loss, such as similarity loss tung2019similarity, contrastive loss tian2019contrastive and etc. The detailed distillation process can be described as Figure 3 (a).
Overall Partial to Whole Distillation
Partial to whole multiple stage distillation.
Although students can distill decomposed knowledge arbitrarily, we argue that progressively knowledge extraction boosts student better. In the above subsection, we have explicitly quantified knowledge through model capability: the more channels there are, the more knowledge the sub-network includes. Sequentially, we only need to divide the student distillation process into multiple stages and each stage distills knowledge gradually in ascending order. We keep the student training epochs as the vanilla distillation process and just split the total training epochs to each piece of knowledge with equal epochs. As Figure3 shows, teacher transfers knowledge to student from to progressively.
Cyclical learning rate scheduler.
During most knowledge distillation approaches, they use the learning rate scheduler with a fixed value that monotonically decreases during the whole training procedure of students. That makes sense, because student just needs to mimic teacher with a piece of knowledge. However, in the PWKD framework, there are multiple pieces of knowledge and student need to mimic different pieces of knowledge in each distillation stage. The conventionally monotonous learning rate schedulers make student optimized with a bigger learning rate in early distillation stages and yet smaller learning rate in later distillation stages, which prevents students from fully absorbing knowledge and further limits the performance improvement.
To fully take advantage of each piece of knowledge, student must converge to a local minimum in each distillation stage. Cyclical learning rate (CLR) scheduler can make model converge fast smith2017cyclical and we equip PWKD with CLR to converge to multiple local minimums. To make a fair comparison, we use the same training epochs as with other knowledge distillation approaches. Thus, each local minimum could be achieved with only ( represents the number of knowledge pieces) of the total training epochs. Specifically, CLR consists of multiple cycles with the same learning rate policy. In each cycle, the learning rate is constrained with a range and varies between the minimum and maximum boundary with a certain functional form. In this paper, we instantiate the cyclical function form as the triangular window, which linearly increases from the minimum to the maximum bound and then linearly decreases to the minimum with equal window size.
We evaluate partial to whole knowledge distillation (PWKD) on CIFAR-100 krizhevsky2009learning and ImageNet deng2009imagenet. We compare PWKD with standard KD on both datasets and to claim the generality, we extend PWKD to nine popular knowledge distillation methods on CIFAR-100. At last, we ablate the effect of PWKD and hyper-parameters.
Results on CIFAR-100
In this section, we aim to verify the effectiveness of PWKD on CIFAR-100 dataset. Firstly, we reconstruct teacher into 4 weight-sharing sub-networks with channel width 0.25, 0.5, 0.75 and 1.0 respectively. Then, all sub-networks are trained jointly in each iteration to obtained knowledge fragments. Specifically, we set ResNet-204, ResNet-324, ResNet-444 and ResNet-564 as teacher ( means that channel width of each layer expands 4 times), and the test accuracy of sub-networks are reported in Table 1. For each teacher, test accuracy of sub-networks increase monotonically with channel with, which means knowledge quantity is positively correlated with channel width. Further compared with test accuracy of teacher without reconstruction, we observe that all sub-networks has inferior performance even for sub-networks with channel width 1.0. This phenomenon may be caused by conflicted optimization goals of sub-networks in weight-sharing strategy.
With the pre-trained sub-networks, we split the distillation process into four stages, which equals to the number of sub-networks. In each distillation stage, student distill from one sub-network, and sub-networks are applied to each stage in the order of channel width from 0.25 to 1.0. Besides, the cyclical learning rate also consist of four cycles, each distillation state share the same learning rate decay policy to converge to local minimum.
We apply the above four teachers to five students: ResNet-20, ResNet-32, ResNet-44, ResNet-56 and ResNet-110. We define test accuracy of student trained from scratch and distilled from teacher with KD hinton2015distilling as baselines. We sum up the evaluation results in Figure 4 with four sub-figures. As Figure 4 shows, PWKD consistently outperforms the two baselines for all teacher-student pairs. An interesting observation is that in Figure 4 (c) and Figure 4 (d), ResNet-20 with KD has inferior performance (drop by 1.08% and 1.16%) than vanilla performance (this phenomenon is interpreted as teacher-student gap in mirzadeh2020improved), but PWKD can still improve ResNet-20 distilled from ResNet-444 and ResNet-564 with 1.01% and 1.42% respectively. To be most important, as Table 1 shows, all sub-networks in PWKD under-perform vanilla teacher, but distill various students outperform the ones distilled from vanilla teacher. These promising results demonstrate that PWKD boosts students better.
Extend to various distillation approaches
|Teacher/student pair||ResNet-204/ResNet-20 (%)||ResNet-324/ResNet-32 (%)||ResNet-444/ResNet-44 (%)||ResNet-564/ResNet-56 (%)|
|Student w/o distillation||68.15||68.15||-||70.00||70.00||-||71.17||71.17||-||72.07||72.07||-|
|Teacher/student pair||ResNet-204/ShuffleNet-V1 (%)||ResNet-204/MobileNet-V2 (%)||VGG13/ShuffleNet-V1 (%)||VGG13/ResNet-20 (%)|
|Student w/o distillation||71.06||71.06||-||62.20||62.20||-||71.06||71.06||-||68.15||68.15||-|
As described in the methodology, PWKD can be used as a general plugin to be applied to various knowledge distillation approaches. To support the generality, we extend PWKD to existing distillation approaches in this section from the perspectives of teacher-student architecture styles and distillation methods. For teacher-student architecture styles, we mainly consider teacher-student pairs with homogeneous architectures (e.g., ResNet/ResNet he2016deep) and heterogeneous architectures (e.g., ResNet/ShuffleNet zhang2018shufflenet, ResNet/MobileNet sandler2018mobilenetv2, VGGNet simonyan2014very/ShuffleNet and VGGNet/ResNet). For distillation methods, we introduce nine popular algorithms, consisting of KD hinton2015distilling, FitNet romero2014fitnets, AT zagoruyko2016paying, SP tung2019similarity, CC peng2019correlation, VID ahn2019variational, RKD park2019relational, PKT passalis2018learning, NST huang2017like and CRD tian2019contrastive. We keep most training setting the repository ***https://github.com/HobbitLong/RepDistiller provided in CRD tian2019contrastive, except that we train both teacher and student 320 epochs, batch size 256 accompanied with 2 initial learning rate.
We report evaluation results of homogeneous and heterogeneous teacher-student pairs in Table 2 and Table 3 respectively. In each table, there are four different teacher-student pairs. As Table 2 and Table 3 shows, teacher plugged with PWKD still consistently boosts student better across two architecture styles and nine popular distillation methods. In Table 2, PWKD achieve a maximum gain 2.93%, and the average improvement is 1.67%. Simultaneously, PWKD achieves a maximum gain 4.71%, and the average improvement is 1.43% in Table 3. Thus, we conclude that PWKD is such general distillation paradigm, which is independent from architecture styles and knowledge categories.
Results on ImageNet
|Teacher/student pair||student||vanilla KD||PWKD|
We further evaluate PWKD on ImageNet. ResNet-50 is adopt as the teacher, and we reconstruct ResNet-50 into 4 sub-networks with also 50 layers but 0.25, 0.5, 0.75 and 1 channel width respectively. We train the four sub-networks jointly to obtain knowledge , , and . Top-1 accuracy of sub-networks are reported in Table 1. There are something different from sub-networks results on CIFAR-100 that sub-network ResNet-50 with 1 channel width outperforms vanilla ResNet-50 trained from scratch independently.
When applying KD to student, we find default KD setting =0.9 and T=6.0 can not improve student as observed in cho2019efficacy. After comparative analysis with CIFAR-100, we guess this phenomenon is caused by inferior training accuracy and larger image category. Thus we try to assign more credit to ground truth labels may improve the efficacy of KD. Specifically, we set in Eq.4 to 0.1. As Table 4 shows, vanilla KD improve student ResNet10, ResNet-18 and ResNet-34 0.32%, 0.80% and 0.79%. Following this hyper-parameter setting, we plug vanilla KD with PWKD, and maximum and minimum boundary of CLR are 0.1 and 0.0001. The performance of students can be further improved by 0.85%, 0.68% and 0.60% compared with vanilla KD methods.
Effects of PWKD and cyclical scheduler
Our proposed distillation scheme consists of PWKD and cyclical learning rate scheduler. We choose ResNet-204/ResNet-20 as teacher/student pair to ablate the effect of PWKD and cyclical learning rate (CLR) in this section. As Table 6 shows, vanilla KD improves ResNet-20 with 0.47%, but PWKD without cyclical learning rate scheduler obtains 2.29% improvement compared with baseline. Further, we introduce cyclical learning rate scheduler to PWKD and PWKD improve ResNet-20 2.61%, which imply convergence to multiple local minimums prompt student better.
|None/ResNet-20 (baseline)||68.15 (+0.00)|
Considering the benefit of CLR, we further investigate more possibility of learning rate decay form in each cycle. Besides the rectangular form presented in smith2017cyclical, we instantiate other custom learning rate decay policy, multi-step scheduler, linear scheduler and cosine scheduler into CLR. We report the comparison results of different CLR in Table 7. We observe that all four CLRs can gain with at least improvement. CLR with the rectangular form works best with PWKD and improve ResNet-44 and ResNet-56 with and respectively.
|Vanilla student||71.17 (+0.00)||72.07 (+0.00)|
|PWKD (multi-step)||72.65 (+1.48)||73.55 (+1.48)|
|PWKD (cosine)||72.89 (+1.72)||73.92 (+1.85)|
|PWKD (linear)||73.25 (+2.08)||74.14 (+2.07)|
|PWKD (rectangle)||73.54 (+2.37)||74.30 (+2.23)|
Varying groups of knowledge decomposition
Knowledge decomposition is one of the key components in PWKD framework, and the groups of knowledge decomposition have great influence on the efficacy of knowledge distillation. Therefore, we investigate the effects of knowledge decomposition by varying groups of knowledge decomposition. Given a teacher, we change the number of sub-networks †††Channel width list [0.5,1.0], [0.25,0.5,0.75,1.0] and [0.25,0.35,0.45,0.55,0.65,0.75,0.85,1.0] corresponds to G=2,4,8 respectively., and train sub-networks jointly to obtain decomposed knowledge. As Figure 6 shows, given four teacher-student pairs, the test accuracy of student is positive correlated with G. These results seems make sense, and further support the intuition of this paper that smoother distillation boosts student better. However, considering the trade-off between teacher training computation and student distillation performance, we set G to 4 in default.
Varying knowledge distillation order
As claimed in the introduction, curriculum learning in distillation process is such significant, and we further investigate the knowledge distillation order in this section. We consider two types of distillation order in PWKD, that is curriculum order (distilling knowledge from ) and reversed order (distilling knowledge from ). We set ResNet-204 as teacher, and ResNet-20, ResNet-32, ResNet-44, ResNet-56 and ResNet-110 as student. We compare PWKD and reversed PWKD in Figure 6, and results show that PWKD consistently outperform reversed PWKD across five teacher-student pairs. In some cases (e.g., ResNet-204/ResNet-56 and ResNet-204/ResNet-110), reversed PWKD even has inferior performance than vanilla KD hinton2015distilling. We conclude that curriculum learning from partial to whole is such promising for knowledge distillation.
We analyze knowledge distillation from a new perspective of teacher knowledge quantity and propose PWKD paradigm. Comprehensive evaluation results across datasets, distillation approaches and teacher-student pairs demonstrate the generality and effectiveness of PWKD. We further conclude that partial-whole hypothesis actually makes sense and knowledge quantity has the potential to improve the efficacy of knowledge distillation. In the future, we plan to explore more knowledge decomposition paradigms from the angle of feature resolution, network depth and etc. Relating knowledge distillation and adaptive neural networks is an interesting topic.