This paper presents a method to interpret the success of knowledge distillation by quantifying and analyzing task-relevant and task-irrelevant visual concepts that are encoded in intermediate layers of a deep neural network (DNN). More specifically, three hypotheses are proposed as follows. 1. Knowledge distillation makes the DNN learn more visual concepts than learning from raw data. 2. Knowledge distillation ensures that the DNN is prone to learning various visual concepts simultaneously. Whereas, in the scenario of learning from raw data, the DNN learns visual concepts sequentially. 3. Knowledge distillation yields more stable optimization directions than learning from raw data. Accordingly, we design three types of mathematical metrics to evaluate feature representations of the DNN. In experiments, we diagnosed various DNNs, and above hypotheses were verified.READ FULL TEXT VIEW PDF
The success of knowledge distillation  has been demonstrated in various studies [31, 45, 11]. It transfers knowledge from a well-learned deep neural network (DNN), namely the teacher network, to another DNN, namely the student network. However, explaining how and why knowledge distillation outperforms learning from raw data still remains a challenge.
In this work, we aim to analyze the success of knowledge distillation from a new perspective, i.e. quantifying the knowledge encoded in the intermediate layer of a DNN. We quantify and compare the amount of the knowledge encoded in the DNN learned via knowledge distillation and the DNN learned from raw data, respectively. Here, the DNN learned from raw data is termed the baseline network. In this research, the amount of the knowledge of a specific layer is measured as the number of visual concepts (e.g. object parts like tails, heads and etc.), which is shown in Figure 1. These visual concepts activate the feature map of this specific layer and are used for prediction.
We design three types of mathematical metrics to analyze task-relevant and task-irrelevant visual concepts. Then, these metrics are used to quantitatively verify three hypotheses as follows.
Hypothesis 1: Knowledge distillation makes the DNN learn more visual concepts. In this paper, a visual concept is defined as an image region, whose information is significantly less discarded and is mainly used by the DNN. We distinguish visual concepts that are relevant to the task away from other concepts, i.e. task-irrelevant concepts. For implementation, let us take the classification task as an example. As is vividly shown in Figure 1, visual concepts on the foreground are usually regarded task-relevant, while those on the background are considered task-irrelevant.
According to the information-bottleneck theory [41, 36], DNNs tend to expose task-relevant visual concepts and discard task-irrelevant concepts to learn discriminative features. Compared to the baseline network (learned from raw data), a well-trained teacher network is usually considered to encode more task-relevant visual concepts and/or less task-irrelevant concepts. Because the student network mimics the logic of the teacher network, the student network is supposed to contain more task-relevant visual concepts and less task-irrelevant concepts.
Hypothesis 2: Knowledge distillation ensures the DNN is prone to learning various visual concepts simultaneously. In comparison, the baseline network tends to learn visual concepts sequentially, i.e.
learning different concepts in different epochs.
Hypothesis 3: Knowledge distillation usually yields more stable optimization directions than learning from raw data. When learning from raw data, the DNN usually tries to model various visual concepts in early epochs and then discard non-discriminative ones in later epochs [41, 36], which leads to unstable optimization directions. We name the phenomenon of inconsistent optimization directions through different epochs “detours”111“Detours” indicate the phenomenon that a DNN tries to model various visual concepts in early epochs and discard non-discriminative ones later. for short in this paper. In comparison, during the knowledge distillation, the teacher network directly guides the student network to target visual concepts without significant detours11footnotemark: 1. Let us take the classification of birds as an example. The baseline network tends to extract features from the head, belly, tail, and tree-branch parts in early epochs, and later discards features from the tree-branch. Whereas, the student network directly learns features from the head and belly parts with less detours11footnotemark: 1.
Methods: We propose three types of mathematical metrics to quantify visual concepts hidden in intermediate layers of a DNN, and analyze how visual concepts are learned during the learning procedure. These metrics measure 1. the number of visual concepts, 2. the learning speed of different concepts, 3. the stability of optimization directions, respectively. We use these metrics to analyze the student network and the baseline network in comparative studies to prove three hypotheses. More specifically, the student network is learned via knowledge distillation, and the baseline network learned from raw data is constructed to have the same architecture as the student network.
Note that visual concepts should be quantified without manual annotations. There are mainly two reasons. 1) It is impossible for people to annotate all kinds of potential visual concepts in the world. 2) For a rigorous research, the subjective bias in human annotations should not affect the quantitative metric. To this end, [14, 26] leverages the entropy to quantify visual concepts encoded in an intermediate layer.
Contributions: Our contributions can be summarized as follows.
1. We propose a method to quantify dark-matter concepts  encoded in intermediate layers of a DNN.
2. Based on the quantification of visual concepts, we propose three types of metrics to diagnose and interpret the superior performance of knowledge distillation from the view of knowledge representations encoded in a DNN.
3. Three hypotheses about knowledge distillation are proposed and verified, which shed light on the explanation of the knowledge distillation.
Although deep neural networks have exhibited superior performance in various tasks, they are still regarded as black boxes. Previous studies of interpreting DNNs can be roughly summarized into semantic explanations and mathematical explanations of the representation capacity.
Semantic explanations for DNNs:
An intuitive way to interpret DNNs is to visualize visual concepts encoded in intermediate layers of DNNs. Feature visualization methods usually show concepts that may significantly activate a specific neuron of a certain layer. Gradient-based methods[47, 37, 46, 27] used gradients of outputs w.r.t. the input image to measure the importance of intermediate-layer activation units or input units. Inversion-based  methods inverted feature maps of convolutional layers into images. From visualization results, people roughly understand visual concepts encoded in intermediate layers of DNNs. For example, filters in low layers usually encode simple visual concepts such as edges and textures, and filters in high layers usually encode concepts e.g. objects and patterns.
Other methods usually estimated the pixel-wise attribution/importance/saliency on an input image, which measured the influence of each input pixel to the final output[30, 25, 20, 9]. Some methods explored the saliency of the input image using intermediate-layer features, such as CAM , Grad-CAM , and Grad-CAM++ . Zhou et al.  computed the actual image-resolution receptive field of neural activations in a feature map.
Bau et al.  disentangled feature representations into semantic concepts using human annotations. Fong and Vedaldi  demonstrated that a DNN used multiple filters to represent a specific semantic concept. Zhang et al. used an explanatory graph 
and a decision tree to represent hierarchical compositional part representations in CNNs. TCAV  measured the importance of user-defined concepts to classification.
Another direction of explainable AI is to learn a DNN with interpretable feature representations in an unsupervised or weakly-supervised manner. In the capsule network , activities of each capsule encoded various properties. The interpretable CNN  learned object part features without part annotations. InfoGAN  and -VAE  learned interpretable factorised latent representations for generative networks.
In contrast, in this research, the quantification of intermediate-layer visual concepts requires us to design metrics with coherency and generality. I.e. unlike previous studies compute importance/saliency/attention [47, 37, 46, 27]
based on heuristic assumptions or using massive human-annotated concepts to explain network features, we quantify visual concepts using the conditional entropy of the input. The entropy is a generic tool with strong connections to various theories, e.g. the information-bottleneck theory [41, 36]. Moreover, the coherency allows the same metric to ensure fair comparisons between layers of a DNN, and between DNNs learned in different epochs.
Mathematical explanations for the representation capacity of DNNs: Evaluating the representation capacity of DNNs mathematically provides a new perspective for explanations. The information-bottleneck theory [41, 36] used the mutual information to evaluate the representation capacity of DNNs [13, 43]. The stiffness  was proposed to diagnose the generalization of a DNN. The CLEVER score  was used to estimate the robustness of neural networks. The Fourier analysis 
was applied to explain the generalization of DNNs learned by stochastic gradient descent. Novaket al.  investigated the correlation between the sensitivity of trained neural networks and generalization. Canonical correlation analysis (CCA)  was used to measure the similarity between representations of neural networks. Chen et al. 
proposed instance-wise feature selection via mutual information for model interpretation. Zhanget al.  explored knowledge consistency between DNNs.
Different from previous methods, our research aims to bridge the gap between mathematical explanations and semantic explanations. We use the entropy of the input to measure the number of visual concepts in a DNN. Furthermore, we quantify visual concepts on the background and the foreground w.r.t. the input image, explore whether a DNN learn various concepts simultaneously or sequentially, and analyze the stability of optimization directions.
Knowledge distillation: knowledge distillation is a popular and successful technique in knowledge transferring. Hinton et al.  considered “soft targets” led to the superior performance of knowledge distillation. Furlanello et al.  explained the dark knowledge transferred from the teacher to the student as importance weighting.
From a theoretical perspective, Lopez-Paz et al.  interpreted knowledge distillation as a form of learning with privileged information. Phuong et al.  explained the success of knowledge distillation from the view of data distribution, optimization bias, and the size of the training set.
However, to the best of our knowledge, the mathematical explanations for knowledge distillation are rare. In this paper, we interpret knowledge distillation from a new perspective, i.e. quantifying, analyzing, and comparing visual concepts encoded in intermediated layers between DNNs learned by knowledge distillation and DNNs learned purely from raw data mathematically.
In this section, we are given a pre-trained DNN (i.e. the teacher network) and then distill it into another DNN (i.e. the student network). In this way, we aim to compare and explain the difference between the student network and the DNN learned from raw data (i.e. the baseline network). To simplify the story, we limit our attention to the task of object classification. Let denote the input image, and denote intermediate-layer features of the teacher network and its corresponding student network, respectively. Knowledge distillation is conducted to force to approximate . Classification results of the teacher and the student are given as and , respectively.
We compare visual concepts encoded in the baseline network and those in the student network to explain knowledge distillation. For a fair comparison, the baseline network has the same structure as the student network, and implementation details are shown in Section 4.1.
According to the information-bottleneck theory [41, 36], the information of the input image is gradually discarded through layers. [14, 26] proposed a method to quantify the input information that was encoded in a specific intermediate layer of a DNN, i.e. measuring how much input information was neglected when the DNN extracted the feature of this layer. The information discarding is formulated as the conditional entropy of the input, given the intermediate-layer feature , as follows.
denotes a set of images which correspond to the concept of a specific object instance. The concept of the object is assumed to be represented by a small range of features , where is a small positive scalar. It was assumed that follows an Gaussian distribution, , where controls the magnitude of the perturbation at each -th pixel. denotes the number of pixels of the input image. In this way, the assumption of the Gaussian distribution ensures that the entropy of the entire image can be decomposed into pixel-level entropies as follows.
In this section, we aim to compare the number of visual concepts that are encoded in the baseline network and those in the student network, so as to verify the above hypothesis.
Using annotated concepts or not: For comparison, we try to define and quantify visual concepts encoded in the intermediate layer of a DNN (either the student network or the baseline network). Note that, in this study, we do not study visual concepts defined by human annotations. For example, Bau et al.  defined visual concepts of objects, parts, textures, scenes, materials, and colors by using manual annotations. However, this research requires us to use and quantify visual concepts without explicit names, which cannot be accurately labeled. These visual concepts are usually referred to as “Dark Matters” .
There are mainly two reasons to use dark-matter visual concepts instead of traditional semantic visual concepts. 1. There exist no standard definitions for semantic visual concepts, and the taxonomy of semantic visual concepts may have significant bias. 2. The cost of annotating all visual concepts are usually unaffordable.
Metric: In this paper, we quantify dark-matter visual concepts from the perspective of the information theory. Given a pre-trained DNN, a set of training images I and an input image , let us focus on the pixel-wise information discarding w.r.t. the intermediate-layer feature . High pixel-wise entropies , shown in Equation (2), indicate that the DNN neglects more information of these pixels. Whereas, the DNN mainly utilizes pixels with low entropies to compute the feature . In this way, image regions with low pixel-wise entropies can be considered to represent relatively valid visual concepts. For example, the head and wings of the bird in Figure 2 are mainly used by the DNN for fine-grained classification. Therefore, metrics are defined as follows.
where and denote the number of visual concepts encoded on the background and the foreground, respectively. and are sets of pixels on the background and the foreground w.r.t. the input image , respectively. is the indicator function. If the condition inside is valid, returns , and otherwise . denotes the average entropy value of the background, which measures the significance of information discarding w.r.t. background pixels. Those pixels on the background are considered to represent task-irrelevant visual concepts. Therefore, we can use as a baseline entropy. Image regions with significantly lower entropy values than can be considered as valid visual concepts, where is a positive scalar. The metric is used to measure the discriminative power of features. As shown in Figure 2, in order to improve the stability and efficiency of the computation, is computed in grids, i.e. all pixels in each local grid share the same . The dark color in Figure 2 indicates the low entropy value .
In statistics, visual concepts on the foreground are usually task-relevant, while those on the background are mainly task-irrelevant. In this way, a well-learned DNN is supposed to encode a large number of visual concepts on the foreground and very few ones on the background. Thus, a larger value denotes the DNN is more discriminative.
Generality and coherency: The design of a metric should consider both the generality and the coherency. The generality is referred to as that the metric is supposed to have strong connections to existing mathematical theories. The coherency ensures comprehensive and fair comparisons in different cases. In this paper, we aim to quantify and compare the number of visual concepts between different network architectures and between different layers. As discussed in [14, 26], existing methods of explaining DNNs usually depend on specific network architecture or specific tasks, such as gradient-based methods [47, 37, 46, 27], perturbation-based methods [9, 20] and inversion-based methods . Unlike previous methods, the conditional entropy of the input ensures fair comparisons between different network architectures and between different layers, which is reported in Table 1.
|Gradient-based [47, 37, 46, 27]||No||No||No|
|Perturbation-based [9, 20]||No||No||No|
In this section, We propose two metrics to prove Hypothesis 2. Given a set of training images I, denote DNNs learned in different epochs. This DNN can be either the student network or the baseline network. obtained after the last epoch is regarded as the final DNN. For each specific image , we quantify visual concepts on the foreground encoded in DNNs learned after different epochs .
In this way, whether or not a DNN learns visual concepts simultaneously can be analyzed in following two terms: 1. whether increases fast along with the epoch number; 2. whether of different images increases simultaneously. The first term indicates whether a DNN learns various visual concepts of a specific image quickly, while the second term evaluates whether a DNN learns visual concepts of different images simultaneously.
For a rigorous evaluation, as shown in Figure 3, we calculate the epoch number , where a DNN obtains richest visual concepts on the foreground. Let and denote initial parameters and parameters learned after the -th epoch. We utilize , named “weight distance” , to measure the learning effect at -th epoch [12, 7]. Compared to using the epoch number, the weight distance better quantifies the total path of updating the parameter at each epoch . Thus, we use the average value
and standard deviation valueof weight distances to quantify whether a DNN learns visual concepts simultaneously. and are given as follows.
represents the average weight distance, where the DNN obtains the richest task-relevant visual concepts. The value of indicates whether a DNN learns visual concepts quickly. describes the variation of the weight distance w.r.t different images, and its value denotes whether a DNN learns various visual concepts simultaneously. Hence, small values of and indicate that the DNN learns various concepts quickly and simultaneously.
During the knowledge distillation, the teacher network directly guides the student network to learn target visual concepts without significant detours11footnotemark: 1. In comparison, according to the information-bottleneck theory [41, 36], when learning from raw data, the DNN usually tries to model various visual concepts and then discard non-discriminative ones, which leads to unstable optimization directions.
In order to quantify the stability of optimization directions of a DNN, a new metric is proposed. Let denote the set of visual concepts on the foreground of image encoded by , respectively. Here, each visual concept is referred to as a specific pixel on the foreground of image , which satisfies . The stability of optimization directions can be measured as follows.
The numerator reflects the number of visual concepts, which have been chosen ultimately for object classification, and are shown as the black box in Figure 1. The denominator represents visual concepts temporarily learned during the learning procedure, which is shown as the green box in Figure 1. denotes the set of visual concepts, which have been tried, but finally are discarded by the DNN. A high value of indicates that the DNN is optimized with less detours11footnotemark: 1 and more stably; vice versa.
Datasets & DNNs: We designed comparative experiments to verify three proposed hypotheses. For comprehensive comparisons, we conducted experiments based on AlexNet , VGG-11, VGG-16, VGG-19 , ResNet-50, ResNet-101, and ResNet-152 . Given each DNN as the teacher network, we distilled knowledge from the teacher network to the student network, which had the same architecture as the teacher network for fair comparisons. Meanwhile, the baseline network was also required to have the same architecture as the teacher network.
were pre-trained on the ImageNet dataset, and then fine-tuned using all three datasets, respectively. For fine-tuning on the ILSVRC-2013 DET dataset, we conducted the classification of terrestrial mammal categories for comparative experiments, considering the high computational burden. For the ILSVRC-2013 DET dataset and the Pascal VOC 2012 dataset, data augmentation  was applied to prevent overfitting. For the CUB200-2011 dataset, we used object images cropped by object bounding boxes for both training and testing. In particular, for the Pascal VOC 2012 dataset, images were cropped by using of the original object bounding box for stable results. For the ILSVRC-2013 DET dataset, we cropped each image by using of the original object bounding box. Because there existed no ground-truth annotations of object segmentation in the ILSVRC-2013 DET dataset, we used the object bounding box as the foreground region. Pixels within the object bounding box were regarded as the foreground , and pixels outside the object bounding box were referred to as the background .
Distillation: In the procedure of knowledge distillation, we selected a fully-connected (FC) layer as the target layer. was used as the distillation loss to mimic the feature of the corresponding layer of the teacher network, where and denoted the -layer features of the teacher network and its corresponding student network, respectively.
Parameters of the student network under the target FC layer were learned exclusively using the distillation loss. Hence, the learning process was not affected by the information of additional human annotations except the knowledge encoded in the teacher network, which ensured fair comparisons. Then we froze parameters under the target layer and learned parameters above the target layer merely using the classification loss.
Selection of layers: For each pair of the student network and the baseline network, we aimed to quantify visual concepts encoded in FC layers and thus conducted comparative experiments. We found that these selected DNNs usually had three FC layers. For sake of brevity, we named three FC layers for short, respectively. Note that, for the ILSVRC-2013 DET dataset and the Pascal VOC 2012 dataset, dimensions of intermediate-layer features encoded in the layer were much smaller than feature dimensions of the and layers. Hence, the target layer was chosen from the and
layer, when DNNs were learned on the ILSVRC-2013 DET dataset and the Pascal VOC 2012 dataset. For the CUB200-2011 dataset, all three FC layers were selected as target layers. Note that, ResNets usually only had one FC layer. In this way, we replaced the only FC layer into a block with two convolutional and three FC layers, each followed by a ReLU layer. Thus, we could measure visual concepts in the student network and the baseline networkw.r.t. each FC layer. For the hyper-parameter (shown in Equation (3)), it was set to for AlexNet, and was set to for other DNNs. It was because AlexNet had much less layers than other DNNs.
According to our hypotheses, the teacher network was learned from a large number of training data. Hence, the teacher network had learned better representations, i.e. encoding more visual concepts on the foreground and less concepts on the background than the baseline network. Thus, the student network learned from the teacher was supposed to contain more visual concepts on the foreground than the baseline network. In this section, we aimed to compare the number of visual concepts encoded in the teacher network, the student network, and the baseline network.
We learned a teacher network from scratch, on the ILSVRC-2013 DET dataset and the CUB200-2011 dataset. In order to boost the performance of the teacher network, data augmentation  was used. The student network was distilled in the same way as Section 4.1, which had the same architecture as the teacher network and the baseline network. Without loss of generality, VGG-16 was chosen, and results were reported in Table 2. We found that the number of concepts on the foreground and the ratio of the teacher network were larger than those of the student network. Meanwhile, the student network obtained larger values of and than the baseline network. In this way, the assumed relationship between the teacher network, the student network, and the baseline network was roughly verified. We also noticed that there was an exception that the value of the teacher network was smaller than that of the student network. It was because the teacher network had a larger average background entropy value (in Equation (3)) than the student network.
Hypothesis 1 assumed that knowledge distillation ensured the student network to learn more task-relevant visual concepts and less task-irrelevant visual concepts. Thus, we utilized , and metrics in Equation (3) to verify this hypothesis.
Values of , and , which evaluated at the and layers of each DNN learned using the CUB200-2011 dataset, the ILSVRC-2013 dataset and the Pascal VOC 2012 dataset, were shown in Table 3. Most results proved Hypothesis 1. I.e. the student network tended to encode more visual concepts on the foreground and less concepts on the background, thereby exhibiting a larger ratio than the baseline network. Figure 5 showed visual concepts encoded in the layer of VGG-11, which also proved Hypothesis 1. Note that very few student networks encoded more background visual concepts . It was because that DNNs used as the teacher network were pre-trained on the ImageNet dataset in Sections 4.3, 4.4, 4.5 to verify Hypotheses 1-3. Pre-trained teacher networks encoded visual concepts of categories, which were much more than necessary. This would make the student network exhibited a larger value than the baseline network.
For Hypothesis 2, we aimed to verify that knowledge distillation enabled the student network to have a higher learning speed, i.e. learning different concepts simultaneously. We used and to prove this hypothesis.
As shown in Table 3, the value and value of the student network were both smaller than that of the baseline network, which verified Hypothesis 2. Note that there were still failure cases. For example, the and were measured at the layer of AlexNet or at the layer of VGG-11. The reason was that AlexNet and VGG-11 both had relatively shallow network architectures. When learning from raw data, DNNs with shallow architectures would learn more concepts and avoid overfitting. Nevertheless, besides very few exceptional cases, knowledge distillation outperformed learning from raw data for most DNNs.
Hypothesis 3 aimed to prove that compared to the baseline network, knowledge distillation made the student network optimized with less detours11footnotemark: 1. The metric depicted the stability of optimization directions and was used to verify above hypothesis. Results reported in Table 3 demonstrated that in most cases, the value of the student network was larger than that of the baseline network. When we measured by AlexNet and VGG-11, failure cases emerged due to the shallow architectures of these two networks. Hence, the optimization directions of the student network tended to be unstable and took more detours11footnotemark: 1.
|AlexNet||S||CUB200-2011 dataset||36.60||4.00||0.90||8.35||25.09||0.57||ILSVRC-2013 DET dataset||49.46||0.66||0.99||0.48||0.10||0.62||Pascal VOC 2012 dataset||25.84||5.86||0.79||1.14||0.56||0.43|
In this paper, we interpret the success of knowledge distillation from the perspective of quantifying the knowledge encoded in the intermediate layer of a DNN. Three types of metrics were proposed to verify three hypotheses in the scenario of classification. I.e. knowledge distillation ensures DNNs learn more task-relevant concepts and less task-irrelevant concepts, have a higher learning speed, and optimize with less detours11footnotemark: 1 than learning from raw data.
There are several limitations of our work. We only focus on the classification task in this paper. However, applying our methods to other tasks (e.g. object segmentation), or other types of data (e.g. the video) is theoretically feasible. Meanwhile, for these tasks, side information may be required. In this paper, our proposed metrics are implemented by using the entropy-based analysis, which has strong connections to the information-bottleneck theory. Unlike the information-bottleneck theory, the proposed metrics can measure the pixel-wise discarding. However, the learning procedure of DNNs cannot be precisely divided into the learning phase and the discarding phase. In each epoch, the DNN may simultaneously learn new visual concepts and discard old task-irrelevant concepts. Thus, the target epoch in Figure 3 is just a rough estimation of the division of two learning phases.
The corresponding author Quanshi Zhang is with the John Hopcroft Center and MoE Key Lab of Artificial Intelligence AI Institute, Shanghai Jiao Tong University. He thanks the support of National Natural Science Foundation of China (U19B2043 and 61906120) and Huawei Technologies. Zhefan Rao and Yilan Chen made equal contribution to this work as interns at Shanghai Jiao Tong University.
International Conference on Machine Learning, pages 882–891, 2018.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
Why should i trust you?: Explaining the predictions of any classifier.In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
New theory cracks open the black box of deep learning.In Quanta Magazine, 2017.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
Learning deep features for discriminative localization.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.