Knowledge Distillation: A Survey

06/09/2020 ∙ by Jianping Gou, et al. ∙ The University of Sydney Birkbeck, University of London 0

In recent years, deep neural networks have been very successful in the fields of both industry and academia, especially for the applications of visual recognition and neural language processing. The great success of deep learning mainly owes to its great scalabilities to both large-scale data samples and billions of model parameters. However, it also poses a great challenge for the deployment of these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the great computational complexity but also the storage. To this end, a variety of model compression and acceleration techniques have been developed, such as pruning, quantization, and neural architecture search. As a typical model compression and acceleration method, knowledge distillation aims to learn a small student model from a large teacher model and has received increasing attention from the community. In this paper, we provide a comprehensive survey on knowledge distillation from the perspectives of different knowledge categories, training schemes, distillation algorithms, as well as applications. Furthermore, we briefly review challenges in knowledge distillation and provide some insights on the subject of future study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 10

page 11

page 12

page 13

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During the last few years, deep learning has made a great success in the fields of artificial intelligence, including a variety of applications in computer vision 

(Krizhevsky et al., 2012)

, reinforcement learning

(Silver et al., 2016), and neural language processing (Devlin et al., 2018)

. With the help of many recent techniques, including residual connections 

(He et al., 2016)

and batch normalization

(Ioffe and Szegedy, 2015), we can easily train very deep models with thousands of layers on powerful GPU or TPU clusters. For example, it takes less than ten minutes to train a ResNet model on a popular image recognition benchmark with millions of images (Deng et al., 2009; Sun et al., 2019); It takes no more than one and a half hours to train a powerful BERT model for language understanding (Devlin et al., 2018; You et al., 2019)

. Though the overwhelming performances brought by the large-scale deep models, the huge computational complexity and massive storage requirements make it a great challenge for the deployment in real-time applications, especially for those devices with limited resources, such as embedded facial recognition systems and autonomous driving cars.

Figure 1: The generic teacher-student framework for knowledge distillation.

To develop efficient deep models, recent works usually focus on 1) efficient basic blocks based on depthwise separable convolution, such as MobileNets (Howard et al., 2017; Sandler et al., 2018) and ShuffleNets (Zhang et al., 2018a; Ma et al., 2018); and 2) model compression and acceleration techniques, mainly including the following categories (Cheng et al., 2018).

  1. Parameter pruning and sharing: These methods focus on removing the inessential parameters of deep neural networks and the removed parameters don’t significantly affect the performance. This category is further divided into model quantization (Wu et al., 2016)

    and binarization

    (Courbariaux et al., 2015), parameter sharing (Han et al., 2015) and structural matrix (Sindhwani et al., 2015).

  2. Low-rank factorization: These methods explore the redundancy of parameters of deep neural networks by employing the matrix/tensor decomposition

    (Denton et al., 2014).

  3. Transferred/compact convolutional filters: These methods reduce the inessential parameters through transferring/compressing the convolutional filters (Zhai et al., 2016).

  4. Knowledge distillation (KD): These methods usually distill the knowledge from a larger deep neural network to a small network (Hinton et al., 2015).

Figure 2: The schematic structure of survey on knowledge distillation. The body of this survey mainly has fundamentals of knowledge distillation, knowledge, distillation, teacher-student architecture, the knowledge distillation algorithms and applications, and discussions about challenges and directions. Note that ‘Section’ is abbreviated as ‘Sec.’ in this figure.

A comprehensive review on model compression and acceleration is out the scope of this paper, and we focus on knowledge distillation, which has received increasing attention from the community. A large deep model tends to achieve very good performance in practice, because the overparameterization improves the generalization performance (Brutzkus and Globerson, 2019; Allen-Zhu et al., 2019; Arora et al., 2018). Knowledge distillation thus explores the redundancy of parameters in deep model for inference by learning a small student model under the supervision of a large teacher model (Bucilua et al., 2006; Ba and Caruana, 2014; Hinton et al., 2015; Urban et al., 2016), and the key problem of knowledge distillation is how to transfer the knowledge from the large teacher model to the small student model. Generally, the teacher-student framework for knowledge distillation is shown in Fig. 1. Although the great success in practice, there are not too much works on either theoretical or empirical understanding of knowledge distillation methods (Cheng et al., 2020; Phuong and Lampert, 2019; Cho and Hariharan, 2019)

. Specifically, to understand the working mechanisms of knowledge distillation, Phuong & Lampert theoretically justified the generalization bound with fast convergence of learning distilled student networks in the scenario of deep linear classifiers

(Phuong and Lampert, 2019). This justification theoretically answers what and how fast the student learns and reveals the factors of determining the success of distillation. The successful distillation relies on data geometry, optimization bias of distillation objective and strong monotonicity of the student classifier. Cheng et al. quantified the knowledge of visual concepts from intermediate layers of a deep neural network to explain knowledge distillation (Cheng et al., 2020). Cho & Hariharan empirically analyzed the efficacy of knowledge distillation in details (Cho and Hariharan, 2019). Empirical analyses observe that a larger model may not be a better teacher because of model capacity gap (Mirzadeh et al., 2019), and distillation adversely affects the student learning. To our knowledge, the empirical evaluation about different knowledge, different distillation, and mutual affection between teacher and student are ignored in (Cho and Hariharan, 2019). Besides, the understanding of knowledge distillation has been explored from the point of view of label smoothing, the prediction confidence of teacher and prior of optimal output layer geometry with the empirical analyses (Tang et al., 2020).

The idea of knowledge distillation for model compression is very similar to the learning scheme of human. Inspired by this, recent knowledge distillation methods have extended to not only teacher-student learning (Hinton et al., 2015), but also mutual learning (Zhang et al., 2018b), self-learning (Yuan et al., 2019), assistant teaching (Mirzadeh et al., 2019), and lifelong learning (Zhai et al., 2019)

. Most of the extensions of knowledge distillation concentrate on compressing the deep neural networks, and the lightweight student networks thus can be easily deployed in applications such as visual recognition, speech recognition, and natural language processing (NLP). Furthermore, the notation of knowledge transfer from one model to another model in knowledge distillation can also be extended to other tasks, such as adversarial attacks 

(Papernot et al., 2016b), data augmentation (Lee et al., 2019a; Gordon and Duh, 2019), data privacy and security (Wang et al., 2019a).

In this paper, we present a comprehensive survey on knowledge distillation. The main objective of this survey is to 1) give a full overview on knowledge distillation, including backgrounds with motivations, basic notations and formulations, and several typical knowledge, distillation and algorithms; 2) give a thorough review on recent progress of knowledge distillation, including theories, applications, and extensions in different real-world scenarios; and 3) address some challenges and provide insights on knowledge distillation from different perspectives of knowledge transfer, including different types of knowledge, training schemes, distillation algorithms/structures, and applications. An overview on the organization of this paper is shown in Fig.2. Specifically, the remainder of this paper is structured as follows. The important concepts and conventional model of knowledge distillation are provided in Section 2. The kinds of knowledge and distillation are summarized in Section 3 and 4, respectively. The existing studies about the teacher-student structures in knowledge distillation are illustrated in Section 5. The latest numerous knowledge distillation approaches are comprehensively summarized and presented in Section 6. The wide applications of knowledge distillation are illustrated by different areas in Section 7. The challenging problems and the future directions in knowledge distillation are discussed in Section 8. Finally, the conclusions are given in Section 9.

2 Backgrounds

In this section, we first introduce the backgrounds of knowledge distillation, and then review the notations and formulations of a vanilla knowledge distillation method (Hinton et al., 2015).

Deep neural networks have achieved remarkable success, especially in the real-world scenarios with large-scale data samples. However, due to the limited computational capacity and memory of the devices in practice, the deployment of deep neural networks in mobile devices/embedded systems becomes a great challenge. To address the issues of cumbersome models for deployment, Bucilua et al. (2006) first proposed model compression to transfer the information from large model ensembles into training a small model without a significant drop in accuracy. The main idea is that the student model mimic the teacher model to learn the competitive or even superior performance. Sequentially, the method of learning a small model from a large model was popularized as the knowledge distillation (Hinton et al., 2015).

Figure 3: An intuitive example of hard and soft targets for knowledge distillation in (Liu et al., 2018c).
Figure 4: The specific architecture of the benchmark knowledge distillation.

A vanilla knowledge distillation framework usually contains two models, a large teacher model (or an ensemble of large models) and a small student model. The pretrained teacher model usually is much larger than the student model, and the main idea is to train an efficient student model with comparable accuracy under the guidance of the teacher model. The supervision signal from the teacher model, which is usually referred to the “knowledge” learned by the teacher model, helps the student model to mimic the behaviors of the teacher model. In a typical image classification task, the logits

(e.g., the output of last layer in deep neural networks) is used as the carrier of the knowledge from the teacher model, which is not explicitly provided by the training data samples. For example, an image of cat is mistakenly classified as a dog with very low probabilities, but the generated frequencies of such mistake is still many times higher than mistaking a cat as car 

(Liu et al., 2018c). Another example is that an image of hand-written digit is more similar to the digit than the digit . These knowledge learned by the teacher model is known as dark knowledge in (Hinton et al., 2015).

The method of knowledge transfer of dark knowledge in a vanilla knowledge distillation framework can be formulated as follows. Given logits as the output of the last fully connected layer and the logit for the -th class, the probability that the input belongs to the -th class can be evaluated by a softmax function,

(1)

Therefore, the soft targets predicted by the teacher model contain the dark knowledge and can be used as a supervision to transfer knowledge from the teacher model to the student model. Similarly, the one-hot label is known as the hard targets and an intuitive example about soft and hard targets is shown in Fig. 3. Furthermore, a temperature factor is introduced to control the importance of each soft target as

(2)

where a higher temperature produces a softer probability distribution over classes. Specifically, when

all classes share the same probability; when , the soft targets become one-hot labels, i.e., the hard targets. Both the soft targets from the teacher model and the ground-truth label are of great importance for improving the performance of the student model (Bucilua et al., 2006; Hinton et al., 2015; Romero et al., 2015), which are used for a distilled loss and a student loss, respectively. The distilled loss is defined to match the logits between the teacher model and the student model as follows.

(3)

where the logits are matched by the cross-entropy gradient with respect to the logits of the student model and the cross-entropy gradient with respect to logit can be evaluated as

(4)

If the temperature is much higher than the magnitude of logits, can be approximated according to its Taylor series,

(5)

If we further assume that the logits of each transfer training sample are zero-mean, (i.e., ), Eq. (5) can be simplified as

(6)

Therefore, according to Eq. (6), the distilled loss is equal to matching the logits between the teacher model and the student model under the conditions of a high temperature and the zero-mean logits, i.e., minimizing . Thus, distillation through matching logits with high temperature can convey very positive logits with much useful knowledge information learned by the teacher model to the student model.

The student loss is defined as the cross-entropy between the ground truth label and the soft logits of the student model:

(7)

where

is the ground truth vector, in which only one element is 1 that represents the ground truth label of the transfer training sample and the others are 0. In ditilled and student losses, both use the same soft logits of the student model but different temperatures. The temperature in the student loss is 1 and in distilled loss

. Finally, the benchmark model of a vanilla knowledge distillation is the joint of the distilled and student losses:

(8)

where is a training input on the transfer set, are the parameters of the student model, and and are the regulated parameters. To easily understand knowledge distillation, the specific architecture of the vanilla knowledge distillation with the joint of the teacher and student models is shown in Fig. 4. In knowledge distillation shown in Fig. 4, the teacher model is always first pre-trained and then the student model is trained using the knowledge from the trained teacher model. In fact, such is offline knowledge distillation. Moreover, the used knowledge only from soft targets of the trained teacher model is often used for training the student model. Of course, there are the other types of knowledge and distillation that will be discussed in the next sections.

3 Knowledge

In this section, we focus on different categories of knowledge for knowledge distillation. A vanilla knowledge distillation uses the logits as the teacher knowledge (Hinton et al., 2015; Ba and Caruana, 2014; Kim et al., 2018; Mirzadeh et al., 2019)

, while the activations, neurons or features of intermediate layers also can be used as the knowledge to guide the training of the student model 

(Romero et al., 2015; Zagoruyko and Komodakis, 2016; Huang and Wang, 2017; Ahn et al., 2019; Heo et al., 2019c). The relationships between different activations/neurons or pairs of samples contain rich information learned by the teacher model (Yim et al., 2017; Lee and Song, 2019; Liu et al., 2019f; Tung and Mori, 2019; Yu et al., 2019). Furthermore, the parameters of the teacher model (or the connections between layers) are another knowledge (Liu et al., 2019c). Considering the model parameters are usually used to fine tune the model on a new data set, which does not focus on model compression, we discuss different knowledge from the following categories, i.e., response-based knowledge, feature-based knowledge, and relation-based knowledge. An intuitive example of different categories of knowledge from the teacher model is shown in Fig. 5.

Figure 5: The schematic illustrations of sources of response-based knowledge, feature-based knowledge and relation-based knowledge in the deep teacher network.

3.1 Response-based Knowledge

Response-based knowledge usually refers to the neural response of the last output layer of the teacher model, and the main idea is to directly mimic the final prediction of the teacher model. The response-based knowledge distillation is simple yet effective for model compression, and has been widely used in different tasks and applications. The most popular response-based knowledge for image classification is known as the soft targets (Ba and Caruana, 2014; Hinton et al., 2015). For different tasks, the response-based knowledge may indicate different types of model prediction. For example, the response in object detection task not only contains the logits but also the offset of bounding box (Chen et al., 2017)

; In semantic landmark localization tasks, e.g., human pose estimation, the response of the teacher model may refer to the heatmap for each landmark 

(Zhang et al., 2019a). Recently, the response-based knowledge has been further explored to address the information of ground-truth label as the conditional targets (Meng et al., 2019).

The idea of the response-based knowledge is straightforward and easy to understand, especially from the explanation of “dark knowledge”. From another perspective, the effectiveness of the soft targets shares the same motivation with label-smoothing (Kim and Kim, 2017), i.e., acting as a strong regularization (Muller et al., 2019; Ding et al., 2019).

3.2 Feature-based Knowledge

Deep neural networks are good at learning multiple levels of feature representation with increasing abstraction, which is known as the representation learning (Bengio et al., 2013). Therefore, apart from the output of the last layer, the output of intermediate layers, i.e., feature maps, are also used as the knowledge to supervise the training of the student model.

The intermediate representations are first introduced in Fitnets (Romero et al., 2015) as the hints to improve the training process of the student model, and the main idea is to directly match the features/activations between the teacher and the student. Inspired by this, a variety of other methods have been proposed to match the features indirectly (Zagoruyko and Komodakis, 2016; Passalis and Tefas, 2018; Kim et al., 2018; Heo et al., 2019c). Specifically, Zagoruyko and Komodakis (2016) defined the “attention map” from the original feature maps as the knowledge, and was generalized by Huang and Wang (2017) via neuron selectivity transfer; In (Passalis and Tefas, 2018), the knowledge was transferred by matching the probability distribution in feature space. To make it easier to transfer the teacher knowledge, Kim et al. (2018) introduced the “factors” as a more understandable form of intermediate representations. Recently, Heo et al. (2019c) proposed to use the activation boundary of the hidden neuron for knowledge transfer. A summary of feature-based knowledge is shown in Table 1.

Feature-based knowledge
Method Description
Feature representations of middle layer (Romero et al., 2015) The middle layer of student is guided by that of teacher via hint-based training.
Parameters distribution of layers (Liu et al., 2019c) The Parameters distribution matching of layers between teacher and student.
Relative dissimilarities of hint maps (You et al., 2017) The distance-based dissimilarity relationships of teacher are transferred into student.
Attention transfer (Zagoruyko and Komodakis, 2016) Student mimics the attention maps of teacher.
Factor transfer (Kim et al., 2018) The convolutional modules as factors encode the feature representations of the output layer.
Probabilistic knowledge transfer (Passalis and Tefas, 2018) Soft probabilistic distribution of data defined on the feature representations of the output layer.
Parameter sharing (Zhou et al., 2018) Student shares the same lower layers of network with teacher.
Activation boundaries (Heo et al., 2019c) The activation boundaries formed by hidden neurons of student match those of teacher.
Neuron selectivity transfer (Huang and Wang, 2017) Student imitates the distribution of the neuron activations from hint layers of teacher.
Feature responses and neuron activations (Heo et al., 2019a) The magnitude of feature response carrying enough feature information and the activation status of each neuron.
Table 1: The summary of feature-based knowledge

3.3 Relation-based Knowledge

Relation-based knowledge
Method Description
Multi-head graph (Lee and Song, 2019) The intra-data relations between any two feature maps via multi-head attention network.
Logits and representation graphs (Zhang and Peng, 2018) Logits graph of multiple self-supervised teachers distills softened prediction knowledge at classifier level and representation graph distills the feature knowledge from pairwise ensemble representations of the compact bilinear pooling layers of teachers .
FSP matrix (Yim et al., 2017; Lee et al., 2018) The relations of any two feature maps from any two layers of networks.
Mutual relations of data (Park et al., 2019) Mutual relations of feature representation outputs structure-wise.
Instance relationship graph (Liu et al., 2019f) This graph contains the knowledge about instance features, instance relationships and the feature space transformation cross layers of teacher.
Similarities of feature representations (Chen et al., 2020; Tung and Mori, 2019) The locality similarities of feature representations of hint layers of the teacher networks (Chen et al., 2020); The similar activations of input pairs in the teacher networks (Tung and Mori, 2019).
Correlation congruence (Peng et al., 2019a) Correlation between instances of data.
Embedding networks (Yu et al., 2019) The distances between feature embeddings of layers from teacher and between data pairs.
Similarity transfer (Chen et al., 2018c) Cross sample Similarities.
Table 2: The summary of relation-based knowledge

Both response-based and feature-based knowledge use the output of specific layers in the teacher model, while relation-based knowledge further explores the relationships/structures between different layers and data samples.

To explore the relationships between different feature maps, Yim et al. (2017)

proposed the flow of solution process (FSP), which was defined by the Gram matrix between two layers. The FSP matrix reflects the relations of any two feature maps and is calculated as the inner product between features from two layers. Based on the correlation between feature maps as the distilled knowledge, knowledge distillation via singular value decomposition was proposed to extract key information in the feature maps

(Lee et al., 2018). To use the knowledge from multiple teachers, Zhang and Peng (2018) formed a directed graph by using the logits/features of each teacher model as the node. Specifically, the relationship/importance between different teachers is modelled before the transfer of knowledge (Zhang and Peng, 2018). Considering the intra-data relations, graph-based knowledge distillation was proposed in (Lee and Song, 2019). Park et al. proposed a relational knowledge distillation method (Park et al., 2019).

Traditional knowledge transfer methods are often individual knowledge distillation that the individual soft targets of teacher are directly distilled into student. In fact, the distilled knowledge contains not only feature information but also mutual relations of data. Liu et al. proposed a robust and effective knowledge distillation method via instance relationship graph (Liu et al., 2019f). The transferred knowledge in instance relationship graph contains instance features, instance relationships and the feature space transformation cross layers. Based on idea of manifold learning, the student network was learned by feature embedding, which preserved the locality information of feature representations of hint layers of the teacher networks (Chen et al., 2020). Through feature embedding, the similarities of data in the teacher network are transferred into the student network. Tung and Mori proposed a similarity-preserving knowledge distillation method (Tung and Mori, 2019). In particular, similarity-preserving knowledge which is the similar activations of input pairs in the teacher networks is transferred into the student network, and the student network preserves such pairwise similarities. Peng et al. proposed a knowledge distillation method via correlation congruence, in which the distilled knowledge contains both the instance-level information and the correlation between instances (Peng et al., 2019a). Using the correlation congruence for distillation, the student network can learn the correlation between instances.

Besides, the knowledge for distillation can also be differently categorized from the other different perspectives, such as structured knowledge that reflects the structural information of data (Liu et al., 2019f; Chen et al., 2020; Tung and Mori, 2019; Peng et al., 2019a; Tian et al., Tian2020) and privileged information of the input features (Vapnik and Izmailov, 2015; Lopez-Paz et al., 2015). A summary of relation-based knowledge is shown in Table 2.

4 Distillation Schemes

In this section, we discuss the training schemes of both teacher and student models. According to whether the teacher model is simultaneously updated with the student model or not, the learning schemes of knowledge distillation can be directly divided into three main categories: online distillation, offline distillation and self-distillation, as shown in Fig. 6.

Figure 6: Different distillation. The red color of “pre-trained” means networks are learned before distillation and the yellow color of “to be trained” means networks are learned during distillation

4.1 Offline Distillation

Most of previous knowledge distillation methods work in a offline manner. In vanilla knowledge distillation (Hinton et al., 2015), the knowledge is usually transferred from a pre-trained teacher model into a student model. Therefore, the whole training process is a two-stage training procedure, i.e., the two separate stages are: 1) the large teacher model is firstly trained on a set of training samples; and 2) the teacher model is used to extract the knowledge such as the logits or the intermediate features, which is then used to guide the training of the student model.

Among the offline distillation methods, the first stage usually is not discussed within the scope of knowledge distillation, i.e., we assume the teacher is pre-defined model and pays little attention to the model structure or its relationship with the student model. Therefore, the offline methods mainly focus on improving different parts of the knowledge transfer, including the design of knowledge (Hinton et al., 2015; Romero et al., 2015)

and the loss functions for feature/distribution matching 

(Huang and Wang, 2017; Zagoruyko and Komodakis, 2016; Passalis and Tefas, 2018; Mirzadeh et al., 2019; Li et al., 2018; Heo et al., 2019b; Asif et al., 2019). The main advantage of offline methods is simple and easy to be implemented. For example, the teacher model may contain a set of models trained by different software packages, and/or located at different machines, and we can extract the knowledge and use a cache on the device to store the extracted knowledge (the teacher model is dynamically changed in online setting).

Obviously, the offline distillation methods always employ one-way knowledge transfer and two-phase training procedure. Nevertheless, the complex high-capacity teacher network and its large time-consuming training time with large-scale training data must be needed in offline distillation, and the capacity gap between teacher and student always exists before distillation.

4.2 Online Distillation

Although offline distillation methods are simple yet effective, there are some issues in offline distillation that have attracted increasing attention from the community (Mirzadeh et al., 2019). To overcome the limitation of offline distillation, online distillation is proposed to further improve the performance of the student model, especially when the high-capacity teacher model with powerful performance is not available (Zhang et al., 2018b; Chen et al., 2019a). In online distillation, both the teacher model and the student model are updated simultaneously, and the whole knowledge distillation framework is end-to-end trainable.

A variety of online knowledge distillation methods have been proposed, especially in the last two years (Zhang et al., 2018b; Chen et al., 2019a; Zhu and Gong, 2018; Xie et al., 2019; Anil et al., 2018; Kim et al., 2019b; Zhou et al., 2018). Specifically, in deep mutual learning (Zhang et al., 2018b), multiple neural networks worked in a collaborative way, in which any one can be the student model and other models will be the teacher during the whole training process. Chen et al. (2019a) further introduced auxiliary peers and group leader into deep mutual learning to form a diverse set of peer models. To reduce the computational cost, Zhu and Gong (2018) proposed a multi-branch architecture, in which each branch indicated a student model and different branches shared the same backbone network. Similar to (Zhu and Gong, 2018), the motivation of rocket launching was to share the common low-level representations of hint layers (Zhou et al., 2018). Rather than using the ensemble of logits, Kim et al. (2019b) introduced a feature fusion module to construct the teacher classifier. Xie et al. (2019) replaced the convolution layer with cheap convolution operations to form the student model. Anil et al. employed online distillation to train large scale distributed neural network, and also proposed a variant of online distillation called codistillation (Anil et al., 2018). Codistillation in parallel trained multiple models with the same architectures and any one model was trained by transferring the knowledge from the other models. Recently, an online adversarial knowledge distillation method was proposed to simultaneously train multiple networks by the discriminators with the knowledge from both the class probabilities and the feature map (Chung et al., 2020).

4.3 Self-Distillation

In self-distillation, the teacher and student models use the same networks (Yuan et al., 2019; Zhang et al., 2019b; Hou et al., 2019; Yang et al., 2019b; Yun et al., 2019; Hahn and Choi, 2019; Lee et al., 2019a), which can be regarded as a special case of online distillation. Specifically, Zhang et al.

proposed a new self distillation method, in which the teacher and student networks was composed of the same convolutional neural networks and the knowledge from the deeper sections of the network was distilled into its shallow ones

(Zhang et al., 2019b). Similar to the self distillation in (Zhang et al., 2019b), a self-attention distillation method was proposed for lane detection by allowing a network to utilize the attention maps of its own layers as the distillation targets for its lower layers (Hou et al., 2019)

. As a special variant of self distillation, snapshot distillation was proposed by transferring the knowledge of the earlier epochs of the network (teacher) into its later epochs (student) to support a supervised training process within the same network

(Yang et al., 2019b). Moreover, another three interesting self distillation methods were also lately proposed in (Yuan et al., 2019; Hahn and Choi, 2019; Yun et al., 2019). Yuan et al. proposed the teacher-free knowledge distillation methods from the perspective of analyzing the label smoothing regularization (Yuan et al., 2019). Hahn and Choi proposed a novel self-knowledge distillation method, in which the self-knowledge was the predicted probabilities instead of traditional soft probabilities (Hahn and Choi, 2019). These predicted probabilities are defined by the feature representations of the training model and reflect the similarities of data in feature embedding space. Yun et al. proposed class-wise self-knowledge distillation to distill the knowledge from predicted distribution of the training model between intra-class samples and augmented samples within the same source into the model self (Yun et al., 2019). In addition, the proposed self-distillation in (Lee et al., 2019a) was adopted for data augmentation and the self-knowledge of augmentation was distilled into the model itself. And the idea of self distillation was also adopted to optimize deep models (the teacher or student networks) with the same architecture one by one and the next networks distilled the knowledge of the previous one during the teacher-student optimization (Furlanello et al., 2018; Bagherinezhad et al., 2018).

For intuitive understanding of distillation, offline, online and self distillation also can be summarized to be consistent with the human teacher-student learning. Offline distillation means the knowledgeable teacher teaches a new student the knowledge; online distillation means the teacher and student simultaneously study each other under the main supervision of teacher; self-distillation means the students learn knowledge themselves without teacher. Moreover, just like the human learning, these three kinds of distillation can be combined to complement each other due to their own advantages.

5 Teacher-Student Architecture

In knowledge distillation, the teacher-student architecture is a generic carrier to form the knowledge transfer. In other words, the quality of knowledge acquisition and distillation from teacher to student is also determined by how to design the teacher and student networks. In terms of the habits of human learning, we hope that a student can find a right teacher. Thus, to well finish capturing and distilling knowledge in knowledge distillation, how to select or design proper structures of teacher and student is very important but difficult problem. Recently, the model setups of teacher and student are almost pre-fixed with unvaried sizes and structures during the distillation, so as to easily cause the model capacity gap. However, how to particulary design the architectures of teacher and student and why their architectures are determined by these model setups are nearly missing. In this section, we mainly discuss the relationships between the structures of teacher and student models, as illustrated in Fig. 7.

Figure 7: The relationships of structures of teacher and student.

Knowledge distillation was previously designed to compress an ensemble of deep neural networks in (Hinton et al., 2015). The complexity of deep neural networks mainly comes from two dimensions: depth and width, i.e., we usually need to transfer knowledge from deeper and wider neural networks to shallower and thinner neural networks (Romero et al., 2015). Therefore, the student networks usually can be: 1) a simplified version of the teacher networks with fewer layers and/or fewer channels in each layer (Wang et al., 2018a; Zhu and Gong, 2018); 2) a quantized version of the teacher networks which shares some same structures with the teacher networks (Polino et al., 2018; Mishra and Marr, 2017; Wei et al., 2018; Shin et al., 2019); 3) small networks which are based on efficient basic operations of the teacher (Howard et al., 2017; Zhang et al., 2018a; Huang et al., 2017); 4) small networks with optimized global network structure (Liu et al., 2019h; Xie et al., 2019; Gu and Tresp, 2020); and 5) the same structures as teacher (Zhang et al., 2018b; Furlanello et al., 2018).

The model capacity gap between the large deep neural networks and the small student neural networks degrades the performance of knowledge transfer (Mirzadeh et al., 2019; Gao et al., 2020). To effectively transfer knowledge to student networks, a variety of methods have been proposed to smoothly reduce the model complexity (Zhang et al., 2018b; Nowak and Corso, 2018; Crowley et al., 2018; Wang et al., 2018a; Liu et al., 2019a, h; Gu and Tresp, 2020). On the one hand, Mirzadeh et al. (2019) introduced a teacher assistant to mitigate the training gap between teacher model and student model, which was further improved by the residual learning, i.e., the assistant structure was used to learn the residual error (Gao et al., 2020). On the other hand, several recent methods also focus on minimizing the difference on structure of student model and the teacher model. For example, Polino et al. (2018) combined network quantization with knowledge distillation, i.e., the student model was small and quantized version of the teacher model. Nowak and Corso (2018) proposed a structure compression method, i.e., transferring the knowledge learned by multiple layers to a single layer. Wang et al. (2018a) progressively performed block-wise knowledge transfer with the receptive field preserved from teacher networks to student networks. In online setting, the teacher networks usually are an ensemble of student networks, in which the student model shares similar structure (or the same structure) with each other (Zhang et al., 2018b; Zhu and Gong, 2018; Furlanello et al., 2018; Chen et al., 2019a).

Recently, depth-wise separable convolution has been widely used in designing efficient neural networks for mobile or embedded devices (Chollet, 2017; Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018a; Ma et al., 2018). Inspired by the success of neural architecture search (or NAS), the performance of small neural networks has been further improved by searching the global structure based on efficient meta operations or blocks (Wu et al., 2019; Tan et al., 2019; Tan and Le, 2019; Radosavovic et al., 2020). Furthermore, the idea of dynamically searching proper knowledge transfer regime also appears in knowledge distillation, e.g., automatically removing redundant layers in a data-driven way using reinforcement learning (Ashok et al., 2017), and searching the optimal student networks given the teacher networks (Liu et al., 2019h; Xie et al., 2019; Gu and Tresp, 2020). The idea of neural architecture search in knowledge distillation, i.e., a joint search of student structure and knowledge transfer under the guidance of the teacher model, will be an interesting subject of future study.

6 Distillation Algorithms

A simple yet very effective idea for knowledge transfer is to directly match the response/feature-based knowledge (Hinton et al., 2015; Romero et al., 2015) or the distributions in feature space (Passalis and Tefas, 2018) between the teacher model and the student model. Recently, many different algorithms have been proposed to improve the process of transferring knowledge in more complex settings. In this section, we review recently proposed typical types of distillation methods for knowledge transfer within the field of knowledge distillation.

6.1 Adversarial distillation

In knowledge distillation, it is difficult for the teacher model to perfectly learn from the true data distribution, and also the student model can not well mimic the teacher model and learn the true distribution of data through transferring the teacher knowledge due to its small capacity (Mirzadeh et al., 2019). Therefore, how to further promote the student model to perfectly mimic the teacher model? Recently, adversarial learning is receiving more and more attention due to its great success in generative networks, i.e., generative adversarial networks or GANs (Goodfellow et al., 2014). Specifically, the discriminator in GANs estimates the probability that a sample comes from the training data distribution while the generator tries to fool the discriminator via generated data samples. Inspired by this, more and more adversarial knowledge distillation methods have been proposed to enable the teacher/student networks to have a better understanding of the true data distribution (Wang et al., 2018d; Xu et al., 2017; Micaelli and Storkey, 2019; Xu et al., 2018; Liu et al., 2018b; Wang et al., 2018e; Chen et al., 2019b; Shen et al., 2019b; Shu et al., 2019; Liu et al., 2018a; Belagiannis et al., 2018).

Figure 8: The different categories of the main adversarial distillation methods. (a) Generator in GAN produces training data to improve KD performance, and the teacher may be used as discriminator. (b) Discriminator in GAN makes the student (also as generator) well mimic teacher. (c) Teacher and student form a generator, and online knowledge distillation is enhanced by discriminator.

As shown in Fig. 8, adversarial learning-based distillation methods, especially those methods using GANs, can be roughly divided into three main categories: 1) an adversarial generator is trained to generate synthetic data, which is either directly used as the training dataset (Chen et al., 2019b) or used to augment the training dataset (Liu et al., 2018b), shown in Fig. 8 (a). Furthermore, Micaelli and Storkey (2019) also tried to utilize an adversarial generator to generate hard examples for knowledge transfer; and 2) a discriminator is introduced to distinguish the samples from the student and the teacher models by using either the logits (Xu et al., 2017, 2018) or the features (Wang et al., 2018e). Specifically, Belagiannis et al. (2018) used unlabeled data samples to form the knowledge transfer, and multiple discriminators are used in (Shen et al., 2019b), shown in Fig. 8 (b). Furthermore, an effective intermediate supervision, i.e., the squeezed knowledge, was used in (Shu et al., 2019) to mitigate the capacity gap between the teacher and the student; and 3) different from above two categories, in which the teacher model is fixed during the training of the student model, the adversarial knowledge distillation also works in an online manner, i.e., the teacher and the student are jointly optimized in each iteration (Wang et al., 2018d; Chung et al., 2020), shown in Fig. 8 (c). Besides, Using knowledge distillation to compress GANs, a learned small GAN student network mimics a larger GAN teacher network via knowledge transfer (Aguinaldo et al., 2019).

In summary, three main points can be concluded from the adversarial distillation methods above as follows: GAN is an effective tool to enhance the power of student learning via the teacher knowledge transfer; joint GAN and KD can generate the valuable data for improving the KD performance and overcoming the limitations of unusable and unaccessible data; KD can be used to compress GANs.

6.2 Multi-teacher distillation

It has been known that different teacher architectures can provide different own useful knowledge that is transferred into the student to improve the knowledge distillation performance. Thus, to sufficiently employ the teacher knowledge in the teacher-student architecture, the multiple teacher networks can be individually and integrally used for distillation during the period of training a student network. In a typical teacher-student framework, the teacher usually is a large model or an ensemble of large models. To transfer knowledge from multiple teachers, the simplest way is to use the averaged response from all teachers as the supervision signal (Hinton et al., 2015). To further explore the advantages of multiple diverse teachers, some multi-teacher knowledge distillation methods have recently been proposed (You et al., 2017; Chen et al., 2019c; Furlanello et al., 2018; Yang et al., 2019a; Zhang et al., 2018b; Sau and Balasubramanian, 2016; Park and Kwak, 2019; Papernot et al., 2016a; Fukuda et al., 2017; Ruder et al., 2017; Wu et al., 2019a; Vongkulbhisal et al., 2019; Yang et al., 2020; Lee et al., 2019c). Generally, the framework of multi-teacher distillation methods is shown in Fig. 9.

Figure 9: The generic framework of multi-teacher distillation methods.

Multiple teacher networks have turned out to be effective for training student model in form of logits and feature representation as the knowledge. Apart from the averaged logits from all teachers, You et al. (2017) further incorporated features from the intermediate layers to encourage the dissimilarity among different training samples. To utilize both logits and intermediate features, Chen et al. (2019c) used two teacher networks, in which one teacher transferred response-based knowledge and the other teacher transferred feature-based knowledge to the student. Furthermore, Fukuda et al. (2017) randomly selected one teacher from the pool of teacher networks at each iteration. To transfer feature-based knowledge from multiple teachers, additional teacher branches are added to the student networks to mimic the intermediate features of teachers (Park and Kwak, 2019; Asif et al., 2019). Born again networks addressed multiple teachers in a step-by-step manner, i.e., the student at the step was used as the teacher of the student at the step (Furlanello et al., 2018), and similar ideas can be found in (Yang et al., 2019a). To efficiently perform knowledge transfer and explore the power of multiple teachers, several alternative methods have been proposed to simulate multiple teachers from one teacher by adding different noise (Sau and Balasubramanian, 2016) or by using stochastic blocks and skip connections (Lee et al., 2019c). More interestingly, due to the special characteristics of multi-teacher KD, its extensions are used for domain adaptation via knowledge adaptation (Ruder et al., 2017), and to protect the privacy and security of data (Papernot et al., 2016a; Vongkulbhisal et al., 2019).

6.3 Cross-modal distillation

Since knowledge distillation has the nature property of transferring knowledge from teacher to student, it can easily realize the knowledge transfer among different modalities in the cross-modal scenarios. Moreover, different input data modalities provide either similar or complementary information, while data/labels on some modalities might not be available during training/testing phase (Gupta et al., 2016; Garcia et al., 2018; Zhao et al., 2018), making it important to transfer knowledge between different modalities. In view of these facts, there are some cross-modal knowledge distillation methods recently (Roheda et al., 2018; Garcia et al., 2018; Passalis and Tefas, 2018; Do et al., 2019; Gupta et al., 2016; Albanie et al., 2018; Zhao et al., 2018; Kundu et al., 2019; Hoffman et al., 2016; Chen et al., 2019d; Thoker and Gall, 2019; Su and Maji, 2016). In this subsection, we review several typical settings for cross-modal knowledge transfer as follows.

Given a teacher model pretrained on one modality (e.g., RGB images) with a large number of well-annotated data samples, Gupta et al. (2016) aimed to transfer the knowledge from the teacher model to the student model with a new unlabeled input modality, such as the depth image and the optical flow. Specifically, the proposed method relies on unlabeled paired samples on both modalities, i.e., both RGB and depth images, and the feature obtained from RGB image by the teacher then is used as the supervision to train the student (Gupta et al., 2016). The idea behind the paired samples is to transfer the annotation/label information via pair-wise sample registration and has been widely used for cross-modal applications (Albanie et al., 2018; Zhao et al., 2018; Thoker and Gall, 2019)

. To perform human pose estimation through walls and occlusions,

Zhao et al. (2018) used synchronized radio signals and camera images to transfer knowledge across modalities for radio-based human pose estimation; In (Thoker and Gall, 2019), the paired samples from two modalities, RGB videos and skeleton sequence, are used to transfer the knowledge learned on RGB videos to the new skeleton-based human action recognition model. To improve the action recognition performance using only RGB images, Garcia et al. (2018) performed cross-modality distillation on an additional modality, i.e., depth image, to generate a hallucination stream for RGB image modality. Tian et al. (Tian2020) further introduced the contrastive loss to transfer pair-wise relationship across different modalities. As argued above, the generic framework of cross-modal distillation methods is shown in Fig. 10.

Figure 10: The generic framework of cross-modal distillation methods. For simplicity, only two modalities are used as an example for the cross-modal distillation methods.

Moreover, Do et al. (2019) proposed a knowledge distillation-based visual question answering method, in which knowledge from trilinear interaction teacher model with image-question-answer as inputs was distilled into learning bilinear interaction student model with image-question as inputs. The proposed probabilistic knowledge distillation in (Passalis and Tefas, 2018) was also well used for knowledge transfer from the textual modality into the visual modality. And in (Hoffman et al., 2016), a modality hallucination architecture was proposed by cross-modality distillation to improve detection performance.

6.4 Graph-based distillation

Most of knowledge distillation algorithms focus on transferring knowledge from the teacher to the student in terms of each training sample, while several recent methods have been proposed to explore the intra-data relationships (Luo et al., 2018; Chen et al., 2020; Zhang and Peng, 2018; Lee and Song, 2019; Park et al., 2019; Liu et al., 2019f; Tung and Mori, 2019; Peng et al., 2019a; Minami et al., 2019; Ma and Mei, 2019; Yao et al., 2019). The main idea of graph-based methods can be summarized as 1) to use the graph as the carrier of teacher knowledge; or 2) to use the graph to control the message passing of the teacher knowledge. The generic framework of graph-based distillation methods is illustrated in Fig. 11. As described in Section 3.3, the graph-based knowledge falls in line of relation-based knowledge. In this section, we introduce typical definitions of the graph-based knowledge and the graph-based message passing distillation algorithms.

Figure 11: The generic framework of graph-based distillation methods.

In (Zhang and Peng, 2018), each vertex represented a self-supervised teacher and two graphs were then constructed by using logits and intermediate features, i.e., the logits graph and representation graph, to transfer knowledge from multiple self-supervised tasks to the student model/task. In (Chen et al., 2020), the graph was used to maintain the relationship between samples in the high-dimensional space, and then knowledge transfer was performed by the proposed locality preserving loss function. Lee and Song (2019) considered intra-data relations via a multi-head graph, in which the vertices were the features at different level of layers in CNNs. Park et al. (2019) directly transferred the mutual relations of data samples, i.e., to match edges of graph between the teacher and the student. Similarly, in (Tung and Mori, 2019), the mutual relations were arranged in the similarity matrix and the similarity preserving knowledge was transferred by matching the similarity matrix. Furthermore, Peng et al. (2019a) not only matched the response-based and feature-based knowledge, but also used the graph-based knowledge. In (Liu et al., 2019f), the instance features and instance relationships were modeled as vertexes and edges of the graph, respectively.

Rather than using the graph-based knowledge, several methods control the process of knowledge transfer using the graph. Specifically, Luo et al. (2018) considered the modality discrepancy to incorporate privileged information from the source domain, and a directed graph named distillation graph was introduced to explore the relationship between different modalities, where each vertex represented a modality and an edge indicated the connection strength from one modality to the other. A bidirectional graph-based diverse collaborative learning was proposed to explore diverse knowledge transfer patterns by (Minami et al., 2019). Yao et al. (2019) further introduced GNNs to deal with the knowledge transfer for graph-based knowledge.

6.5 Attention-based distillation

Since attention can well reflect the neuron activations of convolutional neural network, some attention mechanisms are used in knowledge distillation to improve the performance of the student network (Zagoruyko and Komodakis, 2016; Huang and Wang, 2017; Srinivas and Fleuret, 2018; Crowley et al., 2018; Song et al., 2018). Among these attention-based KD methods (Crowley et al., 2018; Huang and Wang, 2017; Srinivas and Fleuret, 2018; Zagoruyko and Komodakis, 2016), different attention transfer was defined as the mechanism of distilling knowledge from the teacher network to the student network. The core of attention transfer is to define the attention maps of feature embedding of layers of neural networks. That is to say, the knowledge about feature embedding of networks is transferred by attention map functions. Unlike the attention maps, a attentive knowledge distillation method was proposed by designing attention mechanism to attentively assign different confidence rules (Song et al., 2018).

6.6 Data-free distillation

To overcome the issue of unavailable data due to the problem of privacy, legality, security and confidentiality concerns, some data-free KD methods have been proposed (Chen et al., 2019b; Micaelli and Storkey, 2019; Lopes et al., 2017; Nayak et al., 2019). Just as its name implies, the transfer training data for the student network does not exist and could be newly or synthetically generated by different ways. In (Chen et al., 2019b; Micaelli and Storkey, 2019), the transfer data was generated by GAN. In the proposed data-free knowledge distillation method (Lopes et al., 2017), the transfer data to train the student network was reconstructed by using the layer activations or layer spectral activations of the teacher network. In (Nayak et al., 2019), a zero-shot knowledge distillation method was proposed without using existing data and the transfer data was produced by modelling the softmax space from the parameters of the teacher network. In fact, the target data in (Micaelli and Storkey, 2019; Nayak et al., 2019) was generated by using the information from the feature representations of teacher networks. Similar to zero-shot learning, a knowledge distillation method with few-shot learning was designed by distilling knowledge from a teacher model with gaussian processes into a student neural network, and the teacher used limited labelled data (Kimura et al., 2018).

6.7 Quantized distillation

Network quantization reduces the computation complexity of neural networks by converting high-precision networks (e.g., 32-bit floating point) into low-bit networks (e.g., 2-bit and 8-bit). Meanwhile, knowledge distillation aims to train small model to yield a comparable performance with a complex model. Inspired by this, some quantized KD methods have been proposed by using the quantization process in the teacher-student framework (Polino et al., 2018; Mishra and Marr, 2017; Wei et al., 2018; Shin et al., 2019; Kim et al., 2019a), and the framework of quantized distillation methods is shown in Fig. 12. Specifically, in (Polino et al., 2018), a quantized distillation method was proposed to transfer the knowledge to a weight-quantized student network. In (Mishra and Marr, 2017), the proposed quantized KD was called the “apprentice”, in which a high precision network as a teacher taught a low-precision network as a student through knowledge transfer. To make a small student network well mimic a large teacher network, the full-precision teacher network was first quantized on the feature maps, and then the knowledge was transferred from the quantized teacher to a quantized student network (Wei et al., 2018). In (Kim et al., 2019a), quantization-aware knowledge distillation was proposed by self-studying of a quantized student network and co-studying of teacher and student networks with knowledge transfer. Furthermore, Shin et al. (2019) performed empirical analysis of deep neural networks using both distillation and quantization according to hyper-parameters of knowledge distillation, including the size of teacher networks and the distillation temperature.

Figure 12: The generic framework of quantized distillation methods.

6.8 Lifelong distillation

Lifelong learning, including continual learning, continuous learning and meta-learning, aims to learn in a similar way to human. It can not only accumulate the previously learned knowledge but also transfer the learned knowledge into future learning (Chen and Liu, 2018). Considering the characteristic of knowledge distillation, it can be an effective way to preserve and transfer the learned knowledge to avoid catastrophic forgetting. Recently, an increasing number of KD variants, which are based on the lifelong learning, have been developed (Jang et al., 2019; Flennerhag et al., 2018; Peng et al., 2019b; Liu et al., 2019d; Lee et al., 2019b; Zhai et al., 2019; Zhou et al., 2019c; Li and Hoiem, 2017; Shmelkov et al., 2017). Among them, these methods in (Jang et al., 2019; Flennerhag et al., 2018; Peng et al., 2019b; Liu et al., 2019d) adopted the basic idea of meta-learning. In (Jang et al., 2019), Jang et al. designed a meta-transfer networks that can determine what and where to transfer in the teacher-student architecture. In (Flennerhag et al., 2018), a light-weight framework called Leap was proposed for meta-learning over task manifolds by transferring knowledge cross the learning processes. Peng et al. designed a new knowledge transfer network architecture for few-shot image recognition, which incorporated both visual information of images and prior knowledge simultaneously (Peng et al., 2019b). In (Liu et al., 2019d)

, a semantic-aware knowledge preservation method was proposed for image retrieval, and the teacher knowledge from the modalities of images and semantic information was preserved and transferred. Moreover, to address the problem of catastrophic forgetting in lifelong learning, global distillation

(Lee et al., 2019b), multi-model distillation (Zhou et al., 2019c), knowledge distillation-based lifelong GAN (Zhai et al., 2019) and the other KD-based methods (Li and Hoiem, 2017; Shmelkov et al., 2017) have been developed to extract the learned knowledge and to teach the student network on new tasks.

6.9 NAS-based distillation

Neural architecture search (NAS), which is one of the most popular auto machine learning (or AutoML) techniques, aims to automatically identify deep neural models and adaptively learn appropriate deep neural structures. Meanwhile, in knowledge distillation, the success of knowledge transfer depends on not only the knowledge from the teacher but also the architecture of the student. However, there might be a capacity gap between the large teacher model and the small student model, making it difficult for the student to learn well from the teacher’s knowledge. To address this issue, neural architecture search has been adopted to find the appropriate student architecture in knowledge distillation, such as oracle-based

(Kang et al., 2019) and architecture-aware knowledge distillation (Liu et al., 2019h). Furthermore, knowledge distillation also can be employed to improve the efficiency of neural architecture search, such as AdaNAS (Macko et al., 2019), NAS with distilled architecture knowledge (Li et al., 2020) and teacher guided search for architectures or TGSA (Bashivan et al., 2019). In TGSA, each architecture search step is guided to mimic the intermediate feature representations of the teacher network, in which the structure and the the candidate network (or student) is efficiently searched and the feature transfer is effectively supervised by the teacher.

7 Applications

As an effective technique for the compression and acceleration of deep neural networks, knowledge distillation has been widely used in different fields of artificial intelligence, including visual recognition, speech recognition, natural language processing (NLP), and recommendation systems. Furthermore, knowledge distillation also can be used for other purposes, such as the data privacy and the defense of adversarial attacks. In this section, we briefly review knowledge distillation from the perspective of different applications.

7.1 KD in Visual Recognition

In last few years, a variety of knowledge distillation methods have been widely used to solve the problem of model compression in different visual recognition applications. Specifically, most of the knowledge distillation methods were previously developed for image classification (Li and Hoiem, 2017; Peng et al., 2019b; Bagherinezhad et al., 2018; Chen et al., 2018a; Wang et al., 2019b; Mukherjee et al., 2019; Zhu et al., 2019) and then extended to other visual recognition applications, including face recognition (Luo et al., 2016; Kong et al., 2019; Yan et al., 2019; Ge et al., 2018; Wang et al., 2018b, 2019c; Duong et al., 2019; Wu et al., 2020; Wang et al., 2017), action recognition (Luo et al., 2018; Thoker and Gall, 2019; Hao and Zhang, 2019; Garcia et al., 2018; Wang et al., 2019e; Wu et al., 2019b; Zhang et al., 2020), object detection (Li et al., 2017; Hong and Yu, 2019; Shmelkov et al., 2017; Wei et al., 2018; Wang et al., 2019d), lane detection (Hou et al., 2019), pedestrian detection (Shen et al., 2016), video classification (Bhardwaj et al., 2019; Zhang and Peng, 2018), facial landmark detection (Dong and Yang, 2019), person re-identification (Wu et al., 2019a) and person search (Munjal et al., 2019), pose estimation (Nie et al., 2019; Zhang et al., 2019a; Zhao et al., 2018), image or video segmentation (He et al., 2019; Liu et al., 2019g; Mullapudi et al., 2019; Siam et al., 2019; Dou et al., 2020), saliency estimation (Li et al., 2019), image retrieval (Liu et al., 2019d), depth estimation (Pilzer et al., 2019; Ye et al., 2019), visual odometry (Saputra et al., 2019) and visual question answering (Mun et al., 2018; Aditya et al., 2019). Since knowledge distillation in the classification tasks is fundamental for other tasks, we briefly review knowledge distillation in challenging image classification settings as well as other typical applications of knowledge distillation, such as face recognition and action recognition.

The existing KD-based face recognition methods can not only easily solve deployment of deep neural networks, but also improve the classification accuracy (Luo et al., 2016; Kong et al., 2019; Yan et al., 2019; Ge et al., 2018; Wang et al., 2018b, 2019c; Duong et al., 2019; Wang et al., 2017). First of all, these methods focus on the lightweight face recognition with very satisfactory accuracy (Luo et al., 2016; Wang et al., 2018b, 2019c; Duong et al., 2019). In (Luo et al., 2016) the knowledge from the chosen informative neurons of top hint layer of the teacher network was transferred into the student network. A teacher weighting strategy with the feature loss of hint layers was designed for knowledge transfer to avoid the uncorrect supervision of teacher (Wang et al., 2018b). And a recursive knowledge distillation method was designed by using the previous student to initialize the next one, and the student model was first generated by group and pointwise convolution (Yan et al., 2019). Unlike the KD-based face recognition methods in closed-set problems, shrinking teacher-student networks for million-scale lightweight face recognition in open-set problems were proposed by designing different distillation losses (Duong et al., 2019; Wu et al., 2020). Moreover, the typical KD-based face recognition is the low-resolution face recognition (Ge et al., 2018; Wang et al., 2019c; Kong et al., 2019). To improve low-resolution face recognition accuracy, the knowledge distillation framework is developed by the architectures between high-resolution face teacher and low-resolution face student for model acceleration and improved classification performance. In (Ge et al., 2018), Ge et al. proposed a selective knowledge distillation method, in which the teacher network for high-resolution face recognition selectively transferred its informative facial features into the student network for low-resolution face recognition through sparse graph optimization. In (Kong et al., 2019), cross-resolution face recognition was realized by designing a resolution invariant model unifying both face hallucination and heterogeneous recognition sub-nets. To get efficient and effective low resolution face recognition model, the multi kernel maximum mean discrepancy between student and teacher networks was adopted as the the feature loss (Wang et al., 2019c). In addition, the KD-based face recognition was also extended to face alignment and verification by changing the losses in knowledge distillation (Wang et al., 2017).

Recently, knowledge distillation has been well applied for solving the complex image classification problems, and there are existing typical methods (Peng et al., 2019b; Li and Hoiem, 2017; Bagherinezhad et al., 2018; Chen et al., 2018a; Wang et al., 2019b; Mukherjee et al., 2019; Zhu et al., 2019). For the issues of the incomplete, ambiguous and redundant image labels, the label refinery model through self-distillation and label progression was proposed to learn soft, informative, collective and dynamic labels for complex image classification (Bagherinezhad et al., 2018). To address catastrophic forgetting with CNN in a variety of image classification tasks, a learning without forgetting method for CNN on the basis of both knowledge distillation and lifelong learning was proposed to recognize a new image task and preserve the original tasks (Li and Hoiem, 2017). For improving image classification accuracy, the feature maps-based knowledge distillation method with GAN was proposed by transferring knowledge from feature maps to teach a student in (Chen et al., 2018a). Using knowledge distillation, a visual interpretation and diagnosis framework that unified the teacher-student models for interpretation and a deep generative model for diagnosis was designed for image classifiers (Wang et al., 2019b)

. Similar to the knowledge distillation-based low-resolution face recognition, the low-resolution image classification method was proposed by deep feature distillation, which the output features of student matched those of teacher

(Zhu et al., 2019).

As argued in Section 6.3, knowledge distillation with the teacher-student structure can transfer and preserve the cross-modality knowledge, the efficient and effective action recognition under its cross-modal task scenarios can be successfully realized (Thoker and Gall, 2019; Luo et al., 2018; Garcia et al., 2018; Hao and Zhang, 2019; Wu et al., 2019b; Zhang et al., 2020). In nature, these methods are the spatiotemporal modality distillation with different knowledge transfer for action recognition, such as mutual teacher-student networks (Thoker and Gall, 2019), multiple stream networks (Garcia et al., 2018), spatiotemporal distilled dense-connectivity network (Hao and Zhang, 2019), multi-teacher to multi-student networks (Wu et al., 2019b; Zhang et al., 2020) and graph distillation (Luo et al., 2018). Among these methods, the lightweight student can distill and share the knowledge information from multiple modalities holden in the teacher.

According to most of the knowledge distillation-based visual recognition applications, we can summarize two main observations as follows.

  1. Knowledge distillation provides the teacher-student networks that satisfy the requirement of the efficiency and effectiveness of many complex visual recognition tasks.

  2. The knowledge transfer of knowledge distillation can realize the full use, preservation and transfer of different information from complex visual data, such as cross-modality data, multi-domain data and multi-task data and low-resolution image data.

7.2 KD in NLP

Conventional language models such as BERT are very time consuming and resource consuming with complex cumbersome structure. To obtain the lightweight language models with good efficiency and effectiveness, recently knowledge distillation is extensively studied in the field of natural language processing (NLP) and more and more KD methods are proposed in solving the numerous NLP tasks (Liu et al., 2019b; Haidar and Rezagholizadeh, 2019; Yang et al., 2020; Tang et al., 2019; Hu et al., 2018; Sun et al., 2019; Jiao et al., 2019; Nakashole and Flauger, 2017; Wang et al., 2018c; Zhou et al., 2019a; Sanh et al., 2019; Turc et al., 2019; Arora et al., 2019; Clark et al., 2019; Kim and Rush, 2016; Gordon and Duh, 2019; Liu et al., 2019e; Kuncoro et al., 2016; Mou et al., 2016; Tan et al., 2019; Hahn and Choi, 2019; Cui et al., 2017; Freitag et al., 2017; Wei et al., 2019; Shakeri et al., 2019; Aguilar et al., 2019)

. The existing NLP tasks using KD contain neural machine translation (NMT)

(Hahn and Choi, 2019; Kim and Rush, 2016; Zhou et al., 2019a; Tan et al., 2019; Gordon and Duh, 2019; Freitag et al., 2017; Wei et al., 2019), question answering system (Wang et al., 2018c; Arora et al., 2019; Yang et al., 2020; Hu et al., 2018), document retrieval (Shakeri et al., 2019)

, text generation

(Haidar and Rezagholizadeh, 2019), event detection (Liu et al., 2019b) and so on. Among these KD-based NLP methods, most of them belong to natural language understanding (NLU), and many of these KD methods for NLU are designed as the task-specific distillation (Tang et al., 2019; Turc et al., 2019; Mou et al., 2016) and multi-task distillation (Liu et al., 2019e; Yang et al., 2020; Sanh et al., 2019; Clark et al., 2019). In what follows, we mainly describe the related research works about KD for neural machine translation and then KD for extending a typical multilingual representation model entitled bidirectional encoder representations from transformers (BERT) (Devlin et al., 2018) in NLU.

In natural language processing, neural machine translation is the hottest application. Recently there are many extended knowledge distillation methods for neural machine translation (Hahn and Choi, 2019; Zhou et al., 2019a; Tan et al., 2019; Gordon and Duh, 2019; Wei et al., 2019; Freitag et al., 2017; Kim and Rush, 2016). In (Zhou et al., 2019a), the empirical analysis about how knowledge distillation affected the non-autoregressive machine translation (NAT) models was studied and the experimental fact that the translation quality was largely determined by both the capacity of an NAT model and the complexity of the distilled data via the knowledge transfer was concluded. In the sequence generation scenario of NMT, the proposed word-level knowledge distillation was extended into sequence-level knowledge distillation for training sequence generation student model that mimicked the sequence distribution of the teacher (Kim and Rush, 2016). The good performance of sequence-level knowledge distillation was further explained from the perspective of data augmentation and regularization in (Gordon and Duh, 2019). In (Tan et al., 2019), to overcome the multilingual diversity, multi-teacher knowledge distillation with multiple individual models as the teachers for handling bilingual pairs and the multilingual model as the student was proposed to improve the accuracy of multilingual machine translation. To improve the translation quality, the knowledge distillation method was proposed by adopting an ensemble teacher model with a data filtering method to teach the student model, and the ensemble teacher model was based on all kinds of NMT models (Freitag et al., 2017). In (Wei et al., 2019), to improve the performance of machine translation and machine reading tasks, a novel online knowledge distillation method was proposed to address the unstableness of the training process and the decreasing performance on each validation set. In this online KD, the good evaluated model during training was chosen as the teacher model and updated by the subsequent better model, and if the next model had the poor performance the current teacher model would guide it.

As a multilingual representation model, BERT has attracted more attention in natural language understanding (Devlin et al., 2018), but it is also a cumbersome deep model that is not easy to be deployed. To address this problem, several lightweight variations of BERT (called BERT model compression) using knowledge distillation are proposed in (Sun et al., 2019; Jiao et al., 2019; Tang et al., 2019; Sanh et al., 2019). Sun et al. proposed patient knowledge distillation for BERT model compression (BERT-PKD), which was used for sentiment classification, paraphrase similarity matching, natural language inference, and machine reading comprehension (Sun et al., 2019). In the patient KD method, the feature representations of the [CLS] token from the hint layers of teacher were well transferred into the learning of the student. In (Jiao et al., 2019), to accelerate language inference of BERT, TinyBERT was proposed by designing two-stage transformer knowledge distillation containing the general distillation from general-domain knowledge and the task specific distillation from the task-specific knowledge in BERT. In (Tang et al., 2019)

. a task-specific knowledge distillation from the BERT teacher model into a bidirectional long short-term memory network (BiLSTM) was proposed for sentence classification and matching. In

(Sanh et al., 2019), a lightweight student model called DistilBERT with the same generic structure of BERT was designed and learned on a variety of tasks of NLP for good performance. In (Aguilar et al., 2019), a simplified student BERT was proposed by using the internal representations of large teacher BERT via internal distillation.

Besides, several KD methods used in NLP from the different perspectives are also represented below. For question answering (Hu et al., 2018), to improve the efficiency and robustness of the existing machine reading comprehension methods, an attention-guided answer distillation method was proposed by fusing generic distillation and answer distillation to address confusing answer and attention distillation. For a task-specific distillation (Turc et al., 2019), the performance of knowledge distillation with the interactions among pre-training, distillation and fine-tuning for the compact student model was studied, and the proposed pre-trained distillation performed well in sentiment classification, natural language inference, textual entailment. For a multi-task distillation in the context of natural language understanding (Clark et al., 2019), the single-multi born-again distillation method on the basis of born-again neural networks (Furlanello et al., 2018) was proposed by using the single-task teachers to teach the multi-task student. For multilingual representations, knowledge distillation well realized the knowledge transfer among the multi-lingual word embeddings for bilingual dictionary induction (Nakashole and Flauger, 2017), and the effectiveness of the knowledge transfer was also studied cross the ensembles of multilingual models (Cui et al., 2017).

Through the discussions about natural language processing using knowledge distillation above, several observations can be summarized as follows.

  1. Knowledge distillation can provide an efficient and effective lightweight language deep models using teacher-student architecture. Teacher can transfer the rich knowledge from a large number of language data to student, and student can quickly finish many language tasks.

  2. The teacher-student knowledge transfer can well solve many multilingual tasks.

7.3 KD in Speech Recognition

In the field of speech recognition, deep neural acoustic models have attracted more attention and interest due to their powerful performance. However, more and more real-time speech recognition systems should be deployed in embedded platforms with limited computational resources and fast response time, and the state-of-the-art deep complex models cannot satisfy the requirement of such speech recognition scenarios. To satisfy this requirement in speech recognition, knowledge distillation as an effective technique of model compression and acceleration for deep models has widely studied and applied in many speech recognition tasks. Recently, there are numerous knowledge distillation variations to design the lightweight deep acoustic models that are applied in speech recognition (Fukuda et al., 2017; Albanie et al., 2018; Roheda et al., 2018; Shi et al., 2019a, b; Wong and Gales, 2016; Watanabe et al., 2017; Price et al., 2016; Chebotar and Waters, 2016; Lu et al., 2017; Chan et al., 2015; Asami et al., 2017; Shen et al., 2018, 2019a; Bai et al., 2019b; Huang et al., 2018; Ng et al., 2018; Ghorbani et al., 2018; Gao et al., 2019; Shi et al., 2019c; Perez et al., 2020; Takashima et al., 2018; Oord et al., 2017). In particular, these KD-based speech recognition applications have spoken language identification (Shen et al., 2018, 2019a), text-independent speaker recognition (Ng et al., 2018), audio classification (Gao et al., 2019; Perez et al., 2020), acoustic event detection (Price et al., 2016; Shi et al., 2019a, b), speech enhancement (Watanabe et al., 2017), speech synthesis (Oord et al., 2017) and so on.

Among these existing knowledge distillation methods for speech recognition, most of them leverages the teacher-student architectures to improve the efficiency and recognition accuracy of acoustic models (Chan et al., 2015; Chebotar and Waters, 2016; Lu et al., 2017; Price et al., 2016; Shen et al., 2018; Gao et al., 2019; Shen et al., 2019a; Shi et al., 2019c; Perez et al., 2020; Watanabe et al., 2017; Shi et al., 2019a)

. Using the recurrent nature of recurrent neural network (RNN) that can hold the temporal information from speech sequences, the knowledge from the teacher RNN acoustic model was transferred into a small student DNN model

(Chan et al., 2015). Due to the better speech recognition accuracy of combining multiple acoustic modes, the ensembles of different RNNs with different individual training criteria were designed to train a student model through knowledge transfer, and the learned student model obtained the significant recognition improvements on 2,000-hour large vocabulary continuous speech recognition (LVCSR) tasks in 5 languages (Chebotar and Waters, 2016). To strengthen the generalization of the spoken language identification (LID) model on short utterances, the knowledge of feature representations of the long utterance-based teacher network was transferred into the short utterance-based student network that can capture discriminations among short utterances and perform well on the short duration utterance-based LID tasks (Shen et al., 2018). To further improve the performance of short utterance-based LID, an interactive teacher-student learning as an online distillation method was proposed to enhance the ability of the feature representations of short utterances (Shen et al., 2019a). For audio classification, a multi-level feature distillation method was developed and an adversarial learning strategy was adopted to optimize the knowledge transfer (Gao et al., 2019). To improve noise robust speech recognition, knowledge distillation was employed as the tool of speech enhancement (Watanabe et al., 2017). In (Perez et al., 2020), a audio-visual multimodal knowledge distillation method was proposed by transferring knowledge from the teacher models on visual and acoustic data into a student model on a audio data. In essence, this distillation shares the cross-modal knowledge among the teachers and students (Perez et al., 2020; Albanie et al., 2018; Roheda et al., 2018). For efficient acoustic event detection, a quantized distillation method was proposed by using both knowledge distillation and quantization (Shi et al., 2019a). The quantized distillation transferred knowledge from larger CNN teacher model with better detection accuracy into a quantized RNN student model.

Unlike most existing traditional frame-level KD methods, sequence-level KD can perform better in some sequence models for speech recognition, such as connectionist temporal classification (CTC) (Wong and Gales, 2016; Takashima et al., 2018; Huang et al., 2018). In general, sequence-level KD is carried out under the hypothesis of the label sequence with sequence training criteria (Huang et al., 2018). In (Wong and Gales, 2016), the effect of speech recognition performance between frame-level and sequence-level student-teacher training was studied and a new sequence-level student-teacher training method was proposed by constructing the teacher ensemble via sequence-level combination instead of frame-level combination. To improve the performance of unidirectional RNN-based CTC for real-time speech recognition, the knowledge of a bidirectional LSTM-based CTC teacher model was transferred into a unidirectional LSTM-based CTC student model via frame-level KD and sequence-level KD (Takashima et al., 2018).

Moreover, knowledge distillation can be used to solve some special issues in speech recognition (Bai et al., 2019b; Asami et al., 2017; Ghorbani et al., 2018). To overcome overfitting issue of DNN acoustic models in the data scarcity scenario, knowledge distillation is employed as a regularization way to train adapted model with the supervision of the source model. The final adapted model achieved better performance on three real acoustic domains including two dialects and children’s speech (Asami et al., 2017). To overcome the degradation of the performance of non-native speech recognition, an advanced multi-accent student model was trained by distilling the knowledge from the multiple accent-specific RNN-CTC models (Ghorbani et al., 2018). In essence, knowledge distillation in (Asami et al., 2017; Ghorbani et al., 2018) realized the cross-domain knowledge transfer. To solve the complexity of fusing the external language model (LM) into sequence-to-sequence model (Seq2seq) for speech recognition, knowledge distillation is employed as an effective tool to integrate a LM (teacher) into Seq2seq model (student), and the trained Seq2seq model can well reduce character error rate for sequence-to-sequence speech recognition (Bai et al., 2019b).

According the discussions of knowledge distillation-based speech recognition, several observations can be summarized as follows.

  1. The lightweight student model can satisfy the practical requirements of speech recognition, such as real-time responses, limited resources and high recognition accuracy.

  2. Many teacher-student architectures are built on RNN models because of the temporal property of speech sequences.

  3. Sequence-level knowledge distillation can well applied for the sequence models.

  4. Knowledge distillation using teacher-student knowledge transfer can well solve the cross-domain or cross-modal speech recognition applications such as multi-accent recognition.

7.4 KD in Other Applications

In recommender systems, how to fully and correctly leverage the external knowledge such as user review and product images for recommendation plays a very important role in the effectiveness of deep recommendation models. Most importantly, reducing the complexity and improving the efficiency of deep recommendation models is also very necessary. Recently, knowledge distillation has been successfully applied in recommender systems for deep model compression and acceleration (Chen et al., 2018b; Tang and Wang, 2018; Pan et al., 2019). In (Tang and Wang, 2018), knowledge distillation was first introduced into the recommender systems and called ranking distillation because the recommendation was a ranking problem. In (Chen et al., 2018b), an adversarial knowledge distillation method was designed for efficient recommendation. In such recommender system, the teacher as the right review prediction network supervised the student as user-item prediction network (generator), and the student learning was adjusted by adversarial adaption between teacher and student networks. Unlike distillation in (Chen et al., 2018b; Tang and Wang, 2018), Pan et al.

designed a enhanced collaborative denoising autoencoder (ECAE) model for recommender systems via knowledge distillation to capture useful knowledge from generated data of user feedbacks and to reduce noise

(Pan et al., 2019). The unified ECAE framework contained a generation network, a retraining network and a distillation layer that transferred knowledge and reduced noise from the generation network.

Using the natural characteristic of knowledge distillation with teacher-student architectures, knowledge distillation is used as an effective strategy to solve the adversarial attacks or perturbations of deep models (Papernot et al., 2016b; Ross and Doshi-Velez, 2018; Goldblum et al., 2019; Gil et al., 2019) and the unavailable issues of data due to the privacy, confidentiality and security concerns (Lopes et al., 2017; Vongkulbhisal et al., 2019; Papernot et al., 2016a; Bai et al., 2019a; Wang et al., 2019a). It has been argued in (Ross and Doshi-Velez, 2018; Papernot et al., 2016b), the perturbations of the adversarial samples can be overcome by the robust outputs of the teacher networks and distillation. To avoid exposing the data privacy of directly using data, multiple teachers accessed each subsets of the sensitive or unlabelled data and supervised the student (Papernot et al., 2016a; Vongkulbhisal et al., 2019). To address the issue of privacy and security arisen by data, the data to train the student network was generated by using the layer activations or layer spectral activations of the teacher network via data-free distillation (Lopes et al., 2017). To protect data privacy and intellectual piracy, a private model compression framework via knowledge distillation was proposed by releasing the student model on public data while the teacher model on the sensitive and public data (Wang et al., 2019a). This private knowledge distillation adopted privacy loss and bath loss to further improve the privacy degree. To consider the compromise between privacy and performance, a few shot network compression method via a novel layer-wise knowledge distillation with few samples per class was developed and achieved significant performance (Bai et al., 2019a). Of course, there are other special interesting applications of knowledge distillation, such as neural architecture search (Macko et al., 2019; Bashivan et al., 2019) and interpretability of deep neural networks (Liu et al., 2018d).

8 Discussions

In this section, we discuss the challenges of knowledge distillation and provide some insights on the future research of knowledge distillation.

8.1 Challenges

For knowledge distillation, the key is to 1) extract rich knowledge from the teacher and 2) to transfer the knowledge from the teacher to guide the training of the student. Therefore, we discuss the challenges in knowledge distillation from the followings aspects: the quality of knowledge, the distillation algorithms, the teacher-student architectures, and the theory behind the knowledge distillation.

Most of KD methods leverage a combination of different kinds of knowledge, including response-based, feature-based, and relation-based knowledge. Therefore, it is important to know the influence of each individual knowledge and how different kinds of knowledge help each other in a complementary manner. For example, the response-based knowledge shares similar motivation with the label smoothing or the model regularization (Kim and Kim, 2017; Muller et al., 2019; Ding et al., 2019); The featured-based knowledge usually is used to mimic the intermediate process of the teacher and the relation-based knowledge also tries to capture the relationships across different samples. To this end, it is still a challenge to model the different kinds of knowledge in a unified framework.

How to transfer the existing captured knowledge from a teacher to a student is a key step in knowledge distillation. The knowledge transfer is often realized by the distillation that determines the ability of the student learning and the teacher teaching. Generally, the existing distillation methods can be categorized into offline distillation, online distillation and self distillation. To improve the ability of the knowledge transfer, developing new distillation and reasonably integrating different distillation could also be a commonly available measure in knowledge distillation.

Currently, most of the KD methods focus on either new knowledge or new distillation loss functions, leaving the design of the teacher-student architectures poorly investigated (Nowak and Corso, 2018; Crowley et al., 2018; Kang et al., 2019; Liu et al., 2019h; Ashok et al., 2017; Liu et al., 2019a). In fact, apart from the knowledge and distillation, the relationship between the structure of the teacher and the student also significant influences the performance of knowledge distillation. For example, as described in (Zhang et al., 2019b), the student model scarcely learns the knowledge from some teacher models, which is caused by the model capacity gap between the teacher model and the student model (Kang et al., 2019). As a result, how to design an effective student model or find a proper teacher model is still a challenging problem in knowledge distillation.

Despite of a huge number of the knowledge distillation applications, the understanding of knowledge distillation including theoretical explanations and empirical evaluations remains insufficient (Lopez-Paz et al., 2015; Phuong and Lampert, 2019; Cho and Hariharan, 2019). For example, distillation can be viewed as a form of learning with privileged information (Lopez-Paz et al., 2015). Under the assumption of linear teacher and student models, the theoretical explanations of characteristics of the student learning via distillation are studied (Phuong and Lampert, 2019). Furthermore, some empirical evaluations and analysis on the efficacy of knowledge distillation are studied in (Cho and Hariharan, 2019). However, a deep understanding of generalizability of knowledge distillation, especially how to measure the quality of knowledge or the quality of the teacher-student architecture, is still a very challenging problem.

8.2 Future Directions

To enhance the performance of knowledge and distillation, the most importance is to learn what knowledge from teacher network to be distilled into where in the student network on what teacher-student architecture. That is to say, useful knowledge, effective distillation and proper teacher-student architectures are three importance aspects in knowledge distillation. Thus, based on the generality of knowledge distillation for model compression, one of the future topics is still to develop new strategies of capturing and distilling knowledge and to design new teacher-student architectures, in order to solve the challenging problems above.

The model compression and acceleration methods for deep neural networks usually contains four different categories, including parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters and knowledge distillation (Cheng et al., 2018). In existing knowledge distillation methods, there are only a few related works discussing the combination of knowledge distillation and other kinds of compressing methods. For example, quantized knowledge distillation, which can be seen as a parameter pruning method, integrates network quantization into the teacher-student architectures (Polino et al., 2018; Mishra and Marr, 2017; Wei et al., 2018). Therefore, to learn efficient and effective lightweight deep models for the deployment on portable platforms, the hybrid compression methods via both knowledge distillation and other compressing techniques will be an interesting topic for future study.

Apart from model compression for acceleration for deep neural networks, knowledge distillation also can be used in other problems because of the natural characteristics of the knowledge transfer on the teacher-student architecture. Recently, knowledge distillation has been used in the problems of the data privacy and security (Wang et al., 2019a), adversarial attacks of deep models (Papernot et al., 2016b), cross-modalities (Gupta et al., 2016), multiple domains (Asami et al., 2017), catastrophic forgetting (Lee et al., 2019b), accelerating learning of deep models (Chen et al., 2015), efficiency of neural architecture search (Bashivan et al., 2019), self-supervision (Noroozi et al., 2018), and data augmentation (Lee et al., 2019a; Gordon and Duh, 2019). Another interesting example is that the knowledge transfer from the small teacher networks to a large student network can accelerate the large network learning (Chen et al., 2015), which is quite different with the motivation of vanilla knowledge distillation. The feature representations learned from unlabelled data by a large model can also supervise the target model via distillation (Noroozi et al., 2018). To this end, the extensions of knowledge distillation for other purposes and applications might be a meaningful future direction.

Due to the good characteristic of human learning that knowledge distillation holds, it can be practicable to popularize the knowledge transfer to the classic and traditional machine learning methods (Zhou et al., 2019b; Gong et al., 2018; You et al., 2018). For example, traditional two-stage classification was skillfully cast as a single teacher single student problems in terms of the basic idea of knowledge distillation (Zhou et al., 2019b). Furthermore, knowledge distillation can be well unified into many kinds of mainstream machine learning, such as the adversarial learning (Liu et al., 2018b), auto machine learning (Macko et al., 2019), lifelong learning (Zhai et al., 2019), reinforcement learning (Ashok et al., 2017), and so on. Thus, the extensions of knowledge distillation cross many typical machine learning methods could be the interesting research works in the future.

9 Conclusion

In this paper, we present a comprehensive review on knowledge distillation, from the perspectives of knowledge, distillation schemes, teacher-student architectures, distillation algorithms, applications, as well as challenges and future directions. We hope that this comprehensive review will not only provides an overview of knowledge distillation but also provides some insights on the subject of future research.

References

  • Aditya et al. (2019) Aditya, S., Saha, R., Yang, Y. & Baral, C. (2019). Spatial knowledge distillation to aid visual reasoning. In: WACV.
  • Aguilar et al. (2019) Aguilar, G., Ling, Y., Zhang, Y., Yao, B., Fan, X. & Guo, E. (2019). Knowledge distillation from internal representations. arXiv preprint arXiv:1910.03723.
  • Aguinaldo et al. (2019) Aguinaldo, A., Chiang, P. Y., Gain, A., Patil, A., Pearson, K. & Feizi, S. (2019). Compressing gans using knowledge distillation. arXiv preprint arXiv:1902.00159.
  • Ahn et al. (2019) Ahn, S., Hu, S., Damianou, A., Lawrence, N. D. & Dai, Z. (2019). Variational information distillation for knowledge transfer. In: CVPR.
  • Albanie et al. (2018) Albanie, S., Nagrani, A., Vedaldi, A. & Zisserman, A. (2018). Emotion recognition in speech using cross-modal transfer in the wild. In: ACM MM.
  • Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., & Liang, Y. (2019). Learning and generalization in overparameterized neural networks, going beyond two layers. In: NeurIPS.
  • Anil et al. (2018) Anil, R., Pereyra, G., Passos, A., Ormandi, R., Dahl, G. E.. & Hinton, G. E. (2018). Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235.
  • Arora et al. (2018) Arora, S., Cohen, N., & Hazan, E. (2018). On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509.
  • Arora et al. (2019) Arora, S., Khapra, M. M. & Ramaswamy, H. G. (2019). On knowledge distillation from complex networks for response prediction. In: NAACL-HLT.
  • Asami et al. (2017) Asami, T., Masumura, R., Yamaguchi, Y., Masataki, H. & Aono, Y. (2017). Domain adaptation of dnn acoustic models using knowledge distillation. In: ICASSP.
  • Ashok et al. (2017) Ashok, A., Rhinehart, N., Beainy, F. & Kitani, K. M. (2017). N2n learning: Network to network compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030.
  • Asif et al. (2019) Asif, U., Tang, J. & Harrer, S. (2019). Ensemble knowledge distillation for learning improved and efficient networks. arXiv preprint arXiv:1909.08097.
  • Ba and Caruana (2014) Ba, J. & Caruana, R. (2014). Do deep nets really need to be deep? In: NeurIPS.
  • Bagherinezhad et al. (2018) Bagherinezhad, H., Horton, M., Rastegari, M. & Farhadi, A. (2018). Label refinery: Improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641.
  • Bai et al. (2019a) Bai, H., Wu, J., King, I. & Lyu, M. (2019a). Few shot network compression via cross distillation. arXiv preprint arXiv:1911.09450.
  • Bai et al. (2019b) Bai, Y., Yi, J., Tao, J., Tian, Z. & Wen, Z. (2019b). Learn spelling from teachers: transferring knowledge from language models to sequence-to-sequence speech recognition. arXiv preprint arXiv:1907.06017.
  • Bashivan et al. (2019) Bashivan, P., Tensen, M. & DiCarlo, J. J. (2019). Teacher guided architecture search. In: ICCV.
  • Belagiannis et al. (2018) Belagiannis, V., Farshad, A. & Galasso, F. (2018). Adversarial network compression. In:ECCV.
  • Bengio et al. (2013) Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE TPAMI 35(8):1798–1828.
  • Bhardwaj et al. (2019) Bhardwaj, S., Srinivasan, M. & Khapra, M. M. (2019). Efficient video classification using fewer frames. In: CVPR.
  • Brutzkus and Globerson (2019) Brutzkus, A., & Globerson, A. (2019). Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem. In: ICML.
  • Bucilua et al. (2006) Bucilua, C., Caruana, R. & Niculescu-Mizil, A. (2006). Model compression. In: SIGKDD.
  • Chan et al. (2015) Chan, W., Ke, N. R. & Lane, I. (2015). Transferring knowledge from a rnn to a DNN. arXiv preprint arXiv:1504.01483.
  • Chebotar and Waters (2016) Chebotar, Y. & Waters, A. (2016). Distilling knowledge from ensembles of neural networks for speech recognition. In: Interspeech.
  • Chen et al. (2019a) Chen, D., Mei, J. P., Wang, C., Feng, Y. & Chen, C. (2019a) Online knowledge distillation with diverse peers. arXiv preprint arXiv:1912.00350.
  • Chen et al. (2017) Chen, G., Choi, W., Yu, X., Han, T., & Chandraker, M. (2017). Learning efficient object detection models with knowledge distillation. In: NeurIPS.
  • Chen et al. (2019b) Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B. & et al. (2019b). Data-free learning of student networks. In: ICCV.
  • Chen et al. (2020) Chen, H., Wang, Y., Xu, C., Xu, C. & Tao, D. (2020). Learning student networks via feature embedding. TNNLS. DOI: 10.1109/TNNLS.2020.2970494.
  • Chen et al. (2015) Chen, T., Goodfellow, I. & Shlens, J. (2015) Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641.
  • Chen et al. (2018a) Chen, W. C., Chang, C. C. & Lee, C. R. (2018a). Knowledge distillation with feature maps for image classification. In: ACCV.
  • Chen et al. (2018b) Chen, X., Zhang, Y., Xu, H., Qin, Z. & Zha, H. (2018b). Adversarial distillation for efficient recommendation with external knowledge. ACM TOIS 37(1):1–28.
  • Chen et al. (2019c) Chen, X., Su, J. & Zhang, J. (2019c). A two-teacher tramework for knowledge distillation. In: ISNN.
  • Chen et al. (2018c) Chen, Y., Wang, N. & Zhang, Z. (2018c). Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In: AAAI.
  • Chen et al. (2019d) Chen, Y. C., Lin, Y. Y., Yang, M. H., Huang, J. B. (2019d). Crdoco: Pixel-level domain transfer with cross-domain consistency. In: CVPR.
  • Chen and Liu (2018) Chen, Z. & Liu, B. (2018). Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 12(3):1–207.
  • Cheng et al. (2018) Cheng, Y., Wang, D., Zhou, P. & Zhang, T. (2018). Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Proc Mag 35(1):126–136.
  • Cheng et al. (2020) Cheng, X., Rao, Z., Chen, Y., & Zhang, Q. (2020). Explaining Knowledge Distillation by Quantifying the Knowledge. In: CVPR.
  • Cho and Hariharan (2019) Cho, J. H. & Hariharan, B. (2019). On the efficacy of knowledge distillation. In: ICCV.
  • Chollet (2017) Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In: CVPR.
  • Chung et al. (2020) Chung, I., Park, S., Kim, J. & Kwak, N. (2020). Feature-map-level online adversarial knowledge distillation. In: ICRL.
  • Clark et al. (2019) Clark, K., Luong, M. T., Khandelwal, U., Manning, C. D. & Le, Q. V. (2019). Bam! born-again multi-task networks for natural language understanding. arXiv preprint arXiv:1907.04829.
  • Courbariaux et al. (2015) Courbariaux, M., Bengio, Y. & David, J. P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. In: NeurIPS.
  • Crowley et al. (2018) Crowley, E. J., Gray, G. & Storkey, A. J. (2018). Moonshine: Distilling with cheap convolutions. In: NeurIPS.
  • Cui et al. (2017) Cui, J., Kingsbury, B., Ramabhadran, B., Saon, G., Sercu, T., Audhkhasi, K. & et al. (2017).Knowledge distillation across ensembles of multilingual models for low-resource languages. In: ICASSP.
  • Deng et al. (2009)

    Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In:

    CVPR.
  • Denton et al. (2014) Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y. & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In: NeurIPS.
  • Devlin et al. (2018) Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Ding et al. (2019) Ding, Q., Wu, S., Sun, H., Guo, J. & Xia, ST. (2019). Adaptive regularization of labels. arXiv preprint arXiv:1908.05474.
  • Do et al. (2019) Do, T., Do, T. T., Tran, H., Tjiputra, E. & Tran, Q. D. (2019). Compact trilinear interaction for visual question answering. In: ICCV.
  • Dong and Yang (2019) Dong, X. & Yang, Y. (2019). Teacher supervises students how to learn from partially labeled images for facial landmark detection. In: ICCV.
  • Dou et al. (2020) Dou, Q., Liu, Q., Heng, P. A., & Glocker, B. (2020). Unpaired Multi-modal Segmentation via Knowledge Distillation. To appear in IEEE TMI.
  • Duong et al. (2019) Duong, C. N., Luu, K., Quach, K. G. & Le, N. (2019.) Shrinkteanet: Million-scale lightweight face recognition via shrinking teacher-student networks. arXiv preprint arXiv:1905.10620.
  • Flennerhag et al. (2018) Flennerhag, S., Moreno, P. G., Lawrence, N. D. & Damianou, A. (2018). Transferring knowledge across learning processes. arXiv preprint arXiv:1812.01054.
  • Freitag et al. (2017) Freitag, M., Al-Onaizan, Y. & Sankaran, B. (2017). Ensemble distillation for neural machine translation. arXiv preprint arXiv:1702.01802.
  • Fukuda et al. (2017) Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J. & Ramabhadran, B. (2017). Efficient knowledge distillation from an ensemble of teachers. In: Interspeech.
  • Furlanello et al. (2018) Furlanello, T., Lipton, Z., Tschannen, M., Itti, L. & Anandkumar, A. (2018). Born again neural networks. In: ICML.
  • Gao et al. (2019) Gao, L., Mi, H., Zhu, B., Feng, D., Li, Y. & Peng, Y. (2019). An adversarial feature distillation method for audio classification. IEEE Access 7:105319–105330.
  • Gao et al. (2020) Gao, M., Shen, Y., Li, Q., & Loy, C. C. (2020). Residual Knowledge Distillation. arXiv preprint arXiv:2002.09168.
  • Garcia et al. (2018) Garcia, N. C., Morerio, P. & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. In: ECCV.
  • Ge et al. (2018) Ge, S., Zhao, S., Li, C. & Li, J. (2018). Low-resolution face recognition in the wild via selective knowledge distillation. IEEE TIP 28(4):2051–2062.
  • Ghorbani et al. (2018) Ghorbani, S., Bulut, A. E. & Hansen, J. H. (2018). Advancing multi-accented lstm-ctc speech recognition using a domain specific student-teacher learning paradigm. In: SLTW.
  • Gil et al. (2019) Gil, Y., Chai, Y., Gorodissky, O. & Berant, J. (2019). White-to-black: Efficient distillation of black-box adversarial attacks. arXiv preprint arXiv:1904.02405.
  • Goldblum et al. (2019) Goldblum, M., Fowl, L., Feizi, S. & Goldstein, T. (2019). Adversarially robust distillation. arXiv preprint arXiv:1905.09747.
  • Gong et al. (2018) Gong, C., Chang, X., Fang, M. & Yang, J. (2018). Teaching semi-supervised classifier via generalized distillation. In: IJCAI.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In: NIPS.
  • Gordon and Duh (2019) Gordon, M. A. & Duh, K. (2019). Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation. arXiv preprint arXiv:1912.03334.
  • Gu and Tresp (2020) Gu, J., & Tresp, V. (2020). Search for Better Students to Learn Distilled Knowledge. To appear in ECAI.
  • Gupta et al. (2016) Gupta, S., Hoffman, J. & Malik, J. (2016). Cross modal distillation for supervision transfer. In: CVPR.
  • Hahn and Choi (2019) Hahn, S. & Choi, H. (2019). Self-knowledge distillation in natural language processing. arXiv preprint arXiv:1908.01851.
  • Haidar and Rezagholizadeh (2019) Haidar, M. A. & Rezagholizadeh, M. (2019). Textkd-gan: Text generation using knowledge distillation and generative adversarial networks. In: Canadian Conference on Artificial Intelligence.
  • Han et al. (2015) Han, S., Pool, J., Tran, J. & Dally, W. (2015). Learning both weights and connections for efficient neural network. In: NeurIPS.
  • Hao and Zhang (2019) Hao, W. & Zhang, Z. (2019). Spatiotemporal distilled dense-connectivity network for video action recognition. Pattern Recogn 92:13–24.
  • He et al. (2016) He, K., Zhang, X., Ren, S. & Sun, J. (2016) Deep residual learning for image recognition. In: CVPR.
  • He et al. (2019) He, T., Shen, C., Tian, Z., Gong, D., Sun, C. & Yan, Y. (2019). Knowledge adaptation for efficient semantic segmentation. In: CVPR.
  • Heo et al. (2019a) Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., & Choi, J. Y. (2019a). A comprehensive overhaul of feature distillation. In: ICCV.
  • Heo et al. (2019b) Heo, B., Lee, M., Yun, S. & Choi, J. Y. (2019b). Knowledge distillation with adversarial samples supporting decision boundary. In: AAAI.
  • Heo et al. (2019c) Heo, B., Lee, M., Yun, S. & Choi, J. Y. (2019c). Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: AAAI.
  • Hinton et al. (2015) Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Hoffman et al. (2016) Hoffman, J., Gupta, S. & Darrell, T. (2016). Learning with side information through modality hallucination. In: CVPR.
  • Hong and Yu (2019) Hong, W. & Yu, J. (2019). Gan-knowledge distillation for one-stage object detection. arXiv preprint arXiv:1906.08467.
  • Hou et al. (2019) Hou, Y., Ma, Z., Liu, C. & Loy, CC. (2019). Learning lightweight lane detection cnns by self attention distillation. In: ICCV.
  • Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
  • Hu et al. (2018) Hu, M., Peng, Y., Wei, F., Huang, Z., Li, D., Yang, N. & et al. (2018). Attention-guided answer distillation for machine reading comprehension. arXiv preprint arXiv:1808.07644.
  • Huang et al. (2017) Huang, G., Liu, Z., Van, Der Maaten, L. & Weinberger, K. Q. (2017). Densely connected convolutional networks. In: CVPR.
  • Huang et al. (2018) Huang, M., You, Y., Chen, Z., Qian, Y. & Yu, K. (2018). Knowledge distillation for sequence model. In: Interspeech.
  • Huang and Wang (2017) Huang, Z. & Wang, N. (2017). Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219.
  • Ioffe and Szegedy (2015) Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
  • Jang et al. (2019) Jang, Y., Lee, H., Hwang, S. J. & Shin, J. (2019). Learning what and where to transfer. In: ICML.
  • Jiao et al. (2019) Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L. & et al. (2019). Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
  • Kang et al. (2019) Kang, M., Mun, J. & Han, B. (2019). Towards oracle knowledge distillation with neural architecture search. arXiv preprint arXiv:1911.13019.
  • Kim et al. (2018) Kim, J., Park, S. & Kwak, N. (2018). Paraphrasing complex network: Network compression via factor transfer. In: NeurIPS.
  • Kim et al. (2019a) Kim, J., Bhalgat, Y., Lee, J., Patel, C., & Kwak, N. (2019a). QKD: Quantization-aware Knowledge Distillation. arXiv preprint arXiv:1911.12491.
  • Kim et al. (2019b) Kim, J., Hyun, M., Chung, I. & Kwak, N. (2019b). Feature fusion for online mutual knowledge distillation. arXiv preprint arXiv:1904.09058.
  • Kim and Kim (2017) Kim, S. W. & Kim, H. E. (2017). Transferring knowledge to smaller network with class-distance loss. In: IClRW.
  • Kim and Rush (2016) Kim, Y., Rush & A. M. (2016). Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947.
  • Kimura et al. (2018) Kimura, A., Ghahramani, Z., Takeuchi, K., Iwata, T. & Ueda, N. (2018). Few-shot learning of neural networks from scratch by pseudo example optimization. arXiv preprint arXiv:1802.03039.
  • Kong et al. (2019) Kong, H., Zhao, J., Tu, X., Xing, J., Shen, S. & Feng, J. (2019). Cross-resolution face recognition via prior-aided face hallucination and residual knowledge distillation. arXiv preprint arXiv:1905.10777.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In: NeurIPS.
  • Kuncoro et al. (2016) Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C. & Smith, N. A. (2016). Distilling an ensemble of greedy dependency parsers into one mst parser. arXiv preprint arXiv:1609.07561.
  • Kundu et al. (2019) Kundu, J. N., Lakkakula, N. & Babu, R. V. (2019). Um-adapt: Unsupervised multi-task adaptation using adversarial cross-task distillation. In: CVPR.
  • Lee et al. (2019a) Lee, H., Hwang, S. J. & Shin, J. (2019a). Rethinking data augmentation: Self-supervision and self-distillation. arXiv preprint arXiv:1910.05872.
  • Lee et al. (2019b) Lee, K., Lee, K., Shin, J. & Lee, H. (2019b). Overcoming catastrophic forgetting with unlabeled data in the wild. In: ICCV.
  • Lee et al. (2019c) Lee, K., Nguyen, L. T. & Shim, B. (2019c). Stochasticity and skip connections improve knowledge transfer. In: AAAI.
  • Lee and Song (2019) Lee, S. & Song, B. (2019). Graph-based knowledge distillation by multi-head attention network. arXiv preprint arXiv:1907.02226.
  • Lee et al. (2018) Lee, S. H., Kim, D. H. & Song, B. C. (2018). Self-supervised knowledge distillation using singular value decomposition. In: ECCV.
  • Li et al. (2020) Li, C., Peng, J., Yuan, L., Wang, G., Liang, X., Lin, L., & Chang, X. (2020). Blockwisely Supervised Neural Architecture Search with Knowledge Distillation. In: CVPR.
  • Li et al. (2019) Li, J., Fu, K., Zhao, S. & Ge, S. (2019). Spatiotemporal knowledge distillation for efficient estimation of aerial video saliency. IEEE TIP 29:1902–1914.
  • Li et al. (2017) Li, Q., Jin, S. & Yan, J. (2017). Mimicking very efficient network for object detection. In: CVPR.
  • Li et al. (2018) Li, T., Li, J., Liu, Z. & Zhang, C. (2018). Knowledge distillation from few samples. arXiv preprint arXiv:1812.01839.
  • Li and Hoiem (2017) Li, Z. & Hoiem, D. (2017). Learning without forgetting. IEEE TPAMI 40(12):2935–2947.
  • Liu et al. (2019a) Liu, I. J., Peng, J. & Schwing, A. G. (2019a). Knowledge flow: Improve upon your teachers. arXiv preprint arXiv:190405878.
  • Liu et al. (2019b) Liu, J., Chen, Y. & Liu, K. (2019b). Exploiting the ground-truth: An adversarial imitation based knowledge distillation approach for event detection. In: AAAI.
  • Liu et al. (2019c) Liu, J., Wen, D., Gao, H., Tao, W., Chen, T. W., Osa, K. & et al. (2019c). Knowledge representing: efficient, sparse representation of prior knowledge for knowledge distillation. In: CVPRW.
  • Liu et al. (2018a) Liu, P., Liu, W., Ma, H., Mei, T. & Seok, M. (2018a). Ktan: knowledge transfer adversarial network. arXiv preprint arXiv:1810.08126.
  • Liu et al. (2019d) Liu, Q., Xie, L., Wang, H., Yuille & A. L. (2019d). Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In: ICCV.
  • Liu et al. (2018b) Liu, R., Fusi, N. & Mackey, L. (2018b). Model compression with generative adversarial networks. arXiv preprint arXiv:1812.02271.
  • Liu et al. (2018c) Liu, X., Wang, X. & Matwin, S. (2018c). Improving the interpretability of deep neural networks with knowledge distillation. In: ICDMW.
  • Liu et al. (2018d) Liu, X., Wang, X. & Matwin, S. (2018d). Improving the interpretability of deep neural networks with knowledge distillation. In: ICDMW.
  • Liu et al. (2019e) Liu, X., He, P., Chen, W. & Gao, J. (2019e). Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.
  • Liu et al. (2019f) Liu, Y., Cao, J., Li, B., Yuan, C., Hu, W., Li, Y. & Duan, Y. (2019f). Knowledge distillation via instance relationship graph. In: CVPR.
  • Liu et al. (2019g) Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z. & Wang, J. (2019g). Structured knowledge distillation for semantic segmentation. In: CVPR.
  • Liu et al. (2019h) Liu, Y., Jia, X., Tan, M., Vemulapalli, R., Zhu, Y., Green, B. & et al. (2019h). Search to distill: Pearls are everywhere but not the eyes. In: CVPR.
  • Lopes et al. (2017) Lopes, R. G., Fenu, S. & Starner, T. (2017). Data-free knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535.
  • Lopez-Paz et al. (2015) Lopez-Paz, D., Bottou, L., Schölkopf, B. & Vapnik, V. (2015). Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643.
  • Lu et al. (2017) Lu, L., Guo, M. & Renals, S. (2017). Knowledge distillation for small-footprint highway networks. In: ICASSP.
  • Luo et al. (2016) Luo, P., Zhu, Z., Liu, Z., Wang, X. & Tang, X. (2016). Face model compression by distilling knowledge from neurons. In: AAAI.
  • Luo et al. (2018) Luo, Z., Hsieh, J. T., Jiang, L., Carlos Niebles, J.& Fei-Fei, L. (2018). Graph distillation for action detection with privileged modalities. In: ECCV.
  • Macko et al. (2019) Macko, V., Weill, C., Mazzawi, H. & Gonzalvo, J. (2019). Improving neural architecture search image classifiers via ensemble learning. arXiv preprint arXiv:1903.06236.
  • Ma and Mei (2019) Ma, J., & Mei, Q. (2019). Graph representation learning via multi-task knowledge distillation. arXiv preprint arXiv:1911.05700.
  • Ma et al. (2018) Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: ECCV
  • Meng et al. (2019) Meng, Z., Li, J., Zhao, Y. & Gong, Y. (2019). Conditional teacher-student learning. In: ICASSP.
  • Micaelli and Storkey (2019) Micaelli, P. & Storkey, A. J. (2019). Zero-shot knowledge transfer via adversarial belief matching. In: NeurIPS.
  • Minami et al. (2019) Minami, S., Hirakawa, T., Yamashita, T. & Fujiyoshi, H. (2019). Knowledge transfer graph for deep collaborative learning. arXiv preprint arXiv:1909.04286.
  • Mirzadeh et al. (2019) Mirzadeh, S. I., Farajtabar, M., Li, A. & Ghasemzadeh, H. (2019). Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393.
  • Mishra and Marr (2017) Mishra, A. & Marr, D. (2017). Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. arXiv preprint arXiv:1711.05852.
  • Mou et al. (2016) Mou, L., Jia, R., Xu, Y., Li, G., Zhang, L. & Jin, Z. (2016). Distilling word embeddings: An encoding approach. In: CIKM.
  • Mukherjee et al. (2019) Mukherjee, P., Das, A., Bhunia, A. K. & Roy, P. P. (2019). Cogni-net: Cognitive feature learning through deep visual perception. In: ICIP.
  • Mullapudi et al. (2019) Mullapudi, R. T., Chen, S., Zhang, K., Ramanan, D. & Fatahalian, K. (2019). Online model distillation for efficient video inference. In: ICCV.
  • Muller et al. (2019) Muller, R., Kornblith, S. & Hinton, G. E. (2019). When does label smoothing help? In: NeurIPS.
  • Mun et al. (2018) Mun, J., Lee, K., Shin, J. & Han, B. (2018). Learning to specialize with knowledge distillation for visual question answering. In: NeurIPS.
  • Munjal et al. (2019) Munjal, B., Galasso, F. & Amin, S. (2019). Knowledge distillation for end-to-endperson search. arXiv preprint arXiv:1909.01058.
  • Nakashole and Flauger (2017) Nakashole, N. & Flauger, R. (2017). Knowledge distillation for bilingual dictionary induction. In: EMNLP.
  • Nayak et al. (2019) Nayak, G. K., Mopuri, K. R., Shaj, V., Babu, R. V. & Chakraborty, A. (2019). Zero-shot knowledge distillation in deep networks. arXiv preprint arXiv:1905.08114.
  • Ng et al. (2018) Ng, R. W., Liu, X. & Swietojanski, P. (2018). Teacher-student training for text-independent speaker recognition. In: SLTW.
  • Nie et al. (2019) Nie, X., Li, Y., Luo, L., Zhang, N. & Feng, J. (2019). Dynamic kernel distillation for efficient pose estimation in videos. In: ICCV.
  • Noroozi et al. (2018)

    Noroozi, M., Vinjimoor, A., Favaro, P. & Pirsiavash, H .(2018). Boosting self-supervised learning via knowledge transfer. In:

    CVPR.
  • Nowak and Corso (2018) Nowak, T. S. & Corso, J. J. (2018). Deep net triage: Analyzing the importance of network layers via structural compression. arXiv preprint arXiv:1801.04651.
  • Oord et al. (2017) Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K. & et al. (2017). Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433.
  • Pan et al. (2019) Pan, Y., He, F. & Yu, H. (2019). A novel enhanced collaborative autoencoder with knowledge distillation for top-n recommender systems. Neurocomputing 332:137–148.
  • Papernot et al. (2016a) Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I. & Talwar, K. (2016a). Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755
  • Papernot et al. (2016b) Papernot, N., McDaniel, P., Wu, X., Jha, S. & Swami, A. (2016b). Distillation as a defense to adversarial perturbations against deep neural networks. In: ISNN.
  • Park and Kwak (2019) Park, S. & Kwak, N. (2019). Feed: feature-level ensemble for knowledge distillation. arXiv preprint arXiv:1909.10754.
  • Park et al. (2019) Park, W., Kim, D., Lu, Y. & Cho, M. (2019). Relational knowledge distillation. In: CVPR.
  • Passalis and Tefas (2018) Passalis, N. & Tefas, A. (2018). Learning deep representations with probabilistic knowledge transfer. In: ECCV.
  • Peng et al. (2019a) Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y. & et al. (2019a). Correlation congruence for knowledge distillation. In: ICCV.
  • Peng et al. (2019b) Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G. J. & Tang, J. (2019b). Few-shot image recognition with knowledge transfer. In: ICCV.
  • Perez et al. (2020) Perez, A., Sanguineti, V., Morerio, P. & Murino, V. (2020). Audio-visual model distillation using acoustic images. In: WACV.
  • Phuong and Lampert (2019) Phuong, M. & Lampert, C. (2019). Towards understanding knowledge distillation. In: ICML.
  • Pilzer et al. (2019) Pilzer, A., Lathuiliere, S., Sebe, N. & Ricci, E. (2019). Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: CVPR.
  • Polino et al. (2018) Polino, A., Pascanu, R. & Alistarh, D. (2018). Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668.
  • Price et al. (2016) Price, R., Iso, K. & Shinoda, K. (2016). Wise teachers train better dnn acoustic models. EURASIP J Audio Spee 2016(1):10.
  • Radosavovic et al. (2020) Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollar P. (2020). Designing network design spaces. arXiv preprint arXiv:2003.13678.
  • Roheda et al. (2018) Roheda, S., Riggan, B. S., Krim, H. & Dai, L. (2018). Cross-modality distillation: A case for conditional generative adversarial networks. In: ICASSP.
  • Romero et al. (2015) Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2015). Fitnets: Hints for thin deep nets. In: ICLR.
  • Ross and Doshi-Velez (2018) Ross, A. S. & Doshi-Velez, F. (2018). Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In: AAAI.
  • Ruder et al. (2017) Ruder, S., Ghaffari, P. & Breslin, J. G. (2017). Knowledge adaptation: Teaching to adapt. arXiv preprint arXiv:1702.02052.
  • Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In: CVPR.
  • Sanh et al. (2019) Sanh, V., Debut, L., Chaumond, J. & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  • Saputra et al. (2019) Saputra, M. R. U., de Gusmao, P. P., Almalioglu, Y., Markham, A. & Trigoni, N. (2019). Distilling knowledge from a deep pose regressor network. In: ICCV.
  • Sau and Balasubramanian (2016) Sau, B. B. & Balasubramanian, V. N. (2016). Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650.
  • Shu et al. (2019) Shu, C., Li, P., Xie, Y., Qu, Y., Dai, L., & Ma, L.(2019). Knowledge squeezed adversarial network compression. arXiv preprint arXiv:1904.05100.
  • Shakeri et al. (2019) Shakeri, S., Sethy, A. & Cheng, C. (2019). Knowledge distillation in document retrieval. arXiv preprint arXiv:1911.11065.
  • Shen et al. (2016) Shen, J., Vesdapunt, N., Boddeti, V. N. & Kitani, K. M. (2016). In teacher we trust: Learning compressed models for pedestrian detection. arXiv preprint arXiv:1612.00478.
  • Shen et al. (2018) Shen, P., Lu, X., Li, S. & Kawai, H. (2018). Feature representation of short utterances based on knowledge distillation for spoken language identification. In: Interspeech.
  • Shen et al. (2019a) Shen, P., Lu, X., Li, S. & Kawai, H. (2019a). Interactive learning of teacher-student model for short utterance spoken language identification. In: ICASSP.
  • Shen et al. (2019b) Shen, Z., He, Z. & Xue, X. (2019b). Meal: Multi-model ensemble via adversarial learning. In: AAAI.
  • Shi et al. (2019a) Shi, B., Sun, M., Kao, C. C., Rozgic, V., Matsoukas, S. & Wang, C. (2019a). Compression of acoustic event detection models with quantized distillation. arXiv preprint arXiv:1907.00873.
  • Shi et al. (2019b) Shi, B., Sun, M., Kao, CC., Rozgic, V., Matsoukas, S. & Wang, C. (2019b). Semi-supervised acoustic event detection based on tri-training. In: ICASSP.
  • Shi et al. (2019c) Shi, Y., Hwang, M. Y., Lei, X. & Sheng, H. (2019c). Knowledge distillation for recurrent neural network language modeling with trust regularization. In: ICASSP.
  • Shin et al. (2019) Shin, S., Boo, Y. & Sung, W. (2019). Empirical analysis of knowledge distillation technique for optimization of quantized deep neural networks. arXiv preprint arXiv:1909.01688.
  • Shmelkov et al. (2017) Shmelkov, K., Schmid, C. & Alahari, K. (2017). Incremental learning of object detectors without catastrophic forgetting. In: ICCV.
  • Siam et al. (2019) Siam, M., Jiang, C., Lu, S., Petrich, L., Gamal, M., Elhoseiny, M. & et al. (2019). Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In: ICRA.
  • Sindhwani et al. (2015) Sindhwani, V., Sainath, T. & Kumar, S. (2015). Structured transforms for small-footprint deep learning. In: NeurIPS.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489.
  • Song et al. (2018) Song, X., Feng, F., Han, X., Yang, X., Liu, W. & Nie, L. (2018). Neural compatibility modeling with attentive knowledge distillation. In: SIGIR.
  • Srinivas and Fleuret (2018) Srinivas, S. & Fleuret, F. (2018). Knowledge transfer with jacobian matching. arXiv preprint arXiv:1803.00443.
  • Su and Maji (2016) Su, J. C. & Maji, S. (2016). Adapting models to signal degradation using distillation. arXiv preprint arXiv:1604.00433.
  • Sun et al. (2019) Sun, S., Cheng, Y., Gan, Z. & Liu, J. (2019). Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355.
  • Sun et al. (2019) Sun, P., Feng, W., Han, R., Yan, S., & Wen, Y. (2019). Optimizing network performance for distributed dnn training on gpu clusters: Imagenet/alexnet training in 1.5 minutes. arXiv preprint arXiv:1902.06855.
  • Takashima et al. (2018) Takashima, R., Li, S. & Kawai, H. (2018). An investigation of a knowledge distillation method for ctc acoustic models. In: ICASSP.
  • Tan et al. (2019) Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In: CVPR.
  • Tan and Le (2019) Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: ICML.
  • Tan et al. (2019) Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z. & Liu, T. Y. (2019). Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461.
  • Tang et al. (2020) Tang, J., Shivanna, R., Zhao, Z., Lin, D., Singh, A., Chi, E. H., & Jain, S. (2020). Understanding and Improving Knowledge Distillation. arXiv preprint arXiv:2002.03532.
  • Tang and Wang (2018) Tang, J. & Wang, K. (2018). Ranking distillation: Learning compact ranking models with high performance for recommender system. In: SIGKDD.
  • Tang et al. (2019) Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O. & Lin, J. (2019). Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136.
  • Thoker and Gall (2019) Thoker, F. M. & Gall, J. (2019). Cross-modal knowledge distillation for action recognition. In: ICIP.
  • Tian et al. (Tian2020) Tian, Y., Krishnan, D. & Isola, P. (2020). Contrastive representation distillation. arXiv preprint arXiv:1910.10699.
  • Tung and Mori (2019) Tung, F. & Mori, G. (2019). Similarity-preserving knowledge distillation. In: ICCV.
  • Turc et al. (2019) Turc, I., Chang, M. W., Lee, K. & Toutanova, K. (2019). Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962.
  • Urban et al. (2016) Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R. & et al. (2016). Do deep convolutional nets really need to be deep and convolutional? arXiv preprint arXiv:1603.05691.
  • Vapnik and Izmailov (2015) Vapnik, V. & Izmailov, R. (2015). Learning using privileged information: similarity control and knowledge transfer. J Mach Learn Res 16(2023-2049):2.
  • Vongkulbhisal et al. (2019) Vongkulbhisal, J., Vinayavekhin, P. & Visentini-Scarzanella, M. (2019). Unifying heterogeneous classifiers with distillation. In: CVPR.
  • Wang et al. (2017) Wang, C., Lan, X. & Zhang, Y. (2017). Model distillation with knowledge transfer from face classification to alignment and verification. arXiv preprint arXiv:1709.02929.
  • Wang et al. (2018a) Wang, H., Zhao, H., Li, X. & Tan, X. (2018a). Progressive blockwise knowledge distillation for neural network acceleration. In: IJCAI.
  • Wang et al. (2019a) Wang, J., Bao, W., Sun, L., Zhu, X., Cao, B. & Philip, SY. (2019a). Private model compression via knowledge distillation. In: AAAI.
  • Wang et al. (2019b) Wang, J., Gou, L., Zhang, W., Yang, H. & Shen, H. W. (2019b). Deepvid: Deep visual interpretation and diagnosis for image classifiers via knowledge distillation. TVCG 25(6):2168–2180.
  • Wang et al. (2018b) Wang, M., Liu, R., Abe, N., Uchida, H., Matsunami, T. & Yamada, S. (2018b). Discover the effective strategy for face recognition model compression by improved knowledge distillation. In: ICIP.
  • Wang et al. (2019c) Wang, M., Liu, R., Hajime, N., Narishige, A., Uchida, H. & Matsunami, T.(2019c). Improved knowledge distillation for training fast low resolution face recognition model. In: ICCVW.
  • Wang et al. (2019d) Wang, T., Yuan, L., Zhang, X. & Feng, J. (2019d). Distilling object detectors with fine-grained feature imitation. In: CVPR.
  • Wang et al. (2018c) Wang, W., Zhang, J., Zhang, H., Hwang, M. Y., Zong, C. & Li, Z. (2018c). A teacher-student framework for maintainable dialog manager. In: EMNLP.
  • Wang et al. (2018d) Wang, X., Zhang, R., Sun, Y. & Qi, J. (2018d) Kdgan: Knowledge distillation with generative adversarial networks. In: NeurIPS.
  • Wang et al. (2019e) Wang, X., Hu, J. F., Lai, J. H., Zhang, J. & Zheng, W. S. (2019e). Progressive teacher-student learning for early action prediction. In: CVPR.
  • Wang et al. (2018e) Wang, Y., Xu, C., Xu, C. & Tao, D. (2018e). Adversarial learning of portable student networks. In: AAAI.
  • Watanabe et al. (2017) Watanabe, S., Hori, T., Le Roux, J. & Hershey, J. R. (2017). Student-teacher network learning with enhanced features. In: ICASSP.
  • Wei et al. (2019) Wei, H. R., Huang, S., Wang, R., Dai, X. & Chen, J. (2019). Online distilling from checkpoints for neural machine translation. In: NAACL-HLT.
  • Wei et al. (2018) Wei, Y., Pan, X., Qin, H., Ouyang, W. & Yan, J. (2018). Quantization mimic: Towards very tiny cnn for object detection. In: ECCV.
  • Wong and Gales (2016) Wong, J. H. & Gales, M. (2016). Sequence student-teacher training of deep neural networks. In: Interspeech.
  • Wu et al. (2019) Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., … & Keutzer, K. (2019). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In: CVPR.
  • Wu et al. (2019a) Wu, A., Zheng, W. S., Guo, X. & Lai, J. H. (2019a). Distilled person re-identification: Towards a more scalable system. In: CVPR.
  • Wu et al. (2016) Wu, J., Leng, C., Wang, Y., Hu, Q. & Cheng, J. (2016). Quantized convolutional neural networks for mobile devices. In: CVPR.
  • Wu et al. (2019b) Wu, M. C., Chiu, C. T. & Wu, K. H. (2019b). Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In: ICASSP.
  • Wu et al. (2020) Wu, X., He, R., Hu, Y., & Sun, Z. (2020). Learning an evolutionary embedding via massive knowledge distillation. International Journal of Computer Vision, 1-18.
  • Xie et al. (2019) Xie, J., Lin, S., Zhang, Y. & Luo, L. (2019). Training convolutional neural networks with cheap convolutions and online distillation. arXiv preprint arXiv:1909.13063.
  • Xie et al. (2019) Xie, Q., Hovy, E., Luong, M. T., & Le, Q. V. (2019). Self-training with Noisy Student improves ImageNet classification. arXiv preprint arXiv:1911.04252.
  • Xu et al. (2017) Xu, Z., Hsu, Y. C. & Huang, J. (2017). Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint arXiv:1709.00513.
  • Xu et al. (2018)

    Xu, Z., Hsu, Y. C. & Huang, J. (2018). Training student networks for acceleration with conditional adversarial networks. In:

    BMVC.
  • Yan et al. (2019) Yan, M., Zhao, M., Xu, Z., Zhang, Q., Wang, G. & Su, Z. (2019). Vargfacenet: An efficient variable group convolutional neural network for lightweight face recognition. In: ICCVW.
  • Yang et al. (2019a) Yang, C., Xie, L., Qiao, S. & Yuille, A. (2019a). Knowledge distillation in generations: More tolerant teachers educate better students. In: AAAI.
  • Yang et al. (2019b) Yang, C., Xie, L., Su, C. & Yuille, A. L. (2019b). Snapshot distillation: Teacher-student optimization in one generation. In: CVPR.
  • Yang et al. (2020) Yang, Z., Shou, L., Gong, M., Lin, W. & Jiang, D. (2020). Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In: WSDM.
  • Yao et al. (2019) Yao, H., Zhang, C., Wei, Y., Jiang, M., Wang, S., Huang, J., Chawla, N. V., & Li, Z. (2019). Graph Few-shot Learning via Knowledge Transfer. arXiv preprint arXiv:1910.03053.
  • Ye et al. (2019) Ye, J., Ji, Y., Wang, X., Ou, K., Tao, D. & Song, M. (2019). Student becoming the master: Knowledge amalgamation for joint scene parsing, depth estimation, and more. In: CVPR.
  • Yim et al. (2017)

    Yim, J., Joo, D., Bae, J. & Kim, J. (2017). A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In:

    CVPR.
  • You et al. (2017) You, S., Xu, C., Xu, C. & Tao, D. (2017). Learning from multiple teacher networks. In: SIGKDD.
  • You et al. (2018) You, S., Xu, C., Xu, C. & Tao, D. (2018). Learning with single-teacher multi-student. In: AAAI.
  • You et al. (2019) You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., … & Hsieh, C. J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. In: ICLR.
  • Yu et al. (2019) Yu, L., Yazici, V. O., Liu, X., Weijer, J., Cheng, Y. & Ramisa, A. (2019). Learning metrics from teachers: Compact networks for image embedding. In: CVPR.
  • Yuan et al. (2019) Yuan, L., Tay, F. E., Li, G., Wang, T. & Feng, J. (2019). Revisit knowledge distillation: a teacher-free framework. arXiv preprint arXiv:1909.11723.
  • Yun et al. (2019) Yun, S., Park, J., Lee, K. & Shin, J. (2019). Regularizing predictions via class wise self knowledge distillation.
  • Zagoruyko and Komodakis (2016) Zagoruyko, S. & Komodakis, N. (2016). Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928.
  • Zhai et al. (2019) Zhai, M., Chen, L., Tung, F., He, J., Nawhal, M. & Mori, G. (2019). Lifelong gan: Continual learning for conditional image generation. In: ICCV.
  • Zhai et al. (2016) Zhai, S., Cheng, Y., Zhang, Z. M. & Lu, W. (2016). Doubly convolutional neural networks. In: NeurIPS.
  • Zhang and Peng (2018) Zhang, C. & Peng, Y. (2018). Better and faster: knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In: IJCAI.
  • Zhang et al. (2019a) Zhang, F., Zhu, X. & Ye, M. (2019a). Fast human pose estimation. In: CVPR.
  • Zhang et al. (2019b) Zhang, L., Song, J., Gao, A., Chen, J., Bao, C. & Ma, K. (2019b). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: ICCV.
  • Zhang et al. (2020) Zhang, S., Guo, S., Wang, L., Huang, W., & Scott, M. R. (2020). Knowledge Integration Networks for Action Recognition. To appear in AAAI.
  • Zhang et al. (2018a) Zhang, X., Zhou, X., Lin, M. & Sun, J. (2018a). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: CVPR.
  • Zhang et al. (2018b) Zhang, Y., Xiang, T., Hospedales, T. M. & Lu, H. (2018b). Deep mutual learning. In: CVPR.
  • Zhao et al. (2018) Zhao, M., Li, T., Abu Alsheikh, M., Tian, Y., Zhao, H., Torralba, A. & Katabi, D. (2018). Through-wall human pose estimation using radio signals. In: CVPR.
  • Zhou et al. (2019a) Zhou C, Neubig G, Gu J (2019a) Understanding knowledge distillation in non-autoregressive machine translation. arXiv preprint arXiv:1911.02727.
  • Zhou et al. (2018) Zhou, G., Fan, Y., Cui, R., Bian, W., Zhu, X. & Gai, K. (2018). Rocket launching: A universal and efficient framework for training well-performing light net. In: AAAI.
  • Zhou et al. (2019b) Zhou, J., Zeng, S. & Zhang, B. (2019b) Two-stage image classification supervised by a single teacher single student model. arXiv preprint arXiv:1909.12111.
  • Zhou et al. (2019c) Zhou, P., Mai, L., Zhang, J., Xu, N., Wu, Z. & Davis, L. S. (2019c). M2KD: Multi-model and multi-level knowledge distillation for incremental learning. arXiv preprint arXiv:1904.01769.
  • Zhu et al. (2019) Zhu, M., Han, K., Zhang, C., Lin, J. & Wang, Y. (2019). Low-resolution visual recognition via deep feature distillation. In: ICASSP.
  • Zhu and Gong (2018) Zhu, X. & Gong, S. (2018). Knowledge distillation by on-the-fly native ensemble. In: NeurIPS.