Heterogeneous Knowledge Distillation using Information Flow Modeling

Knowledge Distillation (KD) methods are capable of transferring the knowledge encoded in a large and complex teacher into a smaller and faster student. Early methods were usually limited to transferring the knowledge only between the last layers of the networks, while latter approaches were capable of performing multi-layer KD, further increasing the accuracy of the student. However, despite their improved performance, these methods still suffer from several limitations that restrict both their efficiency and flexibility. First, existing KD methods typically ignore that neural networks undergo through different learning phases during the training process, which often requires different types of supervision for each one. Furthermore, existing multi-layer KD methods are usually unable to effectively handle networks with significantly different architectures (heterogeneous KD). In this paper we propose a novel KD method that works by modeling the information flow through the various layers of the teacher model and then train a student model to mimic this information flow. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process, as well as by designing and training an appropriate auxiliary teacher model that acts as a proxy model capable of "explaining" the way the teacher works to the student. The effectiveness of the proposed method is demonstrated using four image datasets and several different evaluation setups.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/31/2021

Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Training a small student network with the guidance of a larger teacher n...
05/20/2022

InDistill: Transferring Knowledge From Pruned Intermediate Layers

Deploying deep neural networks on hardware with limited resources, such ...
04/11/2019

Variational Information Distillation for Knowledge Transfer

Transferring knowledge from a teacher neural network pretrained on the s...
08/12/2021

Learning from Matured Dumb Teacher for Fine Generalization

The flexibility of decision boundaries in neural networks that are ungui...
09/30/2020

Pea-KD: Parameter-efficient and Accurate Knowledge Distillation

How can we efficiently compress a model while maintaining its performanc...
12/31/2019

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge o...
01/06/2021

Modality-specific Distillation

Large neural networks are impractical to deploy on mobile devices due to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Existing knowledge distillation approaches ignore the existence of critical learning periods when transferring the knowledge, even when multi-layer transfer approaches are used. However, as argued in [achille2017critical], the information plasticity

rapidly declines after the first few training epochs, reducing the effectiveness of knowledge distillation. On the other hand, the proposed method models the information flow in the teacher network and provides the appropriate supervision during the first few critical learning epochs in order to ensure that the necessary connections between successive layers of the networks will be formed. Note that even though this process initially slows down the convergence of the network slightly (epochs 1-8), it allows for rapidly increasing the rate of convergence after the critical learning period ends (epochs 10-25). The parameter

controls the relative importance of transferring the knowledge from the intermediate layers during the various learning phases, as described in detail in Section 3.

Despite the tremendous  success of Deep Learning (DL) in a wide range of domain 

[lecun2015deep], most DL methods suffer from a significant drawback: powerful hardware is needed for training and deploying DL models. This significantly hinders DL applications on resource-scarce environments, such as embedded and mobile devices, leading to the development of various methods for overcoming these limitations. Among the most prominent methods for this task is knowledge distillation (KD) [hinton2015distilling], which is also known as knowledge transfer (KT) [yim2017gift]. These approaches aim to transfer the knowledge encoded in a large and complex neural network into a smaller and faster one. In this way, it is possible to increase the accuracy of the smaller model, compared to the same model trained without employing KD. Typically, the smaller model is called student model, while the larger model is called teacher model.

Early KD approaches focused on transferring the knowledge between the last layer of the teacher and student models [compression-model, hinton2015distilling, passalis2018learning, tang2016recurrent, tzeng2015simultaneous, yu2019learning]. This allowed for providing richer training targets to the student model, which capture more information regarding the similarities between different samples, reducing overfitting and increasing the student’s accuracy. Later methods further increased the efficiency of KD by modeling and transferring the knowledge encoded in the intermediate layers of the teacher [romero2014fitnets, yim2017gift, zagoruyko2016paying]. These approaches usually attempt to implicitly model the way information gets transformed through the various layers of a network, providing additional hints to the student model regarding the way that the teacher model process the information.

Even though these methods were indeed able to further increase the accuracy of models trained with KD, they also suffer from several limitations that restrict both their efficiency and flexibility. First, note that neural networks exhibit an evolving behavior, undergoing several different and distinct phases during the training process. For example, during the first few epochs critical connections are formed [achille2017critical], defining almost permanently the future information flow

paths on a network. After fixing these paths, the training process can only fine-tune them, while forming new paths is significantly less probable after the critical learning period ends 

[achille2017critical]. After forming these critical connections, the fitting and compression (when applicable) phases follow [shwartz2017opening, saxe2018information]. Despite this dynamic time-dependent behavior of neural networks, virtually all existing KD approaches ignores the phases that neural networks undergo during the training. This observation leads us to the first research question of this paper: Is a different type of supervision needed during the different learning phases of the student and is it possible to use a stronger teacher to provide this supervision?

To this end, we propose a simple, yet effective way to exploit KD to train a student that mimics the information flow paths of the teacher, while also providing further evidence confirming the existence of critical learning periods during the training phase of a neural network, as originally described in [achille2017critical]. Indeed, as also demonstrated in the ablation study shown in Fig. 1, providing the correct supervision during the critical learning period of a neural network can have a significant effect on the overall training process, increasing the accuracy of the student model. More information regarding this ablation study are provided in Section 4. It is worth noting that the additional supervision, which is employed to ensure that the student will form similar information paths to the teacher, actually slows down the learning process until the critical learning period is completed. However, after the information flow paths are formed, the rate of convergence is significantly accelerated compared to the student networks that do not take into account the existence of critical learning periods.

Figure 2:

Examining the effect of transferring the knowledge from different layers of a teacher model into the third layer of the student model. Two different teachers are used, a strong teacher (ResNet-18, where each layer refers to each layer block) and an auxiliary teacher (CNN-1-A). The nearest centroid classifier accuracy (NCC) is reported for the representations extracted from each layer (in order to provide an intuitive measure of how each layer transforms the representations extracted from the input data). The final precision is reported for a student model trained by either not using intermediate layer supervision (upper black values) or by using different layers of the teacher (4 subsequent precision values). Several different phenomena are observed when the knowledge is transferred from different layers, while the proposed auxiliary teacher allows for achieving the highest precision and provides a straightforward way to match the layers between the models (the auxiliary teacher transforms the data representations in a way that is closer to the student model, as measure through the NCC accuracy).

Another limitation of existing KD approaches that employ multiple intermediate layers is their ability to handle heterogeneous multi-layer knowledge distillation, i.e., transfer the knowledge between teachers and students with vastly different architectures. Existing methods almost exclusively  use network architectures that provide a trivial one-to-one matching between the layers of the student and teacher, e.g., ResNets with the same number of blocks are often used, altering only the number of layers inside of each residual block [yim2017gift, zagoruyko2016paying]. Many of these approaches, such as [yim2017gift], are even more restrictive, also requiring the layers of the teacher and student to have the same dimensionality. As a result, it is especially difficult to perform multi-layer KD between networks with vastly different architectures, since even if just one layer of the teacher model is incorrectly matched to a layer of the student model, then the accuracy of the student can be significantly reduced, either due to over-regularizing the network or by forcing to early compress the representations of the student. This behavior is demonstrated in Fig. 2, where the knowledge is transferred from the 3rd layer of two different teachers to various layers of the student. These findings lead us to the second research question of this paper: Is it possible to handle heterogeneous KD in a structured way to avoid such phenomena?

To this end, in this work, we propose a simple, yet effective approach for training an auxiliary teacher model, which is closer to the architecture of the student model. This auxiliary teacher is responsible for explaining the way the larger teacher works to the student model. Indeed, this approach can significantly increase the accuracy of the teacher, as demonstrated both in Fig 2, as well as on the rest of the experiments conducted in this paper.  It is worth noting that during our initial experiments it was almost impossible to find a layer matching that would actually help us to improve the accuracy of the student model without first designing an appropriate auxiliary teacher model, highlighting the importance of using auxiliary teachers in heterogeneous KD scenarios, as also highlighted in [mirzadeh2019improved].

The main contribution of this paper is proposing a KD method that works by modeling the information flow through the teacher model and then training a student model to mimic this information flow. However, as it was explained previously and experimentally demonstrated in this paper, this process is often very difficult, especially when there is no obvious layer matching between the teacher and student models, which can often process the information in vastly different ways. In fact, even a single layer mismatch, i.e., overly regularizing the network or forcing for an early compression of the representation, can significantly reduce the accuracy of the student model. To overcome these limitations the proposed method works by a) designing and training an appropriate auxiliary teacher model that  allows for a direct and effective one-to-one  matching between the layers of the student and teacher models, as well as b) employing a critical-learning aware KD scheme that ensures that critical connections will be formed allowing for effectively mimicking the teacher’s information flow instead of just learning a student that mimics the output of the student.

The effectiveness of the proposed method is demonstrated using several different tasks, ranging from metric learning and classification to mimicking handcrafted feature extractors for providing fast neural network-based implementations for low power embedded hardware.  The experimental evaluation also includes an extensive representation learning evaluation, given its  increasing importance in many embedded DL and robotic applications and following the evaluation protocol of recently proposed KD methods [passalis2018learning, yu2019learning]

. An open-source implementation of the proposed method is provided in

https://github.com/passalis/pkth.

The rest of the paper is structured as follows. First, the related work is briefly discussed and compared to the proposed method in Section 2. Then, the proposed method is presented in Section 3, while the experimental evaluation is provided in Section 4. Finally, conclusions are drawn in Section 5.

2 Related Work

A large number of knowledge  transfer methods which build upon the  neural network distillation approach have been proposed [ahn2019variational, compression-model, hinton2015distilling, tang2016recurrent, tzeng2015simultaneous]. These methods typically use a teacher model to generate soft-labels and then use these soft-labels for training a smaller student network. It is worth noting that several extensions to this approach have  been proposed. For example, soft-labels can be used for pre-training a large network [tang2015knowledge] and performing domain adaption [tzeng2015simultaneous], while an embedding-based approach  for transferring the knowledge was proposed in [passalis2018unsupervised].  Also, online distillation methods, such as [anil2018large, zhang2018deep], employ a co-training strategy, training both the student and teacher models simultaneously. However, none of these approaches take into account that deep neural networks transition through several learning phases, each one with different characteristics, which requires handling them in different ways. On the other hand, the proposed method models the information flow in the teacher model and then employs a weighting scheme that provides the appropriate supervision during the initial critical learning period of the student, ensuring that the critical connections and information paths formed in the teacher model will be transferred to the student.

Furthermore, several methods that support multi-layer KD have been proposed, such as using hints [romero2014fitnets], the flow of solution procedure matrix (FSP) [yim2017gift], attention transfer [zagoruyko2016paying]

, or singular value decomposition to extract major features from each layer 

[lee2018self].  However, these approaches  usually only target networks with compatible architecture, e.g., residual networks with the same number of residual blocks, both for the teacher and student models. Also, it is not straightforward to use them to successfully transfer the knowledge between heterogeneous models, since even a slight layer mismatch can have a devastating effect on the student’s accuracy, as demonstrated in Fig. 2. It is also worth noting that we were actually unable to effectively apply most of these methods for heterogenous KD, since either they do not support transferring the knowledge between layers of different dimensionality, e.g., [yim2017gift], or they are prone to over-regularization or representation collapse (as demonstrated in Fig. 2) reducing the overall performance of the student.

In contrast with the aforementioned approaches, the proposed method provides a way to perform heterogeneous multi-layer KD by appropriately designing and training an auxiliary network and exploiting the knowledge encoded by the earlier layers of this network.  In this way, the proposed method provides an efficient way for handling any possible network architecture by employing an auxiliary network that is close to the architecture of the student model regardless the architecture of the teacher model. Using the proposed auxiliary network strategy ensures that the teacher model will transform the representations extracted from the data in a way compatible with the student model, allowing for providing a one-to-one matching between the intermediate layers of the networks. It is also  worth nothing that the use of a similar auxiliary network, which is used as an intermediate step for KD, was also proposed in [mirzadeh2019improved]. However in contrast with the proposed method, the auxiliary network used in [mirzadeh2019improved] was employed for merely improving the performance of KD between the final classification layers, instead of designing an auxiliary network that can facilitate efficient multi-layer KD, as proposed in this paper. Finally, to the best of our knowledge, in this work we propose the first architecture-agnostic probabilistic KD approach that works by modeling the information flow through the various layers of the teacher model using a hybrid kernel formulation, can support heterogeneous network architectures and can effectively supervise the student model during its critical learning period.

3 Proposed Method

Let denote the transfer set that contains transfer samples and it is  used to transfer the knowledge from the teacher model to the student model. Note that the proposed method can also work in a purely unsupervised fashion and, as a result, unlabeled data samples can be also used for transferring the knowledge. Also, let denote the representation extracted from the -th layer of the teacher model and    denote the representation extracted from the -th layer of the student model  . Note that the trainable parameters of the student model are denoted by . The proposed method aims to train the student model  , i.e., learn the appropriate parameters , in order to  “mimic” the behavior of as close as possible.

Furthermore, let

denote the random variable that describes the representation extracted from the

-th layer of the teacher model and the corresponding random variable for the student model. Also, let denote the random variable that describes the training targets for the teacher model. In this work, the information flow of the teacher network is defined as progression of mutual information between every layer representation of the network and the training targets, i.e.,  

.  Note that even though the training targets are required for modeling the information flow, they are not actually needed during the KD process, as we will demonstrate later. Then, we can define the information flow vector that characterizes the way the network process information as:

(1)

where is the number of layers of the teacher model. Similarly, the information flow vector for the student model is defined as:

(2)

where again is the number of layers of the student model. The proposed method works by minimizing the divergence between the information flow in the teacher and student models, i.e., , where is a metric used to measure the divergence between two, possibly heterogeneous, networks.  To this end, the information flow divergence is defined as the sum of squared differences between each paired element of the information flow vectors: D_F (ω_s, ω_t) = ∑_i=1^N_L_s ( [ω_s]_i - [ω_t]_κ(i)   )^2, where the layer of the teacher is chosen in order to minimize the divergence with the corresponding layer of the teacher:

(3)

and the notation is used to refer to the -th element of vector . This definition employs the optimal matching between the layers (considering the discriminative power of each layer), except from the final one, which corresponds to the task hand. In this way, it allows for measuring the flow divergence between networks with different architectures. At the same time it is also expected to minimize the impact of  over-regularization and/or representation collapse phenomena, such as those demonstrated in Fig. 2, which often occur when there is large divergence between the layers used for transferring the knowledge. However, this also implies that for networks with vastly different architectures or for networks not yet trained for the task at hand, the same layer of the teacher may be used for transferring the knowledge to multiple layers of the student model, leading to a significant loss of granularity during the KD and leading to stability issues. In Subsection 3.2 we provide a simple, yet effective way to overcome this issue by using auxiliary teacher models. Note that more advanced methods, such as employing fuzzy assignments between different sets of layers can be also used.

3.1 Tractable Information Flow Divergence Measures using Quadratic Mutual Information

In order to effectively transfer the knowledge between two different networks, we have to provide an efficient way to calculate the mutual information, as well as to train the student model to match the mutual information between  two layers of different networks. Recently, it has been demonstrated that when the Quadratic Mutual Information (QMI) [torkkola2003feature] is used, it is possible to efficiently minimize the difference between the mutual information of a specific layer of the teacher and student by appropriately relaxing the optimization problem [passalis2018learning]

. More specifically, the problem of matching the mutual information between two layers can be reduced into a simpler probability matching problem that involves only the pairwise interactions between the transfer samples. Therefore, to transfer the knowledge between a specific layer of the student and an another layer of the teacher, it is adequate to minimize the divergence between the teacher’s and student’s conditional probability distributions, which can be estimated as 

[passalis2018learning]:

(4)

and

(5)

where is a kernel function and and   refer to the student and teacher layers used for the transfer. These probabilities also express how probable is for each sample to select each of its neighbors [maaten2008visualizing], modeling in this way the geometry of the feature space, while matching these two distributions also ensures that the mutual information between the models and a set of (possibly unknown) classes is maintained [passalis2018learning]. Note that the actual training labels are not required during this process, and, as a result, the proposed method can work in a purely unsupervised fashion.

The kernel choice can have a significant effect on the quality of the KD, since it alters how the mutual information is estimated [passalis2018learning]. Apart from the well known Gaussian kernel, which is however often hard to tune, other kernel choices include cosine-based kernels [passalis2018learning], e.g., , where and are two vectors, and the T-student kernel, i.e., , where is typically set to 1. Selecting the most appropriate kernel for the task at hand can lead to significant performance improvements, e.g., cosine-based kernels perform better for retrieval tasks, while using kernel ensembles, i.e., estimating the probability distribution using multiple kernels, can also improve the robustness of mutual information estimation. Therefore, in this paper a hybrid objective that aims at minimizing the divergence calculated using both the cosine kernel, which ensures the good performance of the learned representation for retrieval tasks, and the T-student kernel, which experimentally demonstrated good performance for classification tasks, is used:

(6)

where is a probability divergence metric and the notation  and is used to denote the conditional probabilities of the teacher calculated using the cosine and T-student kernels respectively. Again, the representations used for KD were extracted from the -th/-th layer. The student probability distribution is denoted similarly by  and . The divergence between these distributions can be calculated using a symmetric version of the Kullback-Leibler (KL) divergence, the  Jeffreys divergence [jeffreys1946invariant]: D(P^(t, l_t)——P^(s, l_s))=
∑_i=1^N ∑_j=1, i≠j^N ( p^(t, l_t)_j—i - p^(s, l_s)_j—i ) ⋅( logp^(t, l_t)_j—i - logp^(s, l_s)_j—i),
which can be sampled at a finite number of points during the optimization, e.g., using batches of 64-128 samples. This batch-based strategy has been successfully  employed in a number of different works [passalis2018learning, yu2019learning], without any significant effect on the optimization process.

3.2 Auxiliary Networks and Information Flow

Even though the flow divergence metric defined in (3) takes into account the way different networks process the information, it suffers from a significant drawback: if the teacher process the information in a significantly different way compared to the student, then the same layer of the teacher model might be used for transferring the knowledge to multiple layers of the student model, leading to a significant loss in the granularity of information flow used for KD. Furthermore, this problem can also arise even when the student model is capable of processing the information in a way compatible with the teacher, but it has not been yet appropriately trained for the task at hand. To better understand this, note  that the information flow divergence in (3) is calculated based on the estimated mutual information and not the actual learning capacity of each model. Therefore, directly using the flow divergence definition presented in (3) is not optimal for KD. It is worth noting that this issue is especially critical for every KD method that employs multiple layer, since as we demonstrate in Section 4,  if the layer pairs are not carefully selected, the accuracy of the student model is often lower compared to a model trained without using multi-layer transfer at all.

Unfortunately, due to the poor understanding of the way that neural networks transform the probability distribution of the input data, there is currently no way to  select the most appropriate layers for transferring the knowledge a priori. This process can be especially difficult and tedious, especially when the architectures of the student and teacher differ a lot. To overcome this critical limitation in this work we propose constructing an appropriate auxiliary proxy for the  teacher model, that will allow for directly matching between all the layers of the auxiliary model and the student model, as shown in Fig. 3. In this way, the proposed method employs an auxiliary

network, that has a compatible architecture with the student model, to better facilitate the process of KD. A simple, yet effective approach for designing the auxiliary network is employed in this work: the auxiliary network follows the same architecture as the student model, but using twice the neurons/convolutional filters per layer. Thus, the greater learning capacity of the auxiliary network ensures that enough knowledge will be always available to the auxiliary network (when compared to the student model), leading to better results compared to directly transferring the knowledge from the teacher model. Designing the most appropriate auxiliary network is an open research area and significantly better ways than the proposed one might exist. However, even this simple approach was adequate to significantly enhance the performance of KD and demonstrate the potential of information flow modeling, as further demonstrated in the ablation studies provided in Section 

4. Also, note that a hierarchy of auxiliary teachers can be trained in this fashion, as also proposed in [mirzadeh2019improved].

Figure 3: First, the knowledge is transferred to an appropriate auxiliary teacher, which will better facilitate the process of KD. Then, the proposed method minimizes the information flow divergence between the two models, taking into account the existence of critical learning periods.

The final loss used be optimize the student model, when an auxiliary network is employed, is calculated as:

(7)

where is a hyper-parameter that controls the relative weight of transferring the knowledge from the -th layer of the teacher to the -th layer of the student and the loss defined in (6) is calculated using the auxiliary teacher, instead of the initial teacher. The value of can be dynamically selected during the training process, to ensure that the applied KD scheme takes into account the current learning state of the network, as further discussed in Subsection 3.3

. Finally, stochastic gradient descent is employed to train the student model:

,  where is the matrix with the parameters of the student model and is the employed learning rate.

3.3 Critical Period-aware Divergence Minimization

Neural networks transition through different learning phases during the training process, with the first few epochs being especially critical for the later behavior of the network [achille2017critical]. Using a stronger teacher model provides the opportunity of guiding the student model during the initial critical learning period in order to form the appropriate connectivity between the layers, before the information plasticity declines. However, just minimizing the information flow divergence does not ensure that the appropriate connections will be formed. To better understand this, we have to consider that the gradients back-propagated through the network depend both on the training target, as well as on the initialization of the network. Therefore, for a randomly initialized student, the task of forming the appropriate connections between the intermediate layers might not facilitate the final task at hand (until reaching a certain critical point). This was clearly demonstrated in Fig 1, where the convergence of the network was initially slower, when the proposed method was used, until reaching the point at which the critical learning period ends and the convergence of the network is accelerated.

Therefore, in this work we propose using an appropriate weighting scheme for calculating the value of the hyper-parameter during the training process. More specifically, during the critical learning period a significantly higher weight is given to match the information flow for the earlier layers, ignoring the task at hand dictated by the final layer of the teacher, while this weight gradually decays to 0, as the training process progresses. Therefore, the parameter is calculated as: α_i = {1,if i = NLSαinitγkotherwise, where is the current training epoch, is a decay factor and is the initial weight used for matching the information flow in the intermediate layers. The parameter was set to , while was set to  for all the experiments conducted in this paper (unless otherwise stated). Therefore, during the first few  epochs  (1-10) the final task at hand has a minimal impact on the optimization objective. However, as the training process progresses the importance of matching the information flow for the intermediate layers gradually diminishes and the optimization switches to fine-tuning the network for the task at hand.

4 Experimental Evaluation

The experimental evaluation of the proposed method is provided in this Section. The proposed method was evaluated using four different datasets (CIFAR-10 [krizhevsky2009learning], STL-10 [coates2011analysis], CUB-200 [welinder2010caltech] and SUN Attribute [patterson2012sun] datasets) and compared to four competitive KD methods: neural network distillation [hinton2015distilling], hint-based transfer [romero2014fitnets], probabilistic knowledge transfer (PKT) [passalis2018learning] and metric knowledge transfer (abbreviated as MKT) [yu2019learning]. A variety of different evaluation setups were used to evaluate various aspects of the proposed method. Please refer to the appendix for a detailed description of the employed networks and evaluation setups.

Method mAP (e) mAP (c) top-100 (e) top-100 (c)
Baseline Models
Teacher (ResNet-18)
Aux. (CNN1-A)
With Constrastive Supervision
Student (CNN1)
Hint.
MKT
PKT
Hint-H
MKT-H
PKT-H
Proposed
Without Constrastive Supervision
Student (CNN1)
Distill.
Hint.
MKT
PKT
Hint-H
MKT-H
PKT-H
Proposed
Table 1: Metric Learning Evaluation: CIFAR-10

First, the proposed method was evaluated in a metric learning setup using the CIFAR-10 dataset (Table 1

). The methods were evaluated under two different settings: a) using contrastive supervision (by adding a contrastive loss term in the loss function 

[hadsell2006dimensionality]

), as well as b) using a purely unsupervised setting (cloning the responses of the powerful teacher model). The simple variants (Hint, MKT, PKT) refer to transferring the knowledge only from the penultimate layer of the teacher, while the “-H” variants refer to transferring the knowledge simultaneously from all the layers of the auxiliary model. The abbreviation “e” is used to refer to retrieval using the Euclidean similarity metric, while “c” is used to refer to retrieval using the cosine similarity.

Method Train Accuracy Test Accuracy
Distill
Hint.
MKT
PKT
Hint-H
MKT-H
PKT-H
Proposed
Table 2: Classification Evaluation: CIFAR-10

First, note that using all the layers for distilling the knowledge provides small to no improvements in the retrieval precision, with the exception of the MKT method (when applied without any form of supervision). Actually, in some cases (e.g., when hint-based transfer is employed) the performance when multiple layers are used is worse. This behavior further confirms and highlights the difficulty in applying multi-layer KD methods between heterogeneous architectures. Also, using contrastive supervision seems to provide more consistent results for the competitive methods, especially for the  MKT method. Using the proposed method leads to a significant increase in the mAP, as well as in the top-K precision. For example, mAP (c) increases by over 2.5% (relative increase) over the next best performing method (PKT-H). At the same time, note that the proposed method seems to lead to overall better results when there is no additional supervision. This is again linked to the existence of critical learning periods. As explained before, forming the appropriate information flow paths requires little to no supervision from the final layers, when the network is randomly initialized (since forming these paths usually change the way the network process information, increasing temporarily the loss related to the final task at hand). Similar conclusions can be also drawn from the classification evaluation using the CIFAR-10 dataset. The results are reported in Table 2. Again, the proposed method leads to a relative increase of about 0.7% over the next best-performing method.

Method mAP (e) mAP (c) top-100 (e) top-100 (c)
Teacher (ResNet-18)
Aux. (CNN1-A)
Student (CNN1)
Distill
Hint.
MKT
PKT
Hint-H
MKT-H
PKT-H
Proposed
Table 3: Metric Learning Evaluation: STL Distribution Shift

Next, the proposed method was evaluated under a distribution shift setup using the STL-10 dataset (Table 3

). For these experiments, the teacher model was trained using the CIFAR-10 dataset, but the KD was conducted using the unlabeled split of the STL dataset. Again, similar results as with the CIFAR-10 dataset are observed, with the proposed method outperforming the rest of the evaluated methods over all the evaluated metrics. Again, it is worth noting that directly transferring the knowledge between all the layers of the network often harms the retrieval precision for the competitive approaches. This behavior is also confirmed using the more challenging CUB-200 data set (Table 

4), where  the proposed method again outperforms the rest of the evaluated approaches both for the retrieval evaluation, as well as for the classification evaluation. For the latter, a quite large improvement is observed, since the accuracy increases by over 1.5% over the next best performing method.

Furthermore, we also conducted a HoG [1467360] cloning experiment, in which the knowledge was transferred from a handcrafted feature extractor to demonstrate the flexibility of the proposed approach. The same strategy as in the previous experiments were used, i.e.,  the knowledge was first transferred to an auxiliary model and then further distilled to the student model. It is worth noting that this setup has several emerging applications, as discussed in a variety of recent works [passalis2018learning, chen2018distilling], since it allows for pre-training deep neural networks for domains for which it is difficult to acquire large annotated datasets, as well as providing a straightforward way to exploit the highly optimized deep learning libraries for embedded devices to provide neural network-based implementations of hand-crafted features. The evaluation results for this setup are reported in Table 5, confirming again that the proposed method outperforms the rest of the evaluated methods.

Method mAP (e) mAP (c) top-10 (e) top-10 (c) Acc.
Teacher
Aux.
Student
Distill
Hint.
MDS
PKT
Hint-H
MDS-H
PKT-H
Proposed
Table 4: Metric Learning and Classification  Evaluation: CUB-200
Method mAP (c) top-1 (e) top-10 (c)
HoG
Aux.
Hint
MDS
PKT
Proposed
Table 5: HoG Cloning Network: SUN Dataset

Finally, several ablation studies have been conducted. First, in Fig. 1 we evaluated the effect of using the proposed weighting scheme that takes into account the existence of critical learning periods. The proposed scheme indeed leads to faster convergence over both  single layer KD using the PKT method, as well as over the multi-layer PKT-H method. To validate that the improved results arise from the higher weight given to the intermediate layers over the critical learning period, we used the same decaying scheme for the PKT-H method, but with the initial set to 1 instead of 100. Next, we also demonstrated the impact of matching the correct layers in Fig. 2. Several interesting conclusions can be drawn from the results reported in Fig. 2. For example, note that over-regularization occurs when transferring the knowledge from a teacher layer that has lower MI with the targets (lower NCC accuracy). On the other hand, using a layer with slightly lower discriminative power (Layer 1 of ResNet-18) can have a slightly positive regularization effect. At the same time, using too discriminative layers (Layer 3 and 4 of ResNet-18) can lead to an early collapse of the representation, harming  the precision of the student.  The accuracy of the student  increases only when the correct layers of the auxiliary teacher are matched to the student (Layers 2 and 3 of CNN-1-A).

Furthermore, we also evaluated the effect of using auxiliary models of different sizes on the precision of the student model trained with the proposed method. The evaluation results are provided in Table 6. Two different student models are used: CNN-1 (15k parameters) and CNN-1-L (6k parameters).  As expected, the auxiliary models that are closer to the complexity of the student lead to improved performance compared both to the more complex and the less complex teachers. That is, when the CNN-1 model is used as student, the CNN-1-A teacher achieves the best results, while when the CNN-1-L is used as student, the weaker CNN-1 teacher achieves the highest precision. Note that as the complexity of the student increases, the efficiency of the KD process declines.

Method mAP (e) mAP (c) top-100 (e) top-100 (c)
CNN1-L CNN1
CNN1-A CNN1
CNN1-H CNN1
CNN1 CNN1-L
CNN-1-A CNN-1-L
CNN-1-H CNN-1-L
Table 6: Effect of using auxiliary networks of different sizes (CNN order in term of parameters: CNN-1-H CNN-1-A CNN-1 CNN-1-L)

5 Conclusions

In this paper we presented a novel KD method that that works by modeling the information flow through the various layers of the teacher model. The proposed method was able to overcome several limitations of existing KD approaches, especially when used for training very lightweight deep learning models with architectures that differ significantly from the teacher, by  a) designing and training an appropriate auxiliary teacher model, and b) employing a critical-learning aware KD scheme that ensures that critical connections will be formed to effectively mimic the information flow paths of the auxiliary teacher.

Acknowledgment

This work was supported by the European Union’s Horizon 2020 Research and Innovation Program (OpenDR) under Grant 871449. This publication reflects the authors” views only. The European Commission is not responsible for any use that may be made of the information it contains.

Appendix A Appendix

a.1 Datasets and Evaluation Setups

The proposed method was evaluated using four different datasets: the CIFAR-10 [krizhevsky2009learning] dataset, the STL-10 [coates2011analysis] dataset, the CUB-200 [welinder2010caltech] dataset and the SUN Attribute [patterson2012sun]

dataset. For the CIFAR-10, the training split was used for training and transferring the knowledge to the student models, while for the retrieval evaluation the training split was also used to compile the database. Then, the test set was used to query the database and measure the retrieval performance of various representations. For the STL-10 dataset we followed the same setup as for the CIFAR-10, but we also used the provided unlabeded training split for transferring the knowledge to the student models. For the CUB-200 we also followed the same setup, however the experiments were conducted using the first 30 classes of the data, due to the significantly restricted learning capacity of the employed student models (recall that among the objectives of the paper is to evaluate the performance of KD approaches for ultra-lightweight network architectures and heterogeneous KD setups). Finally, images from the eight most common categories (for which at least 40 images exist) were used for training and evaluating the methods when the SUN Attribute dataset was employed, since a very small number of images exist for the rest of the categories. The 80% of the extracted images was used for training the networks and building the database, while the rest 20% was used to query the database. The evaluation process was repeated 5 times and the mean and standard deviation of the evaluated metrics are reported. For the SUN attribute dataset, the knowledge was distilled from a

HoG features.

For the CIFAR-10 and STL dataset we used the supplied images without performing any resizing (the original

images were used). However, the training dataset was augmented by randomly performing horizontal flipping and randomly cropping the images using padding of 4 pixels. A similar augmentation protocol was used for the CUB-200 dataset. However, the images of the CUB-200 dataset were first resized into

pixels and then a random crop of pixels was used (a center crop of the same size was used during the evaluation process). Also, random rotation up to was used when training the models. Finally, the images of the SUN attribute dataset were resized into pixels, before feeding them into the network, following the protocol used in [passalis2018learning].

Figure 4: Network architectures used for the conducted experiments. The green model was used as the student for the conducted experiments (unless otherwise stated), while the red model was used as the auxiliary teacher. For experiments involving classification, an additional fully connected layer with (number of classes) neurons was added.

a.2 Network Architectures

The network architectures used for the conducted experiments are shown in Fig. 4. The CNN-1 family was used for the experiments conducted using the CIFAR-10 and STL dataset, the CNN-2 family was used for the experiments conducted using the CUB-200 dataset, while the CNN-3 family was used for the SUN Attribute dataset. The suffix “-A” is used to denote the model that was used as the auxiliary teacher. The auxiliary teacher was trained using the PKT method [passalis2018learning]

, by transferring the knowledge from the penultimate layer of a ResNet-18 teacher (for the CIFAR-10, STL and CUB-200 datasets) or from handcrafted features (for the SUN Attribute dataset). The ReLU activation function was used for all the layers, while the batch normalization was used after each convolutional layer.

a.3 Training Hyper-parameters

For all the conducted experiments we used the Adam optimizer, with the default training hyper-parameters. For the experiments conducted using the CIFAR-10 dataset the optimization ran for 50 training epochs with a learning rate of 0.001 (batches of 128 samples were used) for all the evaluated methods. For the ablation results reported in Fig. 2 of the main manuscript the optimization ran for 20 epochs. For the STL dataset the optimization ran for 30 training epochs with a learning rate of 0.001 and batch size equal to 128. For the CUB-200 dataset the optimization ran for 100 training epochs, using a learning rate of 0.001 for the first 50 training epochs and 0.0001 for the subsequent 50 training epochs. Also, for the SUN Attribute dataset the optimization ran for 20 training epochs. Furthermore, the decay factor was set to 0.6 for this dataset, due to the smaller number of training epochs. Finally, note that for the experiments conducted with the contrastive supervision (CIFAR-10) we employed the contrastive loss with the margin set to 1 and the loss was combined with the KD loss after weighting it with 0.1. Also, for the classification experiments reported in Table 2, all the methods were also trained using a supervised classification term (cross-entropy loss). Finally, for all the experiments conducted using the distillation loss, a temperature of was used.

References