Apart from treating soft labels as distilled knowledge, various kinds of knowledge are designed in [yim2017gift, heo2019comprehensive, tian2019contrastive, wang2019pay]. For example, Romero et al. [romero2014fitnets] presented to train intermediate layers of students with guidance of the corresponding layers of teachers, which initiates the subsequent flourishing studies on feature-based knowledge distillation. Researchers [yim2017gift, lee2018self, tung2019similarity] also modulated the relations among adjacent feature maps as additional knowledge to assist the training of student networks. Unfortunately, most of these feature-based KD methods solely focus on aligning the shallow information but overlook the high-level information of both networks, i.e.
, the students mechanically mimicking teachers’ actions while neglecting their interior qualities. Thereby, previous studies consider networks as black-boxes and heuristically select features without any functional properties[tian2019contrastive, wang2019pay, zagoruyko2016paying], which impedes a universal representative of knowledge to be distilled. To address this problem, we argue that leveraging networks’ functional properties to derive high-level knowledge is able to strengthen the performance of KD.
In this paper, we incorporate Lipschitz continuity into KD, considering neural networks as functions rather than black-boxes. By definition in Eq. 4, Lipschitz constant111The Lipschitz constant of a function is the maximum norm of its gradient in the domain set, which reflects Lipschitz continuity of the function. is the upper bound of the relationship between input perturbation and output variation for a given distance, representing the robustness and expressiveness of neural networks [bartlett2017spectrally, miyato2018spectral, lyu2020autoshufflenet]. Specifically, authors in [miyato2018spectral, yoshida2017spectral]
demonstrated the effectiveness of the Lipschitz constant by constraining the weights of the discriminator in a generative adversarial network (GAN). Besides, many studies in representation learning[bengio2013representation, tian2019multimodal] demonstrate that deep neural networks are competent in learning high-level information with increasing abstraction. Inspired by this, we devise a scheme to capture the Lipschitz continuity (i.e., calculate the Lipschitz constant for every intermediate block) of the teacher networks and adopt the captured continuity as knowledge to guide the training of student networks. It is worth noting that Lipschitz constant computation is a NP-hard problem [virmaux2018lipschitz]. We address this problem by proposing an approximation algorithm with a tight upper bound. In particular, we design a Transmitting Matrix () for each block and calculate the spectral norm of through an adopted iteration method to avoid the high complexity of learning large intermediate matrices. We then aggregate all Lipschitz constants calculated from
s as the knowledge of the Lipschitz continuity that are transferred to student networks. Importantly, Lipschitz continuity loss function is backpropagation-friendly for training deep networks because of its differentiability.
Overall, the contributions of this paper are four-fold:
To the best of our knowledge, we are the first on utilizing a high-level functional property, Lipschitz continuity in knowledge distillation, to supervise student networks’ training process. In addition, we theoretically explain the effectiveness of our method from the perspective of network regularization and then empirically consolidate this explanation.
We propose a novel knowledge distillation framework, Lipschitz cONtinuity Guided Knowledge DistillatiON (LONDON) for distilling knowledge from the Lipschitz constant.
To avoid the NP-hard Lipschitz constant calculation, we devise a Transmitting Matrix to numerically approximate the Lipschitz constant of networks in the KD process.
We perform experiments on different knowledge distillation tasks such as classification, object detection, and segmentation. Our proposed method achieves the state-of-the-art results in these tasks on CIFAR-100, ImageNet, and VOC datasets.
2 Related Work
Lipschitz Continuity and Spectral Norm of Neural Network.
The study of adversarial machine learning[kurakin2016adversarial, papernot2016transferability]
shows that neural networks are highly vulnerable to attacks based on small modifications of the input to the model at test time, and estimating the regularity of such architectures is essential for practical applications and generalization improvement. Previous efforts[virmaux2018lipschitz, miyato2018spectral, neyshabur2017exploring] have studied one of the critical characteristics to assess the regularity of deep networks: the Lipschitz continuity of deep learning architectures.
Lipschitz constants, which upper bound the relationship between input perturbation and output variation for a given distance, are introduced to secure the robustness of neural networks to small perturbations. This Lipschitz constant can be seen as a norm to measure the function’s degree of Lipschitz continuity. Apart from some theoretical studies [bartlett2017spectrally, luxburg2004distance, neyshabur2017exploring] explaining that novel generalization bounds critically rely on the Lipschitz constant of the neural network, Lipschitz continuity of neural networks is widely studied for achieving the state-of-the-art performance in many deep learning topics: (i) In image synthesis [miyato2018spectral, yoshida2017spectral]
, researchers used spectral normalization on each layer, an optional approach to constrain the Lipschitz constant of the discriminator for training a GAN on ImageNet, like a regularization term to smooth the discriminator function. And (ii) in adversarial attack machine learning[weng2018evaluating], authors propose constraining local Lipschitz constants of neural networks to avoid adversarial attacks.
Aforementioned efforts underline the significance of Lipschitz constant in neural networks’ expressiveness and robustness. Particularly, deliberately constraining Lipschitz continuity (constant) in an appropriate range is proven to be a powerful technique for smoothing networks, which can enhance the model’s robustness. On account of this, Lipschitz constant, the functional information of neural networks should be introduced into knowledge distillation model for regularizing the training of student networks.
Knowledge Distillation. Apart from the seminal design of soft labels [hinton2015distilling], the alignment of intermediate feature maps is also transferred as knowledge to student networks [romero2014fitnets]. Researchers continued digging into feature-based outputs and proposed various designs of feature maps’ transformation and combination to define the feature-based knowledge, which largely promotes the performance of KD. For example, Heo et al. [heo2019knowledge, heo2019comprehensive]
designated an activation boundary of hidden neurons in different positions of networks as knowledge for distillation. In[yim2017gift], Gram matrixes of neural networks’ adjacent feature maps, representing the relation between intermediate layers, are also adopted as a form of knowledge. Authors [lee2018self, chen2020learning, tung2019similarity]
constructed a similarity measurement for feature representations using singular value decomposition (SVD) to elicit relations between different layers as transferred knowledge.
Inspired by those ideas, many methods are proposed for precisely capturing feature-wise knowledge by artlessly piling up complicated mechanisms on knowledge distillation model. For instance, Wang et al. [wang2019pay] introduced an attention mechanism to assign weights to different CNNs’ channels for dynamically determining the critical features to distill. Furthermore, Tian et al. [tian2019contrastive] introduced contrastive learning to capture correlations and higher-order output dependencies for supervising student network training. This dynamically aligned knowledge almost fully explores potential of distilling networks’ output information for supervision.
However, all those feature-based knowledge distillation methods treat neural networks as black-boxes, which are deficient in exploring the functional properties of neural networks via capturing the high-level information. This limitation hinders the applicability and impedes performance improvement. To alleviate the limitation, we introduce Lipschitz continuity to knowledge distillation.
In this section, we introduce our proposed knowledge distillation framework. We only elaborate the key derivations in this section due to the limited space. Detailed discussions and technical theorems can be found in the supplemental materials. Here, we focus on capturing the functional property of neural networks as knowledge and transferring it in our distillation method in a numerically accessible way.
We first define a fully-connected neural network with layers of widths as the form of function :
where each is an affine function ( and are the sizes of network’s input and output feature maps) and performs element-wise activation for feature maps. For -th layer of the networks, , where and
stand for the weight matrix and bias vector, respectively. For generality purpose, we discard the bias term of the network, so that the network can be simplified as:
Notably, it is sufficient to consider networks with the most straightforward fully-connected layers, since layer with complex structures such as convolution layer can also be denoted as the form of matrix multiplication. We consider a convolution layer with input channels and output channels, and the size of the kernel is , resulting in parameters. We can re-arrange the parameters to a matrix of size , such that this convolution layer can also be processed in the same way as the other fully-connected layers do. Hence, our analysis has no loss for generality in this configuration of function .
Following Eq. 2, we define the function form of the teacher network as , and the student network as , such that the feature-based KD paradigm can be interpreted as:
where given the same data, the ultimate goal of KD paradigm is to minimize the distance between teacher and student for optimizing the latter’s parameters . Particularly, is a distance function, and is certain transformation approach to turn feature maps into more measurable and learnable knowledge. By utilizing those designed knowledge, the student network is forced to mimic the teacher network and hopefully obtains comparable performance with lighter architecture.
Here, we introduce Lipschitz Continuity into KD paradigm as universal information of neural networks based on the functional property of networks. To make Lipschitz constant calculation numerically feasible, we further propose an approximation for the Lipschitz constant and use power iteration method to calculate this approximation.
3.2 The functional information of neural networks: Lipschitz Continuity
Definition 1. A function is called Lipschitz continuous if there exists a constant such that:
The smallest that can hold the inequality is called Lipschitz constant of function , denoted as . By Definition 1, has an excellent property of upper bounding the relationship between input perturbation and output variation for a given distance (generally L2 norm), thus it is considered as a metric to evaluate the robustness of neural networks to small perturbations [luxburg2004distance, virmaux2018lipschitz, bartlett2017spectrally]. However, computing the exact Lipschitz constant of neural networks in the knowledge distillation process is a NP-hard problem [virmaux2018lipschitz]. To solve this problem, we propose a feasible and effective method to approximate the Lipschitz constants in KD.
We first define the affine function for the -th layer , in which and are the feature maps out of the th and the th layer, respectively.
By Lemma 1 as in Supplemetary Appendix, we have , where is the spectral norm of matrix. And the matrix spectral norm is formally defined by
where the spectral norm of matrix
is equivalent to its largest singular value. Thus, for the linear layer, based on Lemma 2 in Supplemetary Appendix, its Lipschitz constant is given by
Additionally, most activation functions such as ReLU, Leaky ReLU, Tanh, Sigmoid as well as max-pooling, have a Lipschitz constant equal to 1. As for other common neural network layers such as dropout, batch normalization and other pooling methods, they all have simple and explicit Lipschitz constants[goodfellow2016deep]. This fixed Lipschitz constant property renders our derivation applicable to most network architectures, such as ResNet [he2016deep] and MobileNet [howard2017mobilenets].
Thereafter, we use the inequality (concluded by Eq. 7 in [bartlett2017spectrally]) to derive the following bound for :
In this way, we transfer the teacher’s Lipschitz constant to the student through a sequence of spectral norm of intermediate layers in the network. Moreover, the upper bound of Lipschitz constant also ensures the quality of knowledge to be transferred.
3.3 Transmitting Matrix
Given the derived tight upper bound of the Lipschitz constant, we design a novel loss to distill Lipschitz continuity from teacher to student by narrowing the distance between corresponding and down. The first problem is how to calculate each spectral norm. Calculating the spectral norm of weight matrix in neural networks by SVD is inaccessible. Specifically, for the complex network structures such as convolutions layers or residual modules, though they can be re-arranged matrix-wisely, their spectral norm’s computation is impractical. Therefore, we propose using Transmitting Matrix (TM) to bypass the complicated calculation of the spectral norm . This approximate calcuation allows feasible computation to distill Lipschitz constant and its further use as a loss function.
For training data of batch size , after a forward process for the -1)th layer, we have a batch of corresponding feature maps as
where for each .
Studies [chen2017exemplar, tung2019similarity] about similarity of feature maps illustrate that for well-trained networks, their batch of feature maps in the same layer have strong mutual linear independence. We formalize the relevance of feature maps in the same layer as
We further normalize the feature maps by
such that a batch of feature maps can be expressed in a vector representation that
where is an unit matrix.
With all the aforementioned equations, we are ready to define the transmitting matrix for calculating the spectral norm of matrix as calculating the spectral norm of
Theorem 1. If matrix
is an orthogonal matrix, such that, where
is an unit matrix, the largest eigenvalues ofand are equivalent.
where is the largest eigenvalue of a matrix. Based on Theorem 1 and Eq. 13, our defined transmitting matrix has the same largest eigenvalue with , i.e. . Thus, combining the definition of spectral norm , we can achieve the spectral norm of matrix by calculating the largest eigenvalue of , , which is solvable.
For networks with more complicated layers such as residual blocks, by considering the block as an affine mapping from front feature maps to back feature maps, this approximation is applicable to calculate the spectral norm block-by-block instead of layer-by-layer, which makes our spectral norm calculation more efficient. To this end, we define the Transmitting Matrix for residual blocks as
where the and are the front feature maps and latter feature maps of a residual block.
3.4 Approximating the Spectral Norm with Power Iteration Method
Following the aforementioned steps, we next need to calculate the spectral norms of two matrices (teacher and student) and then calculate the loss between those two. The intuitive approach is using SVD to compute the spectral norm, which results in overloaded computation. Importantly, the SVD calculation is non-differentiable, making it impossible to train the deep networks. Instead of using SVD, we utilize power iteration method [golub2000eigenvalue, yoshida2017spectral, miyato2018spectral] to approximate the spectral norm of the targeted matrix with a small trade-off of accuracy, as presented in Algorithm 1.
In this way, we have a feasible approach to calculate the spectral norms of s which can faithfully approximate the Lipschitz constant of networks.
3.5 Overall Loss Function
By using Algorithm 1, we obtain the spectral norms for teacher and student networks, respectively: and for each . We define our novel lipschitz continuity loss function as
where is a coefficient greater than . Hence, the decreases with increasing and consequently the increases. In this way, we give more weight on higher layer features since they are closer to the features performing tasks.
Combined with the cross entropy loss and vanilla knowledge distillation loss , we are ready to propose our novel loss function as
where is used to control the degree of distilling the Lipschitz constant. We use because when taking derivative of , the denominator part can be easily eliminated.
3.6 Explaining the Effectiveness from a Regularization Perspective
The derivative of the loss function with respect to :
where , and are respectively the first left and right singular vectors of . For , using SVD, we have
where is the rank of , is the -th biggest singular value, and are correspondingly left and singular vectors, respectively.
In Eq. 18, the first term is the same as the derivative of the loss function of vanilla knowledge distillation. As for the second term, based on Eq. 19, it can be seen as the regularization term penalizing the vanilla knowledge distillation loss with an adaptive regularization coefficient
which constrains the weights of the student networks by utilizing teacher networks’ as a prior supervision information. In other words, our method prevents the student networks from trapping into local minima. In this way, it ensures better training of student networks. We demonstrate performance by designing a corresponding experiment in Section 4.4, showing that our proposed method prevents student networks from over-fitting the dataset.
In this section, we conducted experiments on three computer vision tasks, image classification, object detection, and segmentation, to validate the effectiveness of our proposed distillation method. In addition to comparing our method with the state-of-the-art methods, we also designed a series of ablation studies to verify the effectiveness and highlight the regularization property of our proposed technique. All experiments are implemented using PyTorch[paszke2019pytorch].
|Setup||Compression type||Teacher network||Student network||# of params||# of params||Compress|
|(a)||Depth||WideResNet 28-4||WideResNet 16-4||5.87M||2.77M||47.2%|
|(b)||Channel||WideResNet 28-4||WideResNet 28-2||5.87M||1.47M||25.0%|
|(c)||Depth & channel||WideResNet 28-4||WideResNet 16-2||5.87M||0.70M||11.9%|
|(d)||Different architecture||WideResNet 28-4||ResNet 56||5.87M||0.86M||14.7%|
|(e)||Different architecture||PyramidNet-200 (240)||WideResNet 28-4||26.84M||5.87M||21.9%|
|(f)||Different architecture||PyramidNet-200 (240)||PyramidNet-110 (84)||26.84M||3.91M||14.6%|
|(g)||Different architecture||PyramidNet-200 (240)||ResNet 56||26.84M||0.86M||5.8%|
We chose CIFAR-100 [krizhevsky2009learning] for classification. This is because it is commonly used for comparing KD methods and its relatively small size provides flexibility of implementing different combinations of teacher and student architectures. Besides CIFAR-100, we conducted experiments on ImageNet [deng2009imagenet], a larger dataset, to verify the stability of our distillation method.
CIFAR-100 [krizhevsky2009learning] is the most widely-used image classification dataset, which consists of 50K training images and 10K testing images of size divided into 100 classes. Specifically, we designed various combinations of architectures for teacher and student networks. Table 1 summarizes the settings of each experiment, model size and compression ratio, involving architectures such as Residual Network (ResNet) [he2016deep], Wide Residual Network (WideResNet) [zagoruyko2016wide], and Deep Pyramidal Residual Networks (PyramidNet) [han2017deep]. Experimental results of different settings are shown in Table 2, where it is obvious that our method achieves state-of-the-art performance in all seven settings, for both depth and channel compression (a, b, c) and different architectures (d, e, f, g). Especially, in the setting of depth compression and channel compression (a) and (b), the student networks trained by LONDON even outperform the teacher networks, which further demonstrates the efficacy of our Lipschitz continuity method as a regularization function.
Overall, our proposed method consistently shows comparable or better performance regardless of different compression rates or other network architecture types, which endows our approach with more implementation flexibility. We noted exciting improvements in student networks along with a high compression ratio. Therefore, our results present the potential of using Lipschitz continuity distillation to compress large networks into more resource-efficient ones with acceptable accuracy drop. For example, when the setting (g) is a compression from teacher network to student network with completely different architecture, the student network still benefits from the teacher network via our method. In general, our proposed method can be applied to small networks (fewer parameters) and large networks with satisfactory performance.
ImageNet [deng2009imagenet] is a large-scale dataset with 1.2 million training images and 50k validation images divided into 1,000 classes. Compared to other classification datasets such as CIFAR-100, ImageNet has greater diversity, and its image is larger in scale (average ). For all experiments, we reported both the top-1 and top-5 accuracies. Images are cropped to the size of
for training and validation. The student networks are trained for 100 epochs, and the learning rate begins at 0.1 multiplied by 0.1 at every 30 epochs. To ensure a fair comparison, we used the pre-trained models in the PyTorch library as the teacher networks. Two combinations of network architectures are settled for demonstration. For the first combination, we chose ResNet152[he2016deep] as the teacher network and ResNet50 as the student network. As the second one, for testing the knowledge distillation capacity across different network architectures, we chose ResNet50 as the teacher network, and MobileNet [howard2017mobilenets] as the student network. The results are displayed in Table 3. Compared to strong methods such as [heo2019knowledge, heo2019comprehensive], our method still exhibits a great improvement. In particular, our method makes ResNet50 outperform the teacher network ResNet152, which is a remarkable achievement. Besides, regarding the compression ability, our method makes a considerable improvement in the lightweight architecture, MobileNet, where the error rate of 27.64% of our method is better than any network reported in the paper of MobileNet [howard2017mobilenets].
|Network||# of params||Method||Top-1||Top-5|
|Network||# of params||Method||mAP|
4.2 Object Detection
We applied our proposed method on the most popular high-speed detector, Single Shot Detector (SSD) [liu2016ssd]. All models are trained with the training set of VOC2007 and VOC2012 [everingham2015pascal] where the backbone networks are pre-trained using the ImageNet dataset. All models are trained for 120k iterations with a batch size of 32. We set the SSD trained with no distillation as our baseline and SSD detector with ResNet50 as the teacher network. As for the student networks, we used SSD with ResNet18, or MobileNet [howard2017mobilenets]. We evaluated the detection performance in the VOC2007 testing set. The result is presented in Table 4. Both trained student networks outperform other methods. This implies that our method can be applied to object detector. Furthermore, we found that the distillation between similar structures has better quality than the different ones by comparing the performance of ResNet18 to MobileNet.
|Backbone||# of params||Method||mIoU|
4.3 Semantic Segmentation
In this section, we conducted knowledge distillation on semantic segmentation task. It is worth noting that implementing KD on semantic segmentation is extremely difficult for the penultimate feature maps of the segmentation model, which has higher dimensions than common network architectures. In particular, the widely-used DeepLabV3+ [ChenPSA17] is taken as our study case for semantic segmentation. We used DeepLabV3+ with the backbone of ResNet101 as the teacher, and DeepLabV3+ based on ResNet18[he2016deep] and MobileNetV2 [howard2017mobilenets] as the students. The results shown in Table 5 provide clear evidence that our proposed method can greatly improve the performance of both ResNet18 and MobileNet.
In general, most KD studies are only experimentally justified over the task of image classification. In our case, experiments on detection and segmentation verify that our method can be applied to not only image classification but also other computer vision tasks. The flexibility without significant model modifications is an advantage of our high-level knowledge distillation so that our proposed method has a wide range of potential applications.
Mitigate Overfitting. As demonstrated in Section 3.6, our Lipschitz distillation loss can be seen as a regularization term, which constrains the search space around the point inferred by the teacher so as to prevents overfitting the target dataset. To consolidate this theoretical demonstration, we design a corresponding experiment. We use the setting (b) in Table 1 to study this regularization phenomenon. The results are shown in Figure 2. It is noteworthy that when turning off the Lipschitz continuity loss module, the performance on the validation set drops while the training correct rate stays at the same level. This overfit-reduction phenomenon verifies that our proposed method improves the student network training by regularization.
Ablative Experiments. We conducted an ablation study of our proposed method in CIFAR-100 with the teacher and student architecture pairs in Table 1. By adjusting the coefficient in the loss function (Eq. 16, 17), where equals to no Lipschitz continuity distilled as our baseline. The results are shown in Table 6. With increasing, the performance improvements show the effectiveness of our designed Lipschitz continuity loss. However, when the ratio of in is greater than 20% (on average), LONDON’s performance drops. A well-trained student network should have both the ability to align low-level feature maps and capture the high-level information. Therefore, we believe that putting too much weight on high-level and universal information loses the aligning ability that the network would have.
We investigate the knowledge distillation and Lipschitz continuity of neural networks. Specifically, we present a novel KD method, named LONDON, which numerically calculates and transfers the Lipschitz constant as knowledge. Compared to standard KD methods considering neural networks as black-boxes, our KD method captures the functional property of neural networks as high-level knowledge for training student networks, which further prevents the students networks from overfitting the datasets by extending the representational capability of KD.
Acknowledgements. This research was partially supported by NSF CNS-1908658, NeTS-2109982 and the gift donation from Cisco. This article solely reflects the opinions and conclusions of its authors and not the funding agents.
6.1 Lemma 1.
Based on Rademacher’s theorem [federer2014geometric], for the functions restricted to some neighborhood around any point is Lipschitz, their Lipschitz constant can be calculated by their differential operator.
Lemma 1. If a function is a locally Lipschitz continuous function, then is differentiable almost everywhere. Moreover, if is Lipschitz continuous, then
where is the L2 matrix norm.
6.2 Lemma 2.
Lemma 2. Let and be an linear function. Then for all , we have
Proof. By definition, , and the derivative of this equation is the desired result.
6.3 Computing Exact the Lipschitz Constant of Networks is NP-hard
We take a 2-layer fully-connected neural network with ReLU activation function as an example to demonstrate that Lipschitz computation is not achievable in polynomial time. As we denoted in Method Section, this 2-layer fully-connected neural network can be represented as
where and are matrices of first and second layers of neural network, and is the ReLU activation function.
Proof. To prove that computing the exact Lipschitz constant of Networks is NP-hard, we only need to prove that deciding if the Lipschitz constant is NP-hard.
From a clearly NP-hard problem:
where matrix is positive semi-definite with full rank. We denote matrices and as
so that we have
The spectral norm of this 1-rank matrix is . We prove that Eq. 24 is equivalent to the following optimization problem
Because is full rank, is surjective and all are admissible values for which is the equality case. Finally, ReLU activation units take their derivative within and Eq. 29 is its relaxed optimization problem, that has the same optimum points. So that our desired problem is NP-hard.