1 Introduction
Recently, deep learning models have driven great advances in computer vision
[he2016deep, girshick2015fast][sutskever2014sequence, pennington2014glove], information retrieval [wei2019neural, MMGCN] and multimodal modelling [hu2021coarse, hu2021video]. To meet the buoyant demand of equipping those cumbersome models in resourceconstrained edge devices, researchers have proposed several network compression paradigms, such as network pruning [lecun1989optimal, han2015deep], network quantization [hubara2016binarized] and knowledge distillation (KD) [hinton2015distilling]. Among these compression methods, KD helps the training process of a smaller network (student) by transferring knowledge from a larger one (teacher). As one of the first innovators, Hinton et al. [hinton2015distilling] proposed using soft labels of the larger networks to supervise the training process of the smaller ones. These soft labels are usually interpreted as a form of unseen knowledge distilled from teachers.Apart from treating soft labels as distilled knowledge, various kinds of knowledge are designed in [yim2017gift, heo2019comprehensive, tian2019contrastive, wang2019pay]. For example, Romero et al. [romero2014fitnets] presented to train intermediate layers of students with guidance of the corresponding layers of teachers, which initiates the subsequent flourishing studies on featurebased knowledge distillation. Researchers [yim2017gift, lee2018self, tung2019similarity] also modulated the relations among adjacent feature maps as additional knowledge to assist the training of student networks. Unfortunately, most of these featurebased KD methods solely focus on aligning the shallow information but overlook the highlevel information of both networks, i.e.
, the students mechanically mimicking teachers’ actions while neglecting their interior qualities. Thereby, previous studies consider networks as blackboxes and heuristically select features without any functional properties
[tian2019contrastive, wang2019pay, zagoruyko2016paying], which impedes a universal representative of knowledge to be distilled. To address this problem, we argue that leveraging networks’ functional properties to derive highlevel knowledge is able to strengthen the performance of KD.In this paper, we incorporate Lipschitz continuity into KD, considering neural networks as functions rather than blackboxes. By definition in Eq. 4, Lipschitz constant^{1}^{1}1The Lipschitz constant of a function is the maximum norm of its gradient in the domain set, which reflects Lipschitz continuity of the function. is the upper bound of the relationship between input perturbation and output variation for a given distance, representing the robustness and expressiveness of neural networks [bartlett2017spectrally, miyato2018spectral, lyu2020autoshufflenet]. Specifically, authors in [miyato2018spectral, yoshida2017spectral]
demonstrated the effectiveness of the Lipschitz constant by constraining the weights of the discriminator in a generative adversarial network (GAN). Besides, many studies in representation learning
[bengio2013representation, tian2019multimodal] demonstrate that deep neural networks are competent in learning highlevel information with increasing abstraction. Inspired by this, we devise a scheme to capture the Lipschitz continuity (i.e., calculate the Lipschitz constant for every intermediate block) of the teacher networks and adopt the captured continuity as knowledge to guide the training of student networks. It is worth noting that Lipschitz constant computation is a NPhard problem [virmaux2018lipschitz]. We address this problem by proposing an approximation algorithm with a tight upper bound. In particular, we design a Transmitting Matrix () for each block and calculate the spectral norm of through an adopted iteration method to avoid the high complexity of learning large intermediate matrices. We then aggregate all Lipschitz constants calculated froms as the knowledge of the Lipschitz continuity that are transferred to student networks. Importantly, Lipschitz continuity loss function is backpropagationfriendly for training deep networks because of its differentiability.
Overall, the contributions of this paper are fourfold:

[leftmargin=*]

To the best of our knowledge, we are the first on utilizing a highlevel functional property, Lipschitz continuity in knowledge distillation, to supervise student networks’ training process. In addition, we theoretically explain the effectiveness of our method from the perspective of network regularization and then empirically consolidate this explanation.

We propose a novel knowledge distillation framework, Lipschitz cONtinuity Guided Knowledge DistillatiON (LONDON) for distilling knowledge from the Lipschitz constant.

To avoid the NPhard Lipschitz constant calculation, we devise a Transmitting Matrix to numerically approximate the Lipschitz constant of networks in the KD process.

We perform experiments on different knowledge distillation tasks such as classification, object detection, and segmentation. Our proposed method achieves the stateoftheart results in these tasks on CIFAR100, ImageNet, and VOC datasets.
2 Related Work
Lipschitz Continuity and Spectral Norm of Neural Network.
The study of adversarial machine learning
[kurakin2016adversarial, papernot2016transferability]shows that neural networks are highly vulnerable to attacks based on small modifications of the input to the model at test time, and estimating the regularity of such architectures is essential for practical applications and generalization improvement. Previous efforts
[virmaux2018lipschitz, miyato2018spectral, neyshabur2017exploring] have studied one of the critical characteristics to assess the regularity of deep networks: the Lipschitz continuity of deep learning architectures.Lipschitz constants, which upper bound the relationship between input perturbation and output variation for a given distance, are introduced to secure the robustness of neural networks to small perturbations. This Lipschitz constant can be seen as a norm to measure the function’s degree of Lipschitz continuity. Apart from some theoretical studies [bartlett2017spectrally, luxburg2004distance, neyshabur2017exploring] explaining that novel generalization bounds critically rely on the Lipschitz constant of the neural network, Lipschitz continuity of neural networks is widely studied for achieving the stateoftheart performance in many deep learning topics: (i) In image synthesis [miyato2018spectral, yoshida2017spectral]
, researchers used spectral normalization on each layer, an optional approach to constrain the Lipschitz constant of the discriminator for training a GAN on ImageNet, like a regularization term to smooth the discriminator function. And (ii) in adversarial attack machine learning
[weng2018evaluating], authors propose constraining local Lipschitz constants of neural networks to avoid adversarial attacks.Aforementioned efforts underline the significance of Lipschitz constant in neural networks’ expressiveness and robustness. Particularly, deliberately constraining Lipschitz continuity (constant) in an appropriate range is proven to be a powerful technique for smoothing networks, which can enhance the model’s robustness. On account of this, Lipschitz constant, the functional information of neural networks should be introduced into knowledge distillation model for regularizing the training of student networks.
Knowledge Distillation. Apart from the seminal design of soft labels [hinton2015distilling], the alignment of intermediate feature maps is also transferred as knowledge to student networks [romero2014fitnets]. Researchers continued digging into featurebased outputs and proposed various designs of feature maps’ transformation and combination to define the featurebased knowledge, which largely promotes the performance of KD. For example, Heo et al. [heo2019knowledge, heo2019comprehensive]
designated an activation boundary of hidden neurons in different positions of networks as knowledge for distillation. In
[yim2017gift], Gram matrixes of neural networks’ adjacent feature maps, representing the relation between intermediate layers, are also adopted as a form of knowledge. Authors [lee2018self, chen2020learning, tung2019similarity]constructed a similarity measurement for feature representations using singular value decomposition (SVD) to elicit relations between different layers as transferred knowledge.
Inspired by those ideas, many methods are proposed for precisely capturing featurewise knowledge by artlessly piling up complicated mechanisms on knowledge distillation model. For instance, Wang et al. [wang2019pay] introduced an attention mechanism to assign weights to different CNNs’ channels for dynamically determining the critical features to distill. Furthermore, Tian et al. [tian2019contrastive] introduced contrastive learning to capture correlations and higherorder output dependencies for supervising student network training. This dynamically aligned knowledge almost fully explores potential of distilling networks’ output information for supervision.
However, all those featurebased knowledge distillation methods treat neural networks as blackboxes, which are deficient in exploring the functional properties of neural networks via capturing the highlevel information. This limitation hinders the applicability and impedes performance improvement. To alleviate the limitation, we introduce Lipschitz continuity to knowledge distillation.
3 Method
In this section, we introduce our proposed knowledge distillation framework. We only elaborate the key derivations in this section due to the limited space. Detailed discussions and technical theorems can be found in the supplemental materials. Here, we focus on capturing the functional property of neural networks as knowledge and transferring it in our distillation method in a numerically accessible way.
3.1 Preliminary
We first define a fullyconnected neural network with layers of widths as the form of function :
(1) 
where each is an affine function ( and are the sizes of network’s input and output feature maps) and performs elementwise activation for feature maps. For th layer of the networks, , where and
stand for the weight matrix and bias vector, respectively. For generality purpose, we discard the bias term of the network, so that the network can be simplified as:
(2) 
Notably, it is sufficient to consider networks with the most straightforward fullyconnected layers, since layer with complex structures such as convolution layer can also be denoted as the form of matrix multiplication. We consider a convolution layer with input channels and output channels, and the size of the kernel is , resulting in parameters. We can rearrange the parameters to a matrix of size , such that this convolution layer can also be processed in the same way as the other fullyconnected layers do. Hence, our analysis has no loss for generality in this configuration of function .
Following Eq. 2, we define the function form of the teacher network as , and the student network as , such that the featurebased KD paradigm can be interpreted as:
(3) 
where given the same data, the ultimate goal of KD paradigm is to minimize the distance between teacher and student for optimizing the latter’s parameters . Particularly, is a distance function, and is certain transformation approach to turn feature maps into more measurable and learnable knowledge. By utilizing those designed knowledge, the student network is forced to mimic the teacher network and hopefully obtains comparable performance with lighter architecture.
Here, we introduce Lipschitz Continuity into KD paradigm as universal information of neural networks based on the functional property of networks. To make Lipschitz constant calculation numerically feasible, we further propose an approximation for the Lipschitz constant and use power iteration method to calculate this approximation.
3.2 The functional information of neural networks: Lipschitz Continuity
Definition 1. A function is called Lipschitz continuous if there exists a constant such that:
(4) 
The smallest that can hold the inequality is called Lipschitz constant of function , denoted as . By Definition 1, has an excellent property of upper bounding the relationship between input perturbation and output variation for a given distance (generally L2 norm), thus it is considered as a metric to evaluate the robustness of neural networks to small perturbations [luxburg2004distance, virmaux2018lipschitz, bartlett2017spectrally]. However, computing the exact Lipschitz constant of neural networks in the knowledge distillation process is a NPhard problem [virmaux2018lipschitz]. To solve this problem, we propose a feasible and effective method to approximate the Lipschitz constants in KD.
We first define the affine function for the th layer , in which and are the feature maps out of the th and the th layer, respectively.
By Lemma 1 as in Supplemetary Appendix, we have , where is the spectral norm of matrix. And the matrix spectral norm is formally defined by
(5) 
where the spectral norm of matrix
is equivalent to its largest singular value. Thus, for the linear layer
, based on Lemma 2 in Supplemetary Appendix, its Lipschitz constant is given by(6) 
Additionally, most activation functions such as ReLU, Leaky ReLU, Tanh, Sigmoid as well as maxpooling, have a Lipschitz constant equal to 1. As for other common neural network layers such as dropout, batch normalization and other pooling methods, they all have simple and explicit Lipschitz constants
[goodfellow2016deep]. This fixed Lipschitz constant property renders our derivation applicable to most network architectures, such as ResNet [he2016deep] and MobileNet [howard2017mobilenets].Thereafter, we use the inequality (concluded by Eq. 7 in [bartlett2017spectrally]) to derive the following bound for :
(7) 
In this way, we transfer the teacher’s Lipschitz constant to the student through a sequence of spectral norm of intermediate layers in the network. Moreover, the upper bound of Lipschitz constant also ensures the quality of knowledge to be transferred.
3.3 Transmitting Matrix
Given the derived tight upper bound of the Lipschitz constant, we design a novel loss to distill Lipschitz continuity from teacher to student by narrowing the distance between corresponding and down. The first problem is how to calculate each spectral norm. Calculating the spectral norm of weight matrix in neural networks by SVD is inaccessible. Specifically, for the complex network structures such as convolutions layers or residual modules, though they can be rearranged matrixwisely, their spectral norm’s computation is impractical. Therefore, we propose using Transmitting Matrix (TM) to bypass the complicated calculation of the spectral norm . This approximate calcuation allows feasible computation to distill Lipschitz constant and its further use as a loss function.
For training data of batch size , after a forward process for the 1)th layer, we have a batch of corresponding feature maps as
(8) 
where for each .
Studies [chen2017exemplar, tung2019similarity] about similarity of feature maps illustrate that for welltrained networks, their batch of feature maps in the same layer have strong mutual linear independence. We formalize the relevance of feature maps in the same layer as
(9)  
(10) 
We further normalize the feature maps by
such that a batch of feature maps can be expressed in a vector representation that
(11) 
where is an unit matrix.
With all the aforementioned equations, we are ready to define the transmitting matrix for calculating the spectral norm of matrix as calculating the spectral norm of
(12) 
Eq. 11 and 12 together yield the result as
(13) 
Theorem 1. If matrix
is an orthogonal matrix, such that
, whereis an unit matrix, the largest eigenvalues of
and are equivalent.(14) 
where is the largest eigenvalue of a matrix. Based on Theorem 1 and Eq. 13, our defined transmitting matrix has the same largest eigenvalue with , i.e. . Thus, combining the definition of spectral norm , we can achieve the spectral norm of matrix by calculating the largest eigenvalue of , , which is solvable.
For networks with more complicated layers such as residual blocks, by considering the block as an affine mapping from front feature maps to back feature maps, this approximation is applicable to calculate the spectral norm blockbyblock instead of layerbylayer, which makes our spectral norm calculation more efficient. To this end, we define the Transmitting Matrix for residual blocks as
(15) 
where the and are the front feature maps and latter feature maps of a residual block.
3.4 Approximating the Spectral Norm with Power Iteration Method
Following the aforementioned steps, we next need to calculate the spectral norms of two matrices (teacher and student) and then calculate the loss between those two. The intuitive approach is using SVD to compute the spectral norm, which results in overloaded computation. Importantly, the SVD calculation is nondifferentiable, making it impossible to train the deep networks. Instead of using SVD, we utilize power iteration method [golub2000eigenvalue, yoshida2017spectral, miyato2018spectral] to approximate the spectral norm of the targeted matrix with a small tradeoff of accuracy, as presented in Algorithm 1.
In this way, we have a feasible approach to calculate the spectral norms of s which can faithfully approximate the Lipschitz constant of networks.
3.5 Overall Loss Function
By using Algorithm 1, we obtain the spectral norms for teacher and student networks, respectively: and for each . We define our novel lipschitz continuity loss function as
(16) 
where is a coefficient greater than . Hence, the decreases with increasing and consequently the increases. In this way, we give more weight on higher layer features since they are closer to the features performing tasks.
Combined with the cross entropy loss and vanilla knowledge distillation loss , we are ready to propose our novel loss function as
(17) 
where is used to control the degree of distilling the Lipschitz constant. We use because when taking derivative of , the denominator part can be easily eliminated.
3.6 Explaining the Effectiveness from a Regularization Perspective
The derivative of the loss function with respect to :
(18) 
where , and are respectively the first left and right singular vectors of . For , using SVD, we have
(19) 
where is the rank of , is the th biggest singular value, and are correspondingly left and singular vectors, respectively.
In Eq. 18, the first term is the same as the derivative of the loss function of vanilla knowledge distillation. As for the second term, based on Eq. 19, it can be seen as the regularization term penalizing the vanilla knowledge distillation loss with an adaptive regularization coefficient
(20) 
which constrains the weights of the student networks by utilizing teacher networks’ as a prior supervision information. In other words, our method prevents the student networks from trapping into local minima. In this way, it ensures better training of student networks. We demonstrate performance by designing a corresponding experiment in Section 4.4, showing that our proposed method prevents student networks from overfitting the dataset.
4 Experiments
In this section, we conducted experiments on three computer vision tasks, image classification, object detection, and segmentation, to validate the effectiveness of our proposed distillation method. In addition to comparing our method with the stateoftheart methods, we also designed a series of ablation studies to verify the effectiveness and highlight the regularization property of our proposed technique. All experiments are implemented using PyTorch
[paszke2019pytorch].Setup  Compression type  Teacher network  Student network  # of params  # of params  Compress 

teacher  student  ratio  
(a)  Depth  WideResNet 284  WideResNet 164  5.87M  2.77M  47.2% 
(b)  Channel  WideResNet 284  WideResNet 282  5.87M  1.47M  25.0% 
(c)  Depth & channel  WideResNet 284  WideResNet 162  5.87M  0.70M  11.9% 
(d)  Different architecture  WideResNet 284  ResNet 56  5.87M  0.86M  14.7% 
(e)  Different architecture  PyramidNet200 (240)  WideResNet 284  26.84M  5.87M  21.9% 
(f)  Different architecture  PyramidNet200 (240)  PyramidNet110 (84)  26.84M  3.91M  14.6% 
(g)  Different architecture  PyramidNet200 (240)  ResNet 56  26.84M  0.86M  5.8% 
Setup  Teacher  Baseline  KD  FitNets  AT  Jacobian  FT  AB  OFD  AFD  LONDON 

[hinton2015distilling]  [romero2014fitnets]  [zagoruyko2016paying]  [srinivas2018knowledge]  [kim2018paraphrasing]  [heo2019knowledge]  [heo2019comprehensive]  [wang2019pay]  (ours)  
(a)  21.09  22.72  21.69  21.85  22.07  22.18  21.72  21.36  20.89  21.15  20.33 
(b)  21.09  24.88  23.43  23.94  23.80  23.70  23.41  23.19  21.98  21.79  20.71 
(c)  21.09  27.32  26.47  26.30  26.56  26.71  25.91  26.02  24.08  24.21  23.46 
(d)  21.09  27.68  26.76  26.35  26.66  26.60  26.20  26.04  24.44  24.67  23.78 
(e)  15.57  21.09  20.97  22.16  19.28  20.59  19.04  20.46  17.80  18.24  17.54 
(f)  15.57  22.58  21.68  23.79  19.93  23.49  19.53  20.89  18.89  19.32  18.21 
(g)  15.57  27.68  26.82  26.10  26.64  26.43  26.29  25.70  24.49  24.53  23.52 
4.1 Classification
We chose CIFAR100 [krizhevsky2009learning] for classification. This is because it is commonly used for comparing KD methods and its relatively small size provides flexibility of implementing different combinations of teacher and student architectures. Besides CIFAR100, we conducted experiments on ImageNet [deng2009imagenet], a larger dataset, to verify the stability of our distillation method.
CIFAR100 [krizhevsky2009learning] is the most widelyused image classification dataset, which consists of 50K training images and 10K testing images of size divided into 100 classes. Specifically, we designed various combinations of architectures for teacher and student networks. Table 1 summarizes the settings of each experiment, model size and compression ratio, involving architectures such as Residual Network (ResNet) [he2016deep], Wide Residual Network (WideResNet) [zagoruyko2016wide], and Deep Pyramidal Residual Networks (PyramidNet) [han2017deep]. Experimental results of different settings are shown in Table 2, where it is obvious that our method achieves stateoftheart performance in all seven settings, for both depth and channel compression (a, b, c) and different architectures (d, e, f, g). Especially, in the setting of depth compression and channel compression (a) and (b), the student networks trained by LONDON even outperform the teacher networks, which further demonstrates the efficacy of our Lipschitz continuity method as a regularization function.
Overall, our proposed method consistently shows comparable or better performance regardless of different compression rates or other network architecture types, which endows our approach with more implementation flexibility. We noted exciting improvements in student networks along with a high compression ratio. Therefore, our results present the potential of using Lipschitz continuity distillation to compress large networks into more resourceefficient ones with acceptable accuracy drop. For example, when the setting (g) is a compression from teacher network to student network with completely different architecture, the student network still benefits from the teacher network via our method. In general, our proposed method can be applied to small networks (fewer parameters) and large networks with satisfactory performance.
ImageNet [deng2009imagenet] is a largescale dataset with 1.2 million training images and 50k validation images divided into 1,000 classes. Compared to other classification datasets such as CIFAR100, ImageNet has greater diversity, and its image is larger in scale (average ). For all experiments, we reported both the top1 and top5 accuracies. Images are cropped to the size of
for training and validation. The student networks are trained for 100 epochs, and the learning rate begins at 0.1 multiplied by 0.1 at every 30 epochs. To ensure a fair comparison, we used the pretrained models in the PyTorch library as the teacher networks. Two combinations of network architectures are settled for demonstration. For the first combination, we chose ResNet152
[he2016deep] as the teacher network and ResNet50 as the student network. As the second one, for testing the knowledge distillation capacity across different network architectures, we chose ResNet50 as the teacher network, and MobileNet [howard2017mobilenets] as the student network. The results are displayed in Table 3. Compared to strong methods such as [heo2019knowledge, heo2019comprehensive], our method still exhibits a great improvement. In particular, our method makes ResNet50 outperform the teacher network ResNet152, which is a remarkable achievement. Besides, regarding the compression ability, our method makes a considerable improvement in the lightweight architecture, MobileNet, where the error rate of 27.64% of our method is better than any network reported in the paper of MobileNet [howard2017mobilenets].Network  # of params  Method  Top1  Top5 
(ratio)  error  error  
ResNet152  60.19M  Teacher  21.69  5.95 
Baseline  23.72  6.97  
KD [hinton2015distilling]  22.85  6.55  
AT [zagoruyko2016paying]  22.75  6.35  
ResNet50  25.56M  FT [kim2018paraphrasing]  22.80  6.49 
(42.5%)  AB [heo2019knowledge]  23.47  6.94  
OFD [heo2019comprehensive]  21.65  5.83  
AFD [wang2019pay]  22.08  6.30  
LONDON (ours)  21.12  5.47  
ResNet50  25.56M  Teacher  23.84  7.14 
Baseline  31.13  11.24  
KD [hinton2015distilling]  31.42  11.02  
AT [zagoruyko2016paying]  30.44  10.67  
MobileNet  4.23M  FT [kim2018paraphrasing]  30.12  10.50 
(16.5%)  AB [heo2019knowledge]  31.11  11.29  
OFD [heo2019comprehensive]  28.75  9.66  
AFD [wang2019pay]  28.61  9.81  
LONDON (ours)  27.64  8.97 
Network  # of params  Method  mAP 
(ratio)  
ResNet50SSD  36.7M  Teacher  76.79 
Baseline  71.61  
ResNet18SSD  20.0M  OFD [heo2019comprehensive]  73.08 
(54.5%)  AFD [wang2019pay]  72.78  
LONDON (ours)  73.82  
Baseline  67.58  
MobileNetSSD  6.5M  OFD [heo2019comprehensive]  68.54 
(18.7%)  AFD [wang2019pay]  68.63  
LONDON (ours)  69.09 
4.2 Object Detection
We applied our proposed method on the most popular highspeed detector, Single Shot Detector (SSD) [liu2016ssd]. All models are trained with the training set of VOC2007 and VOC2012 [everingham2015pascal] where the backbone networks are pretrained using the ImageNet dataset. All models are trained for 120k iterations with a batch size of 32. We set the SSD trained with no distillation as our baseline and SSD detector with ResNet50 as the teacher network. As for the student networks, we used SSD with ResNet18, or MobileNet [howard2017mobilenets]. We evaluated the detection performance in the VOC2007 testing set. The result is presented in Table 4. Both trained student networks outperform other methods. This implies that our method can be applied to object detector. Furthermore, we found that the distillation between similar structures has better quality than the different ones by comparing the performance of ResNet18 to MobileNet.
Backbone  # of params  Method  mIoU 
(ratio)  
ResNet101  59.3M  Teacher  77.39 
Baseline  71.79  
ResNet18  16.6M  OFD [heo2019comprehensive]  73.24 
(28.0%)  AFD [wang2019pay]  72.81  
LONDON (ours)  73.62  
Baseline  68.44  
MobileNet  5.8M  OFD [heo2019comprehensive]  71.36 
(9.8%)  AFD [wang2019pay]  71.56  
LONDON (ours)  71.97 
4.3 Semantic Segmentation
In this section, we conducted knowledge distillation on semantic segmentation task. It is worth noting that implementing KD on semantic segmentation is extremely difficult for the penultimate feature maps of the segmentation model, which has higher dimensions than common network architectures. In particular, the widelyused DeepLabV3+ [ChenPSA17] is taken as our study case for semantic segmentation. We used DeepLabV3+ with the backbone of ResNet101 as the teacher, and DeepLabV3+ based on ResNet18[he2016deep] and MobileNetV2 [howard2017mobilenets] as the students. The results shown in Table 5 provide clear evidence that our proposed method can greatly improve the performance of both ResNet18 and MobileNet.
In general, most KD studies are only experimentally justified over the task of image classification. In our case, experiments on detection and segmentation verify that our method can be applied to not only image classification but also other computer vision tasks. The flexibility without significant model modifications is an advantage of our highlevel knowledge distillation so that our proposed method has a wide range of potential applications.
4.4 Analyses
Pair  0  0.1  0.4  1.6  3.2  6.4 

(a)  21.69  21.36  21.54  21.11  20.33  21.87 
(b)  23.43  22.04  22.05  21.88  21.48  22.35 
(c)  26.47  24.39  23.77  23.56  23.62  24.87 
(d)  27.68  24.18  24.42  23.82  23.78  25.22 
Mitigate Overfitting. As demonstrated in Section 3.6, our Lipschitz distillation loss can be seen as a regularization term, which constrains the search space around the point inferred by the teacher so as to prevents overfitting the target dataset. To consolidate this theoretical demonstration, we design a corresponding experiment. We use the setting (b) in Table 1 to study this regularization phenomenon. The results are shown in Figure 2. It is noteworthy that when turning off the Lipschitz continuity loss module, the performance on the validation set drops while the training correct rate stays at the same level. This overfitreduction phenomenon verifies that our proposed method improves the student network training by regularization.
Ablative Experiments. We conducted an ablation study of our proposed method in CIFAR100 with the teacher and student architecture pairs in Table 1. By adjusting the coefficient in the loss function (Eq. 16, 17), where equals to no Lipschitz continuity distilled as our baseline. The results are shown in Table 6. With increasing, the performance improvements show the effectiveness of our designed Lipschitz continuity loss. However, when the ratio of in is greater than 20% (on average), LONDON’s performance drops. A welltrained student network should have both the ability to align lowlevel feature maps and capture the highlevel information. Therefore, we believe that putting too much weight on highlevel and universal information loses the aligning ability that the network would have.
5 Conclusion
We investigate the knowledge distillation and Lipschitz continuity of neural networks. Specifically, we present a novel KD method, named LONDON, which numerically calculates and transfers the Lipschitz constant as knowledge. Compared to standard KD methods considering neural networks as blackboxes, our KD method captures the functional property of neural networks as highlevel knowledge for training student networks, which further prevents the students networks from overfitting the datasets by extending the representational capability of KD.
Acknowledgements. This research was partially supported by NSF CNS1908658, NeTS2109982 and the gift donation from Cisco. This article solely reflects the opinions and conclusions of its authors and not the funding agents.
References
6 Appendix
6.1 Lemma 1.
Based on Rademacher’s theorem [federer2014geometric], for the functions restricted to some neighborhood around any point is Lipschitz, their Lipschitz constant can be calculated by their differential operator.
Lemma 1. If a function is a locally Lipschitz continuous function, then is differentiable almost everywhere. Moreover, if is Lipschitz continuous, then
(21) 
where is the L2 matrix norm.
6.2 Lemma 2.
Lemma 2. Let and be an linear function. Then for all , we have
(22) 
where .
Proof. By definition, , and the derivative of this equation is the desired result.
6.3 Computing Exact the Lipschitz Constant of Networks is NPhard
We take a 2layer fullyconnected neural network with ReLU activation function as an example to demonstrate that Lipschitz computation is not achievable in polynomial time. As we denoted in Method Section, this 2layer fullyconnected neural network can be represented as
(23) 
where and are matrices of first and second layers of neural network, and is the ReLU activation function.
Proof. To prove that computing the exact Lipschitz constant of Networks is NPhard, we only need to prove that deciding if the Lipschitz constant is NPhard.
From a clearly NPhard problem:
(24)  
(25) 
where matrix is positive semidefinite with full rank. We denote matrices and as
(26) 
(27) 
so that we have
(28) 
The spectral norm of this 1rank matrix is . We prove that Eq. 24 is equivalent to the following optimization problem
(29)  
(30) 
Because is full rank, is surjective and all are admissible values for which is the equality case. Finally, ReLU activation units take their derivative within and Eq. 29 is its relaxed optimization problem, that has the same optimum points. So that our desired problem is NPhard.