I Introduction
After the deep Convolutional Neural Network (CNN) proposed by Krizhevsky et al. [1] won the ILSVRC 2012 with higher score than the conventional methods, it became very popular for image classification and object recognition. Later, with the increase in computer performance, deeper and larger models such as VGG [2] and ResNet [3] were proposed. It is believed that deeper models with more parameters perform better performance than the shallow models. However, the calculation cost becomes enormous as the size of the models becomes larger. Various techniques for compressing the size of the models have been proposed to improve performance while reducing computational costs.
One of the methods to compress the size of the models is knowledge distillation (KD). Knowledge distillation is a technique for transferring knowledge of deep or ensemble models with many parameters (teacher model) to smaller shallow models (student model). Specifically, the learning of the student model is accelerated by using the output of the trained teacher model. For example, Ba et al. [4] used the square error between the teacher model and the student model as the student model loss. Hinton et al. [5] used the KLdivergence of the softmax output of both models as the loss of the student model.
In knowledge distillation, the knowledge of the teacher model is transferred to the student model by making the student model learn to imitate the output of the teacher model. In other words, knowledge distillation can also be considered as a method for increasing the similarity between the outputs of the teacher model and the student model. So, in this paper, we propose to introduce metric learning into knowledge distillation.
In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples. Representative methods for deep metric learning include the Siamese Network [6, 7, 8] and the Triplet Network [9], [10]. These are learned so that the Euclidean distance between outputs for similar samples is reduced. In addition, learning is performed so that the Euclidean distance between outputs for dissimilar samples becomes large. That is, there is a function to clarify the difference between dissimilar outputs. The model will try to get the same output for the same object, and will be able to clearly distinguish the difference between different objects. With this feature, the metric learning model has performed extremely well in areas such as face verification and human reidentification.
In this paper, we propose to introduce this metric learning concept into knowledge distillation. The functionality of the metric learning to reduce the differences between similar outputs can be used for the knowledge distillation to reduce the differences between the outputs of the teacher model and the student model. Since the outputs of the teacher model for different objects are usually different, the student model needs to distinguish them. We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved.
The effectiveness of the proposed approach is experimentally confirmed for image classification tasks. Experimental results show that our method dramatically improves the performance of the student model. Also, we have performed experiments to compare the proposed method with stateoftheart knowledge distillation methods. The results show that the student model obtained by the proposed method gives higher performance than the conventional knowledge distillation methods.
Ii Related Works
Iia Deep Convolutional Neural Network
The deep CNN has been proven to be effective and has been applied many applications such as for image classification[1, 3], object detection[11], image segmentation[12] and so on.
The computation within the convolution layers is regarded as a filtering process of the input image as
(1) 
where
is the weight of the neuron indexed as
in the th convolution layer and is the bias of the th convolution layer. The size of the convolution filter is given as. The activation function of each neuron is denoted as
. Usually, pooling layers are added after the convolution layers. The pooling layer performs downsampling for reducing computational costs and enhancing against micro position changes. Fullyconnected layers like multi layer perceptron is connected to the convolution layers which is used to construct the classifier.
IiB Metric Learning
Metric learning is a method of learning embedding for metric (Euclidean distance, cosine similarity, etc.). Metric learning is applied to a wide range of tasks such as image search, biometric authentication, and abnormality detection.
A typical model for deep learning embedding is the Siamese Neural Network [6, 7, 8]
. The Siamese Network consists of two identical networks joined at their outputs. The two networks extract feature vectors from two different samples. Usually, the weights of the two networks are shared. The objective function of the optimization for training the parameters of the networks is defined by using these extracted feature vectors as
(2) 
where represents the distance between the pair of the outputs and of each network for the sample pair and . The distance is defined by
(3) 
where is a parameter indicating the distance between clusters and is an index set of sample pairs randomly generated from the samples in the minibatch. A label is assigned for each sample pair such that label is when the pair and is similar and label is when the pair and is dissimilar. The scheme of Siamese Network is shown in Fig. 1.
After the training of Siamese Network, the distance between the outputs for dissimilar pair will be far while the distance between the outputs for similar pair become close. The Siamese Network can only consider the metric between samples pairwise. For this reason, the Siamese Network have to uniquely determine the concept of similarity. For example, if there are two different male images, in the case of the concept of “gender”, they should be judged to be similar. However, in the case of the concept of “individuals”, they should be judged not to be similar. It is difficult to express these multiple concepts in the Siamese Network.
Wang et al. [9], Hoffer et al. [10] proposed the Triplet Network (Figure 2), an extension of the Siamese Network to tripletwise.
Triplet Network is designed to learn embedding from a triplet of samples called ”anchor”, ”positive” and ”negative”. The triplet network learns embedding such that the distance between ”anchor” and ”positve” is smaller than the distance between ”anchor” and ”negative” . Various losses have been proposed for the Triplet Network. The triplet loss
(4) 
is generally used, where denotes the output of the model, and is a parameter indicating the distance between clusters. is an index set of ”anchor”, ”positive”, and ”negative”. The triplet network learns so that the anchorpositive distance is relatively closer than the anchornegative distance. Therefore, multiple similar concepts can be considered without depending on one similar concept. Triplet Network can overcome the drawbacks of Siamese Network.
IiC Knowledge Distillation
Knowledge distillation is a technique for transferring knowledge of deep or ensemble model with many parameters (teacher model) to smaller shallow model (student model). Various approaches have been proposed for transferring knowledge.
Ba et al. [4]
considered the difference between the input vectors of the softmax activation function (logits) of the teacher model and the student model as loss (Equation(
5)), and proposed to train the student model so that the feature vectors of the student model got close to the teacher model (BKD). The loss to measure the similarity between the teacher model and the student model is defined by the mean squared errors between the feature vector of the teacher model and the feature vector of the student model as(5) 
where is the set of training samples. As a result, the authors found that the performance of the student model was higher than the network trained by the student model alone. They claimed that the uncertainty of the feature vectors of the teacher model is useful for learning the student model.
Hinton et al. [5]
proposed training the student model so that the softmax outputs of the teacher model and the softmax outputs (probability) of the student model are close (HKD). They used the KLdivergence of the softmax outputs of both models as the loss for training the student model. The loss based on the KLdivergence is defined as
(6) 
where is the softmax function, is KLdivergence between the softmax outputs of the teacher model and the softmax outputs of the student model defined by
(7) 
and is a temperature parameter. They argued that negative values in logits could have a positive effect in some cases or negative effect in some cases, on learning.
In recent years, various approaches have been proposed for knowledge distillation. Park et al. [13] expressed the relationship between the outputs of the teacher model as the Euclidean distance between the two outputs, and transferred it to the student model (RKDD). They expressed the similarity between the pair of the outputs and as
(8) 
where
is a set of sample pairs randomly generated from the samples in the minibatch. They optimized the student model by Huber loss so that the similarity between the outputs of the student model and the similarity of the outputs of the teacher model got closer. The loss function of RKDD is defined by using the similarity
as(9) 
where
(10) 
is Huber loss.
They also used the cosine of the angle formed by the three outputs as the similarity of the model outputs (RKDA). The cosine of the three outputs , , and is defined by
(11) 
where
(12) 
(13) 
Loss of RKDA is also defined by using Huber loss [14] as
(14) 
where is a set of triplet samples randomly generated from the samples in the minibatch.
They also argued that using both angles and Euclidean distance for knowledge transfer would further improve the performance of the student model (RKDDA). The loss for this case is defined as
(15) 
where and are the hyperparameters.
The pairwise or tripletwise similarity between the output of the teacher model to the output of the student model is considered in RKDD, RKDA, or RKDDA while the pointwise transfer was performed in the knowledge distillation methods such as Ba’s KD [4] or Hinton’s KD [5]. Thus the knowledge distillation methods such as RKDD, RKDA, or RKDDA succeeded in achieving top performance.
The loss using the output distribution of the teacher model (soft target) can be used alone or combined with the loss using the label of the original training data (hard target). The combined loss can be defined as
(16) 
where is hard target loss and is soft target loss. In this case, a hyperparameter is introduced to keep the balance between the hard target and the soft target. It is also possible to combine multiple soft target losses. For example, in Park’s KD, they are and .
Iii Proposed method
Previously proposed methods of knowledge distillation focused on minimizing the difference between the teacher model and the student model in order to transfer knowledge of the teacher model. For example, Ba et al. [4] considered the difference in model output as loss and Hinton et al. [5] considered the difference in model output distribution as loss. In other words, they are solving the optimization problem in which the similarity between the function of the teacher model and the function of the student model is maximized. There is a possibility that knowledge of the teacher model can be transferred to the student model by using the metric learning method to embed the neighboring relations of the teacher model in the output space of the student model.
Triplet [10], one of the deep metric learning methods, is a powerful method for learning similarity and dissimilarity (Equation (IIB)). Triplet loss has a function to reduce the output distance of ”anchorpositive” and a function to increase the output distance of ”anchornegative”. We propose to apply this technique for knowledge distillation.
We define the triplet loss for knowledge transfer as
(17) 
where is the output of the teacher model, is the output of the student model, and is the margin. Further, is a training sample extracted at random, and is a sample classified into a different class from in the case of a soft target. is the index set of each corresponding sample. So, the output of the teacher model and the output of the student model for the same sample are considered as ”anchor” and ”positive”, respectively. Also we consider the output of the student model for samples of different classes in the case of soft target as ”negative”. Since we are training the student model, the weight of the teacher model is not updated. That is, is treated as a constant during training process.
For our loss, there is a term that makes the outputs of the teacher model and the student model closer. This is realized by making the square error of the outputs of the teacher model and the student model a loss. This is similar to Ba et al. [4] proposed function. It is easy to define the loss by using KLdivergence similar with Hinton et al. [5]. In addition, there are terms that increase the distance between outputs of different classes. In other words, in addition to the conventional loss functions, it is possible to add a constraint that the student model makes the outputs of other classes dissimilar. This will allow the student model to clarify differences in output between classes. We have shown experimentally that this proposal contributed to the performance improvement of the student model. We will also seek further performance improvements in combination with other distillation losses.
Iv experiment
We experimentally investigated the effectiveness of the proposed method with the image classification task. CIFAR10 [15]
and Tiny ImageNet
[16] are used as datasets.Iva Experiments using CIFAR10
student model(5 layers)  teacher model(8 layers) 

conv1(channel:32, filter:3)  conv1(channel:32, filter:3) 
max pooling(2*2)  batch normalization(32) 
conv2(channel:32, filter:3)  max pooling(2*2) 
max pooling(2*2)  conv2(channel:32, filter:3) 
conv3(channel:64, filter:3)  batch normalization(32) 
max pooling(2*2)  max pooling(2*2) 
fullyconnected(128)  conv3(channel:64, filter:3) 
output(10)  batch normalization(64) 
conv4(channel:64, filter:3)  
batch normalization(64)  
conv5(channel:128, filter:3)  
batch normalization(128)  
max pooling(2*2)  
fullyconnected(512)  
dropout  
fullyconnected(128)  
dropout  
output(10) 
CIFAR10 is a dataset that includes color images of 10 kinds of objects such as ”automobile” and ”dog.” The size of each image in CIFAR10 is . The numbers of the training samples and the test samples are 50,000 and 10,000. We conducted comparative experiments with Ba’s KD [4] (BKD), Hinton’s KD [5] (HKD), and Park’s RKDDA [13].
In the case of CIFAR10, the 8layers CNN model shown in Table I was used as a teacher model, and the 5layers CNN model was used as a student model. The number of trainable parameters of the student model is 161,130, and the number of trainable parameters of the teacher model is 1,256,106. The number of parameters in the student model is about 12.8 of the number of parameters in the teacher model.
An activation function ReLU defined by
(18) 
is used for the outputs of each convolution layer and the outputs of each fullyconnected layer except for the last output layer.
In the training, the optimization is done by using the stochastic gradient descent (SGD) with momentum. The learning rate of SGD is initially set at 0.01 and changed by multiplying 0.1 at every 100 epochs. The momentum parameter is set to 0.9. We also introduced weight decay to prevent overlearning. We set hard target loss as cross entropy with softmax and combined it with soft target loss. In addition, we introduced a parameter
as shown in Equation (16) to maintain the balance between hard target loss and soft target loss. We set , , , and . The temperature parameter of Hinton’s KD [5] was set to 4, and the hyperparameter of triplet loss was set to 5.method  accuracy 

Student model  
Ba’s KD  
Hinton’s KD  
RKDDA  
Ours KD  
Teacher model 
The Table II shows the classification accuracy of the student model for test samples in each method. From the table, it can be seen that in the case of CIFAR10, the proposed method achieves higher performance than the other methods.
IvB Experiments using Tiny ImageNet
Also, we have performed experiments using the Tiny ImageNet dataset. Tiny ImageNet dataset includes color images of 200 kinds of objects. The image size is , and it has 500 train images and 50 test images per class. For the case of Tiny ImageNet, VGG19 with batch normalization and VGG11 were used as the teacher model and the student model, respectively. The numbers of trainable parameters of the teacher model and the student model are 46,028,808 and 35,213,896, respectively. This means that the number of parameters of the student model is about 76.5 of those of the teacher model. We conducted comparative experiments with Ba’s KD [4] (BKD), Hinton’s KD [5] (HKD), and Park’s RKDDA [13].
The weights in the convolution layers were pretrained by using ImageNet, and they were used as the initial weights of the training. Similarly, the SGD with momentum was used for the optimization. The learning rate of SGD with momentum is initially set at 0.001 and multiplied by 0.9 every 3 epochs. The momentum parameter is set to 0.9. We also used mixup [17] to prevent overlearning. Crossentropy with softmax was used as the hard target loss and was combined with the soft target loss. Also, we introduced a parameter as shown in Equation (16) to maintain the balance between hard target loss and soft target loss. We set , , , and . The temperature parameter of Hinton’s KD [5] was set to 4, and the parameter m of triplet loss was set to 5.
method  accuracy 

Student model  
Ba’s KD  
Hinton’s KD  
RKDDA  
Ours KD  
Teacher model 
Table III shows the classification accuracy of the student model for the Tiny ImageNet dataset in each method. The results show that the proposed method gives slightly better performance than the other methods.
IvC Experiments on the combined loss
Park et al. [13] succeeded in achieving even higher performance by combining their method (RKDDA) with Hinton’s method [5]. Here we investigate the effectiveness of the combination of different loss functions. The loss of each method is combined with Hinton’s loss as
(19) 
. Depending on the datasets CIFAR10 and Tiny ImageNet, the same network architectures with the previous subsections were used. The student models were trained, and the classification performances of the trained student models are calculated.
method  CIFAR10  Tiny ImageNet 

Student model  
Hinton’s KD  
Ba’s KD + HKD  
RKDDA + HKD  
Ours KD + HKD  
Teacher model 
Table IV shows the classification performance of the test samples for each dataset. From this table IV, it is noticed that the classification accuracy of Park’s KD [13] was improved by combining with the Hinton’s loss for CIFAR10 dataset, but the classification accuracy of the proposed method and Ba’s KD [4] was not improved. However the proposed method still gives the best classification accuracy. For the Tiny ImageNet dataset, all methods showed an increase in the classification accuracy when they are combined with Hinton’s KD [5]. Also, the proposed method gives the best accuracy for this case.
Also we consider a combination with Park’s KD [13]. We investigate the effects of combining RKDDA loss with our loss and we investigate combinations with both Hinton’s KD [5] and RKDDA. Depending on the datasets CIFAR10 and Tiny ImageNet, the same network architectures with the previous subsections were used.
method  CIFAR10  Tiny ImageNet 

Student model  
RKDDA  
ours KD  
ours KD + RKDDA  
ours KD + HKD + RKDDA  
Teacher model 
Table V shows the classification accuracy of the student model in each method. In the case of the CIFAR10 dataset, no performance improvement was obtained even when combined with RKDDA. In the case of the Tiny ImageNet dataset, we succeeded in improving performance by combining it with RKDDA. In addition, using both RKDDA and Hinton’s KD in combination with our method resulted in further performance improvements. Although the combination of our method and Hinton’s KD had improved the performance, the combination of both Hinton’s KD and RKDDA achieved even higher performance.
V conclusion
In this paper, we applied the concept of metric learning to knowledge distillation. By introducing the concept of metric learning, we transfer not only the teacher model knowledge to be imitated but also the knowledge that should not be imitated to the student model. Our experimental results showed that this approach can significantly improve the performance of the student model. Also, the combination with Hinton’s KD [5] and Park’s KD [13] succeeded in giving the student model even better performance. This fact indicates that metric learning contributes to transferring the knowledge of the teacher model to the student model.
On the other hand, the concept of metric learning, such as Triplet loss, has a problem with hard sampling. In hard sampling, we intentionally select samples that are difficult to learn to get consistent training results. If you use only samples that are easy to learn, the performance of learning decreases. In the proposed approach, it is necessary to use hard sampling because the proposed method is based on the Triplet loss. Thus we have to introduce a way to address this problem, like Hermans et al. [18].
Acknowledgment
This work was partly supported by JSPS KAKENHI Grant Number 16K00239.
References
 [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[3]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [4] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
 [5] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [6] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pages 737–744, 1994.
 [7] Sumit Chopra, Raia Hadsell, Yann LeCun, et al. Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), pages 539–546, 2005.
 [8] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.

[9]
Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James
Philbin, Bo Chen, and Ying Wu.
Learning finegrained image similarity with deep ranking.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1386–1393, 2014.  [10] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, pages 84–92. Springer, 2015.
 [11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
 [12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [13] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.

[14]
Peter J Huber.
Robust estimation of a location parameter.
In Breakthroughs in statistics, pages 492–518. Springer, 1992.  [15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Citeseer, 2009.
 [16] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 2015.
 [17] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
 [18] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737, 2017.
Comments
There are no comments yet.