1 Introduction
Recent years have witnessed the impressive success of CNNs in the area of face recognition (Parkhi et al., 2015; Sun et al., 2014a; Taigman et al., 2014; Sun et al., 2014b). However, effective face recognition CNN models typically consume a large amount of storage and computation, making it difficult to deploy on mobile and embedded devices. To resolve this problem, several lightweight CNN models have been proposed, such as MobileID (Luo et al., 2016), ShiftFaceNet (Wu et al., 2018), and MobileFaceNet (Chen et al., 2018b).
Unfortunately, model size reduction usually coincides with performance decline. Triplet loss (Schroff et al., 2015), as a metric learning method, is widely used in face recognition to further improve accuracy (Deng et al., 2018). Triplet loss explicitly maximizes the interclass distance and meanwhile minimizes the intraclass distance, where a margin term is used to determine the decision boundaries between positive and negative pairs.
In the original triplet loss, the margin is set to a constant, which tends to push the decision boundaries among different classes to the same value, thus loses the hidden similarity structures of different identities. Therefore, it is necessary to set a dynamic margin to take into account the similarity structures. In this vein, (Zakharov et al., 2017)
sets the margin term as a function of angular differences between the poses for pose estimation;
(Wang et al., 2018) formulates the adaptive margin as a nonlinear mapping of the average distances among different people for person reidentification. However, they obtain the dynamic margins by handcrafted rules rather than learned distances. In this paper, we propose an enhanced version of triplet loss, named triplet distillation, which borrows the idea of knowledge distillation (Hinton et al., 2015) to determine the dynamic margins for face recognition. Specifically, we determine the similarity between two identities according to distances learned by the teacher model. This similarity, as a kind of knowledge, is then applied to guiding the student model to optimize its decision boundaries.The major contributions of this work can be summarized as follows:

We propose the triplet distillation method to transfer knowledge from a teacher model to a student model for face recognition.

We improve the triplet loss with dynamic margins by utilizing the similarity structures among different identities, which is in contrast with the fixed margin of the original triplet loss.
2 Related Work
Triplet loss. The main purpose of triplet loss (Schroff et al., 2015) is to distinguish identities in the projected space with the guidance of distances among an anchor sample, a positive sample, and a negative sample. There are several revisions for the original triplet loss, which mainly fall into the following three categories: (1) Adding new constraints to the objective function to improve the generalization performance (Cheng et al., 2016; Chen et al., 2017); (2) Optimizing the selection of triplet samples to make the triplet samples more informative, which can lead to faster convergence and better performance (Sohn, 2016; Hermans et al., 2017; Ge, 2018; Dong & Shen, 2018; Ming et al., 2017); (3) Proposing dynamic margins for different triplet combinations, such as (Zakharov et al., 2017; Wang et al., 2018) which use handcrafted methods to determine the similarities among different identities. Our method belongs to the last category. Different from previous approaches, we exploits a teacher model to obtain the similarity information among identities to set the dynamic margins.
Knowledge distillation. Knowledge distillation, firstly proposed by (Buciluǎ et al., 2006) and then refined by Hinton et al. (Hinton et al., 2015), is a model compression method to transfer the knowledge of a large teacher network to a small student network. The main idea is to let the student network learn a mapping function which is similar to the teacher network. Most researches follow (Hinton et al., 2015)
to learn the softtarget outputs of the teacher network
(Fukuda et al., 2017; Sau & Balasubramanian, 2016; Zhou et al., 2018; Furlanello et al., 2018). These methods make the student model match the output distributions of the teacher model. Not confined to the output distributions of the teacher model, the definition of knowledge can also refer to its feature maps. For example, (Romero et al., 2014; Huang & Wang, 2017; Chen et al., 2018a) utilize feature maps of the middle layers to guide the knowledge transfer from the teacher model to the student model. Recent works further broaden the definition of knowledge to other attributes such as attention maps (Zagoruyko & Komodakis, 2016; Huang & Wang, 2017) and affinity among training samples (Chen et al., 2018c). In this paper, we also use the knowledge of feature similarity between identities as a guidance to train the student model.3 The Proposed Method
3.1 Teacher and student networks
We employ the widelyused ResNet100 (He et al., 2016) as the teacher model. For the student model, we adopt a slim version of MobileFaceNet (Chen et al., 2018b), which has the same architecture as MobileFaceNet, yet with three quarters of the number of channels in each convolutional layer on average. The detailed statistics of the teacher and student model are summarized in Table 1.
Model  Size/MB  Params/  FLOPs/  Time/s 

Teacher model  
MobileFaceNet  
Student model 
, and student model. The FLOPs are counted by TFProf, a profiling tool in Tensorflow. The inference time is averaged by
runs of forwarding an image of size on Intel Xeon(R) CPU E52609 v4 @1.70GHz with single thread.3.2 Triplet distillation
Triplet loss is applied to a triplet of samples, represented as . Here is the anchor image; is called the positive image, which belongs to the same identity as , and is called the negative image, which belongs to a different identity of . The triplet loss aims to minimize the distance between the anchor and positive images, and meanwhile maximize the distance between the anchor and negative images. The objective function of triplet loss can be formulated as
(1) 
where is the number of triplets in a minibatch; denotes the distance between two images. Notably, the hyperparameter represents a margin enforced between the positive and negative pairs, that is, only when the distance difference between the negative pair and the positive pair is larger than a threshold , will the loss not count. Naturally, the final distances among different identity clusters will be pushed to the margin .
In the original triplet loss, is the same for all identities. In other words, all identity clusters will be separated with a roughly same distance, which ignores the subtle similarity structures among different identities, since different people are not equally different. For example, if person looks more similar to person than to person , then it should be better to set the for {, } smaller than the for {, } because such setting will push and closer than and in the hyperspace of the student model. In a similar spirit to dark knowledge proposed in knowledge distillation (Hinton et al., 2015), this similarity structure is informative and useful, but not considered in the original triplet loss. Our proposed triplet distillation method exploits knowledge distillation to bridge this gap.
First, the teacher model extracts two features from a triplet and obtains the distance between them. Then, we map this distance into the margin and apply it to the training of the student model. Different from previous mathematical angle calculation methods (Zakharov et al., 2017; Wang et al., 2018), our scheme adopts the welltrained teacher model to calculate the face distance, which has more capability to capture the similarity structures in its learned representations. With the proposed dynamic margin term, the objective function can be written as
(2) 
(3) 
where denotes the distance between two images calculated by the student model, represents the distance calculated by teacher model,
denotes the distance between intraclass and interclass features extracted by the teacher model, and
represents the function of the margin with regards to the distance. We employ a simple increasing linear function for ,(4) 
where and represent the minimum and maximum values of margin; and represents the maximum distance in a minibatch. In this way, the margin is constrained between and .
4 Experiments
4.1 Implementation details
Preprocessing. We use MTCNN (Zhang et al., 2016) to detect faces and facial landmarks on the MSCeleb1M dataset (Guo et al., 2016), which consists of million photos of celebrities. To obtain data of higher quality, million photos from identities are picked out to make a refined MSCeleb1M dataset (Deng et al., 2018). All the images are aligned based on the detected landmarks and then resized to with normalization (subtracted by mean
and divided by standard deviation
).Training. The architectures of the teacher and student models are described in Section 3.1. Both of them are first trained from scratch with the ArcFace loss (Deng et al., 2018)
. Stochastic Gradient Descent (SGD) is used with momentum
and batch size . The learning rate begins with and is divided by at iteration and , before the training finally ends at iteration .Then the proposed triplet distillation is used to finetune the student model. During this stage, there are classes, images per class in each minibatch. The learning rate is and the training stops at iterations. We randomly sample triplets from the refined MSCeleb1M dataset to obtain different ’s (Equation (3)). Then the largest one is chosen as , and the smallest one as . TensorFlow (Abadi et al., 2016) is used in all our experiments. Our source codes and trained models will be made available to the public.
Evaluation. In the evaluation stage, we extract the features of each image and its horizontally flipped image. Then the two features are concatenated as one for face verification using the cosine distance. Three popular face verification datasets are considered here: LFW (Huang et al., 2008), CPLFW (Zheng & Deng, 2018), and AgeDB (Moschoglou et al., 2017). For LFW and CPLFW, we adopt all the provided pairs ( positive and negative pairs for each dataset); for AgeDB, which has different year gaps, we only choose one of them with positive and negative pairs as our evaluation dataset.
4.2 Experimental results
As shown in Table 2, the pretrained teacher model reaches on LFW, on CPLFW, and on AgeDB30. The student model trained by ArcFace reaches on LFW, on CPLFW, and on AgeDB30.
For comparison with the original triplet loss, we set the fixed margin as , , and , which are chosen based on our validation for the best performance of triplet loss. After applying the proposed triplet distillation to the student model, its verification accuracy is boosted to on LFW, on CPLFW, and on AgeDB30. In other words, the accuracy of triplet distillation is consistently higher than the original triplet loss using the fixed margin.
Model  LFW  AgeDB30  CPLFW 

Teacher  99.73%  98.25%  92.85% 
Student  98.75%  93.53%  78.53% 
Student+Triplet loss  
margin=0.3  99.21%  94.08%  80.80% 
margin=0.4  99.23%  94.00%  81.16% 
margin=0.5  99.20%  93.80%  80.38% 
Student+Ours  99.27%  94.25%  81.28% 
5 Conclusion
We propose triplet distillation for deep face recognition, which takes advantage of knowledge distillation to generate dynamic margins to enhance triplet loss. The distance obtained by the teacher model reflects similarity information between different identities, which can be regarded as a new type of knowledge. Compared with the original triplet loss, experiments have proven that our proposed method delivers an encouraging performance improvement.
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for largescale machine learning. In Symposium on Operating Systems Design and Implementation, 2016.
 Buciluǎ et al. (2006) Buciluǎ, C., Caruana, R., and NiculescuMizil, A. Model compression. In SIGKDD, 2006.
 Chen et al. (2018a) Chen, H., Wang, Y., Xu, C., Xu, C., and Tao, D. Learning student networks via feature embedding. arXiv preprint arXiv:1812.06597, 2018a.
 Chen et al. (2018b) Chen, S., Liu, Y., Gao, X., and Han, Z. Mobilefacenets: Efficient cnns for accurate realtime face verification on mobile devices. In Chinese Conference on Biometric Recognition, 2018b.
 Chen et al. (2017) Chen, W., Chen, X., Zhang, J., and Huang, K. Beyond triplet loss: a deep quadruplet network for person reidentification. In CVPR, 2017.
 Chen et al. (2018c) Chen, Y., Wang, N., and Zhang, Z. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In AAAI, 2018c.

Cheng et al. (2016)
Cheng, D., Gong, Y., Zhou, S., Wang, J., and Zheng, N.
Person reidentification by multichannel partsbased cnn with improved triplet loss function.
In CVPR, 2016.  Deng et al. (2018) Deng, J., Guo, J., Xue, N., and Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
 Dong & Shen (2018) Dong, X. and Shen, J. Triplet loss in siamese network for object tracking. In ECCV, 2018.
 Fukuda et al. (2017) Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., and Ramabhadran, B. Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017.
 Furlanello et al. (2018) Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and Anandkumar, A. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018.
 Ge (2018) Ge, W. Deep metric learning with hierarchical triplet loss. In ECCV, 2018.
 Guo et al. (2016) Guo, Y., Zhang, L., Hu, Y., He, X., and Gao, J. Msceleb1m: A dataset and benchmark for largescale face recognition. In ECCV, 2016.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
 Hermans et al. (2017) Hermans, A., Beyer, L., and Leibe, B. In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737, 2017.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Huang et al. (2008) Huang, G. B., Mattar, M., Berg, T., and LearnedMiller, E. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’RealLife’Images: detection, alignment, and recognition, 2008.
 Huang & Wang (2017) Huang, Z. and Wang, N. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219, 2017.

Luo et al. (2016)
Luo, P., Zhu, Z., Liu, Z., Wang, X., and Tang, X.
Face model compression by distilling knowledge from neurons.
In AAAI, 2016.  Ming et al. (2017) Ming, Z., Chazalon, J., Luqman, M. M., Visani, M., and Burie, J.C. Simple triplet loss based on intra/interclass metric learning for face verification. In ICCVW, 2017.
 Moschoglou et al. (2017) Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., and Zafeiriou, S. Agedb: the first manually collected, inthewild age database. In CVPR, 2017.
 Parkhi et al. (2015) Parkhi, O. M., Vedaldi, A., Zisserman, A., et al. Deep face recognition. In BMVC, 2015.
 Romero et al. (2014) Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 Sau & Balasubramanian (2016) Sau, B. B. and Balasubramanian, V. N. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650, 2016.
 Schroff et al. (2015) Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
 Sohn (2016) Sohn, K. Improved deep metric learning with multiclass npair loss objective. In NeurIPS, 2016.
 Sun et al. (2014a) Sun, Y., Chen, Y., Wang, X., and Tang, X. Deep learning face representation by joint identificationverification. In NeurIPS, 2014a.
 Sun et al. (2014b) Sun, Y., Wang, X., and Tang, X. Deep learning face representation from predicting 10,000 classes. In CVPR, 2014b.
 Taigman et al. (2014) Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. Deepface: Closing the gap to humanlevel performance in face verification. In CVPR, 2014.
 Wang et al. (2018) Wang, J., Zhou, S., Wang, J., and Hou, Q. Deep ranking model by large adaptive margin learning for person reidentification. PR, 74:241–252, 2018.
 Wu et al. (2018) Wu, B., Wan, A., Yue, X., Jin, P., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J., and Keutzer, K. Shift: A zero flop, zero parameter alternative to spatial convolutions. In CVPR, 2018.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
 Zakharov et al. (2017) Zakharov, S., Kehl, W., Planche, B., Hutter, A., and Ilic, S. 3d object instance recognition and pose estimation using triplet loss with dynamic margin. In IROS, 2017.
 Zhang et al. (2016) Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
 Zheng & Deng (2018) Zheng, T. and Deng, W. Crosspose lfw: A database for studying crosspose face recognition in unconstrained environments. Technical Report 1801, Beijing University of Posts and Telecommunications, February 2018.
 Zhou et al. (2018) Zhou, G., Fan, Y., Cui, R., Bian, W., Zhu, X., and Gai, K. Rocket launching: A universal and efficient framework for training wellperforming light net. In AAAI, 2018.