Log In Sign Up

Triplet Distillation for Deep Face Recognition

by   Yushu Feng, et al.

Convolutional neural networks (CNNs) have achieved a great success in face recognition, which unfortunately comes at the cost of massive computation and storage consumption. Many compact face recognition networks are thus proposed to resolve this problem. Triplet loss is effective to further improve the performance of those compact models. However, it normally employs a fixed margin to all the samples, which neglects the informative similarity structures between different identities. In this paper, we propose an enhanced version of triplet loss, named triplet distillation, which exploits the capability of a teacher model to transfer the similarity information to a small model by adaptively varying the margin between positive and negative pairs. Experiments on LFW, AgeDB, and CPLFW datasets show the merits of our method compared to the original triplet loss.


page 1

page 2

page 3

page 4


Deep Ranking with Adaptive Margin Triplet Loss

We propose a simple modification from a fixed margin triplet loss to an ...

MassFace: an efficient implementation using triplet loss for face recognition

In this paper we present an efficient implementation using triplet loss ...

Cross-Domain Similarity Learning for Face Recognition in Unseen Domains

Face recognition models trained under the assumption of identical traini...

How to Train Triplet Networks with 100K Identities?

Training triplet networks with large-scale data is challenging in face r...

Enhancing Convolutional Neural Networks for Face Recognition with Occlusion Maps and Batch Triplet Loss

Despite the recent success of convolutional neural networks for computer...

Distribution Distillation Loss: Generic Approach for Improving Face Recognition from Hard Samples

Large facial variations are the main challenge in face recognition. To t...

Teacher-Student Training and Triplet Loss to Reduce the Effect of Drastic Face Occlusion

We study a series of recognition tasks in two realistic scenarios requir...

1 Introduction

Recent years have witnessed the impressive success of CNNs in the area of face recognition (Parkhi et al., 2015; Sun et al., 2014a; Taigman et al., 2014; Sun et al., 2014b). However, effective face recognition CNN models typically consume a large amount of storage and computation, making it difficult to deploy on mobile and embedded devices. To resolve this problem, several lightweight CNN models have been proposed, such as MobileID (Luo et al., 2016), ShiftFaceNet (Wu et al., 2018), and MobileFaceNet (Chen et al., 2018b).

Unfortunately, model size reduction usually coincides with performance decline. Triplet loss (Schroff et al., 2015), as a metric learning method, is widely used in face recognition to further improve accuracy (Deng et al., 2018). Triplet loss explicitly maximizes the inter-class distance and meanwhile minimizes the intra-class distance, where a margin term is used to determine the decision boundaries between positive and negative pairs.

In the original triplet loss, the margin is set to a constant, which tends to push the decision boundaries among different classes to the same value, thus loses the hidden similarity structures of different identities. Therefore, it is necessary to set a dynamic margin to take into account the similarity structures. In this vein, (Zakharov et al., 2017)

sets the margin term as a function of angular differences between the poses for pose estimation

(Wang et al., 2018) formulates the adaptive margin as a nonlinear mapping of the average distances among different people for person re-identification. However, they obtain the dynamic margins by handcrafted rules rather than learned distances. In this paper, we propose an enhanced version of triplet loss, named triplet distillation, which borrows the idea of knowledge distillation (Hinton et al., 2015) to determine the dynamic margins for face recognition. Specifically, we determine the similarity between two identities according to distances learned by the teacher model. This similarity, as a kind of knowledge, is then applied to guiding the student model to optimize its decision boundaries.

The major contributions of this work can be summarized as follows:

  • We propose the triplet distillation method to transfer knowledge from a teacher model to a student model for face recognition.

  • We improve the triplet loss with dynamic margins by utilizing the similarity structures among different identities, which is in contrast with the fixed margin of the original triplet loss.

  • Experiments on LFW (Huang et al., 2008), AgeDB (Moschoglou et al., 2017) and CPLFW (Zheng & Deng, 2018) show that the proposed mehtod performs favorably against the original scheme.

2 Related Work

Triplet loss. The main purpose of triplet loss (Schroff et al., 2015) is to distinguish identities in the projected space with the guidance of distances among an anchor sample, a positive sample, and a negative sample. There are several revisions for the original triplet loss, which mainly fall into the following three categories: (1) Adding new constraints to the objective function to improve the generalization performance (Cheng et al., 2016; Chen et al., 2017); (2) Optimizing the selection of triplet samples to make the triplet samples more informative, which can lead to faster convergence and better performance (Sohn, 2016; Hermans et al., 2017; Ge, 2018; Dong & Shen, 2018; Ming et al., 2017); (3) Proposing dynamic margins for different triplet combinations, such as (Zakharov et al., 2017; Wang et al., 2018) which use handcrafted methods to determine the similarities among different identities. Our method belongs to the last category. Different from previous approaches, we exploits a teacher model to obtain the similarity information among identities to set the dynamic margins.

Knowledge distillation. Knowledge distillation, firstly proposed by (Buciluǎ et al., 2006) and then refined by Hinton et al. (Hinton et al., 2015), is a model compression method to transfer the knowledge of a large teacher network to a small student network. The main idea is to let the student network learn a mapping function which is similar to the teacher network. Most researches follow (Hinton et al., 2015)

to learn the soft-target outputs of the teacher network 

(Fukuda et al., 2017; Sau & Balasubramanian, 2016; Zhou et al., 2018; Furlanello et al., 2018). These methods make the student model match the output distributions of the teacher model. Not confined to the output distributions of the teacher model, the definition of knowledge can also refer to its feature maps. For example, (Romero et al., 2014; Huang & Wang, 2017; Chen et al., 2018a) utilize feature maps of the middle layers to guide the knowledge transfer from the teacher model to the student model. Recent works further broaden the definition of knowledge to other attributes such as attention maps (Zagoruyko & Komodakis, 2016; Huang & Wang, 2017) and affinity among training samples (Chen et al., 2018c). In this paper, we also use the knowledge of feature similarity between identities as a guidance to train the student model.

3 The Proposed Method

3.1 Teacher and student networks

We employ the widely-used ResNet-100 (He et al., 2016) as the teacher model. For the student model, we adopt a slim version of MobileFaceNet (Chen et al., 2018b), which has the same architecture as MobileFaceNet, yet with three quarters of the number of channels in each convolutional layer on average. The detailed statistics of the teacher and student model are summarized in Table 1.

Model Size/MB Params/ FLOPs/ Time/s
Teacher model
Student model
Table 1: Comparison between teacher model, MobileFaceNet (Chen et al., 2018b)

, and student model. The FLOPs are counted by TFProf, a profiling tool in Tensorflow. The inference time is averaged by

runs of forwarding an image of size on Intel Xeon(R) CPU E5-2609 v4 @1.70GHz with single thread.

3.2 Triplet distillation

Triplet loss is applied to a triplet of samples, represented as . Here  is the anchor image; is called the positive image, which belongs to the same identity as , and  is called the negative image, which belongs to a different identity of . The triplet loss aims to minimize the distance between the anchor and positive images, and meanwhile maximize the distance between the anchor and negative images. The objective function of triplet loss can be formulated as


where is the number of triplets in a mini-batch;  denotes the distance between two images. Notably, the hyper-parameter  represents a margin enforced between the positive and negative pairs, that is, only when the distance difference between the negative pair and the positive pair is larger than a threshold , will the loss  not count. Naturally, the final distances among different identity clusters will be pushed to the margin .

In the original triplet loss, is the same for all identities. In other words, all identity clusters will be separated with a roughly same distance, which ignores the subtle similarity structures among different identities, since different people are not equally different. For example, if person  looks more similar to person  than to person , then it should be better to set the  for {, } smaller than the  for {, } because such setting will push  and  closer than  and  in the hyperspace of the student model. In a similar spirit to dark knowledge proposed in knowledge distillation (Hinton et al., 2015), this similarity structure is informative and useful, but not considered in the original triplet loss. Our proposed triplet distillation method exploits knowledge distillation to bridge this gap.

First, the teacher model extracts two features from a triplet and obtains the distance between them. Then, we map this distance into the margin and apply it to the training of the student model. Different from previous mathematical angle calculation methods (Zakharov et al., 2017; Wang et al., 2018), our scheme adopts the well-trained teacher model to calculate the face distance, which has more capability to capture the similarity structures in its learned representations. With the proposed dynamic margin term, the objective function can be written as


where   denotes the distance between two images calculated by the student model,   represents the distance calculated by teacher model,

denotes the distance between intra-class and inter-class features extracted by the teacher model, and

represents the function of the margin with regards to the distance. We employ a simple increasing linear function for ,


where and represent the minimum and maximum values of margin; and  represents the maximum distance in a mini-batch. In this way, the margin is constrained between and .

4 Experiments

4.1 Implementation details

Pre-processing. We use MTCNN (Zhang et al., 2016) to detect faces and facial landmarks on the MS-Celeb-1M dataset (Guo et al., 2016), which consists of  million photos of  celebrities. To obtain data of higher quality, million photos from identities are picked out to make a refined MS-Celeb-1M dataset (Deng et al., 2018). All the images are aligned based on the detected landmarks and then resized to with normalization (subtracted by mean 

and divided by standard deviation 


Training. The architectures of the teacher and student models are described in Section 3.1. Both of them are first trained from scratch with the ArcFace loss (Deng et al., 2018)

. Stochastic Gradient Descent (SGD) is used with momentum

and batch size . The learning rate begins with and is divided by at iteration and , before the training finally ends at iteration .

Then the proposed triplet distillation is used to fine-tune the student model. During this stage, there are  classes, images per class in each mini-batch. The learning rate is and the training stops at iterations. We randomly sample triplets from the refined MS-Celeb-1M dataset to obtain different ’s (Equation (3)). Then the largest one is chosen as , and the smallest one as . TensorFlow (Abadi et al., 2016) is used in all our experiments. Our source codes and trained models will be made available to the public.

Evaluation. In the evaluation stage, we extract the features of each image and its horizontally flipped image. Then the two features are concatenated as one for face verification using the cosine distance. Three popular face verification datasets are considered here: LFW (Huang et al., 2008), CPLFW (Zheng & Deng, 2018), and AgeDB (Moschoglou et al., 2017). For LFW and CPLFW, we adopt all the provided pairs ( positive and negative pairs for each dataset); for AgeDB, which has different year gaps, we only choose one of them with positive and negative pairs as our evaluation dataset.

4.2 Experimental results

As shown in Table 2, the pre-trained teacher model reaches on LFW, on CPLFW, and  on AgeDB-30. The student model trained by ArcFace reaches on LFW, on CPLFW, and on AgeDB-30.

For comparison with the original triplet loss, we set the fixed margin  as , , and , which are chosen based on our validation for the best performance of triplet loss. After applying the proposed triplet distillation to the student model, its verification accuracy is boosted to on LFW, on CPLFW, and on AgeDB-30. In other words, the accuracy of triplet distillation is consistently higher than the original triplet loss using the fixed margin.

Model LFW AgeDB-30 CPLFW
Teacher 99.73% 98.25% 92.85%
Student 98.75% 93.53% 78.53%
Student+Triplet loss
margin=0.3 99.21% 94.08% 80.80%
margin=0.4 99.23% 94.00% 81.16%
margin=0.5 99.20% 93.80% 80.38%
Student+Ours 99.27% 94.25% 81.28%
Table 2: Comparison of the proposed Triplet Distillation with triplet loss.

5 Conclusion

We propose triplet distillation for deep face recognition, which takes advantage of knowledge distillation to generate dynamic margins to enhance triplet loss. The distance obtained by the teacher model reflects similarity information between different identities, which can be regarded as a new type of knowledge. Compared with the original triplet loss, experiments have proven that our proposed method delivers an encouraging performance improvement.