Minimum Margin Loss for Deep Face Recognition

05/17/2018 ∙ by Xin Wei, et al. ∙ Ulster University 0

Face recognition has achieved great progress owing to the fast development of the deep neural network in the past a few years. As the baton in a deep neural network, a number of the loss functions have been proposed which significantly improve the state-of-the-art methods. In this paper, we proposed a new loss function called Minimum Margin Loss (MML) which aims at enlarging the margin of those over-close class centre pairs so as to enhance the discriminative ability of the deep features. MML supervises the training process together with the Softmax loss and the Centre loss, and also makes up the defect of Softmax + Centre loss. The experimental results on LFW and YTF datasets show that the proposed method achieves the state-of-the-art performance, which demonstrates the effectiveness of the proposed MML.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the past ten years, Deep Learning-based methods have achieved great progress in various computer vision areas, including face recognition

[1, 2, 3, 4], object recognition [5, 6, 7], action recognition [8, 9, 10, 11] and so on. Among these areas, the progress on face recognition is particularly remarkable because of the development of two important aspects – larger face datasets and better loss functions.

As a crucial factor, the scale and the quality of the training datasets directly influence the performance of a DNN model. Currently, there are a few public available large-scale face datasets, for example, MS-Celeb-1M [12], VGGFace2 [13], MegaFace [14] and CASIA WebFace [15]. As shown in Table I, CASIA WebFace consists of 0.5M face images; VGGFace2 contains totally 3M face images but only from 9K identities; while MS-Celeb-1M and MegaFace contain many more identities and more images, which have bigger potential for training a better DNN model. However, both MS-Celeb-1M and MegaFace have the problem of long-tailed distribution [16], which means a minority of people owns a majority of face images while a large number of people have very limited face images. Using the datasets with long-tailed distribution, the trained model tends to overfit the classes with rich samples and weaken the generalisation ability on the long-tailed portion [16]. Specifically, in order to separate different classes, the classes with rich samples tend to have a relatively larger margin between their class centres; conversely, as the classes with limited samples only occupy a small space and are easy to be compressed and separated, they tend to own a relatively smaller margin between their class centres. Thus, it is reasonable to consider setting a minimum margin to rectify this bias. In this paper, we will focus on the overclose pairs of class centre and propose our loss based on the minimum margin.

MS-Celeb-1M VGGFace2 MegaFace CASIA
#Identities 100K 9K 672K 11K
#Images 10M 3M 5M 0.5M
Avg per Person 105 323 7 47
TABLE I: Statistics for recent public available large-scale face datasets.

Besides the training set, another important aspect is the loss function which directs the networks to optimise their weights during the training process. Nowadays, the best performing loss functions can be roughly divided into two types [17, 18, 19]: the loss functions based on Euclidean distance and the loss functions based on Cosine distance. Most of them are derived from Softmax Loss by adding a penalty or modifying softmax directly.

The loss functions based on Euclidean distance include Contrastive Loss [2], Triplet Loss [1], Centre Loss [20], Range Loss [16], Marginal Loss [17] and so on. These methods aim at improving the discriminative ability of features by maximising the inter-class distance or minimising the intra-class distance. Contrastive Loss inputs the networks with two types of sample pair – the positive sample pair (two faces from the positive class) and the negative sample pair (two face images from the negative class). Contrastive Loss minimises the Euclidean distance of the positive pairs and penalises the negative pairs that have a distance smaller than the threshold. Triplet Loss uses the triplet as the input which includes a positive sample, a negative sample and an anchor. An anchor is also a positive sample, which is initially closer to some negative samples than it is to some positive samples. During the training, the anchor-positive pairs are pulled together while the anchor-negative pairs are pushed apart as much as possible. However, the selection of the sample pairs and the triplets is laborious and time-consuming for both Contrastive Loss and Triplet Loss. Centre Loss, Marginal Loss and Range Loss add another penalty to implement the joint supervision with Softmax Loss. Specifically, Centre Loss adds a penalty to Softmax by calculating and restricting the distances between the within-class samples and the corresponding class centre. Marginal Loss considers all the sample pairs in a batch and forces the sample pairs from the different classes to have a margin larger than the threshold while forcing the samples from the same classes to have a margin smaller than the threshold . But it is overstrict to force the two farthest samples in a class to have a distance smaller than the two nearest samples from the different classes, which makes the training procedure hard to converge. Range Loss calculates the distances of the samples within each class, and chooses two sample pairs which have the largest distances as the intra-class constraint; simultaneously, Range Loss calculates the distance of each class centre pair, and forces the class centre pair that has the smallest distance to have a larger margin than the designated threshold. However, only considering one centre pair each time is not comprehensive, as more centre pairs may have the margins smaller than the designated threshold and the training procedure is hard to completely converge because of the slow learning speed.

The loss functions based on on Cosine distance include L-Softmax Loss [21], A-Softmax Loss [22], AM-Softmax Loss [18], ArcFace [19]

and so on. L-Softmax reformulates the output of softmax layer from

to so as to transform the Euclidean distance to Cosine distance, and also add multiplicative angular constraints to to enlarge the angular margins between different identities. Based on L-Softmax Loss, A-Softmax applies weight normalisation, so is further reformulated to which simplifies the training target. However, after using the same multiplicative angular constraints, both L-Softmax and A-Softmax Loss are difficult to converge. So annealing optimization strategy is adopted by these two methods to help the algorithm to converge. To improve the convergence of A-Softmax, Wang et al. [18] propose AM-Softmax which replaces the multiplicative angular constraints with the additive angular constraints, namely, transforms to . Besides, AM-Softmax also applies feature normalisation and introduces the global scaling factor which makes . Hence, the training target is again simplified to . ArcFace also utilises the additive angular constraints, but it changes to which makes it have better geometric interpretation. Both AM-Softmax and ArcFace adopt weight normalisation and feature normalisation which restrict all the features to lie on a hypersphere. However, is it overstrict to force all the features to lie on a hypersphere instead of a wider space? Why and how do weight normalisation and feature normalisation benefit the training procedure? These questions are difficult to answer explicitly, and some evidence shows that “soft” feature normalisation may lead to better results [23].

Inspired by Softmax Loss, Centre Loss and Marginal Loss, we propose the Minimum Margin Loss (MML) in this paper which aims at forcing all the class centre pairs to have a distance larger than the specified minimum margin. Different from Range Loss, MML penalises all the ‘unqualified’ class centre pairs instead of only penalising the centre pair that has the shortest distance. MML reuses the centre positions constantly updated by Centre Loss, and directs the training process by joint supervision with Softmax Loss and Centre Loss. In this way, Softmax Loss+Centre Loss+MML achieved better performance than Softmax Loss and Softmax Loss+Centre Loss while almost has no increment in computing cost. According to our knowledge, there is no loss function which considered setting a minimum margin between the class centres. However, it is necessary to have such a constraint for rectifying the bias introduced in by imbalanced data. To prove the effectiveness of the proposed method, experiments are conducted on three public datasets – Labeled Faces in the Wild (LFW) [24], YouTube Faces (YTF) [25] and Megaface [14] datasets. Results show that MML achieves superior performance than softmax Loss and centre Loss, while achieves competitive results compared with the state-of-the-art methods.

Ii From Softmax Loss to Minimum Margin Loss

Ii-a Softmax Loss and Centre Loss

Softmax Loss is the most commonly used loss function, which is presented below:

(1)

where is the batch size, is the class number of a batch, denotes the feature of the th sample belonging to the th class, denotes the th column of the weight matrix in the final fully connected layer and is the bias term of the th class. From Eq(1), it can be seen that Softmax Loss is designed to obtain the cross entropy between the predicted label and the true label, which in other words means the target of Softmax Loss is only to separate the features from different classes in the training set instead of learning discriminative features. Such a target is appropriate for close-set tasks, like most application scenarios of object recognition and behaviour recognition. But the application scenarios of face recognition are open-set tasks in most cases, so the discriminative ability of features has considerable influence on the performance of a face recognition system. To enhance the discriminative ability of features, Wen et al. [20] proposed the Centre Loss to minimise the intra-class distance, as shown below:

(2)

where denotes the class centre of the th class. Centre Loss calculates all the distances between the class centres and within-class samples, and is used in conjunction with Softmax Loss:

(3)
(4)

where is the hyper-parameter for balancing the two loss functions.

Ii-B Marginal Loss and Range Loss

After combining the Softmax Loss with the Centre Loss, the within-class compactness is significantly enhanced. But it is not enough to only use Softmax Loss as the inter-class constraint, as it only encourages the separability of features. So Deng et al. [17] proposed Marginal Loss which also takes the way of joint supervision with the Softmax Loss:

(5)
(6)

where and are the features of the th and th samples in a batch, respectively; indicates whether and belong to the same class, is defined as , is the threshold to separate the positive pairs and the negative pairs, and

is the error margin besides the classification hyperplane.

Marginal Loss considers all the possible combinations of the sample pairs in a batch and specifies a threshold to constrain all these sample pairs including the positive pairs and the negative pairs. Marginal Loss forces the distances of the positive pairs to be close up to the threshold while forcing the distances of the negative pairs to be farther than the threshold . But utilising the same threshold to constrain both the positive and negative pairs is not proper. Because it is often the case that the two farthest samples in a class have a distance larger than the two nearest samples of the two different but closest classes. Forcibly changing this situation will make the training procedure hard to converge.

Similar to the aforementioned methods, the Range Loss proposed by Zhang et al. [16] also works with softmax Loss as the supervisory signals:

(7)

Different from Marginal Loss, Range Loss consists of two independent losses, namely and to calculate the intra-class loss and inter-class loss respectively (see Eq.(8)).

(8)

where and are two weights for adjusting the influence of and . Mathematically, and are defined as follows:

(9)
(10)
(11)

where is the class number in current batch, is the th largest distance of the sample pairs in class , is the central distance of two nearest classes in current batch, and denote the class centres of class and which have the shortest central distance, and is the margin threshold. measures all the sample pairs in a class and select sample pairs that have the large distances to build the loss for controlling the within-class compactness. As described in [17], experiments show that is the best choice. aims at forcing the class centre pair that has the smallest distance to have a larger margin up to the designated threshold. But there are more centre pairs that may have distances larger than the designated threshold. It is not comprehensive enough for only considering one centre pair each time which leads the training procedure to take a long time to completely converge because of the low learning speed.

Ii-C The Proposed Minimum Margin Loss

Inspired by Softmac Loss, Centre Loss and Marginal Loss, we propose the Minimum Margin Loss (MML) in this paper. MML is used in conjunction with Softmax Loss and Centre Loss, where Centre Loss is utilised to enhance the within-class compactness, Softmax and MML are applied for improving the between-class separability. Specifically speaking, Softmax is in charge of guaranteeing the correctness of classification while MML aims at optimising the between-class margins. The total loss is shown below:

(12)

where and are the hyper-parameters for adjusting the impact of Centre Loss and MML.

MML specifies a threshold called Minimum Margin. By reusing the class centre positions updated by Centre Loss, MML filters all the class centre pairs based on the specified Minimum Margin. For those pairs which have distances smaller than the threshold, corresponding penalties are added into to the loss value. The detail of MML is formulated as follows:

(13)

where is the class number of a batch, and denote the class centres of the th and th classes respectively, and represents the designated minimum margin. In each training batch, the class centres are updated by Centre Loss with the following two equations:

(14)
(15)

where is the learning rate of the class centres, is and the number of iteration and is a conditional function. If the condition is satisfied, , otherwise .

Algorithm 1 shows the basic learning steps in the CNNs with the proposed .

Input:

Training samples {}, initialised parameters in convolution layers, parameters in the final fully connected layer, and initialised class centres . Learning rate

, hyperparameters

and , learning rate of the class centres and the number of iteration .

Output:

The parameters .

1:while not converge do
2:     Calculate the total loss by .
3:

     Calculate the backpropagation error

for each sample by .
4:     Update by .
5:     Update for each centre by .
6:     Update by .
7:     .
8:end while
Algorithm 1 Learning algorithm in the CNNs with the proposed .
(a) Without using MML
(b) After using MML
(c) Comparison between S2 and S3
Fig. 1: For each class in VGGFace2, its corresponding nearest neighbour class can be found by comparing the positions of different class centres. (a), (b) and (c) show the the distributions of the distances between every class centre and its corresponding nearest class centre. Specifically, (a) shows the the distribution in the case of using the features generated by Scheme II (without using MML). (b) shows the the distribution in the case of using the features generated by Scheme III (using MML). (c) shows the comparison results of (a) and (b), where S2 and S3 represent Scheme II and Scheme III, respectively.

Ii-D Discussion

Ii-D1 Whether MML can truly enlarge distances of the closest class centre pairs that are smaller than the specified minimum margin

To verify this point, we use the deep models trained by Scheme II (Softmax Loss + Centre Loss) and Scheme III (Softmax Loss + Centre Loss + MML) to extract the features of all the images from a cleaned version of VGGFace2 dataset [13]. The details of the cleaned dataset and the training process of these two models can be found in III-A. The difference between Scheme II and Scheme III is that Scheme III employs MML as a part of the supervision signal but Scheme II does not. With the extracted features, we calculate the centre position for each class and then calculate the distance between each class centre and its corresponding closest neighbour class centre. The distributions of the distances of these class centres are shown in Figure 1. Figure 1(a) and Figure 1(b) show the distance distributions of Scheme II and Scheme III, respectively. Figure 1(c) makes a comparison between Scheme II and Scheme III, from which we can see that Scheme III has smaller values on the first five bins while owns larger values on the rest of the bins. This indicates that MML enlarges the distance of some neighbour centre pairs, therefore increases the quantity of the centre pairs having large margin.

Ii-D2 Whether MML can truly improve the performance of the model on face recognition

To answer this question, we conduct extensive experiments on different benchmark datasets as illustrated in Section III. The experimental types include face verification, face identification, image-based recognition and video-based recognition. Results show that the proposed method can beat the baseline methods as well as some state-of-the-art methods.

Iii Experiments

In this section, we describe the implementation details of the experiments, investigate the influence of the parameters and , and evaluate the performance of the proposed method. The evaluations are conducted on MegaFace [14], FaceScrub [26], LFW [24] and YTF [25] datasets with face identification and face verification tasks. Face identification and face verification are two main tasks of face recognition. Face verification aims at verifying whether two faces are from the person, answering ‘Yes’ or ‘No’, which is a binary classification problem. Face identification is to identifying the ID of a face, answering the exact ID, which is a multi-classification problem.

Iii-a Experiment Details

Training data. In all experiments, we use VGGFace2 [13] as our training data. To ensure the reliability and the accuracy of the experimental results, we removed all the face images that might be overlapped with the benchmark datasets. As the label noise in the VGGFace2 is very low, no data cleaning has been applied. The final training dataset contains 3.05M face images from 8K identities.

Data preprocessing. MTCNN [27] is applied to all the face images for landmark location, face alignment and face detection. If face detection fails on a training image, we simply discard it; if it fails on a testing image, the provided landmarks are used instead. All the training and testing images are cropped to 160*160 RGB images. To augment the training data, we also perform random horizontal flipping on the training images. To improve the recognition accuracy, we concatenate the features of the original testing image and its horizontally flipped counterpart.

Network settings. Based on Inception-ResNet-v1 [28]

, we implemented and trained three models by Tensorflow

[29]

according to three supervision schemes: Softmax Loss (Scheme I), Softmax Loss + Centre Loss (Scheme II), and Softmax Loss + Centre Loss + MML (Scheme III). We train these three models on one GPU (GTX 1080 Ti), and we set 90 as the batch size, 512 as the embedding size, 5e-4 as the weight decay and 0.4 as the keep probability of the fully connected layer. The total number of iterations is 275K, costing about 30 hours. The learning rate is initiated as 0.05 and is divided by 10 every 100K iterations. All three schemes use the same parameter settings except that Scheme III loads the trained model of Scheme II as the pre-trained model before training starts, as this way makes Scheme III achieves better recognition performance.

Test settings. During the testing, we try our best to find the parameter settings that lead to highest performance. The and in Eq.(12) are set to be 5e-5 and 5e-8, respectively. The minimum margin of MML is set to be 280. The deep feature of each image is obtained from the output of the fully connected layer, and we concatenate the features of the original testing image and its horizontally flipped counterpart, therefore the resulting feature size of each image is dimensions. The final verification results are achieved by comparing the threshold with the Euclidean distance of two features

Iii-B Influence Analysis on Parameters and

is the hyper-parameter for adjusting the impact of MML in the combination. is the designated minimum margin. These two parameters influence the performance of the proposed method. Therefore, how to set these two parameters is a question worthy of study.

We conduct two experiments on LFW dataset. In the first experiment, we fixed to 5e-8, and observe the influence of on the verification performance as shown in Figure 2. In the second experiment, we fixed to 280, and evaluate the relationship between and the verification accuracy as shown in Figure 2. From Figure 2, we can see that setting to 0, namely without using MML, is not proper, as it leads to low accuracy. The highest accuracy appears when is 280. From Figure 2, we can observe that the verification performance remains stable with a wide range of , but reaches its peak value when is 5e-8. The above experimental results show that choosing the proper values for and can improve the verification performance of the learned model. Therefore, in the subsequent experiments, we fixed and to 280 and 5e-8, respectively.

Fig. 2: Face verification accuracies on LFW dataset with two groups of models: (a) fixed = 5e-8, and different , (b) fixed = 280, and different
Fig. 3: (a) reports the CMC curves of different methods with 1M distractors on MegaFace Set 1. (b) reports the ROC curves of different methods with 1M distractors on MegaFace Set 1. S1, S2 and S3 represent Scheme I, Scheme II and Scheme III, respectively. Apart from the S1 to S3, the results of other methods are provided by MegaFace team.
Methods Rank1@ Rank100@ VR@FAR VR@FAR VR@FAR
3divi 33.70% 67.74% 36.76% 51.56% 66.88%
zJointBayes 3.02% 13.94% 1.44% 3.70% 9.83%
zLBP 2.33% 7.04% - - 2.10%
Scheme I 41.39% 74.77% 49.46% 63.53% 76.97%
Scheme II 41.79% 77.03% 49.12% 63.79% 77.50%
Scheme III (Proposed) 45.30% 78.89% 53.80% 66.54% 79.23%
TABLE II: The identification rates and the verification rates of different methods on Megaface and FaceScrub datasets with 1M distractors.

Iii-C MegaFace Challenge 1 on FaceScrub

In this section, we conduct experiment with the MegaFace dataset [14] and the FaceScrub dataset [26]. MegaFace dataset consists of a million faces and their respective bounding boxes obtained from Flickr (Yahoo’s dataset). FaceScrub dataset is a publicly available dataset containing 0.1M images from 530 identities. According to the experimental protocol of MegaFace Challenge 1, the MegaFace dataset is used as the distractor set, while the FaceScrub dataset is used as the test set. The experiments are conducted with the provided code [14], which only evaluates our methods on one of the three sets of MegaFace (Set 1). More details about the experimental protocol can be found in [14].

We compare the proposed method (Scheme III) with some existing ones, including (a) 3divi (the deep model from 3DiVi Company), JointBayes [30] and LBP [31], and (b) our baseline methods (Scheme I and Scheme II). In the face identification experiments, the Cumulative Match Characteristics (CMC) curves [32] are calculated to measure the ranking capabilities of different methods, as illustrated by Figure 3). In the face verification experiments, we use the Receiver Operating Characteristic (ROC) curves to evaluate the different methods. The ROC curves plot the False Accept Rate (FAR) of a 1:1 matcher versus the False Reject Rate (FRR) of the matcher which are shown in Figure 3. Table II lists the numeric results of different methods on identification rates and the verification rates with 1M distractors.

From Figure 3, Figure 3 and Table II, we can observe that JointBayes and LBP perform poorly compared with other deep learning-based methods, and almost fail in excluding the distractors. The proposed Scheme III consistently outperforms Scheme I, Scheme II and 3divi, which confirms the effectiveness of the proposed method.

Fig. 4: Some examples from the LFW dataset (left) and the YTF dataset (right).
Methods Images LFW(%) YTF(%)
ICCV17’ Range Loss [16] 1.5M 99.52 93.7
CVPR17’ Marginal Loss [17] 4M 99.48 96.0
CVPR15’ DeepID2+ [33] 99.47 93.2
BMVC15’ VGG Face [34] 2.6M 98.95 97.3
CVPR14’ Deep Face [35] 4M 97.35 91.4
CVPR15’ Fusion [36] 500M 98.37
ICCV15’ FaceNet [1] 200M 99.63 95.1
arXiv15’ Baidu [37] 1.3M 99.13
ECCV16’ Centre Loss [20] 0.7M 99.28 94.9
NIPS16’ Multibatch [38] 2.6M 98.20
ECCV16’ Aug [39] 0.5M 98.06
arXiv18’ ArcFace [19] 7.1M 99.83
CVPR17’ SphereFace [22] 0.5M 99.42 95.0
arXiv18’ CosFace [40] 5M 99.73 97.6
Scheme I 3.05M 99.43 94.9
Scheme II 3.05M 99.50 95.1
Scheme III (Proposed) 3.05M 99.63 95.5
TABLE III: Verification performance of state-of-the-art methods on LFW and YTF datasets.

Iii-D Comparison with the State-of-the-art Methods on LFW and YTF Datasets

In this section, we evaluate the proposed method on two public benchmark datasets – LFW [24] and YTF [25] datasets according to the settings in Section III-A. Some preprocessed examples from these two datasets are shown in Figure 4.

LFW dataset is collected from the web, which contains 13,233 face images with large variations in facial paraphernalia, pose and expression. These face images come from 5749 different identities where 4069 of them have one image and the remaining 1680 identities have at least two images. LFW utilises the Viola-Jones face detector, which is the only constraint on the faces collected. We follow the standard experimental protocol of unrestricted with labelled outside data [41] and test 6,000 face pairs according to the given pair list.

YTF dataset consists of 3,425 videos obtained from YouTube. These videos come from 1,595 identities with an average of 2.15 videos for each person. The frame number of the video clips ranges from 48 to 6,070, and the average is 181.3 frames. Also, we follow the standard experimental protocol of unrestricted with labelled outside data to evaluate the performance of the relevant methods on the given 5,000 video pairs.

Table III shows the results of the proposed method and the state-of-the-art methods on LFW and YTF datasets, from which we can observe the followings.

  • The proposed Scheme III outperforms Scheme I (Softmax Loss only) and Scheme II (Softmax and Centre Loss), increasing the verification performance both on LFW and YTF datasets. On LFW, the accuracy improves from 99.43% and 99.50% to 99.63%, while on YTF, the accuracy increases from 94.9% and 95.1% to 95.5%. This demonstrates the effectiveness of the MML, also demonstrates the effectiveness of the combination of Softmax Loss + Centre Loss + MML.

  • Compared with the state-of-the-art methods, the proposed method has an accuracy higher than most of the methods on LFW and YTF datasets. Only ArcFace and CosFace slightly outperform the proposed method by 0.2% and 0.1% on LFW, however both of them utilise nearly double-sized training data and require much longer time and higher GPU memory on training. Using one GPU (GTX 1080 Ti, Memory size: 11GB), ArcFace needs at least 200 hours on training (the batch size is set to be 64 in order to fit the memory size of the GPU), while the proposed method only needs 30 hours for training. This shows the advantage of the whole framework including data preprocessing, network settings and MML.

Iv Conclusion

In this paper, a new loss function – Minimum Margin Loss (MML) is presented to guide deep neural networks to learn highly discriminative face features. MML is aimed at enlarging the margin of those ‘unqualified’ class centre pairs. To verify the effectiveness of the proposed method, extensive experiments are conducted. The results on the popular MegaFace, LFW and YTF datasets show that the proposed method has achieved the state-of-the-art performance.

References