Face verification is a long-term but still challenging research topic within computer vision and image processing community[Bianco2017faceverification, Sengupta2016faceverification, du2014discriminative]. It is of wide-range applications, such as public security system, human-machine interaction, e-commercial trading, etc [osadchy2010scifi, Ren2013A]
. From pattern recognition perspective, face verification can be regarded as a 2-class fine-grained visual pattern recognition task to decide whether 2 face images indicate the same person[FVface]. It suffers from the challenges of high intra-person variation on illumination, pose, expression, age and occlusion. The central idea for improving the performance is to reduce intra-person variation, while enlarging inter-person difference. Metric learning on visual feature (e.g., CNN feature [Bhattarai2016CP]) is one of the commonly used technologies to address this.
Metric learning aims to learn a discriminative distance function towards the specific task. The yielded distance metric is able to enlarge inter-class distance, and reduces intra-class distance simultaneously. Besides face verification, metric learning is also widely applied to other fine-grained visual recognition tasks (e.g., person re-identification [dikmen2010pedestrian], plant trait characterization [lu2017towards], and object detection [you2014local, du2016beyond]).
During the past decades, numerous efforts [KISSme, XQDA, xiong2014kernelmetric] have been paid on the development of metric learning technology from the different theoretical perspectives. Generally, they can be categorized into 2 families. One is linear paradigm [KISSme, XQDA] that learns a linear distance metric transformation matrix . Then, the distance between 2 samples and is calculated as . On the other hand, nonlinear model (e.g., kernel-based manner [xiong2014kernelmetric, wang2019incorporating]) is also studied to capture the nonlinearity of data. All of the linear and nonlinear approaches above are of shallow learning manner. Essentially, this leads to the fact that they may suffer from the underfitting problem. To search more reliable metric, some works resort to the ensemble of multiple shallow metrics [dong2017dimensionality, paisitkriangkrai2015learning]. However, the plain ensemble manner cannot effectively alleviate the intrinsic underfitting problem within the shallow metrics. That is, the yielded metric may not be discriminative enough to reveal subtle difference among categories.
Recently, the ideas of deep metric learning [Yi2014Deep, Hoffer2014Deep]
have emerged. Leveraging the strong fitting capacity of deep neural network (DNN)[VGG], these deep models significantly outperform the shallow learning counterparts. In spite of the remarkable performance enhancement, large number of training samples are required to avoid overfitting [lowshot]. Nevertheless, this may not be always met under the practical application scenarios. Meanwhile, the training procedure of DNN via back-propagation is computationally expensive in terms of both computational resource and time consumption [VGG]. And, its convergence is sensitive to parameter setting (e.g., learning rate). Thus, for some lightweight applications (e.g., embedded system) more efficient and stable approach is required.
Concerning the defects of the existing shallow and deep metric learning approaches, a novel ensemble cascade metric learning (ECML) mechanism for face verification is proposed by us. The key research motivation is to seek the good tradeoff between underfitting and overfitting. Inspired by the success of deep metric learning framework, hierarchical linear metric learning is executed on the raw feature in the cascade way. It aims to improve the discriminative power of the learnt metric by resisting underfitting. However, like each coin has two sides, this also tends to lead to overfitting on the certain feature dimensions. To alleviate this, we take advantage of the ensemble learning principle [ensembling] by randomly splitting the features into non-overlapping groups at each learning stage, besides the final one. Then, metric learning is executed among the feature groups individually. Thus, the discriminative information within the feature dimensions can be better maintained to impair the overfitting effect caused by cascade learning scheme. The main technical pipeline of the proposed ensemble cascade metric learning mechanism is shown in Fig. 1.
As aforementioned, effective linear metric learning approach is the essential element of the proposed ECML mechanism. To ensure learning efficiency and scalability, the ones with closed-form solution are preferred. To this end, KISSME [KISSme] and XQDA [XQDA]
are the outstanding ones. However, we argue that the robustness of KISSME is not satisfactory enough due to the potential computation failure problem on inversion of covariance matrix for Gaussian model, especially in small-scale cases. The main reason is that, the face samples from the same person are often highly correlated. In XQDA, this problem is averted by adding a small turbulence to the within-class covariance matrix as regularizer. But this may lead to inaccurate estimation of the within-class covariance matrix. Towards this, a robust Mahalanobis metric learning (RMML) method is proposed by us. It does not need to compute inverse matrix of intra-class covariance matrix, and is of closed-form solution. Meanwhile, the feature distribution characteristics of face is emphatically considered in RMML to enhance performance. By embedding RMML into ECML mechanism, our proposed metric learning manner (EC-RMML) can run in one-pass learning manner of high scalability.
Experiments on the large-scale Ms-Celeb-1M dataset [guo2016ms] demonstrate that EC-RMML is superior to state-of-the-art metric learning methods on face verification. And, ECML mechanism is also applicable to other metric learning approaches.
The main contributions of this paper include:
ECML: a novel ensemble cascade metric learning mechanism for face verification. It is easy to implement, and able to achieve the good tradeoff between underfitting and overfitting;
RMML: a robust Mahalanobis metric learning approach with closed-form solution.
The source code and supporting materials of our proposition is available at https://github.com/xf1994/ECML.
Ii Related work
To facilitate face verification, one research avenue is to exploit the discriminative visual features. The well-established ones include LBP and its variants [Chen2012Bayesian, Chen2013Blessing, LBP3, LBP4], face attribute learning [Kumar2010Attribute]
, and Fisher vector encoded dense SIFT[FVface]
. Most recently, deep learning-based (i.e., CNN) features[DeepID, FaceNet, centerloss] have achieved great success.
On top of these visual features, metric learning technology [mltrack, clustering1, clustering2] has drawn much attention for further performance enhancement. Davis et al. [ITML] proposed information theoretic metric learning (ITML) method to minimize the differential relative entropy between two multivariate Gaussians. Guillaumin et al. [LDML] proposed to learn a metric from the probabilistic perspective using logistic discriminant, which is termed as LDML. Weinberger et al. [LMNN]
proposed the large-margin nearest neighbor (LMNN) metric learning approach to enhance the performance of KNN classification. All the methods mentioned above suffer from one common defect. That is, they highly depend on the iterative optimization procedure during training. As a consequence, their scalability is not satisfactory enough especially towards large-scale data.
To alleviate this, some scalable metric learning approaches are proposed. Among them, LDA [LDA] runs efficiently with the closed-form solution. However, it cannot work when only the pair-wised labels are given. To address this, SILD [SILD] is proposed in the way of estimating within-class covariance matrix and between-class covariance matrix using side-information. Koestinger et al. [KISSme] proposed KISSME learnt through Gaussian hypothesis to explore feature difference space. With the closed-form solution, KISSME works well on both small and large-scale datasets. Nevertheless, it may fail to work occasionally due to the collapse of computing the inversion of covariance matrix for Gaussian model. Meanwhile, the embedded Gaussian hypothesis may not hold towards the high-dimensional features. Taking advantages of KISSME and LDA, Liao et al. [XQDA]
proposed cross-view quadratic discriminant analysis (XQDA) metric learning method. XQDA is also learnt through feature difference. It achieves the closed-form solution using the generalized eigenvalue decomposition. Unfortunately, it suffers from the same computational problem as KISSME. That is, computing the inverse matrix of within-class covariance matrix is also required. Hence, in the implementation of XQDA, it adds a small turbulence to the within-class covariance matrix to guarantee its robustness. However, this may lead to inaccurate estimation of the within-class covariance matrix.
All the metric learning methods above are of shallow learning paradigm. That is, only one global metric is learned from the raw feature. They may often suffer from the underfitting problem. To obtain more reliable metric, some works [dong2017dimensionality, paisitkriangkrai2015learning] address this issue by the ensemble of multiple metrics. However, the ensemble of shallow metrics still cannot effectively alleviate the intrinsic underfitting problem.
Very recently, taking advantage of the strong fitting power of deep neural network, some deep metric learning approaches [Yi2014Deep, Hoffer2014Deep] are proposed in the end-to-end manner. Nevertheless, they are data-hungry. When the training data is not sufficient enough they do not perform well [Li2014DeepReID]. Inspired by the success of deep metric learning paradigm, we propose ECML as an ensemble cascade metric learning mechanism in spirit of balancing effectiveness and efficiency. Being different from the deep metric learning methods in [Yi2014Deep, Hoffer2014Deep], ECML can run in one-pass learning manner without iterative training procedure. To alleviate over-fitting risk, ECML executes feature shuffle operation [xiangyu2017shufflenet]. Meanwhile, a robust Mahalanobis metric learning method with the closed-form solution is also proposed as the basic unit of ECML. It does not need to compute the inverse matrix of within-class covariance matrix to avoid the computation failure problem faced by KISSME.
Iii ECML: a novel ensemble cascade metric learning mechanism
As aforementioned, face verification is indeed a challenging fine-grained visual recognition task. One main difficulty is that, the different people may be of subtle feature difference. To verify this, Fig. 2 (a) shows the pairwise distance distribution between the subjects within the 100,000 randomly sampled matched and unmatched face pairs (55,043 matched pairs, and 44,957 unmatched pairs) from MS-Celeb-1M dataset [guo2016ms] using FV-based face representation [FVface]. It can be observed that the subject distance of the matched and unmatched face pairs distributes have serious overlap, which actually imposes great challenge for accurate face verification. As a consequence, metric learning approach of strong fitting capacity is required to facilitate the discriminative power of feature. Nevertheless, resisting the overfitting risk should be taken into consideration simultaneously. In spirit of achieving good tradeoff between underfitting and overfitting, we propose an ensemble cascade metric learning (ECML) mechanism towards face verification. That is, hierarchical metric learning procedure is executed in cascade way to enhance the fitting power. And, at each learning stage the feature is randomly shuffled into groups to work in ensemble manner to alleviate overfitting.
Iii-a Cascade metric learning to enhance discriminative power
Generally, the existing state-of-the-art linear metric learning approaches (e.g., XQDA [XQDA]) share one common characteristics that they are executed in shallow learning manner (i.e., one-stage learning). We argue that, this paradigm somewhat limits the fitting capacity of metric learning procedure for face verification, being trapped in underfitting status. To reveal this point intuitively, Fig. 2 (b) shows the pairwise distance distribution between the subjects within the 100,000 face pairs in Fig. 2 (a) after using XQDA. We can see that, although with the promotion of XQDA the subject distance distribution that corresponds to the matched and unmatched face pairs is still of high overlap. That is to say, underfitting phenomenon happens. Hence, the fitting capacity of the employed metric learning approaches should be further facilitated to improve the discriminative power. As a consequence, we propose to execute cascade metric learning procedure to address this, which is inspired by the great success of deep learning framework that conducts hierarchical nonlinear feature learning [VGG].
As shown in Fig. 3, the proposed cascade metric learning manner consists of hierarchical learning stages. Our intrinsic intuition is that, when the metric learning stage goes further, the same person can continuously approach closer while the different persons will be pushed farther in feature space to alleviate the problem of underfitting. In particular, the first learning stages will map the face feature from the previous stage to the new feature space. And, the last learning stage yields the final distance metric for face verification.
Linear metric learning procedure is executed in all of the learning stages as the core component, to generate the stage-wise Mahalanobis transformation matrix at the -th learning stage. The main reason for why we choose linear metric learning paradigm to build cascade metric learning mechanism is mainly due to its relatively high computational efficiency and scalability, compared to the nonlinear ones [xiong2014kernelmetric].
As aforementioned, each of the first learning stage needs to map the face feature from the previous learning stage to the new face space. Without losing generality, assuming we are at the -th learning stage with the achieved stage-wise Mahalanobis transformation matrix . Let denote the yielded face feature that corresponds to the -th person at the -th learning stage. One intuitive way to map from the previous learning stage is to decompose via Cholesky decomposition as
Then, is used as the mapping matrix to acquire the output feature of the -th learning stage for the -th person by
However, for some existing metric learning approaches (e.g., KISSME) the learnt may not be positive definite. In this case, Cholesky decomposition cannot be executed directly. To address this, we propose a modified Cholesky decomposition approach. In particular, Schur decomposition is first conducted on as
where and are the decomposed matrices obtained by Schur decomposition. In particular,
is the unitary matrix obtained by Schur decomposition. And,should be a diagonal matrix since is a real symmetric matrix.
To make decomposable, is modified as by setting the negative eigenvalues in as 0. Then, can be decomposed as:
In other words, is decomposed as
We term this decomposition procedure of as modified Cholesky decomposition (MCD). Actually, directly setting negative eigenvalues in as 0 may hurt the discriminative capacity of metric learning approaches. However, generally the number of negative eigenvalues is relatively small. And, the proposed cascade metric learning procedure help to compensate the defect.
Meanwhile, we find that during the phase of cascade metric learning some feature dimensions of the yielded may be of much greater values than the other dimensions. Actually, this phenomenon tends to lead overfitting. To alleviate this, we choose to execute square root normalization on to suppress the large values as
where indicates the square root normalization operation function; denotes the signum function.
To verify the feasibility of cascade metric learning mechanism, we apply it to XQDA. Fig. 4 shows pairwise distance distribution comparison between the subjects within the face pairs in Fig. 2 (a), after using raw XQDA and cascade XQDA 111The metric learning stage number is empirically set to 3, besides the final learning stage. respectively. It can be observed that, the subject distance distribution overlap between the matched and unmatched face pairs has been remarkably reduced by cascade XQDA, compared to the raw one. That is, the K-L divergence between the “Pos” and “Neg” distribution has been enhanced from 1.2294 to 1.9142. Although cascade metric learning paradigm is able to enhance the discriminative power of feature on training set, it still may lead to overfitting problem. Next, we will illustrate the way to alleviating this via ensemble metric learning.
Iii-B Ensemble metric learning to suppress overfitting risk
As verified in Fig. 4, cascade metric learning helps to improve the discriminative power of the yielded feature on training set. However, it cannot ensure the generalization capacity to test set. That is, overfitting may happen, which is also often faced by deep learning paradigm [Dropout]. To suppress overfitting, an ensemble metric learning mechanism is proposed here. Our main idea is that, at each cascade metric learning stage (besides the final one) each input feature vector will be randomly shuffled into groups. Then, metric learning procedure that involves linear metric learning, MCD, and square root normalization proposed in Sec. III-A will be executed to each feature group in ensemble manner. The outputs from all of the ensemble groups are consequently concatenated to yield the final output feature of the corresponding learning stage. Actually, this can avoid the problem that only a small number of feature dimensions play the dominant role during the phase of metric learning. In this way, the discriminative information within the different feature dimensions can be better maintained, which helps to suppress overfitting.
The main technical pipeline of the proposed ensemble metric learning mechanism at the -th learning stage is shown in Fig 5. In particular, at the -th cascade learning stage suppose the dimensionality of the input feature vector from the previous learning stage is , we randomly shuffle into ensemble groups uniformly. That is, each feature group is of dimensionality . If cannot be divided by
, 0 padding onis executed. Within each group, metric learning procedure is executed with the output feature vector , where is the ensemble group index. Then, the output feature of the -th cascade learning stage will be the concatenation of all as .
When the cascade metric learning procedure goes further, the number of the stage-wise ensemble groups will be gradually decreased. Suppose ensemble metric learning stages exist, the number of the ensemble groups at the -th learning stage is set to (i.e., as the previous stage). In this way, the ensemble learning procedure gradually fuses the yielded weak metrics to generate the stronger one. Being combined with cascade metric learning, ensemble metric learning aims to alleviate the potential overfitting problem. Using the 100,000 matched and unmatched face pairs in Fig. 2 as training set, 10,000 face pairs (4,977 matched, and 5,023 unmatched) are additionally randomly sampled from MS-Celeb-1M dataset as test set to verify the generalization capacity of ensemble metric learning. The comparison on K-L divergence between the “Pos” and “Neg” pairwise distance distribution on training and test set among the different metric learning methods is listed in Table I. XQDA and RMML (proposed in Sec. IV) are employed as the basic metric learning approaches. Cascade learning procedure and ensemble cascaded learning procedure are imposed to them respectively. It can be observed that:
For XQDA and RMML, ensemble metric learning can further enhance K-L divergence on test set when it is appended to cascade metric learning;
Ensemble cascade metric learning mechanism can significantly enlarge K-L divergence on test set, compared to the raw FV feature and basic metric learning approaches.
The results above somewhat verify our propositions that, (1) ensemble metric learning indeed helps to suppress overfitting risk, and (2) ensemble cascade metric learning is essentially effective to improve the discriminative power of raw feature.
Iv RMML: a robust Mahalanobis metric learning approach
Iv-a Revisit on KISSME
As we can see from Fig. 3 and 5, Mahalanobis (i.e., linear) metric learning approach plays the fundamental role within the proposed ECML mechanism. An effective and efficient Mahalanobis metric learning approach is preferred to drive ECML. Among the existing linear ones, KISSME [KISSme] with the closed-form solution is widely used due to its balance between effectiveness and efficiency. Here, we will make a revisit on KISSME first.
Let denote the training sample set, represent the feature difference between 2 samples, and indicate which indicate whether and belong to the same person () or not (). KISSME assumes a zero-mean Gaussian structure of the feature difference space. The likelihood towards whether corresponds to the matched pair or not can be estimated as
where indicates the feature vector dimensionality; denotes the hypothesis that is the matched pair, and represents the hypothesis that is the unmatched pair. And, and are covariance matrices for the matched and unmatched cases respectively. They are computed as
By applying the log-likelihood ratio test, the distance function can be simplified as
|ML method||Training set||Test set|
|Raw FV feature||0.2209||0.1655|
KISSME is indeed an effective and efficient linear metric learning method. However, we argue that it still suffers from 2 main defects, according to the revisit above as follows:
First, it needs to fit Gaussian models for the feature difference (not unsigned distance) that correspond to the matched and unmatched sample pairs. However towards face, the feature difference does not always obey Gaussian distribution. To verify this, Fig.6 shows the feature difference distribution of the 100,000 matched and unmatched face pairs in Fig. 2 when using CNN feature. It can be clearly observed that, the feature difference distribution of the unmatched face pairs does not distribute in Gaussian form;
Secondly, KISSME needs to compute the inversion of covariance matrices towards the Gaussian models that correspond to the feature difference on match and unmatched face pairs ( for matched case, and for unmatched case). Nevertheless since the matched face pairs are always highly correlated as shown in Fig. 6, tends to be singular in cases. Thus, the computation on may fail to work occasionally. Although PCA is applied to alleviate this problem to some extent in KISSME, the execution of PCA may lead to discriminative information loss.
To address the problems in KISSME above for improvement, a robust Mahalanobis metric learning approach (RMML) is proposed by us. It aims to push the feature difference of the matched face pairs close to origin, and pull the feature difference of the unmatched face pairs far from origin to enhance discriminative power. Being difference from KISSME, computation on the inversion of covariance matrices for Gaussian models is not required to enhance robustness. Meanwhile, RMML is of closed-form solution to ensure that it can run in one-pass learning manner without complex iterative optimization procedure.
Iv-B RMML formulation
Denote as the training set of face samples, following KISSME we cast RMML into the feature difference space as . Let indicate whether and belong to the same person () or not (). As aforementioned, RMML is proposed to push the feature difference of the matched face pairs close to origin, and pull the feature difference of the unmatched face pairs far from origin. To this end, we propose to search a Mahalanobis matrix able to minimize intra-class distance and maximize inter-class distance. As a consequence, the discriminative term of RMML is defined as
where and are used to normalize the distance of the matched and unmatched face pairs to be of comparable magnitude. Besides the discriminative term, a regularization term is also given by
indicates the identity matrix; andrepresents Frobenius norm of matrix. The regularization term aims to resist feature space distortion. By incorporating and , the learning procedure of RMML is formulated by
where plays the role to balance the effect of and . It is worthy noting that, within RMML we do not require the feature difference distributes in Gaussian form as KISSME.
Iv-C Closed-form solution of RMML
Eqn. 16 is actually a convex optimization problem. It can be solved using the existing optimization packages. However, this procedure is time consuming and requires powerful computational resource, especially for large-scale learning tasks. To address this, we propose a closed-form solution of RMML. To solve Eqn. 16, the discriminative term is rewritten as
where of size is feature difference matrix for the matched face pairs; of size is feature difference matrix for the unmatched face pairs; is the feature dimension number; and indicates the trace of a matrix. In particular, the column of corresponds to the feature difference of the samples from the same person (i.e., ), and the column of corresponds to the feature difference of the samples from the different persons (i.e., ). Then taking the derivative of on , we have
Setting , it can be obtained that
In order to make and to be comparable on magnitude, we normalize by the mean of its eigenvalues as
where is the mean of eigenvalues. Finally, we obtain the closed-form solution of RMML as
It is worthy noting that, no computation on inversion of matrix is required to obtain . The computation failure problem on in KISSME will not happen to RMML. Compared to KISSME, RMML is indeed more robust.
To verify the effectiveness and efficiency of the proposed ensemble cascade metric learning (ECML) mechanism and robust Mahalanobis metric learning approach (RMML), we choose to test them on the recently proposed MS-Celeb-1M dataset [guo2016ms]. This dataset consists of 1 million celebrities, and each person is of over 20 labeled images. The samples are of high intra-person variation as shown in Fig. 7, due to the issues of pose, age, makeup, facial expression, etc. This actually imposes great challenges to accurate face verification. During experiments, all the face images will be regularized to pixels using the face alignment approach in [seetaalign].
To justify the generalization capacity of our propositions, the widely-used deep convolutional neural network (CNN)[VGG] and Fisher vector (FV) [FVface] face characterization manners are employed as the input feature of ECML and RMML respectively.
To demonstrate the scalability and robustness of our approach, the experiments are conducted using the small-scale and large-scale test protocols on CNN feature as follows:
Small-scale test protocol, 6,000 images are randomly selected from 84 persons for training, and 4,000 images are randomly selected from 52 persons for test. 6,000 training pairs are randomly selected, including 3,169 positive pairs and 2,831 negative pairs. 4,000 verification pairs are randomly selected for test, including 2,160 positive pairs and 1,840 negative pairs;
Large-scale test protocol, 450,000 images are randomly selected from 6,700 persons for training, and 50,000 images are randomly selected from 807 persons for test. 450,000 training pairs are randomly selected, including 218,535 positive pairs and 231,465 negative pairs. 50,000 verification pairs are randomly selected for test, including 26,851 positive pairs and 23,149 negative pairs.
Since the raw FV face feature requires a huge number of computational memory, the experiments are conducted on it using moderate-scale protocol as follow:
Moderate-scale test protocol, 100,000 images are randomly selected from 1267 persons for training, and 10,000 images are randomly selected from 138 persons for test. 100,000 training pairs are randomly selected, including 55,043 positive pairs and 44,957 negative pairs. 10,000 verification pairs are randomly selected for test, including 4,977 positive pairs and 5,023 negative pairs.
Wide-range comparisons with the state-of-the-art metric learning approaches are executed, including LMNN [LMNN], LDML [LDML], ITML [ITML], KISSme [KISSme], SILD [SILD], and XQDA [XQDA]. In addition, Mahalanobis matrix for genuine pairs [KISSme] is learnt as baseline. Since LMNN, XQDA and LDML need to specify the identity of each face image, full label information is provided for this two methods. The output feature dimensionality of LMNN and LDML is set as the same as the input feature. Parameters within ITML are sept as the default ones. Following [XQDA], for XQDA and SILD dimensions with eigenvalue larger than 1 are preserved. For RMML, when it runs independently the parameter is set to 0.5. When it is embed into ECML, is set to 0.1 for all the learning stages. For ECML, the cascade metric learning stage number is set to 3, plus 1 final linear metric learning stage.
The existing studies [KISSme, XQDA] generally choose to reduce the feature dimensionality to a fixed length empirically. However according to our experience, PCA dimensionality can essentially influence the performance of metric learning approaches. Hence, to conduct a thorough investigation on its effect towards the different metric learning methods, PCA is executed with the different output dimensionalities (i.e., 640, 320, 160, 80 and 40) to the raw feature. And, we also intend to compare our proposition with the other metric learning approaches in the different PCA cases to demonstrate the superiority.
Euclidean distance is employed to measure the similarity between the 2 face samples for verification. Equal Error Rate (EER) [deepface] is reported as the performance evaluation criteria on effectiveness. Comparison on running speed of the different approaches will also be executed towards efficiency.
Since random feature shuffle operation exists within ECML (ensemble metric learning phase specifically), the ECML boosted metric learning approaches will run for 5 times and the average EER is reported as the performance evaluation criteria to suppress the effect of randomness. Actually, this leads to more fair comparison among the different metric learning methods.
V-a Performance comparison on CNN feature
Deep learning paradigm has brought impressive advance to the state-of-the-art performance on face verification task, CNN especially. Being different from the handcraft features (i.e., SIFT and HOG), CNN possesses the strong capacity of feature learning in data-driven manner for performance enhancement. To conduct experiments on CNN feature, a small CNN network is trained by us based on CASIA dataset [CASIA]. The employed CNN architecture is shown in Fig. 8. The 640-dimensional output of the first fully-connected layer is employed as face feature. As aforementioned, towards CNN feature experiments are conducted using the small-scale and large-scale test protocols simultaneously. Next, we will introduce the experimental results that correspond to these 2 protocols respectively.
V-A1 Test using small-scale protocol
The performance comparison of EER among the different metric learning approaches using thep small-scale test protocol is listed in Table II. It can be observed that:
When embedding RMML into ECML mechanism (i.e., EC-RMML), it outperforms the other metric learning approaches significantly in all the test cases that correspond to the different PCA dimensionalities. That is, the performance enhancement on EER is at most and at least. This indeed verifies the effectiveness of our proposition that combines ECML and RMML;
ECML mechanism is able to enhance the performance of RMML remarkably. The performance gain yielded by ECML is at most and at least on EER. Hence, this demonstrates that ECML is an effective cascade metric learning mechanism for performance improvement towards face verification;
Without ECML, RMML achieves comparable performance with the other metric learning methods in the test cases. It is worthy noting that, RMML is derived from KISSME but with better performance consistently. The reason seems that, the feature difference of unmatched face pairs does not distribute in Gaussian form as being assumed by KISSME (demonstrated in Fig. 6). This also reveals the insight that, the choice of optimal metric learning approach somewhat depends on the raw feature distribution;
When embedding KISSME into ECML mechanism (i.e., EC-KISSME), it fail to work since the computation failure problem on the inversion of covariance matrices towards the Gaussian models. Nevertheless, it will not happen to RMML. This phenomenon actually demonstrates the robustness of RMML for application. When embedding XQDA into ECML mechanism (i.e., EC-XQDA), it has enhanced the performance of XQDA when PCA dimensionality is set as 80 and 40. This demonstrated the general effectiveness of the ECML mechanism for boosting the performance of linear metric learning methods;
Generally, all of the involved metric learning approaches can improve the discriminative power of raw feature. And, with the reduction of PCA dimensionality the performance of all the metric learning methods are enhanced in most cases.
V-A2 Test using large-scale protocol
During the phase of large-scale test, since the gradient decent based metric learning approaches are extremely time-consuming we only report the results of the scalable methods of the closed-form or approximate closed-form solution. The performance comparison of EER among the different metric learning approaches using the large-scale test protocol is listed in Table III. From the experimental results, we can see that:
In the large-scale test case, among all the metric learning approaches EC-RMML still significantly outperforms the other existing ones. It is worthy noting that, EC-RMML can work effectively in both large-scale case than small-scale case. This actually verifies the generalization capacity of EC-RMML to data scale. Accordingly, we assume that EC-RMML possesses strong potentiality to explore the discriminative information within the big data;
ECML still significantly enhance the performance of RMML in all test cases. It verifies the fact that, ECML is an effective ensemble cascade metric learning mechanism both suitable for small-scale and large-scale face data;
RMML outperforms the other scalable metric learning approaches of the closed-form solution. It demonstrates that, RMML is a more suitable metric learning method for face verification when CNN feature is used for face characterization;
It is worthy noting that, in large-scale test case and XQDA may be even inferior to the raw CNN feature only with PCA. However, this does not happen to RMML and EC-RMML. In our opinion, the fitting capacity of these metric learning approaches for large-scale face data is not strong enough on CNN feature.
In addition, to intuitively reveal the effect of EC-RMML towards face verification we draw the CNN face feature distribution before and after using EC-RMML in Fig. 9. In particular, 1276 face images from 15 persons are involved. Obviously, using EC-RMML the distance between the different persons has been enlarged to essentially improve the discriminative power.
V-B Performance comparison on FV feature
Here, we extract the FV-based face feature following the paradigm in [FVface]
. In particular, SIFT is used as the low-level feature and 30 Gaussians are involved in Gaussian mixture model (GMM). Using the moderate-scale test protocol, the performance comparison of EER among the different metric learning approaches is listed in TableIV. Since the gradient decent based metric learning approaches are extremely time-consuming, we only report the results of the scalable methods of the closed-form or approximate closed-form solution. It can be summarized that:
Towards FV-based face representation, EC-RMML nor RMML cannot achieve the best performance. They are generally inferior to KISSME and XQDA. The reason seems that, the pairwise FV feature difference distributes in Gaussian form on face, which is beneficial for KISSME and XQDA. To verify this, Fig. 10 shows feature difference distribution of the 2,000 matched and unmatched face pairs using FV feature. Essentially, both of the matched and unmatched cases distributes in Gaussian form approximately. This also justifies our viewpoint in Sec. V-A that, the performance of metric learning methods relies on the raw feature distribution even for the same visual recognition task;
ECML still consistently enhances the performance of RMML on FV feature. This demonstrates the effectiveness and generalization capacity of ECML for the different features. Additionally, we apply ECML to KISSME and XQDA. Actually, EC-XQDA outperforms all the other approaches. This reveals that ECML is not only applicable to RMML, which will be further analyzed next;
Generally speaking, FV feature is inferior to CNN feature for face verification.
V-C Discussion on ECML mechanism
V-C1 Generalization capacity of ECML
In Sec. V-B, it has been revealed that ECML can also boosted the performance of XQDA on FV feature as well as RMML. Since CNN feature is of stronger face characterization ability, to better justify the generalization capacity of ECML for face verification we summarize the performance of EC-KISSME, EC-XQDA and EC-RMML respectively, using both the large-scale test protocol and the small-scale test protocol. The performance comparison among the raw KISSME and XQDA, and their ensemble cascaded versions is listed in Table V. We can see that, if no computation failure occurs, ECML can improve their performance on CNN feature in most cases both for KISSME, XADA and RMML. Actually, the experimental results in Table II, III, IV and V verify the effectiveness of generalization capacity of ECML to the different metric learning methods and visual features.
Average value and standard deviation on EER (%) of EC-RMML on CNN feature, usinglarge-scale test protocol. In particular, EC-RMML runs for 10 times with the random feature shuffle operation. “Avg.” indicates the average value of EER, and “Std.” denotes standard deviation of EER.
V-C2 Stability of ECML towards random feature shuffle
Since the random operation that splits the input feature into different groups exists in the ensemble learning procedure of ECML, to verify the stability of ECML we run EC-RMML for 5 times on CNN feature using the large-scale test protocol. Average value and standard deviation on EER of EC-RMML is reported in Table VI to justify the effectiveness and stability of ECML simultaneously. The performance of RMML is also reported. It can be observed that, ECML can significantly enhance the performance of RMML but with low standard deviation (i.e., less than ). This indeed demonstrates the stability of ECML towards the random feature shuffle operation during the phase of ensemble metric learning.
V-C3 Effectiveness of cascade metric learning and ensemble metric learning mechanism
To reveal the effectiveness of cascade metric learning and ensemble metric learning mechanism, we compare the performance of RMML, its cascade boosted version and ensemble cascaded version using CNN feature with the small-scale and large-scale test protocols respectively in Table VII. We can see that:
Both in the large-scale and small-scale test cases, cascade metric learning mechanism consistently improve the performance of RMML by large margins. This essentially justifies the effectiveness of cascade metric learning mechanism;
Generally, ensemble metric learning mechanism further enhances the performance of the cascade counterpart when the dimensionality is larger than 40. In these cases, the probability of overfitting is relatively higher. This demonstrates that, ensemble metric learning helps to alleviate the overfitting problem that may happen during the phase of cascade metric learning.
V-C4 Good balance between underfitting and overfitting of ECML
As aforementioned, the key research motivation of ECML is to achieve the good balance between underfitting and overfitting. To reveal this, we conduct the experiments using FV feature of 80 PCA dimensionality with the different amounts of training samples. The test set follows the moderate-scale test protocol setting. The experiments are executed on XQDA and RMML respectively. The training and test EER is reported to simultaneously to reflect the relationship between underfitting and overfitting. The experimental results are shown in Fig. 11. We can see that:
XQDA tends to be trapped in overfitting problem. That is, their test EER is much higher than training EER. However when ECML is executed to XQDA, the test EER has been generally reduced. And, the performance gap between training EER and test EER is also lower down. Thus, the discriminative power and generalization capacity of XQDA has been improved by ECML due to its good balance between underfitting and overfitting;
Using FV feature, RMML generally suffers from underfitting problem. That is, both of the training EER and test EER are high but with the relatively small performance gap. When ECML is applied, both of the training EER and test EER have been reduced. And, the performance gap between training and test EER is not significantly enlarged. This also demonstrates that, ECML actually can maintain the balance between underfitting and overfitting.
V-C5 Cascade metric learning stage number setting
Cascade metric learning mechanism is proposed to improve the fitting capacity of the existing metric learning approaches. Intuitively, when cascade metric learning procedure goes deeper the discriminative power of the yielded distance metric will be further enhanced. But, the overfitting risk will also be increased. Thus, setting the suitable cascade metric learning stage number is an essential issue towards good balance between underfitting and overfitting. To address this, we set the cascade metric learning stage number from 1 to 5, excluding the final metric learning stage. The ensemble cascaded version of RMML (i.e., EC-RMML) runs on CNN feature, using the large-scale and small-scale test protocol respectively. PCA number is set to 640. The performance comparison among the different cascade metric learning stage numbers is listed in Table VIII. We can see that, with the increment of cascade metric learning stage number the performance of EC-RMML is enhanced. However, when it is too big (e.g., over 3 for large scale) the performance will drop oppositely. That is, the problem of overfitting may happen. Accordingly, the cascade metric learning stage number is empirically set to 3.
V-D Discussion on RMML
V-D1 Parameter setting on
Within RMML, plays the role of balancing the effect of discriminative term and regularization term. Theoretically, the larger is the more discriminative the learnt distance metric is, but also suffering from the higher overfitting risk. To choose , we range it from 0 to 1.2 with the stride of 0.1. Fig. 12 (a) shows the performance of RMML with the different on CNN feature, using the large-scale and small-scale test protocol respectively. PCA number is set to 640. It can be observed that, when is small (i.e., less than 0.5) with its increment the performance of RMML is enhanced remarkably. However, when is equal or larger than 0.5 the performance gain is not significant or even with drop. The reason seems that, when is too big overfitting problem tends to happen. Thus, is set to 0.5 for RMML when it runs solely.
Meanwhile, we also investigate the setting of for RMML when it is embedded into ECML. Under the same experimental setting of RMML, the performance of EC-RMML that corresponds to the different is shown in Fig. 12 (b). It can be seen that, with the increment of , the performance of EC-RMML is enhanced. However, when is too big (e.g., over 0.1 in large-scale protocol) the performance will drop oppositely. Accordingly, is set to 0.1 for RMML when it is embedded into ECML.
V-D2 Comparison with KISSME
The proposition of RMML is derived from KISSME. The essential difference between them is that, RMML does not hold the assumption that the pairwise feature difference of the matched and unmatched samples is in Gaussian distribution form. This leads to the fact that, on FV feature that obeys the Gaussian distribution assumption KISSME actually performs better than RMML as shown in Tabel IV. But on CNN feature that does not preserve this distribution property, RMML consistently outperforms RMML as shown in Table II and III. Since CNN feature is of stronger discriminative power than FV feature for face characterization, we can draw the conclusion that RMML is better choice for face verification than KISSME. Meanwhile, EC-RMML is able to avoid the computation failure problem that may happen to EC-KISSME as shown in Table II, Table III and Table IV. Thus, RMML is of stronger robustness for the practical applications.
V-E Running time analysis
The running time analysis is listed in Tabel IX, Tabel X and Tabel XI respectively. The run time comparison among the different metric learning methods is listed in Tabel IX. The experiment is conducted on CNN feature using the small-scale test protocol, with the PCA number of 640. In Tabel X and Tabel XI
, we report the training and test time consumption of RMML and EC-RMML over each experimental task. In particular, training time indicates the whole time consumption for training of all samples, and test time denotes the time consumption of testing per sample pair. The raw feature extraction time consumption is excluded. All the experiments run on the computer with Intel (R) Xeon(R) E5-2640 @ 2.00GHz (only using one core) in Matlab. We can see that, due to the closed-form solution both of RMML and EC-RMML essentially run fast, compared to the other metric learning approaches. And, the introduction of ECML towards RMML will not yield heavy extra time consumption. Concerning the remarkable performance gain yielded by ECML, it actually achieves good tradeoff between effectiveness and efficiency.
|CNN feature Large-scale||RMML||2.6251||1.0121||0.4711||0.1895||0.0987|
|CNN feature Small-scale||RMML||0.1396||0.0526||0.0255||0.0138||0.0057|
|CNN feature Large-scale||RMML||0.0020||0.0011||0.0006||0.0003||0.0002|
|CNN feature small-scale||RMML||0.0020||0.0010||0.0005||0.0003||0.0002|
In this paper, an ensemble cascade metric learning mechanism (ECML) is proposed by us for face verification. Essentially, ECML takes the advantage of achieving good balance between underfitting and overfitting. Specifically, cascade metric learning is executed to boost discriminative power to address underfitting problem. Meanwhile, ensemble metric learning is conducted coordinately to alleviate the underlying overfitting risk. The extensive experiments demonstrate that, ECML can improve the performance of the existing metric learning methods on different visual features, without huge extra computational burden.
A robust Mahalanobis metric learning approach (RMML) of closed-form solution is also proposed. RMML does not require the pairwise feature difference of the matched and unmatched samples distributes in Gaussian form as KISSME. And, it can avoid the potential computation failure problem that happens to KISSME, SILD, XQDA, etc. On CNN feature, RMML and its ensemble cascaded version (EC-RMML) outperform the other metric learning approaches significantly. The running speed is also high.
Currently, the cascade metric learning stage number within ECML is not high due to the overfitting problem. Inspired by the great success of deep residual neural network of hundred layers [resnet], in future work we plan to introduce the idea of feature residual to ECML. We wish this helps to deepen the cascade metric learning stage to further enhance the discriminative power of the learnt distance metric.
This work is jointly supported by National Natural Science Foundation of China (Grant No. 61502187, 61772256, and 61876211), the Equipment Pre-research Field Fund of China (Grant No. 61403120405), National Key R&D Program of China (No. 2018YFB1004p600), Fundamental Research Funds for the Central Universities (Grant No. 2019kfyXKJC024), and National Key Laboratory Open Fund of China (Grant No. 6142113180211). Joey Tianyi Zhou is supported by Singapore Government’s Research, Innovation and Enterprise 2020 Plan (Advanced Manufacturing and Engineering domain) under Grant A18A1b0045.