Face recognition is a primary technique in computer vision to model and understand the real world. Many methods and enormous datasets[vggface2, msceleb1m, kemelmacher2016megaface, vggface, imdbface, casia_webface]
have been introduced, and recently, methods that use deep learning[ArcFace, uniformface, AFRN, SphereFace, CosFace, centerloss, regularface] have greatly improved the face recognition accuracy, but it still falls short of expectations.
To reduce the shortfall, most of the recent research in face recognition focused on improving the loss function. The streams from CenterLoss[centerloss], CosFace [CosFace], ArcFace [ArcFace] and RegularFace [regularface] all tried to minimize the intra-class variation and maximize the inter-class variation. These methods are effective and have gradually improved the accuracy by elaborating the objective of learning.
Despite the development of loss functions, general-purpose networks, not a network devised for a face recognition, can have difficulty in enabling effective training of the network to recognize a huge number of person identities. Unlike common problems such as classification, in the evaluation stage, a face-recognition model encounters new identities, which are not included in the training set. Thus, the model has to embed nearly 100k identities [msceleb1m] in the training set and also consider a huge number of unknown identities. However, most of the existing methods just attach several fully-connected layers after widely-used backbone networks such as VGG [vggface] and ResNet [resnet] without any designs for the characteristics of face recognition.
Grouping is a key idea to efficiently-and-flexibly embed a significant number of people and briefly describe an unknown person. Each person has own characteristics in his or her face. At the same time, they have common ones shared in a group of people. In the real world, group-based description (man with deep, black eyes and red beard) that involves common characteristics in the group, can be useful to narrow down the set of candidates, even though it cannot identify the exact person. Unfortunately, explicit grouping requires manual categorizing on huge data and may be limited by the finite range of descriptions by human knowledge, However, by adopting the concept of grouping, the recognition network can reduce the search space and flexibly embed a significant number of identities into an embedding feature.
We propose a novel face-recognition architecture called GroupFace that learns multiple latent groups and constructs group-aware representations to effectively adopt the concept of grouping (Figure 1). We define Latent Groups, which are internally determined as latent variables by comprehensively considering facial factors (e.g., hair, pose, beard) and non-facial factors (e.g., noise, background, illumination). To learn the latent groups, we introduce a self-distributed grouping method that determines group labels by considering the overall distribution of latent groups. The proposed GroupFace structurally ensembles multiple group-aware representations into the original instance-based representation for face recognition.
We summarize the contributions as follows:
GroupFace is a novel face-recognition-specialized architecture that integrates the group-aware representations into the embedding feature and provides well-distributed group-labels to improve the quality of feature representation. GroupFace also suggests a new similarity metric to consider the group information additionally.
We prove the effectiveness of GroupFace in extensive experiments and ablation studies on the behaviors of GroupFace.
GroupFace can be applied many existing face-recognition methods to obtain a significant improvement with a marginal increase in the resources. Especially, a hard-ensemble version of GroupFace can achieve high recognition-accuracy by adaptively using only a few additional convolutions.
2 Related Works
has been studied for decades. Many researchers proposed machine learning techniques with feature engineering[10.1007/978-3-540-24670-1_36, 6619233, joint_bayesian, fisherface, 5459250, CSML, eigenface, WHT:ECCVW08:DBMW, QiYin:2011:AMF:2191740.2192084]. Recently, deep learning methods have overcome the limitations of traditional face-recognition approaches with public face-recognition datasets [vggface2, msceleb1m, kemelmacher2016megaface, vggface, imdbface, casia_webface]. DeepFace [deepface] used 3D face frontalization to achieve a breakthrough in face recognition methods that use deep learning. FaceNet [FaceNet] proposed triplet loss to maximize the distance between an anchor and its negative sample, and to minimize the distance between the same anchor and its positive sample. CenterLoss [centerloss] proposed center loss to minimize the distance between samples and their class centers. MarginalLoss [marginal_loss] adopted the concept of margin to minimize intra-class variations and to keep inter-class distances with margin. RangeLoss [range_loss] used long-tailed data during the training stage. RingLoss [ring_loss] constrained a feature’s magnitude to be a certain number. NormFace [NormFace] proposed to normalize features and fully connected layer weights; verification accuracy increased after normalization. SphereFace [SphereFace] proposed angular softmax (A-Softmax) loss with multiplicative angular margin. Based on A-Softmax, CosFace [CosFace] proposed an additive cosine margin and ArcFace [ArcFace] applies an additive angular margin. The authors of RegularFace [regularface] and UniformFace [uniformface] argued that approaches that use angular margin [ArcFace, SphereFace, CosFace] concentrated on intra-class compactness only, then suggested new losses to increase the inter-class variation. These previous methods, in general, focused on how to improve loss functions to improve face recognition accuracy with conventional feature representation. A slight change such as adding a few layers or increasing the number of channels, commonly did not bring a noticeable improvement. However, GroupFace improves the quality of feature representation and achieves a significant improvement by adding a few more layers in parallel.
or clustering methods such as k-means internally categorize samples by considering relative metrics such as a cosine similarity or Euclidean distance without explicit class labels. In general, these clustering methods attempt to to construct well-distinguished categories by preventing the assignment of most images to one or a few clusters. Recently, several methods that used deep learning[caron2018deep, noroozi2016unsupervised, yang2016joint]
have been introduced. These methods are effective, however, they use full batches as in previous methods, not mini-batches as in deep learning. Thus, these methods are not readily incorporate deeply and end-to-end in an application framework. To efficiently learn the latent groups, our method introduces a self-distributed grouping method that considers an expectation-normalized probability in a deep manner.
3 Proposed Method
Our GroupFace learns the latent groups by using a self-distributed grouping method, constructs multiple group-aware representations and ensembles them into the standard instance-based representation to enrich the feature representation for face recognition.
We discuss that how the scheme of latent groups are effectively integrated into the embedding feature in GroupFace.
We will call a feature vector in conventional face recognition scheme[ArcFace, CosFace, centerloss, regularface] as an Instance-based Representation in this paper (Figure 2). The instance-based representation is commonly trained as an embedding feature by using softmax-based loss (e.g., CosFace [CosFace] and ArcFace [ArcFace]) and is used to predict an identity as:
where is an identity label, is the instance-based representation of a given sample x, and is a function which projects an embedding feature of 512 dimension into dimensional space. is the number of person identities.
Group-aware Representation. GroupFace uses a novel Group-aware Representation as well as the instance-based representation to enrich the embedding features. Each group-aware representation vector is extracted by deploying fully-connected layers for each corresponding group (Figure 2). The embedding feature (, Final Representation in Figure 2) of GroupFace is obtained by aggregating the instance-based representation and the weighted-summed group-aware representation . GroupFace predicts an identity by using the enriched final representation as:
where is an ensemble of multiple group-aware representations with group probabilities.
Structure. GroupFace calculates and uses instance-based representation and group-aware representations, concurrently. The instance-based representation is obtained by the same procedures that are used in conventional face recognition methods [ArcFace, CosFace, centerloss, regularface], and the
group-aware representations are obtained similarly by deploying a fully-connected layer. Then, the group probabilities are calculated from the instance-based representation vector by deploying a Group Decision Network (GDN) that consists of three fully-connected layers and a softmax layer. Using the group probabilities, the multiple group-aware representations are sub-ensembled in a soft manner (S-GroupFace) or a hard manner (H-GroupFace).
S-GroupFace aggregates multiple group-aware representations with corresponding probabilities of groups as weights, and is defined as:
H-GroupFace selects one of the group-aware representations for which the corresponding group probability has the highest value, and is defined as:
S-GroupFace provides a significant improvement of recognition accuracy with a marginal requirement for additional resources, and H-GroupFace is more suitable for practical applications than S-GroupFace, at the cost of a few additional convolutions. The final representation is enriched by aggregating both the instance-based representation and the sub-ensembled group-aware representation.
Group-aware Similarity. We introduce a group-aware similarity that is a new similarity considering both the standard embedding feature and the intermediate feature of GDN in the inference stage. The group-aware similarity is penalized by a distance between intermediate features of two given instances because the intermediate feature is not trained on the cosine space and just describes the group identity of a given sample, not the explicit identity of a given sample. The group-aware similarity between the image and the image is defined as:
where is a cosine similarity metric, is a distance metric, denotes the intermediate feature of GDN and, and are a constant parameter. The parameters are determined empirically to be and .
3.2 Self-distributed Grouping
In this work, we define a group as a set of samples that share any common visual-or-non-visual features that are used for face recognition. Such a group is determined by a deployed GDN. Our GDN is gradually trained in a self-grouping manner that provides a group label by considering the distribution of latent groups without any explicit ground-truth information.
Naïve Labeling. A naïve way to determine a group label is to take an index that has the maximum activation of softmax outputs.We build a GDN to determine a belonging group for a given sample by deploying MLP and attaching a softmax function:
where is the group. The lack of the consideration for the group distribution can cause the naïve solution to assign most of samples to one or few groups.
We introduce an efficient labeling method that utilizes a modified probability regulated by a prior probability to generate uniformly-distributed group labels in a deep manner. We define an expectation-normalized probabilityto balance the number of samples among groups:
where the first bounds the normalized probability between 0 and 1. Then, the expectation of the expectation-normalized probability is computed as:
The optimal self-distributed label is obtained as:
The trained GDN estimates a set of group probabilities that represent the degree to which the sample belongs to the latent groups. As the number of samples approaches infinity, the proposed method stably outputs the uniform-distributed labels (Figure3).
The network of GroupFace is trained by both the standard classification loss, which is a softmax-based loss to distinguish identities, and the self-grouping loss, which is a softmax loss to train latent groups, simultaneously.
Loss Function. A softmax-based loss (ArcFace [ArcFace] is mainly used in this work) is used to train a feature representation for identities and is defined as:
where is a number of samples in a mini-batch, is the angle between a feature and the corresponding weight, is a scale factor, is a marginal factor. To construct the optimal group-space, a self-grouping loss, which reduces the difference between the prediction and the self-generated label, is defined as:
Training. The whole network is trained by using the aggregation of two losses:
where the parameter balances the weights of different losses and is empirically set to . Thus, GDN can learn the group, which is an attribute beneficial to face recognition.
We describe implementation details and extensively perform experiments and ablation studies to show the effectiveness of GroupFace.
4.1 Implementation Details
Datasets. For the train, we use MSCeleb-1M [msceleb1m] which has contain about 10M images for 100K identities. Due to the noisy labels of MSCeleb-1M original dataset, we use the refined version [ArcFace] which contains 3.8M images for 85k identities. For the test, we conduct our experiments with nine commonly used datasets as follows:
LFW [LFW] which contains 13,233 images from 5,749 identities and provides 6000 pairs from them. CALFW [CALFW] and CPLFW [CPLFWTech] are the reorganized datasets from LFW to include higher pose and age variations.
YTF [YTF] which consists of 3,425 videos of 1,595 identities.
MegaFace [kemelmacher2016megaface] which composed of more than 1 million images from 690K identities for challenge 1(MF1).
CFP-FP [CFP_FP] which contains 500 subjects, each with 10 frontal and 4 profile images.
AgeDB-30 [age_db] which contains 12,240 images of 440 identities.
IJB-B [IJB_B] which contains 67,000 face images, 7,000 face videos and 10,000 non-face images.
IJB-C [IJB_C] which contains 138,000 face images, 11,000 face videos and 10,000 non-face images.
Metrics. We compare the verification-accuracy for identity-pairs on LFW [LFW], YTF [YTF], CALFW [CALFW], CPLFW [CPLFWTech], CFP-FP [CFP_FP], AgeDB-30 [age_db] and MegaFace [kemelmacher2016megaface] verification task. MegaFace [kemelmacher2016megaface] identification task is evaluated by rank-1 identification accuracy with 1 million distractors. We compare a True Accept Rate at a certain False Accept Rate (TAR@FAR) from 1e-4 to 1e-6 on IJB-B [IJB_B] and IJB-C [IJB_C].
Experimental Setting. We construct a normalized face image [ArcFace, SphereFace, CosFace] () by warping a face-region using five facial points from two eyes, nose and two corners of mouth. We employ the ResNet-100 [resnet] as the backbone network similar to the recent works [ArcFace, AFRN]. We vectorize the activation and reduced # activations to 4096 (shared feature in Figure 2) by a block of BN-FC. Our GroupFace is attached after res5c in ResNet-100, where its activation dimension is 51277. The MLP in GDN consists of two blocks of BN-FC and a FC for group classification. We follow [ArcFace, CosFace] to set the hyper-parameters of the loss function.
Learning. We train the model with synchronized GPUs and a mini-batch involving images per GPU. To stable the group-probability, the network of GroupFace is trained from the pre-trained network that is trained by only the softmax-based loss [ArcFace, CosFace]. We used a learning rate of for the first 50k, for the 20k, and for 10k with a weight decay of and a momentum of
with stochastic gradient descent (SGD). We compute the expectation of group probabilities by computing the group probabilities ofsamples on all GPUs and averaging the expectations over the recent -batches to accurately estimate the expectation of the group probabilities on the population; between and empirically shows a similar performance.
4.2 Ablation Studies
To show the effectiveness of the proposed method, we perform the ablation studies on the it’s behaviors. For all experiments, we also use the same network structure with the hyper-parameters mentioned earlier. To clearly show the effect of each ablation study, TAR@FAR of the models are compared on IJB-B dataset [IJB_B]; all models in the ablation studies shows around on LFW.
Number of Groups. We compare the recognition performance according to the number of groups (Table (a)a). As the number of groups grows, the performance increases steadily. In particular, a few initial groups can benefit greatly, and by deploying more groups, significant improvement of performance can be obtained.
Learning for GDN. We compare the learning method for GDN (Table (b)b): (1) without loss (adopt the group-aware network structure only), (2) naive labeling, and (3) self-distributed labeling. Just by applying our novel network structure, the recognition performance is greatly improved. In particular, the performance is further increased by adjusting the proposed self-distributed labeling method.
Hard vs. Soft. S-GroupFace shows a high improvement in the performance because it uses all group-aware representations comprehensively with a reasonable additional resource (Table (c)c). Since H-GroupFace uses only one strongest group-aware representation even if many groups are deployed, the burden of increasing the number of groups is fixed to a slight amount of additional resource. Thus, H-GroupFace can be applied immediately for high performance gains in practical applications.
Aggregation vs. Concatenation. We compare how to combine the instance-based representation and the group-aware representations into an one embedding feature (Table (d)d): (1) aggregation and (2) concatenation. Concatenation-based GroupFace shows a better TAR@FAR=1e-6 by 0.67 percentage points than Aggregation-based GroupFace, however, Aggregation-based GroupFace shows a much better TAR@FAR=1e-5 by 1.16 percentage points. We chose the Aggregation-based GroupFace that is generally better performing with fewer feature dimensions.
Group-aware Similarity. The recognition-performance is once again improved significantly by evaluating the group-aware similarity (Table (e)e). Even though the group-aware similarity increases the feature dimension for calculating a similarity, it is easy to extract the required feature because the feature is the intermediate output of the recognition network. Especially, this experiment shows that the group-based information is distinct from the conventional identity-based information enough to improve performance in practical usages. We show more detailed experiments in Table 5.
Lightweight Model. GroupFace is also effective for a lightweight model such as ResNet-34 [resnet] that requires only 8.9 GFLOPS less than ResNet-100 [resnet], which requires 24.2 GFLOPS. ResNet-34 based GroupFace shows a similar performance of ResNet-100 based ArcFace [ArcFace] and greatly outperforms ResNet-100 in a most difficult criterion (FAR=1e-6). In addition, the group-aware similarity significantly exceed the basic performance of ResNet-34 model (Table (f)f).
|SphereFace [SphereFace]||Large / R||97.91||97.91|
|AdaptiveFace [adaptiveface]||Large / R||95.02||95.61|
|CosFace [CosFace]||Large / R||97.91||97.91|
|ArcFace [ArcFace]||Large / R||98.35||98.49|
|GroupFace||Large / R||98.74||98.79|
|Method||TAR on IJB-B||TAR on IJB-C|
LFW, YTF, CALFW, CPLFW, CFP-FP and AgeDB-30. We compare the verification-accuracy on LFW [LFW] and YTF [YTF] with the unrestricted with labelled outside data protocol (Table 2). On YTF, we evaluate all the images without the exclusion of noisy images from image sequences. Even though both datasets are highly-saturated, Our GroupFace surpasses the other recent methods. We also report the verification accuracy on the variant of LFW (CALFW [CALFW], CPLFW [CPLFWTech]), CFP-FP [CFP_FP] and AgeDB-30 [age_db] (Table 3). Our GroupFace shows the better accuracy on all of the above datasets.
MegaFace. We evaluate our GroupFace under the large-training-set protocol, in which models are trained by using the training set containing more than 0.5M images, on MegaFace [kemelmacher2016megaface] (Table 4). GroupFace is the top-ranked face recognition model among the recent published state-of-the-art methods. On the refined MegaFace [ArcFace], our GroupFace also outperforms the other models.
IJB-B and IJB-C. We compare the proposed method with other methods on IJB-B [IJB_B] and IJB-C [IJB_C] datasets (Table 5). Recent angular-margin-softmax based methods [ArcFace, CosFace] show great performance in the datasets. We reports the improvement of GroupFace in the verification accuracy based on both CosFace [CosFace] and ArcFace [ArcFace] without any test-time augmentations such as horizontal flipping. Our GroupFace shows significant improvements on all FAR criteria by 8.5 percentage points on FAR=1e-6, 1.8 percentage points on FAR=1e-5 and 0.2 percentage points on FAR=1e-4 than the ArcFace [ArcFace] on IJB-B and by 4.3 percentage points on FAR=1e-6, 1.2 percentage points on FAR=1e-5 and 0.4 percentage points on FAR=1e-4 than the ArcFace [ArcFace] on IJB-C. The recognition-performance is once again improved significantly by applying the group-aware similarity (Eq. 5), especially on the most difficult criterion (TAR@FAR=1e-6) on IJB-B by 5.3 percentage points.
To show the effectiveness of the proposed method, we visualize the feature representation, the average activation of groups and the visual interpretation of groups.
2D Projection of Representation. Figure 4 shows a quantitative comparison among (a) the final representation of the baseline network (ArcFace [ArcFace]), (b) the instance-based representation of GroupFace and the final representation of GroupFace on a 2D space. We select the first eight identities in the refined MSCeleb-1M dataset [msceleb1m] and map the extracted features onto the angular space by using t-SNE [t_SNE]. The quantitative comparison shows that the proposed model generates more distinctive feature representations rather than the baseline model and also the proposed model enhances the instance-based representation.
Activation Distribution of Groups.
The proposed Self-Grouping tries to make the samples evenly spread throughout the all groups, and at the same time, the softmax-based loss also simultaneously propagates gradients into GDN so that the identification works best. Thus, the probability distribution is not exactly uniform (Figure5). Some probabilities of the groups are low and the others are high (e.g., 1, 2, 5, 6, 14, 15, 17, 18, 28, 29, 30, 31 groups). The overall distribution is not uniform as we expected, but we see that there is no dominant one among the high activated group.
Interpretation of Groups. The trained latent groups are not always visually distinguishable because they are categorized by a non-linear function of GDN using a latent feature, not a facial attribute (e.g., hair, glasses, and mustache). However, there are two cases of groups (Group 5 and 20 in Figure 6) that we can clearly see their visual properties; 95 of randomly-selected 100 images are men in Group 5 and 94 of randomly-selected 100 images are bald men in Group 20. Others are not described as an one visual property, however, they seems to be described as multiple visual properties such as smile women, right-profile people and scared people in Group 1.
We introduce a new face-recognition-specialized architecture that consists of a group-aware network structure and a self-distributed grouping method to effectively manipulate multiple latent group-aware representations. By extensively conducting the ablation studies and experiments, we prove the effectiveness of our GroupFace. The visualization also shows that GroupFace fundamentally enhances the feature representations rather than the existing methods and the latent groups have some meaningful visual descriptions. Our GroupFace provides a significant improvement in the recognition-performance ans is practically applicable to existing recognition systems. The rationale behind the effectiveness of GroupFace is summarized in two main ways: (1) It is well known that additional supervisions from different objectives can bring an improvement of the given task by sharing a network for feature extraction,e.g., a segmentation head can improve accuracy in object detection [bell2016inside, he2017mask]. Likewise, learning the groups can be a helpful cue to train a more generalized feature extractor for face recognition. (2) GroupFace proposes a novel structure that fuses instance-based representation and group-based representation, which is empirically proved its effectiveness.
We thank AI team of Kakao Enterprise, especially Wonjae Kim and Yoonho Lee for their helpful feedback.