1 Introduction
Recent advances of deep learning approaches have remarkably boosted the performance of face recognition. Some approaches claim to have achieved
[34, 4, 17, 46, 44] or even surpassed [29, 38, 45] human performance on several benchmarks. However, those approaches only recognize faces over a single image or video sequence. Such scenarios deviate from the reality. In practical face recognition systems (and arguably human cortex for face recognition), each subject face to recognize is often enrolled with a set of images and videos captured under varying conditions and acquisition methods. Intuitively such rich information can benefit face recognition performance, which however has not been effectively exploited in existing approaches [43, 19, 46].In this paper, we consider the challenging task—unconstrained setbased face recognition firstly introduced in [16]—that is more consistent with realworld scenarios. Unconstrained setbased face recognition defines the minimal facial representation unit as a set of images and videos instead of a single medium. Setbased face recognition requires solving a more difficult settoset matching problem, where both the probe and gallery are sets of face media. This task raises the necessity to build subjectspecific face models for each subject individually, instead of relying on a single multiclass recognition model as before. An illustration on the difference between traditional face recognition over a single input image and the targeted face recognition over a set of unconstrained images/videos is given in Fig. 1. The most significant challenge in the unconstrained setbased face recognition task is how to learn good representations for the media set, even in presence of large intraset variance of realworld subject faces caused by varying conditions in illumination, sensor, compression, etc., and subject attributes such as facial pose, expression and occlusion. Solving this problem needs to address these distracting factors effectively and learn setlevel discriminative face representation.
Recently, several setbased face recognition methods have been proposed [3, 5, 7, 11]. They generally adopt the following two strategies to obtain setlevel face representation. One is to learn a set of imagelevel face representations from each face medium in the set individually [7, 21]
, and use all the information for following face recognition. Such a strategy is obviously computationally expensive as it needs to perform exhaustive pairwise matching and is fragile to outlier medium captured under unusual conditions. The other strategy is to aggregate face representations within the set through simple average or max pooling and generate single representation for each set
[5, 28]. However, this obviously suffers from information loss and insufficient exploitation of the image/video set.To overcome the limitations of existing methods, we propose a novel MultiPrototype Network (MPNet) model. To learn better setlevel representations, MPNet introduces a Dense SubGraph (DSG) learning subnet to implicitly factorize each face media set of a particular subject into a number of disentangled subsets, instead of handcrafting the set partition using some intuitive features. Each dense subgraph discovers a subset (representing a prototype) of face media that are with small intraset variance but discriminative from other subject faces. MPNet learns to enhance the compactness of the prototypes as well as their coverage of large variance for a single subject face, through which heterogeneous attributes within each face media set are sufficiently considered and flexibly untangled. This significantly helps improve the unconstrained setbased face recognition performance by providing multiple comprehensive and succinct face representations, reducing the impact of media inconsistency. Compared with existing setbased face recognition methods [3, 5, 7, 11], MPNet effectively addresses the large variance challenge and offers more discriminative and flexible face representations with lower computational complexity. Also, superior to naive average or max pooling of face features, MPNet effectively preserves the necessary information through the DSG learning for setbased face recognition. The main contributions of this work can be summarized as follows:

We propose a novel and effective multiprototype discriminative learning architecture MPNet. To our best knowledge, MPNet is the first endtoend trainable model that adaptively learns multiple prototype face representations from sets of media. It is effective at addressing the large intraset variance issue that is critical to setbased face recognition.

MPNet introduces a Dense SubGraph (DSG) learning subnet that automatically factorizes each face media set into a number of disentangled prototypes representing consistent face media with sufficient discriminativeness. Through the DSG subnet, MPNet is capable of untangling inconsistent media and dealing with faces captured under challenging conditions robustly.

DSG provides a general loss that encourages compactness around multiple discovered centers with strong discrimination. It offers a new and systematic approach for large variance object recognition in the real world.
Based on the above technical contributions, we have presented a highperformance model for unconstraint setbased face recognition. It achieves currently best results on IJBA [16], YTF [40] and IJBC [23] benchmark datasets with significant improvement over stateofthearts.
2 Related Work
Recent top performing approaches for face recognition often rely on deep CNNs with advanced architectures. For instance, the VGGface model [25, 2], as an application of the VGG architecture [30], provides stateoftheart performance. The DeepFace model [34, 35] also uses a deep CNN coupled with 3D alignment. FaceNet [29] utilizes the inception deep CNN architecture for unconstrained face recognition. DeepID2+ [33] and DeepID3 [32] extend the FaceNet model by including joint Bayesian metric learning and multitask learning, yielding better unconstrained face recognition performance. SphereFace [18], CosFace [38], AMSoftmax [37] and ArcFace [9] exploit marginbased representation learning to achieve small intraclass distance and large interclass distance. Those methods enhance their overall performance via carefully designed architectures, which are however not tailored for unconstrained setbased face recognition.
With the introduction of IJBA benchmark [16] by NIST in , the problem of unconstrained setbased face recognition attracts increasing attention. Recent solutions to this problem are also based on deep architectures, which are leading approaches on LFW [14] and YTF [40]. Among them, BCNN [7] applies a new Bilinear CNN (BCNN) architecture for face identification. Pooling Faces [11] aligns faces in 3D and partitions them according to facial and imaging properties. PAMs [21] handles pose variability by learning PoseAware Models (PAMs) for frontal, halfprofile and fullprofile poses to perform face recognition in the wild. Those methods often employ separate processing steps without considering the modality variance within one set of face media and underlying multiple prototype structures. Therefore, much useful information may loss, leading to inferior performance.
Our proposed MPNet shares a similar idea as subcategoryaware object classification [10]
that considers intraclass variance in building object classifiers. Our method differs from it in following aspects: 1) the “prototype” is not predefined in MPNet; 2) MPNet is based on deep learning and can be endtoend trainable. It is also interesting to investigate the application of our MPNet architecture in generic object recognition tasks.
3 MultiPrototype Networks
Fig. 2 visualizes the architecture of the MPNet, which takes a pair of face media sets as input and outputs a matching result for the unconstrained setbased face recognition. It adopts a modern deep siamese CNN architecture for setbased facial representation learning, and introduces a new DSG subnet for learning the multiprototype that models various representative faces under different conditions for the input. MPNet is endtoend trainable by minimizing the ranking loss and a new DSG loss. We now present each component in detail.
3.1 Setbased Facial Representation Learning
Different from face recognition over a single image, the task of setbased face recognition aims to accept or reject the claimed identity of a subject represented by a face media set containing both images and videos. Performance is assessed using two measures: percentage of false accepts and that of false rejects. A good model should optimize both metrics simultaneously. MPNet is designed to nonlinearly map the raw sets of faces to multiple prototypes in a low dimensional space such that the distance between these prototypes is small if the sets belong to the same subject, and large otherwise. The similarity metric learning is achieved by training MPNet with two identical CNN branches that share weights. MPNet handles inputs in a pairwise, settoset way so that it explicitly organizes the face media in a way favorable to setbased face recognition.
MPNet learns face representations at multiscale for gaining strengthened robustness to scale variance in realworld faces. Specifically, for each medium within a face media set, a multiscale pyramid is constructed by resizing the image or video frame to four different scales. To handle the error of face detection, MPNet performs random cropping to collect local and global patches from each scale of the multiscale pyramid with a fixed size, as illustrated in Fig. 3. To handle the imbalance of realistic face data (e.g., some subjects are enrolled with scarce media from limited images while some with redundant media from reduplicative video frames), the data distribution within each set is adjusted by resampling. In particular, the set containing scarce media (i.e., less than a predefined parameter that is set empirically) is augmented by duplicating and flipping images, which is intuitively beneficial with the support from more relevant information. The large set with redundant media (i.e., more than ) is subsampled to the size of . The resulting input streams to MPNet are tuples of face media set pairs and the associated ground truth annotations , where and denote the two sets of the th pair, and denotes the binary pairwise label.
The proposed MPNet adopts a siamese CNN architecture in which two branches share weights for pairwise setbased facial representation learning. Each branch is initiated with VGGface [25]
, including 13 convolutional layers, 5 pooling layers and 2 fullyconnected layers. We make the following careful architectural design for each branch to ensure that the learned deep facial representations are more suitable for multiprototype representation learning. 1) For activation functions, instead of using ReLU
[24] to suppress all the negative responses, we adopt the PReLU [12] to allow negative responses. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk, benefiting convergence of MPNet. 2) We adopt two local normalization layers after the convolutional layer and the convolutional layer, respectively. The local normalization tends to uniformize the mean and variance of a feature map around a local neighborhood. This is especially useful for correcting nonuniform illumination or shading artifacts. 3) We adopt an average operator for the last pooling layer and a max operator for the previous pooling layers to generate compact and discriminative deep setbased facial representations. Note that our approach is not restricted to the CNN module used, and can also be generalized to other stateoftheart architectures for performance boosting.The learned deep facial representation for each face media set is denoted as . Here recall is the specified size of the face media set after distribution balance.
3.2 MultiPrototype Discriminative Learning
Throughout this paper, a prototype is defined as a collection of similar face media that are representative for a subject face under certain conditions. Face media forming the same prototype have small variance and one can safely extract representation by pooling approaches without worrying about information loss.
To address the critical large variance issue in setbased face recognition, we propose the multiprototype discriminative learning. With this component, each face media set is implicitly factorized into a certain number of prototypes. Multiprototype learning encourages the output facial representations to be compact w.r.t. certain prototypes and meanwhile discriminative for different subjects. Thus, MPNet is capable of modelling the prototypelevel interactions effectively while addressing the large variance and false matching caused by untypical faces. MPNet dynamically offers an optimal tradeoff between facial information preserving and computation cost. It does not require exhaustive matching across all of the possible pairs from two sets for accurately recognizing faces in the wild. It learns multiple prototypes through a dense subgraph learning as detailed below.
To discover the underlying multiple prototypes of each face media set instead of handcrafting the set partition, we propose a novel DSG learning approach. DSG formulates the similarity of face media within a set through a graph and discovers the prototype by mining the dense subgraphs. Each subgraph has high internal similarity and small similarity to the outside media. Compared with clusteringbased data partition, DSG is advantageous in terms of flexiblity and robustness to outliers. Each subgraph provides a prototype for the input subject faces. We then perform face recognition at the prototype level, which is concise and also sufficiently informative.
Given a latent affinity graph characterizing the relations among entities (face media here), denoted as =, each dense subgraph refers to a prototype of the vertices () with larger internal connection () affinity than other candidates. In this work, learning DSG within each face media set implicitly discovers consistent faces sharing similar conditions such as age, pose, expression, illumination and media modality, through which heterogeneous factors are flexibly untangled.
Formally, suppose the graph is associated with an affinity , and each element of encodes the similarity between two face media: . Let be the number of prototypes (or equivalently, number of dense subgraphs) and be the partition indicator: indicates the th medium is in the th prototype. The DSG aims to find the representative subgraph via optimizing through
(1) 
where
is an all1 vector. Here the
constraint guarantees that every medium will be allocated to only one prototype. The allocation after learning would maximize the intraprototype media similarity. This is significantly different from kmeans clustering where the centers are not necessary to learn and similarity is not defined based on the distance to the center.
This problem is not easy to solve. We therefore carefully design the DSG layers that form the DSG subnet to optimize it endtoend. This subnet consists of two layers, which takes the setbased facial representations ’s as input and outputs the reconstructed discriminative features.
The layer makes the prototype prediction. Given the input deep facial representation , this layer outputs its indicator by dynamically projecting each input latent affinity graph to prototypes, , where is the sigmoid activation function to rectify inputs to and is the DSG predictor parameter. Given the predicted and input , the layer computes the DSG loss (see Sec. 3.3) to ensure the reconstructed representation form reasonably compact and discriminative prototypes. is obtained by elementwisely multiplying the output of the layer of the DSG subnet with the predicted multiprototype indicator from its layer. More details are given in Sec. 4.
Method  Verification  Identification  






Rank1  
OpenBR [16]  0.4330.006  0.2360.009  0.1040.014  0.8510.028  0.9340.017  0.2460.011  
GOTS [16]  0.6270.012  0.4060.014  0.1980.008  0.7650.033  0.9530.024  0.4330.021  
BCNNs [7]        0.6590.032  0.8570.024  0.5880.020  
Pooling faces [11]  0.631  0.309        0.846  
LSFS [36]  0.8950.013  0.7330.034  0.5140.060  0.3870.032  0.6170.063  0.8200.024  
Deep Miltipose [1]  0.991  0.787    0.250  0.48  0.846  
DCNN+metric [6]  0.9470.011  0.7870.043        0.8520.018  
Triplet Similarity [28]  0.9450.002  0.7900.030  0.5900.050  0.2460.014  0.4440.065  0.8800.015  
PAMs [21]    0.8260.018  0.6520.037      0.8400.012  
DCNN [5]  0.9670.009  0.8380.042    0.2100.033  0.4230.094  0.9030.012  
Masi et al. [22]    0.886  0.725      0.906  
Triplet Embedding [28]  0.9640.005  0.9000.010  0.8130.002  0.1370.014  0.2470.030  0.9320.010  
AllInOne [27]  0.9760.004  0.9220.010  0.8230.020  0.1130.014  0.2080.020  0.9470.008  
Template Adaptation [8]  0.9790.004  0.9390.013  0.8360.027  0.1180.016  0.2260.049  0.9280.001  
NAN [42]  0.9790.004  0.9410.008  0.8810.011  0.0830.009  0.1830.041  0.9580.005  
DAGAN [46]  0.9910.003  0.9760.007  0.9300.005  0.0510.009  0.1100.039  0.9710.007  
softmax [26]  0.9840.002  0.9700.004  0.9430.005  0.0440.006  0.0850.041  0.9730.005  
3DPIM [45]  0.9960.001  0.9890.002  0.9770.004  0.0160.005  0.0640.045  0.9900.002  
baseline  0.9680.009  0.8710.014  0.7350.031  0.1880.011  0.3720.045  0.9070.010  
w/o DSG  0.9710.006  0.8870.012  0.7430.027  0.1820.010  0.3670.041  0.9120.008  
MPNet  0.9710.006  0.8630.019  0.7340.033  0.1890.013  0.3860.043  0.9090.007  
MPNet  0.9710.007  0.8800.015  0.7400.026  0.1790.009  0.3610.044  0.9130.009  
MPNet  0.9790.004  0.9240.013  0.7640.022  0.1710.012  0.3500.046  0.9230.008  
MPNet  0.9800.005  0.9190.013  0.7790.021  0.1690.009  0.3370.042  0.9320.008  
MPNet  0.9750.008  0.9090.017  0.7570.025  0.1640.011  0.3590.040  0.9260.010  
MPNet (Ours)  0.9970.002  0.9910.003  0.9840.005  0.0110.005  0.0590.040  0.9940.003 
Face recognition performance comparison on IJBA. The results are averaged over 10 testing splits. “” means the result is not reported. Standard deviation is not available for some methods.
3.3 Optimization
We optimize MPNet by minimizing the following two loss functions in a conjugate way.
Ranking loss: a ranking loss is designed in MPNet to enforce the distance to shrink for genuine set pairs and be large for imposter set pairs, so that MPNet explicitly maps input patterns into the target spaces to approximate the semantic distance in the input space.
We first normalize the outputs from the DSG subnet, so that all the setbased facial representations are within the same range for loss computation. Then, we use Euclidean distance to measure the finegrained pairwise dissimilarity between and :
(2) 
where .
We further ensemble the distances into one energybased matching result for each coarselevel set pair:
(3) 
where is a bandwidth parameter.
Our final ranking loss function is formulated as
(4) 
where is a margin, such that , is the distance for genuine pair, is the distance for imposter pair, and is the binary pairwise label, with for genuine pair ( in Eq. (4)) and for imposter pair ( in Eq. (4)).
Dense SubGraph loss: We propose to learn dense prototypes through solving the problem defined in Eq. (1). Expanding the objective in Eq. (1) gives
(5) 
Since , we have only if for some , i.e., the face media and are divided into the same prototype. Maximizing the trace in Eq. (1) is to find the partition of samples (indicated by ) to form subgraphs such that the samples associated with the same subgraph have the largest total affinity (i.e., density) . In practice, maximizing would encourage contributions of the representations belonging to the same prototype to be close to each other and each resulted cluster to be far away from others, i.e., they form multiple dense subgraphs.
In Eq. (5), each element of the affinity encodes similarity between two corresponding media, i.e.
(6) 
Equivalently, the DSG learning can be achieved through the following minimization problem:
(7) 
where is the Euclidean distance matrix computed in Eqn. (2).
Then we define the following DSG loss function to optimize the learned deep setbased facial representations:
(8) 
Thus, minimizing the DSG loss would encourage contributions of the representations ’s belonging to the same prototype to be close to each other. If one visualizes the learned representations in the highdimensional space, the learned representations of one face media set form several compact clusters and each cluster may be far away from others. In this way, a face media set with large variance is distributed to several clusters implicitly. Each cluster has a small variance. We also conduct experiments for illustration in Sec. 4.1.
To simplify the above optimization, we propose to relax the constraint of to by a sigmoid activation function. Thus, the DSG loss is redefined as
(9) 
We adopt the joint supervision of ranking and DSG losses to train MPNet for multiprototype discriminative learning:
(10) 
where is a weighting parameter among the two losses.
Clearly, MPNet is endtoend trainable and can be optimized with BP and SGD algorithm. We summarize the learning algorithm of MPNet in Algorithm 1.
4 Experiments
We evaluate MPNet qualitatively and quantitatively under various settings for unconstrained setbased face recognition on IJBA [16], YTF [40] and IJBC [23].
Implementation Details
We initialize the CNN module of MPNet for deep setbased facial representation learning with the VGGface model [25], and finetune it on the target dataset. For each medium with the provided face bounding box, we first crop the facial RoI accordingly and then resize it to multiple sizes to build the multiscale pyramids, where and . The size of inputs to MPNet is fixed as by randomly cropping local and global patches of compatible size from images/video frames. No 2D or 3D face alignment is used. The threshold for balancing input data distribution is set as 128 for tradingoff recognition accuracy and computation cost. The weights of the
layer (implemented with a 1D convolution layer with sigmoid activation function) of the DSG subnet are initialized by normal distribution with an std 0.001. The number of total prototypes
is set as 500. We also conduct experiments to illustrate how the influences the overall performance in Sec. 4.2. The bandwidth parameter in Eq. (3) is set to 10, the margin of the ranking loss is fixed as 0.8, and the tradeoff parameter is set as 0.01 by 5fold crossvalidation. Different values oflead to different deep feature distributions. With proper
, the discriminative power of deep features can be significantly enhanced.is large enough for balancing the scales of two loss terms as the subgraph loss calculates summations over more pairs. The proposed network is implemented based on the publicly available Caffe platform
[15], which is trained on three NVIDIA GeForce GTX TITAN X GPUs with 12G memory. During training, the learning rate is initialized to 0.01, and during finetuning, the learning rate is initialized to 0.001. We train our model using SGD with a batch size of 1 face media set pair, momentum of 0.9, and weight decay of 0.0005.4.1 Evaluations on IJBA Benchmark
IJBA contains 5,397 images and 2,042 videos from 500 subjects, captured from inthewild environment to avoid near frontal bias. For training and testing, 10 random splits are provided by each protocol, respectively. It contains two tasks, face verification and identification. The performance is evaluated by TAR@FAR, FNIR@FPIR and Rank metrics, respectively.
4.1.1 Ablation Study and Quantitative Comparison
We first investigate different architectures and losses of MPNet to see their respective roles in unconstrained setbased face recognition. We compare 8 variants of MPNet, i.e., baseline (siamese VGGface [25]), w/o DSG, MPNet, and MPNet (backbone: ResNet101 [13]).
The performance comparison in terms of TAR@FAR, FNIR@FPIR and Rank1 on IJBA is reported in the lower panel of Tab. 1. By comparing the results from w/o DSG vs. the baseline, around
improvement for overall evaluation metrics can be observed. This confirms the benefits of the basic refining tricks in terms of the network structure. Compared with w/o DSG, MPNet
further boosts the performance by around , which speaks well for the superiority of using the auxiliary DSG loss to enhance the deep setbased facial representation learning. It simplifies unconstrained setbased face recognition, yet reserves discriminative and comprehensive information. By varying the numbers of prototypes, one can see that as increases from 3 to 1,000, the performance on the overall metrics improves consistently when . This demonstrates that the affinitybased dense subgraph learning of the proposed DSG subnet can effectively enhance the deep feature capacity of unconstrained setbased face recognition. However, further increasing does not bring further performance improvement and may even harm the performance on the overall metrics. The reason is that an appropriately large value of will predict a sparse prototype partition indicator matrix , which helps reach an optimal tradeoff between facial information preserving and computation cost for addressing the large variance and false matching caused by untypical faces. However, an oversize value of will enforce the learned filters to all zero ones, which always produces invariant performance without any discriminative information. We hence set to 500 in all the experiments.For fair comparison with other stateofthearts (upper panel of Tab. 1), we further replace the backbone from VGGface to ResNet101 (bottom row) while keeping other settings the same. Our MPNet achieves the best results over 10 testing splits on both protocols. This superior performance demonstrates that MPNet is very effective for the unconstrained setbased face recognition in presence of large intraset variance. Compared with existing setbased face recognition approaches, our MPNet can effectively address the large variance challenge and offer more discriminative and flexible face representations with small computational complexity. Also, superior to the naive average or max pooling of face features, MPNet effectively preserves necessary information through the DSG learning for setbased face recognition.
Moreover, compared with exhaustive matching strategies (e.g., DCNN [5]) which have complexity for similarity computation (, are media numbers of each face set to recognize) and take 1.2s for recognizing each probe set, our MPNet is more efficient as it operates on prototype level, which significantly reduces the computational complexity to , ( is the prototype number) and takes 0.5s for recognizing each probe set. Although naive average or max pooling strategies (e.g., Pooling faces [11]) are slightly advantageous in testing time (0.3s for recognizing each probe set), they suffer from information loss severely. Our MPNet effectively preserves the necessary information through DSG learning for unconstrained setbased face recognition.
4.1.2 Qualitative Comparison
We then verify the effectiveness of our deep multiprototype discriminative learning strategy. The predicted prototypes with relatively larger affinities within the set 1311 and set 3038 from the testing data of IJBA split1 are visualized using tSNE [20] in Fig. 4. We observe that MPNet explicitly learns to automatically predict the prototype memberships within each coarselevel face set reflecting different poses (e.g., the first 6 learned prototypes), expressions (e.g., the and the learned prototypes), illumination (e.g., the and learned prototypes), and media modalities (e.g., the and learned prototypes). Each learned prototype contains coherent media offering collective facial representation with specific patterns. Outliers within each face set are detected by MPNet (e.g., the last learned prototypes). MPNet is learnt to enhance the compactness of the prototypes as well as their coverage of large variance for a single subject face, through which the heterogeneous attributes within each face media set are sufficiently considered and flexibly untangled. Compared with clusteringbased data partition, MPNet with DSG learning is advantageous since it is endtoend trainable, can learn more discriminative features and is robust to outliers. Learning DSG maximizes the intraprototype media similarity and interprototype difference, resulting in discriminative face representations. This is significantly different from clustering (e.g., kmeans) methods where only the similarity defined based on the distance to the center is considered during learning.
Finally, we visualize the verification results in Fig. 5 for IJBA split1 to gain insight into unconstrained setbased face recognition. After computing the similarities for all pairs of probe and reference sets, we sort the resulting list. Each row represents a probe and reference set pair. The original sets within IJBA contain from one to dozens of media. Up to 8 individual media are shown with the last space showing a mosaic of the remaining media in the set. Between the sets are the set IDs for probe and reference as well as the best matched and best nonmatched similarities. Fig. 5 (blue, left) shows the best matched cases. In the top30 scoring correct matches, we immediately note that every reference set contains dozens of media. The probe sets either contain dozens of media or one medium that matches well. Fig. 5 (blue, right) shows the worst matched cases, representing failed matching. The thirty lowest matched results from singlemedium probe sets are all under extremely challenging unconstrained conditions. These extremely difficult cases cannot be solved even using the specific operations designed in MPNet. Fig. 5 (green, left) showing the worst nonmatched cases highlights the understandable errors involving singlemedium probe sets representing impostors in challenging orientations. Fig. 5 (green, right) showing the best nonmatched cases shows the most certain nonmates, again often involving large sets with enough guidance from the relevant information of the same subject.
4.2 Evaluations on YTF Benchmark
YTF contains 3,425 videos of 1,595 different subjects. The average length of a video clip is 181.3 frames. All the video sequences were downloaded from YouTube. We follow the unrestricted with labeled outside data protocol and report the result on 5,000 video pairs.
The face recognition performance comparison of the proposed MPNet with other stateofthearts on YTF is reported in Tab 2. MPNet improves the best by , which well verified the superiority of MPNet for effectively learning setlevel discriminative face representations.
4.3 Evaluations on IJBC Benchmark
Method 






GOTS [23]  0.066  0.147  0.330  0.620  
FaceNet [29]  0.330  0.487  0.665  0.817  
VGGface [25]  0.437  0.598  0.748  0.871  
VGGface2_ft [2]  0.768  0.862  0.927  0.967  
MNvc [41]  0.771  0.862  0.927  0.968  
MPNet  0.827  0.898  0.940  0.971 
IJBC contains 31,334 images and 11,779 videos from 3,531 subjects, which are split into 117,542 frames, 8.87 images and 3.34 videos per subject, captured from inthewild environments to avoid the near frontal bias. For fair comparison, we follow the templatebased setting and evaluate models on the standard 1:1 verification protocol in terms of TAR@FAR.
The face recognition performance comparison of the proposed MPNet with other stateofthearts on IJBC is reported in Tab. 3. MPNet beats the best by 5.60% w.r.t. TAR@FAR=, which further shows its remarkable generalizability for recognizing faces in the wild, and the learned deep features are robust and disambiguated.
5 Conclusion
We proposed a novel MultiPrototype Network (MPNet) with a new Dense SubG
raph (DSG) learning subnet to address unconstrained setbased face recognition, which adaptively learns compact and discriminative multiprototype representations. Comprehensive experiments demonstrate the superiority of MPNet over stateofthearts. The proposed framework can be easily extended to other generic object recognition tasks by utilizing the areaspecific sets. In future, we will explore a pure MPNet architecture where all components are replaced with well designed MPNet layers, which can hierarchically exploit the multiprototype discriminative information to solve complex computer vision problems.
Acknowledgement
The work of Jian Zhao was partially supported by China Scholarship Council (CSC) grant 201503170248.
The work of Junliang Xing was partially supported by the National Science Foundation of Chian 61672519.
The work of Jiashi Feng was partially supported by NUS IDS R263000C67646, ECRA R263000C87133 and MOE TierII R263000D17112.
References
 [1] W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, I. Masi, J. Choi, J. Lekust, J. Kim, P. Natarajan, et al. Face recognition using deep multipose representations. In WACV, pages 1–9, 2016.
 [2] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VggFace2: a dataset for recognising faces across pose and age. In FG, pages 67–74, 2018.
 [3] R. Chellappa, J.C. Chen, R. Ranjan, S. Sankaranarayanan, A. Kumar, V. M. Patel, and C. D. Castillo. Towards the design of an endtoend automated system for image and videobased recognition. arXiv preprint arXiv:1601.07883, 2016.
 [4] J. Chen, V. M. Patel, L. Liu, V. Kellokumpu, G. Zhao, M. Pietikäinen, and R. Chellappa. Robust local features for remote face recognition. IVC, 64:34–46, 2017.
 [5] J.C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verification using deep cnn features. In WACV, pages 1–9, 2016.

[6]
J.C. Chen, R. Ranjan, A. Kumar, C.H. Chen, V. M. Patel, and R. Chellappa.
An endtoend system for unconstrained face verification with deep convolutional neural networks.
In ICCVW, pages 118–126, 2015.  [7] A. R. Chowdhury, T.Y. Lin, S. Maji, and E. LearnedMiller. Onetomany face recognition with bilinear CNNs. In WACV, pages 1–9, 2016.
 [8] N. Crosswhite, J. Byrne, C. Stauffer, O. Parkhi, Q. Cao, and A. Zisserman. Template adaptation for face verification and identification. In FG, pages 1–8, 2017.
 [9] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. ArcFace: additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
 [10] J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan. Subcategoryaware object classification. In CVPR, pages 827–834, 2013.
 [11] T. Hassner, I. Masi, J. Kim, J. Choi, S. Harel, P. Natarajan, and G. Medioni. Pooling faces: template based face recognition with pooled face images. In CVPRW, pages 59–67, 2016.

[12]
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification.
In ICCV, pages 1026–1034, 2015.  [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [14] G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller. Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report, Technical Report 0749, University of Massachusetts, Amherst, 2007.
 [15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: convolutional architecture for fast feature embedding. In ACM MM, pages 675–678, 2014.
 [16] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah, M. Burge, and A. K. Jain. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A. In CVPR, pages 1931–1939, 2015.
 [17] J. Li, J. Zhao, F. Zhao, H. Liu, J. Li, S. Shen, J. Feng, and T. Sim. Robust face recognition with deep multiview representation learning. In ACM MM, pages 1068–1072, 2016.
 [18] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. SphereFace: Deep hypersphere embedding for face recognition. In CVPR, volume 1, page 1, 2017.
 [19] X. Lu, Y. Wang, W. Zhang, S. Ding, and W. Jiang. Deep CNNs for face verification. In CCBR, pages 85–92, 2016.
 [20] L. v. d. Maaten and G. Hinton. Visualizing data using tSNE. JMLR, 9(Nov):2579–2605, 2008.
 [21] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Poseaware face recognition in the wild. In CVPR, pages 4838–4846, 2016.
 [22] I. Masi, A. T. Tran, J. T. Leksut, T. Hassner, and G. Medioni. Do we really need to collect millions of faces for effective face recognition? arXiv preprint arXiv:1603.07057, 2016.
 [23] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, et al. IARPA Janus BenchmarkC: face dataset and protocol. In ICB, 2018.
 [24] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, pages 807–814, 2010.
 [25] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.
 [26] R. Ranjan, C. D. Castillo, and R. Chellappa. L2constrained Softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
 [27] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa. An allinone convolutional neural network for face analysis. arXiv preprint arXiv:1611.00851, 2016.
 [28] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chellappa. Triplet probabilistic embedding for face verification and clustering. In BTAS, pages 1–8, 2016.
 [29] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: a unified embedding for face recognition and clustering. In CVPR, pages 815–823, 2015.
 [30] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [31] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identificationverification. In NIPS, pages 1988–1996, 2014.
 [32] Y. Sun, D. Liang, X. Wang, and X. Tang. DeepID3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.
 [33] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In CVPR, pages 2892–2900, 2015.
 [34] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: closing the gap to humanlevel performance in face verification. In CVPR, pages 1701–1708, 2014.
 [35] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Webscale training for face identification. In CVPR, pages 2746–2754, 2015.
 [36] D. Wang, C. Otto, and A. K. Jain. Face search at scale: 80 million gallery. arXiv preprint arXiv:1507.07242, 2015.
 [37] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018.
 [38] H. Wang, Y. Wang, Z. Zhou, X. Ji, and W. Liu. CosFace: large margin cosine loss for deep face recognition. In CVPR, pages 5265–5274, 2018.
 [39] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515, 2016.
 [40] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In CVPR, pages 529–534, 2011.
 [41] W. Xie and A. Zisserman. Multicolumn networks for face recognition. arXiv preprint arXiv:1807.09192, 2018.
 [42] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua. Neural aggregation network for video face recognition. In CVPR, pages 5216–5225, 2017.
 [43] H. Ye, W. Shao, H. Wang, J. Ma, L. Wang, Y. Zheng, and X. Xue. Face recognition via active annotation and learning. In ACM MM, pages 1058–1062, 2016.
 [44] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen, J. Xing, et al. Towards pose invariant face recognition in the wild. In CVPR, pages 2207–2216, 2018.
 [45] J. Zhao, L. Xiong, Y. Cheng, Y. Cheng, J. Li, L. Zhou, Y. Xu, J. Karlekar, S. Pranata, S. Shen, J. Xing, S. Yan, and J. Feng. 3daided deep poseinvariant face recognition. In IJCAI, pages 1184–1190, 2018.
 [46] J. Zhao, L. Xiong, P. K. Jayashree, J. Li, F. Zhao, Z. Wang, P. S. Pranata, P. S. Shen, S. Yan, and J. Feng. Dualagent gans for photorealistic and identity preserving profile face synthesis. In NIPS, pages 66–76, 2017.
Comments
There are no comments yet.