Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network
Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, i.e., 97 none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi-automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF. The publication of CASIAWebFace will attract more research groups entering this field and accelerate the development of face recognition in the wild.READ FULL TEXT VIEW PDF
Deep convolutional neural networks have recently proven extremely effect...
Face images appeared in multimedia applications, e.g., social networks a...
This paper proposes a novel face recognition algorithm based on large-sc...
Automatic age estimation from real-world and unconstrained face images i...
This paper proposes to learn high-performance deep ConvNets with sparse
We study in this paper the problem of one-shot face recognition, with th...
Training data are critical in face recognition systems. However, labelin...
Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network
This package shows how to train a siamese network using Lasagne and Theano and includes network definitions for state-of-the-art networks including: DeepID, DeepID2, Chopra et. al, and Hani et. al. We also include one pre-trained model using a custom convolutional network.
In the past year, the performance of face recognition algorithms increased in a large margin. For example, the accuracy on LFW , the hardest face dataset at present, is improved from 95% to 99% , which is on a par with human performance. The best methods on LFW can be divided into two categories: wide model and deep model. The essence of good model is that it should has enough capacity to represent the variations of complex face images. High dimensional LBP  is a typical wide model which flatten the face manifold by transforming the image into a very high dimensional space. And convolutional neural network (CNN)  is the state-of-the-art deep model for face recognition and image analysis.
Although a model can increase its complexity either along “width” or “depth” direction, deep model is more effective than wide model when with the same number of parameters 
. Furthermore, ordinary computer can’t easily handle the high dimensional features extracted by wide model. On the contrary, the dimension of features in each layer of deep model are much lower, which make the memory consumption of deep model is affordable. The power of CNN to distill knowledge from data has been verified in may fields. Recently, deep CNN is becoming the mainstream in face recognition and hold the top positions on LFW. However, limited by the scale of training data, the ceiling of CNN has not been measured yet.
Many excellent open source implementations [9, 8] of CNN can be used to learn face representations from big data but no research team has made their private large dataset public until now. The reason may be that large datasets are very hard to collect and need consume lots of money and manpower. But for academy research, developing algorithms on private data is harmful in two aspects: First, most researchers can’t make contributions to large scale face recognition methods for lacking of data. Second, due to different training set, many classical methods and CNN are not comparable.
On LFW, the best methods both use outside data besides of LFW, i.e., the “Unrestricted, Labeled Outside Data” category of LFW. In fact, the methods in this category are hardly called as methods simply but solutions which at least include outside data and algorithm. To extend the scale of LFW and standardize the evaluation protocol of “Unrestricted, Labeled Outside Data”, this paper builds a large scale dataset including about 10,000 subjects and 500,000 images, called CASIA-WebFace 111You could apply the dataset at http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html. To the best of our knowledge, the size of this dataset rank second in the literature, only smaller than the private dataset of Facebook (SCF) . We encourage those data-consuming methods training on this dataset and reporting performance on LFW.
Crawling face images from Internet is easy but annotating their identities is hard. Thanks to the good structure of IMDb 222http://www.imdb.com website, the crawling and annotation can be done in a semi-automatic way. First, the names of some interested celebrities are crawled form the website, and then the photos in their pages are downloaded. Because most photos usually contain more than one face, the difficult arise. Thus we propose a simple and fast clustering method to annotate the identity of faces in the photos. To ensure the subjects in the dataset are not overlapping to LFW, we use edit distance of the names to check duplication. Finally, we scan the whole dataset by manual and correct the false annotations.
The contributions of this paper are summarized as follows
We build a large scale face dataset and makes it public, which will dispel the chaos of evaluation on LFW and make the methods fairly comparable;
We propose a semi-automatic pipeline to construct large scale face dataset from Internet, which will attract more researchers to build new face datasets or enlarge existing face datasets;
Data and algorithm are two essential components for pattern recognition. With the successful applications of deep learning in face recognition, dataset collection lags behind algorithm. In this section we review some popular face datasets and representation learning methods.
Early face datasets were almost collected under controlled environments, such as PIE , FERET  and so on. Through the efforts of many researchers, we could obtain very high performance on these ideal datasets. But we found that the models learned from these datasets are difficult to generalize to new environments in practical applications. To improve the generalization of face recognition methods, the interests of community gradually transferred from controlled environments to uncontrolled environments, i.e., face recognition in the wild. Then a milestone dataset, LFW  including 5749 subjects, was born in 2007.
Compared to previous datasets, the biggest difference of LFW is that the images are crawled from Internet rather than acquired under several pre-defined environments. Therefore, LFW has more variations in pose, illumination, expression, resolution, imaging device and these factors are combined together in random way. In 2009, based on the name list of LFW  collected another good dataset named as PubFig. Although PubFig just include 200 subjects, the number of image for each subject is much more than LFW and it supply 73 attributes to describe the face images. YTF  is another dataset based on the name list of LFW but it’s created for video based face recognition. All the videos in YTF were downloaded from YouTube. Because the videos on YouTube are compressed in very high ratio, the quality of the face snapshots are lower than LFW.
CACD  is a large dataset collected for cross-age face recognition in 2014, which include 2,000 subjects and 163,446 images. The scale of CACD is large enough to train deep models but the dataset contains much noise and incorrect identity labels. The reason is that the images are crawled by Google Image search engine, and just a small subset (200 subjects) is checked by manual.
Besides the above publicly accessible datasets, there are three large scale private datasets: Facebook’s SFC , CUHK’s CelebFaces  and MSRA’s WDRef . Among them, SFC has the biggest scale, including more then 4000 subjects and each subject has an average of 1000 images. Using SFC,  successfully learns an effective face representation robust to face variations in the wild. Although the scale of CelebFaces and WDRef are relative smaller than SFC, they are also good resources for developing high performance algorithms. The current state-of-the-art accuracy on LFW is obtained by training on CelebFaces. It’s a pity that the three good datasets are not publicly available, therefore this paper collects CASIA-WebFace to fill this gap.
The first popular face recognition method is Eigenface  which was proposed in 1991. Now we can see Eigenface as a model with one linear layer. Fisherface  or LDA is a one layer linear model too. In the following long period, researchers mainly focused on how to solve the parameters of the linear layer with respect to some cost functions, such as reconstruction error and classification error. Much attentions were also paid to regularized the solution of LDA because LDA is easily prone to SSS problem (Small Sample Size).
Then, various local feature based methods emerged and they were naturally used by combining with the above linear models, such as Gabor+LDA , LBP+LDA  and so on. We can roughly see these methods as a two-layer model though the parameters of local filters are hand-crafted. The first layer is usually applied on input image in a local and nonlinear way, such as Gabor magnitude and LBP coding. Lots of papers show that “locally nonlinear + fully connected linear” architecture is definitely better than “fully connected linear” architecture. Gabor+LBP+LDA proposed in  is even a three-layer model, which obtained good performance by carefully tuning.
As we know, more deeper models (#layers 3) based on hand-crafted filters are rarely reported in the literature. Because the filters (or parameters) of each layer are usually designed independently by hand and the dynamics between layers are hard to handle by human observations. Therefore, learning the parameters of all layers from data is the best way out. With the successful applications of CNN in image classification, it becomes the mainstream in the field of face recognition rapidly. Before CNN becomes popular, many good filter learning methods [19, 14] were proposed to learn the parameters of two-layer models but they are difficult to generalize to deep architectures. Following the current trend, this paper uses deep CNN to learn face representation from large scale dataset.
Dataset for face recognition only need two kinds of data: face image and identity. Randomly crawling face images from Internet and annotating them is nearly an impossible mission. IMDb is a well structured website containing rich information of celebrities, such as name, gender, birthday and photos. We first search the celebrities born between 1940 to 2014 year on the website, and then crawl the names of them.
Each celebrity has an independent page on the website. A sample page is shown in Fig. 1, in which we only focus on the “name”, “main photo” and “photo gallery” contents. Neglecting the celebrities don’t having “main photo”, we get 38,423 subjects and 903,304 images in total. Then all images are processed by a multi-view face detector, 844,126 images remain in the dataset and 1,556,352 faces are detected. Because many images appear in the “photo gallery” of serval celebrities simultaneously, the actual number of images and faces are smaller than above numbers.
The dataset at current state can’t be used for training, we need annotate the identity of faces in each image. The “main photo” usually only contains a single face of the corresponding celebrity but the majority of photos in the “photo gallery” contain multiple faces which belong to other celebrities. Our task is to assign an identity to each face, and divide the faces into groups according to their identities.
Browsing the “photo gallery” of each celebrity, we find that every photo is annotated by several name tags. The name tag can reduce the search space of face-identity correspondence and simply our annotation task. Two sample pages of photo are shown in Fig. 2, which illustrates two kinds of noise in the photo page: miss detection and miss annotation.
Clustering all faces by existing face recognition engine is a natural way to deal this large scale task. General clustering methods need to compute the similarity (or distance) matrix of all samples first but the matrix is too large to be loaded into memory. To effectively annotate the large scale faces, we propose a three-step way by using name tag and face similarity simultaneously. The proposed method can run on normal PC and obtain good clustering results. The steps of our tag-similarity clustering method are as follows.
Extract the feature template of each face by a pre-trained face recognition engine ;
Use the “main photo” of each celebrity as its seed.
Use the images containing 1 face to augment each celebrity’s seeding images.
For the remain images in “photo gallery”, find the correspondence between faces and celebrities constrained by similarity and name tag.
Crop faces from images and save into independent folder for each celebrity. Manually check the dataset and delete the false grouped face images.
After clustering complete, we remove the subjects having less than 15 face images. To make this dataset compatible with LFW, we check the duplicate subjects based on edit distance between the names in CASIA-WebFace and LFW. There are 1043 subjects with the same names are found between CASIA-WebFace and LFW, and these subjects are removed from CASIA-WebFace. Now, we can see CASIA-WebFace as an independent training set for LFW. By combining CASIA-WebFace and LFW, we obtain a new benchmark for large scale face recognition in the wild.
After being cleaned, CASIA-WebFace finally has 10,575 subjects and 494,414 face images. Because the scale of dataset is too large, we can’t promise all faces are detected and annotated correctly. A small amount of miss classified samples don’t affect the training process and may be able to improve the robustness of the model. The quality of the dataset will be illustrated by the experimental results.
The statistics of the proposed CASIA-WebFace dataset is shown in Table 1. Except for Facebook’s SFC dataset, the scale of CASIA-WebFace has the largest scale. For users’ privacy issue, maybe SFC will never be open to research community. The features of Microsoft’s WDRef dataset was publicly available from 2012 but it is inflexible for advanced researches. Among the datasets listed in the table, CASIA-IMDb+LFW is the most suitable combination for large scale face recognition in the wild. If you feel the accuracy of LFW has been saturated by the current state-of-the-art method . BLUFR  is a more challenging protocol to report your results.
|WDRef ||2,995||99,773||Public (feature only)|
|CACD ||2,000||163,446||Public (partial annotated)|
To illustrate the advantages of the proposed dataset, we learn an effective representation from this dataset by deep convolutional network with many latest tricks.
The baseline deep convolutional network is constructed by combining many tricks from recent successful networks including very deep architecture 
, low dimensional representation and multiple loss functions. Small filter and very deep architecture can reduce the number of parameters and enhance the nonlinearity of the network. Low dimensional representation conforms to the assumption that face images usually lie on a low dimensional manifold and the low dimensional constrain can reduce the complexity of the network. Combing identification and verification loss functions has been analyzed in , which can learn more discriminative representations than Softmax only.
The dimension of input layer is 1001001 channel, i.e., gray image. The proposed network includes 10 convolutional layer, 5 pooling layers and 1 fully connected layers, the detailed architecture of which is shown in Table 2. The size of all filters in the network are 33. The first four pooling layers use max operator and the last pooling layer is average. Limited by the computation power of our GPU, the architecture is not optimal but just determined according to our experience. There are still some room to improve.
|Conv11||convolution||33 / 1||10010032||1||0.28K|
|Conv12||convolution||33 / 1||10010064||1||18K|
|Pool1||max pooling||22 / 2||505064||0|
|Conv21||convolution||33 / 1||505064||1||36K|
|Conv22||convolution||33 / 1||5050128||1||72K|
|Pool2||max pooling||22 / 2||2525128|
|Conv31||convolution||33 / 1||252596||1||108K|
|Conv32||convolution||33 / 1||2525192||1||162K|
|Pool3||max pooling||22 / 2||1313192||0|
|Conv41||convolution||33 / 1||1313128||1||216K|
|Conv42||convolution||33 / 1||1313256||1||288K|
|Pool4||max pooling||22 / 2||77256||0|
|Conv51||convolution||33 / 1||77160||1||360K|
|Conv52||convolution||33 / 1||77320||1||450K|
|Pool5||avg pooling||77 / 1||11320|
achieved high performance in ImageNet 2014 challenges by a 19 layer network. Meanwhile, obtained slightly better results than  by a 22 layer network. This paper combines the tricks from these two papers. We use multiple small filters to approximate large filter and remove redundant fully connected layers to reduce the number of parameters. Finally, our network uses 33 filter in all 10 convolutional layers and just has 1 fully connected layer.
Pool5 layer is used as face representation, the dimension of which is equal to the number of channel of Conv52, 320. To distinguish large number of subjects in the training set (10575), this low dimensional representation should fully distill discriminative information from face images. As same as 
, Softmax (identification) and Contrastive (verification) cost are combined to construct the objective function. ReLU neuron is used after all convolutional layers, except for Conv52. Because Conv52 are combined by average to generate the low dimensional face representation, they should be dense and compact. ReLU is apt to produce sparse vector, therefore applying it on the face representation will degrade the performance.
In the training stage, Pool5 is used as input of Contrastive cost function. And Fc6 is used as input of Softmax cost function. Because the number of parameters of Fc6 is very large, i.e., 32010575, we set the dropout ratio as 0.4 to regularize Fc6. The importance of two cost functions are balanced by a weight .
Before input to the network, all face images are converted to gray scale and normalized to 100100 according to two landmarks (see Fig. 4). Compared to the most used eye centers, the distance between the selected two landmarks here is relative invariant to pose variations in yaw angle. After normalization, the distance between the two points is 25 pixels. Because face has nearly symmetric structure, we double the training set by mirror operation, which can result the representations more robust to pose variations.
With such huge amounts of data, the current network is unlikely over-fitting, therefore we set the weight decay of all convolutional layers to 0 and the weight decay of the fully connected layer to 5e-4. The learn rate is set to 1e-2 initially and reduce to 1e-5 gradually. Because the convergence rate of Softmax is faster than Contrastive cost function, the weight is set to a small value 3.2e-4 at first and increase to 6.4e-3 gradually.
The open source implementation cuda-convnet  is used to train our network. For Softmax cost, we just need input face images and their labels, but for Contrastive cost, we need generate face pairs by sampling from the training set. To reduce the consumption of memory and disk space, we just sample the positive and negative face pairs online within each batch. The face pairs across batch are not covered. How to generate complete face pairs effectively is left to future work.
CASIA-WebFace is always used to train the deep network. As described in the last section, all face images in CASIA-WebFace are processed by face detection, face landmarking and alignment. Due to the facial symmetry, we mirror 493,456 detected faces to augment the dataset. Finally, we have 986,912 training samples. The whole process is fully automatic and the false aligned faces are remained as they are. There are still a small number of miss detection in this dataset. If you have a better face detector than ours, your training set may be larger than ours slightly.
LFW and YTF are two most popular and challenging datasets for face recognition in the wild. Because they are not overlapped to the proposed CASIA-WebFace, it’s very reasonable to report performance on LFW and YTF. Besides that, the trained deep network is also evaluated according to a more challenging and practical protocol, BLUFR, which can reflect the performance of face recognition in real applications more objectively.
LFW includes 5,749 subjects and 13,233 face images. There are three main protocols for performance reporting: unsupervised, restricted and unrestricted protocol. Unsupervised protocol is used to evaluate the baseline performance of face representation and the other two protocols are usually used to evaluated the performance of metric learning or the whole method. For all protocols, the test set is fixed, which includes 6000 face pairs in 10 splits. Mean accuracy and standard error of the mean should be reported.
All images in LFW are processed by the same pipeline as CASIA-WebFace, and normalized to
. The representations of all faces are extracted by the deep network trained on CASIA-WebFace (be abbreviated to DR). First, we evaluate the base performance of the representation directly. Then, we test the influence of unsupervised and supervised learning on the base representations. The following experiments are conducted:
A: DR + Cosine;
B: DR + PCA on CASIA-WebFace + Cosine;
C: DR + Joint Bayes on CASIA-WebFace;
D: DR + PCA on LFW training set + Cosine;
E: DR + Joint Bayse on LFW training set.
|Ours C||1||unsupervised 333The term “unsupervised” means that the model is not trained on LFW in supervised way.|
According the protocols of LFW, the hyper-parameters are both tuned on CASIA-WebFace or View1 of LFW, such as the dimension of PCA and the regularization factor of Joint Bayes. The the accuracies are evaluated on View2 of LFW and listed in Table 3. Other state-of-the-art results of DeepFace and DeepID2 are also given for comparison. From the results of our 6 experiments, we can draw 4 conclusions:
The base representation has good performance;
Fine-tuning on the training set of LFW can improve the performance slightly, e.g., BD, CE;
Based on the base representation, Joint Bayes can improve the performance marginally, e.g., BC, DE.
By inspecting the results in unsupervised setting, we can see that our base representation is better than DeepFace, 96.13% vs. 95.92%. After tuning on LFW by PCA, the accuracy 96.33% is improved slightly. Because Joint Bayes can’t deal with pairwise samples directly, we don’t conduct experiment by restricted protocol. When using unrestricted protocol, our single-network scheme E achieves 97.73%, which is better than DeepFace’s 7-networks ensemble 97.35% and is comparable to DeepID2’s 4-networks ensemble 97.75%.
The superiority of our network is benefit from the deep architecture. On other aspects, our method and dataset are still inferior to DeepFace: 1) we just align face images by 2D similarity transformation which is inferior to DeepFace’s 3D alignment; 2) The scale of training set of DeepFace, SFC, is 10X larger than our CASIA-WebFace. Limited by GPU computational resources, we don’t continue to train deeper network or train network ensemble to improve the performance here. After publish the dataset, CASIA-WebFace, we believe the whole research community can refresh the record more quickly.
The test set of LFW just include 6000 face pairs, half of which is genuine and the other half is impostor. As discussed in , such scale of negative face pairs are not enough to evaluate the performance at low FARs. Therefore,  developed a new benchmark protocol, called BLUFR, to fully use all the 13,233 face images in LFW. BLUFR contains both verification and open-set identification scenarios, with a focus at low FARs. There are 10 trials of experiments, with each trial containing about 156,915 genuine matching scores and 46,960,863 impostor matching scores on average for performance evaluation.
The representations of faces in LFW are extracted in the same way as the previous experiment. Then the results can be calculated by the standard benchmark toolkit . For simplicity, we just report the results of scheme E and F. The VR (Verification Rate) and DIR (Detection and Identification Rate) of our methods and compared methods are listed in Table 4. The numbers in the table are measured in of 10 trials, where is mean accuracy and
|HD-LBP + JB|
|HD-LBP + LDA|
 just reported the performance of some conventional shallow (but wide) models under BLUFR protocol. The best reported method is HD-LBP + JB (High Dimensional LBP + Joint Bayes), the result of which is VR=41.66% (at FAR=0.1%). As shown in Table 4, our deep network surpass HD-LBP based methods significantly. The superiority of deep models against wide models has been illustrated in previous work and the conclusion is verified in this paper again.
We find that all numbers in Table 4 are obviously lower than those in Table 3, especially the DIR (at FAR=1% and Rank=1). Because DIR is an important index to reflect the performance of face surveillance (or watch-list) systems, we think face recognition algorithms still have large gap to appeal the requirements of surveillance applications.
To test the generalization ability of our network, we evaluate it on a video face dataset, YouTube Faces (YTF). Due to motion blur and high compression ratio, the quality of images in YTF are much worse than web photos. For each video in YTF, we randomly select 15 frames and extract their representations by our deep network (DR). In the training stage, the 15 frames are seen as 15 samples with same identity. In the testing stage, the similarity score of video pair is the mean value of 1515=225 frame pairs. The following experiments are conducted in unsupervised and supervised settings:
A: DR + Cosine;
D: DR + PCA on YTF training set + Cosine;
E: DR + Joint Bayes on YTF training set.
DeepFace holds the best result on YTF and is better than other methods by a large margin, therefore, we only compare to DeepFace. Directly matching by Cosine function, the base representation achieves 88.00% accuracy on YTF. Transforming the representation by PCA on YTF, the accuracy improves to 90.60% remarkably. When tuning the representation by Joint Bayes further, our method outperforms DeepFace slightly.
This work collected a large scale face dataset from Internet and made it public to research community. The new dataset isn’t overlapping to LFW and can be used in conjunction with LFW for large scale face recognition research. This combination can standardize the evaluation protocol of LFW and advance the reproducible research. On the other side, unified training and testing set can make various methods comparable. This work also described the whole process of dataset construction and face representation learning by a 11-layer convolutional network. Referring the pipeline proposed in this paper, anyone can easily train a high performance face recognition engine. Future work will be done in three directions: 1) augment the dataset by using commercial image search engines; 2) develop more effective annotation tools and algorithms; 3) explore novel methods to train single network to approach the performance of big deep network ensemble.
This work was supported by the Chinese National Natural Science Foundation Projects #61105023, #61103156, #61105037, #61203267, #61375037, #61473291, National Science and Technology Support Program Project #2013BAK02B01, Chinese Academy of Sciences Project No.KGZD-EW-102-2, and AuthenMetric R&D Funds.
The Tesla K40 used for this research was donated by the NVIDIA Corporation.
Proceedings of the European Conference on Computer Vision, pages 45–58, 1996.
Foundations and trends in Machine Learning, 2(1):1–127, 2009.
“Face recognition with decision tree-based local binary patterns”.In Computer Vision–ACCV 2010, pages 618–629. Springer, 2011.