Subject Identification Across Large Expression Variations Using 3D Facial Landmarks

by   Sk Rahatul Jannat, et al.

Landmark localization is an important first step towards geometric based vision research including subject identification. Considering this, we propose to use 3D facial landmarks for the task of subject identification, over a range of expressed emotion. Landmarks are detected, using a Temporal Deformable Shape Model and used to train a Support Vector Machine (SVM), Random Forest (RF), and Long Short-term Memory (LSTM) neural network for subject identification. As we are interested in subject identification with large variations in expression, we conducted experiments on 3 emotion-based databases, namely the BU-4DFE, BP4D, and BP4D+ 3D/4D face databases. We show that our proposed method outperforms current state of the art methods for subject identification on BU-4DFE and BP4D. To the best of our knowledge, this is the first work to investigate subject identification on the BP4D+, resulting in a baseline for the community.



There are no comments yet.


page 3


Landmarks-assisted Collaborative Deep Framework for Automatic 4D Facial Expression Recognition

We propose a novel landmarks-assisted collaborative end-to-end deep fram...

Sparsity-Aware Deep Learning for Automatic 4D Facial Expression Recognition

In this paper, we present a sparsity-aware deep network for automatic 4D...

3D Facial Expression Reconstruction using Cascaded Regression

This paper proposes a novel model fitting algorithm for 3D facial expres...

Unsupervised Learning of Landmarks based on Inter-Intra Subject Consistencies

We present a novel unsupervised learning approach to image landmark disc...

Structure-Aware Long Short-Term Memory Network for 3D Cephalometric Landmark Detection

Detecting 3D landmarks on cone-beam computed tomography (CBCT) is crucia...

Greedy Search for Descriptive Spatial Face Features

Facial expression recognition methods use a combination of geometric and...

Multi-Year Vector Dynamic Time Warping Based Crop Mapping

Recent automated crop mapping via supervised learning-based methods have...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Broadly, face recognition can be categorized as holistic, hybrid matching, or feature-based

[39]. Holistic approaches look at the global similarity of the face such as a 3D morphable model (3DMM) [2]; hybrid matching make use of either multiple methods [14] or multiple modalities [15]; feature-based methods look at local features of the face to find similarities [41]. The work proposed in this paper can be categorized as feature-based. Due to its non-intrusive nature and wide applicability in security and defense related fields, face recognition has been actively researched by many groups in recent decades.

Since some of the earlier methods for face recognition [32], [38], to more recent works within the past 10 years [5], [36] 2D face recognition has been an actively researched field. With the recent advances in deep neural networks, we have seen significant jumps in performance [12], [18], [23], [25], [28], [34]. Liu et al. [22]

proposed the angular softmax that allows convolutional neural networks (CNN) the ability to learn angularly discriminative features. This was proposed to handle the problem where face features are shown to have a smaller intra-class distance compared to inter-class distance. Recently, Tuan et al.

[31] proposed regressing 3D morphable model shape and texture parameters from a 2D image using a CNN. Using this approach, they were able to obtain a sufficient amount of training data for their network showing promising results. Zhu et al. [42] proposed a high-fidelity pose and expression normalization method that made use of a 3DMM to generate natural, frontal facing, neutral face images. Using this method, they achieved promising results in both constrained and unconstrained environments (i.e. wild settings). Although performance has been increasing and groups have been actively working on 2D subject identification, there are still some challenges such as pose and lighting. 3D faces can help to minimize these challenges [26]

, and in recent years, this research has made significant strides

[11], [12], [27] due to the development of powerful, high-fidelity 3D sensors.

Echeagaray-Patron et al. [11] proposed a method for 3D face recognition where conformal mapping is used to map the original face surfaces onto a Riemannian manifold. From the conformal and isometric invariants that they compute, comparisons are then made. This method was shown to have invariance to both expression and pose. Li et al. [21] proposed the use of SIFT-like matching using three 3D key point descriptors. Each of these descriptors were fused at the feature-level to describe local shapes of detected key points. Lei et al. [20]

proposed the Angular Radial Signature for 3D face recognition. This signature is extracted from the semi-rigid regions of the face, followed by mid-level features being extracted from the signature by Kernel Principal Component Analysis. These features were then used to train a support vector machine showing promising results when comparing neutral vs. non-neutral faces. Berretti et al.

[1] proposed the use of 3D Weighted Walkthroughs with iso-geodesic facial strips for the task of 3D face recognition. They achieved promising results on the FRGC v2.0 [24] and SHREC08 [9] 3D facial datasets. Using multistage hybrid alignment algorithms and an annotated face model, Kakadiaris et al. [17] used a deformable model framework to show robustness to facial expressions when performing 3D face recognition.

Motivated by the above works, we propose to use 3D facial landmarks for subject identification across large variations in expression. We track the facial landmarks using a Temporal Deformable Shape Model (TDSM) [7]. See Fig. 1 for an overview of the proposed approach. The rest of the paper is organized as follows. Section 2 gives a brief overview of the TDSM algorithm, Section 3 details our experimental design and results, and we conclude in Section 4.

Figure 1: Overview of proposed method. Example is showing an unseen 3D mesh model of subject ‘F001’ from BP4D+ [40], who is correctly identified based on training a LSTM [13] from 3D facial data detected from a TDSM.

2 Temporal Deformable Shape Model

The Temporal Deformable Shape Model (TDSM) models the shape variation of 3D facial data. Given a sequence of data (i.e. 4D), it also models the implicit constraints on shape that are imposed (e.g. small changes in motion and shape). To construct a TDSM, a training set of 3D facial landmarks is required. First, the 3D facial landmarks are aligned using a modified version of Procrustes analysis [10]. Given a training set of size L 3D faces, where each face has N facial landmarks (aligned with Procrustes analysis), a parameterized model S is constructed, . is the landmarks of the 3D face in the training set, where and . From this model, principal component analysis (PCA), is then applied to learn the modes of variation, V, of the training data.

Given the parameterized model, S, and the modes of variation, V, to detect 3D facial landmarks, an offline weight vector, w, is constructed that allows for new face shapes to be constructed (i.e. these face shapes are constructed offline), by a linear combination of landmarks as where is the average face shape. These constructed face shapes are constrained to be within the range , where is the weight in the range, and is the eigenvalue from PCA. This constraint is imposed to make sure the new face shape is a 3D face.

To fit (i.e. detect landmarks) to a new input mesh, an offline table of weights (w

) is constructed with a uniform amount of variance. The Procrustes distance,

D, is then computed between each face shape (referred to as an instance of the TDSM) and the new input mesh. The smallest distance is considered the best detected landmarks. Note that this is not meant to be an exhaustive overview of a TDSM, therefore we refer the reader to the original work [7] for more details.

3 Experimental Design and Results

Using a TDSM, we detected 83 facial landmarks on 3 publicly available 3D emotion-based face databases: BU4DFE [35], BP4D [37], and BP4D+ [40]

. From these facial landmarks, we then conducted subject identification experiments, where the landmarks are used as training data for 3 machine learning classifiers. Using these 83 facial landmarks we have also reduced the dimensionality of the 3D faces from over 30,000 3D vertices, while still retaining important features for subject identification. This allows us to reduce storage requirements, as well as processing time of the 3D face, which can be limitations of 3D face recognition

[3], [16]. An overview of the databases and the experimental design is detailed in the following subsections.

3.1 3D face databases

One of the main goals of this work is to show subject identification across large variations in expression. Considering this, we needed to evaluate large and varied 3D emotion-based face databases. To facilitate this, we chose 3 state-of-the-art 3D emotion-based face databases, and investigated a total of 282 subjects across the 3 datasets.

BU-4DFE [35]: Consists of 101 subjects displaying 6 prototypic facial expressions plus neutral. The dataset has 58 females and 43 males, including a variety of racial ancestries. The age range of the BU-4DFE is 18-45 years of age.

BP4D [37]: Consists of 41 subjects displaying 8 expressions plus neutral. It consists of 23 females and 18 males; 11 Asian, 4 Hispanic, 6 African-American, and 20 Euro-American ethnicities are represented. The age range of the BP4D is 18-29 years of age. This database was developed to explore spatiotemporal features in facial expressions. Due to its large variation in expression, it is a natural fit for our subject identification study.

BP4D+ [40]: Consists of 140 subjects (82 females and 58 males) ages 18-66. This data corpus consists of ethnic and racial ancestries that include African American, Caucasian, and Asian each with highly varied emotions. These emotions are elicited through tasks designed to elicit dynamic emotions in the subjects such as disgust, sadness, pain, and surprise resulting in a challenging dataset. Like the BP4D database, this dataset was also designed to study emotion classification. Its diversity and number of subjects, as well as large variations in expressions, make it a natural fit for our study.

3.2 Experimental design

To conduct our experiments, we detected 83 facial landmarks on the 3D data using a TDSM. Given 3D facial landmarks, we then translated them so that the centroid of the face is located at the origin in 3D space to align the data. The translated 3D facial features were then used for subject identification. Each of the 3D facial landmarks (, , coordinates) are inserted into a new feature vector. For all 83 landmarks, this gives us a feature vector of size . This feature vector is used to train classifiers for subject identification. To ensure our results were not classifier specific, we trained a support vector machine (SVM) [33], random forest (RF) [4], and Long short-term memory (LSTM) neural network [13]

. Our network consists of one short-term memory layer with a look back of two faces (estimated landmarks), followed by 0.5 dropout, and a fully connected layer for classification. The softmax activation function was used, along with the RMSprop

[30] optimizer with a learning rate of 0.0001.

For each classifier, each subject’s identity was used as the class (each 3D face is labeled with a subject id). Accurate results on an SVM, RF, and LSTM show the robustness of the 3D facial landmarks to multiple machine learning classifiers. We conducted one-to-many subjection identification, where all subjects were in both the training and testing sets. These sets were split based on time (i.e. different sections of the sequences available in the datasets) so consecutive (i.e., similar) frames did not appear in both sets.

3.3 Subject identification results

We achieved an average subject identification accuracy of 99.9%, on random forest and support vector machine, and 99.93% for a long short-term network, across all databases. As can be seen in Table 1, an SVM, RF, and LSTM can accurately identify subjects from the BU4DFE, BP4D, and BP4D+ datasets achieving a max accuracy of 100% on BU4DFE, and a minimum accuracy of 99.8% on BP4D+. All three of the tested classifiers achieved consistent results across all three datasets, showing these results are not classifier dependent. As each of the datasets contain large variations in expression, these results show the detected 3D landmarks have robustness to expression changes for the task of subject identification.

SVM 99.9% 99.9% 99.9%
RF 100% 99.9% 99.8%
LSTM 100% 99.9% 99.9%
Table 1: Subject identification accuracies for the 3 tested datasets and classifiers.

3.4 Subject identification with occluded faces

Along with subject identification using all 83 landmarks, we also tested on a smaller number of facial landmarks to simulate occluded faces. For these experiments, we split the 3D facial landmarks (i.e. face) into 4 quadrants (Fig. 2) and detected a smaller number of landmarks (top right: 23; top left:23; lower right: 20; lower left: 17) using a TDSM. We then ran the same experiments for each quadrant. As shown in Section 3, the results are not classifier specific, as the random forest, SVM, and LSTM network have similar results. Due to this we only used a random forest and support vector machine for these experiments.

When testing on simulated occluded faces on BU4DFE, both the random forest and SVM achieved 99.9% accuracy in all four quadrants, showing robustness to occlusion. Testing on BP4D, the random forest achieved an average accuracy of 99.7% across the four quadrants, and SVM achieved an average accuracy of 93.2% across the four quadrants. On BP4D+, random forest and SVM achieved an average accuracy of 99.4% and 97.5%, respectively across the four quadrants. These results detail the expressive power of the detected 3D facial landmarks to reliably identify subjects under extreme conditions (e.g. large variations in expression and occlusion). See Table 2 for individual quadrant accuracies for BP4D and BP4D+ (BU4DFE not shown as all quadrants had same accuracy of 99.9% for both classifiers).

Figure 2: Detected landmarks (BP4D [37]) used for subject ID (original 3D mesh shown only for display purposes). (a) 83 landmarks with texture (note: texture is shown for display purposes only showing robustness to facial hair); (b) 83 landmarks; (c) top left quadrant; (d) top right quadrant; (e) lower left quadrant; and (f) lower right quadrant.
RF 99.7 99.7 99.7 99.7 99.3 99.3 99.6 99.5
SVM 95.1 96.8 93.4 87.5 98.8 99.1 97.5 94.8
Table 2: Subject identification accuracies(percentage) for faces with simulated occlusion. Key: TR: Top Right; TL: Top Left; LR: Lower Right; LL: Lower Left.

3.5 Comparisons to state of the art

We compared our proposed method to the current state of the art on BU-4DFE [35] and BP4D [37] (see Table 3 for both). To the best of our knowledge this is the first study to perform subject identification on BP4D+ [40]; therefore, we did not have any works to compare against resulting in a baseline for the community. In these comparisons, it is important to note that Canavan et al [6] used 1800 and 2400 frames from BU-4DFE and BP4D, respectively, for their experiments. We used all data in both datasets (60402 and 367474 respectively). The work from Sun et al. [29] also requires both spatial and temporal information to achieve their results of 98.61%, and while our approach can incorporate temporal information (e.g. LSTM), it can also identify a subject based on one frame of data, which is useful when temporal information is not available.

Method BU4DFE BP4D
Proposed Method (RF) 100% 99.9%
Proposed Method (LSTM) 100% 99.9%
Proposed Method (SVM) 99.9% 99.9%
Sun et al. [29] 98.61% N/A
Fernandes et al. [19] 96.71% N/A
Canavan et al. [6] 92.7% 93.4%
Table 3: State-of-the-art comparisons.

4 Conclusion

We have shown 3D facial landmarks can be used for subject identification across large variations in expression. We validated our approach on three 3D emotion-based face databases (BU4DFE [35], BP4D [37], and BP4D+ [40]), using a random forest, support vector machine, and long short-term neural network. The proposed method outperforms current state of the art on 2 publicly available 3D face databases achieving a max identification accuracy of 100% on BU-4DFE and 99.9% on BP4D. To the best of our knowledge, this is the first work to report subject identification results on the BP4D+. We have also shown the detected landmarks can be used for subject identification in the presence of facial occlusion (simulated). We will further investigate this robustness to expression and occlusion in future work, by investigating other state-of-the-art 3D face emotion datasets such as 4DFab [8], which was also designed with biometrics studies in mind, as well as large variations in expression.

We are also interested in emotion-invariant multimodal subject identification. In this paper, we have shown that 3D landmarks are invariant to large expression changes for the task of subject identification. Since facial expressions are often physiological responses to emotion, emotion-invariant identification can have a broad range of applications such as medicine and healthcare (e.g., identifying individuals despite expressions of pain). Multimodal approaches are generally more accurate due to the fusion of heterogeneous data, each contributing identifying information. Considering this, we hypothesize a multimodal approach will significantly advance research on emotion-invariant subject identification while yielding new insight on the impact of emotion on novel modalities such as smartphone sensor data (e.g., accelerometer and touch measurements) and other unconstrained and transparently acquired data. Such approaches will be valuable for continuous subject identification.


  • [1] S. Berretti, A. Del Bimbo, and P. Pala (2010) 3D face recognition using isogeodesic stripes. IEEE transactions on pattern analysis and machine intelligence 32 (12), pp. 2162–2177. Cited by: §1.
  • [2] V. Blanz and T. Vetter (2003) Face recognitoin based on fittign a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (12), pp. 1063–1074. Cited by: §1.
  • [3] K. Bowyer, K. Chang, and P. Flynn (2006) A survey of approaches and challenges in 3d and multi-modal 3d+ 2d face recognition. 101 (1), pp. 1–15. Cited by: §3.
  • [4] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §3.2.
  • [5] S. Canavan, B. Johnson, M. Reale, Y. Zhang, L. Yin, and J. Sullins (2010) Evaluation of multi-frame fusion based face classification under shadow. In

    International Conference on Pattern Recognition

    pp. 1265–1268. Cited by: §1.
  • [6] S. Canavan, P. Liu, X. Zhang, and L. Yin (2015) Landmark localization on 3d/4d range data using a shape index-based statistical shape model with global and local constraints. CVIU 139, pp. 136–148. Cited by: §3.5, Table 3.
  • [7] S. Canavan, X. Zhang, and L. Yin (2013) Fitting and tracking 3d/4d facial data using a temporal deformable shape model. In International Conference on Multimedia and Expo, pp. 1–6. Cited by: §1, §2.
  • [8] S. Cheng et al. (2018) 4dfab: a large scale 4d database for facial expression analysis and biometric applications. In CVPR, pp. 5117–5126. Cited by: §4.
  • [9] M. Daoudi, F. ter Haar, and R. Veltkamp (2008) SHREC 2008-shape retrieval contest of 3d face scans. Cited by: §1.
  • [10] M. de Bruijne, B. van Ginneken, M. Viergever, and W. Niessen (2003) Adapting active shape models for 3d segmentation of tubular structures in medical images. In Biennial International Conference on Information Processing in Medical Imaging, pp. 136–147. Cited by: §2.
  • [11] B. Echeagaray-Patron, V. Kober, V. Karnaukhov, and V. Kuznetsov (2017) A method of face recognition using 3d facial surfaces. Journal of Communications Technology and Electronics 62 (6), pp. 648–652. Cited by: §1, §1.
  • [12] M. Emambakhsh and A. Evans (2016) Nasal patches and curves for expression-robust 3d face recognition. IEEE transactions on pattern analysis and machine intelligence 39 (5), pp. 995–1007. Cited by: §1.
  • [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Figure 1, §3.2.
  • [14] J. Huang, B. Heisele, and V. Blanz (2003) Component-based face recognition with 3d morphable models. International Conference on Audio and Video-based Person Authentication. Cited by: §1.
  • [15] I. o. Kakadiaris (2005) Multimodal face recognition: combination of geometry with physiological information. In CVPR, Vol. 2, pp. 1022–1029. Cited by: §1.
  • [16] I. Kakadiaris et al. (2007) Three-dimensional face recognition in the presence of facial expressions: an annotated deformable model approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (4), pp. 640–649. Cited by: §3.
  • [17] I. Kakadiaris, G. Passalis, G. Toderici, M. Murtuza, and T. Theoharis (2006) 3D face recognition.. In BMVC, pp. 869–878. Cited by: §1.
  • [18] I. Kemelmacher-Shlizerman et al. (2016) The megaface benchmark: 1 million faces for recognition at scale. In CVPR, pp. 4873–4882. Cited by: §1.
  • [19] S. Lawrence and J. B. (2014) 3D and 4d face recognition: a comprehensive review. Recent Patents on Engineering 8 (2), pp. 112–119. Cited by: Table 3.
  • [20] Y. Lei, M. Bennamoun, M. Hayat, and Y. Guo (2014) An efficient 3d face recognition approach using local geometrical signatures. Pattern Recognition 47 (2), pp. 509–524. Cited by: §1.
  • [21] H. Li et al. (2015) Towards 3d face recognition in the real: a registration-free approach using fine-grained matching of 3d keypoint descriptors. IJCV 113 (2), pp. 128–142. Cited by: §1.
  • [22] W. Liu et al. (2017) Sphereface: deep hypersphere embedding for face recognition. In CVPR, pp. 212–220. Cited by: §1.
  • [23] O. Parkhi, A. Vedaldi, A. Zisserman, et al. (2015) Deep face recognition.. In BMVC, Vol. 1, pp. 6. Cited by: §1.
  • [24] P. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek (2005) Overview of the face recognition grand challenge. In computer vision and pattern recognition, Vol. 1, pp. 947–954. Cited by: §1.
  • [25] J. Saragih, S. Lucey, and J. Cohn (2011) Deformable model fitting by regularized landmark mean-shift. International Journal of Computer Vision 91 (2), pp. 200–215. Cited by: §1.
  • [26] S. Singh and S. Prasad (2018) Techniques and challenges of face recognition: a critical review. Procedia computer science 143, pp. 536–543. Cited by: §1.
  • [27] S. Soltanpour, B. Boufama, and Q. J. Wu (2017) A survey of local feature methods for 3d face recognition. Pattern Recognition 72, pp. 391–406. Cited by: §1.
  • [28] Y. Sun, Y. Chen, X. Wang, and X. Tang (2014) Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pp. 1988–1996. Cited by: §1.
  • [29] Y. Sun and L. Yin (2008) 3D spatio-temporal face recognition using dynamic range model sequences. In Computer Vision and Pattern Recognition Workshops, pp. 1–7. Cited by: §3.5, Table 3.
  • [30] T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §3.2.
  • [31] A. Tuan Tran, T. Hassner, I. Masi, and G. Medioni (2017) Regressing robust and discriminative 3d morphable models with a very deep neural network. In Computer Vision and Pattern Recognition, pp. 5163–5172. Cited by: §1.
  • [32] M. Turk and A. Pentland (1991) Face recognition using eigenfaces. In CVPR, pp. 586–591. Cited by: §1.
  • [33] V. Vapnik (1998) The support vector method of function estimation. In Nonlinear Modeling, pp. 55–85. Cited by: §3.2.
  • [34] Y. Wen et al. (2016) A discriminative feature learning approach for deep face recognition. In ECCV, pp. 499–515. Cited by: §1.
  • [35] L. Yin et al. (2008) A high-resolution 3d dynamic facial expression database, 2008. In FG, Cited by: §3.1, §3.5, §3, §4.
  • [36] L. Zhang et al. (2011) Sparse representation or collaborative representation: which helps face recognition?. In ICCV, pp. 471–478. Cited by: §1.
  • [37] X. Zhang et al. (2014) Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32 (10), pp. 692–706. Cited by: Figure 2, §3.1, §3.5, §3, §4.
  • [38] W. Zhao et al. (1998) Discriminant analysis of principal components for face recognition. In Face Recognition, pp. 73–85. Cited by: §1.
  • [39] W. Zhao et al. (2003) Face recognition: a literature survey. ACM Computing Surveys 35 (4), pp. 399–458. Cited by: §1.
  • [40] Z. Zheng et al. (2016) Multimodal spontaneous emotion corpus for human behavior analysis. In Computer Vision and Pattern Recognition, pp. 3438–3446. Cited by: Figure 1, §3.1, §3.5, §3, §4.
  • [41] C. Zhong et al. (2007) Robust 3d face recognition using learned visual codebook. In CVPR, pp. 1–6. Cited by: §1.
  • [42] X. Zhu et al. (2015) High-fidelity pose and expression normalization for face recognition in the wild. In CVPR, pp. 787–796. Cited by: §1.