1 Introduction
Recognizing emotion is considered one of the most important parts of human intelligence [25] and it has applications in fields as varied as entertainment, transportation, medicine and health, and psychology. Due to this, there has been a great deal of research into human emotion recognition in the past decades, where many important advances have been made. This is due in part to the new availability of large, varied, and challenging datasets [6, 13, 20, 23, 27, 28, 30, 33].
There is a large and varied body of work into facial expression recognition. Using a Spatio-Temporal Hidden Markov Model (HMM), the intra- and inter-frame information can be used for this task
[29]. It has been shown that using a random forest
[2]along with a Deformation Vector Field
[8], to obtain the local deformations of the face over time, can be used to accurately classify expressions. Facial expressions have also been successfully classified using a Support Vector Machine (SVM) with a radial basis function (RBF) kernel with geometrical coordinates, as well as the normal of the coordinates
[15]. Lucey et al [22], analyzed videos of patients with shoulder injuries to automatically recognize pain. In this work, an Active Appearance Model [4] was used to detect Facial Action Units to distinguish pain on facial expressions. They detail 84.7 for area under the ROC curve on the UNBC-McMaster Shoulder Pain Database [26]. This study is encouraging as it suggests Action Units can be used to recognize emotions (e.g. pain).Deep learning has shown recent success in expression recognition. Using a Boosted Deep Belief Network, Liu et al. [21] trained feature learning, selection, and classifier construction iteratively in a unified loopy framework; which showed an increase in the classification accuracy. De-expression Residue Learning [31] was also proposed which can generate a corresponding neutral expression given an arbitrary facial expression from an image. Yang et al. [32] proposed regenerating expression from input facial images. By using a conditional GAN [24], they developed an identity adaptive feature space that can handle variations in subjects.
Facial expression recognition is a popular approach to recognizing emotion, however, there is also a varied body of work that makes use of multimodal data for emotion recognition. Soleymani et al. [27] incorporated electroencephalogram, pupillary response, and gaze distance information from 20 videos. They used this data to train an SVM to classify arousal and valence for 24 participants. Kessous also showed an increase of more than 10% when using a multimodal approach [18]. They used a Bayesian classifier, and fused facial expression with speech data that consisted of multiple languages including Greek, French, German, and Italian.
While these works, and others, have had success detecting expressions and emotion with multimodal data, little work has been done on analyzing their impact on recognition. Motivated by this, we present an analysis of multimodal data and the impact each modality has on emotion recognition. Our contributions can be summarized as follows:
-
[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]
-
A detailed analysis of physiological data, 3D landmarks, and facial action units (AU) [10] both independently and combined at the feature level (unimodal vs. multimodal), for emotion recognition, is presented.
-
Insight is provided on the impact of physiological data, 3D landmarks, and AUs for positively influencing emotion recognition studies.
-
To the best of our knowledge, this is the first work to conduct this type of analysis on the BP4D+ multimodal dataset [35], resulting in a baseline for future analyses.
2 Data Selection and Feature Extraction
We propose to use 3D facial data (landmarks), action units and physiological data in our analysis. We chose these 3 modalities based on their complementary nature. First, given movement, and the shape of the face changes (3D landmarks), we can also assume that there will be a change in the occurrence of action units [9]. We have also chosen the complementary modality, physiological data, as facial expressions can be faked. It has been observed that people smile during negative emotional experiences [11]. Considering this, physiological data can complement the other two modalities for recognizing emotion.
To conduct the proposed analysis, a suitably large corpus of emotion data is needed that contains 3D facial data, action units, and physiological data. For our experiments we have chosen the BP4D+ multimodal spontaneous emotion corpus [35]. In total, there are over 1.5 million frames of multimodal available in the BP4D+. For this study we use 192,452 frames of multimodal data from all 140 subjects. This subset of data contains 4 target emotions that are happiness, embarrassment, fear, and pain. We are using this subset, as it is largest set of frames, in BP4D+, that are encoded with action units.
2.1 3D facial data
For our study we used 83 3D facial landmarks (same as seen in BP4D+) to represent the face. Each of the landmarks were detected using a shape index-based statistical shape model (SI-SSM) [3], that creates shape index-based patches from global and local features of the face. These global and local features are concatenated into one model, which is then used along with a cross-correlation matching technique to match the training data to an input mesh model. Examples of detected 3D facial landmarks can be seen in Fig. 1. For our 3D facial data feature vector, we directly use the coordinates (x, y, z) of the 3D tracked facial landmarks as they can accurately represent the induced expression that can be seen in the entire 3D model [12], which contains approximately 30k-50k vertices; where our reduced feature vector contains 249 features (83 – 3D coordinates). Using this reduced feature space (relative to the entire 3D mesh) allows for lower dimensional data, without sacrificing any recognition accuracy.

2.2 Action units
For each of the 4 emotions that have action units coded, a total of 35 action units (AUs) were coded by five different expert FACS coders. For each task of all 140 subjects approximately 20 seconds of the most expressive part of the sequence was annotated, giving us our 192,452 frames of multimodal data that we use for our study. For our AU feature vector, we include the occurrence of all annotated AUs for each frame where 1 corresponds to the AU being present and 0 corresponds to the AU not being present in the current frame. There are some instances in the BP4D+ where the AU occurrence is listed as 9, which is referred to as unknown. For our experiments, 9 is treated as a 0 (i.e. not present).
2.3 Physiological data
For each subject and task, the BP4D+ contains 8 separate measurements of physiological data derived from blood pressure (BP), heart rate (HR), respiration (RESP), and skin conductivity (EDA). All physiological data was sampled at 1000 Hz which required us to synchronize with the available 3D facial data and corresponding action units to have accurate readings for each frame of data. To synchronize this, we first divide the total number of frames of physiological data by the total number of frames of 3D facial data for that task (average sync value). We then use the average value over the average sync value as our new frame. For example, given a task with 1000 frames of 3D facial data, along with 40,000 frames of diastolic BP we would have , resulting in us taking the average diastolic BP for every 40 frames. Calculating the mentioned average over all 40,000 frames, results in 1000 frames of diastolic BP matching to the 1000 frames of corresponding 3D facial data. In this same task, there are 400 frames that include both 3D facial landmarks and AUs (frames labeled with task, subject, and frame number). We then use the corresponding frame number to extract that exact index from the calculated diastolic BP averages. This gives us our resulting 400 frames of synchronized 3D facial data, physiological data, and action units. For our physiological feature vector, we take the average value of each frame over all eight of the data types (i.e. fuse the signals).
3 Experimental Design and Results
3.1 Feature analysis
A main contribution of this work is analyzing which modality and features are most important for our 4 target emotions. To do this we used principal component analysis (PCA) for feature selection keeping 95% of the original variance. We did this for each of our unimodal feature vectors for all the training data, as well as each individual emotion. This was done to analyze which features are important for emotion recognition in a general sense, and for each targeted emotion resulting in a total of 15 total rankings (3 feature vectors for each: happy, embarrassment, pain, fear, and all emotions). The features were then ranked based on highest eigenvalue.

Action Units. The top ranked action units included the lips, cheeks, nose, and eye/eyebrow regions. Across each of the target emotions, along with all combined emotions the ranked AUs were similar. The difference being their rankings change across different emotion (e.g. AU12 was ranked first for happy, while AU12 was ranked second for embarrassed). Table 1, second column, shows the top 5 ranked AUs. As can be seen here the top AUs for ‘Happy’ are 12, 6, 11, and 7. When considering the Emotion Facial Action Coding System [16], which only looks at emotion-related facial action, ‘Happy’, is 6+12. This shows a correlation between the PCA rankings and the action units associated with the emotion. We also calculated the normalized AU distribution across each target emotion. This showed that while each emotion had similar occurring action units, they varied in distribution, which contributes complimentary information to the other modalities. This can explain the increase in accuracy when a multimodal approach is used (Table 3).
Physiological Data. Most of the top ranked features for physiological data were variations on blood pressure (e.g. diastolic and systolic). Pulse rate was also ranked as a top feature for each of the target emotions, however, when all emotions were included in the training data, pulse rate was replaced by EDA. This suggests that skin conductivity is important for recognizing multiple emotions. It is interesting to note that for each of the 4 target emotions, not only were the top ranked features the same, they were also ranked in the same order. Although each emotion had the same ranked physiological data, they all had large variations in the data between them. This variance in data allows for a high level of recognition accuracy (Table 2). Table 1, third column, shows the top 5 ranked physiological signals.
3D Facial Data. When analyzing the 3D facial data, each of the target emotions show variance in the regions of the face that were ranked for the top features. For example, happy targeted the right eye and eyebrow, and pain was across the right eyebrow, nose, and left eyebrow. These regions of the face are also consistent with the AUs ranked as the top features (e.g. mouth, face, eyes/eyebrows). See Table 1 for the top 5 ranked 3D facial landmarks (face region for each) and Fig. 2 for examples of these landmarks on corresponding 3D mesh models. It can be seen, in Fig. 2, that emotional variance is conveyed in different 3D regions of the face for each of the target emotions.
Emotion | Action Units | Phys | 3D Facial Landmarks |
Lip corner puller (12) | Mean BP | ||
Cheek raiser (6) | Diastolic BP | ||
Happy | Upper lip raiser (10) | Raw BP | Right eye (2) |
Nasolabial deepener (11) | Pulse Rate | Right eyebrow (3) | |
Lid tightener (7) | |||
Cheek raiser (6) | Mean BP | ||
Lip corner puller (12) | Diastolic BP | ||
Embarrassed | Upper lip raiser (10) | Systolic BP | Left face contour (2) |
Lid tightener (7) | Raw BP | Left eyebrow (3) | |
Nasolabial deepener (11) | Pulse Rate | ||
Lip Corner Puller (12) | Mean BP | ||
Cheek raiser (6) | Diastolic BP | Right eyebrow (2) | |
Pain | Upper lip raiser (10) | Systolic BP | Nose (2) |
Nasolabial deepener (11) | Raw BP | Left eyebrow(1) | |
Lid tightener (7) | Pulse Rate | ||
Upper lip raiser (10) | Mean BP | ||
Cheek raiser (6) | Diastolic BP | ||
Fear | Lid tightener (7) | Systolic BP | Right eyebrow (5) |
Lip corner puller (12) | Raw BP | ||
Nasolabial deepener (11) | Pulse Rate | ||
Lip corner puller (12) | Mean BP | ||
Upper lip raiser (10) | Diastolic BP | ||
All | Cheek raiser (6) | Systolic BP | Left eyebrow (5) |
Lid tightener (7) | Raw BP | ||
Nasolabial deepener (11) | EDA |
3.2 Emotion recognition
To conduct our emotion recognition experiments, we created a feature vector for each unimodal and multimodal configuration (Tables 2 and 3). We then used each of these feature vectors to train a random forest [2] for recognizing the four target emotions. Random forests have successfully been used in a wide variety of classification tasks such as classifying ecological data [7], real-time hand gesture recognition [36]
, and head pose estimation
[14], which makes them a natural fit for our analysis.3D Facial Data | Action Units | Physiological | |
Accuracy | 99.29% | 61.94% | 99.94% |
Recall | 98.80% | 60.35% | 99.95% |
Precision | 99.33% | 61.00% | 99.95% |
3D Facial Data | Action Units | 3D Facial Data | 3D Facial Data | |
Action Units | Physiological | Physiological | Action Units | |
Physiological | ||||
Accuracy | 99.53% | 99.95% | 99.76% | 99.83% |
Recall | 99.58% | 99.95% | 99.75% | 99.83% |
Precision | 99.52% | 99.95% | 99.75% | 99.85% |
Happy | Embarrassment | Fear | Pain | |
Happy | 32511 | 7730 | 3373 | 7917 |
Embarrassment | 17561 | 26038 | 3238 | 5282 |
Fear | 8773 | 5206 | 14652 | 8163 |
Pain | 1983 | 2334 | 1685 | 46006 |
Unimodal vs. Multimodal Emotion Recognition. We used 10-fold cross validation for each of our experiments. The results for unimodal and multimodal emotion recognition can be seen in Tables 2 and 3 respectively. When physiological data was used, recognition accuracy was highest for both unimodal and multimodal approaches, achieving an accuracy of 99.94% for the 4 target emotions, with a unimodal approach. This result is intuitive as physiological signals are closely tied to human emotion [19, 20]. When AUs were combined with physiological data we achieved our highest recognition accuracy of 99.95%. This also agrees with the literature that the fusion of multimodal data, including action units, can provide complimentary information and increase recognition accuracy [5]. Although emotion recognition from AUs shows promising results, especially when fused with other modalities, they exhibit the lowest classification rate of the unimodal feature vectors with a recognition accuracy of 61.94%. The confusion matrices for AUs, physiological data, and AUs combined with physiological data are shown in Tables 4, 5, and 6 respectively.
Fusing multimodal data has been found to increase emotion recognition including pain in infants [34]. Our results show similar results with pain as well, increasing from 99.92% with physiological data to 99.98% when AUs were fused with physiological data. It is interesting to note, that while the overall recognition accuracy was higher when AUs were combined with physiological data, the recognition rates for both happy and fear decreased to 99.94% and 99.90% respectively. This can be attributed to some redundant action unit patterns between happy and fear.
Happy | Embarrassment | Fear | Pain | |
Happy | 51512 | 10 | 5 | 4 |
Embarrassment | 21 | 52080 | 4 | 14 |
Fear | 4 | 7 | 36780 | 3 |
Pain | 22 | 13 | 6 | 51967 |
Happy | Embarrassment | Fear | Pain | |
Happy | 51504 | 10 | 5 | 4 |
Embarrassment | 10 | 52100 | 3 | 6 |
Fear | 14 | 16 | 36758 | 6 |
Pain | 3 | 9 | 1 | 51995 |
4 Discussion
We have presented an analysis on impact of 3D facial landmarks, action units, and physiological data for emotion recognition. We have conducted experiments in both a unimodal and multimodal capacity on four target emotions. Our analysis has shown that 3D facial data shows variations in facial regions allowing for accurate emotion recognition. We have also shown that physiological data can be used for emotion recognition due to the changes across emotion. The occurrence of action units shows differences in distribution over 35 AUs across the four-target emotions, which allows for complimentary information to be used when fusing the AUs with other modalities at the feature level. Although the fusion of AUs is shown to increase the accuracy across the four tested emotions, the results also show that directly using AU occurrences without fusing other modalities, for emotion recognition, is still a challenging problem. These results suggest more research is needed to determine the positive impact of using action units in a unimodal approach for emotion recognition.
While these results are encouraging, there are some limitations to the study. First, more multimodal databases need to be investigated, as our study only made use of BP4D+. Secondly, more details are needed as to why the fusion of AU occurrences showed an increase in accuracy, while using them in a unimodal capacity generated a relatively low accuracy. Lastly, our current study only focused on four emotions due to the limited number of available action units. A much larger range of emotions are needed to fully test the efficacy of the proposed approach. Considering this, for our future work, we will detect action units [1]
, across a larger set of data, as well as use deep neural networks and other fusion methods including score level fusion, and the fusion of deep and hand-crafted features
[17]. We will also test on a larger set of multimodal datasets, and we will investigate the impact of both AU occurrences and intensities for emotion recognition. These experiments will be conducted across a larger set of emotions that include, but are not limited to, surprise, sadness, anger, and disgust. Along with these emotions, we will also investigate subject self-reporting of emotion (i.e. perceived emotion).Acknowledgment
This material is based on work that was supported in part by an Amazon Machine Learning Research Award.
References
-
[1]
(2016)
Openface: an open source facial behavior analysis toolkit
. In WACV, pp. 1–10. Cited by: §4. - [2] (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §1, §3.2.
- [3] (2015) Landmark localization on 3d/4d range data using a shape index-based stat shape model with global and local constraints. CVIU 139, pp. 136–148. Cited by: §2.1.
- [4] (2001) Active appearance models. IEEE Transactions on PAMI (6), pp. 681–685. Cited by: §1.
- [5] (2016) Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: history, trends, and affect-related applications. IEEE transactions on PAMI 38 (8), pp. 1548–1568. Cited by: §3.2.
- [6] (2011) A facs valid 3d dynamic action unit database with applications to 3d dynamic morphable facial modeling. In ICCV, Cited by: §1.
- [7] (2007) Random forests for classification in ecology. Ecology 88 (11), pp. 2783–2792. Cited by: §3.2.
- [8] (2012) 3D dynamic expression recognition based on a novel deformation vector field and random forest. In ICPR, pp. 1104–1107. Cited by: §1.
- [9] (1978) The facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press. Cited by: §2.
- [10] (1997) What the face reveals: basic and applied studies of spon exp using the facial action coding system (facs). Oxford Uni Press. Cited by: item 1.
- [11] (1989) The argument and evidence about universals in facial expressions. Handbook of social psychophysiology, pp. 143–164. Cited by: §2.
- [12] (2018) Spon and non-spontaneous 3d facial expression recognition using a statistical model with global and local constraints. ICIP. Cited by: §2.1.
- [13] (2010) A 3-d audio-vis corpus of affect comm. IEEE Trans on Multimedia 12 (6), pp. 591–598. Cited by: §1.
-
[14]
(2011)
Real time head pose estimation with random regression forests
. In CVPR 2011, pp. 617–624. Cited by: §3.2. - [15] (2012) 3D/4d facial expression analysis: an advanced annotated face model approach. IVC 30 (10), pp. 738–749. Cited by: §1.
- [16] EMFACS-7: emotional facial action coding system. Uni of Cali at SF 2 (36), pp. 1. Cited by: §3.1.
-
[17]
(2019)
Fusion of hand-crafted and deep features for empathy prediction
. FG Workshops. Cited by: §4. - [18] (2010) Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis. Journal on Multimodal User Interfaces 3 (1-2), pp. 33–48. Cited by: §1.
- [19] (2011) Phys signals and their use in augmenting emotion recognition for human–machine interaction. In Emotion-oriented systems, pp. 133–159. Cited by: §3.2.
- [20] (2011) Deap: a database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3 (1), pp. 18–31. Cited by: §1, §3.2.
- [21] (2014) Facial expression recognition via a boosted deep belief network. In CVPR, pp. 1805–1812. Cited by: §1.
- [22] (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In CVPRW, pp. 94–101. Cited by: §1.
- [23] (2011) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing 3 (1), pp. 5–17. Cited by: §1.
- [24] (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
- [25] (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans on PAMI (10), pp. 1175–1191. Cited by: §1.
- [26] (2008) The structure, reliability and validity of pain expression: evidence from patients with shoulder pain. Pain 139 (2), pp. 267–274. Cited by: §1.
- [27] (2011) Multimodal emotion recognition in response to videos. IEEE transactions on affective computing 3 (2), pp. 211–223. Cited by: §1, §1.
- [28] (2012) Exploring the effect of illumination on automatic expression recognition using the ict-3drfe database. IVC 30 (10), pp. 728–737. Cited by: §1.
- [29] (2010) Tracking vertex vlow and model adaptation for 3d spatio-temporal face analysis. IEEE Transactions on SMC-A 40 (3), pp. 461–474. Cited by: §1.
- [30] (2010) A natural visible and infrared facial expression database for expression rec and emotion inference. IEEE Trans on Multimedia 12 (7), pp. 682–691. Cited by: §1.
- [31] (2018) Facial expression recognition by de-expression residue learning. In CVPR, Cited by: §1.
- [32] (2018) Identity-adaptive facial expression recognition through expression regeneration using conditional generative adversarial networks. In FG, Cited by: §1.
- [33] (2008) A high-resolution 3d dynamic facial expression database. In FG, Vol. 126. Cited by: §1.
- [34] (2016) Machine-based multimodal pain assessment tool for infants: a review. arXiv preprint arXiv:1607.00331. Cited by: §3.2.
- [35] (2016) Multi spon emotion corpus for human behavior analysis. In CVPR, pp. 3438–3446. Cited by: item 3, Figure 1, §2.
- [36] (2012) Real-time hand gesture detection and recognition by random forest. In Communications and information processing, pp. 747–755. Cited by: §3.2.
Comments
There are no comments yet.