Impact of multiple modalities on emotion recognition: investigation into 3d facial landmarks, action units, and physiological data

by   Diego Fabiano, et al.

To fully understand the complexities of human emotion, the integration of multiple physical features from different modalities can be advantageous. Considering this, we present an analysis of 3D facial data, action units, and physiological data as it relates to their impact on emotion recognition. We analyze each modality independently, as well as the fusion of each for recognizing human emotion. This analysis includes which features are most important for specific emotions (e.g. happy). Our analysis indicates that both 3D facial landmarks and physiological data are encouraging for expression/emotion recognition. On the other hand, while action units can positively impact emotion recognition when fused with other modalities, the results suggest it is difficult to detect emotion using them in a unimodal fashion.



There are no comments yet.


page 3


Exploiting Facial Landmarks for Emotion Recognition in the Wild

In this paper, we describe an entry to the third Emotion Recognition in ...

Emotion Recognition from Multiple Modalities: Fundamentals and Methodologies

Humans are emotional creatures. Multiple modalities are often involved w...

Angry or Climbing Stairs? Towards Physiological Emotion Recognition in the Wild

Inferring emotions from physiological signals has gained much traction i...

An EEG-Based Multi-Modal Emotion Database with Both Posed and Authentic Facial Actions for Emotion Analysis

Emotion is an experience associated with a particular pattern of physiol...

Distributed Processing of Biosignal-Database for Emotion Recognition with Mahout

This paper investigates the use of distributed processing on the problem...

Emotion Recognition under Consideration of the Emotion Component Process Model

Emotion classification in text is typically performed with neural networ...

Emotion recognition using a glasses-type wearable device via multi-channel facial responses

We present a glasses type wearable device to detect emotions from a huma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recognizing emotion is considered one of the most important parts of human intelligence [25] and it has applications in fields as varied as entertainment, transportation, medicine and health, and psychology. Due to this, there has been a great deal of research into human emotion recognition in the past decades, where many important advances have been made. This is due in part to the new availability of large, varied, and challenging datasets [6, 13, 20, 23, 27, 28, 30, 33].

There is a large and varied body of work into facial expression recognition. Using a Spatio-Temporal Hidden Markov Model (HMM), the intra- and inter-frame information can be used for this task


. It has been shown that using a random forest


along with a Deformation Vector Field


, to obtain the local deformations of the face over time, can be used to accurately classify expressions. Facial expressions have also been successfully classified using a Support Vector Machine (SVM) with a radial basis function (RBF) kernel with geometrical coordinates, as well as the normal of the coordinates

[15]. Lucey et al [22], analyzed videos of patients with shoulder injuries to automatically recognize pain. In this work, an Active Appearance Model [4] was used to detect Facial Action Units to distinguish pain on facial expressions. They detail 84.7 for area under the ROC curve on the UNBC-McMaster Shoulder Pain Database [26]. This study is encouraging as it suggests Action Units can be used to recognize emotions (e.g. pain).

Deep learning has shown recent success in expression recognition. Using a Boosted Deep Belief Network, Liu et al. [21] trained feature learning, selection, and classifier construction iteratively in a unified loopy framework; which showed an increase in the classification accuracy. De-expression Residue Learning [31] was also proposed which can generate a corresponding neutral expression given an arbitrary facial expression from an image. Yang et al. [32] proposed regenerating expression from input facial images. By using a conditional GAN [24], they developed an identity adaptive feature space that can handle variations in subjects.

Facial expression recognition is a popular approach to recognizing emotion, however, there is also a varied body of work that makes use of multimodal data for emotion recognition. Soleymani et al. [27] incorporated electroencephalogram, pupillary response, and gaze distance information from 20 videos. They used this data to train an SVM to classify arousal and valence for 24 participants. Kessous also showed an increase of more than 10% when using a multimodal approach [18]. They used a Bayesian classifier, and fused facial expression with speech data that consisted of multiple languages including Greek, French, German, and Italian.

While these works, and others, have had success detecting expressions and emotion with multimodal data, little work has been done on analyzing their impact on recognition. Motivated by this, we present an analysis of multimodal data and the impact each modality has on emotion recognition. Our contributions can be summarized as follows:

  1. [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  2. A detailed analysis of physiological data, 3D landmarks, and facial action units (AU) [10] both independently and combined at the feature level (unimodal vs. multimodal), for emotion recognition, is presented.

  3. Insight is provided on the impact of physiological data, 3D landmarks, and AUs for positively influencing emotion recognition studies.

  4. To the best of our knowledge, this is the first work to conduct this type of analysis on the BP4D+ multimodal dataset [35], resulting in a baseline for future analyses.

2 Data Selection and Feature Extraction

We propose to use 3D facial data (landmarks), action units and physiological data in our analysis. We chose these 3 modalities based on their complementary nature. First, given movement, and the shape of the face changes (3D landmarks), we can also assume that there will be a change in the occurrence of action units [9]. We have also chosen the complementary modality, physiological data, as facial expressions can be faked. It has been observed that people smile during negative emotional experiences [11]. Considering this, physiological data can complement the other two modalities for recognizing emotion.

To conduct the proposed analysis, a suitably large corpus of emotion data is needed that contains 3D facial data, action units, and physiological data. For our experiments we have chosen the BP4D+ multimodal spontaneous emotion corpus [35]. In total, there are over 1.5 million frames of multimodal available in the BP4D+. For this study we use 192,452 frames of multimodal data from all 140 subjects. This subset of data contains 4 target emotions that are happiness, embarrassment, fear, and pain. We are using this subset, as it is largest set of frames, in BP4D+, that are encoded with action units.

2.1 3D facial data

For our study we used 83 3D facial landmarks (same as seen in BP4D+) to represent the face. Each of the landmarks were detected using a shape index-based statistical shape model (SI-SSM) [3], that creates shape index-based patches from global and local features of the face. These global and local features are concatenated into one model, which is then used along with a cross-correlation matching technique to match the training data to an input mesh model. Examples of detected 3D facial landmarks can be seen in Fig. 1. For our 3D facial data feature vector, we directly use the coordinates (x, y, z) of the 3D tracked facial landmarks as they can accurately represent the induced expression that can be seen in the entire 3D model [12], which contains approximately 30k-50k vertices; where our reduced feature vector contains 249 features (83 – 3D coordinates). Using this reduced feature space (relative to the entire 3D mesh) allows for lower dimensional data, without sacrificing any recognition accuracy.

Figure 1: 3D facial landmarks on corresponding 3D mesh model for our targeted emotions of happiness, embarrassment, pain, and fear from the BP4D+ [35].

2.2 Action units

For each of the 4 emotions that have action units coded, a total of 35 action units (AUs) were coded by five different expert FACS coders. For each task of all 140 subjects approximately 20 seconds of the most expressive part of the sequence was annotated, giving us our 192,452 frames of multimodal data that we use for our study. For our AU feature vector, we include the occurrence of all annotated AUs for each frame where 1 corresponds to the AU being present and 0 corresponds to the AU not being present in the current frame. There are some instances in the BP4D+ where the AU occurrence is listed as 9, which is referred to as unknown. For our experiments, 9 is treated as a 0 (i.e. not present).

2.3 Physiological data

For each subject and task, the BP4D+ contains 8 separate measurements of physiological data derived from blood pressure (BP), heart rate (HR), respiration (RESP), and skin conductivity (EDA). All physiological data was sampled at 1000 Hz which required us to synchronize with the available 3D facial data and corresponding action units to have accurate readings for each frame of data. To synchronize this, we first divide the total number of frames of physiological data by the total number of frames of 3D facial data for that task (average sync value). We then use the average value over the average sync value as our new frame. For example, given a task with 1000 frames of 3D facial data, along with 40,000 frames of diastolic BP we would have , resulting in us taking the average diastolic BP for every 40 frames. Calculating the mentioned average over all 40,000 frames, results in 1000 frames of diastolic BP matching to the 1000 frames of corresponding 3D facial data. In this same task, there are 400 frames that include both 3D facial landmarks and AUs (frames labeled with task, subject, and frame number). We then use the corresponding frame number to extract that exact index from the calculated diastolic BP averages. This gives us our resulting 400 frames of synchronized 3D facial data, physiological data, and action units. For our physiological feature vector, we take the average value of each frame over all eight of the data types (i.e. fuse the signals).

3 Experimental Design and Results

3.1 Feature analysis

A main contribution of this work is analyzing which modality and features are most important for our 4 target emotions. To do this we used principal component analysis (PCA) for feature selection keeping 95% of the original variance. We did this for each of our unimodal feature vectors for all the training data, as well as each individual emotion. This was done to analyze which features are important for emotion recognition in a general sense, and for each targeted emotion resulting in a total of 15 total rankings (3 feature vectors for each: happy, embarrassment, pain, fear, and all emotions). The features were then ranked based on highest eigenvalue.

Figure 2: Top 5 ranked 3D facial features across the 4 emotions.

Action Units. The top ranked action units included the lips, cheeks, nose, and eye/eyebrow regions. Across each of the target emotions, along with all combined emotions the ranked AUs were similar. The difference being their rankings change across different emotion (e.g. AU12 was ranked first for happy, while AU12 was ranked second for embarrassed). Table 1, second column, shows the top 5 ranked AUs. As can be seen here the top AUs for ‘Happy’ are 12, 6, 11, and 7. When considering the Emotion Facial Action Coding System [16], which only looks at emotion-related facial action, ‘Happy’, is 6+12. This shows a correlation between the PCA rankings and the action units associated with the emotion. We also calculated the normalized AU distribution across each target emotion. This showed that while each emotion had similar occurring action units, they varied in distribution, which contributes complimentary information to the other modalities. This can explain the increase in accuracy when a multimodal approach is used (Table 3).

Physiological Data. Most of the top ranked features for physiological data were variations on blood pressure (e.g. diastolic and systolic). Pulse rate was also ranked as a top feature for each of the target emotions, however, when all emotions were included in the training data, pulse rate was replaced by EDA. This suggests that skin conductivity is important for recognizing multiple emotions. It is interesting to note that for each of the 4 target emotions, not only were the top ranked features the same, they were also ranked in the same order. Although each emotion had the same ranked physiological data, they all had large variations in the data between them. This variance in data allows for a high level of recognition accuracy (Table 2). Table 1, third column, shows the top 5 ranked physiological signals.

3D Facial Data. When analyzing the 3D facial data, each of the target emotions show variance in the regions of the face that were ranked for the top features. For example, happy targeted the right eye and eyebrow, and pain was across the right eyebrow, nose, and left eyebrow. These regions of the face are also consistent with the AUs ranked as the top features (e.g. mouth, face, eyes/eyebrows). See Table 1 for the top 5 ranked 3D facial landmarks (face region for each) and Fig. 2 for examples of these landmarks on corresponding 3D mesh models. It can be seen, in Fig. 2, that emotional variance is conveyed in different 3D regions of the face for each of the target emotions.

Emotion Action Units Phys 3D Facial Landmarks
Lip corner puller (12) Mean BP
Cheek raiser (6) Diastolic BP
Happy Upper lip raiser (10) Raw BP Right eye (2)
Nasolabial deepener (11) Pulse Rate Right eyebrow (3)
Lid tightener (7)
Cheek raiser (6) Mean BP
Lip corner puller (12) Diastolic BP
Embarrassed Upper lip raiser (10) Systolic BP Left face contour (2)
Lid tightener (7) Raw BP Left eyebrow (3)
Nasolabial deepener (11) Pulse Rate
Lip Corner Puller (12) Mean BP
Cheek raiser (6) Diastolic BP Right eyebrow (2)
Pain Upper lip raiser (10) Systolic BP Nose (2)
Nasolabial deepener (11) Raw BP Left eyebrow(1)
Lid tightener (7) Pulse Rate
Upper lip raiser (10) Mean BP
Cheek raiser (6) Diastolic BP
Fear Lid tightener (7) Systolic BP Right eyebrow (5)
Lip corner puller (12) Raw BP
Nasolabial deepener (11) Pulse Rate
Lip corner puller (12) Mean BP
Upper lip raiser (10) Diastolic BP
All Cheek raiser (6) Systolic BP Left eyebrow (5)
Lid tightener (7) Raw BP
Nasolabial deepener (11) EDA
Table 1: PCA rankings for each feature for each individual emotion along with all 4 target emotions, shown in ranked order. NOTE: number in parentheses in column four corresponds to total number of landmarks in that region.

3.2 Emotion recognition

To conduct our emotion recognition experiments, we created a feature vector for each unimodal and multimodal configuration (Tables 2 and 3). We then used each of these feature vectors to train a random forest [2] for recognizing the four target emotions. Random forests have successfully been used in a wide variety of classification tasks such as classifying ecological data [7], real-time hand gesture recognition [36]

, and head pose estimation

[14], which makes them a natural fit for our analysis.

3D Facial Data Action Units Physiological
Accuracy 99.29% 61.94% 99.94%
Recall 98.80% 60.35% 99.95%
Precision 99.33% 61.00% 99.95%
Table 2: Unimodal emotion recognition from BP4D+.
3D Facial Data Action Units 3D Facial Data 3D Facial Data
Action Units Physiological Physiological Action Units
Accuracy 99.53% 99.95% 99.76% 99.83%
Recall 99.58% 99.95% 99.75% 99.83%
Precision 99.52% 99.95% 99.75% 99.85%
Table 3: Multimodal emotion recognition from BP4D+.
Happy Embarrassment Fear Pain
Happy 32511 7730 3373 7917
Embarrassment 17561 26038 3238 5282
Fear 8773 5206 14652 8163
Pain 1983 2334 1685 46006
Table 4: Confusion matrix of action units.

Unimodal vs. Multimodal Emotion Recognition. We used 10-fold cross validation for each of our experiments. The results for unimodal and multimodal emotion recognition can be seen in Tables 2 and 3 respectively. When physiological data was used, recognition accuracy was highest for both unimodal and multimodal approaches, achieving an accuracy of 99.94% for the 4 target emotions, with a unimodal approach. This result is intuitive as physiological signals are closely tied to human emotion [19, 20]. When AUs were combined with physiological data we achieved our highest recognition accuracy of 99.95%. This also agrees with the literature that the fusion of multimodal data, including action units, can provide complimentary information and increase recognition accuracy [5]. Although emotion recognition from AUs shows promising results, especially when fused with other modalities, they exhibit the lowest classification rate of the unimodal feature vectors with a recognition accuracy of 61.94%. The confusion matrices for AUs, physiological data, and AUs combined with physiological data are shown in Tables 4, 5, and 6 respectively.

Fusing multimodal data has been found to increase emotion recognition including pain in infants [34]. Our results show similar results with pain as well, increasing from 99.92% with physiological data to 99.98% when AUs were fused with physiological data. It is interesting to note, that while the overall recognition accuracy was higher when AUs were combined with physiological data, the recognition rates for both happy and fear decreased to 99.94% and 99.90% respectively. This can be attributed to some redundant action unit patterns between happy and fear.

Happy Embarrassment Fear Pain
Happy 51512 10 5 4
Embarrassment 21 52080 4 14
Fear 4 7 36780 3
Pain 22 13 6 51967
Table 5: Confusion matrix of physiological data.
Happy Embarrassment Fear Pain
Happy 51504 10 5 4
Embarrassment 10 52100 3 6
Fear 14 16 36758 6
Pain 3 9 1 51995
Table 6: Confusion matrix of action units and physiological.

4 Discussion

We have presented an analysis on impact of 3D facial landmarks, action units, and physiological data for emotion recognition. We have conducted experiments in both a unimodal and multimodal capacity on four target emotions. Our analysis has shown that 3D facial data shows variations in facial regions allowing for accurate emotion recognition. We have also shown that physiological data can be used for emotion recognition due to the changes across emotion. The occurrence of action units shows differences in distribution over 35 AUs across the four-target emotions, which allows for complimentary information to be used when fusing the AUs with other modalities at the feature level. Although the fusion of AUs is shown to increase the accuracy across the four tested emotions, the results also show that directly using AU occurrences without fusing other modalities, for emotion recognition, is still a challenging problem. These results suggest more research is needed to determine the positive impact of using action units in a unimodal approach for emotion recognition.

While these results are encouraging, there are some limitations to the study. First, more multimodal databases need to be investigated, as our study only made use of BP4D+. Secondly, more details are needed as to why the fusion of AU occurrences showed an increase in accuracy, while using them in a unimodal capacity generated a relatively low accuracy. Lastly, our current study only focused on four emotions due to the limited number of available action units. A much larger range of emotions are needed to fully test the efficacy of the proposed approach. Considering this, for our future work, we will detect action units [1]

, across a larger set of data, as well as use deep neural networks and other fusion methods including score level fusion, and the fusion of deep and hand-crafted features

[17]. We will also test on a larger set of multimodal datasets, and we will investigate the impact of both AU occurrences and intensities for emotion recognition. These experiments will be conducted across a larger set of emotions that include, but are not limited to, surprise, sadness, anger, and disgust. Along with these emotions, we will also investigate subject self-reporting of emotion (i.e. perceived emotion).


This material is based on work that was supported in part by an Amazon Machine Learning Research Award.


  • [1] T. Baltrušaitis et al. (2016)

    Openface: an open source facial behavior analysis toolkit

    In WACV, pp. 1–10. Cited by: §4.
  • [2] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §1, §3.2.
  • [3] S. Canavan et al. (2015) Landmark localization on 3d/4d range data using a shape index-based stat shape model with global and local constraints. CVIU 139, pp. 136–148. Cited by: §2.1.
  • [4] T. Cootes et al. (2001) Active appearance models. IEEE Transactions on PAMI (6), pp. 681–685. Cited by: §1.
  • [5] C. Corneanu et al. (2016) Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: history, trends, and affect-related applications. IEEE transactions on PAMI 38 (8), pp. 1548–1568. Cited by: §3.2.
  • [6] D. Cosker, E. Krumhuber, and A. Hilton (2011) A facs valid 3d dynamic action unit database with applications to 3d dynamic morphable facial modeling. In ICCV, Cited by: §1.
  • [7] D. Cutler et al. (2007) Random forests for classification in ecology. Ecology 88 (11), pp. 2783–2792. Cited by: §3.2.
  • [8] H. Drira and others. (2012) 3D dynamic expression recognition based on a novel deformation vector field and random forest. In ICPR, pp. 1104–1107. Cited by: §1.
  • [9] P. Ekman and W. Friesen (1978) The facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press. Cited by: §2.
  • [10] P. Ekman and E. Rosenberg (1997) What the face reveals: basic and applied studies of spon exp using the facial action coding system (facs). Oxford Uni Press. Cited by: item 1.
  • [11] P. Ekman (1989) The argument and evidence about universals in facial expressions. Handbook of social psychophysiology, pp. 143–164. Cited by: §2.
  • [12] D. Fabiano and S. Canavan (2018) Spon and non-spontaneous 3d facial expression recognition using a statistical model with global and local constraints. ICIP. Cited by: §2.1.
  • [13] G. Fanelli et al. (2010) A 3-d audio-vis corpus of affect comm. IEEE Trans on Multimedia 12 (6), pp. 591–598. Cited by: §1.
  • [14] G. Fanelli, J. Gall, and L. Van Gool (2011)

    Real time head pose estimation with random regression forests

    In CVPR 2011, pp. 617–624. Cited by: §3.2.
  • [15] T. Fang et al. (2012) 3D/4d facial expression analysis: an advanced annotated face model approach. IVC 30 (10), pp. 738–749. Cited by: §1.
  • [16] W. Friesen and P. Ekman EMFACS-7: emotional facial action coding system. Uni of Cali at SF 2 (36), pp. 1. Cited by: §3.1.
  • [17] S. Hinduja et al. (2019)

    Fusion of hand-crafted and deep features for empathy prediction

    FG Workshops. Cited by: §4.
  • [18] L. Kessous et al. (2010) Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis. Journal on Multimodal User Interfaces 3 (1-2), pp. 33–48. Cited by: §1.
  • [19] R. Knapp et al. (2011) Phys signals and their use in augmenting emotion recognition for human–machine interaction. In Emotion-oriented systems, pp. 133–159. Cited by: §3.2.
  • [20] S. Koelstra et al. (2011) Deap: a database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3 (1), pp. 18–31. Cited by: §1, §3.2.
  • [21] P. Liu et al. (2014) Facial expression recognition via a boosted deep belief network. In CVPR, pp. 1805–1812. Cited by: §1.
  • [22] P. Lucey et al. (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In CVPRW, pp. 94–101. Cited by: §1.
  • [23] G. McKeown et al. (2011) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing 3 (1), pp. 5–17. Cited by: §1.
  • [24] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
  • [25] R. W. Picard, E. Vyzas, and J. Healey (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans on PAMI (10), pp. 1175–1191. Cited by: §1.
  • [26] P. Prkachin (2008) The structure, reliability and validity of pain expression: evidence from patients with shoulder pain. Pain 139 (2), pp. 267–274. Cited by: §1.
  • [27] M. Soleymani, M. Pantic, and T. Pun (2011) Multimodal emotion recognition in response to videos. IEEE transactions on affective computing 3 (2), pp. 211–223. Cited by: §1, §1.
  • [28] G. Stratou et al. (2012) Exploring the effect of illumination on automatic expression recognition using the ict-3drfe database. IVC 30 (10), pp. 728–737. Cited by: §1.
  • [29] Y. Sun et al. (2010) Tracking vertex vlow and model adaptation for 3d spatio-temporal face analysis. IEEE Transactions on SMC-A 40 (3), pp. 461–474. Cited by: §1.
  • [30] S. Wang et al. (2010) A natural visible and infrared facial expression database for expression rec and emotion inference. IEEE Trans on Multimedia 12 (7), pp. 682–691. Cited by: §1.
  • [31] H. Yang et al. (2018) Facial expression recognition by de-expression residue learning. In CVPR, Cited by: §1.
  • [32] H. Yang et al. (2018) Identity-adaptive facial expression recognition through expression regeneration using conditional generative adversarial networks. In FG, Cited by: §1.
  • [33] L. Yin et al. (2008) A high-resolution 3d dynamic facial expression database. In FG, Vol. 126. Cited by: §1.
  • [34] G. Zamzmi et al. (2016) Machine-based multimodal pain assessment tool for infants: a review. arXiv preprint arXiv:1607.00331. Cited by: §3.2.
  • [35] Z. Zhang et al. (2016) Multi spon emotion corpus for human behavior analysis. In CVPR, pp. 3438–3446. Cited by: item 3, Figure 1, §2.
  • [36] X. Zhao et al. (2012) Real-time hand gesture detection and recognition by random forest. In Communications and information processing, pp. 747–755. Cited by: §3.2.