Improving Speech Related Facial Action Unit Recognition by Audiovisual Information Fusion

06/29/2017
by   Zibo Meng, et al.
0

It is challenging to recognize facial action unit (AU) from spontaneous facial displays, especially when they are accompanied by speech. The major reason is that the information is extracted from a single source, i.e., the visual channel, in the current practice. However, facial activity is highly correlated with voice in natural human communications. Instead of solely improving visual observations, this paper presents a novel audiovisual fusion framework, which makes the best use of visual and acoustic cues in recognizing speech-related facial AUs. In particular, a dynamic Bayesian network (DBN) is employed to explicitly model the semantic and dynamic physiological relationships between AUs and phonemes as well as measurement uncertainty. A pilot audiovisual AU-coded database has been collected to evaluate the proposed framework, which consists of a "clean" subset containing frontal faces under well controlled circumstances and a challenging subset with large head movements and occlusions. Experiments on this database have demonstrated that the proposed framework yields significant improvement in recognizing speech-related AUs compared to the state-of-the-art visual-based methods especially for those AUs whose visual observations are impaired during speech, and more importantly also outperforms feature-level fusion methods by explicitly modeling and exploiting physiological relationships between AUs and phonemes.

READ FULL TEXT

page 1

page 4

page 7

page 9

page 10

page 12

page 14

research
06/23/2017

Listen to Your Face: Inferring Facial Action Units from Audio Channel

Extensive efforts have been devoted to recognizing facial action units (...
research
05/27/2020

Modality Dropout for Improved Performance-driven Talking Faces

We describe our novel deep learning approach for driving animated faces ...
research
04/27/2022

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

Talking head generation is to synthesize a lip-synchronized talking head...
research
04/17/2020

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby explo...
research
07/17/2017

Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition

Recognizing facial action units (AUs) from spontaneous facial expression...
research
03/07/2019

Voice Activity Detection: Merging Source and Filter-based Information

Voice Activity Detection (VAD) refers to the problem of distinguishing s...
research
02/22/2022

Continuous Speech for Improved Learning Pathological Voice Disorders

Goal: Numerous studies had successfully differentiated normal and abnorm...

Please sign up or login with your details

Forgot password? Click here to reset