Backchannel Detection and Agreement Estimation from Video with Transformer Networks

06/02/2023
by   Ahmed Amer, et al.
0

Listeners use short interjections, so-called backchannels, to signify attention or express agreement. The automatic analysis of this behavior is of key importance for human conversation analysis and interactive conversational agents. Current state-of-the-art approaches for backchannel analysis from visual behavior make use of two types of features: features based on body pose and features based on facial behavior. At the same time, transformer neural networks have been established as an effective means to fuse input from different data sources, but they have not yet been applied to backchannel analysis. In this work, we conduct a comprehensive evaluation of multi-modal transformer architectures for automatic backchannel analysis based on pose and facial information. We address both the detection of backchannels as well as the task of estimating the agreement expressed in a backchannel. In evaluations on the MultiMediate'22 backchannel detection challenge, we reach 66.4 with a one-layer transformer architecture, outperforming the previous state of the art. With a two-layer transformer architecture, we furthermore set a new state of the art (0.0604 MSE) on the task of estimating the amount of agreement expressed in a backchannel.

READ FULL TEXT

page 3

page 6

research
03/15/2023

Multi-Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling

Facial expression recognition is important for various purpose such as e...
research
10/23/2022

Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation

Although human action anticipation is a task which is inherently multi-m...
research
09/20/2022

MultiMediate '22: Backchannel Detection and Agreement Estimation in Group Interactions

Backchannels, i.e. short interjections of the listener, serve important ...
research
07/26/2022

Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

Body language is an eye-catching social signal and its automatic analysi...
research
03/16/2023

Vision Transformer for Action Units Detection

Facial Action Units detection (FAUs) represents a fine-grained classific...
research
06/10/2022

Transformer-Graph Neural Network with Global-Local Attention for Multimodal Rumour Detection with Knowledge Distillation

Misinformation spreading becomes a critical issue in online conversation...
research
07/31/2023

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

Conversational engagement estimation is posed as a regression problem, e...

Please sign up or login with your details

Forgot password? Click here to reset