Detecting expressions with multimodal transformers

11/30/2020
by   Srinivas Parthasarathy, et al.
0

Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person's audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user's expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2 significant improvements over models trained on single modalities with gains of up to 3.6 the expression detection on the Aff-Wild2 database.

READ FULL TEXT
research
03/26/2021

DBATES: DataBase of Audio features, Text, and visual Expressions in competitive debate Speeches

In this work, we present a database of multimodal communication features...
research
03/25/2022

Facial Expression Recognition with Swin Transformer

The task of recognizing human facial expressions plays a vital role in v...
research
09/23/2020

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Visual voice activity detection (V-VAD) uses visual features to predict ...
research
10/02/2020

Training Strategies to Handle Missing Modalities for Audio-Visual Expression Recognition

Automatic audio-visual expression recognition can play an important role...
research
10/26/2022

End-to-End Multimodal Representation Learning for Video Dialog

Video-based dialog task is a challenging multimodal learning task that h...
research
09/22/2021

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Referring expressions are commonly used when referring to a specific tar...
research
07/11/2023

A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings

Predicting where a person is looking is a complex task, requiring to und...

Please sign up or login with your details

Forgot password? Click here to reset