Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

09/23/2020
by   Sylvain Guy, et al.
6

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild – WildVVAD – based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

READ FULL TEXT

page 3

page 5

page 6

research
07/08/2023

FTFDNet: Learning to Detect Talking Face Video Manipulation with Tri-Modality Interaction

DeepFake based digital facial forgery is threatening public media securi...
research
11/30/2020

Detecting expressions with multimodal transformers

Developing machine learning algorithms to understand person-to-person en...
research
04/24/2023

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Singing voice transcription converts recorded singing audio to musical n...
research
08/21/2020

RespVAD: Voice Activity Detection via Video-Extracted Respiration Patterns

Voice Activity Detection (VAD) refers to the task of identification of r...
research
09/28/2021

The VVAD-LRS3 Dataset for Visual Voice Activity Detection

Robots are becoming everyday devices, increasing their interaction with ...
research
12/04/2022

Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

3D audio-visual production aims to deliver immersive and interactive exp...
research
10/07/2021

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

Detection of common events and scenes from audio is useful for extractin...

Please sign up or login with your details

Forgot password? Click here to reset