RespVAD: Voice Activity Detection via Video-Extracted Respiration Patterns

08/21/2020
by   Arnab Kumar Mondal, et al.
0

Voice Activity Detection (VAD) refers to the task of identification of regions of human speech in digital signals such as audio and video. While VAD is a necessary first step in many speech processing systems, it poses challenges when there are high levels of ambient noise during the audio recording. To improve the performance of VAD in such conditions, several methods utilizing the visual information extracted from the region surrounding the mouth/lip region of the speakers' video recording have been proposed. Even though these provide advantages over audio-only methods, they depend on faithful extraction of lip/mouth regions. Motivated by these, a new paradigm for VAD based on the fact that respiration forms the primary source of energy for speech production is proposed. Specifically, an audio-independent VAD technique using the respiration pattern extracted from the speakers' video is developed. The Respiration Pattern is first extracted from the video focusing on the abdominal-thoracic region of a speaker using an optical flow based method. Subsequently, voice activity is detected from the respiration pattern signal using neural sequence-to-sequence prediction models. The efficacy of the proposed method is demonstrated through experiments on a challenging dataset recorded in real acoustic environments and compared with four previous methods based on audio and visual cues.

READ FULL TEXT
research
03/09/2020

Crossmodal learning for audio-visual speech event localization

An objective understanding of media depictions, such as about inclusive ...
research
10/26/2022

Acoustically-Driven Phoneme Removal That Preserves Vocal Affect Cues

In this paper, we propose a method for removing linguistic information f...
research
01/06/2022

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Augmented reality devices have the potential to enhance human perception...
research
04/11/2016

Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection

In this paper, we address the problem of multiple view data fusion in th...
research
09/23/2020

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Visual voice activity detection (V-VAD) uses visual features to predict ...
research
08/12/2021

Deep Neural Network Voice Activity Detector for Downsampled Audio Data: An Experiment Report

Sociometric badges are an emerging technology for study how teams intera...
research
11/06/2018

An audio-only method for advertisement detection in broadcast television content

We address the task of advertisement detection in broadcast television c...

Please sign up or login with your details

Forgot password? Click here to reset