Crossmodal learning for audio-visual speech event localization

03/09/2020
by   Rahul Sharma, et al.
4

An objective understanding of media depictions, such as about inclusive portrayals of how much someone is heard and seen on screen in film and television, requires the machines to discern automatically who, when, how and where someone is talking. Media content is rich in multiple modalities such as visuals and audio which can be used to learn speaker activity in videos. In this work, we present visual representations that have implicit information about when someone is talking and where. We propose a crossmodal neural network for audio speech event detection using the visual frames. We use the learned representations for two downstream tasks: i) audio-visual voice activity detection ii) active speaker localization in video frames. We present a state-of-the-art audio-visual voice activity detection system and demonstrate that the learned embeddings can effectively localize to active speakers in the visual frames.

READ FULL TEXT

page 6

page 7

page 8

page 14

research
12/01/2022

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, ...
research
01/06/2022

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Augmented reality devices have the potential to enhance human perception...
research
10/14/2022

Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual Diarization

This report describes our approach for the Audio-Visual Diarization (AVD...
research
03/07/2022

Visually Supervised Speaker Detection and Localization via Microphone Array

Active speaker detection (ASD) is a multi-modal task that aims to identi...
research
08/21/2020

RespVAD: Voice Activity Detection via Video-Extracted Respiration Patterns

Voice Activity Detection (VAD) refers to the task of identification of r...
research
04/11/2016

Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection

In this paper, we address the problem of multiple view data fusion in th...
research
08/12/2021

Deep Neural Network Voice Activity Detector for Downsampled Audio Data: An Experiment Report

Sociometric badges are an emerging technology for study how teams intera...

Please sign up or login with your details

Forgot password? Click here to reset