Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

12/05/2019
by   Ander Arriandiaga, et al.
0

In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2018

DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Human auditory cortex excels at selectively suppressing background noise...
research
07/09/2022

Dual-path Attention is All You Need for Audio-Visual Speech Extraction

Audio-visual target speech extraction, which aims to extract a certain s...
research
03/08/2022

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

This paper presents an audio-visual approach for voice separation which ...
research
02/27/2023

Fast Trajectory End-Point Prediction with Event Cameras for Reactive Robot Control

Prediction skills can be crucial for the success of tasks where robots h...
research
11/23/2022

Data-driven Feature Tracking for Event Cameras

Because of their high temporal resolution, increased resilience to motio...
research
02/02/2021

Multimodal Attention Fusion for Target Speaker Extraction

Target speaker extraction, which aims at extracting a target speaker's v...
research
02/19/2019

Low-Latency Deep Clustering For Speech Separation

This paper proposes a low algorithmic latency adaptation of the deep clu...

Please sign up or login with your details

Forgot password? Click here to reset