USEV: Universal Speaker Extraction with Visual Cue

09/30/2021
by   Zexu Pan, et al.
0

A speaker extraction algorithm seeks to extract the target speaker's voice from a multi-talker speech mixture. An auxiliary reference, such as a video recording or a pre-recorded speech, is usually used as a cue to form a top-down auditory attention. The prior studies are focused mostly on speaker extraction from a multi-talker speech mixture with highly overlapping speakers. However, a multi-talker speech mixture is often sparsely overlapped, furthermore, the target speaker could even be absent sometimes. In this paper, we propose a universal speaker extraction network that works for all multi-talker scenarios, where the target speaker can be either absent or present. When the target speaker is present, the network performs over a wide range of target-interference speaker overlapping ratios, from 0 such universal multi-talker scenarios is generally described as sparsely overlapped speech. We advocate that a visual cue, i.e. lips movement, is more informative to serve as the auxiliary reference than an audio cue, i.e. pre-recorded speech. In addition, we propose a scenario-aware differentiated loss function for network training. The experimental results show that our proposed network outperforms various competitive baselines in disentangling sparsely overlapped speech in terms of signal fidelity and perceptual evaluations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2023

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Target speaker extraction aims to extract the speech of a specific speak...
research
10/15/2020

Muse: Multi-modal target speaker extraction with visual cues

Speaker extraction algorithm relies on the speech sample from the target...
research
09/19/2023

USED: Universal Speaker Extraction and Diarization

Speaker extraction and diarization are two crucial enabling techniques f...
research
10/31/2022

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

The speaker extraction technique seeks to single out the voice of a targ...
research
01/14/2021

Speaker activity driven neural speech extraction

Target speech extraction, which extracts the speech of a target speaker ...
research
03/31/2022

Speaker Extraction with Co-Speech Gestures Cue

Speaker extraction seeks to extract the clean speech of a target speaker...
research
10/09/2022

VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Speaker extraction seeks to extract the target speech in a multi-talker ...

Please sign up or login with your details

Forgot password? Click here to reset