WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

06/13/2021
by   Yunzhe Hao, et al.
0

In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information. However, no one cares for the cue of sound onset, which has been emphasized in the auditory scene analysis and psychology. Inspired by it, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task. We further extended to the onset/offset cues and got performance improvement. From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection. We also combined voiceprint with onset/offset cues. Voiceprint models voice characteristics of the target while onset/offset models the start/end information of the speech. From the perspective of auditory scene analysis, the combination of two perception cues can promote the integrity of the auditory object. The experiment results are also close to state-of-the-art performance, using nearly half of the parameters. We hope that this work will inspire communities of speech processing and psychology, and contribute to communication between them. Our code will be available in https://github.com/aispeech-lab/wase/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2022

L-SpEx: Localized Target Speaker Extraction

Speaker extraction aims to extract the target speaker's voice from a mul...
research
11/07/2021

LiMuSE: Lightweight Multi-modal Speaker Extraction

The past several years have witnessed significant progress in modeling t...
research
05/22/2023

Target Active Speaker Detection with Audio-visual Cues

In active speaker detection (ASD), we would like to detect whether an on...
research
10/26/2022

Multitask Detection of Speaker Changes, Overlapping Speech and Voice Activity Using wav2vec 2.0

Self-supervised learning approaches have lately achieved great success o...
research
10/15/2020

Muse: Multi-modal target speaker extraction with visual cues

Speaker extraction algorithm relies on the speech sample from the target...
research
05/03/2021

AvaTr: One-Shot Speaker Extraction with Transformers

To extract the voice of a target speaker when mixed with a variety of ot...
research
05/22/2023

Can we hear physical and social space together through prosody?

When human listeners try to guess the spatial position of a speech sourc...

Please sign up or login with your details

Forgot password? Click here to reset