L-SpEx: Localized Target Speaker Extraction

02/21/2022
by   Meng Ge, et al.
0

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2023

Beamformer-Guided Target Speaker Extraction

We propose a Beamformer-guided Target Speaker Extraction (BG-TSE) method...
research
10/28/2022

Local-global speaker representation for target speaker extraction

Target speaker extraction is to extract the target speaker's voice from ...
research
06/13/2021

WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

In the speaker extraction problem, it is found that additional informati...
research
10/31/2022

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

The speaker extraction technique seeks to single out the voice of a targ...
research
04/11/2022

Listen only to me! How well can target speech extraction handle false alarms?

Target speech extraction (TSE) extracts the speech of a target speaker i...
research
06/16/2022

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

Target speech extraction is a technique to extract the target speaker's ...
research
12/07/2022

MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

Recently, many deep learning based beamformers have been proposed for mu...

Please sign up or login with your details

Forgot password? Click here to reset