SpEx: Multi-Scale Time Domain Speaker Extraction Network

04/17/2020
by   Chenglin Xu, et al.
0

Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment. It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra. However, such an approach is adversely affected by the inherent difficulty of phase estimation. Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra. In this way, we avoid phase estimation. The SpEx network consists of four network components, namely speaker encoder, speech encoder, speaker extractor, and speech decoder. Specifically, the speech encoder converts the mixture speech into multi-scale embedding coefficients, the speaker encoder learns to represent the target speaker with a speaker embedding. The speaker extractor takes the multi-scale embedding coefficients and target speaker embedding as input and estimates a receptive mask. Finally, the speech decoder reconstructs the target speaker's speech from the masked embedding coefficients. We also propose a multi-task learning framework and a multi-scale embedding implementation. Experimental results show that the proposed SpEx achieves 37.3 best baseline in terms of signal-to-distortion ratio (SDR), scale-invariant SDR (SI-SDR), and perceptual evaluation of speech quality (PESQ) under an open evaluation condition.

READ FULL TEXT

page 1

page 3

page 11

research
04/29/2020

Time-domain speaker extraction network

Speaker extraction is to extract a target speaker's voice from multi-tal...
research
05/10/2020

SpEx+: A Complete Time Domain Speaker Extraction Network

Speaker extraction aims to extract the target speech signal from a multi...
research
06/28/2023

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

The previous SpEx+ has yielded outstanding performance in speaker extrac...
research
03/24/2019

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It a...
research
10/25/2020

Speakerfilter-Pro: an improved target speaker extractor combines the time domain and frequency domain

This paper introduces an improved target speaker extractor, referred to ...
research
03/06/2022

Single microphone speaker extraction using unified time-frequency Siamese-Unet

In this paper we present a unified time-frequency method for speaker ext...
research
10/19/2020

Attention-based scaling adaptation for target speech extraction

The target speech extraction has attracted widespread attention in recen...

Please sign up or login with your details

Forgot password? Click here to reset