X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

03/09/2023
by   Kai Liu, et al.
0

Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such experience problem is wrong speaker extraction (called speaker confusion, SC), which leads to strong negative experience and hampers effective conversations. To mitigate the imperative SC issue, we reformulate the training objective and propose two novel loss schemes that explore the metric of reconstruction improvement performance defined at small chunk-level and leverage the metric associated distribution information. Both loss schemes aim to encourage a TSE network to pay attention to those SC chunks based on the said distribution information. On this basis, we present X-SepFormer, an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer. Experimental results on the benchmark WSJ0-2mix dataset validate the effectiveness of our proposals, showing consistent improvements on SC errors (by 14.8 best system significantly outperforms the current SOTA systems and offers the top TSE results reported till date on the WSJ0-2mix.

READ FULL TEXT
research
01/16/2023

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

As a practical alternative of speech separation, target speaker extracti...
research
09/19/2023

USED: Universal Speaker Extraction and Diarization

Speaker extraction and diarization are two crucial enabling techniques f...
research
04/04/2022

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Recently, end-to-end speaker extraction has attracted increasing attenti...
research
10/26/2022

TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge

This paper describes the TSUP team's submission to the ISCSLP 2022 conve...
research
03/31/2022

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Speaker extraction algorithm extracts the target speech from a mixture s...
research
08/09/2020

Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

Recent advancements in deep learning led to human-level performance in s...
research
11/12/2022

Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization

End-to-end diarization presents an attractive alternative to standard ca...

Please sign up or login with your details

Forgot password? Click here to reset