Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

04/04/2022
by   Zifeng Zhao, et al.
0

Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network and hence lead to wrong extraction results, which deteriorates the overall performance. We refer to this as the target confusion problem. In this paper, we conduct an analysis of such an issue and solve it in two stages. In the training phase, we propose to integrate metric learning methods to improve the distinguishability of embeddings produced by the speaker encoder. While for inference, a novel post-filtering strategy is designed to revise the wrong results. Specifically, we first identify these confusion samples by measuring the similarities between output estimates and enrollment utterances, after which the true target sources are recovered by a subtraction operation. Experiments show that performance improvement of more than 1dB SI-SDRi can be brought, which validates the effectiveness of our methods and emphasizes the impact of the target confusion problem.

READ FULL TEXT

page 1

page 4

research
01/23/2020

Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

Target speech extraction, which extracts a single target source in a mix...
research
03/09/2023

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Target speech extraction (TSE) systems are designed to extract target sp...
research
05/18/2023

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

This paper proposes a novel Attention-based Encoder-Decoder network for ...
research
02/01/2022

New Insights on Target Speaker Extraction

In recent years, researchers have become increasingly interested in spea...
research
03/07/2023

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Since diarization and source separation of meeting data are closely rela...
research
08/15/2022

Analysis of impact of emotions on target speech extraction and speech separation

Recently, the performance of blind speech separation (BSS) and target sp...
research
12/10/2022

GPU-accelerated Guided Source Separation for Meeting Transcription

Guided source separation (GSS) is a type of target-speaker extraction me...

Please sign up or login with your details

Forgot password? Click here to reset