Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

06/16/2022
by   Hiroshi Sato, et al.
0

Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in “intra-speaker” characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving average performance given a set of enrollment utterances, here we propose to guarantee the worst performance, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case performance by focusing on training with difficult enrollment cases where extraction does not perform well. In addition, we investigate the effectiveness of auxiliary speaker identification loss (SI-loss) as another way to improve robustness over enrollments. Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/23/2020

Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

Target speech extraction, which extracts a single target source in a mix...
research
02/21/2022

L-SpEx: Localized Target Speaker Extraction

Speaker extraction aims to extract the target speaker's voice from a mul...
research
04/11/2022

Listen only to me! How well can target speech extraction handle false alarms?

Target speech extraction (TSE) extracts the speech of a target speaker i...
research
06/23/2021

Enrollment-less training for personalized voice activity detection

We present a novel personalized voice activity detection (PVAD) learning...
research
10/28/2022

Local-global speaker representation for target speaker extraction

Target speaker extraction is to extract the target speaker's voice from ...
research
10/31/2022

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

The speaker extraction technique seeks to single out the voice of a targ...
research
05/03/2021

AvaTr: One-Shot Speaker Extraction with Transformers

To extract the voice of a target speaker when mixed with a variety of ot...

Please sign up or login with your details

Forgot password? Click here to reset