Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

02/24/2022
by   Rajeev Rikhye, et al.
0

VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/02/2021

Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

In this paper, we propose a solution to allow speaker conditioned speech...
research
03/30/2021

Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

Speaker verification has been studied mostly under the single-talker con...
research
08/12/2019

Personal VAD: Speaker-Conditioned Voice Activity Detection

In this paper, we propose "personal VAD", a system to detect the voice a...
research
03/28/2018

Machine Speech Chain with One-shot Speaker Adaptation

In previous work, we developed a closed-loop speech chain model based on...
research
12/13/2018

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Neural TTS has shown it can generate high quality synthesized speech. In...
research
02/06/2022

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for M2MeT Challenge

In this paper, we present the speaker diarization system for the Multi-c...
research
09/17/2021

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

Streaming recognition of multi-talker conversations has so far been eval...

Please sign up or login with your details

Forgot password? Click here to reset