DeepAI AI Chat
Log In Sign Up

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

by   Rajeev Rikhye, et al.

VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.


page 1

page 2

page 3

page 4


Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

In this paper, we propose a solution to allow speaker conditioned speech...

Target Speaker Verification with Selective Auditory Attention for Single and Multi-talker Speech

Speaker verification has been studied mostly under the single-talker con...

Personal VAD: Speaker-Conditioned Voice Activity Detection

In this paper, we propose "personal VAD", a system to detect the voice a...

Machine Speech Chain with One-shot Speaker Adaptation

In previous work, we developed a closed-loop speech chain model based on...

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Neural TTS has shown it can generate high quality synthesized speech. In...

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for M2MeT Challenge

In this paper, we present the speaker diarization system for the Multi-c...

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

Streaming recognition of multi-talker conversations has so far been eval...