Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

04/08/2022
by   Shaojin Ding, et al.
0

Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/12/2019

Personal VAD: Speaker-Conditioned Voice Activity Detection

In this paper, we propose "personal VAD", a system to detect the voice a...
research
09/09/2020

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

We introduce VoiceFilter-Lite, a single-channel source separation model ...
research
07/02/2021

Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

In this paper, we propose a solution to allow speaker conditioned speech...
research
11/23/2020

Streaming Multi-speaker ASR with RNN-T

Recent research shows end-to-end ASR systems can recognize overlapped sp...
research
11/07/2021

Retrieving Speaker Information from Personalized Acoustic Models for Speech Recognition

The widespread of powerful personal devices capable of collecting voice ...
research
05/10/2022

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Streaming recognition and segmentation of multi-party conversations with...
research
04/06/2021

Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

Often, the storage and computational constraints of embeddeddevices dema...

Please sign up or login with your details

Forgot password? Click here to reset