Personal VAD: Speaker-Conditioned Voice Activity Detection

08/12/2019
by   Shaojin Ding, et al.
0

In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For every frame, personal VAD outputs the scores for three classes: non-speech, target speaker speech, and non-target speaker speech. With our optimal setup, we are able to train a 130KB model that outperforms a baseline system where individually trained standard VAD and speaker recognition network are combined to perform the same task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2018

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

In this paper, we present a novel system that separates the voice of a t...
research
04/08/2022

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Personalization of on-device speech recognition (ASR) has seen explosive...
research
09/26/2019

Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

Voice activity detection (VAD), which classifies frames as speech or non...
research
09/21/2020

End-to-End Speaker-Dependent Voice Activity Detection

Voice activity detection (VAD) is an essential pre-processing step for t...
research
02/24/2022

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

VoiceFilter-Lite is a speaker-conditioned voice separation model that pl...
research
06/28/2021

Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits

Target speech separation is the process of filtering a certain speaker's...
research
10/28/2022

Laugh Betrays You? Learning Robust Speaker Representation From Speech Containing Non-Verbal Fragments

The success of automatic speaker verification shows that discriminative ...

Please sign up or login with your details

Forgot password? Click here to reset