Voice activity detection in the wild: A data-driven approach using teacher-student training

05/10/2021
by   Heinrich Dinkel, et al.
0

Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.

READ FULL TEXT

page 1

page 2

page 9

page 11

research
01/05/2019

Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

For real-world speech recognition applications, noise robustness is stil...
research
03/27/2020

Voice activity detection in the wild via weakly supervised sound event detection

Traditional supervised voice activity detection (VAD) methods work well ...
research
03/27/2020

GPVAD: Towards noise robust voice activity detection via weakly supervised sound event detection

Traditional voice activity detection (VAD) methods work well in clean an...
research
05/31/2021

Multi-Scale Temporal Convolution Network for Classroom Voice Detection

Teaching with the cooperation of expert teacher and assistant teacher, w...
research
11/09/2022

Improving Noisy Student Training on Non-target Domain Data for Automatic Speech Recognition

Noisy Student Training (NST) has recently demonstrated extremely strong ...
research
05/19/2020

Improved Noisy Student Training for Automatic Speech Recognition

Recently, a semi-supervised learning method known as "noisy student trai...
research
11/28/2019

Unsupervised Neural Mask Estimator For Generalized Eigen-Value Beamforming Based ASR

The state-of-art methods for acoustic beamforming in multi-channel ASR a...

Please sign up or login with your details

Forgot password? Click here to reset