GPVAD: Towards noise robust voice activity detection via weakly supervised sound event detection

03/27/2020
by   Heinrich Dinkel, et al.
0

Traditional voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck for such supervised VAD training is its requirement for clean training data and frame-level labels. In contrast, we propose the GPVAD framework, which can be easily trained from noisy data in a weakly supervised fashion, requiring only clip-level labels. We proposed two GPVAD models, one full (GPV-F), which outputs all possible sound events and one binary (GPV-B), only splitting speech and noise. We evaluate the two GPVAD models and a CRNN based standard VAD model (VAD-C) on three different evaluation protocols (clean, synthetic noise, real). Results show that the GPV-F demonstrates competitive performance in clean and noisy scenarios compared to traditional VAD-C. Interestingly, in real-world evaluation, GPV-F largely outperforms VAD-C in terms of frame-level evaluation metrics as well as segment-level ones. With a much lower request for data, the naive binary clip-level GPV-B model can still achieve a comparable performance to VAD-C in real-world scenarios.

READ FULL TEXT
research
03/27/2020

Voice activity detection in the wild via weakly supervised sound event detection

Traditional supervised voice activity detection (VAD) methods work well ...
research
05/10/2021

Voice activity detection in the wild: A data-driven approach using teacher-student training

Voice activity detection is an essential pre-processing component for sp...
research
01/19/2021

Towards duration robust weakly supervised sound event detection

Sound event detection (SED) is the task of tagging the absence or presen...
research
12/17/2020

DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling

While neural-based text to speech (TTS) models can synthesize natural an...
research
08/18/2019

Weakly Supervised Segmentation by A Deep Geodesic Prior

The performance of the state-of-the-art image segmentation methods heavi...
research
02/18/2023

One-Pot Multi-Frame Denoising

The performance of learning-based denoising largely depends on clean sup...
research
09/14/2022

Few Clean Instances Help Denoising Distant Supervision

Existing distantly supervised relation extractors usually rely on noisy ...

Please sign up or login with your details

Forgot password? Click here to reset