Cross-domain Voice Activity Detection with Self-Supervised Representations

09/22/2022
by   Sina Alisamir, et al.
0

Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal, which is a necessary first step for many today's speech based applications. Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics, such as Mel Filter Banks (MFBs). Such methods therefore require an extra normalisation step to adapt to a new domain where the acoustics is impacted, which can be simply due to a change of speaker, microphone, or environment. In addition, this normalisation step is usually a rather rudimentary method that has certain limitations, such as being highly susceptible to the amount of data available for the new domain. Here, we exploited the crowd-sourced Common Voice (CV) corpus to show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains, because they are computed with contextualised representations of speech across multiple domains. SSL representations also achieve better results than systems based on hand-crafted representations (MFBs), and off-the-shelf VADs, with significant improvement in cross-domain settings.

READ FULL TEXT

page 5

page 6

research
10/26/2022

Multitask Detection of Speaker Changes, Overlapping Speech and Voice Activity Using wav2vec 2.0

Self-supervised learning approaches have lately achieved great success o...
research
06/11/2023

Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features

Patients who have had their entire larynx removed, including the vocal f...
research
04/24/2023

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Singing voice transcription converts recorded singing audio to musical n...
research
10/27/2021

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

We present a neural analysis and synthesis (NANSY) framework that can ma...
research
05/10/2022

Domain Invariant Masked Autoencoders for Self-supervised Learning from Multi-domains

Generalizing learned representations across significantly different visu...
research
07/03/2023

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Significant progress has been made in speaker dependent Lip-to-Speech sy...

Please sign up or login with your details

Forgot password? Click here to reset