X-Vector based voice activity detection for multi-genre broadcast speech-to-text

12/09/2021
by   Misa Ogura, et al.
0

Voice Activity Detection (VAD) is a fundamental preprocessing step in automatic speech recognition. This is especially true within the broadcast industry where a wide variety of audio materials and recording conditions are encountered. Based on previous studies which indicate that xvector embeddings can be applied to a diverse set of audio classification tasks, we investigate the suitability of x-vectors in discriminating speech from noise. We find that the proposed x-vector based VAD system achieves the best reported score in detecting clean speech on AVA-Speech, whilst retaining robust VAD performance in the presence of noise and music. Furthermore, we integrate the x-vector based VAD system into an existing STT pipeline and compare its performance on multiple broadcast datasets against a baseline system with WebRTC VAD. Crucially, our proposed x-vector based VAD improves the accuracy of STT transcription on real-world broadcast audio

READ FULL TEXT
research
10/24/2022

Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Most automatic speech processing systems are sensitive to the acoustic e...
research
08/02/2018

AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Speech activity detection (or endpointing) is an important processing st...
research
02/03/2020

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

This paper integrates a voice activity detection (VAD) function with end...
research
12/14/2020

AV Taris: Online Audio-Visual Speech Recognition

In recent years, Automatic Speech Recognition (ASR) technology has appro...
research
03/04/2021

A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Recently, it has become easier to obtain speech data from various media ...
research
07/13/2022

MM-ALT: A Multimodal Automatic Lyric Transcription System

Automatic lyric transcription (ALT) is a nascent field of study attracti...
research
12/06/2022

BC-VAD: A Robust Bone Conduction Voice Activity Detection

Voice Activity Detection (VAD) is a fundamental module in many audio app...

Please sign up or login with your details

Forgot password? Click here to reset