AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

08/02/2018
by   Sourish Chaudhuri, et al.
0

Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music, and speech co-occurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlapping noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.

READ FULL TEXT
research
11/02/2021

AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence

We propose a dataset, AVASpeech-SMAD, to assist speech and music activit...
research
12/09/2021

X-Vector based voice activity detection for multi-genre broadcast speech-to-text

Voice Activity Detection (VAD) is a fundamental preprocessing step in au...
research
04/13/2018

Voices Obscured in Complex Environmental Settings (VOICES) corpus

This paper introduces the Voices Obscured In Complex Environmental Setti...
research
09/06/2023

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Dialect identification is a critical task in speech processing and langu...
research
03/01/2019

KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

In this paper, we describe KT-Speech-Crawler: an approach for automatic ...
research
10/06/2020

Digital Voicing of Silent Speech

In this paper, we consider the task of digitally voicing silent speech, ...
research
03/10/2022

KSoF: The Kassel State of Fluency Dataset – A Therapy Centered Dataset of Stuttering

Stuttering is a complex speech disorder that negatively affects an indiv...

Please sign up or login with your details

Forgot password? Click here to reset