AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

01/05/2019
by   Joseph Roth, et al.
6

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.

READ FULL TEXT

page 1

page 2

page 5

page 8

page 12

page 13

page 15

research
08/17/2021

Look Who's Talking: Active Speaker Detection in the Wild

In this work, we present a novel audio-visual dataset for active speaker...
research
03/09/2023

WASD: A Wilder Active Speaker Detection Dataset

Current Active Speaker Detection (ASD) models achieve great results on A...
research
03/07/2022

Visually Supervised Speaker Detection and Localization via Microphone Array

Active speaker detection (ASD) is a multi-modal task that aims to identi...
research
05/11/2022

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Audio-visual automatic speech recognition is a promising approach to rob...
research
11/24/2017

Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition

This paper presents a self-supervised method for detecting the active sp...
research
09/14/2023

Efficient Face Detection with Audio-Based Region Proposals

Robot vision often involves a large computational load due to large imag...
research
02/13/2020

Self-supervised learning for audio-visual speaker diarization

Speaker diarization, which is to find the speech segments of specific sp...

Please sign up or login with your details

Forgot password? Click here to reset