STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

06/15/2023
by   Kazuki Shimada, et al.
5

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results show that the audio-visual SELD system achieves lower localization error than the audio-only system. The data is available at https://zenodo.org/record/7880637.

READ FULL TEXT
research
06/04/2022

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (ST...
research
10/02/2020

AVECL-UMONS database for audio-visual event classification and localization

We introduce the AVECL-UMons dataset for audio-visual event classificati...
research
07/20/2021

FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos

Deep learning based visual to sound generation systems essentially need ...
research
04/04/2022

Dual Quaternion Ambisonics Array for Six-Degree-of-Freedom Acoustic Representation

Spatial audio methods are gaining a growing interest due to the spread o...
research
03/14/2020

Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

Immersive audio-visual perception relies on the spatial integration of b...
research
08/18/2023

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

We present Spatial LibriSpeech, a spatial audio dataset with over 650 ho...
research
06/01/2021

Dual Normalization Multitasking for Audio-Visual Sounding Object Localization

Although several research works have been reported on audio-visual sound...

Please sign up or login with your details

Forgot password? Click here to reset