Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

08/11/2023
by   Sung Jin Um, et al.
0

The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL

READ FULL TEXT

page 1

page 3

page 7

research
08/09/2023

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Self-supervised sound source localization is usually challenged by the m...
research
11/06/2022

Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Learning to localize the sound source in videos without explicit annotat...
research
03/07/2022

HRTF measurement for accurate identification of binaural sound localization cues

Although various research institutes have measured head-related transfer...
research
02/13/2022

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

The task of audio-visual sound source localization has been well studied...
research
03/09/2020

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual...
research
09/06/2021

Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

Humans can robustly recognize and localize objects by using visual and/o...
research
08/18/2023

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

We present Spatial LibriSpeech, a spatial audio dataset with over 650 ho...

Please sign up or login with your details

Forgot password? Click here to reset