Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

08/09/2023
by   Tianyu Liu, et al.
0

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN

READ FULL TEXT

page 4

page 5

page 11

research
02/13/2022

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

The task of audio-visual sound source localization has been well studied...
research
08/11/2023

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

The objective of the sound source localization task is to enable machine...
research
03/25/2022

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

Sound source localization in visual scenes aims to localize objects emit...
research
03/20/2023

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Self-supervised audio-visual source localization aims to locate sound-so...
research
03/29/2023

Audio-Visual Grouping Network for Sound Localization from Mixtures

Sound source localization is a typical and challenging task that predict...
research
11/15/2022

FlowGrad: Using Motion for Visual Sound Source Localization

Most recent work in visual sound source localization relies on semantic ...
research
03/20/2023

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

The images and sounds that we perceive undergo subtle but geometrically ...

Please sign up or login with your details

Forgot password? Click here to reset