A Closer Look at Weakly-Supervised Audio-Visual Source Localization

08/30/2022
by   Shentong Mo, et al.
0

Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding objects can be costly, a plethora of weakly-supervised localization methods that can learn from datasets with no bounding-box annotations have been proposed in recent years, by leveraging the natural co-occurrence of audio and visual signals. Despite significant interest, popular evaluation protocols have two major flaws. First, they allow for the use of a fully annotated dataset to perform early stopping, thus significantly increasing the annotation effort required for training. Second, current evaluation metrics assume the presence of sound sources at all times. This is of course an unrealistic assumption, and thus better metrics are necessary to capture the model's performance on (negative) samples with no visible sound sources. To accomplish this, we extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples, and measure performance using metrics that balance localization accuracy and recall. Using the new protocol, we conducted an extensive evaluation of prior methods, and found that most prior works are not capable of identifying negatives and suffer from significant overfitting problems (rely heavily on early stopping for best results). We also propose a new approach for visual sound source localization that addresses both these problems. In particular, we found that, through extreme visual dropout and the use of momentum encoders, the proposed approach combats overfitting effectively, and establishes a new state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code and pre-trained models are available at https://github.com/stoneMo/SLAVC.

READ FULL TEXT
research
03/17/2022

Localizing Visual Sounds the Easy Way

Unsupervised audio-visual source localization aims at localizing visible...
research
06/01/2021

Dual Normalization Multitasking for Audio-Visual Sounding Object Localization

Although several research works have been reported on audio-visual sound...
research
03/20/2023

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Self-supervised audio-visual source localization aims to locate sound-so...
research
04/06/2021

Localizing Visual Sounds the Hard Way

The objective of this work is to localize sound sources that are visible...
research
04/11/2022

How to Listen? Rethinking Visual Sound Localization

Localizing visual sounds consists on locating the position of objects th...
research
04/07/2022

Deep Visual Geo-localization Benchmark

In this paper, we propose a new open-source benchmarking framework for V...
research
12/11/2021

Early Stopping for Deep Image Prior

Deep image prior (DIP) and its variants have showed remarkable potential...

Please sign up or login with your details

Forgot password? Click here to reset