Localizing Visual Sounds the Hard Way

04/06/2021
by   Honglie Chen, et al.
7

The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines.

READ FULL TEXT

page 1

page 4

page 6

page 8

page 11

page 12

page 13

page 14

research
07/13/2020

Multiple Sound Sources Localization from Coarse to Fine

How to visually localize multiple sound sources in unconstrained videos ...
research
11/06/2022

Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Learning to localize the sound source in videos without explicit annotat...
research
08/30/2022

A Closer Look at Weakly-Supervised Audio-Visual Source Localization

Audio-visual source localization is a challenging task that aims to pred...
research
03/17/2022

Localizing Visual Sounds the Easy Way

Unsupervised audio-visual source localization aims at localizing visible...
research
02/07/2022

Learning Sound Localization Better From Semantically Similar Samples

The objective of this work is to localize the sound sources in visual sc...
research
06/10/2019

Identifying Visible Actions in Lifestyle Vlogs

We consider the task of identifying human actions visible in online vide...
research
01/29/2018

Local Visual Microphones: Improved Sound Extraction from Silent Video

Sound waves cause small vibrations in nearby objects. A few techniques e...

Please sign up or login with your details

Forgot password? Click here to reset