Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

11/20/2019
by   Arda Senocak, et al.
65

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360-degree videos.

READ FULL TEXT

page 6

page 7

page 11

page 12

page 16

page 17

page 18

page 19

research
03/10/2018

Learning to Localize Sound Source in Visual Scenes

Visual events are usually accompanied by sounds in our daily lives. We p...
research
07/13/2020

Multiple Sound Sources Localization from Coarse to Fine

How to visually localize multiple sound sources in unconstrained videos ...
research
11/06/2022

Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization

Learning to localize the sound source in videos without explicit annotat...
research
03/23/2023

Egocentric Audio-Visual Object Localization

Humans naturally perceive surrounding scenes by unifying sound and sight...
research
11/15/2022

FlowGrad: Using Motion for Visual Sound Source Localization

Most recent work in visual sound source localization relies on semantic ...
research
03/20/2023

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

The images and sounds that we perceive undergo subtle but geometrically ...
research
03/09/2020

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual...

Please sign up or login with your details

Forgot password? Click here to reset