Mix and Localize: Localizing Sound Sources in Mixtures

11/28/2022
by   Xixi Hu, et al.
0

We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize

READ FULL TEXT

page 1

page 3

page 4

page 6

page 8

page 11

page 12

research
03/29/2023

Audio-Visual Grouping Network for Sound Localization from Mixtures

Sound source localization is a typical and challenging task that predict...
research
04/26/2022

Sound Localization by Self-Supervised Time Delay Estimation

Sounds reach one microphone in a stereo pair sooner than the other, resu...
research
03/20/2023

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

The images and sounds that we perceive undergo subtle but geometrically ...
research
01/20/2022

Learning Pixel Trajectories with Multiscale Contrastive Random Walks

A range of video modeling tasks, from optical flow to multiple object tr...
research
06/25/2020

Space-Time Correspondence as a Contrastive Random Walk

This paper proposes a simple self-supervised approach for learning repre...
research
04/04/2022

Object Permanence Emerges in a Random Walk along Memory

This paper proposes a self-supervised objective for learning representat...
research
08/30/2019

Recursive Visual Sound Separation Using Minus-Plus Net

Sounds provide rich semantics, complementary to visual data, for many ta...

Please sign up or login with your details

Forgot password? Click here to reset