Learning to Separate Object Sounds by Watching Unlabeled Video

04/05/2018
by   Ruohan Gao, et al.
0

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to study audio source separation in large-scale general "in the wild" videos. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising.

READ FULL TEXT

page 14

page 15

research
04/16/2019

Co-Separating Sounds of Visual Objects

Learning how objects sound from video is challenging, since they often h...
research
03/25/2021

Weakly-supervised Audio-visual Sound Source Detection and Separation

Learning how to localize and separate individual object sounds in the au...
research
04/18/2019

Self-Supervised Audio-Visual Co-Segmentation

Segmenting objects in images and separating sound sources in audio are c...
research
03/28/2023

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

We propose a self-supervised approach for learning to perform audio sour...
research
04/05/2021

Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

There are rich synchronized audio and visual events in our daily life. I...
research
05/05/2021

Self-Supervised Learning from Automatically Separated Sound Scenes

Real-world sound scenes consist of time-varying collections of sound sou...
research
10/29/2022

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

There exists an unequivocal distinction between the sound produced by a ...

Please sign up or login with your details

Forgot password? Click here to reset