Visual Scene Graphs for Audio Source Separation

09/24/2021
by   Moitreya Chatterjee, et al.
6

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention. These embeddings are used for conditioning an audio encoder-decoder towards source separation. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds. In this paper, we also introduce an "in the wild” video dataset for sound source separation that contains multiple non-musical sources, which we call Audio Separation in the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation. Thorough experiments on the proposed ASIW and the standard MUSIC datasets demonstrate state-of-the-art sound separation performance of our method against recent prior approaches.

READ FULL TEXT

page 11

page 12

page 13

page 14

page 15

page 16

page 17

page 18

research
10/29/2022

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

There exists an unequivocal distinction between the sound produced by a ...
research
11/21/2019

WildMix Dataset and Spectro-Temporal Transformer Model for Monoaural Audio Source Separation

Monoaural audio source separation is a challenging research area in mach...
research
05/30/2023

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

The ability to accurately recognize, localize and separate sound sources...
research
04/18/2019

Self-Supervised Audio-Visual Co-Segmentation

Segmenting objects in images and separating sound sources in audio are c...
research
12/21/2019

Deep Audio Prior

Deep convolutional neural networks are known to specialize in distilling...
research
10/19/2021

The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

The cocktail party problem aims at isolating any source of interest with...
research
05/15/2021

Move2Hear: Active Audio-Visual Source Separation

We introduce the active audio-visual source separation problem, where an...

Please sign up or login with your details

Forgot password? Click here to reset