Self-Supervised Learning from Automatically Separated Sound Scenes

05/05/2021
by   Eduardo Fonseca, et al.
0

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.

READ FULL TEXT
research
04/05/2018

Learning to Separate Object Sounds by Watching Unlabeled Video

Perceiving a scene most fully requires all the senses. Yet modeling how ...
research
04/10/2018

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The thud of a bouncing ball, the onset of speech as lips open -- when vi...
research
08/30/2019

Recursive Visual Sound Separation Using Minus-Plus Net

Sounds provide rich semantics, complementary to visual data, for many ta...
research
10/29/2022

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

There exists an unequivocal distinction between the sound produced by a ...
research
11/18/2022

Self-Remixing: Unsupervised Speech Separation via Separation and Remixing

We present Self-Remixing, a novel self-supervised speech separation meth...
research
05/11/2020

Foreground-Background Ambient Sound Scene Separation

Ambient sound scenes typically comprise multiple short events occurring ...
research
11/06/2017

Unsupervised Learning of Semantic Audio Representations

Even in the absence of any explicit semantic annotation, vast collection...

Please sign up or login with your details

Forgot password? Click here to reset