DeepAI AI Chat
Log In Sign Up

Deep Clustering For General-Purpose Audio Representations

by   Sreyan Ghosh, et al.

We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supervised pre-training scheme. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset and transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available.


page 1

page 2

page 3

page 4


Contrastive Learning of General-Purpose Audio Representations

We introduce COLA, a self-supervised pre-training approach for learning ...

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

Methods for extracting audio and speech features have been studied since...

Membership Inference Attacks Against Self-supervised Speech Models

Recently, adapting the idea of self-supervised learning (SSL) on continu...

Internet Explorer: Targeted Representation Learning on the Open Web

Modern vision models typically rely on fine-tuning general-purpose model...

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

Current computer vision models, unlike the human visual system, cannot y...

A Self-Supervised Framework for Function Learning and Extrapolation

Understanding how agents learn to generalize – and, in particular, to ex...

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Inspired by the recent progress in self-supervised learning for computer...