A Framework for Contrastive and Generative Learning of Audio Representations

10/22/2020
by   Prateek Verma, et al.
0

In this paper, we present a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels. The core idea in self supervised contrastive learning is to map an audio signal and its various augmented versions (representative of salient aspects of audio like pitch, timbre etc.) to a space where they are close together, and are separated from other different signals. In addition we also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals, without access to any labels. Here, we map audio signals on a smaller scale to discrete dictionary elements and train transformers to predict the next dictionary element. We only use data as a method of supervision, bypassing the need of labels needed to act as a supervision for training the deep neural networks. We then use a linear classifier head in order to evaluate the performance of our models, for both self supervised contrastive and generative transformer based representations that are learned. Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model. These representations, with avail-ability of large scale audio data show promise in various tasks for audio understanding tasks

READ FULL TEXT

page 2

page 3

research
10/19/2020

CLAR: Contrastive Learning of Auditory Representations

Learning rich visual representations using contrastive self-supervised l...
research
09/11/2023

Optimizing Audio Augmentations for Contrastive Learning of Health-Related Acoustic Signals

Health-related acoustic signals, such as cough and breathing sounds, are...
research
06/15/2023

Rosetta Neurons: Mining the Common Units in a Model Zoo

Do different neural networks, trained for various vision tasks, share so...
research
10/09/2021

Visually Exploring Multi-Purpose Audio Data

We analyse multi-purpose audio using tools to visualise similarities wit...
research
10/25/2019

SPICE: Self-supervised Pitch Estimation

We propose a model to estimate the fundamental frequency in monophonic a...
research
05/23/2020

S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

We propose a sequential variational autoencoder to learn disentangled re...
research
09/03/2022

Equivariant Self-Supervision for Musical Tempo Estimation

Self-supervised methods have emerged as a promising avenue for represent...

Please sign up or login with your details

Forgot password? Click here to reset