Self-supervised audio representation learning for mobile devices

05/24/2019
by   Marco Tagliasacchi, et al.
0

We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the temporal gap between two short audio segments extracted at random from the same audio clip. The other methods are inspired by Word2Vec, a popular technique used to learn word embeddings, and aim at reconstructing a temporal spectrogram slice from past and future slices or, alternatively, at reconstructing the context of surrounding slices from the current slice. We focus our evaluation on small encoder architectures, which can be potentially run on mobile devices during both inference (re-using a common learned representation across multiple downstream tasks) and training (capturing the true data distribution without compromising users' privacy when combined with federated learning). We evaluate the quality of the embeddings produced by the self-supervised learning models, and show that they can be re-used for a variety of downstream tasks, and for some tasks even approach the performance of fully supervised models of similar size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2021

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Inspired by the recent progress in self-supervised learning for computer...
research
03/07/2023

Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation

Self-supervised learning (SSL) has recently shown remarkable results in ...
research
10/27/2020

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Self-supervised audio representation learning offers an attractive alter...
research
10/25/2019

Learning audio representations via phase prediction

We learn audio representations by solving a novel self-supervised learni...
research
03/25/2022

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Inspired by the recent progress in self-supervised learning for computer...
research
07/05/2022

Federated Self-supervised Learning for Video Understanding

The ubiquity of camera-enabled mobile devices has lead to large amounts ...
research
10/25/2019

SPICE: Self-supervised Pitch Estimation

We propose a model to estimate the fundamental frequency in monophonic a...

Please sign up or login with your details

Forgot password? Click here to reset