BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

04/15/2022
by   Daisuke Niizumi, et al.
0

Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should provide multiple aspects of these robust features, such as local and global features and their statistics. To implement our principle, we propose a self-supervised learning method: Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"). BYOL-A pre-trains representations of the input sound themselves invariant to audio data augmentations by minimizing the difference between a pair of augmented input variants, which makes the learned representations robust to the perturbations of sounds. In the BYOL-A encoder, the global pooling calculates representations to form multi-aspect information by combining statistics of frequency- and channel-wise, local, and global features. As a result, the learned representations should provide multi-aspect robust features of the input and serve various needs of diverse tasks. We evaluated general audio task performance among previous state-of-the-art methods, and BYOL-A showed competitive results in all tasks with the best average result of 72.4 VoxCeleb1 and 63.8 experiments and validated the contributions of BYOL-A components. Our code is available online.

READ FULL TEXT

page 2

page 3

page 6

page 7

page 8

page 10

page 11

page 12

research
04/26/2022

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Recent general-purpose audio representations show state-of-the-art perfo...
research
06/01/2023

Masked Autoencoders with Multi-Window Attention Are Better Audio Learners

Several recent works have adapted Masked Autoencoders (MAEs) for learnin...
research
09/30/2022

An empirical study of weakly supervised audio tagging embeddings for general audio representations

We study the usability of pre-trained weakly supervised audio tagging (A...
research
06/24/2022

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

Methods for extracting audio and speech features have been studied since...
research
03/25/2022

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Inspired by the recent progress in self-supervised learning for computer...
research
03/20/2022

A Study on Robustness to Perturbations for Representations of Environmental Sound

Audio applications involving environmental sound analysis increasingly u...
research
02/01/2022

Bootstrap Confidence Regions for Learned Feature Embeddings

Algorithmic feature learners provide high-dimensional vector representat...

Please sign up or login with your details

Forgot password? Click here to reset