Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

04/26/2022
by   Daisuke Niizumi, et al.
0

Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4 performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We make our code available online.

READ FULL TEXT

page 2

page 12

page 14

page 15

page 16

page 21

page 22

research
10/26/2022

Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

Masked Autoencoders is a simple yet powerful self-supervised learning me...
research
04/15/2022

BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

Pre-trained models are essential as feature extractors in modern machine...
research
12/01/2022

A General Purpose Supervisory Signal for Embodied Agents

Training effective embodied AI agents often involves manual reward engin...
research
04/10/2023

Leveraging Neural Representations for Audio Manipulation

We investigate applying audio manipulations using pretrained neural netw...
research
05/23/2023

Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

Self-supervised learning general-purpose audio representations have demo...
research
10/21/2020

Contrastive Learning of General-Purpose Audio Representations

We introduce COLA, a self-supervised pre-training approach for learning ...
research
11/15/2021

Metric-based multimodal meta-learning for human movement identification via footstep recognition

We describe a novel metric-based learning approach that introduces a mul...

Please sign up or login with your details

Forgot password? Click here to reset