Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training

04/27/2022
by   Dading Chong, et al.
0

Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in the audio domain, most transformer-based models for audio tasks are finetuned from pre-trained models in other domains (e.g. image), which has a notable gap with the audio domain. Other methods explore the self-supervised learning approaches directly in the audio domain but currently do not perform well in the downstream tasks. In this paper, we present a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (AudioSet used in this paper). Our method masks random patches of the input spectrogram and reconstructs the masked regions with an encoder-decoder architecture. Without using extra model weights or supervision, experimental results on multiple downstream datasets demonstrate MaskSpec achieves a significant performance gain against the supervised methods and outperforms the previous pre-trained models. In particular, our best model reaches the performance of 0.471 (mAP) on AudioSet, 0.854 (mAP) on OpenMIC2018, 0.982 (accuracy) on ESC-50, 0.976 (accuracy) on SCV2, and 0.823 (accuracy) on DCASE2019 Task1A respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2021

Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

Representation learning from unlabeled data has been of major interest i...
research
10/26/2022

AVES: Animal Vocalization Encoder based on Self-Supervision

The lack of annotated training data in bioacoustics hinders the use of l...
research
12/16/2021

Masked Feature Prediction for Self-Supervised Visual Pre-Training

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-...
research
01/20/2023

Self-Supervised Learning for Data Scarcity in a Fatigue Damage Prognostic Problem

With the increasing availability of data for Prognostics and Health Mana...
research
10/25/2019

Learning audio representations via phase prediction

We learn audio representations by solving a novel self-supervised learni...
research
10/25/2022

MOFormer: Self-Supervised Transformer model for Metal-Organic Framework Property Prediction

Metal-Organic Frameworks (MOFs) are materials with a high degree of poro...

Please sign up or login with your details

Forgot password? Click here to reset