Masked Autoencoders that Listen

07/13/2022
by   Po-Yao Huang, et al.
3

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.

READ FULL TEXT

page 2

page 3

page 6

page 9

page 15

page 18

research
03/30/2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the rec...
research
05/28/2022

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Masked Autoencoders (MAE) have shown great potentials in self-supervised...
research
06/01/2023

Masked Autoencoders with Multi-Window Attention Are Better Audio Learners

Several recent works have adapted Masked Autoencoders (MAEs) for learnin...
research
04/17/2023

Learning to Compress Prompts with Gist Tokens

Prompting is now the primary way to utilize the multitask capabilities o...
research
10/26/2022

Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

Masked Autoencoders is a simple yet powerful self-supervised learning me...
research
03/23/2023

LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

We introduce LMCodec, a causal neural speech codec that provides high qu...
research
10/31/2022

Audio Time-Scale Modification with Temporal Compressing Networks

We proposed a novel approach in the field of time-scale modification on ...

Please sign up or login with your details

Forgot password? Click here to reset