MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

03/30/2022
by   Alan Baade, et al.
22

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75 majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/27/2021

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

Despite having impressive vision-language (VL) pretraining with BERT-bas...
research
11/11/2021

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-superv...
research
07/13/2022

Masked Autoencoders that Listen

This paper studies a simple extension of image-based Masked Autoencoders...
research
05/23/2023

Difference-Masking: Choosing What to Mask in Continued Pretraining

Self-supervised learning (SSL) and the objective of masking-and-predicti...
research
06/08/2022

Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks

For unsupervised pretraining, mask-reconstruction pretraining (MRP) appr...
research
03/14/2023

AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Unsupervised learning of vision transformers seeks to pretrain an encode...
research
12/30/2022

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

With the attention mechanism, transformers achieve significant empirical...

Please sign up or login with your details

Forgot password? Click here to reset