MAST: Multiscale Audio Spectrogram Transformers

11/02/2022
by   Sreyan Ghosh, et al.
0

We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST). Given an input audio spectrogram we first patchify and project it into an initial temporal resolution and embedding dimension, post which the multiple stages in MAST progressively expand the embedding dimension while reducing the temporal resolution of the input. We use a pyramid structure that allows early layers of MAST operating at a high temporal resolution but low embedding space to model simple low-level acoustic information and deeper temporally coarse layers to model high-level acoustic information with high-dimensional embeddings. We also extend our approach to present a new Self-Supervised Learning (SSL) method called SS-MAST, which calculates a symmetric contrastive loss between latent representations from a student and a teacher encoder. In practice, MAST significantly outperforms AST by an average accuracy of 3.4 LAPE Benchmark. Moreover, SS-MAST achieves an absolute average improvement of 2.6 on GitHub at the time of publication.

READ FULL TEXT
research
04/22/2021

Multiscale Vision Transformers

We present Multiscale Vision Transformers (MViT) for video and image rec...
research
03/19/2023

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Audio event has a hierarchical architecture in both time and frequency a...
research
02/08/2022

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

Deriving multimodal representations of audio and lexical inputs is a cen...
research
10/10/2022

Automated Audio Captioning via Fusion of Low- and High- Dimensional Features

Automated audio captioning (AAC) aims to describe the content of an audi...
research
05/30/2023

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Automated audio captioning (AAC) which generates textual descriptions of...
research
10/27/2022

Opening the Black Box of wav2vec Feature Encoder

Self-supervised models, namely, wav2vec and its variants, have shown pro...
research
10/29/2019

Depa: Self-supervised audio embedding for depression detection

Depression detection research has increased over the last few decades as...

Please sign up or login with your details

Forgot password? Click here to reset