Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

03/19/2023
by   Wentao Zhu, et al.
0

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST <cit.> by 22.2%, 4.4% and 4.7% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2008

Audio Classification from Time-Frequency Texture

Time-frequency representations of audio signals often resemble texture i...
research
11/02/2022

MAST: Multiscale Audio Spectrogram Transformers

We present Multiscale Audio Spectrogram Transformer (MAST) for audio cla...
research
02/02/2022

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Audio classification is an important task of mapping audio samples into ...
research
10/18/2021

SpecTNT: a Time-Frequency Transformer for Music Audio

Transformers have drawn attention in the MIR field for their remarkable ...
research
02/08/2020

A Time-Frequency Perspective on Audio Watermarking

Existing audio watermarking methods usually treat the host audio signals...
research
01/31/2019

End-to-End Probabilistic Inference for Nonstationary Audio Analysis

A typical audio signal processing pipeline includes multiple disjoint an...
research
05/25/2023

SoundSieve: Seconds-Long Audio Event Recognition on Intermittently-Powered Systems

A fundamental problem of every intermittently-powered sensing system is ...

Please sign up or login with your details

Forgot password? Click here to reset