ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation

11/23/2022
by   Sara Atito, et al.
0

Vision transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by data hungry nature of transformers and limited labelled data most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the natural images domain and audio domain. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose ASiT, a novel self-supervised transformer for general audio representations that captures local and global contextual information employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance on five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining. The code and pretrained weights will be made publicly available for the scientific community.

READ FULL TEXT
research
10/13/2021

Study of positional encoding approaches for Audio Spectrogram Transformers

Transformers have revolutionized the world of deep learning, specially i...
research
05/30/2022

GMML is All you Need

Vision transformers have generated significant interest in the computer ...
research
07/15/2022

Position Prediction as an Effective Pretraining Strategy

Transformers have gained increasing popularity in a wide range of applic...
research
02/02/2022

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Audio classification is an important task of mapping audio samples into ...
research
05/11/2023

Extending Audio Masked Autoencoders Toward Audio Restoration

Audio classification and restoration are among major downstream tasks in...
research
10/29/2019

Depa: Self-supervised audio embedding for depression detection

Depression detection research has increased over the last few decades as...
research
03/14/2023

CAT: Causal Audio Transformer for Audio Classification

The attention-based Transformers have been increasingly applied to audio...

Please sign up or login with your details

Forgot password? Click here to reset