HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

02/02/2022
by   Ke Chen, et al.
0

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35 the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

READ FULL TEXT

page 1

page 2

page 3

page 4

11/23/2022

ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation

Vision transformers, which were originally developed for natural languag...
03/14/2023

CAT: Causal Audio Transformer for Audio Classification

The attention-based Transformers have been increasingly applied to audio...
07/08/2022

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

Accurate sound localization in a reverberation environment is essential ...
11/01/2017

Reducing Model Complexity for DNN Based Large-Scale Audio Classification

Audio classification is the task of identifying the sound categories tha...
03/07/2023

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

In this paper, we propose an effective sound event detection (SED) metho...
03/19/2023

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Audio event has a hierarchical architecture in both time and frequency a...
12/20/2022

Visual Transformers for Primates Classification and Covid Detection

We apply the vision transformer, a deep machine learning model build aro...

Code Repositories

HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"


view repo

Please sign up or login with your details

Forgot password? Click here to reset