Streaming Audio Transformers for Online Audio Tagging

05/29/2023
by   Heinrich Dinkel, et al.
0

Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Audio Spectrogram Transformer models rule the field of Audio Tagging, ou...
research
10/11/2021

Efficient Training of Audio Transformers with Patchout

The great success of transformer-based models in natural language proces...
research
06/01/2023

Adapting a ConvNeXt model to audio classification on AudioSet

In computer vision, convolutional neural networks (CNN) such as ConvNeXt...
research
10/22/2022

GCT: Gated Contextual Transformer for Sequential Audio Tagging

Audio tagging aims to assign predefined tags to audio clips to indicate ...
research
03/03/2023

Unified Keyword Spotting and Audio Tagging on Mobile Devices with Transformers

Keyword spotting (KWS) is a core human-machine-interaction front-end tas...
research
03/31/2022

Deep Hyperspectral Unmixing using Transformer Network

Currently, this paper is under review in IEEE. Transformers have intrigu...
research
07/13/2022

Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers

Humans have an innate ability to sense their surroundings, as they can e...

Please sign up or login with your details

Forgot password? Click here to reset