Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

11/09/2022
by   Florian Schmid, et al.
0

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of .483 mAP on AudioSet. Source Code available at: https://github.com/fschmid56/EfficientAT

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2021

Efficient Training of Audio Transformers with Patchout

The great success of transformer-based models in natural language proces...
research
06/01/2023

Adapting a ConvNeXt model to audio classification on AudioSet

In computer vision, convolutional neural networks (CNN) such as ConvNeXt...
research
05/29/2023

Streaming Audio Transformers for Online Audio Tagging

Transformers have emerged as a prominent model framework for audio taggi...
research
08/23/2023

CED: Consistent ensemble distillation for audio tagging

Augmentation and knowledge distillation (KD) are well-established techni...
research
02/27/2022

Transformer-based Knowledge Distillation for Efficient Semantic Segmentation of Road-driving Scenes

For scene understanding in robotics and automated driving, there is a gr...
research
04/07/2022

Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

ImageNet serves as the primary dataset for evaluating the quality of com...
research
11/10/2022

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Compared to the great progress of large-scale vision transformers (ViTs)...

Please sign up or login with your details

Forgot password? Click here to reset