Adapting a ConvNeXt model to audio classification on AudioSet

06/01/2023
by   Thomas Pellegrini, et al.
0

In computer vision, convolutional neural networks (CNN) such as ConvNeXt, have been able to surpass state-of-the-art transformers, partly thanks to depthwise separable convolutions (DSC). DSC, as an approximation of the regular convolution, has made CNNs more efficient in time and memory complexity without deteriorating their accuracy, and sometimes even improving it. In this paper, we first implement DSC into the Pretrained Audio Neural Networks (PANN) family for audio classification on AudioSet, to show its benefits in terms of accuracy/model size trade-off. Second, we adapt the now famous ConvNeXt model to the same task. It rapidly overfits, so we report on techniques that improve the learning process. Our best ConvNeXt model reached 0.471 mean-average precision on AudioSet, which is better than or equivalent to recent large audio transformers, while using three times less parameters. We also achieved positive results in audio captioning and audio retrieval with this model. Our PyTorch source code and checkpoint models are available at https://github.com/topel/audioset-convnext-inf.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Audio Spectrogram Transformer models rule the field of Audio Tagging, ou...
research
10/11/2021

Efficient Training of Audio Transformers with Patchout

The great success of transformer-based models in natural language proces...
research
05/29/2023

Streaming Audio Transformers for Online Audio Tagging

Transformers have emerged as a prominent model framework for audio taggi...
research
08/08/2022

Efficient Neural Net Approaches in Metal Casting Defect Detection

One of the most pressing challenges prevalent in the steel manufacturing...
research
12/15/2022

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Vision transformers (ViTs) have achieved impressive results on various c...
research
04/30/2021

Determining Chess Game State From an Image

Identifying the configuration of chess pieces from an image of a chessbo...
research
11/05/2022

Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block

Recently, massive architectures based on Convolutional Neural Network (C...

Please sign up or login with your details

Forgot password? Click here to reset