AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification

03/25/2022
by   Juncheng B. Li, et al.
0

After its sweeping success in vision and language tasks, pure attention-based neural architectures (e.g. DeiT) are emerging to the top of audio tagging (AT) leaderboards, which seemingly obsoletes traditional convolutional neural networks (CNNs), feed-forward networks or recurrent networks. However, taking a closer look, there is great variability in published research, for instance, performances of models initialized with pretrained weights differ drastically from without pretraining, training time for a model varies from hours to weeks, and often, essences are hidden in seemingly trivial details. This urgently calls for a comprehensive study since our 1st comparison is half-decade old. In this work, we perform extensive experiments on AudioSet which is the largest weakly-labeled sound event dataset available, we also did an analysis based on the data quality and efficiency. We compare a few state-of-the-art baselines on the AT task, and study the performance and efficiency of 2 major categories of neural architectures: CNN variants and attention-based variants. We also closely examine their optimization procedures. Our opensourced experimental results provide insights to trade-off between performance, efficiency, optimization process, for both practitioners and researchers. Implementation: https://github.com/lijuncheng16/AudioTaggingDoneRight

READ FULL TEXT
research
08/02/2018

DCASE 2018 Challenge baseline with convolutional neural networks

The Detection and Classification of Acoustic Scenes and Events (DCASE) i...
research
04/06/2019

Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems

The Detection and Classification of Acoustic Scenes and Events (DCASE) 2...
research
12/10/2019

Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization

Sound event detection (SED) is a task to detect sound events in an audio...
research
03/02/2019

Weakly labelled AudioSet Classification with Attention Neural Networks

Audio tagging is the task of predicting the presence or absence of sound...
research
01/29/2019

Comprehensive Evaluation of Deep Learning Architectures for Prediction of DNA/RNA Sequence Binding Specificities

Motivation: Deep learning architectures have recently demonstrated their...
research
08/24/2022

Improved Zero-Shot Audio Tagging Classification with Patchout Spectrogram Transformers

Standard machine learning models for tagging and classifying acoustic si...

Please sign up or login with your details

Forgot password? Click here to reset