MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling on Raw Waveforms

04/06/2021
by   Kai Middlebrook, et al.
2

In this work, we aim to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention-augmented convolution (AAC) blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by incorporating self-attention. The difference between MuSLCAT and MuSLCAN is their backend components. MuSLCAT's backend is a modified version of BERT. While MuSLCAN's is a simple AAC block. We validate the proposed MuSLCAT and MuSLCAN architectures by comparing them to state-of-the-art networks on four benchmark datasets for music tagging and genre recognition. Our experiments show that MuSLCAT and MuSLCAN consistently yield competitive results when compared to state-of-the-art waveform-based models yet require considerably fewer parameters.

READ FULL TEXT
research
06/21/2017

Multi-Level and Multi-Scale Feature Aggregation Using Sample-level Deep Convolutional Neural Networks for Music Classification

Music tag words that describe music audio by text have different levels ...
research
03/25/2018

Learning Environmental Sounds with Multi-scale Convolutional Neural Network

Deep learning has dramatically improved the performance of sounds recogn...
research
09/18/2021

MS-SincResNet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification

In this study, we proposed a new end-to-end convolutional neural network...
research
03/06/2017

Multi-Level and Multi-Scale Feature Aggregation Using Pre-trained Convolutional Neural Networks for Music Auto-tagging

Music auto-tagging is often handled in a similar manner to image classif...
research
06/16/2019

Multi-scale Embedded CNN for Music Tagging (MsE-CNN)

Convolutional neural networks (CNN) recently gained notable attraction i...
research
05/17/2022

ColonFormer: An Efficient Transformer based Method for Colon Polyp Segmentation

Identifying polyps is a challenging problem for automatic analysis of en...
research
11/07/2017

End-to-end learning for music audio tagging at scale

The lack of data tends to limit the outcomes of deep learning research -...

Please sign up or login with your details

Forgot password? Click here to reset