Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

05/01/2021
by   Prateek Verma, et al.
1

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

READ FULL TEXT

page 1

page 3

page 4

research
03/18/2023

Content Adaptive Front End For Audio Signal Processing

We propose a learnable content adaptive front end for audio signal proce...
research
10/18/2021

SpecTNT: a Time-Frequency Transformer for Music Audio

Transformers have drawn attention in the MIR field for their remarkable ...
research
11/25/2022

Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

The success of supervised deep learning methods is largely due to their ...
research
07/19/2023

Improving Domain Generalization for Sound Classification with Sparse Frequency-Regularized Transformer

Sound classification models' performance suffers from generalizing on ou...
research
02/10/2020

Unsupervised Learning of Audio Perception for Robotics Applications: Learning to Project Data to T-SNE/UMAP space

Audio perception is a key to solving a variety of problems ranging from ...
research
08/14/2023

Active Bird2Vec: Towards End-to-End Bird Sound Monitoring with Transformers

We propose a shift towards end-to-end learning in bird sound monitoring ...

Please sign up or login with your details

Forgot password? Click here to reset