Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

10/07/2021
by   Prateek Verma, et al.
0

This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely end-to-end Transformer architectures. We, in this work, explore an approach, based on Bag-of-Words model. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. We utilize micro and macro level clustered vanilla embeddings, and use a MLP head for classification. We only use feed-forward encoder-decoder models to get the bottlenecks of spectral envelops, spectral patches and slices as well as multi-resolution spectra. A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation. Using simple codes learned on latent representations, we show how we surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures. This work hopefully would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/25/2022

Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

The success of supervised deep learning methods is largely due to their ...
research
05/01/2021

Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

Over the past two decades, CNN architectures have produced compelling mo...
research
02/04/2023

Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation

Prior work has attempted to understand the internal structures and funct...
research
09/15/2023

Diverse Neural Audio Embeddings – Bringing Features back !

With the advent of modern AI architectures, a shift has happened towards...
research
06/03/2022

Exploring Transformers for Behavioural Biometrics: A Case Study in Gait Recognition

Biometrics on mobile devices has attracted a lot of attention in recent ...
research
07/23/2015

Deep Fishing: Gradient Features from Deep Nets

Convolutional Networks (ConvNets) have recently improved image recogniti...
research
06/13/2016

Deep Image Homography Estimation

We present a deep convolutional neural network for estimating the relati...

Please sign up or login with your details

Forgot password? Click here to reset