Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection

Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from these high-level features. However, RNNs have some drawbacks, such as a limited capability to model long temporal dependencies and slow training and inference times due to their sequential processing nature. Recently, a few SELD studies used multi-head self-attention (MHSA), among other innovations in their models. MHSA and the related transformer networks have shown state-of-the-art performance in various domains. While they can model long temporal dependencies, they can also be parallelized efficiently. In this paper, we study in detail the effect of MHSA on the SELD task. Specifically, we examined the effects of replacing the RNN blocks with self-attention layers. We studied the influence of stacking multiple self-attention blocks, using multiple attention heads in each self-attention block, and the effect of position embeddings and layer normalization. Evaluation on the DCASE 2021 SELD (task 3) development data set shows a significant improvement in all employed metrics compared to the baseline CRNN accompanying the task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2020

Memory Controlled Sequential Self Attention for Sound Recognition

In this paper we investigate the importance of the extent of memory in s...
research
06/07/2021

PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Sound event localization aims at estimating the positions of sound sourc...
research
05/31/2020

CNRL at SemEval-2020 Task 5: Modelling Causal Reasoning in Language with Multi-Head Self-Attention Weights based Counterfactual Detection

In this paper, we describe an approach for modelling causal reasoning in...
research
02/02/2020

Sound Event Detection with Depthwise Separable and Dilated Convolutions

State-of-the-art sound event detection (SED) methods usually employ a se...
research
04/05/2019

Convolutional Self-Attention Networks

Self-attention networks (SANs) have drawn increasing interest due to the...
research
07/23/2021

SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain

In this work, we propose a novel self-attention based neural network for...
research
10/31/2021

A Simple Approach to Image Tilt Correction with Self-Attention MobileNet for Smartphones

The main contributions of our work are two-fold. First, we present a Sel...

Please sign up or login with your details

Forgot password? Click here to reset