Selective Kernel Attention for Robust Speaker Verification

04/03/2022
by   Sung Hwan Mun, et al.
0

Recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention techniques. However, their full potential may not have been exploited because these techniques' receptive fields are fixed where most convolutional layers operate with specified kernel sizes such as 1, 3 or 5. We aim to further improve this line of research by introducing a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional layer to adaptively select the kernel size in a data-driven fashion based on an attention mechanism that exploits both frequency and channel domain using the previous layer's output. We propose three module variants using the SKA mechanism whereby two modules are applied in front of an ECAPA-TDNN model, and the other is combined with the Res2Net backbone block. Experimental results demonstrate that our proposed model consistently outperforms the conventional counterpart on the three different evaluation protocols in terms of both equal error rate and minimum detection cost function. In addition, we present a detailed analysis that helps understand how the SKA module works.

READ FULL TEXT

page 2

page 4

research
07/10/2022

Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning

Recently, attention mechanisms have been applied successfully in neural ...
research
10/31/2022

Convolution-Based Channel-Frequency Attention for Text-Independent Speaker Verification

Deep convolutional neural networks (CNNs) have been applied to extractin...
research
10/13/2021

Simple Attention Module based Speaker Verification with Iterative noisy label detection

Recently, the attention mechanism such as squeeze-and-excitation module ...
research
03/15/2019

Selective Kernel Networks

In standard Convolutional Neural Networks (CNNs), the receptive fields o...
research
02/19/2021

Frequency-Temporal Attention Network for Singing Melody Extraction

Musical audio is generally composed of three physical properties: freque...
research
10/16/2019

Frequency and temporal convolutional attention for text-independent speaker recognition

Majority of the recent approaches for text-independent speaker recogniti...
research
04/01/2020

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...

Please sign up or login with your details

Forgot password? Click here to reset