Studying squeeze-and-excitation used in CNN for speaker verification

09/13/2021
by   Mickael Rouvier, et al.
0

In speaker verification, the extraction of voice representations is mainly based on the Residual Neural Network (ResNet) architecture. ResNet is built upon convolution layers which learn filters to capture local spatial patterns along all the input, then generate feature maps that jointly encode the spatial and channel information. Unfortunately, all feature maps in a convolution layer are learnt independently (the convolution layer does not exploit the dependencies between feature maps) and locally. This problem has first been tackled in image processing. A channel attention mechanism, called squeeze-and-excitation (SE), has recently been proposed in convolution layers and applied to speaker verification. This mechanism re-weights the information extracted across features maps. In this paper, we first propose an original qualitative study about the influence and the role of the SE mechanism applied to the speaker verification task at different stages of the ResNet, and then evaluate several SE architectures. We finally propose to improve the SE approach with a new pool- ing variant based on the concatenation of mean- and standard- deviation-pooling. Results showed that applying SE only on the first stages of the ResNet allows to better capture speaker information for the verification task, and that significant discrimination gains on Voxceleb1-E, Voxceleb1-H and SITW evaluation tasks have been noted using the proposed pooling variant.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/14/2020

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Current speaker verification techniques rely on a neural network to extr...
research
10/31/2022

Convolution-Based Channel-Frequency Attention for Text-Independent Speaker Verification

Deep convolutional neural networks (CNNs) have been applied to extractin...
research
02/16/2019

RES-SE-NET: Boosting Performance of Resnets by Enhancing Bridge-connections

One of the ways to train deep neural networks effectively is to use resi...
research
04/01/2020

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...
research
04/01/2020

Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker ...
research
10/13/2021

Duality Temporal-channel-frequency Attention Enhanced Speaker Representation Learning

The use of channel-wise attention in CNN based speaker representation ne...
research
03/01/2023

PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification

ECAPA-TDNN is currently the most popular TDNN-series model for speaker v...

Please sign up or login with your details

Forgot password? Click here to reset