ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

05/14/2020
by   Brecht Desplanques, et al.
0

Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a time delay neural network that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res(2)Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complementary information, we aggregate and propagate features of different hierarchical levels. Finally, we improve the statistics pooling module with channel-dependent frame attention. This enables the network to focus on different subsets of frames during each of the channel's statistics estimation. The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/07/2021

MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification

Most of the recent state-of-the-art results for speaker verification are...
research
10/08/2021

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

In this paper, we propose TitaNet, a novel neural network architecture f...
research
09/13/2021

Studying squeeze-and-excitation used in CNN for speaker verification

In speaker verification, the extraction of voice representations is main...
research
05/18/2023

Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions

Different variants of a Forensic Automatic Speaker Recognition (FASR) sy...
research
05/10/2021

Study on the temporal pooling used in deep neural networks for speaker verification

The x-vector architecture has recently achieved state-of-the-art results...
research
05/20/2023

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

In this paper, we propose ACA-Net, a lightweight, global context-aware s...
research
12/23/2021

Graph attentive feature aggregation for text-independent speaker verification

The objective of this paper is to combine multiple frame-level features ...

Please sign up or login with your details

Forgot password? Click here to reset