Dual-stream Time-Delay Neural Network with Dynamic Global Filter for Speaker Verification

03/20/2023
by   Yangfu Li, et al.
0

The time-delay neural network (TDNN) is one of the state-of-the-art models for text-independent speaker verification. However, it is difficult for conventional TDNN to capture global context that has been proven critical for robust speaker representations and long-duration speaker verification in many recent works. Besides, the common solutions, e.g., self-attention, have quadratic complexity for input tokens, which makes them computationally unaffordable when applied to the feature maps with large sizes in TDNN. To address these issues, we propose the Global Filter for TDNN, which applies log-linear complexity FFT/IFFT and a set of differentiable frequency-domain filters to efficiently model the long-term dependencies in speech. Besides, a dynamic filtering strategy, and a sparse regularization method are specially designed to enhance the performance of the global filter and prevent it from overfitting. Furthermore, we construct a dual-stream TDNN (DS-TDNN), which splits the basic channels for complexity reduction and employs the global filter to increase recognition performance. Experiments on Voxceleb and SITW databases show that the DS-TDNN achieves approximate 10 decline over 28 ECAPA-TDNN. Besides, it has the best trade-off between efficiency and effectiveness compared with other popular baseline systems when facing long-duration speech. Finally, visualizations and a detailed ablation study further reveal the advantages of the DS-TDNN.

READ FULL TEXT

page 1

page 10

research
02/03/2022

MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

The time delay neural network (TDNN) represents one of the state-of-the-...
research
11/24/2022

A new Speech Feature Fusion method with cross gate parallel CNN for Speaker Recognition

In this paper, a new speech feature fusion method is proposed for speake...
research
10/28/2022

Universal speaker recognition encoders for different speech segments duration

Creating universal speaker encoders which are robust for different acous...
research
05/03/2022

Efficient dynamic filter for robust and low computational feature extraction

Unseen noise signal which is not considered in a model training process ...
research
06/24/2020

Practical and Verifiable Electronic Sortition

Existing verifiable e-sortition systems are impractical due to computati...
research
08/04/2022

Data-driven Attention and Data-independent DCT based Global Context Modeling for Text-independent Speaker Recognition

Learning an effective speaker representation is crucial for achieving re...
research
05/24/2023

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

Typically, the Time-Delay Neural Network (TDNN) and Transformer can serv...

Please sign up or login with your details

Forgot password? Click here to reset