Data-driven Attention and Data-independent DCT based Global Context Modeling for Text-independent Speaker Recognition

08/04/2022
by   Wei Xia, et al.
0

Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences that entail a complex hierarchical structure. Signals may contain diverse information at each time-frequency (TF) location. For example, it may be more beneficial to focus on high-energy parts for phoneme classes such as fricatives. The standard convolutional layer that operates on neighboring local regions cannot capture the complex TF global context information. In this study, a general global time-frequency context modeling framework is proposed to leverage the context information specifically for speaker representation modeling. First, a data-driven attention-based context model is introduced to capture the long-range and non-local relationship across different time-frequency locations. Second, a data-independent 2D-DCT based context model is proposed to improve model interpretability. A multi-DCT attention mechanism is presented to improve modeling power with alternate DCT base forms. Finally, the global context information is used to recalibrate salient time-frequency locations by computing the similarity between the global context and local features. The proposed lightweight blocks can be easily incorporated into a speaker model with little additional computational costs and effectively improves the speaker verification performance compared to the standard ResNet model and Squeeze&Excitation block by a large margin. Detailed ablation studies are also performed to analyze various factors that may impact performance of the proposed individual modules. Results from experiments show that the proposed global context modeling framework can efficiently improve the learned speaker representations by achieving channel-wise and time-frequency feature recalibration.

READ FULL TEXT

page 1

page 5

page 9

research
09/02/2020

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

In this study, we propose the global context guided channel and time-fre...
research
10/13/2021

Duality Temporal-channel-frequency Attention Enhanced Speaker Representation Learning

The use of channel-wise attention in CNN based speaker representation ne...
research
07/10/2022

Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning

Recently, attention mechanisms have been applied successfully in neural ...
research
12/14/2021

Explore Long-Range Context feature for Speaker Verification

Capturing long-range dependency and modeling long temporal contexts is p...
research
03/01/2023

PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification

ECAPA-TDNN is currently the most popular TDNN-series model for speaker v...
research
03/20/2023

Dual-stream Time-Delay Neural Network with Dynamic Global Filter for Speaker Verification

The time-delay neural network (TDNN) is one of the state-of-the-art mode...
research
06/12/2021

Structure-Regularized Attention for Deformable Object Representation

Capturing contextual dependencies has proven useful to improve the repre...

Please sign up or login with your details

Forgot password? Click here to reset