Explore Long-Range Context feature for Speaker Verification

12/14/2021
by   Zhuo Li, et al.
0

Capturing long-range dependency and modeling long temporal contexts is proven to benefit speaker verification tasks. In this paper, we propose the combination of the Hierarchical-Split block(HS-block) and the Depthwise Separable Self-Attention(DSSA) module to capture richer multi-range context speaker features from a local and global perspective respectively. Specifically, the HS-block splits the feature map and filters into several groups and stacks them in one block, which enlarges the receptive fields(RFs) locally. The DSSA module improves the multi-head self-attention mechanism by the depthwise-separable strategy and explicit sparse attention strategy to model the pairwise relations globally and captures effective long-range dependencies in each channel. Experiments are conducted on the Voxceleb and SITW. Our best system achieves 1.27 SITW by applying the combination of HS-block and DSSA module.

READ FULL TEXT
research
04/03/2018

Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling

Recurrent neural networks (RNN), convolutional neural networks (CNN) and...
research
04/07/2023

PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Vision Transformer (ViT) has shown great potential for various visual ta...
research
06/16/2021

Invertible Attention

Attention has been proved to be an efficient mechanism to capture long-r...
research
01/11/2021

ORDNet: Capturing Omni-Range Dependencies for Scene Parsing

Learning to capture dependencies between spatial positions is essential ...
research
08/04/2022

Data-driven Attention and Data-independent DCT based Global Context Modeling for Text-independent Speaker Recognition

Learning an effective speaker representation is crucial for achieving re...
research
03/18/2020

PIC: Permutation Invariant Convolution for Recognizing Long-range Activities

Neural operations as convolutions, self-attention, and vector aggregatio...
research
09/02/2020

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

In this study, we propose the global context guided channel and time-fre...

Please sign up or login with your details

Forgot password? Click here to reset