Self-attention encoding and pooling for speaker recognition

08/03/2020
by   Pooyan Safari, et al.
0

The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94 x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2019

Self Multi-Head Attention for Speaker Recognition

Most state-of-the-art Deep Learning (DL) approaches for speaker recognit...
research
07/26/2020

Double Multi-Head Attention for Speaker Verification

Most state-of-the-art Deep Learning systems for speaker verification are...
research
04/04/2021

Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

A back-end model is a key element of modern speaker verification systems...
research
03/31/2020

AM-MobileNet1D: A Portable Model for Speaker Recognition

Speaker Recognition and Speaker Identification are challenging tasks wit...
research
01/28/2020

Masked cross self-attention encoding for deep speaker embedding

In general, speaker verification tasks require the extraction of speaker...
research
09/13/2019

End-to-End Neural Speaker Diarization with Self-attention

Speaker diarization has been mainly developed based on the clustering of...
research
08/14/2020

End-to-End Trainable Self-Attentive Shallow Network for Text-Independent Speaker Verification

Generalized end-to-end (GE2E) model is widely used in speaker verificati...

Please sign up or login with your details

Forgot password? Click here to reset