Self Multi-Head Attention for Speaker Recognition

06/24/2019
by   Miquel India, et al.
0

Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18% of relative EER. Obtained results show a 58% relative improvement in EER compared to i-vector+PLDA.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/26/2020

Double Multi-Head Attention for Speaker Verification

Most state-of-the-art Deep Learning systems for speaker verification are...
research
02/20/2019

Utterance-level end-to-end language identification using attention-based CNN-BLSTM

In this paper, we present an end-to-end language identification framewor...
research
08/03/2020

Self-attention encoding and pooling for speaker recognition

The computing power of mobile devices limits the end-user applications i...
research
02/22/2018

Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics

Learning speaker-specific features is vital in many applications like sp...
research
02/05/2020

Identification of Indian Languages using Ghost-VLAD pooling

In this work, we propose a new pooling strategy for language identificat...
research
08/21/2018

Exploring a Unified Attention-Based Pooling Framework for Speaker Verification

The pooling layer is an essential component in the neural network based ...
research
12/21/2020

Multi-stream Convolutional Neural Network with Frequency Selection for Robust Speaker Verification

Speaker verification aims to verify whether an input speech corresponds ...

Please sign up or login with your details

Forgot password? Click here to reset