Supervised attention for speaker recognition

11/10/2020
by   Seong Min Kye, et al.
0

The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context vector is to select the most discriminative frames for speaker recognition. However, the SAP underperforms compared to the temporal average pooling (TAP) baseline in some settings, which implies that the attention is not learnt effectively in end-to-end training. To tackle this problem, we introduce strategies for training the attention mechanism in a supervised manner, which learns the context vector using classified samples. With our proposed methods, context vector can be boosted to select the most informative frames. We show that our method outperforms existing methods in various experimental settings including short utterance speaker recognition, and achieves competitive performance over the existing baselines on the VoxCeleb datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/03/2017

End-to-End Attention based Text-Dependent Speaker Verification

A new type of End-to-End system for text-dependent speaker verification ...
research
04/14/2018

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

In this paper, we explore the encoding/pooling layer and loss function i...
research
10/06/2017

End-to-end DNN Based Speaker Recognition Inspired by i-vector and PLDA

Recently several end-to-end speaker verification systems based on deep n...
research
09/25/2018

Attention Mechanism in Speaker Recognition: What Does It Learn in Deep Speaker Embedding?

This paper presents an experimental study on deep speaker embedding with...
research
08/13/2020

Cross attentive pooling for speaker verification

The goal of this paper is text-independent speaker verification where ut...
research
06/28/2022

Attention-based conditioning methods using variable frame rate for style-robust speaker verification

We propose an approach to extract speaker embeddings that are robust to ...
research
10/31/2018

Discriminatively Re-trained i-vector Extractor for Speaker Recognition

In this work we revisit discriminative training of the i-vector extracto...

Please sign up or login with your details

Forgot password? Click here to reset