Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

07/14/2021
by   Hongning Zhu, et al.
0

This paper proposes a serialized multi-layer multi-head attention for neural speaker embedding in text-independent speaker verification. In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms to derive refined features that are more correlated with speakers. Serialized attention mechanism contains a stack of self-attention modules to create fixed-dimensional representations of speakers. Instead of utilizing multi-head attention in parallel, the proposed serialized multi-layer multi-head attention is designed to aggregate and propagate attentive statistics from one layer to the next in a serialized manner. In addition, we employ an input-aware query for each utterance with the statistics pooling. With more layers stacked, the neural network can learn more discriminative speaker embeddings. Experiment results on VoxCeleb1 dataset and SITW dataset show that our proposed method outperforms other baseline methods, including x-vectors and other x-vectors + conventional attentive pooling approaches by 9.7

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2018

Attentive Statistics Pooling for Deep Speaker Embedding

This paper proposes attentive statistics pooling for deep speaker embedd...
research
08/11/2020

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification

X-vectors have become the standard for speaker-embeddings in automatic s...
research
10/10/2021

Poformer: A simple pooling transformer for speaker verification

Most recent speaker verification systems are based on extracting speaker...
research
07/27/2020

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization for End-to-End Speaker Verification System

One of the most important parts of an end-to-end speaker verification sy...
research
06/26/2022

Transport-Oriented Feature Aggregation for Speaker Embedding Learning

Pooling is needed to aggregate frame-level features into utterance-level...
research
10/11/2021

Multi-query multi-head attention pooling and Inter-topK penalty for speaker verification

This paper describes the multi-query multi-head attention (MQMHA) poolin...
research
04/15/2019

Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection

It is known that a deep neural network model pre-trained with large-scal...

Please sign up or login with your details

Forgot password? Click here to reset