Deep Segment Attentive Embedding for Duration Robust Speaker Verification

11/01/2018
by   Bin Liu, et al.
0

LSTM-based speaker verification usually uses a fixed-length local segment randomly truncated from an utterance to learn the utterance-level speaker embedding, while using the average embedding of all segments of a test utterance to verify the speaker, which results in a critical mismatch between testing and training. This mismatch degrades the performance of speaker verification, especially when the durations of training and testing utterances are very different. To alleviate this issue, we propose the deep segment attentive embedding method to learn the unified speaker embeddings for utterances of variable duration. Each utterance is segmented by a sliding window and LSTM is used to extract the embedding of each segment. Instead of only using one local segment, we use the whole utterance to learn the utterance-level embedding by applying an attentive pooling to the embeddings of all segments. Moreover, the similarity loss of segment-level embeddings is introduced to guide the segment attention to focus on the segments with more speaker discriminations, and jointly optimized with the similarity loss of utterance-level embeddings. Systematic experiments on Tongdun and VoxCeleb show that the proposed method significantly improves robustness of duration variant and achieves the relative Equal Error Rate reduction of 50 respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/07/2020

Segment Aggregation for short utterances speaker verification using raw waveforms

Most studies on speaker verification systems focus on long-duration utte...
research
10/22/2020

Graph Attention Networks for Speaker Verification

This work presents a novel back-end framework for speaker verification u...
research
11/08/2022

BER: Balanced Error Rate For Speaker Diarization

DER is the primary metric to evaluate diarization performance while faci...
research
02/08/2019

Speaker diarisation using 2D self-attentive combination of embeddings

Speaker diarisation systems often cluster audio segments using speaker e...
research
08/13/2020

Cross attentive pooling for speaker verification

The goal of this paper is text-independent speaker verification where ut...
research
06/01/2023

Speaker verification using attentive multi-scale convolutional recurrent network

In this paper, we propose a speaker verification method by an Attentive ...
research
03/20/2020

Detecting Mismatch between Text Script and Voice-over Using Utterance Verification Based on Phoneme Recognition Ranking

The purpose of this study is to detect the mismatch between text script ...

Please sign up or login with your details

Forgot password? Click here to reset