T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

10/29/2020
by   Yanpei Shi, et al.
0

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block. The multi-head attention mechanism in the transformer structure could better capture different speaker properties when the input utterance contains multiple speakers. The memory mechanism used in the frame-level encoders can build a recurrent connection that better capture long-term speaker features. The experiments are conducted on artificial datasets based on the Switchboard Cellular part1 (SWBC) and Voxceleb1 datasets. In different data construction scenarios (Concat and Overlap), the proposed model shows better performance comparaing with four strong baselines, reaching 13.3 H-vectors and S-vectors. The use of memory mechanism could reach 10.6 relative improvement compared with not using memory mechanism.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2020

Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

Identifying multiple speakers without knowing where a speaker's voice is...
research
10/17/2019

H-VECTORS: Utterance-level Speaker Embedding Using A Hierarchical Attention Model

In this paper, a hierarchical attention network to generate utterance-le...
research
08/11/2020

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification

X-vectors have become the standard for speaker-embeddings in automatic s...
research
02/18/2020

Hierarchical Transformer Network for Utterance-level Emotion Recognition

While there have been significant advances in de-tecting emotions in tex...
research
12/29/2020

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

Emotion Recognition in Conversation (ERC) is a more challenging task tha...
research
05/26/2022

DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

Speaker verification (SV) aims to determine whether the speaker's identi...
research
07/10/2023

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Taking long-term spectral and temporal dependencies into account is esse...

Please sign up or login with your details

Forgot password? Click here to reset