T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

by   Yanpei Shi, et al.

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block. The multi-head attention mechanism in the transformer structure could better capture different speaker properties when the input utterance contains multiple speakers. The memory mechanism used in the frame-level encoders can build a recurrent connection that better capture long-term speaker features. The experiments are conducted on artificial datasets based on the Switchboard Cellular part1 (SWBC) and Voxceleb1 datasets. In different data construction scenarios (Concat and Overlap), the proposed model shows better performance comparaing with four strong baselines, reaching 13.3 H-vectors and S-vectors. The use of memory mechanism could reach 10.6 relative improvement compared with not using memory mechanism.



There are no comments yet.


page 1

page 2

page 3

page 4


Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

Identifying multiple speakers without knowing where a speaker's voice is...

H-VECTORS: Utterance-level Speaker Embedding Using A Hierarchical Attention Model

In this paper, a hierarchical attention network to generate utterance-le...

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification

X-vectors have become the standard for speaker-embeddings in automatic s...

Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

This paper proposes a serialized multi-layer multi-head attention for ne...

Weakly Supervised Training of Speaker Identification Models

We propose an approach for training speaker identification models in a w...

HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders

In this paper, HeadPosr is proposed to predict the head poses using a si...

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

Emotion Recognition in Conversation (ERC) is a more challenging task tha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.