DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

05/26/2022
by   Nan Zhang, et al.
0

Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models.

READ FULL TEXT
research
08/11/2020

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification

X-vectors have become the standard for speaker-embeddings in automatic s...
research
04/07/2020

Multi-Scale Aggregation Using Feature Pyramid Module for Text-Independent Speaker Verification

Currently, the most widely used approach for speaker verification is the...
research
10/25/2018

Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

Input utterance with short duration is one of the most critical threats ...
research
12/21/2020

Multi-stream Convolutional Neural Network with Frequency Selection for Robust Speaker Verification

Speaker verification aims to verify whether an input speech corresponds ...
research
10/10/2021

Poformer: A simple pooling transformer for speaker verification

Most recent speaker verification systems are based on extracting speaker...
research
06/01/2023

A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

We introduce a monaural neural speaker embeddings extractor that compute...
research
10/29/2020

T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

Identifying multiple speakers without knowing where a speaker's voice is...

Please sign up or login with your details

Forgot password? Click here to reset