Learning Speaker Embedding from Text-to-Speech

10/21/2020
by   Jaejin Cho, et al.
0

Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesize that the embeddings will contain minimal phonetic information since the TTS decoder will obtain that information from the textual input. TTS reconstruction can also be combined with speaker classification to enhance these embeddings further. Once trained, the speaker encoder computes representations for the speaker verification task, while the rest of the TTS blocks are discarded. We investigated training TTS from either manual or ASR-generated transcripts. The latter allows us to train embeddings on datasets without manual transcripts. We compared ASR transcripts and Kaldi phone alignments as TTS inputs, showing that the latter performed better due to their finer resolution. Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset. TTS with speaker classification loss improved EER by 0.28% and 0.73% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2019

An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

This paper presents an end-to-end text-independent speaker verification ...
research
04/06/2019

Self-supervised speaker embeddings

Contrary to i-vectors, speaker embeddings such as x-vectors are incapabl...
research
01/11/2023

Improving And Analyzing Neural Speaker Embeddings for ASR

Neural speaker embeddings encode the speaker's speech characteristics th...
research
02/07/2021

U-vectors: Generating clusterable speaker embedding from unlabeled data

Speaker recognition deals with recognizing speakers by their speech. Str...
research
10/25/2022

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

In this paper, we proposed Adapitch, a multi-speaker TTS method that mak...
research
07/19/2023

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

In this paper we introduce a first attempt on understanding how a non-au...
research
01/07/2020

Learning Speaker Embedding with Momentum Contrast

Speaker verification can be formulated as a representation learning task...

Please sign up or login with your details

Forgot password? Click here to reset