Building Synthetic Speaker Profiles in Text-to-Speech Systems

02/07/2022
by   Jie Pu, et al.
0

The diversity of speaker profiles in multi-speaker TTS systems is a crucial aspect of its performance, as it measures how many different speaker profiles TTS systems could possibly synthesize. However, this important aspect is often overlooked when building multi-speaker TTS systems and there is no established framework to evaluate this diversity. The reason behind is that most multi-speaker TTS systems are limited to generate speech signals with the same speaker profiles as its training data. They often use discrete speaker embedding vectors which have a one-to-one correspondence with individual speakers. This correspondence limits TTS systems and hinders their capability of generating unseen speaker profiles that did not appear during training. In this paper, we aim to build multi-speaker TTS systems that have a greater variety of speaker profiles and can generate new synthetic speaker profiles that are different from training data. To this end, we propose to use generative models with a triplet loss and a specific shuffle mechanism. In our experiments, the effectiveness and advantages of the proposed method have been demonstrated in terms of both the distinctiveness and intelligibility of synthesized speech signals.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2019

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

When the available data of a target speaker is insufficient to train a h...
research
07/23/2020

Version Control of Speaker Recognition Systems

This paper discusses one of the most challenging practical engineering p...
research
11/08/2018

Speaker-adaptive neural vocoders for statistical parametric speech synthesis systems

This paper proposes speaker-adaptive neural vocoders for statistical par...
research
11/24/2020

Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech

In recent years, Text-To-Speech (TTS) has been used as a data augmentati...
research
07/07/2021

Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis

In multi-speaker speech synthesis, data from a number of speakers usuall...
research
04/12/2017

Trainable Referring Expression Generation using Overspecification Preferences

Referring expression generation (REG) models that use speaker-dependent ...
research
02/02/2023

Site-specific Deep Learning Path Loss Models based on the Method of Moments

This paper describes deep learning models based on convolutional neural ...

Please sign up or login with your details

Forgot password? Click here to reset