Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

05/16/2020
by   Tao Tu, et al.
0

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available. However, laborious paired data collection processes prevent many institutes from building multi-speaker TTS systems of great performance. In this work, we propose a semi-supervised learning approach for multi-speaker TTS. A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. The experiment results demonstrate that with only an hour of paired speech data, no matter the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy. In addition, our analysis reveals that different speaker characteristics of the paired data have an impact on the effectiveness of semi-supervised TTS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2018

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

Although end-to-end text-to-speech (TTS) models such as Tacotron have sh...
research
01/23/2017

Characterisation of speech diversity using self-organising maps

We report investigations into speaker classification of larger quantitie...
research
10/24/2021

Learning Speaker Representation with Semi-supervised Learning approach for Speaker Profiling

Speaker profiling, which aims to estimate speaker characteristics such a...
research
12/29/2022

Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following

Agents that can follow language instructions are expected to be useful i...
research
10/01/2020

SESQA: semi-supervised learning for speech quality assessment

Automatic speech quality assessment is an important, transversal task wh...
research
12/15/2021

Textless Speech-to-Speech Translation on Real Data

We present a textless speech-to-speech translation (S2ST) system that ca...
research
10/25/2022

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

In this paper, we proposed Adapitch, a multi-speaker TTS method that mak...

Please sign up or login with your details

Forgot password? Click here to reset