UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

06/28/2023
by   Heeseung Kim, et al.
0

We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single <unit, speech> pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a unit sequence as input.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2022

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data

We propose Guided-TTS 2, a diffusion-based generative model for high-qua...
research
04/20/2021

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

Text to speech (TTS) is widely used to synthesize personal voice for a t...
research
12/15/2021

Textless Speech-to-Speech Translation on Real Data

We present a textless speech-to-speech translation (S2ST) system that ca...
research
10/25/2022

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

In this paper, we proposed Adapitch, a multi-speaker TTS method that mak...
research
10/31/2020

AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization

Recently, voice conversion (VC) has been widely studied. Many VC systems...
research
06/01/2022

AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pr...
research
04/01/2022

Residual-guided Personalized Speech Synthesis based on Face Image

Previous works derive personalized speech features by training the model...

Please sign up or login with your details

Forgot password? Click here to reset