Expressive Neural Voice Cloning

01/30/2021
by   Paarth Neekhara, et al.
2

Voice cloning is the task of learning to synthesize the voice of an unseen speaker from a few samples. While current voice cloning methods achieve promising results in Text-to-Speech (TTS) synthesis for a new voice, these approaches lack the ability to control the expressiveness of synthesized audio. In this work, we propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker. We achieve this by explicitly conditioning the speech synthesis model on a speaker encoding, pitch contour and latent style tokens during training. Through both quantitative and qualitative evaluations, we show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker. These cloning tasks include style transfer from a reference speech, synthesizing speech directly from text, and fine-grained style control by manipulating the style conditioning variables during inference.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2019

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GS...
research
09/08/2021

Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis ai...
research
04/07/2022

Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder

This paper proposes a controllable singing voice synthesis system capabl...
research
08/04/2021

Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

This paper presents Daft-Exprt, a multi-speaker acoustic model advancing...
research
03/17/2021

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

Previous works on neural text-to-speech (TTS) have been addressed on lim...
research
11/06/2018

Robust and fine-grained prosody control of end-to-end speech synthesis

We propose prosody embeddings for emotional and expressive speech synthe...
research
05/30/2023

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Various applications of voice synthesis have been developed independentl...

Please sign up or login with your details

Forgot password? Click here to reset