Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

10/26/2019
by   Rafael Valle, et al.
0

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

READ FULL TEXT
research
01/30/2021

Expressive Neural Voice Cloning

Voice cloning is the task of learning to synthesize the voice of an unse...
research
04/04/2019

In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

Neural text-to-speech synthesis (NTTS) models have shown significant pro...
research
08/10/2023

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Recent work has shown that it is possible to resynthesize high-quality s...
research
07/20/2023

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Expressive speech synthesis models are trained by adding corpora with di...
research
08/13/2021

Enhancing audio quality for expressive Neural Text-to-Speech

Artificial speech synthesis has made a great leap in terms of naturalnes...
research
09/06/2022

The Role of Voice Persona in Expressive Communication:An Argument for Relevance in Speech Synthesis Design

We present an approach to imbuing expressivity in a synthesized voice by...
research
04/19/2023

Affective social anthropomorphic intelligent system

Human conversational styles are measured by the sense of humor, personal...

Please sign up or login with your details

Forgot password? Click here to reset