UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control

06/21/2021
by   Minsu Kang, et al.
0

We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://jackson-kang.github.io/paper_works/UniTTS/demos.

READ FULL TEXT

page 9

page 13

page 18

research
03/02/2022

U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

We propose U-Singer, the first multi-singer emotional singing voice synt...
research
05/10/2020

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

High-fidelity speech can be synthesized by end-to-end text-to-speech mod...
research
06/25/2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

Although high-fidelity speech can be obtained for intralingual speech sy...
research
03/15/2023

Cross-speaker Emotion Transfer by Manipulating Speech Style Latents

In recent years, emotional text-to-speech has shown considerable progres...
research
04/12/2022

VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

Speech restoration aims to remove distortions in speech signals. Prior m...
research
07/20/2023

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Expressive speech synthesis models are trained by adding corpora with di...
research
05/13/2022

FontNet: Closing the gap to font designer performance in font synthesis

Font synthesis has been a very active topic in recent years because manu...

Please sign up or login with your details

Forgot password? Click here to reset