ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

05/20/2023
by   Yuyue Wang, et al.
0

Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are available at https://xh621.github.io/stand-up-comedy-demo/

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/17/2022

Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation

While deep learning-based text-to-speech (TTS) models such as VITS have ...
research
07/06/2021

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

While recent text to speech (TTS) models perform very well in synthesizi...
research
02/13/2022

Distribution augmentation for low-resource expressive text-to-speech

This paper presents a novel data augmentation technique for text-to-spee...
research
02/16/2022

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Expressive text-to-speech (TTS) has become a hot research topic recently...
research
11/26/2022

Contextual Expressive Text-to-Speech

The goal of expressive Text-to-speech (TTS) is to synthesize natural spe...
research
05/22/2023

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Text-to-speech(TTS) has undergone remarkable improvements in performance...
research
03/10/2023

An End-to-End Neural Network for Image-to-Audio Transformation

This paper describes an end-to-end (E2E) neural architecture for the aud...

Please sign up or login with your details

Forgot password? Click here to reset