CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

05/18/2023
by   Zhenhui Ye, et al.
0

Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the same text token under different contexts. Specifically, 1) We encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space with the elaborate design of the encoder inputs and contrastive loss; 2) We introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. We show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker TTS. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in our method. Source code and audio samples are available at https://clapspeech.github.io.

READ FULL TEXT

page 2

page 8

page 9

page 12

research
12/31/2020

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi...
research
04/29/2022

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be...
research
04/25/2022

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

The recent progress in non-autoregressive text-to-speech (NAR-TTS) has m...
research
05/13/2023

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training

Multilingual vision-language (V L) pre-training has achieved remarkabl...
research
11/21/2022

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Video-language pre-training is crucial for learning powerful multi-modal...
research
11/15/2022

CorruptEncoder: Data Poisoning based Backdoor Attacks to Contrastive Learning

Contrastive learning (CL) pre-trains general-purpose encoders using an u...
research
09/16/2023

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Cross-modal retrieval (CMR) has been extensively applied in various doma...

Please sign up or login with your details

Forgot password? Click here to reset