EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

08/10/2023
by   Tu Anh Nguyen, et al.
0

Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2019

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GS...
research
09/08/2021

Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

Cross-speaker style transfer (CSST) in text-to-speech (TTS) synthesis ai...
research
01/31/2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Expressive text-to-speech (TTS) aims to synthesize different speaking st...
research
11/26/2022

Contextual Expressive Text-to-Speech

The goal of expressive Text-to-speech (TTS) is to synthesize natural spe...
research
12/19/2022

Speaking Style Conversion With Discrete Self-Supervised Units

Voice Conversion (VC) is the task of making a spoken utterance by one sp...
research
05/30/2022

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Text-to-Speech (TTS) has recently seen great progress in synthesizing hi...
research
07/20/2023

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Expressive speech synthesis models are trained by adding corpora with di...

Please sign up or login with your details

Forgot password? Click here to reset