On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

07/11/2023
by   Siyang Wang, et al.
0

Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which layer from each SSL model is most suited for spontaneous TTS. We address this shortcoming by extending the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL. Furthermore, SSL has also shown potential in predicting the mean opinion scores (MOS) of synthesized speech, but this has only been done in read-speech MOS prediction. We extend an SSL-based MOS prediction framework previously developed for scoring read speech synthesis and evaluate its performance on synthesized spontaneous speech. All experiments are conducted twice on two different spontaneous corpora in order to find generalizable trends. Overall, we present comprehensive experimental results on the use of SSL in spontaneous TTS and MOS prediction to further quantify and understand how SSL can be used in spontaneous TTS. Audios samples: https://www.speech.kth.se/tts-demos/sp_ssl_tts

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/05/2023

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Recent work has explored using self-supervised learning (SSL) speech rep...
research
03/01/2023

ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations

Text-to-speech (TTS) systems are modelled as mel-synthesizers followed b...
research
08/02/2023

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

While FastSpeech2 aims to integrate aspects of speech such as pitch, ene...
research
11/04/2022

Self-Supervised Learning for Speech Enhancement through Synthesis

Modern speech enhancement (SE) networks typically implement noise suppre...
research
10/27/2022

Training Autoregressive Speech Recognition Models with Limited in-domain Supervision

Advances in self-supervised learning have significantly reduced the amou...
research
09/13/2023

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Recent strides in neural speech synthesis technologies, while enjoying w...
research
01/17/2023

MooseNet: A trainable metric for synthesized speech with plda backend

We present MooseNet, a trainable speech metric that predicts listeners' ...

Please sign up or login with your details

Forgot password? Click here to reset