A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

03/05/2023
by   Siyang Wang, et al.
0

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/11/2023

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Self-supervised learning (SSL) speech representations learned from large...
research
10/27/2022

Training Autoregressive Speech Recognition Models with Limited in-domain Supervision

Advances in self-supervised learning have significantly reduced the amou...
research
07/27/2023

The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions

Recent work in the field of speech enhancement (SE) has involved the use...
research
06/13/2023

A Novel Scheme to classify Read and Spontaneous Speech

The COVID-19 pandemic has led to an increased use of remote telephonic i...
research
09/26/2022

The Ability of Self-Supervised Speech Models for Audio Representations

Self-supervised learning (SSL) speech models have achieved unprecedented...
research
11/29/2022

Model Extraction Attack against Self-supervised Speech Models

Self-supervised learning (SSL) speech models generate meaningful represe...
research
04/07/2022

MAESTRO: Matched Speech Text Representations through Modality Matching

We present Maestro, a self-supervised training method to unify represent...

Please sign up or login with your details

Forgot password? Click here to reset