A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

03/22/2022
by   Rishabh Jain, et al.
0

Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.

READ FULL TEXT

page 6

page 7

page 10

research
11/01/2022

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

Fine-tuning is a popular method for adapting text-to-speech (TTS) models...
research
03/31/2022

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

In neural text-to-speech (TTS), two-stage system or a cascade of separat...
research
08/21/2022

Visualising Model Training via Vowel Space for Text-To-Speech Systems

With the recent developments in speech synthesis via machine learning, t...
research
01/13/2021

Whispered and Lombard Neural Speech Synthesis

It is desirable for a text-to-speech system to take into account the env...
research
06/29/2022

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Generating expressive and contextually appropriate prosody remains a cha...
research
03/31/2022

Manipulation of oral cancer speech using neural articulatory synthesis

We present an articulatory synthesis framework for the synthesis and man...
research
10/29/2021

VRAIN-UPV MLLP's system for the Blizzard Challenge 2021

This paper presents the VRAIN-UPV MLLP's speech synthesis system for the...

Please sign up or login with your details

Forgot password? Click here to reset