Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

11/01/2022
by   Alexandra Vioni, et al.
0

Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.

READ FULL TEXT
research
06/14/2023

SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Self-supervised learning (SSL) for speech representation has been succes...
research
11/06/2020

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Despite prosody is related to the linguistic information up to the disco...
research
04/05/2017

Automatic Measurement of Pre-aspiration

Pre-aspiration is defined as the period of glottal friction occurring in...
research
05/14/2023

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT ...
research
06/16/2022

Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

We propose an end-to-end empathetic dialogue speech synthesis (DSS) mode...
research
05/20/2020

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce h...
research
08/02/2018

Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

We investigated the impact of noisy linguistic features on the performan...

Please sign up or login with your details

Forgot password? Click here to reset