Log In Sign Up

Deep Learning Based Assessment of Synthetic Speech Naturalness

by   Gabriel Mittag, et al.

In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.


page 1

page 2

page 3

page 4


Perceptually Guided End-to-End Text-to-Speech

Several fast text-to-speech (TTS) models have been proposed for real-tim...

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets

In this paper, we present an update to the NISQA speech quality predicti...

DeepFry: Identifying Vocal Fry Using Deep Neural Networks

Vocal fry or creaky voice refers to a voice quality characterized by irr...

Improving Self-Supervised Learning-based MOS Prediction Networks

MOS (Mean Opinion Score) is a subjective method used for the evaluation ...

Towards Learning a Universal Non-Semantic Representation of Speech

The ultimate goal of transfer learning is to reduce labeled data require...

Visualising and Explaining Deep Learning Models for Speech Quality Prediction

Estimating quality of transmitted speech is known to be a non-trivial ta...

Full-Reference Speech Quality Estimation with Attentional Siamese Neural Networks

In this paper, we present a full-reference speech quality prediction mod...

Code Repositories


NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment

view repo