Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

08/02/2018
by   Hieu-Thi Luong, et al.
0

We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and objective results demonstrate that corrupted linguistic features, especially those in the test set, affected the ideal system's performance significantly in a statistical sense due to a mismatched condition between the training and test sets. Interestingly, while an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2016

Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

We propose two novel techniques --- stacking bottleneck features and min...
research
01/23/2020

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework

The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to pro...
research
10/18/2022

Spontaneous speech synthesis with linguistic-speech consistency training using pseudo-filled pauses

We propose a training method for spontaneous speech synthesis models tha...
research
06/18/2018

A Weighted Superposition of Functional Contours Model for Modelling Contextual Prominence of Elementary Prosodic Contours

The way speech prosody encodes linguistic, paralinguistic and non-lingui...
research
09/27/2022

When Handcrafted Features and Deep Features Meet Mismatched Training and Test Sets for Deepfake Detection

The accelerated growth in synthetic visual media generation and manipula...
research
11/01/2022

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

Current state-of-the-art methods for automatic synthetic speech evaluati...
research
05/03/2018

The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models

Seq2Seq based neural architectures have become the go-to architecture to...

Please sign up or login with your details

Forgot password? Click here to reset