An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

06/03/2021
by   Beata Lorincz, et al.
0

Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all. In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings. We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker's synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers' neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2020

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

In this paper, we present AISHELL-3, a large-scale and high-fidelity mul...
research
07/07/2020

X-vectors: New Quantitative Biomarkers for Early Parkinson's Disease Detection from Speech

Many articles have used voice analysis to detect Parkinson's disease (PD...
research
06/03/2019

Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN

Text-to-speech (TTS) acoustic models map linguistic features into an aco...
research
02/28/2020

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

We aim to characterize how different speakers contribute to the perceive...
research
06/09/2023

Speaker Embeddings as Individuality Proxy for Voice Stress Detection

Since the mental states of the speaker modulate speech, stress introduce...
research
06/07/2022

The Influence of Dataset Partitioning on Dysfluency Detection Systems

This paper empirically investigates the influence of different data spli...
research
07/19/2023

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

In this paper we introduce a first attempt on understanding how a non-au...

Please sign up or login with your details

Forgot password? Click here to reset