Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

03/20/2022
by   Tuomo Raitio, et al.
0

We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing the spectral tilt parameter and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various masking noise conditions, and compare these to well-known speech intelligibility-enhancing algorithms. The evaluations show that the proposed method can improve the intelligibility of synthetic speech with little loss in speech quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/28/2020

Speech Synthesis and Control Using Differentiable DSP

Modern text-to-speech systems are able to produce natural and high-quali...
research
09/14/2023

Mandarin Lombard Flavor Classification

The Lombard effect refers to individuals' unconscious modulation of voca...
research
12/21/2017

On the Use of a Spectral Glottal Model for the Source-filter Separation of Speech

The estimation of glottal flow from a speech waveform is a key method fo...
research
06/24/2017

A Variational EM Method for Pole-Zero Modeling of Speech with Mixed Block Sparse and Gaussian Excitation

The modeling of speech can be used for speech synthesis and speech recog...
research
04/12/2017

Sampling-based speech parameter generation using moment-matching networks

This paper presents sampling-based speech parameter generation using mom...
research
10/26/2022

Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

This paper proposes a method for selecting training data for text-to-spe...
research
08/13/2020

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

The increased adoption of digital assistants makes text-to-speech (TTS) ...

Please sign up or login with your details

Forgot password? Click here to reset