Controllable Emphasis with zero data for text-to-speech

07/13/2023
by   Arnaud Joly, et al.
0

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by 7.3% and correct testers' identification of the emphasized word in a sentence by 40% on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2017

Techniques and Challenges in Speech Synthesis

The aim of this project was to develop and implement an English language...
research
05/12/2023

Using Deepfake Technologies for Word Emphasis Detection

In this work, we consider the task of automated emphasis detection for s...
research
04/04/2019

In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

Neural text-to-speech synthesis (NTTS) models have shown significant pro...
research
10/06/2021

Emphasis control for parallel neural TTS

The semantic information conveyed by a speech signal is strongly influen...
research
07/07/2022

BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus

BibleTTS is a large, high-quality, open speech dataset for ten languages...
research
02/19/2021

Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input

The prosody of a spoken word is determined by its surrounding context. I...
research
06/24/2021

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Whilst recent neural text-to-speech (TTS) approaches produce high-qualit...

Please sign up or login with your details

Forgot password? Click here to reset