Using previous acoustic context to improve Text-to-Speech synthesis

12/07/2020
by   Pilar Oplustil-Gallegos, et al.
0

Many speech synthesis datasets, especially those derived from audiobooks, naturally comprise sequences of utterances. Nevertheless, such data are commonly treated as individual, unordered utterances both when training a model and at inference time. This discards important prosodic phenomena above the utterance level. In this paper, we leverage the sequential nature of the data using an acoustic context encoder that produces an embedding of the previous utterance audio. This is input to the decoder in a Tacotron 2 model. The embedding is also used for a secondary task, providing additional supervision. We compare two secondary tasks: predicting the ordering of utterance pairs, and predicting the embedding of the current utterance audio. Results show that the relation between consecutive utterances is informative: our proposed model significantly improves naturalness over a Tacotron 2 baseline.

READ FULL TEXT
research
10/09/2021

Using multiple reference audios and style embedding constraints for speech synthesis

The end-to-end speech synthesis model can directly take an utterance as ...
research
11/11/2022

MaskedSpeech: Context-aware Speech Synthesis with Masking Strategy

Humans often speak in a continuous manner which leads to coherent and co...
research
04/06/2021

An Initial Investigation for Detecting Partially Spoofed Audio

All existing databases of spoofed speech contain attack data that is spo...
research
11/24/2019

Enhancing Out-Of-Domain Utterance Detection with Data Augmentation Based on Word Embeddings

For most intelligent assistant systems, it is essential to have a mechan...
research
10/21/2019

Disambiguating Speech Intention via Audio-Text Co-attention Framework: A Case of Prosody-semantics Interface

Understanding the intention of an utterance is challenging for some pros...
research
07/31/2020

An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances

In this paper, we propose a sub-utterance unit selection framework to re...
research
12/16/2020

You Are What You Tweet: Profiling Users by Past Tweets to Improve Hate Speech Detection

Hate speech detection research has predominantly focused on purely conte...

Please sign up or login with your details

Forgot password? Click here to reset