OverFlow: Putting flows on top of neural transducers for better TTS

11/13/2022
by   Shivam Mehta, et al.
0

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Compared to dominant flow-based acoustic models, our approach integrates autoregression for improved modelling of long-range dependences such as utterance-level prosody. Experiments show that a system based on our proposal gives more accurate pronunciations and better subjective speech quality than comparable methods, whilst retaining the original advantages of neural HMMs. Audio examples and code are available at https://shivammehta25.github.io/OverFlow/

READ FULL TEXT
research
08/30/2021

Neural HMMs are all you need (for high-quality attention-free TTS)

Neural sequence-to-sequence TTS has achieved significantly better output...
research
07/31/2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Neural text-to-speech systems are often optimized on L1/L2 losses, which...
research
03/02/2022

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

In this paper, we propose a method of speaker adaption with intuitive pr...
research
09/23/2019

Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities

Modern sequence to sequence neural TTS systems provide close to natural ...
research
09/21/2023

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

Text-based speech editing (TSE) techniques are designed to enable users ...
research
02/01/2021

Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis

Recent researches on both utterance-level and phone-level prosody modell...
research
11/24/2022

Prosody-controllable spontaneous TTS with neural HMMs

Spontaneous speech has many affective and pragmatic functions that are i...

Please sign up or login with your details

Forgot password? Click here to reset