ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

05/17/2021
by   Shoule Wu, et al.
0

In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called ItôTTS, and the model that generates wave is called ItôWave. ItôTTS and ItôWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of original text or mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of ItôTTS and ItôWave can exceed the current state-of-the-art methods, reached 3.925±0.160 and 4.35±0.115 respectively.

READ FULL TEXT

page 14

page 15

page 16

page 17

research
01/29/2022

ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation

In this paper, we propose a vocoder based on a pair of forward and rever...
research
04/06/2021

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling

In this work, we introduce NU-Wave, the first neural audio upsampling mo...
research
10/17/2019

Nearly unstable family of stochastic processes given by stochastic differential equations with time delay

Let a be a finite signed measure on [-r, 0] with r ∈ (0, ∞). Consider a ...
research
02/28/2023

Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement

Recently, score-based generative models have been successfully employed ...
research
06/17/2022

NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates

Conventionally, audio super-resolution models fixed the initial and the ...
research
06/14/2021

CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis

In this paper, we propose a novel score-base generative model for uncond...
research
07/17/2021

STRODE: Stochastic Boundary Ordinary Differential Equation

Perception of time from sequentially acquired sensory inputs is rooted i...

Please sign up or login with your details

Forgot password? Click here to reset