Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

11/21/2022
by   Takenori Yoshimura, et al.
0

This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.

READ FULL TEXT
research
04/05/2019

WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

WaveCycleGAN has recently been proposed to bridge the gap between natura...
research
09/24/2020

Timbre Space Representation of a Subtractive Synthesizer

In this study, we produce a geometrically scaled perceptual timbre space...
research
05/28/2021

Differentiable Artificial Reverberation

We propose differentiable artificial reverberation (DAR), a family of ar...
research
08/05/2023

A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models fo...
research
11/15/2018

Comprehensive evaluation of statistical speech waveform synthesis

Statistical TTS systems that directly predict the speech waveform have r...
research
10/04/2021

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Are end-to-end text-to-speech (TTS) models over-parametrized? To what ex...
research
03/05/2022

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

The traditional vocoders have the advantages of high synthesis efficienc...

Please sign up or login with your details

Forgot password? Click here to reset