On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

10/04/2021
by   Cheng-I Jeff Lai, et al.
1

Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2022

EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

Neural models are known to be over-parameterized, and recent work has sh...
research
04/17/2019

End-to-End Speech Translation with Knowledge Distillation

End-to-end speech translation (ST), which directly translates from sourc...
research
04/20/2021

Review of end-to-end speech synthesis technology based on deep learning

As an indispensable part of modern human-computer interaction system, sp...
research
11/21/2022

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

This paper integrates a classic mel-cepstral synthesis filter into a mod...
research
03/29/2022

Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achi...
research
11/14/2022

SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech

Text-to-speech (TTS) models have achieved remarkable naturalness in rece...

Please sign up or login with your details

Forgot password? Click here to reset