WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

03/31/2022
by   Hubert Siuzdak, et al.
0

Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as Wav2Vec 2.0 embeddings are time-aligned and speaker-independent. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.

READ FULL TEXT

page 1

page 2

page 4

research
07/02/2022

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

Building a voice conversion system for noisy target speakers, such as us...
research
07/05/2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

The zero-shot scenario for speech generation aims at synthesizing a nove...
research
04/01/2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Adaptive text to speech (TTS) can synthesize new voices in zero-shot sce...
research
04/07/2022

Self supervised learning for robust voice cloning

Voice cloning is a difficult task which requires robust and informative ...
research
05/09/2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speec...
research
09/23/2022

ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Rhythm

Recent developments in neural speech synthesis and vocoding have sparked...
research
10/21/2020

Grapheme or phoneme? An Analysis of Tacotron's Embedded Representations

End-to-end models, particularly Tacotron-based ones, are currently a pop...

Please sign up or login with your details

Forgot password? Click here to reset