HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

07/30/2023
by   Sang Hoon Lee, et al.
0

Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at <https://hiervst.github.io/>.

READ FULL TEXT
research
05/15/2022

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis

Style transfer for out-of-domain (OOD) speech synthesis aims to generate...
research
05/19/2022

End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions

Zero-shot voice conversion is becoming an increasingly popular research ...
research
09/14/2023

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Direct speech-to-speech translation (S2ST) with discrete self-supervised...
research
09/21/2023

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speake...
research
05/14/2019

Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Non-parallel many-to-many voice conversion, as well as zero-shot voice c...
research
05/15/2020

ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network

We propose a neural network for zero-shot voice conversion (VC) without ...
research
08/09/2023

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

Current talking face generation methods mainly focus on speech-lip synch...

Please sign up or login with your details

Forgot password? Click here to reset