Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis

12/14/2020
by   Neeraj Kumar, et al.
0

The style of the speech varies from person to person and every person exhibits his or her own style of speaking that is determined by the language, geography, culture and other factors. Style is best captured by prosody of a signal. High quality multi-speaker speech synthesis while considering prosody and in a few shot manner is an area of active research with many real-world applications. While multiple efforts have been made in this direction, it remains an interesting and challenging problem. In this paper, we present a novel few shot multi-speaker speech synthesis approach (FSM-SS) that leverages adaptive normalization architecture with a non-autoregressive multi-head attention model. Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK and LibriTTS datasets, using multiple quantitative metrics that measure generated speech distortion and MoS, along with speaker embedding analysis of the generated speech vs the actual speech samples.

READ FULL TEXT

page 6

page 7

research
11/30/2022

SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech

Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate ...
research
11/02/2022

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

Disentanglement of a speaker's timbre and style is very important for st...
research
06/06/2021

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

With rapid progress in neural text-to-speech (TTS) models, personalized ...
research
12/14/2020

Robust One Shot Audio to Video Generation

Audio to Video generation is an interesting problem that has numerous ap...
research
09/23/2021

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

One-shot voice cloning aims to transform speaker voice and speaking styl...
research
11/19/2022

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

This paper aims to synthesize target speaker's speech with desired speak...
research
06/20/2020

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Generating 3D speech-driven talking head has received more and more atte...

Please sign up or login with your details

Forgot password? Click here to reset