SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate

07/13/2022
by   Nabarun Goswami, et al.
0

The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesized speech. Speech separation on the other hand is the challenging task of separating individual speakers from an overlapping mixed signal of various speakers. Speaker attractors are high-dimensional embedding vectors that pull the time-frequency bins of each speaker's speech towards themselves while repelling those belonging to other speakers. In this work, we explore the possibility of using these powerful speaker attractors for zero-shot speaker adaptation in multi-speaker TTS synthesis and propose speaker attractor text to speech (SATTS). Through various experiments, we show that SATTS can synthesize natural speech from text from an unseen target speaker's reference signal which might have less than ideal recording conditions, i.e. reverberations or mixed with other speakers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2018

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synth...
research
06/24/2019

Single-Channel Speech Separation with Auxiliary Speaker Embeddings

We present a novel source separation model to decompose asingle-channel ...
research
07/07/2021

Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis

In multi-speaker speech synthesis, data from a number of speakers usuall...
research
05/08/2021

Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model Selection

This paper presents a novel zero-shot learning approach towards personal...
research
07/31/2019

Quantifying Cochlear Implant Users' Ability for Speaker Identification using CI Auditory Stimuli

Speaker recognition is a biometric modality that uses underlying speech ...
research
06/10/2021

Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

Text-to-speech systems recently achieved almost indistinguishable qualit...

Please sign up or login with your details

Forgot password? Click here to reset