Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

08/09/2020
by   Dipjyoti Paul, et al.
0

Recent advancements in deep learning led to human-level performance in single-speaker speech synthesis. However, there are still limitations in terms of speech quality when generalizing those systems into multiple-speaker models especially for unseen speakers and unseen recording qualities. For instance, conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers. In this work, we propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). We target towards the development of an efficient universal vocoder even for unseen speakers and recording conditions. In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings. Using publicly-available data for training, SC-WaveRNN achieves significantly better performance over baseline WaveRNN on both subjective and objective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23 seen speaker and seen recording condition and up to 95 unseen condition. Finally, we extend our work by implementing a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. In terms of performance, our system has been preferred over the baseline TTS system by 60 respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/02/2021

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speak...
research
11/15/2018

Robust universal neural vocoding

This paper introduces a robust universal neural vocoder trained with 74 ...
research
02/15/2022

SpeechPainter: Text-conditioned Speech Inpainting

We propose SpeechPainter, a model for filling in gaps of up to one secon...
research
04/01/2022

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Adaptive text to speech (TTS) can synthesize new voices in zero-shot sce...
research
01/05/2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

We introduce a language modeling approach for text to speech synthesis (...
research
02/24/2021

Triplet loss based embeddings for forensic speaker identification in Spanish

With the advent of digital technology, it is more common that committed ...
research
03/09/2023

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Target speech extraction (TSE) systems are designed to extract target sp...

Please sign up or login with your details

Forgot password? Click here to reset