Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

02/16/2022
by   Adam Gabrys, et al.
0

State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker. It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system and marks a conceptual shift in the existing TTS paradigm, framing the few-shot TTS problem as a VC task. Furthermore, we propose to use a duration-controllable TTS system to create a parallel speech corpus to facilitate the VC task. Results show that the Voice Filter outperforms state-of-the-art few-shot speech synthesis techniques in terms of objective and subjective metrics on one minute of speech on a diverse set of voices, while being competitive against a TTS model built on 30 times more data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/07/2020

DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

Singing voice conversion is converting the timbre in the source singing ...
research
04/21/2022

Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation

Data augmentation via voice conversion (VC) has been successfully applie...
research
09/02/2020

Efficient neural speech synthesis for low-resource languages throughmultilingual modeling

Recent advances in neural TTS have led to models that canprodu...
research
01/11/2023

Modelling low-resource accents without accent-specific TTS frontend

This work focuses on modelling a speaker's accent that does not have a d...
research
09/06/2023

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

We propose a highly controllable voice manipulation system that can perf...
research
05/24/2022

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

Recently, synthesizing personalized speech by text-to-speech (TTS) appli...
research
04/20/2020

Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System

Abstract End-to-end text-to-speech (TTS) systems has proved its great su...

Please sign up or login with your details

Forgot password? Click here to reset