DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

03/29/2022
by   Takaaki Saeki, et al.
0

Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trained on speech corpora that contain both additive noises and environmental distortions. It jointly represents the time-variant additive noises with a frame-level encoder and the time-invariant environmental distortions with an utterance-level encoder. We also propose a regularization method to attain clean environmental embedding that is disentangled from the utterance-dependent information such as linguistic contents and speaker characteristics. Evaluation results show that our method achieved significantly higher-quality synthetic speech than previous methods in the condition including both additive noise and reverberation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2022

Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

This paper proposes a method for selecting training data for text-to-spe...
research
12/17/2020

DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling

While neural-based text to speech (TTS) models can synthesize natural an...
research
03/07/2023

Do Prosody Transfer Models Transfer Prosody?

Some recent models for Text-to-Speech synthesis aim to transfer the pros...
research
03/02/2022

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

In this paper, we propose a method of speaker adaption with intuitive pr...
research
02/03/2020

Within-sample variability-invariant loss for robust speaker recognition under noisy environments

Despite the significant improvements in speaker recognition enabled by d...
research
07/16/2019

Towards Adapting NMF Dictionaries Using Total Variability Modeling for Noise-Robust Acoustic Features

We propose an algorithm to extract noise-robust acoustic features from n...
research
10/12/2021

Multi-Modal Pre-Training for Automated Speech Recognition

Traditionally, research in automated speech recognition has focused on l...

Please sign up or login with your details

Forgot password? Click here to reset