Environment Aware Text-to-Speech Synthesis

10/08/2021
by   Daxin Tan, et al.
0

This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize speech audio that carries designated speaker characteristics and environment attribute. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/Environment-Aware-TTS/ .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2018

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synth...
research
10/16/2019

Contextual Joint Factor Acoustic Embeddings

Embedding acoustic information into fixed length representations is of i...
research
03/11/2019

Deep Text-to-Speech System with Seq2Seq Model

Recent trends in neural network based text-to-speech/speech synthesis pi...
research
07/04/2021

EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

This paper presents the design, implementation and evaluation of a speec...
research
03/29/2022

Disentangling speech from surroundings in a neural audio codec

We present a method to separate speech signals from noisy environments i...
research
06/28/2022

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

This paper proposes a new "decompose-and-edit" paradigm for the text-bas...
research
02/09/2023

A Composite T60 Regression and Classification Approach for Speech Dereverberation

Dereverberation is often performed directly on the reverberant audio sig...

Please sign up or login with your details

Forgot password? Click here to reset