A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

03/04/2021
by   Hanbin Bae, et al.
0

Recently, it has become easier to obtain speech data from various media such as the internet or YouTube, but directly utilizing them to train a neural text-to-speech (TTS) model is difficult. The proportion of clean speech is insufficient and the remainder includes background music. Even with the global style token (GST). Therefore, we propose the following method to successfully train an end-to-end TTS model with limited broadcast data. First, the background music is removed from the speech by introducing a music filter. Second, the GST-TTS model with an auxiliary quality classifier is trained with the filtered speech and a small amount of clean speech. In particular, the quality classifier makes the embedding vector of the GST layer focus on representing the speech quality (filtered or clean) of the input speech. The experimental results verified that the proposed method synthesized much more high-quality speech than conventional methods.

READ FULL TEXT
research
08/04/2018

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

Global Style Tokens (GSTs) are a recently-proposed method to learn laten...
research
05/21/2019

Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems

In this paper, we propose a high-quality generative text-to-speech (TTS)...
research
05/14/2020

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Data augmentation is one of the most effective ways to make end-to-end a...
research
12/09/2021

X-Vector based voice activity detection for multi-genre broadcast speech-to-text

Voice Activity Detection (VAD) is a fundamental preprocessing step in au...
research
03/13/2023

A processing framework to access large quantities of whispered speech found in ASMR

Whispering is a ubiquitous mode of communication that humans use daily. ...
research
05/07/2021

A Benchmarking on Cloud based Speech-To-Text Services for French Speech and Background Noise Effect

This study presents a large scale benchmarking on cloud based Speech-To-...
research
04/20/2020

Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System

Abstract End-to-end text-to-speech (TTS) systems has proved its great su...

Please sign up or login with your details

Forgot password? Click here to reset