Perceptually Guided End-to-End Text-to-Speech

11/02/2020
by   Yeunju Choi, et al.
0

Several fast text-to-speech (TTS) models have been proposed for real-time processing, but there is room for improvement in speech quality. Meanwhile, there is a mismatch between the loss function for training and the mean opinion score (MOS) for evaluation, which may limit the speech quality of TTS models. In this work, we propose a method that can improve the speech quality of a fast TTS model while maintaining the inference speed. To do so, we train a TTS model using a perceptual loss based on the predicted MOS. Under the supervision of a MOS prediction model, a TTS model can learn to increase the perceptual quality of speech directly. In experiments, we train FastSpeech on our internal Korean dataset using the MOS prediction model pre-trained on the Voice Conversion Challenge 2018 evaluation results. The MOS test results show that our proposed approach outperforms FastSpeech in speech quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2021

Deep Learning Based Assessment of Synthetic Speech Naturalness

In this paper, we present a new objective prediction model for synthetic...
research
02/25/2017

Deep Voice: Real-time Neural Text-to-Speech

We present Deep Voice, a production-quality text-to-speech system constr...
research
01/24/2022

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

Although end-to-end text-to-speech (TTS) models can generate natural spe...
research
06/22/2023

MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning

In this paper, we introduce MFCCGAN as a novel speech synthesizer based ...
research
12/03/2020

Individually amplified text-to-speech

Text-to-speech (TTS) offers the opportunity to compensate for a hearing ...
research
04/04/2022

MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment

The acoustic environment can degrade speech quality during communication...
research
07/31/2023

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Single-stage text-to-speech models have been actively studied recently, ...

Please sign up or login with your details

Forgot password? Click here to reset