VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

07/31/2023
by   Jungil Kong, et al.
0

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.

READ FULL TEXT
research
06/11/2021

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Several recent end-to-end text-to-speech (TTS) models enabling single-st...
research
05/19/2020

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Accent conversion (AC) transforms a non-native speaker's accent into a n...
research
11/02/2020

Perceptually Guided End-to-End Text-to-Speech

Several fast text-to-speech (TTS) models have been proposed for real-tim...
research
06/02/2021

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

We propose a new paradigm for maintaining speaker identity in dysarthric...
research
02/28/2023

Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners

Grapheme-to-phoneme (G2P) transduction is part of the standard text-to-s...
research
04/07/2023

ArmanTTS single-speaker Persian dataset

TTS, or text-to-speech, is a complicated process that can be accomplishe...
research
11/04/2020

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

In this paper, we introduce Kathaka, a model trained with a novel two-st...

Please sign up or login with your details

Forgot password? Click here to reset