Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

05/26/2023
by   Seongyeon Park, et al.
0

Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum without search, by inducing the decoder of VITS-based models to its full reconstruction ability. With our framework, we show superior performance compared to baselines in zero-shot TTS and VC, achieving state-of-the-art performance. Furthermore, we show the robustness of our framework in various settings. We provide an explanation for the results in the discussion.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2022

Exploring Euphemism Detection in Few-Shot and Zero-Shot Settings

This work builds upon the Euphemism Detection Shared Task proposed in th...
research
07/27/2020

Practical and sample efficient zero-shot HPO

Zero-shot hyperparameter optimization (HPO) is a simple yet effective us...
research
03/16/2018

Deep Multiple Instance Learning for Zero-shot Image Tagging

In-line with the success of deep learning on traditional recognition pro...
research
05/23/2023

Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker

Re-rankers, which order retrieved documents with respect to the relevanc...
research
06/24/2019

Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models

We study several methods for full or partial sharing of the decoder para...
research
07/14/2021

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale ...
research
03/03/2021

Energy-Based Learning for Scene Graph Generation

Traditional scene graph generation methods are trained using cross-entro...

Please sign up or login with your details

Forgot password? Click here to reset