TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

08/08/2022
by   Huaizhen Tang, et al.
0

Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity and the speech content using information-constraining bottlenecks. However, due to the pure autoencoder training method, it is difficult to evaluate the separation effect of content and speaker identity. In this paper, a novel voice conversion framework, named Text Guided AutoVC(TGAVC), is proposed to more effectively separate content and timbre from speech, where an expected content embedding produced based on the text transcriptions is designed to guide the extraction of voice content. In addition, the adversarial training is applied to eliminate the speaker identity information in the estimated content embedding extracted from speech. Under the guidance of the expected content embedding and the adversarial training, the content encoder is trained to extract speaker-independent content embedding from speech. Experiments on AIShell-3 dataset show that the proposed model outperforms AutoVC in terms of naturalness and similarity of converted speech.

READ FULL TEXT
research
04/15/2020

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

Non-parallel many-to-many voice conversion remains an interesting but ch...
research
10/25/2022

MetaSpeech: Speech Effects Switch Along with Environment for Metaverse

Metaverse expands the physical world to a new dimension, and the physica...
research
05/28/2019

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

We present an unsupervised end-to-end training scheme where we discover ...
research
02/16/2021

Axial Residual Networks for CycleGAN-based Voice Conversion

We propose a novel architecture and improved training objectives for non...
research
03/15/2022

Text-free non-parallel many-to-many voice conversion using normalising flows

Non-parallel voice conversion (VC) is typically achieved using lossy rep...
research
01/10/2023

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Text-to-speech (TTS) and voice conversion (VC) are two different tasks b...
research
10/31/2022

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

The rapid spread of media content synthesis technology and the potential...

Please sign up or login with your details

Forgot password? Click here to reset