Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

05/07/2020
by   Seung-won Park, et al.
0

We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be trained with conventional TTS datasets. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods based on Phonetic Posteriorgram (PPG). By training and evaluating our system with 108 speakers from the VCTK dataset, we outperform the previous method in terms of both naturalness and speaker similarity. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance. Audio samples are available at https://mindslab-ai.github.io/cotatron, and the code with a pre-trained model will be made available soon.

READ FULL TEXT
research
04/02/2021

Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

In this paper, we pose the current state-of-the-art voice conversion (VC...
research
04/18/2019

TTS Skins: Speaker Conversion via ASR

We present a fully convolutional wav-to-wav network for converting betwe...
research
02/22/2022

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source an...
research
12/11/2019

Voice Conversion for Whispered Speech Synthesis

We present an approach to synthesize whisper by applying a handcrafted s...
research
02/06/2023

Autodecompose: A generative self-supervised model for semantic decomposition

We introduce Autodecompose, a novel self-supervised generative model tha...
research
03/31/2022

DeepFry: Identifying Vocal Fry Using Deep Neural Networks

Vocal fry or creaky voice refers to a voice quality characterized by irr...
research
01/09/2023

Introducing Model Inversion Attacks on Automatic Speaker Recognition

Model inversion (MI) attacks allow to reconstruct average per-class repr...

Please sign up or login with your details

Forgot password? Click here to reset