Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer

09/03/2020
by   Jing-Xuan Zhang, et al.
0

With the development of automatic speech recognition (ASR) and text-to-speech synthesis (TTS) technique, it's intuitive to construct a voice conversion system by cascading an ASR and TTS system. In this paper, we present a ASR-TTS method for voice conversion, which used iFLYTEK ASR engine to transcribe the source speech into text and a Transformer TTS model with WaveNet vocoder to synthesize the converted speech from the decoded text. For the TTS model, we proposed to use a prosody code to describe the prosody information other than text and speaker information contained in speech. A prosody encoder is used to extract the prosody code. During conversion, the source prosody is transferred to converted speech by conditioning the Transformer TTS model with its code. Experiments were conducted to demonstrate the effectiveness of our proposed method. Our system also obtained the best naturalness and similarity in the mono-lingual task of Voice Conversion Challenge 2020.

READ FULL TEXT
research
03/29/2022

A single speaker is almost all you need for automatic speech recognition

We explore the use of speech synthesis and voice conversion applied to a...
research
07/20/2021

On Prosody Modeling for ASR+TTS based Voice Conversion

In voice conversion (VC), an approach showing promising results in the l...
research
10/06/2020

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

This paper presents the sequence-to-sequence (seq2seq) baseline system f...
research
03/11/2019

Singing voice conversion with non-parallel data

Singing voice conversion is a task to convert a song sang by a source si...
research
02/16/2023

QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion

With the development of automatic speech recognition (ASR) and text-to-s...
research
03/24/2022

Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion

Non-parallel data voice conversion (VC) have achieved considerable break...
research
12/21/2021

Voice Quality and Pitch Features in Transformer-Based Speech Recognition

Jitter and shimmer measurements have shown to be carriers of voice quali...

Please sign up or login with your details

Forgot password? Click here to reset