Text-free non-parallel many-to-many voice conversion using normalising flows

03/15/2022
by   Thomas Merritt, et al.
0

Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a large challenge. This is particularly challenging in the scenario where at inference-time we have no knowledge of the text being read, i.e., text-free VC. To mitigate this, we investigate information-preserving VC approaches. Normalising flows have gained attention for text-to-speech synthesis, however have been under-explored for VC. Flows utilize invertible functions to learn the likelihood of the data, thus provide a lossless encoding of speech. We investigate normalising flows for VC in both text-conditioned and text-free scenarios. Furthermore, for text-free VC we compare pre-trained and jointly-learnt priors. Flow-based VC evaluations show no degradation between text-free and text-conditioned VC, resulting in improvements over the state-of-the-art. Also, joint-training of the prior is found to negatively impact text-free VC quality.

READ FULL TEXT

page 2

page 3

research
02/11/2019

A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data

In a typical voice conversion system, vocoder is commonly used for speec...
research
03/29/2019

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

We investigated the training of a shared model for both text-to-speech (...
research
08/08/2022

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Non-parallel many-to-many voice conversion remains an interesting but ch...
research
05/10/2021

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Text-to-Speech (TTS) synthesis plays an important role in human-computer...
research
06/11/2020

FastPitch: Parallel Text-to-speech with Pitch Prediction

We present FastPitch, a fully-parallel text-to-speech model based on Fas...
research
09/09/2019

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Text-to-speech systems are typically evaluated on single sentences. When...

Please sign up or login with your details

Forgot password? Click here to reset