A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

11/12/2022
by   Li-Wei Chen, et al.
0

We present a unified system to realize one-shot voice conversion (VC) on the pitch, rhythm, and speaker attributes. Existing works generally ignore the correlation between prosody and language content, leading to the degradation of naturalness in converted speech. Additionally, the lack of proper language features prevents these systems from accurately preserving language content after conversion. To address these issues, we devise a cascaded modular system leveraging self-supervised discrete speech units as language representation. These discrete units provide duration information essential for rhythm modeling. Our system first extracts utterance-level prosody and speaker representations from the raw waveform. Given the prosody representation, a prosody predictor estimates pitch, energy, and duration for each discrete unit in the utterance. A synthesizer further reconstructs speech based on the predicted prosody, speaker representation, and discrete units. Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability. Code and samples are publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2021

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

The goal of voice conversion is to transform source speech into a target...
research
12/19/2022

Speaking Style Conversion With Discrete Self-Supervised Units

Voice Conversion (VC) is the task of making a spoken utterance by one sp...
research
11/14/2021

Textless Speech Emotion Conversion using Decomposed and Discrete Representations

Speech emotion conversion is the task of modifying the perceived emotion...
research
06/18/2021

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

One-shot voice conversion (VC), which performs conversion across arbitra...
research
01/02/2023

Analysing Discrete Self Supervised Speech Representation for Spoken Language Modeling

This work profoundly analyzes discrete self-supervised speech representa...
research
03/03/2023

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Recognizing whispered speech and converting it to normal speech creates ...
research
03/28/2022

Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions

In our previous work, we proposed a language-independent speaker anonymi...

Please sign up or login with your details

Forgot password? Click here to reset