Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data

09/06/2023
by   Hyungseob Lim, et al.
0

While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.

READ FULL TEXT
research
07/13/2022

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Voice conversion is to generate a new speech with the source content and...
research
02/22/2022

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source an...
research
09/06/2023

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

We propose a highly controllable voice manipulation system that can perf...
research
03/30/2022

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Building a high-quality singing corpus for a person who is not good at s...
research
12/19/2022

Speaking Style Conversion With Discrete Self-Supervised Units

Voice Conversion (VC) is the task of making a spoken utterance by one sp...
research
10/25/2022

MetaSpeech: Speech Effects Switch Along with Environment for Metaverse

Metaverse expands the physical world to a new dimension, and the physica...
research
01/13/2021

Whispered and Lombard Neural Speech Synthesis

It is desirable for a text-to-speech system to take into account the env...

Please sign up or login with your details

Forgot password? Click here to reset