A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

06/28/2022
by   Xu Li, et al.
0

Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive speaker characteristics than conversational speech. It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task. To this end, this work proposes a novel hierarchical speaker representation framework for SVC, which can capture fine-grained speaker characteristics at different granularity. It consists of an up-sampling stream and three down-sampling streams. The up-sampling stream transforms the linguistic features into audio samples, while one down-sampling stream of the three operates in the reverse direction. It is expected that the temporal statistics of each down-sampling block can represent speaker characteristics at different granularity, which will be engaged in the up-sampling blocks to enhance the speaker modeling. Experiment results verify that the proposed method outperforms both the LUT and SRN based SVC systems. Moreover, the proposed system supports the one-shot SVC with only a few seconds of reference audio.

READ FULL TEXT
research
11/17/2020

Optimizing voice conversion network with cycle consistency loss of speaker identity

We propose a novel training scheme to optimize voice conversion network ...
research
06/07/2020

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

Voice conversion (VC) is a task that transforms the source speaker's tim...
research
06/24/2022

Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

The cloning of a speaker's voice using an untranscribed reference sample...
research
04/30/2020

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

Prosody Transfer (PT) is a technique that aims to use the prosody from a...
research
11/23/2022

Voice-preserving Zero-shot Multiple Accent Conversion

Most people who have tried to learn a foreign language would have experi...
research
06/28/2022

Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion

Verifying the identity of a speaker is crucial in modern human-machine i...
research
07/17/2023

Vocoder drift compensation by x-vector alignment in speaker anonymisation

For the most popular x-vector-based approaches to speaker anonymisation,...

Please sign up or login with your details

Forgot password? Click here to reset