Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

09/18/2023
by   Zheng-Yan Sheng, et al.
0

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website.

READ FULL TEXT

page 7

page 8

research
03/18/2022

DGC-vector: A new speaker embedding for zero-shot voice conversion

Recently, more and more zero-shot voice conversion algorithms have been ...
research
05/09/2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speec...
research
11/06/2021

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Nowadays, as more and more systems achieve good performance in tradition...
research
05/31/2021

StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts

Voice conversion is the task of converting a spoken utterance from a sou...
research
10/27/2021

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, i...
research
04/13/2021

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

Voice conversion (VC) is a task that transforms voice from target audio ...
research
06/23/2021

Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens

In this paper, we propose a novel spoken-text-style conversion method th...

Please sign up or login with your details

Forgot password? Click here to reset