Zero-shot personalized lip-to-speech synthesis with face image based voice control

05/09/2023
by   Zheng-Yan Sheng, et al.
0

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2Speech synthesis with a face image rather than reference audio to control voice characteristics.

READ FULL TEXT
research
09/18/2023

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

This paper presents a novel task, zero-shot voice conversion based on fa...
research
02/27/2023

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

The goal of this work is zero-shot text-to-speech synthesis, with speaki...
research
02/22/2022

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Multi-speaker text-to-speech (TTS) using a few adaption data is a challe...
research
03/30/2022

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Traditional studies on voice conversion (VC) have made progress with par...
research
02/18/2022

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Though significant progress has been made for speaker-dependent Video-to...
research
03/31/2022

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Recent advances in neural text-to-speech research have been dominated by...
research
04/30/2023

StyleLipSync: Style-based Personalized Lip-sync Video Generation

In this paper, we present StyleLipSync, a style-based personalized lip-s...

Please sign up or login with your details

Forgot password? Click here to reset