Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

02/27/2023
by   Jiyoung Lee, et al.
0

The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles. The project page is https://facetts.github.io.

READ FULL TEXT
research
05/09/2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speec...
research
11/17/2022

Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models

There has been a significant progress in Text-To-Speech (TTS) synthesis ...
research
05/23/2019

Speech2Face: Learning the Face Behind a Voice

How much can we infer about a person's looks from the way they speak? In...
research
02/22/2022

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Multi-speaker text-to-speech (TTS) using a few adaption data is a challe...
research
07/09/2020

Attention-based Residual Speech Portrait Model for Speech to Face Generation

Given a speaker's speech, it is interesting to see if it is possible to ...
research
07/17/2023

Identity-Preserving Aging of Face Images via Latent Diffusion Models

The performance of automated face recognition systems is inevitably impa...
research
04/11/2023

Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond

Although text-to-image diffusion models have made significant strides in...

Please sign up or login with your details

Forgot password? Click here to reset