Residual-guided Personalized Speech Synthesis based on Face Image

04/01/2022
by   Jianrong Wang, et al.
0

Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. It was reported that face information has a strong link with the speech sound. Thus in this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder. A Face-based Residual Personalized Speech Synthesis Model (FR-PSS) containing a speech encoder, a speech synthesizer and a face encoder is designed for PSS. In this model, by designing two speech priors, a residual-guided strategy is introduced to guide the face feature to approach the true speech feature in the training. Moreover, considering the error of feature's absolute values and their directional bias, we formulate a novel tri-item loss function for face encoder. Experimental results show that the speech synthesized by our model is comparable to the personalized speech synthesized by training a large amount of audio data in previous works.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2023

Audio-visual video-to-speech synthesis with synthesized input audio

Video-to-speech synthesis involves reconstructing the speech signal of a...
research
07/09/2020

Attention-based Residual Speech Portrait Model for Speech to Face Generation

Given a speaker's speech, it is interesting to see if it is possible to ...
research
10/14/2022

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

We present a comprehensive empirical study for personalized spontaneous ...
research
03/23/2022

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis mainly focus on current se...
research
06/28/2023

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

We propose UnitSpeech, a speaker-adaptive speech synthesis method that f...
research
06/17/2021

PixInWav: Residual Steganography for Hiding Pixels in Audio

Steganography comprises the mechanics of hiding data in a host media tha...
research
08/23/2020

Geometry-guided Dense Perspective Network for Speech-Driven Facial Animation

Realistic speech-driven 3D facial animation is a challenging problem due...

Please sign up or login with your details

Forgot password? Click here to reset