Crossmodal Voice Conversion

04/09/2019
by   Hirokazu Kameoka, et al.
0

Humans are able to imagine a person's voice from the person's appearance and imagine the person's appearance from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face image and generate a face image that matches the voice of the input speech by leveraging the correlation between faces and voices. We propose a model, consisting of a speech converter, a face encoder/decoder and a voice encoder. We use the latent code of an input face image encoded by the face encoder as the auxiliary input into the speech converter and train the speech converter so that the original latent code can be recovered from the generated speech by the voice encoder. We also train the face decoder along with the face encoder to ensure that the latent code will contain sufficient information to reconstruct the input face image. We confirmed experimentally that a speech converter trained in this way was able to convert input speech into a voice that matched an input face image and that the voice encoder and face decoder can be used to generate a face image that matches the voice of the input speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

research
09/30/2020

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

This paper presents a novel framework to build a voice conversion (VC) s...
research
07/16/2021

Controlled AutoEncoders to Generate Faces from Voices

Multiple studies in the past have shown that there is a strong correlati...
research
07/11/2021

A Deep-Bayesian Framework for Adaptive Speech Duration Modification

We propose the first method to adaptively modify the duration of a given...
research
04/18/2019

TTS Skins: Speaker Conversion via ASR

We present a fully convolutional wav-to-wav network for converting betwe...
research
05/23/2019

Speech2Face: Learning the Face Behind a Voice

How much can we infer about a person's looks from the way they speak? In...
research
09/01/2021

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

The strong relation between face and voice can aid active speaker detect...
research
04/13/2020

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

This work seeks the possibility of generating the human face from voice ...

Please sign up or login with your details

Forgot password? Click here to reset