Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

07/26/2021
by   Se-Yun Um, et al.
0

In this paper, we propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. Therefore, our method can be regarded as a multi-speaker face-to-speech waveform model. We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results. Specifically, we evaluate the performances of the linguistic feature and the speaker characteristic generation modules by measuring the accuracy of automatic speech recognition and automatic speaker/gender recognition tasks, respectively. We also evaluate the naturalness of the synthesized speech waveforms using a mean opinion score (MOS) test.

READ FULL TEXT
research
03/25/2019

Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks

Speech is a rich biometric signal that contains information about the id...
research
03/29/2022

NeuraGen-A Low-Resource Neural Network based approach for Gender Classification

Human voice is the source of several important information. This is in t...
research
10/30/2021

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

This study addresses the problem of single-channel Automatic Speech Reco...
research
07/09/2020

Attention-based Residual Speech Portrait Model for Speech to Face Generation

Given a speaker's speech, it is interesting to see if it is possible to ...
research
10/28/2022

The Importance of Speech Stimuli for Pathologic Speech Classification

Current findings show that pre-trained wav2vec 2.0 models can be success...
research
04/26/2021

Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis

We propose a novel phrase break prediction method that combines implicit...
research
10/27/2020

Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

This paper proposes voicing-aware conditional discriminators for Paralle...

Please sign up or login with your details

Forgot password? Click here to reset