Speech2Video: Cross-Modal Distillation for Speech to Video Generation

07/10/2021
by   Shijing Si, et al.
0

This paper investigates a novel task of talking face video generation solely from speeches. The speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. Indeed, the timbre, accent and speed in speeches could contain rich information relevant to speakers' appearance. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. In this article, we propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs. The extracted features are then integrated by a generative adversarial network into talking face video clips. With carefully crafted discriminators, the proposed framework achieves realistic generation results. Experiments with observed individuals demonstrated that the proposed framework captures the emotional expressions solely from speeches, and produces spontaneous facial motion in the video output. Compared to the baseline method where speeches are combined with a static image of the speaker, the results of the proposed framework is almost indistinguishable. User studies also show that the proposed method outperforms the existing algorithms in terms of emotion expression in the generated videos.

READ FULL TEXT
research
01/08/2021

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

We introduce a new approach for audio-visual speech separation. Given a ...
research
12/01/2022

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, ...
research
12/27/2021

Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Responsive listening during face-to-face conversations is a critical ele...
research
04/27/2022

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

Talking head generation is to synthesize a lip-synchronized talking head...
research
03/13/2022

Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Talking head video generation aims to produce a synthetic human face vid...
research
11/20/2022

Audio-visual video face hallucination with frequency supervision and cross modality support by speech based lip reading loss

Recently, there has been numerous breakthroughs in face hallucination ta...

Please sign up or login with your details

Forgot password? Click here to reset