End-to-End Lip Synchronisation

05/18/2020
by   You Jin Kim, et al.
0

The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.

READ FULL TEXT
research
08/10/2019

Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Deep learning has successfully shown excellent performance in learning j...
research
11/10/2021

Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention

Classifying group-level emotions is a challenging task due to complexity...
research
03/30/2022

End to End Lip Synchronization with a Temporal AutoEncoder

We study the problem of syncing the lip movement in a video with the aud...
research
04/04/2019

ExCL: Extractive Clip Localization Using Natural Language Descriptions

The task of retrieving clips within videos based on a given natural lang...
research
11/24/2017

Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

Little research focuses on cross-modal correlation learning where tempor...
research
04/13/2020

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

This work seeks the possibility of generating the human face from voice ...
research
10/31/2016

Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features

We propose a novel approach for First Impressions Recognition in terms o...

Please sign up or login with your details

Forgot password? Click here to reset