Speech Prediction in Silent Videos using Variational Autoencoders

11/14/2020
by   Ravindra Yadav, et al.
0

Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.

READ FULL TEXT
research
07/05/2022

A survey of multimodal deep generative models

Multimodal learning is a framework for building models that make predict...
research
03/06/2016

Variational methods for Conditional Multimodal Deep Learning

In this paper, we address the problem of conditional modality learning, ...
research
06/14/2019

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Speech is a means of communication which relies on both audio and visual...
research
12/11/2019

Multimodal Generative Models for Compositional Representation Learning

As deep neural networks become more adept at traditional tasks, many of ...
research
06/30/2019

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Multimodal learning allows us to leverage information from multiple sour...
research
01/26/2018

Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

We investigate deep generative models that can exchange multiple modalit...
research
01/27/2021

Self-Calibrating Active Binocular Vision via Active Efficient Coding with Deep Autoencoders

We present a model of the self-calibration of active binocular vision co...

Please sign up or login with your details

Forgot password? Click here to reset