Sound2Sight: Generating Visual Dynamics from Sound and Context

07/23/2020
by   Anoop Cherian, et al.
0

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis – a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to further condition a video forecasting module to generate future frames. The stochastic prior allows the model to sample multiple plausible futures that are consistent with the provided audio and the past context. Moreover, to improve the quality and coherence of the generated frames, we propose a multimodal discriminator that differentiates between a synthesized and a real audio-visual clip. We empirically evaluate our approach, vis-á-vis closely-related prior methods, on two new datasets viz. (i) Multimodal Stochastic Moving MNIST with a Surprise Obstacle, (ii) Youtube Paintings; as well as on the existing Audio-Set Drums dataset. Our extensive experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality, while also producing diverse video content.

READ FULL TEXT

page 2

page 14

page 21

page 33

page 34

page 35

page 38

page 39

research
12/09/2022

Motion and Context-Aware Audio-Visual Conditioned Video Prediction

Existing state-of-the-art method for audio-visual conditioned video pred...
research
04/06/2021

Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

We introduce a non-parametric approach for infinite video texture synthe...
research
10/19/2019

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

We present an audio-visual multimodal approach for the task of zeroshot ...
research
12/07/2018

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the con...
research
08/23/2023

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Generating realistic audio effects for movies and other media is a chall...
research
03/04/2022

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Most methods for conditional video synthesis use a single modality as th...
research
12/31/2020

A Multi-modal Deep Learning Model for Video Thumbnail Selection

Thumbnail is the face of online videos. The explosive growth of videos b...

Please sign up or login with your details

Forgot password? Click here to reset