Sound-Guided Semantic Video Generation

04/20/2022
by   Seung-Hyun Lee, et al.
3

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.

READ FULL TEXT

page 3

page 11

page 12

page 14

research
11/30/2021

Sound-Guided Semantic Image Manipulation

The recent success of the generative model shows that leveraging the mul...
research
09/08/2023

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

In recent years, video generation has become a prominent generative tool...
research
04/06/2022

Perceive, Represent, Generate: Translating Multimodal Information to Robotic Motion Trajectories

We present Perceive-Represent-Generate (PRG), a novel three-stage framew...
research
03/08/2022

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pretrained StyleGAN

One-shot talking face generation aims at synthesizing a high-quality tal...
research
11/21/2022

LISA: Localized Image Stylization with Audio via Implicit Neural Representation

We present a novel framework, Localized Image Stylization with Audio (LI...
research
03/21/2018

Probabilistic Video Generation using Holistic Attribute Control

Videos express highly structured spatio-temporal patterns of visual data...
research
11/20/2022

MagicVideo: Efficient Video Generation With Latent Diffusion Models

We present an efficient text-to-video generation framework based on late...

Please sign up or login with your details

Forgot password? Click here to reset