APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

08/24/2023
by   Yupu Yao, et al.
0

Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.

READ FULL TEXT

page 1

page 2

research
07/26/2023

VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

Recently, diffusion models like StableDiffusion have achieved impressive...
research
05/23/2023

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

This paper presents a controllable text-to-video (T2V) diffusion model, ...
research
06/02/2023

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Video colorization is a challenging task that involves inferring plausib...
research
05/30/2023

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

In this study, we present an efficient and effective approach for achiev...
research
08/19/2023

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

This study introduces an efficient and effective method, MeDM, that util...
research
05/22/2023

ControlVideo: Training-free Controllable Text-to-Video Generation

Text-driven diffusion models have unlocked unprecedented abilities in im...
research
06/06/2019

Scaling Autoregressive Video Models

Due to the statistical complexity of video, the high degree of inherent ...

Please sign up or login with your details

Forgot password? Click here to reset