Towards Consistent Video Editing with Text-to-Image Diffusion Models

05/27/2023
by   Zicheng Zhang, et al.
0

Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner. Despite their low requirements of data and computation, these methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence, limiting their applications in the real world. In this paper, we propose to address the above issues with a novel EI^2 model towards Enhancing vIdeo Editing consIstency of TTI-based frameworks. Specifically, we analyze and find that the inconsistent problem is caused by newly added modules into TTI models for learning temporal information. These modules lead to covariate shift in the feature space, which harms the editing capability. Thus, we design EI^2 to tackle the above drawbacks with two classical modules: Shift-restricted Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM). First, through theoretical analysis, we demonstrate that covariate shift is highly related to Layer Normalization, thus STAM employs a Instance Centering layer replacing it to preserve the distribution of temporal features. In addition, STAM employs an attention layer with normalized mapping to transform temporal features while constraining the variance shift. As the second part, we incorporate STAM with a novel FFAM, which efficiently leverages fine-coarse spatial information of overall frames to further enhance temporal consistency. Extensive experiments demonstrate the superiority of the proposed EI^2 model for text-driven video editing.

READ FULL TEXT

page 1

page 4

page 8

page 9

page 17

research
08/18/2023

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

Diffusion-based methods can generate realistic images and videos, but th...
research
03/14/2023

Edit-A-Video: Single Video Editing with Object-Aware Consistency

Despite the fact that text-to-video (TTV) model has recently achieved re...
research
03/30/2023

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Large-scale text-to-image diffusion models achieve unprecedented success...
research
08/17/2023

Edit Temporal-Consistent Videos with Image Diffusion Model

Large-scale text-to-image (T2I) diffusion models have been extended for ...
research
02/02/2023

Dreamix: Video Diffusion Models are General Video Editors

Text-driven image and video diffusion models have recently achieved unpr...
research
08/21/2023

EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints

Motivated by the superior performance of image diffusion models, more an...
research
11/17/2021

DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation

In this paper, we present an efficient and effective single-stage framew...

Please sign up or login with your details

Forgot password? Click here to reset