Intelligent Grimm – Open-ended Visual Storytelling via Latent Diffusion Models

06/01/2023
by   Chang Liu, et al.
12

Generative models have recently exhibited exceptional capabilities in various scenarios, for example, image generation based on text description. In this work, we focus on the task of generating a series of coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We make the following three contributions: (i) to fulfill the task of visual storytelling, we introduce two modules into a pre-trained stable diffusion model, and construct an auto-regressive image generator, termed as StoryGen, that enables to generate the current frame by conditioning on both a text prompt and a preceding frame; (ii) to train our proposed model, we collect paired image and text samples by sourcing from various online sources, such as videos, E-books, and establish a data processing pipeline for constructing a diverse dataset, named StorySalon, with a far larger vocabulary than existing animation-specific datasets; (iii) we adopt a three-stage curriculum training strategy, that enables style transfer, visual context conditioning, and human feedback alignment, respectively. Quantitative experiments and human evaluation have validated the superiority of our proposed model, in terms of image quality, style consistency, content consistency, and visual-language alignment. We will make the code, model, and dataset publicly available to the research community.

READ FULL TEXT

page 2

page 7

page 10

page 14

page 16

page 17

page 18

research
05/23/2023

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

This paper presents a controllable text-to-video (T2V) diffusion model, ...
research
01/12/2023

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

The goal of this paper is to augment a pre-trained text-to-image diffusi...
research
08/26/2023

ORES: Open-vocabulary Responsible Visual Synthesis

Avoiding synthesizing specific visual concepts is an essential challenge...
research
03/01/2022

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Training a text-to-image generator in the general domain (e.g., Dall.e, ...
research
10/19/2022

Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models

The impressive capacity shown by recent text-to-image diffusion models t...
research
04/25/2022

StyleGAN-Human: A Data-Centric Odyssey of Human Generation

Unconditional human image generation is an important task in vision and ...
research
03/13/2023

ODIN: On-demand Data Formulation to Mitigate Dataset Lock-in

ODIN is an innovative approach that addresses the problem of dataset con...

Please sign up or login with your details

Forgot password? Click here to reset