Text-Only Training for Visual Storytelling

08/17/2023
by   Yuechen Wang, et al.
0

Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only text data for training enables our method to learn from external text story data, enhancing the generalization capability of visual storytelling. We conduct extensive experiments on the VIST benchmark, showcasing the effectiveness of our approach in both in-domain and cross-domain settings. Further evaluations on expression diversity and human assessment underscore the superiority of our method in terms of informativeness and robustness.

READ FULL TEXT

page 1

page 8

research
05/14/2021

Plot and Rework: Modeling Storylines for Visual Storytelling

Writing a coherent and engaging story is not easy. Creative writers use ...
research
01/26/2023

Multimodal Event Transformer for Image-guided Story Ending Generation

Image-guided story ending generation (IgSEG) is to generate a story endi...
research
05/13/2018

Hierarchical Neural Story Generation

We explore story generation: creative systems that can build coherent an...
research
01/11/2019

From Plots to Endings: A Reinforced Pointer Generator for Story Ending Generation

We introduce a new task named Story Ending Generation (SEG), whic-h aims...
research
03/01/2022

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Training a text-to-image generator in the general domain (e.g., Dall.e, ...
research
11/14/2022

Learning to Model Multimodal Semantic Alignment for Story Visualization

Story visualization aims to generate a sequence of images to narrate eac...
research
11/11/2019

Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication

Visual storytelling aims to generate a narrative paragraph from a sequen...

Please sign up or login with your details

Forgot password? Click here to reset