Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

10/21/2021
by   Adyasha Maharana, et al.
16

While much research has been done in text-to-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance. In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. Third, we also incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images within a dual learning setup. We show that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. We train the model end-to-end using intra-story contrastive loss (between words and image sub-regions) and show significant improvements in several metrics (and human evaluation) for multiple datasets. Finally, we provide an analysis of the linguistic and visuo-spatial information. Code and data: https://github.com/adymaharana/VLCStoryGan.

READ FULL TEXT

page 1

page 8

page 9

page 15

page 16

research
05/20/2021

Improving Generation and Evaluation of Visual Stories via Semantic Consistency

Story visualization is an under-explored task that falls at the intersec...
research
12/03/2019

Knowledge-Enriched Visual Storytelling

Stories are diverse and highly personalized, resulting in a large possib...
research
09/13/2022

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Recent advances in text-to-image synthesis have led to large pretrained ...
research
11/14/2022

Learning to Model Multimodal Semantic Alignment for Story Visualization

Story visualization aims to generate a sequence of images to narrate eac...
research
01/09/2023

An Impartial Transformer for Story Visualization

Story Visualization is an advanced task of computed vision that targets ...
research
08/03/2022

Word-Level Fine-Grained Story Visualization

Story visualization aims to generate a sequence of images to narrate eac...
research
10/16/2022

Character-Centric Story Visualization via Visual Planning and Token Alignment

Story visualization advances the traditional text-to-image generation by...

Please sign up or login with your details

Forgot password? Click here to reset