Improving Generation and Evaluation of Visual Stories via Semantic Consistency

by   Adyasha Maharana, et al.

Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at:



There are no comments yet.


page 1

page 7

page 8

page 9

page 14

page 16


Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

While much research has been done in text-to-image synthesis, little wor...

StoryGAN: A Sequential Conditional GAN for Story Visualization

In this work we propose a new task called Story Visualization. Given a m...

A Pipeline for Creative Visual Storytelling

Computational visual storytelling produces a textual description of even...

Semantic Frame Forecast

This paper introduces semantic frame forecast, a task that predicts the ...

Discriminative Latent Semantic Graph for Video Captioning

Video captioning aims to automatically generate natural language sentenc...

A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation

Narrative story generation is a challenging problem because it demands t...

Sort Story: Sorting Jumbled Images and Captions into Stories

Temporal common sense has applications in AI tasks such as QA, multi-doc...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.