Reading Between the Lines: Exploring Infilling in Visual Narratives

10/26/2020
by   Khyathi Raghavi Chandu, et al.
2

Generating long form narratives such as stories and procedures from multiple modalities has been a long standing dream for artificial intelligence. In this regard, there is often crucial subtext that is derived from the surrounding contexts. The general seq2seq training methods render the models shorthanded while attempting to bridge the gap between these neighbouring contexts. In this paper, we tackle this problem by using infilling techniques involving prediction of missing steps in a narrative while generating textual descriptions from a sequence of images. We also present a new large scale visual procedure telling (ViPT) dataset with a total of 46,200 procedures and around 340k pairwise images and textual descriptions that is rich in such contextual dependencies. Generating steps using infilling technique demonstrates the effectiveness in visual procedures with more coherent texts. We conclusively show a METEOR score of 27.51 on procedures which is higher than the state-of-the-art on visual storytelling. We also demonstrate the effects of interposing new text with missing images during inference. The code and the dataset will be publicly available at https://visual-narratives.github.io/Visual-Narratives/.

READ FULL TEXT
research
04/21/2020

Textual Visual Semantic Dataset for Text Spotting

Text Spotting in the wild consists of detecting and recognizing text app...
research
11/22/2019

Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder

Automatically generating natural language descriptions from an image is ...
research
04/12/2021

Visual Goal-Step Inference using wikiHow

Procedural events can often be thought of as a high level goal composed ...
research
01/08/2019

GILT: Generating Images from Long Text

Creating an image reflecting the content of a long text is a complex pro...
research
11/21/2019

Incorporating Textual Evidence in Visual Storytelling

Previous work on visual storytelling mainly focused on exploring image s...
research
02/28/2015

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes

This paper proposes a novel framework for generating lingual description...
research
01/17/2022

On the Context-Free Ambiguity of Emoji: A Data-Driven Study of 1,289 Emojis

Emojis come with prepacked semantics making them great candidates to cre...

Please sign up or login with your details

Forgot password? Click here to reset