TextCaps: a Dataset for Image Captioning with Reading Comprehension

03/24/2020
by   Oleksii Sidorov, et al.
0

Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

READ FULL TEXT

page 2

page 6

page 12

page 14

page 19

page 23

page 24

page 25

research
05/17/2021

Multi-Modal Image Captioning for the Visually Impaired

One of the ways blind people understand their surroundings is by clickin...
research
12/07/2020

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

When describing an image, reading text in the visual scene is crucial to...
research
07/21/2018

Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text

Images and text in advertisements interact in complex, non-literal ways....
research
05/28/2022

BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset

As computers have become efficient at understanding visual information a...
research
08/26/2023

Towards Real Time Egocentric Segment Captioning for The Blind and Visually Impaired in RGB-D Theatre Images

In recent years, image captioning and segmentation have emerged as cruci...
research
01/19/2019

Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding

Providing systems the ability to relate linguistic and visual content is...

Please sign up or login with your details

Forgot password? Click here to reset