Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task

12/22/2016
by   Nan Ding, et al.
0

We introduce a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identifying the most suitable text describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing "keywords" (or key-phrases) and their corresponding visual concepts. Instead, it requires an alignment between the representations of the two modalities that achieves a visually-grounded "understanding" of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-studied metric: the accuracy in detecting the true target among the decoys. The paper makes several contributions: an effective and extensible mechanism for generating decoys from (human-created) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available; human evaluation results on this dataset, informing a performance upper-bound; and several baseline and competitive learning approaches that illustrate the utility of the proposed task and dataset in advancing both image and language comprehension. We also show that, in a multi-task learning setting, the performance on the proposed task is positively correlated with the end-to-end task of image captioning.

READ FULL TEXT

page 5

page 6

research
07/10/2023

SITTA: A Semantic Image-Text Alignment for Image Captioning

Textual and semantic comprehension of images is essential for generating...
research
06/23/2016

VideoMCC: a New Benchmark for Video Comprehension

While there is overall agreement that future technology for organizing, ...
research
05/06/2019

Image Captioning with Clause-Focused Metrics in a Multi-Modal Setting for Marketing

Automatically generating descriptive captions for images is a well-resea...
research
03/21/2021

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

Automatically describing images using natural sentences is an important ...
research
12/19/2018

Generating Diverse and Meaningful Captions

Image Captioning is a task that requires models to acquire a multi-modal...
research
05/16/2018

Defoiling Foiled Image Captions

We address the task of detecting foiled image captions, i.e. identifying...
research
11/24/2021

Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training

While captioning models have obtained compelling results in describing n...

Please sign up or login with your details

Forgot password? Click here to reset