FOIL it! Find One mismatch between Image and Language caption

05/03/2017
by   Ravi Shekhar, et al.
0

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MSCOCO dataset, FOIL-COCO, which associates images with both correct and "foil" captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake ("foil word"). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

READ FULL TEXT

page 1

page 5

page 9

research
07/21/2020

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Significant progress has been made in recent years in image captioning, ...
research
11/26/2019

Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions

Powerful generative adversarial networks (GAN) have been developed to au...
research
02/22/2018

ChatPainter: Improving Text to Image Generation using Dialogue

Synthesizing realistic images from text descriptions on a dataset like M...
research
05/16/2018

Defoiling Foiled Image Captions

We address the task of detecting foiled image captions, i.e. identifying...
research
11/18/2014

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image ...
research
11/05/2021

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval

Matching model is essential for Image-Text Retrieval framework. Existing...
research
04/07/2022

ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO

Image-Test matching (ITM) is a common task for evaluating the quality of...

Please sign up or login with your details

Forgot password? Click here to reset