Learning to Generate Grounded Image Captions without Localization Supervision

06/01/2019
by   Chih-Yao Ma, et al.
0

When generating a sentence description for an image, it frequently remains unclear how well the generated caption is grounded in the image or if the model hallucinates based on priors in the dataset and/or the language model. The most common way of relating image regions with words in caption models is through an attention mechanism over the regions that is used as input to predict the next word. The model must therefore learn to predict the attention without knowing the word it should localize. In this work, we propose a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it and then reconstruct the sentence from the localized image region(s) to match the ground-truth. The initial decoder and the proposed reconstructor share parameters during training and are learned jointly with the localizer, allowing the model to regularize the attention mechanism. Our proposed framework only requires learning one extra fully-connected layer (the localizer), a layer that can be removed at test time. We show that our model significantly improves grounding accuracy without relying on grounding supervision or introducing extra computation during inference.

READ FULL TEXT

page 2

page 4

page 8

page 10

page 11

page 12

page 13

research
04/05/2017

Generating Descriptions with Grounded and Co-Referenced People

Learning how to generate descriptions of images or videos received major...
research
11/12/2015

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...
research
12/02/2017

Improving Visually Grounded Sentence Representations with Self-Attention

Sentence representation models trained only on language could potentiall...
research
08/02/2021

Distributed Attention for Grounded Image Captioning

We study the problem of weakly supervised grounded image captioning. Tha...
research
06/13/2023

Top-Down Viewing for Weakly Supervised Grounded Image Captioning

Weakly supervised grounded image captioning (WSGIC) aims to generate the...
research
12/06/2019

Connecting Vision and Language with Localized Narratives

We propose Localized Narratives, an efficient way to collect image capti...
research
11/23/2017

Self-view Grounding Given a Narrated 360° Video

Narrated 360 videos are typically provided in many touring scenarios to ...

Please sign up or login with your details

Forgot password? Click here to reset