Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

03/27/2019
by   Samyak Datta, et al.
0

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discriminative image representation using these matched RoIs. In a subsequent step, this (learned) representation is aligned with the caption. Our key contribution lies in building this `caption-conditioned' image encoding which tightly couples both the tasks and allows the weak supervision to effectively guide visual grounding. We provide an extensive empirical and qualitative analysis to investigate the different components of our proposed model and compare it with competitive baselines. For phrase localization, we report an improvement of 4.9 VisualGenome dataset. We also report results that are at par with the state-of-the-art on the downstream caption-to-image retrieval task on COCO and Flickr30k datasets.

READ FULL TEXT

page 2

page 4

page 8

research
10/12/2020

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Phrase localization is a task that studies the mapping from textual phra...
research
11/12/2015

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...
research
08/20/2019

Phrase Localization Without Paired Training Examples

Localizing phrases in images is an important part of image understanding...
research
08/08/2019

Semi Supervised Phrase Localization in a Bidirectional Caption-Image Retrieval Framework

We introduce a novel deep neural network architecture that links visual ...
research
04/20/2021

Detector-Free Weakly Supervised Grounding by Separation

Nowadays, there is an abundance of data involving images and surrounding...
research
03/14/2023

Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment

Medical phrase grounding (MPG) aims to locate the most relevant region i...
research
11/22/2017

Conditional Image-Text Embedding Networks

This paper presents an approach for grounding phrases in images which jo...

Please sign up or login with your details

Forgot password? Click here to reset