MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

10/12/2020
by   Qinxin Wang, et al.
1

Phrase localization is a task that studies the mapping from textual phrases to regions of an image. Given difficulties in annotating phrase-to-object datasets at scale, we develop a Multimodal Alignment Framework (MAF) to leverage more widely-available caption-image datasets, which can then be used as a form of weak supervision. We first present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. By adopting a contrastive objective, our method uses information in caption-image pairs to boost the performance in weakly-supervised scenarios. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods. With the help of the visually-aware language representations, we can also improve the previous best unsupervised result by 5.56 weakly-supervised strategies significantly contribute to our strong results.

READ FULL TEXT

page 1

page 2

page 3

page 8

research
03/27/2019

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

We address the problem of grounding free-form textual phrases by using w...
research
08/20/2019

Phrase Localization Without Paired Training Examples

Localizing phrases in images is an important part of image understanding...
research
04/07/2022

Adapting CLIP For Phrase Localization Without Further Training

Supervised or weakly supervised methods for phrase localization (textual...
research
11/26/2022

Who are you referring to? Weakly supervised coreference resolution with multimodal grounding

Coreference resolution aims at identifying words and phrases which refer...
research
10/17/2022

Weakly Supervised Face Naming with Symmetry-Enhanced Contrastive Loss

We revisit the weakly supervised cross-modal face-name alignment task; t...
research
05/18/2023

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

Using only image-sentence pairs, weakly-supervised visual-textual ground...
research
07/05/2022

Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

Recent progress on 3D scene understanding has explored visual grounding ...

Please sign up or login with your details

Forgot password? Click here to reset