Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

05/18/2023
by   Davide Rigoni, et al.
0

Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. Compared to the supervised approach, learning is more difficult since bounding boxes and textual phrases correspondences are unavailable. In light of this, we propose the Semantic Prior Refinement Model (SPRM), whose predictions are obtained by combining the output of two main modules. The first untrained module aims to return a rough alignment between textual phrases and bounding boxes. The second trained module is composed of two sub-components that refine the rough alignment to improve the accuracy of the final phrase-bounding box alignments. The model is trained to maximize the multimodal similarity between an image and a sentence, while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected to help the most during training. Our approach shows state-of-the-art results on two popular datasets, Flickr30k Entities and ReferIt, shining especially on ReferIt with a 9.6 absolute improvement. Moreover, thanks to the untrained component, it reaches competitive performances just using a small fraction of training examples.

READ FULL TEXT
research
07/03/2020

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

Weakly supervised phrase grounding aims at learning region-phrase corres...
research
11/26/2022

Who are you referring to? Weakly supervised coreference resolution with multimodal grounding

Coreference resolution aims at identifying words and phrases which refer...
research
10/12/2020

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Phrase localization is a task that studies the mapping from textual phra...
research
06/19/2022

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Given an input image, and nothing else, our method returns the bounding ...
research
08/11/2021

A Better Loss for Visual-Textual Grounding

Given a textual phrase and an image, the visual grounding problem is def...
research
03/29/2018

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Textual grounding is an important but challenging task for human-compute...
research
11/21/2016

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

This paper presents a framework for localization or grounding of phrases...

Please sign up or login with your details

Forgot password? Click here to reset