What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

06/19/2022
by   Tal Shaharabany, et al.
0

Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided. To achieve this, our method combines two pre-trained networks: the CLIP image-to-text matching score and the BLIP image captioning tool. Training takes place on COCO images and their captions and is based on CLIP. Then, during inference, BLIP is used to generate a hypothesis regarding various regions of the current image. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains. It also shows very convincing results in the novel task of weakly-supervised open-world purely visual phrase-grounding presented in our work. For example, on the datasets used for benchmarking phrase-grounding, our method results in a very modest degradation in comparison to methods that employ human captions as an additional input. Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://talshaharabany/what-is-where-by-looking.

READ FULL TEXT

page 2

page 4

page 6

page 8

page 10

page 11

page 12

page 13

research
06/17/2020

Contrastive Learning for Weakly Supervised Phrase Grounding

Phrase grounding, the problem of associating image regions to caption wo...
research
09/07/2023

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

It has been established that training a box-based detector network can e...
research
05/18/2023

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

Using only image-sentence pairs, weakly-supervised visual-textual ground...
research
04/07/2022

Adapting CLIP For Phrase Localization Without Further Training

Supervised or weakly supervised methods for phrase localization (textual...
research
04/20/2021

Detector-Free Weakly Supervised Grounding by Separation

Nowadays, there is an abundance of data involving images and surrounding...
research
10/15/2019

DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images

Paper-intensive industries like insurance, law, and government have long...
research
11/17/2017

Grounding Visual Explanations (Extended Abstract)

Existing models which generate textual explanations enforce task relevan...

Please sign up or login with your details

Forgot password? Click here to reset