Adapting CLIP For Phrase Localization Without Further Training

04/07/2022
by   Jiahao Li, et al.
0

Supervised or weakly supervised methods for phrase localization (textual grounding) either rely on human annotations or some other supervised models, e.g., object detectors. Obtaining these annotations is labor-intensive and may be difficult to scale in practice. We propose to leverage recent advances in contrastive language-vision models, CLIP, pre-trained on image and caption pairs collected from the internet. In its original form, CLIP only outputs an image-level embedding without any spatial resolution. We adapt CLIP to generate high-resolution spatial feature maps. Importantly, we can extract feature maps from both ViT and ResNet CLIP model while maintaining the semantic properties of an image embedding. This provides a natural framework for phrase localization. Our method for phrase localization requires no human annotations or additional training. Extensive experiments show that our method outperforms existing no-training methods in zero-shot phrase localization, and in some cases, it even outperforms supervised methods. Code is available at https://github.com/pals-ttic/adapting-CLIP .

READ FULL TEXT

page 2

page 4

page 9

page 10

page 14

page 15

page 16

page 17

research
09/07/2023

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

It has been established that training a box-based detector network can e...
research
10/12/2020

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Phrase localization is a task that studies the mapping from textual phra...
research
06/19/2022

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Given an input image, and nothing else, our method returns the bounding ...
research
02/18/2021

Hierarchical Attention Fusion for Geo-Localization

Geo-localization is a critical task in computer vision. In this work, we...
research
08/20/2019

Phrase Localization Without Paired Training Examples

Localizing phrases in images is an important part of image understanding...
research
12/14/2021

Improving Human-Object Interaction Detection via Phrase Learning and Label Composition

Human-Object Interaction (HOI) detection is a fundamental task in high-l...
research
10/04/2022

CFL-Net: Image Forgery Localization Using Contrastive Learning

Conventional forgery localizing methods usually rely on different forger...

Please sign up or login with your details

Forgot password? Click here to reset