ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

04/12/2022
by   Sanjay Subramanian, et al.
4

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29 imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8

READ FULL TEXT

page 1

page 2

page 8

page 13

page 16

page 18

research
06/30/2023

Zero-shot Nuclei Detection via Visual-Language Pre-trained Models

Large-scale visual-language pre-trained models (VLPM) have proven their ...
research
06/07/2022

Masked Unsupervised Self-training for Zero-shot Image Classification

State-of-the-art computer vision models are mostly trained with supervis...
research
12/29/2021

A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Recently, zero-shot image classification by vision-language pre-training...
research
11/03/2022

Zero-shot Video Moment Retrieval With Off-the-Shelf Models

For the majority of the machine learning community, the expensive nature...
research
05/27/2023

Modularized Zero-shot VQA with Pre-trained Models

Large-scale pre-trained models (PTMs) show great zero-shot capabilities....
research
02/13/2023

A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models

Contrastively trained text-image models have the remarkable ability to p...
research
08/21/2023

Towards Accelerated Model Training via Bayesian Data Selection

Mislabeled, duplicated, or biased data in real-world scenarios can lead ...

Please sign up or login with your details

Forgot password? Click here to reset