VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

09/03/2023
by   Xuyang Liu, et al.
0

Large-scale text-to-image diffusion models have shown impressive capabilities across various generative tasks, enabled by strong vision-language alignment obtained through pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding.

READ FULL TEXT
research
09/24/2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabi...
research
03/27/2023

Text-to-Image Diffusion Models are Zero-Shot Classifiers

The excellent generative capabilities of text-to-image diffusion models ...
research
03/20/2023

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

Vision-Language models like CLIP have been widely adopted for various ta...
research
01/12/2023

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

The goal of this paper is to augment a pre-trained text-to-image diffusi...
research
03/13/2023

ODIN: On-demand Data Formulation to Mitigate Dataset Lock-in

ODIN is an innovative approach that addresses the problem of dataset con...
research
06/01/2023

UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

Recent advances in vision-language pre-training have enabled machines to...
research
03/02/2023

Human Motion Diffusion as a Generative Prior

In recent months, we witness a leap forward as denoising diffusion model...

Please sign up or login with your details

Forgot password? Click here to reset