Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

07/21/2023
by   Mayug Maniparambil, et al.
0

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT ( 7 ( 7 We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by  2 fine-grained datasets. We will release the code, prompts, and auxiliary text dataset upon acceptance.

READ FULL TEXT
research
07/21/2023

Generating Image-Specific Text Improves Fine-grained Image Classification

Recent vision-language models outperform vision-only models on many imag...
research
07/13/2023

Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

Vision-language foundation models such as CLIP have shown impressive zer...
research
11/02/2022

P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Inspired by the success of visual-language methods (VLMs) in zero-shot c...
research
05/25/2023

Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation

Many fine-grained classification tasks, like rare animal identification,...
research
05/15/2023

PLIP: Language-Image Pre-training for Person Representation Learning

Pre-training has emerged as an effective technique for learning powerful...
research
11/25/2022

CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels

Pre-trained vision-language models like CLIP have recently shown superio...
research
05/22/2023

Efficient Large-Scale Vision Representation Learning

In this article, we present our approach to single-modality vision repre...

Please sign up or login with your details

Forgot password? Click here to reset