ECO: Ensembling Context Optimization for Vision-Language Models

07/26/2023
by   Lorenzo Agnolucci, et al.
0

Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.

READ FULL TEXT
research
09/12/2022

VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models

Vision-language models trained on large, randomly collected data had sig...
research
02/07/2023

Boosting Zero-shot Classification with Synthetic Data Diversity via Stable Diffusion

Recent research has shown it is possible to perform zero-shot classifica...
research
10/09/2022

Learning to Decompose Visual Features with Latent Textual Prompts

Recent advances in pre-training vision-language models like CLIP have sh...
research
04/13/2023

What does CLIP know about a red circle? Visual prompt engineering for VLMs

Large-scale Vision-Language Models, such as CLIP, learn powerful image-t...
research
10/18/2022

Perceptual Grouping in Vision-Language Models

Recent advances in zero-shot image recognition suggest that vision-langu...
research
07/03/2023

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Recently, there has been an increase in interest in evaluating large lan...
research
03/20/2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...

Please sign up or login with your details

Forgot password? Click here to reset