Visual Classification via Description from Large Language Models

10/13/2022
by   Sachit Menon, et al.
0

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure – computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

READ FULL TEXT

page 5

page 6

page 15

research
06/12/2023

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

The visual classification performance of vision-language models such as ...
research
07/05/2023

A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis

Zero-shot medical image classification is a critical process in real-wor...
research
09/07/2022

What does a platypus look like? Generating customized prompts for zero-shot image classification

Open vocabulary models are a promising new paradigm for image classifica...
research
09/08/2023

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visua...
research
08/08/2023

Few-shot medical image classification with simple shape and texture text descriptors using vision-language models

In this work, we investigate the usefulness of vision-language models (V...
research
08/22/2023

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

Pre-trained vision-language models, e.g., CLIP, working with manually de...
research
08/07/2023

Learning Concise and Descriptive Attributes for Visual Recognition

Recent advances in foundation models present new opportunities for inter...

Please sign up or login with your details

Forgot password? Click here to reset