Visually-Grounded Descriptions Improve Zero-Shot Image Classification

06/05/2023
by   Michael Ogezi, et al.
0

Language-vision models like CLIP have made significant progress in zero-shot vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive class descriptions remains a major challenge. Existing approaches suffer from granularity and label ambiguity issues. To tackle these challenges, we propose V-GLOSS: Visual Glosses, a novel method leveraging modern language models and semantic knowledge bases to produce visually-grounded class descriptions. We demonstrate V-GLOSS's effectiveness by achieving state-of-the-art results on benchmark ZSIC datasets including ImageNet and STL-10. In addition, we introduce a silver dataset with class descriptions generated by V-GLOSS, and show its usefulness for vision tasks. We make available our code and dataset.

READ FULL TEXT
research
12/05/2022

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Recent works have shown that unstructured text (documents) from online s...
research
03/07/2023

Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions

Human language users can generate descriptions of perceptual concepts be...
research
07/05/2023

A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis

Zero-shot medical image classification is a critical process in real-wor...
research
08/02/2023

More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

CLIP, as a foundational vision language model, is widely used in zero-sh...
research
08/01/2023

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

Today, large language models (LLMs) are taught to use new tools by provi...
research
07/11/2023

Objaverse-XL: A Universe of 10M+ 3D Objects

Natural language processing and 2D vision models have attained remarkabl...
research
10/18/2022

Perceptual Grouping in Vision-Language Models

Recent advances in zero-shot image recognition suggest that vision-langu...

Please sign up or login with your details

Forgot password? Click here to reset