I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

12/05/2022
by   Muhammad Ferjad Naeem, et al.
0

Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/10/2019

Zero-shot Text Classification With Generative Language Models

This work investigates the use of natural language to enable zero-shot m...
research
06/05/2023

Visually-Grounded Descriptions Improve Zero-Shot Image Classification

Language-vision models like CLIP have made significant progress in zero-...
research
11/28/2016

Gaze Embeddings for Zero-Shot Image Classification

Zero-shot image classification using auxiliary information, such as attr...
research
07/07/2022

Improving Few-Shot Image Classification Using Machine- and User-Generated Natural Language Descriptions

Humans can obtain the knowledge of novel visual concepts from language d...
research
07/14/2021

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale ...
research
01/19/2022

CM3: A Causal Masked Multimodal Model of the Internet

We introduce CM3, a family of causally masked generative models trained ...
research
01/26/2023

SemSup-XC: Semantic Supervision for Zero and Few-shot Extreme Classification

Extreme classification (XC) involves predicting over large numbers of cl...

Please sign up or login with your details

Forgot password? Click here to reset