Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

03/03/2023
by   Renrui Zhang, et al.
0

Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.

READ FULL TEXT

page 8

page 15

page 16

page 17

research
09/25/2022

Collaboration of Pre-trained Models Makes Better Few-shot Learner

Few-shot classification requires deep neural networks to learn generaliz...
research
04/03/2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

The popularity of Contrastive Language-Image Pre-training (CLIP) has pro...
research
02/27/2023

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM) d...
research
05/03/2022

Contrastive Learning for Prompt-Based Few-Shot Language Learners

The impressive performance of GPT-3 using natural language prompts and i...
research
07/14/2023

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training

The foundation models based on pre-training technology have significantl...
research
06/21/2023

NeuroCLIP: Neuromorphic Data Understanding by CLIP and SNN

Recently, the neuromorphic vision sensor has received more and more inte...
research
03/28/2023

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Human-Object Interaction (HOI) detection aims to localize human-object p...

Please sign up or login with your details

Forgot password? Click here to reset