Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

05/19/2023
by   Runqi Wang, et al.
1

Pre-trained vision-language models have inspired much research on few-shot learning. However, with only a few training images, there exist two crucial problems: (1) the visual feature distributions are easily distracted by class-irrelevant information in images, and (2) the alignment between the visual and language feature distributions is difficult. To deal with the distraction problem, we propose a Selective Attack module, which consists of trainable adapters that generate spatial attention maps of images to guide the attacks on class-irrelevant image areas. By messing up these areas, the critical features are captured and the visual distributions of image features are calibrated. To better align the visual and language feature distributions that describe the same object class, we propose a cross-modal distribution alignment module, in which we introduce a vision-language prototype for each class to align the distributions, and adopt the Earth Mover's Distance (EMD) to optimize the prototypes. For efficient computation, the upper bound of EMD is derived. In addition, we propose an augmentation strategy to increase the diversity of the images and the text prompts, which can reduce overfitting to the few-shot training images. Extensive experiments on 11 datasets demonstrate that our method consistently outperforms prior arts in few-shot learning. The implementation code will be available at https://github.com/bhrqw/SADA.

READ FULL TEXT

page 3

page 8

research
11/28/2022

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Although significant progress has been made in few-shot learning, most o...
research
07/20/2022

Temporal and cross-modal attention for audio-visual zero-shot learning

Audio-visual generalised zero-shot learning for video classification req...
research
09/07/2023

Text-to-feature diffusion for audio-visual few-shot learning

Training deep learning models for video classification from audio-visual...
research
04/04/2023

Black Box Few-Shot Adaptation for Vision-Language models

Vision-Language (V-L) models trained with contrastive learning to align ...
research
05/29/2023

Deeply Coupled Cross-Modal Prompt Learning

Recent advancements in multimodal foundation models (e.g., CLIP) have ex...
research
02/08/2023

Gestalt-Guided Image Understanding for Few-Shot Learning

Due to the scarcity of available data, deep learning does not perform we...
research
02/01/2022

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Human-Object Interaction (HOI) detection is an essential task to underst...

Please sign up or login with your details

Forgot password? Click here to reset