Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

04/16/2022
by   Guangxing Han, et al.
0

We study multimodal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Most of previous works focus on either few-shot or zero-shot object detection, ignoring the complementarity of visual and semantic information. We first show that meta-learning and prompt-based learning, the most commonly-used methods for few-shot learning and zero-shot transferring from pre-trained vision-language models to downstream tasks, are conceptually similar. They both reformulate the objective of downstream tasks the same as the pre-training tasks, and mostly without tuning the parameters of pre-trained models. Based on this observation, we propose to combine meta-learning with prompt-based learning for multimodal FSOD without fine-tuning, by learning transferable class-agnostic multimodal FSOD models over many-shot base classes. Specifically, to better exploit the pre-trained vision-language models, the meta-learning based cross-modal prompting is proposed to generate soft prompts and further used to extract the semantic prototype, conditioned on the few-shot visual examples. Then, the extracted semantic prototype and few-shot visual prototype are fused to generate the multimodal prototype for detection. Our models can efficiently fuse the visual and semantic information at both token-level and feature-level. We comprehensively evaluate the proposed multimodal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/24/2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabi...
research
05/25/2022

Learning a Better Initialization for Soft Prompts via Meta-Learning

Prompt tuning (PT) is an effective approach to adapting pre-trained lang...
research
04/10/2023

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

This work proposes POMP, a prompt pre-training method for vision-languag...
research
02/28/2023

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap ...
research
04/16/2021

Does language help generalization in vision models?

Vision models trained on multimodal datasets have recently proved very e...
research
10/19/2022

Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models

Prompt learning is a new learning paradigm which reformulates downstream...
research
09/10/2022

OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Advancing object detection to open-vocabulary and few-shot transfer has ...

Please sign up or login with your details

Forgot password? Click here to reset