OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

09/10/2022
by   Tiancheng Zhao, et al.
15

Advancing object detection to open-vocabulary and few-shot transfer has long been a challenge for computer vision research. This work explores a continual learning approach that enables a detector to expand its zero/few-shot capabilities via multi-dataset vision-language pre-training. Using natural language as knowledge representation, we explore methods to accumulate "visual vocabulary" from different training datasets and unify the task as a language-conditioned detection framework. Specifically, we propose a novel language-aware detector OmDet and a novel training mechanism. The proposed multimodal detection network can resolve the technical challenges in multi-dataset joint training and it can generalize to arbitrary number of training datasets without the requirements for manual label taxonomy merging. Experiment results on COCO, Pascal VOC, and Wider Face/Pedestrian confirmed the efficacy by achieving on par or higher scores in joint training compared to training separately. Moreover, we pre-train on more than 20 million images with 4 million unique object vocabulary, and the resulting model is evaluated on 35 downstream tasks of ODinW. Results show that OmDet is able to achieve the state-of-the-art fine-tuned performance on ODinW. And analysis shows that by scaling up the proposed pre-training method, OmDet continues to improve its zero/few-shot tuning performance, suggesting a promising way for further scaling.

READ FULL TEXT

page 5

page 8

research
06/30/2023

Zero-shot Nuclei Detection via Visual-Language Pre-trained Models

Large-scale visual-language pre-trained models (VLPM) have proven their ...
research
02/26/2021

Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed ...
research
04/10/2023

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

This work proposes POMP, a prompt pre-training method for vision-languag...
research
04/16/2022

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

We study multimodal few-shot object detection (FSOD) in this paper, usin...
research
08/18/2023

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision repre...
research
04/10/2023

Defense-Prefix for Preventing Typographic Attacks on CLIP

Vision-language pre-training models (VLPs) have exhibited revolutionary ...
research
06/21/2021

GAIA: A Transfer Learning System of Object Detection that Fits Your Needs

Transfer learning with pre-training on large-scale datasets has played a...

Please sign up or login with your details

Forgot password? Click here to reset